# 🧹 Data Cleaning in Pandas: Indian Employees Dataset

In this notebook, we will:

- Load a real-world employee dataset
- Identify and handle missing values (NaNs and `inf`)
- Remove duplicates and fix inconsistent data
- Save the cleaned dataset as a new CSV


In [1]:
# 📦 Load necessary libraries
import pandas as pd
import numpy as np

## 📂 Load the Dataset

We'll load the file `indian_emp.csv` from the folder `practice/`.

> Make sure the file path exists in your local machine!


In [None]:
# Load dataset
df = pd.read_csv("practice/indian_emp.csv")

# Display first few rows to understand the structure
df.head()

## 🔍 Check for Missing Values (NaNs)

We'll now identify how many missing values (`NaN`) are present in each column.


In [None]:
# Check how many values are missing per column
print("🕳️ Gayab (Missing) Values:")
print(df.isnull().sum())

## 🩹 Step 1: Fill Missing Values with Mean or Median

We'll fix missing values using column-wise logic:
- For **numerical columns** like Salary, Age, Experience: use **mean**
- For **Performance Rating**: use **median**


In [None]:
# Fill missing salary, age, experience using mean
df["Salary (INR)"] = df["Salary (INR)"].fillna(df["Salary (INR)"].mean())
df["Age"] = df["Age"].fillna(df["Age"].mean())
df["Experience (Years)"] = df["Experience (Years)"].fillna(df["Experience (Years)"].mean())

# Fill performance rating using median
df["Performance Rating"] = df["Performance Rating"].fillna(df["Performance Rating"].median())   

## 🧠 Step 2: Handle Infinite (`inf`) Values

We'll treat `"inf"` or `-inf` as missing (NaN), and then handle them just like other missing values.


In [None]:
# Replace all inf/-inf with NaN
df.replace([np.inf, -np.inf], np.nan, inplace=True)

# Fill any remaining numeric NaNs with their column means
df[df.select_dtypes(include=[np.number]).columns] = df.select_dtypes(include=[np.number]).fillna(
    df.select_dtypes(include=[np.number]).mean()
)

## 🧹 Step 3: Remove Duplicates

Let's clean up any repeated rows in the data.

In [None]:
# Drop duplicate rows
df.drop_duplicates(inplace=True)

## ⚠️ Step 4: Fix Negative Salaries

Negative salaries are invalid, so we’ll replace them with the **average salary** using `np.where`.

### 🔍 Explanation of this line:
```python
df["Salary (INR)"] = np.where(df["Salary (INR)"] < 0, df["Salary (INR)"].mean(), df["Salary (INR)"])


It checks where Salary < 0

If True → replace with mean salary

Else → keep original value

In [None]:
# Replace negative salary values with column mean
df["Salary (INR)"] = np.where(df["Salary (INR)"] < 0, df["Salary (INR)"].mean(), df["Salary (INR)"])

## 💾 Final Step: Save the Cleaned Data

Let’s save the cleaned DataFrame to a new CSV file.


In [None]:
# Save to new CSV file
df.to_csv("practice/saaf_kiya.CSV", index=False)
print("✅ Data cleaning complete! File saved as 'saaf_kiya.CSV'")

# incase you want the og code , check below

In [None]:
# #Lib load import kiya
import pandas as pd
import numpy as np

## load dataset
df=pd.read_csv('practice/indian_emp.csv')
# print(df.head()) # #helps to understand dataset ka structure i.e -> kya kya hai , col names

##check mising val "NaN"
# print("Gayab hai")
# print(df.isnull().sum()) # konse col mey kitne val missing hai dikhayega

# # gayab values ko bhar rahe hai
df["Salary (INR)"] = df["Salary (INR)"].fillna(df["Salary (INR)"].mean())
df["Performance Rating"] = df["Performance Rating"].fillna(df["Performance Rating"].median())
df["Age"] = df["Age"].fillna(df["Age"].mean())
df["Experience (Years)"] = df["Experience (Years)"].fillna(df["Experience (Years)"].mean())


# # "inf" values ki bari-> unko humne nan sey replace kar diya so that woh bhi gayab values ban jaye 
df.replace([np.inf,-np.inf],np.nan,inplace=True)
df[df.select_dtypes(include=[np.number]).columns] = df.select_dtypes(include=[np.number]).fillna(df.select_dtypes(include=[np.number]).mean())  #"inf" values to "nan", "nan" to mean of col

# # nakli ko hatayenge
df.drop_duplicates(inplace=True)

# # negative salaries ko thik karenge 
df["Salary (INR)"]=np.where(df["Salary (INR)"]<0,df["Salary (INR)"].mean(),df["Salary (INR)"]) # hey chatgpt do explain the function of this
df.to_csv("practice/saaf_kiya.CSV",index=False)
print("samapti")


# csv file below

In [None]:
Emp_ID,Name,Age,Salary (INR),Experience (Years),City,Department,Performance Rating
101,Amit Sharma,27,650000,4,Delhi,IT,4.5
102,Riya Gupta,32,NaN,8,Mumbai,HR,3.8
103,Rajesh Verma,45,1200000,inf,Bangalore,Finance,4.9
104,Priya Singh,29,720000,6,Kolkata,IT,4.2
105,Sunil Kumar,26,500000,3,Chennai,Sales,NaN
106,Alok Mehta,52,2000000,28,Hyderabad,Operations,5
107,Kavita Yadav,31,inf,10,Pune,IT,4
108,Pankaj Mishra,24,-45000,2,Jaipur,Marketing,3.5
109,Deepika Reddy,48,3000000,30,Bangalore,Management,5
110,Ramesh Patil,29,720000,6,Kolkata,IT,4.2
111,Meera Nair,34,820000,9,Kochi,HR,4.3
112,Sameer Desai,37,1100000,15,Delhi,Finance,4.8
113,Swati Joshi,29,650000,5,Mumbai,IT,4.6
114,Vinod Iyer,55,5000000,35,Chennai,Management,5
115,Anjali Saxena,23,380000,1,Pune,HR,3.9
116,Manish Khanna,40,1250000,18,Hyderabad,Operations,4.7
117,Arjun Pandey,31,750000,7,Jaipur,IT,4.4
118,Neha Kapoor,28,NaN,4,Bangalore,IT,3.8
119,Amit Sharma,27,650000,4,Delhi,IT,4.5
120,Rohit Bansal,36,980000,12,Mumbai,Finance,4.9
121,Suresh Choudhri,42,inf,20,Hyderabad,Operations,4.5
122,Sneha Tiwari,25,550000,2,Kolkata,Sales,3.6
123,Gopal Das,51,4800000,32,Kochi,Management,4.9
124,Preeti Sharma,33,770000,8,Pune,HR,4.1
125,Aditya Saxena,22,-30000,0,Delhi,IT,3.2
126,Kiran Kumar,29,650000,5,Mumbai,IT,4.5
127,Mohan Reddy,30,820000,7,Bangalore,HR,4.4
128,Sanjay Agarwal,43,1400000,21,Chennai,Finance,4.8
129,Rekha Nair,35,950000,11,Kochi,Operations,4.6
130,Vikash Singh,28,680000,5,Delhi,IT,4.3
131,Pooja Mishra,26,420000,2,Jaipur,Marketing,3.7
132,Ravi Kumar,39,1150000,16,Hyderabad,Finance,4.7
133,Nisha Patel,31,780000,8,Mumbai,HR,4.2
134,Ajay Gupta,44,1800000,22,Bangalore,Operations,4.9
135,Sunita Roy,27,580000,4,Kolkata,Sales,4.0
136,Manoj Sharma,33,850000,9,Pune,IT,4.4
137,Divya Nair,29,720000,6,Chennai,HR,4.1
138,Rahul Jain,38,1300000,inf,Delhi,Finance,4.8
139,Anita Singh,32,NaN,7,Mumbai,IT,4.0
140,Sachin Reddy,46,2500000,25,Bangalore,Management,5
141,Geeta Sharma,30,750000,6,Hyderabad,Operations,4.3
142,Naveen Kumar,24,450000,1,Jaipur,Marketing,3.5
143,Priyanka Das,35,1050000,13,Kochi,Finance,4.6
144,Ashok Patel,41,1450000,19,Chennai,Operations,4.8
145,Meena Gupta,28,670000,5,Delhi,IT,4.2
146,Rajiv Singh,37,1200000,15,Mumbai,Finance,4.7
147,Sushma Nair,33,inf,8,Bangalore,HR,4.3
148,Harish Kumar,29,720000,6,Pune,IT,4.1
149,Vandana Sharma,31,850000,9,Kolkata,Operations,4.5
150,Deepak Reddy,36,1150000,14,Hyderabad,Finance,4.8
151,Kavya Nair,NaN,950000,inf,Chennai,IT,NaN
152,Rohit Agarwal,34,NaN,12,Delhi,Finance,4.6
153,Seema Patel,inf,780000,NaN,Mumbai,HR,3.9
154,Vijay Kumar,28,inf,6,Bangalore,Operations,NaN
155,Nidhi Sharma,NaN,650000,inf,Pune,Marketing,4.2
156,Arun Singh,42,NaN,inf,Kolkata,Management,NaN
157,Ritu Gupta,NaN,inf,15,Hyderabad,Finance,4.7
158,Manish Reddy,35,1200000,NaN,Kochi,IT,inf
159,Shweta Nair,inf,NaN,8,Jaipur,HR,NaN
160,Prakash Das,29,850000,inf,Chennai,Operations,NaN