# Pandas Null Values Handling Practice

This notebook contains practice questions for handling null values in pandas.

**Dataset:** sales_data.csv (Employee sales data with missing values)

## Setup: Import Libraries and Load Data

In [15]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('csv data files/null_v1.csv')
print("Dataset loaded successfully!")
df.head(10)

Dataset loaded successfully!


Unnamed: 0,employee_id,name,department,salary,age,sales_amount,join_date,performance_score
0,101,John Smith,Sales,55000.0,28.0,12500.0,2020-03-15,8.5
1,102,Maria Garcia,,62000.0,34.0,15800.0,2019-07-22,9.2
2,103,James Wilson,Marketing,48000.0,,11200.0,2021-01-10,
3,104,Lisa Anderson,Sales,,29.0,14500.0,2020-11-05,7.8
4,105,Robert Brown,IT,75000.0,42.0,,2018-05-18,8.9
5,106,,Marketing,51000.0,31.0,9800.0,,8.1
6,107,Emily Davis,Sales,58000.0,27.0,16200.0,2021-06-30,9.5
7,108,Michael Lee,,69000.0,38.0,,2019-02-14,
8,109,Sarah Taylor,IT,,33.0,13400.0,2020-09-25,8.3
9,110,David Martinez,Marketing,54000.0,,,2021-03-08,7.9


## Question 1: Detecting Null Values

**Task:** 
- Display the total number of null values in each column
- Display the percentage of null values in each column
- Which column has the most missing values?

In [16]:
# Your code here
print("=="*50)
print(f"total nulls valuse of each column")
print("=="*50)
null_values=df.isnull().sum()
print(f"{null_values}")
print(f"tolal sum of all null valuses {null_values.sum()}")
max=df.isnull().sum() / len(df)*100
print("=="*50)
print( "percentage of null valuse ")
print("=="*50)
max=df.isnull().sum() / len(df)*100
print(f"{max}")
print("=="*50)
print("most missing value in data set")
print("=="*50)
sum_null=df.isnull().sum()
max_value=sum_null.max()
sum_null[sum_null ==max_value]


total nulls valuse of each column
employee_id          0
name                 2
department           5
salary               4
age                  4
sales_amount         6
join_date            3
performance_score    5
dtype: int64
tolal sum of all null valuses 29
percentage of null valuse 
employee_id           0.0
name                 10.0
department           25.0
salary               20.0
age                  20.0
sales_amount         30.0
join_date            15.0
performance_score    25.0
dtype: float64
most missing value in data set


sales_amount    6
dtype: int64

## Question 2: Visualizing Missing Data

**Task:** 
- Use `isnull()` to create a boolean DataFrame showing where nulls exist
- Display rows that contain at least one null value

### ans:-
1 ✔ isnull() → checks null values

2 ✔ any(axis=1) → True for rows where any column has a null

3 ✔ df[...] → filters those rows

In [17]:
# Your code here
print("=="*50)
print("boolean null values")
print("=="*50)
print(df.isnull())
print("=="*50)
print("the Row contain at least one null valuse")
print("=="*50)
df[df.isnull().any(axis=1)]

boolean null values
    employee_id   name  department  salary    age  sales_amount  join_date  \
0         False  False       False   False  False         False      False   
1         False  False        True   False  False         False      False   
2         False  False       False   False   True         False      False   
3         False  False       False    True  False         False      False   
4         False  False       False   False  False          True      False   
5         False   True       False   False  False         False       True   
6         False  False       False   False  False         False      False   
7         False  False        True   False  False          True      False   
8         False  False       False    True  False         False      False   
9         False  False       False   False   True          True      False   
10        False  False       False   False  False         False       True   
11        False  False        True   False  

Unnamed: 0,employee_id,name,department,salary,age,sales_amount,join_date,performance_score
1,102,Maria Garcia,,62000.0,34.0,15800.0,2019-07-22,9.2
2,103,James Wilson,Marketing,48000.0,,11200.0,2021-01-10,
3,104,Lisa Anderson,Sales,,29.0,14500.0,2020-11-05,7.8
4,105,Robert Brown,IT,75000.0,42.0,,2018-05-18,8.9
5,106,,Marketing,51000.0,31.0,9800.0,,8.1
7,108,Michael Lee,,69000.0,38.0,,2019-02-14,
8,109,Sarah Taylor,IT,,33.0,13400.0,2020-09-25,8.3
9,110,David Martinez,Marketing,54000.0,,,2021-03-08,7.9
10,111,Jennifer White,Sales,61000.0,30.0,17500.0,,
11,112,Christopher Johnson,,57000.0,35.0,12800.0,2019-12-01,8.7


## Question 3: Drop Rows with Null Values

**Task:** 
- Create a new DataFrame that drops all rows containing any null values
- How many rows remain after dropping?
- Is this a good approach for this dataset? Why or why not?

In [18]:
# Your code here
before_null=df.shape
after_null=df.dropna().shape
print("=="*50)
print("shape before removing null values")
print("=="*50)
print(f"{before_null}")
print("=="*50)
print("shape after removing null values")
print("=="*50)
print(f"{after_null}")
# Your code here
print("=="*50)
print("filling null values with mean of the column")
print("=="*50)
df["age"].fillna(df["age"].mean(), inplace=True)
print(df)



shape before removing null values
(20, 8)
shape after removing null values
(2, 8)
filling null values with mean of the column
    employee_id                 name department   salary   age  sales_amount  \
0           101           John Smith      Sales  55000.0  28.0       12500.0   
1           102         Maria Garcia        NaN  62000.0  34.0       15800.0   
2           103         James Wilson  Marketing  48000.0  33.5       11200.0   
3           104        Lisa Anderson      Sales      NaN  29.0       14500.0   
4           105         Robert Brown         IT  75000.0  42.0           NaN   
5           106                  NaN  Marketing  51000.0  31.0        9800.0   
6           107          Emily Davis      Sales  58000.0  27.0       16200.0   
7           108          Michael Lee        NaN  69000.0  38.0           NaN   
8           109         Sarah Taylor         IT      NaN  33.0       13400.0   
9           110       David Martinez  Marketing  54000.0  33.5           N

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["age"].fillna(df["age"].mean(), inplace=True)


## Question 4: Drop Columns with Many Nulls

**Task:** 
- Drop columns that have more than 20% missing values
- Which columns were dropped?

In [19]:
# Your code here
print("=="*50)
print("columns that have more than 20% missing valuse")
print("=="*50)
# df = df.drop(columns=df.columns[df.isnull().mean() > 0.20])
df
print("=="*50)
print("columns that have more than 20% missing valuse")
print("=="*50)
# droup_col = df.drop(columns=df.columns[df.isnull().mean() > 0.20])
# df.drop(["join_date"], axis=1)
print(f"drouped columns:-")




columns that have more than 20% missing valuse
columns that have more than 20% missing valuse
drouped columns:-


## Question 5: Fill Null Values with Mean/Median

**Task:** 
- Fill null values in the 'salary' column with the mean salary
- Fill null values in the 'age' column with the median age
- Display the filled data

In [20]:
# Your code here
print("=="*50)
# print(" filling null valuse in  salary column with the mean salary")
df["salary"].fillna(df["salary"].mean(),inplace=True)

# print(" filling null valuse in  age column with the median age")
df["age"].fillna(df["age"].median(),inplace=True)

print("filled data")
display(df)


filled data


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["salary"].fillna(df["salary"].mean(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["age"].fillna(df["age"].median(),inplace=True)


Unnamed: 0,employee_id,name,department,salary,age,sales_amount,join_date,performance_score
0,101,John Smith,Sales,55000.0,28.0,12500.0,2020-03-15,8.5
1,102,Maria Garcia,,62000.0,34.0,15800.0,2019-07-22,9.2
2,103,James Wilson,Marketing,48000.0,33.5,11200.0,2021-01-10,
3,104,Lisa Anderson,Sales,60750.0,29.0,14500.0,2020-11-05,7.8
4,105,Robert Brown,IT,75000.0,42.0,,2018-05-18,8.9
5,106,,Marketing,51000.0,31.0,9800.0,,8.1
6,107,Emily Davis,Sales,58000.0,27.0,16200.0,2021-06-30,9.5
7,108,Michael Lee,,69000.0,38.0,,2019-02-14,
8,109,Sarah Taylor,IT,60750.0,33.0,13400.0,2020-09-25,8.3
9,110,David Martinez,Marketing,54000.0,33.5,,2021-03-08,7.9


## Question 6: Fill Categorical Nulls

**Task:** 
- Fill null values in the 'department' column with 'Unknown'
- Fill null values in the 'name' column with 'Anonymous'
- Verify no nulls remain in these columns

In [37]:
# Your code here
df["department"]=df["department"].fillna("unknown")
df["name"]=df["name"].fillna("anonymous")


## Question 7: Forward Fill and Backward Fill

**Task:** 
- Sort the DataFrame by 'join_date'
- Use forward fill (`ffill`) to fill null values in 'performance_score'
- Then use backward fill (`bfill`) for any remaining nulls
- Why might this approach work or not work for this data?

In [44]:
# Your code here
# df["join_date"].sort_values(ascending=True)
# df["Join_date"] = pd.to_datetime(df["Join_date"])
# df = df.sort_values(by="Join_date")
df["performance_score"]=df["performance_score"].ffill()
df["performance_score"]=df["performance_score"].bfill()


## Question 8: Conditional Filling

**Task:** 
- For null values in 'sales_amount', fill them with different values based on department:
  - Sales department: fill with the mean of Sales department
  - Other departments: fill with 0
- How many values did you fill?

In [45]:
# Your code here
df.groupby("department")["sales_amount"].mean()


department
IT           14150.0
Marketing    10500.0
Sales        15260.0
unknown      13975.0
Name: sales_amount, dtype: float64

## Question 9: Interpolation

**Task:** 
- Sort the data by 'employee_id'
- Use interpolation to fill null values in numeric columns
- Compare the results with mean filling

In [46]:
# Your code here
df

Unnamed: 0,employee_id,name,department,salary,age,sales_amount,join_date,performance_score
0,101,John Smith,Sales,55000.0,28.0,12500.0,2020-03-15,8.5
1,102,Maria Garcia,unknown,62000.0,34.0,15800.0,2019-07-22,9.2
2,103,James Wilson,Marketing,48000.0,33.5,11200.0,2021-01-10,9.2
3,104,Lisa Anderson,Sales,60750.0,29.0,14500.0,2020-11-05,7.8
4,105,Robert Brown,IT,75000.0,42.0,,2018-05-18,8.9
5,106,anonymous,Marketing,51000.0,31.0,9800.0,,8.1
6,107,Emily Davis,Sales,58000.0,27.0,16200.0,2021-06-30,9.5
7,108,Michael Lee,unknown,69000.0,38.0,,2019-02-14,9.5
8,109,Sarah Taylor,IT,60750.0,33.0,13400.0,2020-09-25,8.3
9,110,David Martinez,Marketing,54000.0,33.5,,2021-03-08,7.9


## Question 10: Complete Data Cleaning Pipeline

**Task:** 
Create a complete cleaning pipeline that:
1. Loads the original data
2. Fills null values appropriately for each column type
3. Verifies no nulls remain
4. Saves the cleaned data to 'sales_data_cleaned.csv'

Choose the best strategy for each column based on what you've learned!

In [25]:
# Your code here


## Bonus Challenge

**Task:** 
- Analyze which method of handling nulls (dropping vs. filling) is better for this dataset
- Calculate summary statistics before and after handling nulls
- Discuss potential biases introduced by different filling methods

In [26]:
# Your code here
