In [1]:
import pandas as pd

# Handling Missing Data
when a value is missing from a dataset.

we can represent with.

- NaN()
- None (for object data type)

.isnull() is a pandas function used to check whether a value in a DataFrame or Series is missing (NaN).

It returns True where values are missing and False where values exist.

In [21]:
data={
    "Name":["Ali",  None, "Akram", "Daniyal", "Habib", "Zohran", "Ghafoor", "Rehman"],
    "Age":[29, None, 22, 38, 29, 45, 53, 60],
    "Salary":[50000, None, 45000, 52000, 49000, 70000, 48000, 58000],
    "Performance_Score":[85, None, 78, 92, 88, 95, 80, 54]
}

df=pd.DataFrame(data)
print(df)
print(df.isnull())

      Name   Age   Salary  Performance_Score
0      Ali  29.0  50000.0               85.0
1     None   NaN      NaN                NaN
2    Akram  22.0  45000.0               78.0
3  Daniyal  38.0  52000.0               92.0
4    Habib  29.0  49000.0               88.0
5   Zohran  45.0  70000.0               95.0
6  Ghafoor  53.0  48000.0               80.0
7   Rehman  60.0  58000.0               54.0
    Name    Age  Salary  Performance_Score
0  False  False   False              False
1   True   True    True               True
2  False  False   False              False
3  False  False   False              False
4  False  False   False              False
5  False  False   False              False
6  False  False   False              False
7  False  False   False              False


for how many values missing in column

we use .isnull().sum()

In [6]:
print(df.isnull().sum())

Name                 1
Age                  1
Salary               1
Performance_Score    1
dtype: int64


After detecting missing values we have 2 ways 
1. we can remove
2. or fill with any value

For this we use .dropna(axis=0, inplace=True)  axis=0 rows and axis=1 columns

dropna() is a pandas method used to remove rows or columns that contain missing values (NaN).

In [12]:
df.dropna(inplace=True)
print(df)

      Name   Age   Salary  Performance_Score
0      Ali  29.0  50000.0               85.0
2    Akram  22.0  45000.0               78.0
3  Daniyal  38.0  52000.0               92.0
4    Habib  29.0  49000.0               88.0
5   Zohran  45.0  70000.0               95.0
6  Ghafoor  53.0  48000.0               80.0
7   Rehman  60.0  58000.0               54.0


**.fillna()** is a pandas method used to replace missing values (NaN) with something else.

There are many situations where cleaning (filling NaN) is better than deleting rows, because deleting can destroy your dataset.

*Example*:

Your dataset has 5000 rows.

Missing values exist in 1000 rows.

if we remove cols or row contain NaN

we will lose 1000 rows 20% of your data is gone!



In [24]:
# df.fillna(0, inplace=True)
data={
    "Name":["Ali",  None, "Akram", "Daniyal", "Habib", "Zohran", "Ghafoor", "Rehman"],
    "Age":[29, None, 22, 38, 29, 45, 53, 60],
    "Salary":[50000, None, 45000, 52000, 49000, 70000, 48000, 58000],
    "Performance_Score":[85, None, 78, 92, 88, 95, 80, 54]
}

df=pd.DataFrame(data)


df["Salary"].fillna(df["Salary"].mean(), inplace=True)
print(df)

      Name   Age        Salary  Performance_Score
0      Ali  29.0  50000.000000               85.0
1     None   NaN  53142.857143                NaN
2    Akram  22.0  45000.000000               78.0
3  Daniyal  38.0  52000.000000               92.0
4    Habib  29.0  49000.000000               88.0
5   Zohran  45.0  70000.000000               95.0
6  Ghafoor  53.0  48000.000000               80.0
7   Rehman  60.0  58000.000000               54.0


**Interpolation** is needed whenever you have missing values (NaN) in your dataset, especially when the data is ordered or has a trend. Instead of deleting rows or filling with a fixed value, interpolation estimates missing values in a way that preserves the structure and trend of the data.

Preserve trends in data
Interpolation keeps all rows and just fills the missing values logically
Interpolation estimates values according to the sequence, instead of using arbitrary replacements.
Missing values can break line charts, trend plots, or moving averages.

Interpolation fills gaps naturally, making plots continuous

In [25]:
data={
    "Time":[1,2,3,4,5],
    "Value":[10,None, 30, None, 50]
}

df = pd.DataFrame(data)
print("Befire interploate")
print(df)


df['Value'].interpolate(method='linear', inplace=True)
print("After interploate")
print(df)

Befire interploate
   Time  Value
0     1   10.0
1     2    NaN
2     3   30.0
3     4    NaN
4     5   50.0
After interploate
   Time  Value
0     1   10.0
1     2   20.0
2     3   30.0
3     4   40.0
4     5   50.0
