### ðŸ“… Day 3: Data Cleaning & Missing Values


In [48]:
import numpy as np
import pandas as pd

Detecting Missing Values

In [49]:
data = {"Name":["Asha","Rahul","Meena",None],
        "Age":[21,None,23,22],
        "Marks":[85,90,np.nan,95]}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Marks
0,Asha,21.0,85.0
1,Rahul,,90.0
2,Meena,23.0,
3,,22.0,95.0


In [50]:
df.isnull()

Unnamed: 0,Name,Age,Marks
0,False,False,False
1,False,True,False
2,False,False,True
3,True,False,False


In [51]:
df.isnull().sum()

Name     1
Age      1
Marks    1
dtype: int64

Dropping Missing Values

In [52]:
df.dropna()

Unnamed: 0,Name,Age,Marks
0,Asha,21.0,85.0


In [53]:
df.dropna(axis=1)

0
1
2
3


Filling Missing Values

In [54]:
df.fillna(0)

Unnamed: 0,Name,Age,Marks
0,Asha,21.0,85.0
1,Rahul,0.0,90.0
2,Meena,23.0,0.0
3,0,22.0,95.0


In [55]:
df["Age"].fillna(df["Age"].mean())

0    21.0
1    22.0
2    23.0
3    22.0
Name: Age, dtype: float64

In [56]:
df["Marks"].fillna(df["Marks"].mean())

0    85.0
1    90.0
2    90.0
3    95.0
Name: Marks, dtype: float64

Create a DataFrame with some missing values.

In [57]:
d = {"one":[2,3,None],
     "two":[4,None,6],
     "three":[7,8,9]}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two,three
0,2.0,4.0,7
1,3.0,,8
2,,6.0,9


Drop rows with missing values.

In [58]:
df.dropna()

Unnamed: 0,one,two,three
0,2.0,4.0,7


Fill missing values with mean/median.

In [59]:
df.fillna(df.mean())

Unnamed: 0,one,two,three
0,2.0,4.0,7
1,3.0,5.0,8
2,2.5,6.0,9


Check for null values in a DataFrame.


In [60]:
data = {"Name":["Asha","Rahul","Meena",None],
        "Age":[21,None,23,22],
        "Marks":[85,90,np.nan,None]}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Marks
0,Asha,21.0,85.0
1,Rahul,,90.0
2,Meena,23.0,
3,,22.0,


In [61]:
df.isnull()

Unnamed: 0,Name,Age,Marks
0,False,False,False
1,False,True,False
2,False,False,True
3,True,False,True


Fill missing values with 0.


In [62]:
df.fillna(0)

Unnamed: 0,Name,Age,Marks
0,Asha,21.0,85.0
1,Rahul,0.0,90.0
2,Meena,23.0,0.0
3,0,22.0,0.0


Fill missing values in the age column with the mean.


In [63]:
df["Age"].fillna(df["Age"].mean())

0    21.0
1    22.0
2    23.0
3    22.0
Name: Age, dtype: float64

Fill missing values in the Salary column with the mean.


In [64]:
df.dropna(thresh=df.shape[1]-1)

Unnamed: 0,Name,Age,Marks
0,Asha,21.0,85.0
1,Rahul,,90.0
2,Meena,23.0,


Replace missing values in a column with the previous rowâ€™s value (ffill).


In [66]:
data = {"Name":["Asha","Rahul","Meena",None],
        "Age":[21,None,23,22],
        "Marks":[85,90,np.nan,95]}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Marks
0,Asha,21.0,85.0
1,Rahul,,90.0
2,Meena,23.0,
3,,22.0,95.0


In [70]:
df["Age"] = df["Age"].ffill()

In [71]:
df

Unnamed: 0,Name,Age,Marks
0,Asha,21.0,85.0
1,Rahul,21.0,90.0
2,Meena,23.0,
3,,22.0,95.0
