# Data Cleaning
- data is usually messy with: 
    - missing values - blanks, NaA, None
    - Incomplete rows/columns
    - Incosistent placeholders - NA, '', ?
- Handling missing data before
    - Data analysis
    - Training ML Models
    - Visualization

# Missing Data

- Pandas use NaN(Not a Number) from NumPy to show missing val in num col
- NaN is also used for object or strings
- None is also converted into NaN by pandas

```NaN``` is a floating point value
Comparision with it -> False

- Solution = isnull(), notnull()

### Check for Missing value
- isnull() - returns DF/Series of booleans showing True if missing

In [2]:
import pandas as pd
import numpy as np

data = {
    'Name': ['Alice', 'Bob', np.nan, 'David'],
    'Age': [25, np.nan, 35, 40],
    'City': ['Pune', 'Delhi', 'Mumbai', None]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25.0,Pune
1,Bob,,Delhi
2,,35.0,Mumbai
3,David,40.0,


In [3]:
df.isnull()

Unnamed: 0,Name,Age,City
0,False,False,False
1,False,True,False
2,True,False,False
3,False,False,True


In [None]:
df.notnull()  # returns False for NaN

Unnamed: 0,Name,Age,City
0,True,True,True
1,True,False,True
2,False,True,True
3,True,True,False


In [None]:
# checking how many null per col

df.isnull().sum()
df.isnull().sum().sum()  # shows total NaN is DF

3

### Removing Missing Data 
- dropna()

In [7]:
# 1. Remove rows with missing val 

df_cleaned = df.dropna()
df_cleaned

Unnamed: 0,Name,Age,City
0,Alice,25.0,Pune


In [11]:
# 2. Remove cols with missing val

df.dropna(axis=1)

0
1
2
3


In [12]:
# 3. Remove rows where all cols are missing

df.dropna(how='all')

Unnamed: 0,Name,Age,City
0,Alice,25.0,Pune
1,Bob,,Delhi
2,,35.0,Mumbai
3,David,40.0,


In [13]:
# 4. Drop rows where specific col are missing

df.dropna(subset=['Age', 'City'])

Unnamed: 0,Name,Age,City
0,Alice,25.0,Pune
2,,35.0,Mumbai


### Filling Missing Data
- fillna()

In [15]:
# fill with a specific value

df_filled = df.fillna('Unkown')
df_filled

Unnamed: 0,Name,Age,City
0,Alice,25.0,Pune
1,Bob,Unkown,Delhi
2,Unkown,35.0,Mumbai
3,David,40.0,Unkown


In [26]:
# fill num missing with specific val

df['Age'] = df['Age'].fillna(0) # only fills empty with 0
df

Unnamed: 0,Name,Age,City
0,Alice,25.0,Pune
1,Bob,0.0,Delhi
2,,35.0,Mumbai
3,David,40.0,


In [27]:
# fill using mean, median, mode

df['Age'] = df['Age'].fillna(df['Age'].mean())

df['Age'] = df['Age'].fillna(df['Age'].median())

In [32]:
# forward fill - fills with prev row val
df_ffill = df.fillna(method='ffill')
df_ffill

  df_ffill = df.fillna(method='ffill')


Unnamed: 0,Name,Age,City
0,Alice,25.0,Pune
1,Bob,0.0,Delhi
2,Bob,35.0,Mumbai
3,David,40.0,Mumbai


In [33]:
#backward fill - fills with next row val
df_bfill = df.fillna(method='bfill')
df_bfill

  df_bfill = df.fillna(method='bfill')


Unnamed: 0,Name,Age,City
0,Alice,25.0,Pune
1,Bob,0.0,Delhi
2,David,35.0,Mumbai
3,David,40.0,
