### Handeling Missing Data

Types of missing data
1. NaN: not a number: missing value marker in numeric (array) (numpy.nan)
2. None: treated as missing in object and string columns
3. NaT: not a time: missing value for date-time like data
4. pd.Na: nullable type: for nullable integer, boolean, and string types. (works better for mixed dtypes than NaN)

In [105]:
import numpy as np
import pandas as pd
pd.reset_option('display.max_rows')
pd.set_option('display.width', 150)

In [111]:
df = pd.read_csv('D:/Code/practice_dataset.csv')
df.head()

Unnamed: 0,ID,Name,Age,Gender,Department,Joining_Date,Salary,Performance_Score,Remote_Work
0,1.0,Person_1,56,Female,HR,2018-10-01,71919.0,Good,True
1,2.0,Person_2,46,Female,Finance,2020-06-09,35600.0,Average,False
2,3.0,Person_3,32,Male,IT,2019-03-15,59124.0,Excellent,False
3,4.0,Person_4,60,Male,HR,2015-05-27,57643.0,Good,False
4,5.0,Person_5,25,Male,HR,2023-10-25,70764.0,Average,False


Checking for missing values

In [54]:
# total missing value
df.isnull().sum()


ID                   100
Name                   2
Age                    0
Gender                 2
Department             3
Joining_Date           2
Salary                 4
Performance_Score      2
Remote_Work            0
dtype: int64

### Handeling Missing Value
1. Removing Missing Values: if we don't want missing data, then we just remove row or column
2. Filling Missing Values: or we fill missing value with calculated value (average, mean, cumulative value)
3. Interpolation: estimated value, possible value
4. Replace with Domain Knowledge

Removing

In [37]:
# dropna(axis)
new_df = df.dropna()        #.dropna(axis=1, inplace=True) ---> if to remove column, and to change the original data
print(new_df)
print(df.shape)
print(new_df.shape)

       ID        Name    Age  Gender Department Joining_Date   Salary Performance_Score  Remote_Work
0     1.0    Person_1  40.88  Female         HR   2018-10-01  71919.0              Good         True
1     2.0    Person_2  40.88  Female    Finance   2020-06-09  35600.0           Average        False
2     3.0    Person_3  40.88    Male         IT   2019-03-15  59124.0         Excellent        False
3     4.0    Person_4  40.88    Male         HR   2015-05-27  57643.0              Good        False
4     5.0    Person_5  40.88    Male         HR   2023-10-25  70764.0           Average        False
..    ...         ...    ...     ...        ...          ...      ...               ...          ...
93   94.0   Person_94  40.88  Female         HR   2021-10-21  88510.0              Good        False
94   95.0   Person_95  40.88  Female    Finance   2019-12-05  43403.0         Excellent         True
95   96.0   Person_96  40.88    Male  Marketing   2020-09-05  62097.0         Excellent    

Filling

In [45]:
# hard code value
# df.fillna(0)

# calculated value
df['Age'].fillna(df['Age'].mean())
df

Unnamed: 0,ID,Name,Age,Gender,Department,Joining_Date,Salary,Performance_Score,Remote_Work
0,1.0,Person_1,56,Female,HR,2018-10-01,71919.0,Good,True
1,2.0,Person_2,46,Female,Finance,2020-06-09,35600.0,Average,False
2,3.0,Person_3,32,Male,IT,2019-03-15,59124.0,Excellent,False
3,4.0,Person_4,60,Male,HR,2015-05-27,57643.0,Good,False
4,5.0,Person_5,25,Male,HR,2023-10-25,70764.0,Average,False
...,...,...,...,...,...,...,...,...,...
95,96.0,Person_96,24,Male,Marketing,2020-09-05,62097.0,Excellent,True
96,97.0,Person_97,26,Female,Marketing,2015-08-14,43121.0,Good,True
97,98.0,Person_98,41,Female,Finance,2017-03-08,85071.0,,False
98,,Person_99,18,Male,Marketing,2021-05-20,40966.0,Excellent,False


Interpolation

In [112]:
print(f'Before interpolation \nNumber of null value in ID column: {df['ID'].isna().sum()}')

# column ID is object type, so change it into number since interpolation is best for numeric dat
df['ID'] = pd.to_numeric(df['ID'])

# inperpolate(method, axis, )
# df['ID'].interpolate(method='linear', inplace=True)  # this is not recommended

# Apply method at dataframe level
# df.interpolate({'ID': 'linear'}, inplace=True)
df['ID'] = df['ID'].interpolate(method='linear')
print(f"After Inpterpolation \nNumber of null value: {df['ID'].isna().sum()}")
df


Before interpolation 
Number of null value in ID column: 2
After Inpterpolation 
Number of null value: 0


Unnamed: 0,ID,Name,Age,Gender,Department,Joining_Date,Salary,Performance_Score,Remote_Work
0,1.0,Person_1,56,Female,HR,2018-10-01,71919.0,Good,True
1,2.0,Person_2,46,Female,Finance,2020-06-09,35600.0,Average,False
2,3.0,Person_3,32,Male,IT,2019-03-15,59124.0,Excellent,False
3,4.0,Person_4,60,Male,HR,2015-05-27,57643.0,Good,False
4,5.0,Person_5,25,Male,HR,2023-10-25,70764.0,Average,False
...,...,...,...,...,...,...,...,...,...
95,96.0,Person_96,24,Male,Marketing,2020-09-05,62097.0,Excellent,True
96,97.0,Person_97,26,Female,Marketing,2015-08-14,43121.0,Good,True
97,98.0,Person_98,41,Female,Finance,2017-03-08,85071.0,,False
98,99.0,Person_99,18,Male,Marketing,2021-05-20,40966.0,Excellent,False
