# (Filling Nulls)

Handling missing or null values in a dataset is a common challenge in data analysis. Incomplete or missing data can impact the accuracy of analysis and modeling results. Here are some common methods for filling null values in a dataset:

## Mean/Median/Mode Imputation

In this method, the missing values are replaced by the mean, median, or mode of the non-null values in the same column. This is a simple method and works well when the data is normally distributed.

In [37]:
# Data Preprocessing

## Imporing packages
import pandas as pd
import numpy as np

## Making data
d = {'ID': [1, 22, 333, 4444, 55555], 'col1': [1,2,None,4,None], 'col2': [1,None,3,4,5]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,ID,col1,col2
0,1,1.0,1.0
1,22,2.0,
2,333,,3.0
3,4444,4.0,4.0
4,55555,,5.0


In [16]:
# Nmber of missing values
df.isnull().sum()

ID      0
col1    2
col2    1
dtype: int64

In [17]:
# Some details
missing_values = df.isnull()
rows_with_missing_values = df[df.isnull().any(axis=1)]
rows_without_missing_values = df.dropna()
columns_with_missing_values = df.loc[:, df.isnull().any()]
columns_without_missing_values = df.loc[:, df.notnull().all()]
missing_values

Unnamed: 0,ID,col1,col2
0,False,False,False
1,False,False,True
2,False,True,False
3,False,False,False
4,False,True,False


### 1) Method 1

In [34]:
# Filling nulls with the left cell value in the same row. If there is not such a cell, then puts NaN again. Whole columns.
df1=df.fillna(method='ffill',axis=1)
df1

Unnamed: 0,ID,col1,col2
0,1.0,1.0,1.0
1,22.0,2.0,2.0
2,333.0,333.0,3.0
3,4444.0,4.0,4.0
4,55555.0,55555.0,5.0


### 2) Method 2 (Mode)

In [36]:
# Filling nulls with mode. Whole columns.
for item in df.columns:
    # Replacing mode in every column for null cells
    mode_value_every_column = df[item].value_counts().index[0]
    df[item].fillna(mode_value_every_column,inplace=True)
df

Unnamed: 0,ID,col1,col2
0,1,1.0,1.0
1,22,2.0,5.0
2,333,4.0,3.0
3,4444,4.0,4.0
4,55555,4.0,5.0


### 3) Method 3

In [38]:
# Filling nulls with mode. one column.
col1_without_null=[]
mode_col1 = df['col1'].value_counts().index[0]
for item in df['col1']:
    if item==np.nan:
        col1_without_null.append(mode_col1)
    else:
        col1_without_null.append(mode_col1)

df['col1']=col1_without_null
df

Unnamed: 0,ID,col1,col2
0,1,4.0,1.0
1,22,4.0,
2,333,4.0,3.0
3,4444,4.0,4.0
4,55555,4.0,5.0


## KNN Imputation

This method uses k-nearest neighbors to fill in missing values. The idea is to find the k nearest neighbors of each missing value and fill the missing value with the mean of the values from these k neighbors.

## Linear Interpolation

In this method, the missing values are filled by linear interpolation between the values of the nearest neighbors. This method is best suited for time-series data.

## Multiple Imputation

In this method, multiply imputed datasets are created and analyzed to obtain more accurate results. The missing values are randomly filled multiple times and averaged to obtain the final results.

## Predictive Modeling

In this method, a predictive model is trained on the available data to predict the missing values. The model can be linear regression, decision trees, or any other machine learning algorithm.

## Conclusion

In conclusion, filling in missing values in a dataset is a crucial step in data analysis. The method used for filling in missing values can impact the accuracy of the results, so it is important to choose the right method and use it appropriately.