## How to handle missing values?

<b>Missing data</b> is defined as the values or data that is not stored (or not present) for some variable/s in the given dataset.Handling missing values is an essential step in machine learning as it can significantly impact the performance of your model.<br>

Missing values in a dataset can be represented in various ways, depending on the source of the data and the conventions used. Here are some common representations:<br>
1)NaN (Not a Number)<br>
2)NULL or None<br>
3)Empty Strings<br>
4)Blanks or Spaces<br>

<b>Some common ways to handle missing values:</b><br>

<b>Drop rows with missing values:</b><br> This method is simple and effective, especially when the dataset is large and the missing values are few. However, it can lead to loss of information if the missing values are significant.<br>

<b>Replace with mean/median/mode:</b><br> This method replaces the missing values with the mean, median, or mode of the respective feature. This is a simple and effective method, especially for numerical features.<br>

<b>Imputation using regression:</b><br> This method uses a regression model to predict the missing values based on other features.<br>

<b>Imputation using K-Nearest Neighbors (KNN):</b><br> This method uses KNN to find the most similar rows to the one with missing values and impute the missing values based on these similar rows.<br>

In [1]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

In [2]:
# Create a sample dataset
data = {
    'Age': [25, 30, np.nan, 35, 40, np.nan, 50],
    'Salary': [50000, 54000, 58000, np.nan, 64000, 67000, np.nan],
    'Purchased': ['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No']
    }

In [3]:
df = pd.DataFrame(data)

In [4]:
print("Original DataFrame:")
print(df)

Original DataFrame:
    Age   Salary Purchased
0  25.0  50000.0        No
1  30.0  54000.0       Yes
2   NaN  58000.0        No
3  35.0      NaN        No
4  40.0  64000.0       Yes
5   NaN  67000.0       Yes
6  50.0      NaN        No


In [5]:
# Separate features and target
X = df[['Age', 'Salary']]
y = df['Purchased']

In [6]:
# Create an imputer object with a mean filling strategy
imputer = SimpleImputer(strategy='mean')


In [7]:
# Fit the imputer on the data and transform the data
X_imputed = imputer.fit_transform(X)

In [8]:
# Convert the imputed array back to a DataFrame
df_imputed = pd.DataFrame(X_imputed, columns=['Age', 'Salary'])

In [9]:
# Add the target column back to the DataFrame
df_imputed['Purchased'] = y


In [10]:
print("\nDataFrame after Mean Imputation:")
print(df_imputed)


DataFrame after Mean Imputation:
    Age   Salary Purchased
0  25.0  50000.0        No
1  30.0  54000.0       Yes
2  36.0  58000.0        No
3  35.0  58600.0        No
4  40.0  64000.0       Yes
5  36.0  67000.0       Yes
6  50.0  58600.0        No
