# Missing Values 

# Defination

 Missing values are the data points that are not recorded or unavailable in a dataset.
They represent gaps or blanks where information is expected but absent.
   

## Step-by-Step: Handling Missing Values in Machine Learning

# Step 1: Import the Libraries

In [None]:

We usually use Pandas and NumPy for data analysis.

import pandas as pd
import numpy as np


# Step 2: Load or Create Your Dataset

In [None]:

Example dataset with missing values:

data = {
    'Name': ['John', 'Emma', 'Ryan', 'Sophia', 'Chris'],
    'Age': [25, None, 30, 22, np.nan],
    'Salary': [50000, 60000, None, 52000, 58000],
    'City': ['New York', 'Paris', None, 'London', 'Berlin']
}

df = pd.DataFrame(data)
print(df)

In [None]:
Output:

     Name   Age   Salary      City
0    John  25.0  50000.0  New York
1    Emma   NaN  60000.0      Paris
2    Ryan  30.0      NaN       None
3  Sophia  22.0  52000.0     London
4   Chris   NaN  58000.0     Berlin


# üîπ Step 3: Detect Missing Values

In [None]:
3.1 Check for missing values (True/False)
df.isnull()

3.2 Count missing values column-wise
df.isnull().sum()


Output:

Name      0
Age       2
Salary    1
City      1
dtype: int64

3.3 Check if the dataset has any missing values
df.isnull().values.any()

3.4 Get total count of missing values
df.isnull().sum().sum()

# Step 4: Visualize Missing Data (Optional)

In [None]:

Visualization helps see where missing data exists.

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(), cbar=False, cmap='coolwarm')
plt.show()


This heatmap highlights missing values visually.


# Step 5: Handle Missing Values

In [None]:

Now that we‚Äôve found them, we can handle them in different ways.

üß© A. Removing Missing Values
1Ô∏è‚É£ Remove rows with missing data
df1 = df.dropna()

2Ô∏è‚É£ Remove columns with missing data
df2 = df.dropna(axis=1)

3Ô∏è‚É£ Remove rows only if all values are missing
df3 = df.dropna(how='all')

4Ô∏è‚É£ Remove rows if certain columns have missing data
df4 = df.dropna(subset=['Age', 'Salary'])

üß© B. Filling (Imputing) Missing Values
Instead of deleting data, we can fill missing values using statistics or logic.

1Ô∏è‚É£ Fill with a constant value
df['City'].fillna('Unknown', inplace=True)

2Ô∏è‚É£ Fill with the mean (numerical columns)
df['Age'].fillna(df['Age'].mean(), inplace=True)

3Ô∏è‚É£ Fill with the median
df['Salary'].fillna(df['Salary'].median(), inplace=True)

4Ô∏è‚É£ Fill with the mode (for categorical data)
df['City'].fillna(df['City'].mode()[0], inplace=True)

üß© C. Forward / Backward Fill (from neighboring rows)
1Ô∏è‚É£ Forward Fill ‚Äì use previous value
df.fillna(method='ffill', inplace=True)

2Ô∏è‚É£ Backward Fill ‚Äì use next value
df.fillna(method='bfill', inplace=True)

üß© D. Conditional Imputation

Fill missing values based on group statistics.
Example: Fill missing Salary by average salary of each city.

df['Salary'] = df.groupby('City')['Salary'].transform(lambda x: x.fillna(x.mean()))

üß© E. Using Machine Learning Models to Predict Missing Values

For more accuracy:

Separate rows with and without missing values.

Train a model (like Linear Regression or KNN) on complete data.

Predict missing values.

Example (using sklearn‚Äôs KNNImputer):

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2)
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])

üß© F. Add a Flag Column (Optional)

Sometimes we also track missingness:

df['Age_missing'] = df['Age'].isnull()


# Step 6: Verify Changes

In [None]:

After handling missing values, always check again:

df.isnull().sum()


Output should be:

Name      0
Age       0
Salary    0
City      0
dtype: int64

# Step 7: Save the Clean Data

In [None]:

df.to_csv('clean_data.csv', index=False)


# ‚úÖ Summary Table

In [None]:

Step	Action	Function
1	Import libraries	import pandas as pd
2	Load dataset	pd.read_csv()
3	Find missing values	df.isnull() / df.isnull().sum()
4	Visualize	sns.heatmap(df.isnull())
5A	Remove	df.dropna()
5B	Fill with mean/median/mode	df.fillna()
5C	ML-based imputation	KNNImputer()
6	Verify	df.isnull().sum()
7	Save cleaned data	df.to_csv()