# Introduction to Data Preprocessing



**Why is data preprocessing important?**
Data in the real world is often incomplete, inconsistent, and messy. 

Preprocessing transforms raw data into a clean and usable form, helping machine learning models to perform better.

**Common preprocessing steps:**

        Handling missing data
        Feature scaling
        Encoding categorical data
        Splitting the dataset for training and testing

# **Handling Missing Data**

**Why does missing data occur?**

**Missing Data in Real-World Datasets**

Missing data is common in real-world datasets due to reasons like data collection errors, participant dropout, or incomplete data entry.

**Impact on Machine Learning Models**

Many machine learning algorithms cannot handle missing values directly, which may lead to inaccurate predictions or even failure to train the model.

**Ways to Handle Missing Data**

1. **Remove Missing Values**
    - When the impact is minimal.

2. **Impute Missing Values**
    - When removing values would result in data loss.

pip install pandas numpy scikit-learn

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

data = {
    'age': [25, np.nan, 35, 40, 29],
    'salary': [50000, 60000, np.nan, 80000, 50000],
    'purchased': ['No', 'Yes', 'Yes', 'No', np.nan]
}

df = pd.DataFrame(data)
print(df)


    age   salary purchased
0  25.0  50000.0        No
1   NaN  60000.0       Yes
2  35.0      NaN       Yes
3  40.0  80000.0        No
4  29.0  50000.0       NaN


## Techniques for handling missing data

1. **Removing Missing Values**

When to remove missing data?
    . If the dataset is large and the proportion of missing data is small.

    
    . When missing data is scattered randomly and does not impact the overall distribution.

In [2]:
# Removing rows with missing values
df_dropna = df.dropna()
print(df_dropna)


    age   salary purchased
0  25.0  50000.0        No
3  40.0  80000.0        No


**2. Imputing Missing Values**

#### When to impute missing data?

    . When the dataset is small or missing values occur frequently.

    . Imputation maintains data consistency without sacrificing too much information.

**Common imputation methods:**

    . Mean/Median/Mode imputation.

    . Forward fill/backward fill (using previous/next value)

In [3]:
# (Numerical Data)
# For numerical columns, such as age and salary, we can replace missing values with the mean of the column.

# Imputing missing values using mean for age and salary
imputer = SimpleImputer(strategy='mean')
df['age'] = imputer.fit_transform(df[['age']])
df['salary'] = imputer.fit_transform(df[['salary']])
print(df)

     age   salary purchased
0  25.00  50000.0        No
1  32.25  60000.0       Yes
2  35.00  60000.0       Yes
3  40.00  80000.0        No
4  29.00  50000.0       NaN


In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Sample data
data = {
    'age': [25, np.nan, 35, 40, 29,40,40],
    'salary': [50000, 60000, np.nan, 80000, 50000, 50000, 50000],
    'purchased': ['No', 'Yes', 'Yes', 'No', np.nan,'Yes','Yes']
}

# Create DataFrame
df = pd.DataFrame(data)

# Convert the 'purchased' column to object type (if not already)
df['purchased'] = df['purchased'].astype('object')

# Imputing missing values using mode for categorical data
imputer = SimpleImputer(strategy='most_frequent')

# Apply the imputer on the 'purchased' column
# Use ravel() to flatten the result since it's a 2D array
df['purchased'] = imputer.fit_transform(df[['purchased']]).ravel()

# Display the DataFrame after mode imputation
print(df)


    age   salary purchased
0  25.0  50000.0        No
1   NaN  60000.0       Yes
2  35.0      NaN       Yes
3  40.0  80000.0        No
4  29.0  50000.0       Yes
5  40.0  50000.0       Yes
6  40.0  50000.0       Yes


In [None]:
# Using Forward/Backward Fill
# For time-series or ordered data, you can fill missing values using nearby values.


# Forward fill
df_ffill = df.fillna(method='ffill')
print(df_ffill)

# Backward fill
df_bfill = df.fillna(method='bfill')
print(df_bfill)

 When to Remove vs. When to Impute Missing Data

 
**Remove missing data when:**

- The amount of missing data is small (<5%) and randomly distributed.

- Removing missing values doesn’t cause a significant reduction in the size of the dataset.


**Impute missing data when:**

- The missing values represent a significant portion of the data (>5%).

- You want to avoid data loss and preserve the size of the dataset.


**Maintaining Data Distribution:**

It is important to ensure that imputing missing values does not distort the data’s underlying distribution. For example:

- Mean imputation may reduce the variance of the data.

- Mode imputation for categorical data may increase the representation of the most common class.

You should always analyze the dataset and decide the best technique based on its structure and how missing values are distributed.