# Data Pre-processing Steps



### Summary of Data Preprocessing Steps:
1. **Data Cleaning**:
   - Handle missing values, noisy data, outliers, duplicates.
   - Normalize data.
   
2. **Data Integration**:
   - Resolve schema conflicts, entity identification, and redundancy.
   
3. **Data Transformation**:
   - Feature scaling, encoding, engineering, and aggregation.
   
4. **Data Reduction**:
   - Dimensionality reduction, feature selection, and sampling.

These preprocessing steps ensure that data is structured, clean, and relevant, ultimately leading to better model performance.

Dealing with missing values is a common and important part of the data preprocessing stage in machine learning and data analysis. In Python, missing values are usually represented as `NaN` (Not a Number). The `pandas` library offers robust tools to handle these missing values. Here’s a detailed guide to handling missing values step by step:

### 1. **Identifying Missing Values**
Before dealing with missing values, the first step is to **identify** them in the dataset.

#### Example Dataset:
```python
import pandas as pd

# Sample dataset
data = {'Name': ['John', 'Alice', 'Bob', 'Emma', None],
        'Age': [28, None, 34, 29, 30],
        'Salary': [50000, 60000, None, None, 55000]}

df = pd.DataFrame(data)
print(df)
```

This would output:
```
     Name   Age   Salary
0    John  28.0  50000.0
1   Alice   NaN  60000.0
2     Bob  34.0      NaN
3    Emma  29.0      NaN
4    None  30.0  55000.0
```

In the above dataset:
- The **`Name`** column has a missing value (`None`).
- The **`Age`** column has one missing value (`NaN`).
- The **`Salary`** column has two missing values (`NaN`).

#### Step 1.1: Check for Missing Values
To check for missing values, you can use the following methods:

```python
# Check for missing values in the entire DataFrame
print(df.isnull())   # True means the value is missing

# Count the number of missing values in each column
print(df.isnull().sum())
```

Output:
```
    Name    Age  Salary
0  False  False  False
1  False   True  False
2  False  False   True
3  False  False   True
4   True  False  False

Name      1
Age       1
Salary    2
dtype: int64
```

### 2. **Dealing with Missing Values**
Once you’ve identified where the missing values are, you can handle them using different techniques based on the problem and data type. The main approaches include **removing** missing values, **imputing** missing values, and **replacing** missing values.

---

### 2.1 **Removing Missing Values**
This approach is useful when the proportion of missing data is small and removing the records won’t significantly affect the analysis.

#### Step 2.1.1: Drop Rows with Missing Values
To remove rows that contain any missing value:

```python
# Drop rows with any missing values
df_dropped_rows = df.dropna()
print(df_dropped_rows)
```

Output:
```
    Name   Age   Salary
0   John  28.0  50000.0
```
Here, all rows with missing values were removed, leaving only the first row.

#### Step 2.1.2: Drop Columns with Missing Values
To remove columns that contain any missing values:

```python
# Drop columns with any missing values
df_dropped_columns = df.dropna(axis=1)
print(df_dropped_columns)
```

Output:
```
     Name
0    John
1   Alice
2     Bob
3    Emma
4    None
```
In this case, it removed both the `Age` and `Salary` columns, as they contain missing values.

#### Step 2.1.3: Drop Rows or Columns Based on a Threshold
You may want to drop rows or columns that have more than a certain number of missing values. For example, you might remove rows that have more than 1 missing value:

```python
# Drop rows with more than 1 missing value
df_dropped_thresh = df.dropna(thresh=2)
print(df_dropped_thresh)
```

Output:
```
     Name   Age   Salary
0    John  28.0  50000.0
1   Alice   NaN  60000.0
2     Bob  34.0      NaN
4    None  30.0  55000.0
```
In this case, only row 3 (Emma) was dropped because it had more than 1 missing value.

---

### 2.2 **Imputing Missing Values**
Imputation is the process of replacing missing values with estimates based on the existing data. This is often preferred over removing data, as it prevents the loss of valuable information.

#### Step 2.2.1: Fill Missing Values with a Constant (Zero, Mean, Median, Mode)

##### **Filling with a Constant Value**:
You might want to replace missing values with a fixed value, such as 0 or an empty string.

```python
# Fill missing values with 0 for numerical columns
df_filled_constant = df.fillna(0)
print(df_filled_constant)
```

Output:
```
     Name   Age   Salary
0    John  28.0  50000.0
1   Alice   0.0  60000.0
2     Bob  34.0      0.0
3    Emma  29.0      0.0
4    0.0   30.0  55000.0
```

##### **Filling with the Mean (for numerical data)**:
You can replace missing values in a column with the mean of that column.

```python
# Fill missing values in 'Age' and 'Salary' with their respective mean
df_filled_mean = df.copy()
df_filled_mean['Age'] = df['Age'].fillna(df['Age'].mean())
df_filled_mean['Salary'] = df['Salary'].fillna(df['Salary'].mean())
print(df_filled_mean)
```

Output:
```
     Name        Age   Salary
0    John  28.000000  50000.0
1   Alice  30.250000  60000.0
2     Bob  34.000000  55000.0
3    Emma  29.000000  55000.0
4    None  30.000000  55000.0
```
Here, the missing values in the `Age` and `Salary` columns have been replaced with the mean values of their respective columns.

##### **Filling with the Median (for skewed distributions)**:
You can also replace missing values with the median:

```python
df_filled_median = df.copy()
df_filled_median['Age'] = df['Age'].fillna(df['Age'].median())
df_filled_median['Salary'] = df['Salary'].fillna(df['Salary'].median())
print(df_filled_median)
```

##### **Filling with the Mode (for categorical data)**:
For categorical columns, replacing missing values with the mode (most frequent value) makes sense:

```python
# Fill missing 'Name' with the most frequent value (mode)
df_filled_mode = df.copy()
df_filled_mode['Name'] = df['Name'].fillna(df['Name'].mode()[0])
print(df_filled_mode)
```

Output:
```
     Name   Age   Salary
0    John  28.0  50000.0
1   Alice   NaN  60000.0
2     Bob  34.0      NaN
3    Emma  29.0      NaN
4    John  30.0  55000.0
```

#### Step 2.2.2: Forward and Backward Filling
You can propagate the last valid value forward or backward to fill missing values. This technique works well for time-series data.

```python
# Forward fill
df_ffill = df.fillna(method='ffill')
print(df_ffill)

# Backward fill
df_bfill = df.fillna(method='bfill')
print(df_bfill)
```

---

### 2.3 **Advanced Imputation Techniques**
For more sophisticated imputation, machine learning algorithms like **K-Nearest Neighbors (KNN)**, **Multiple Imputation**, or **Regression-based imputation** can be used.

#### Step 2.3.1: K-Nearest Neighbors (KNN) Imputation
KNN imputes missing values by finding the K-nearest neighbors of an observation and using their average to replace the missing values. This can be done using the `KNNImputer` from the `sklearn` library.

```python
from sklearn.impute import KNNImputer

# Initialize the KNNImputer
imputer = KNNImputer(n_neighbors=2)

# Perform KNN imputation on the DataFrame
df_knn_imputed = pd.DataFrame(imputer.fit_transform(df[['Age', 'Salary']]), columns=['Age', 'Salary'])
print(df_knn_imputed)
```

#### Step 2.3.2: Multivariate Imputation by Chained Equations (MICE)
MICE uses multiple regression models to predict missing values. It creates a model for each feature with missing values and predicts them iteratively.

---

### Conclusion
Handling missing values is crucial for building reliable machine learning models. Here are the main strategies:
1. **Removing Missing Values**: Dropping rows or columns that contain missing data.
2. **Imputation**: Filling missing values with mean, median, mode, or using advanced techniques like KNN.
3.

 **Filling with Forward/Backward Values**: Useful for time-series data.


The best method to deal with missing values depends on several factors related to the nature of the data, the proportion of missing values, and the context of the analysis. Here are some considerations for choosing the best method:

### 1. **Small Percentage of Missing Values**
If only a small proportion of your data has missing values, and the data is not crucial for your analysis, you can simply **drop rows or columns** with missing values.

- **When to Use:** When missing values are less than 5% of the dataset.
- **Method:** `df.dropna()` or dropping specific columns/rows.

#### Example:
```python
df_cleaned = df.dropna()  # Remove rows with any missing values
```

### 2. **Larger Percentage of Missing Values (Imputation Required)**
If a significant portion of your dataset has missing values, **imputation** is the preferred method. There are various imputation techniques based on the type of data you’re working with:

#### **Numerical Data (Continuous)**:
- **Mean/Median Imputation**: 
  - **Mean** is used if the data is normally distributed.
  - **Median** is better for data with outliers or skewed distributions.
  
  Example:
  ```python
  df['Age'] = df['Age'].fillna(df['Age'].mean())  # Mean Imputation
  df['Salary'] = df['Salary'].fillna(df['Salary'].median())  # Median Imputation
  ```

#### **Categorical Data**:
- **Mode Imputation**: Replace missing values with the most frequent category (mode). This works well for columns with a small number of distinct categories.

  Example:
  ```python
  df['Category'] = df['Category'].fillna(df['Category'].mode()[0])
  ```

### 3. **Time-Series Data**
For time-series data, **forward fill** or **backward fill** are commonly used, as they propagate the last observed value forward or backward to fill in missing values.

- **When to Use:** When the data points are dependent on each other over time.

#### Example:
```python
df_ffill = df.fillna(method='ffill')  # Forward fill
df_bfill = df.fillna(method='bfill')  # Backward fill
```

### 4. **Advanced Methods**
For more sophisticated datasets, particularly when missing values appear in patterns, **advanced imputation methods** like KNN (K-Nearest Neighbors) or regression-based imputation are effective.

- **K-Nearest Neighbors (KNN)**: This technique works by filling missing values based on the nearest observations in terms of distance.
  
  **When to Use:** When the relationship between missing values and other features is complex.
  
  Example:
  ```python
  from sklearn.impute import KNNImputer
  imputer = KNNImputer(n_neighbors=3)
  df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
  ```

- **Multivariate Imputation by Chained Equations (MICE)**: This method iteratively predicts missing values using multiple regression models.
  
  **When to Use:** For complex datasets where each feature can be predicted by other features.

---

### Summary of Best Methods Based on Scenarios:
1. **Small Missing Data (Less than 5%)**: Drop rows/columns.
2. **Numerical Data**: Use **mean** for normal distributions and **median** for skewed distributions.
3. **Categorical Data**: Use **mode** to fill missing categorical values.
4. **Time-Series Data**: Use **forward fill** or **backward fill**.
5. **Complex Data Relationships**: Use **KNN imputation** or **MICE** for better performance.

---

### Best Method Overall:
For most practical applications, **mean/median imputation** for numerical data and **mode imputation** for categorical data provide a simple yet effective approach. However, **KNN imputation** is often a more accurate method when patterns between features exist. You can start with basic imputation methods and experiment with more advanced methods depending on the dataset’s complexity.

Would you like a detailed example for any specific method?