## Handling Missing Categorical Data | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2021%20Handling%20Missing%20Categorical%20Data)

### 1. Introduction

Categorical features represent qualitative information (e.g., color, type, or category) that typically come as strings or labels. In many real-world datasets, missing values occur in categorical columns due to data entry issues, nonresponse in surveys, or other collection errors. If left unaddressed, these missing values can reduce model performance and may bias your analysis.

### 2. Challenges

- **Information Loss:** Simply dropping rows or columns with missing categorical data can lead to significant data loss.
- **Bias:** Imputing missing values incorrectly may bias the distribution of the categories.
- **Encoding Issues:** Many machine learning models require numerical input, so you often need to encode categorical variables after imputation.

### 3. Common Strategies

1. **Deletion:**
   - **Row Deletion:** Remove observations with missing categorical values if they are few.
   - **Column Deletion:** Drop the entire feature if too many values are missing and it is not critical.

2. **Imputation:**
   - **Mode Imputation (Most Frequent Value):** Replace missing values with the mode (the most common category) of the column.  
     *Example:* If 'Red' is the most frequent color, fill missing values with 'Red'.  
   - **New Category:** Create a new category (e.g., "Missing" or "Unknown") to explicitly capture missingness. This strategy preserves the fact that the data was missing, which may hold predictive power.  

   - **Predictive Imputation:** Train a classifier using other features to predict the missing category. This is more complex but may yield better results when the missingness is informative.
   - **KNN Imputation:** Use a nearest neighbors approach to impute the missing category based on similar observations.

3. **Using Scikit-Learn’s SimpleImputer:**  
   - Scikit-Learn provides a straightforward `SimpleImputer` with the strategy set to `"most_frequent"` which is ideal for categorical data.  


### 4. Python Code Example

Below is a Python example that demonstrates mode imputation for a categorical feature using Pandas and Scikit-Learn's `SimpleImputer`:

```python
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Sample dataset with missing categorical values
data = {
    'Color': ['Red', 'Blue', np.nan, 'Green', 'Blue', np.nan, 'Red'],
    'Size': ['S', 'M', 'L', np.nan, 'M', 'S', np.nan]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Option 1: Impute using the most frequent value (mode)
imputer_mode = SimpleImputer(strategy='most_frequent')
df_mode_imputed = pd.DataFrame(imputer_mode.fit_transform(df), columns=df.columns)
print("\nDataFrame after Mode Imputation:")
print(df_mode_imputed)

# Option 2: Replace missing values with a new category 'Missing'
df_new_category = df.fillna('Missing')
print("\nDataFrame after Adding New Category 'Missing':")
print(df_new_category)
```

#### Explanation:
- **Original DataFrame:** Contains missing values (np.nan) in the 'Color' and 'Size' columns.
- **Mode Imputation:** The `SimpleImputer` with `strategy='most_frequent'` fills missing values with the most common value in each column.
- **New Category:** Alternatively, you can simply replace missing values with a new category such as `'Missing'` to preserve the information that the data was absent.

### 5. Conclusion

Handling missing categorical data is essential to maintain the quality of your dataset and improve model performance. Choose your strategy based on the amount of missing data and the importance of the feature. For many cases, mode imputation or adding a new category (e.g., "Missing") are simple yet effective approaches. Always validate your imputation method by comparing model performance on the imputed data.