## Handling Missing Numerical Data | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2020%20Imputing%20Numerical%20Data)

### Overview

Missing numerical data is a frequent challenge in real-world datasets. Numerical features can contain gaps due to data entry errors, sensor malfunctions, or respondents skipping questions. Proper handling of these missing values is critical since improper treatment can lead to biased model estimates and reduced performance.

### Common Techniques

#### 1. Simple Imputation
- **Mean Imputation:**  
  Replace missing values with the mean of the available data. This is easy to implement but can be sensitive to outliers.
- **Median Imputation:**  
  Substitute missing values with the median. This method is more robust to outliers and works well if the distribution is skewed.
- **Mode Imputation:**  
  While mode imputation is more common for categorical data, in some numerical cases with multimodal distributions, it might be considered.

#### 2. Advanced Imputation Methods
- **K-Nearest Neighbors (KNN) Imputation:**  
  Uses similar data points (neighbors) to estimate missing values by averaging the values from the nearest neighbors.
- **Regression Imputation:**  
  Builds a regression model using complete cases to predict missing values based on other features.
- **Multiple Imputation:**  
  Creates several complete datasets by imputing values multiple times and then combining the results to reflect the uncertainty in the imputations.

#### 3. Time Series Specific Methods
- **Forward Fill and Backward Fill:**  
  In time-ordered data, forward fill (propagating the last observed value forward) or backward fill (using the next value) can be used.
- **Interpolation:**  
  Linear or polynomial interpolation can estimate missing values based on surrounding data points.

### Best Practices

- **Analyze Missingness Pattern:**  
  Determine whether the data are missing completely at random (MCAR), at random (MAR), or not at random (MNAR) to guide the choice of method.
- **Check Distribution:**  
  Compare the distribution before and after imputation to ensure that the chosen method does not introduce bias.
- **Cross-Validation:**  
  Evaluate model performance using different imputation methods to choose the one that minimizes error and maintains the integrity of the data.

### Python Code Example

Below is a Python snippet that demonstrates how to handle missing numerical data using mean and median imputation with the `pandas` library and scikit-learn's `SimpleImputer`.

```python
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Create a sample DataFrame with missing numerical data
data = {
    'feature1': [10, 20, np.nan, 40, 50],
    'feature2': [5, np.nan, 15, 20, 25],
    'feature3': [100, 200, 300, np.nan, 500]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Check for missing values in the dataset
print("\nMissing values per column:")
print(df.isnull().sum())

# Impute missing values using Mean Imputation
mean_imputer = SimpleImputer(strategy='mean')
df_mean_imputed = pd.DataFrame(mean_imputer.fit_transform(df), columns=df.columns)
print("\nDataFrame after Mean Imputation:")
print(df_mean_imputed)

# Impute missing values using Median Imputation
median_imputer = SimpleImputer(strategy='median')
df_median_imputed = pd.DataFrame(median_imputer.fit_transform(df), columns=df.columns)
print("\nDataFrame after Median Imputation:")
print(df_median_imputed)
```

#### Explanation:
- **Data Creation:** A DataFrame with numerical features is created, including `np.nan` to simulate missing values.
- **Missing Value Check:** The number of missing values in each column is printed.
- **Mean and Median Imputation:**  
  Using `SimpleImputer` from scikit-learn, missing values are replaced with either the mean or the median of each column. The resulting DataFrames are displayed.

---

### Conclusion

Handling missing numerical data is a crucial preprocessing step. While simple imputation methods like mean or median replacement are straightforward and efficient, more advanced techniques such as KNN or multiple imputation can offer better performance, especially when missingness is not completely random. Always validate the imputation method to ensure that the resulting dataset retains the underlying statistical properties of the original data.