# 1. How to handle missing data in regression analysis

## 1.1. Understanding Missing Data

Before deciding how to handle missing data, it's important to understand the nature and potential reasons for the missingness. Missing data can be categorized into three types:

- **Missing Completely at Random (MCAR):** The probability of missingness is the same for all observations, and missing values are not related to any other observed or unobserved data.

- **Missing at Random (MAR):** The missingness is related to observed data, but not to the unobserved data.

- **Missing Not at Random (MNAR):** The missingness is related to the unobserved data. In such cases, the missing data itself is informative.

## 1.2. Common Techniques to Handle Missing Data

### 1.2.1 Deletion

- **Listwise Deletion (Complete Case Analysis)**

    - **Description:** Remove all observations with any missing values.

    - **Pros:**

        - Simple to implement.
        
        - Keeps the dataset consistent across variables.
            
    - **Cons:**

        - Can lead to a significant loss of data, especially if many observations have missing values.

        - May introduce bias if the missing data is not MCAR.

- **Pairwise Deletion**

    - **Description:** Only remove observations with missing values for specific analyses or calculations.

    - **Pros:**

        - Preserves more data than listwise deletion.

    - **Cons:**

        - Inconsistent sample size across different analyses.

        - May lead to biased estimates if the data is not MCAR.

### 1.2.2. Imputation

- **Mean/Median/Mode Imputation**

    - **Description:** Replace missing values with the mean, median, or mode of the observed data for each variable.

    - **Pros:**

        - Simple and quick to implement.

    - **Cons:**

        - Does not account for the relationships between variables, leading to underestimated variability.
        
        - Can introduce bias if the data is not MCAR or MAR.

- **K-Nearest Neighbors (KNN) Imputation**

    - **Description:** Use the average of the nearest $k$ neighbors to impute missing values based on feature similarity.

    - **Pros:**

        - Accounts for relationships between variables.
        
        - More accurate than mean/median/mode imputation.

    - **Cons:**

        - Computationally intensive, especially with large datasets.

        - Choice of $k$ and distance metric can affect results.

- **Regression Imputation**

    - **Description:** Predict missing values using regression models based on other variables in the dataset.

    - **Pros:**

        - Accounts for relationships between variables.

        - Useful when relationships between variables are linear.

    - **Cons:**

        - Can underestimate variability.

        - Assumes linearity unless using non-linear models.

- **Multiple Imputation**

    - **Description:** Create multiple imputed datasets, analyze each, and combine results to account for uncertainty in imputations.

    - **Pros:**

        - Provides robust estimates by incorporating uncertainty in the imputation process.

        - Suitable for complex datasets and missing data mechanisms.

    - **Cons:**

        - Computationally intensive.

        - More complex to implement and interpret.

- **Interpolation and Extrapolation**

    - **Description:** Use interpolation for missing values in time-series data based on observed values before and after the gap.

    - **Pros:**

        - Suitable for time-series data with trends or patterns.

    - **Cons:**

        - Can be inaccurate if data is highly volatile or non-linear.

### 1.2.3. Advanced Techniques

- **Using Machine Learning Models:**

    - **Description:** Use machine learning models to predict and impute missing values based on other features.

    - **Pros:**

        - Flexible and can model complex relationships.

    - **Cons:**

        - Computationally intensive and requires careful tuning and validation.

## 1.3. Implementing Missing Data Techniques in Python

Let's use a real dataset to illustrate various techniques for handling missing data. We'll use the Titanic dataset, which is readily available in the seaborn library. This dataset contains information about the passengers on the Titanic, including whether they survived, their age, and other attributes. The dataset has missing values, which makes it a good example for demonstrating different imputation techniques.

**Load and Explore the Titanic Dataset**

In [1]:
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Display the first few rows
print(df.head())

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())


   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  

Missing values per column:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
a

**Analyzing Missing Data**

The output will show the structure of the dataset and indicate which columns have missing values. Typically, `age`, `embarked`, `deck`, and `embark_town` have missing values in the Titanic dataset.

### 1.3.1. Handling Missing Data

Let's go through several techniques to handle these missing values.

#### 1.3.1.1. **Deletion Methods**

- **Listwise Deletion:** Remove all rows with any missing values
    
    - **Pros and Cons:** This method is straightforward but may result in a significant loss of data.


In [2]:
# Listwise deletion
df_listwise = df.dropna()

# Check the size of the new DataFrame
print("\nListwise deletion:")
print(df_listwise.shape)



Listwise deletion:
(182, 15)


#### 1.3.1.2. **Imputation Methods**

- **Mean Imputation**

    - Replace missing age values with the mean age

    - **Pros and Cons:** Easy to implement but does not consider variability.


In [3]:
# Mean imputation for 'age'
df_mean_imputed = df.copy()
df_mean_imputed['age'].fillna(df_mean_imputed['age'].mean(), inplace=True)

# Check for missing values in 'age'
print("\nMean imputation for 'age':")
print(df_mean_imputed['age'].isnull().sum())



Mean imputation for 'age':
0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_mean_imputed['age'].fillna(df_mean_imputed['age'].mean(), inplace=True)


- **K-Nearest Neighbors (KNN) Imputation**

    - Use KNN to impute missing values based on similar rows

    - **Pros and Cons:** Captures relationships between variables but can be computationally intensive.



In [4]:
from sklearn.impute import KNNImputer

# Select numeric columns for KNN imputation
df_numeric = df[['age', 'fare', 'pclass', 'sibsp', 'parch']]

# Apply KNN imputation
knn_imputer = KNNImputer(n_neighbors=5)
df_knn_imputed = df_numeric.copy()
df_knn_imputed.iloc[:, :] = knn_imputer.fit_transform(df_numeric)

# Check for missing values in 'age'
print("\nKNN imputation for 'age':")
print(df_knn_imputed['age'].isnull().sum())



KNN imputation for 'age':
0


- **Regression Imputation**

    - Predict missing age values using linear regression based on other features:

    - **Pros and Cons:** Utilizes relationships between variables but assumes linearity unless using non-linear models.

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Prepare data for regression imputation
age_known = df[df['age'].notnull()]
age_unknown = df[df['age'].isnull()]

# Independent variables
X_known = age_known[['pclass', 'sibsp', 'parch', 'fare']]
X_unknown = age_unknown[['pclass', 'sibsp', 'parch', 'fare']]

# Dependent variable
y_known = age_known['age']

# Fit regression model
regressor = LinearRegression()
regressor.fit(X_known, y_known)

# Predict missing ages
predicted_ages = regressor.predict(X_unknown)

# Impute predicted ages
df_regression_imputed = df.copy()
df_regression_imputed.loc[df['age'].isnull(), 'age'] = predicted_ages

# Check for missing values in 'age'
print("\nRegression imputation for 'age':")
print(df_regression_imputed['age'].isnull().sum())



Regression imputation for 'age':
0


#### 1.3.1.3. **Advanced Techniques**

- **Multiple Imputation**

    - Implement multiple imputation to account for uncertainty in the imputation process:

    - **Pros and Cons:** Provides robust estimates but is computationally intensive.

In [6]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Use IterativeImputer for multiple imputation
iterative_imputer = IterativeImputer(max_iter=10, random_state=0)
df_multiple_imputed = df_numeric.copy()
df_multiple_imputed.iloc[:, :] = iterative_imputer.fit_transform(df_numeric)

# Check for missing values in 'age'
print("\nMultiple imputation for 'age':")
print(df_multiple_imputed['age'].isnull().sum())



Multiple imputation for 'age':
0
