### Handling Missing Data

Handling missing data is a crucial step in the data preprocessing pipeline. Missing data can occur for various reasons, and understanding the nature of the missingness is important for choosing appropriate handling strategies. Here are some common mechanisms of missing data:

**Missing Completely at Random (MCAR):**
The missingness is completely random, and there is no relationship between the missing data and any observed or unobserved variables.

**Missing at Random (MAR):**
The missingness depends on the observed data but not on the missing data itself. In other words, the probability of missingness is the same for all units with the same observed values.

**Missing Not at Random (MNAR):**
The missingness depends on the missing values themselves, even after accounting for observed data.

#### Strategies for Handling Missing Data:

**Deletion:**
- Remove rows or columns with missing values.
- Suitable when missing data is MCAR and removing rows/columns does not introduce bias.
Imputation:

**Fill in missing values with estimated values.**
- Common imputation methods include mean imputation, median imputation, or more sophisticated methods like k-Nearest Neighbors (KNN) imputation or regression imputation.

**Advanced Techniques:**
- Use machine learning models to predict missing values based on other features.
- Techniques like Multiple Imputation generate multiple imputed datasets to account for uncertainty.

In [2]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# load the titatnic dataset
titanic = pd.read_csv('./data/titanic.csv')

# Summary of missing values
print('Missing Data Summary:')
print(titanic.isnull().sum())

# Drop columns with a high percentage of missing values (e.g., Cabin)
titanic = titanic.drop(columns=['Cabin'])

# Split the dataset into features and target variable
X = titanic.drop(columns=['Survived'])
y = titanic['Survived']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Impute missing values in 'Age' using mean imputation
imputer = SimpleImputer(strategy='mean')
X_train['Age'] = imputer.fit_transform(X_train[['Age']])
X_test['Age'] = imputer.transform(X_test[['Age']])

# Impute missing values in 'Embarked' using mode imputation
imputer_embarked = SimpleImputer(strategy='most_frequent')
X_train['Embarked'] = imputer_embarked.fit_transform(X_train[['Embarked']])
X_test['Embarked'] = imputer_embarked.transform(X_test[['Embarked']])

# Train a Random Forest classifier on the imputed training set
model = RandomForestClassifier(random_state=42)
model.fit(X_train.select_dtypes(include=['number']), y_train)

# Make predictions on the imputed testing set
y_pred = model.predict(X_test.select_dtypes(include=['number']))

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Missing Data Summary:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


ValueError: 2