# Day 3: Data Preprocessing - Missing Values and Encoding

**Course**: Machine Learning with Python: From Basics to Applications  
**Objective**: Learn to handle missing data and encode categorical variables to prepare datasets for machine learning models.  
**Prerequisites**: Basic Python knowledge, familiarity with NumPy and Pandas (from Day 2), and the raw Titanic dataset (`titanic.csv`).  
**Tools**: Pandas, scikit-learn (install with `pip install pandas scikit-learn`).  
**Dataset**: Titanic dataset (raw, available from Kaggle or provided).  

In this notebook, we will:  
1. Load the raw Titanic dataset.  
2. Check for and handle missing values (impute `Age` with median, drop rows with missing `Embarked`).  
3. Encode categorical variables (`Gender` with `LabelEncoder`, `Embarked` with one-hot encoding).  
4. Save the preprocessed dataset as a CSV.  
5. Verify the preprocessing steps.  

## Step 1: Import Libraries

We need Pandas for data handling and scikit-learn’s `LabelEncoder` for encoding categorical variables.

In [16]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

## Step 2: Load the Titanic Dataset

Load the raw Titanic dataset (`titanic.csv`) and inspect it to understand its structure and identify missing values.

In [19]:
# Load the dataset
df = pd.read_csv('titanic.csv')

# Display first 5 rows
print("First 5 rows of raw data:")
df.head()

# Check data info and missing values
print("\nDataset info:")
df.info()

print("\nMissing values:")
print(df.isnull().sum())

First 5 rows of raw data:

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Gender       891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Missing values:
PassengerId      0
Survived         0
Pclass           0
Name             0
Gender           0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare           

**Expected Output**:  
- `df.head()` shows columns like `PassengerId`, `Survived`, `Pclass`, `Gender`, `Age`, `Fare`, `Embarked`, etc.  
- `df.info()` shows data types (e.g., `Age` as float, `Gender` as object).  
- `df.isnull().sum()` likely shows ~177 missing values for `Age`, ~2 for `Embarked`, and many for `Cabin`.  

**Note**: We’ll focus on `Age` and `Embarked` for this exercise; `Cabin` has too many missing values and can be ignored.

## Step 3: Handle Missing Values

Impute missing `Age` values with the median to preserve data. Drop rows with missing `Embarked` values since they are few.

In [20]:
# Impute missing Age with median
df['Age'] = df['Age'].fillna(df['Age'].median())
# Drop rows with missing Embarked
df.dropna(subset=['Embarked'], inplace=True)

# Verify missing values
print("Missing values after handling:")
print(df.isnull().sum())

Missing values after handling:
PassengerId      0
Survived         0
Pclass           0
Name             0
Gender           0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64


**Expected Output**:  
- No missing values for `Age` or `Embarked`.  
- `Cabin` may still have missing values, but we’re ignoring it for now.

## Step 4: Encode Categorical Variables

Encode `Gender` using `LabelEncoder` (male=0, female=1). Encode `Embarked` using one-hot encoding with `pd.get_dummies` and drop the first dummy column to avoid multicollinearity.

In [21]:
# Encode Gender with LabelEncoder
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])

# Encode Embarked with one-hot encoding
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Display first 5 rows after encoding
print("\nFirst 5 rows after encoding:")
df.head()


First 5 rows after encoding:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,False,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,False,False
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,False,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,False,True
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,False,True


**Expected Output**:  
- `Gender` is now 0 or 1 (0=male, 1=female).  
- `Embarked` is replaced by `Embarked_C` and `Embarked_Q` (binary columns), with `Embarked_S` dropped.

## Step 5: Save Preprocessed Dataset

Save the preprocessed dataset as `titanic_preprocessed.csv` for use in future sessions.

In [22]:
# Save preprocessed data
df.to_csv('titanic_preprocessed.csv', index=False)

print("Preprocessed dataset saved as titanic_preprocessed.csv")

Preprocessed dataset saved as titanic_preprocessed.csv


## Step 6: Verification

Verify that the dataset has no missing values, `Gender` is encoded as 0/1, and `Embarked` is one-hot encoded.

In [24]:
# Verify saved file
print("\nVerify titanic_preprocessed.csv:")
print(pd.read_csv('titanic_preprocessed.csv').head())

# Confirm no missing values
print("\nMissing values in preprocessed data:")
print(df.isnull().sum())

# Check unique values for Gender
print("\nUnique values in Gender:", df['Gender'].unique())

# Check columns for Embarked encoding
print("\nColumns after encoding:", df.columns.tolist())


Verify titanic_preprocessed.csv:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name  Gender   Age  SibSp  \
0                            Braund, Mr. Owen Harris       1  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...       0  38.0      1   
2                             Heikkinen, Miss. Laina       0  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)       0  35.0      1   
4                           Allen, Mr. William Henry       1  35.0      0   

   Parch            Ticket     Fare Cabin  Embarked_Q  Embarked_S  
0      0         A/5 21171   7.2500   NaN       False        True  
1      0          PC 17599  71.2833   C85       False       False  
2      0  STON/O2. 3101282   7.9250   NaN       False        True  
3      0      

**Expected Output**:  
- No missing values (except possibly `Cabin`, which we’re ignoring).  
- `Gender` has values [0, 1].  
- Columns include `Embarked_C`, `Embarked_Q` (but not `Embarked_S`).

## Assignment

1. Run this notebook to preprocess the Titanic dataset.  
2. Verify no missing values for `Age` and `Embarked`.  
3. Confirm `Gender` is encoded as 0/1 and `Embarked` is one-hot encoded (`Embarked_C`, `Embarked_Q`).  
4. Ensure the saved `titanic_preprocessed.csv` contains the expected columns and data.  
5. Submit a screenshot of the notebook output showing the verification steps (missing values, unique values for `Gender`, and column names).  

**Next Steps**: On Day 4, we’ll split this preprocessed dataset into training and test sets and apply feature scaling.