# Day 5: Assignment - Prepare Titanic Dataset for Modeling

**Course**: Machine Learning with Python: From Basics to Applications  
**Objective**: Combine preprocessing steps from Days 3 and 4 to fully prepare the Titanic dataset for machine learning modeling.  
**Prerequisites**: Basic Python knowledge, familiarity with NumPy, Pandas (Day 2), handling missing values and encoding (Day 3), and train/test split with feature scaling (Day 4).  
**Tools**: Pandas, scikit-learn (install with `pip install pandas scikit-learn`).  
**Dataset**: Raw Titanic dataset (`titanic.csv`, available from Kaggle or provided).  

In this notebook, we will:  
1. Load the raw Titanic dataset.  
2. Handle missing values (impute `Age` with median, drop rows with missing `Embarked`).  
3. Encode categorical variables (`Sex` with `LabelEncoder`, `Embarked` with one-hot encoding).  
4. Split data into training (80%) and test (20%) sets.  
5. Scale numerical features (`Age`, `Fare`) using `StandardScaler`.  
6. Save the preprocessed and split datasets as CSVs.  
7. Verify all steps (no missing values, correct encoding, split sizes, and scaling).  


## Step 1: Import Libraries

Import Pandas for data handling, scikit-learn for encoding, splitting, and scaling.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Step 2: Load the Titanic Dataset

Load the raw Titanic dataset and inspect it to identify missing values and data structure.

In [2]:
# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

# Display first 5 rows
print("First 5 rows of raw data:")
df.head()

# Check data info and missing values
print("\nDataset info:")
df.info()

print("\nMissing values:")
print(df.isnull().sum())

First 5 rows of raw data:

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Missing values:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare           

**Expected Output**:  
- `df.head()` shows columns like `PassengerId`, `Survived`, `Pclass`, `Sex`, `Age`, `Fare`, `Embarked`, etc.  
- `df.info()` shows data types (e.g., `Age` as float, `Sex` as object).  
- `df.isnull().sum()` likely shows ~177 missing values for `Age`, ~2 for `Embarked`, and many for `Cabin`.  

**Note**: We’ll focus on `Age` and `Embarked`; `Cabin` has too many missing values and will be ignored.

## Step 3: Handle Missing Values

Impute missing `Age` values with the median to preserve data. Drop rows with missing `Embarked` values since they are few.

In [4]:
# Impute missing Age with median
df['Age'] = df['Age'].fillna(df['Age'].median())

# Drop rows with missing Embarked
df.dropna(subset=['Embarked'], inplace=True)

# Verify missing values
print("Missing values after handling:")
print(df.isnull().sum())

Missing values after handling:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64


**Expected Output**:  
- No missing values for `Age` or `Embarked`.  
- `Cabin` may still have missing values, but we’re ignoring it.

## Step 4: Encode Categorical Variables

Encode `Sex` using `LabelEncoder` (male=0, female=1). Encode `Embarked` using one-hot encoding with `pd.get_dummies` and drop the first dummy column to avoid multicollinearity.

In [None]:
# Encode Sex with LabelEncoder
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])

# Encode Embarked with one-hot encoding
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Display first 5 rows after encoding
print("\nFirst 5 rows after encoding:")
df.head()

**Expected Output**:  
- `Sex` is now 0 or 1 (0=male, 1=female).  
- `Embarked` is replaced by `Embarked_C` and `Embarked_Q` (binary columns), with `Embarked_S` dropped.

## Step 5: Train/Test Split

Split the data into 80% training and 20% test sets using `train_test_split`. Set `random_state=42` for reproducibility.

In [None]:
# Separate features and target
X = df.drop('Survived', axis=1)
y = df['Survived']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verify shapes
print("Training set shape (X_train, y_train):", X_train.shape, y_train.shape)
print("Test set shape (X_test, y_test):", X_test.shape, y_test.shape)

**Expected Output**:  
- `X_train` and `y_train`: ~712 rows (80% of ~889 after dropping rows).  
- `X_test` and `y_test`: ~177 rows (20%).

## Step 6: Feature Scaling

Apply `StandardScaler` to scale numerical features (`Age`, `Fare`). Fit the scaler on the training data and transform both training and test data to avoid data leakage.

In [None]:
# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform on training data (Age, Fare)
X_train[['Age', 'Fare']] = scaler.fit_transform(X_train[['Age', 'Fare']])

# Transform test data (using same scaler)
X_test[['Age', 'Fare']] = scaler.transform(X_test[['Age', 'Fare']])

# Verify scaling (mean ~0, std ~1 for training data)
print("Training Age mean after scaling:", X_train['Age'].mean())
print("Training Age std after scaling:", X_train['Age'].std())
print("Training Fare mean after scaling:", X_train['Fare'].mean())
print("Training Fare std after scaling:", X_train['Fare'].std())

# Display first few rows of scaled training data
print("\nFirst 5 rows of X_train after scaling:")
X_train.head()

**Expected Output**:  
- `Age` and `Fare` in `X_train`: Mean ~0, std ~1.  
- `X_test` values are scaled but may have slightly different mean/std (normal, as scaler was fit on training data).

## Step 7: Save Datasets

Save the preprocessed and split datasets as CSV files for use in Week 2.

In [None]:
# Save training and test datasets
X_train.to_csv('titanic_X_train_final.csv', index=False)
X_test.to_csv('titanic_X_test_final.csv', index=False)
y_train.to_csv('titanic_y_train.csv', index=False)
y_test.to_csv('titanic_y_test.csv', index=False)

print("Datasets saved as titanic_X_train_final.csv, titanic_X_test_final.csv, titanic_y_train.csv, titanic_y_test.csv")

## Step 8: Verification

Verify that the datasets are correctly preprocessed, split, and scaled.

In [None]:
# Verify saved files
print("\nVerify X_train_final.csv:")
print(pd.read_csv('titanic_X_train_final.csv').head())

# Confirm no missing values
print("\nMissing values in X_train:", X_train.isnull().sum().sum())
print("Missing values in X_test:", X_test.isnull().sum().sum())

# Check unique values for Sex
print("\nUnique values in Sex (X_train):", X_train['Sex'].unique())

# Check columns for Embarked encoding
print("\nColumns in X_train:", X_train.columns.tolist())

# Verify split sizes
print("\nTraining set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])

# Verify scaling
print("\nTraining Age mean:", X_train['Age'].mean())
print("Training Age std:", X_train['Age'].std())

**Expected Output**:  
- No missing values in `X_train` or `X_test` (except possibly `Cabin`, ignored).  
- `Sex` has values [0, 1].  
- Columns include `Embarked_C`, `Embarked_Q` (not `Embarked_S`).  
- Training set: ~712 rows, test set: ~177 rows.  
- `Age` and `Fare` in `X_train`: Mean ~0, std ~1.

## Assignment

1. Run this notebook to preprocess, split, and scale the Titanic dataset.  
2. Verify:  
   - No missing values for `Age` and `Embarked`.  
   - `Sex` is encoded as 0/1, `Embarked` as `Embarked_C`, `Embarked_Q`.  
   - Split sizes are 80% training (\~712 rows), 20% test (\~177 rows).  
   - `Age` and `Fare` in `X_train` have mean ~0 and std ~1.  
3. Ensure the saved CSV files (`titanic_X_train_final.csv`, etc.) are created and correct.  
4. Submit a screenshot of the notebook output showing the verification steps (missing values, `Sex` unique values, columns, split sizes, and scaling stats).  

**Next Steps**: In Week 2, you’ll use these datasets to train machine learning models like linear regression and logistic regression.