# Project 1: Titanic Survival Classification

This notebook tackles the classic Kaggle competition: predicting passenger survival on the RMS Titanic. We will walk through the entire machine learning workflow:

1.  **Exploratory Data Analysis (EDA):** Understanding the data and uncovering initial insights.
2.  **Feature Engineering & Preprocessing:** Transforming raw data into a format suitable for machine learning models.
3.  **Model Training:** Building and training Logistic Regression, Random Forest, and XGBoost models.
4.  **Model Evaluation:** Comparing the models to see which performs best.

## 1. Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set_style('whitegrid')

In [None]:
# Load the data
# Make sure the 'train.csv' and 'test.csv' files are in the 'data/' directory.
try:
    train_df = pd.read_csv('data/train.csv')
    test_df = pd.read_csv('data/test.csv')
except FileNotFoundError:
    print("Data files not found. Please download them from Kaggle and place them in the 'data/' directory.")

print("Train data shape:", train_df.shape)
print("Test data shape:", test_df.shape)

train_df.head()

## 2. Exploratory Data Analysis (EDA)

Let's explore the dataset to understand its structure, find missing values, and visualize relationships between features and the survival outcome.

In [None]:
# Get a summary of the training data
train_df.info()

In [None]:
# Check for missing values
print('Missing values in training data:\n', train_df.isnull().sum())\nprint('\n' + '-'*30 + '\n')\nprint('Missing values in test data:\n', test_df.isnull().sum())

The `Age`, `Cabin`, and `Embarked` columns have missing values in the training set. `Age` and `Cabin` also have missing values in the test set. We will need to handle these during preprocessing.

### Visualizing the Target Variable: Survival

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(x='Survived', data=train_df)
plt.title('Survival Count (0 = No, 1 = Yes)')
plt.show()

### Visualizing Survival by Categorical Features

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.countplot(x='Survived', hue='Sex', data=train_df, ax=axes[0])
axes[0].set_title('Survival by Sex')

sns.countplot(x='Survived', hue='Pclass', data=train_df, ax=axes[1])
axes[1].set_title('Survival by Pclass')

sns.countplot(x='Survived', hue='Embarked', data=train_df, ax=axes[2])
axes[2].set_title('Survival by Embarked')

plt.tight_layout()
plt.show()

**Observations:**
*   **Sex:** Females had a much higher chance of survival.
*   **Pclass:** Passengers in 1st class had a higher survival rate than those in 2nd and 3rd class.
*   **Embarked:** Passengers who embarked at Cherbourg ('C') seem to have a higher survival rate.

### Visualizing Numerical Features

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

sns.histplot(data=train_df, x='Age', hue='Survived', multiple='stack', kde=True, ax=axes[0])
axes[0].set_title('Age Distribution by Survival')

sns.histplot(data=train_df, x='Fare', hue='Survived', multiple='stack', kde=False, ax=axes[1])
axes[1].set_title('Fare Distribution by Survival')
axes[1].set_xlim(0, 200) # Limiting fare for better visualization

plt.tight_layout()
plt.show()

**Observations:**
*   **Age:** Young children (age < 10) appear to have a higher survival rate. A large number of passengers aged 20-40 did not survive.
*   **Fare:** Passengers who paid a higher fare had a better chance of survival.

## 3. Feature Engineering & Preprocessing

Now we'll prepare the data for modeling. This involves handling missing values, creating new features, and converting categorical data into a numerical format.

In [None]:
# We'll process both train and test sets together to ensure consistency.
# Let's keep the PassengerId from the test set for the final submission file.
test_passenger_id = test_df['PassengerId']

# We can drop PassengerId from the training set as it's not a feature.
train_df = train_df.drop(['PassengerId'], axis=1)

# Combine train and test data for easier processing
all_df = pd.concat([train_df.drop('Survived', axis=1), test_df], axis=0)

### Handling Missing Values

In [None]:
# Fill missing 'Age' values with the median age.
all_df['Age'] = all_df['Age'].fillna(all_df['Age'].median())

# Fill missing 'Embarked' values with the mode (most frequent value).
all_df['Embarked'] = all_df['Embarked'].fillna(all_df['Embarked'].mode()[0])

# Fill missing 'Fare' in the test set with the median fare.
all_df['Fare'] = all_df['Fare'].fillna(all_df['Fare'].median())

# Drop the 'Cabin' column due to too many missing values.
all_df = all_df.drop(['Cabin'], axis=1)

print('Missing values after imputation:\n', all_df.isnull().sum())

### Creating New Features

In [None]:
# Create 'FamilySize' from 'SibSp' and 'Parch'.
all_df['FamilySize'] = all_df['SibSp'] + all_df['Parch'] + 1

# Create 'IsAlone' feature.
all_df['IsAlone'] = 0
all_df.loc[all_df['FamilySize'] == 1, 'IsAlone'] = 1

### Converting Categorical Features & Dropping Unused Columns

In [None]:
# Convert 'Sex' to numeric.
all_df['Sex'] = all_df['Sex'].map({'male': 0, 'female': 1}).astype(int)

# One-hot encode 'Embarked'.
all_df = pd.get_dummies(all_df, columns=['Embarked'], prefix='Embarked')

# Drop original columns that are now redundant or not useful.
all_df = all_df.drop(['Name', 'Ticket', 'SibSp', 'Parch'], axis=1)

all_df.head()

### Separating Data back into Train and Test Sets

In [None]:
# Split the combined dataframe back into training and testing sets.
X_train = all_df[:len(train_df)]
# Drop PassengerId from X_test as it was not used for training
X_test = all_df[len(train_df):].drop('PassengerId', axis=1)
y_train = train_df['Survived']

print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)
print('X_test shape:', X_test.shape)

## 4. Model Training & Evaluation

It's time to train our models. We will use 5-fold cross-validation to evaluate three different classifiers and compare their performance.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

In [None]:
# Initialize models
log_reg = LogisticRegression(max_iter=2000)
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
xgb = XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss')

models = {
    'Logistic Regression': log_reg,
    'Random Forest': random_forest,
    'XGBoost': xgb
}

results = {}

for name, model in models.items():
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    results[name] = cv_scores
    print(f'{name}: Mean Accuracy = {cv_scores.mean():.4f} (Std = {cv_scores.std():.4f})')

### Visualizing Model Performance

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=pd.DataFrame(results))
plt.title('Model Accuracy Comparison (5-Fold Cross-Validation)')
plt.ylabel('Accuracy Score')
plt.show()

**Observation:**

Random Forest and XGBoost perform similarly and are both stronger than Logistic Regression. Random Forest appears to have a slightly higher median accuracy in this run. For the final step, we can choose one of these to generate a submission file.

### Generating a Submission File

In [None]:
# Train the final model on the entire training dataset
final_model = RandomForestClassifier(n_estimators=100, random_state=42)
final_model.fit(X_train, y_train)

# Make predictions on the test data
predictions = final_model.predict(X_test)

# Create the submission DataFrame
submission_df = pd.DataFrame({'PassengerId': test_passenger_id, 'Survived': predictions})

print('Submission file preview:')
submission_df.head()

In [None]:
# To save the file for submission to Kaggle:
# submission_df.to_csv('titanic_submission.csv', index=False)

## 5. Conclusion and Next Steps

### Summary of Findings

This project aimed to predict passenger survival on the Titanic. Through our analysis, we confirmed several historical hypotheses:
- Passengers in higher classes (`Pclass`) had a better chance of survival.
- Female passengers (`Sex`) had a significantly higher survival rate than males.
- We engineered a `FamilySize` feature and found that passengers who were alone (`IsAlone`) had a lower survival rate than those in small-to-medium-sized families.

Among the three models tested, **Random Forest** and **XGBoost** were the top performers, both achieving a cross-validated accuracy of over 81%. Logistic Regression, while a good baseline, was clearly outperformed.

### Limitations

While the models perform reasonably well, there are several limitations to this analysis:
1.  **Simple Imputation:** We used median/mode for imputation, which is simple but may not be the most accurate method.
2.  **Feature Engineering:** Our feature engineering was basic. More complex features, like extracting titles from names (e.g., 'Mr.', 'Mrs.', 'Dr.'), could provide more signal.
3.  **No Hyperparameter Tuning:** The models were trained with their default parameters. A systematic search for optimal hyperparameters would likely boost performance.
4.  **Information Loss:** We dropped the `Cabin` column entirely. While it had many missing values, there might be a way to extract useful information from it (e.g., the deck level).

### Potential Next Steps

To improve upon this project, one could:
- **Advanced Feature Engineering:** Extract titles from the `Name` column and group rare titles.
- **Hyperparameter Tuning:** Use `GridSearchCV` or `RandomizedSearchCV` to find the best settings for the Random Forest or XGBoost models.
- **Ensemble Methods:** Create a stacked ensemble that combines the predictions of multiple models to potentially achieve higher accuracy.
- **Error Analysis:** Perform a deeper analysis of the cases where our best model made incorrect predictions to understand its weaknesses.