# Titanic Dataset Analysis

In this notebook, we will perform an analysis of the Titanic dataset. The goal is to build predictive models to determine the survival of passengers.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import joblib
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Matplotlib is building the font cache; this may take a moment.


## Load Dataset

In this section, we load the datasets from the `train.csv` and `test.csv` files.

In [None]:
# Load Dataset
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

## Data Overview

In this section, we perform an initial exploration of the training dataset to understand its structure and contents.

In [None]:
# Data Overview
print(train_df.head())
print(train_df.info())
print(train_df.describe())

## Exploratory Data Analysis (EDA)

In this section, we perform exploratory data analysis to visualize and understand the relationships between different features and the target variable 'Survived'. This helps us gain insights into the data and identify patterns that may be useful for building predictive models.

In [None]:
# Exploratory Data Analysis (EDA)
sns.pairplot(train_df, hue='Survived')
plt.show()
sns.heatmap(train_df.corr())
plt.show()

## Fill Missing Values

We handle missing values in the dataset to ensure completeness:
- For the 'Age' column, we fill missing values with the median age.
- For the 'Embarked' column, we use the most common embarkation point (mode).
- In the test dataset, we also fill missing values in the 'Fare' column with the median fare.

In [None]:
# Fill missing values
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)
train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True)
test_df['Age'].fillna(test_df['Age'].median(), inplace=True)
test_df['Fare'].fillna(test_df['Fare'].median(), inplace=True)

## Encode Categorical Variables

We use `LabelEncoder` to convert the categorical variables 'Sex' and 'Embarked' into numerical format. This step is necessary for machine learning algorithms to process the data. 

- **LabelEncoder**: Assigns a unique integer to each category (e.g., 'male' as 1, 'female' as 0).

The encoder is applied to both the training and test datasets for consistency.

In [None]:
# Encode categorical variables
label_encoder = LabelEncoder()
train_df['Sex'] = label_encoder.fit_transform(train_df['Sex'])
train_df['Embarked'] = label_encoder.fit_transform(train_df['Embarked'])
test_df['Sex'] = label_encoder.transform(test_df['Sex'])
test_df['Embarked'] = label_encoder.transform(test_df['Embarked'])

## Normalize Numerical Features

We use `StandardScaler` to normalize the numerical features 'Age' and 'Fare'. Normalization scales the features to have a mean of 0 and a standard deviation of 1, which helps improve the performance of machine learning algorithms.

In [None]:
# Normalize numerical features
scaler = StandardScaler()
train_df[['Age', 'Fare']] = scaler.fit_transform(train_df[['Age', 'Fare']])
test_df[['Age', 'Fare']] = scaler.transform(test_df[['Age', 'Fare']])

## Feature Engineering

We create a new feature 'FamilySize' by combining 'SibSp' and 'Parch' to represent the total number of family members aboard the Titanic.

In [None]:
# Feature Engineering
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch'] + 1

## Data Visualization

In this section, we create visualizations to explore the relationships between different features and the target variable 'Survived'. These visualizations help us gain insights into the data and identify patterns that may be useful for building predictive models.

In [None]:
# Data Visualization
sns.countplot(x='Survived', data=train_df)
plt.show()

sns.countplot(x='Pclass', hue='Survived', data=train_df)
plt.show()

sns.countplot(x='Sex', hue='Survived', data=train_df)
plt.show()

sns.histplot(train_df['Age'], bins=30, kde=True)
plt.show()

## Model Building

We define the feature matrix `X` and target variable `y`, split the data into training and validation sets, and initialize a `RandomForestClassifier` model. The model is then trained on the training data.

In [None]:
# Model Building
X = train_df.drop(['Survived', 'Name', 'Ticket', 'Cabin'], axis=1)
y = train_df['Survived']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

joblib.dump(model, 'titanic_model.pkl')
print("Model saved to 'titanic_model.pkl'")

## Feature Importance

We analyze which features were most important in predicting survival, using the importance scores from our Random Forest model.

In [None]:
# Feature Importance
importances = model.feature_importances_
feature_names = X.columns
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance')

## Hyperparameter Tuning

We use GridSearchCV to systematically search for the optimal combination of hyperparameters for our RandomForest model. This involves testing different values for parameters like the number of trees (n_estimators), maximum depth, minimum samples required to split a node, and minimum samples required at a leaf node.

In [None]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(model, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Use the best model
best_model = grid_search.best_estimator_

## Cross-Validation

We use cross-validation to evaluate the model's performance more robustly. Cross-validation helps in assessing how well the model generalizes to an independent dataset by splitting the data into multiple folds and training the model on each fold. This process provides a more reliable estimate of the model's performance.

In [None]:
# Cross-Validation
cv_scores = cross_val_score(model, X, y)
print(f'Cross-validation scores: {cv_scores}')
print(f'Mean cross-validation score: {cv_scores.mean()}')

## Model Evaluation

We evaluate the model's performance using accuracy score, confusion matrix, and classification report. These metrics help us understand how well the model is performing in predicting the survival of passengers.

In [None]:
# Model Evaluation
y_pred = model.predict(X_val)
print(f'Accuracy: {accuracy_score(y_val, y_pred)}')
print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))

## Conclusion

In this analysis, we built a machine learning model to predict the survival of passengers on the Titanic.