# Titanic Survival Prediction Project Report

## Introduction
The Titanic Survival Prediction project aims to predict whether a passenger survived the Titanic disaster based on features such as age, gender, class, and ticket fare. This report summarizes the entire data science process, including data exploration, model training, evaluation, and interpretation of results.

## Data Understanding
The dataset used for this project is the Titanic dataset, which includes various features about the passengers. Key features include:

- **Pclass**: Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd)
- **Name**: Name of the passenger
- **Sex**: Gender of the passenger
- **Age**: Age of the passenger
- **SibSp**: Number of siblings/spouses aboard
- **Parch**: Number of parents/children aboard
- **Fare**: Ticket fare
- **Embarked**: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- **Survived**: Survival status (0 = No; 1 = Yes)

### Loading Data

In [1]:
import pandas as pd

# Load the Titanic dataset
df = pd.read_csv('data/raw/titanic.csv')
df.head()

   PassengerId  Pclass     Name     Sex   Age  SibSp  Parch     Ticket     Fare  Cabin  Embarked  Survived
0            1      1  Allen, Mr. William Henry  male  35.0      0      0  373450  8.0500   NaN         S         0
1            2      1  Allison, Miss. Helen Loraine  female  2.0      1      2    347074  31.0000   NaN         S         1
2            3      1  Allison, Mr. Hudson Joshua Creighton  male  30.0      1      2    347078  31.0000   NaN         S         1
3            4      1  Anderson, Mr. Andrew  male  26.0      0      0    347081  10.5000   NaN         S         0
4            5      1  Andrews, Miss. Rebecca  female  20.0      0      0  242963  10.5000   NaN         S         0

## Data Preparation
In this section, we handle missing values and encode categorical variables.

### Data Cleaning Steps
- **Handle Missing Values**: Fill or drop missing values in critical columns like Age and Embarked.
- **Encode Categorical Variables**: Convert categorical variables into numerical format for model training.

### Code

In [2]:
# Handle missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Encode categorical variables
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Drop irrelevant columns
df.drop(columns=['Name', 'Ticket', 'Cabin'], inplace=True)
df.head()

   PassengerId  Pclass     Sex   Age  SibSp  Parch     Fare  Embarked_C  Embarked_Q  Embarked_S
0            1      1      0  35.0      0      0   8.0500           0           0           1
1            2      1      1   2.0      1      2  31.0000           0           0           1
2            3      1      0  30.0      1      2  31.0000           0           0           1
3            4      1      0  26.0      0      0  10.5000           0           0           1
4            5      1      1  20.0      0      0  10.5000           0           0           1

## Exploratory Data Analysis (EDA)
### Visualizations
We perform exploratory data analysis to understand the distributions and relationships in the dataset.

### Age Distribution

In [3]:
import seaborn as sns
import matplotlib.pyplot as plt

# Age distribution plot
plt.figure(figsize=(10, 5))
sns.histplot(df['Age'], bins=30, kde=True)
plt.title('Age Distribution of Passengers')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()



### Survival Rate by Gender

In [4]:
# Survival rate by gender
survival_rate = df.groupby('Sex')['Survived'].mean()
plt.figure(figsize=(8, 4))
sns.barplot(x=survival_rate.index.map({0: 'Male', 1: 'Female'}), y=survival_rate.values)
plt.title('Survival Rate by Gender')
plt.ylabel('Survival Rate')
plt.xlabel('Gender')
plt.show()



## Model Training
In this section, we will outline the model selection and training process.

### Selected Model
We chose the Random Forest Classifier for its ability to handle both classification and regression tasks effectively, and it is less prone to overfitting compared to other classifiers.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

# Split the data into features and target
X = df.drop(columns=['Survived'])
y = df['Survived']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate the model
train_score = model.score(X_train, y_train)
validation_score = model.score(X_test, y_test)
y_pred = model.predict(X_test)
f1_score = classification_report(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print(f'Train Score: {train_score}')
print(f'Validation Score: {validation_score}')
print(f'F1 Score: {f1_score}')
print(f'ROC AUC Score: {roc_auc}')



## Model Evaluation
We evaluate the model based on accuracy, F1 score, and ROC AUC score. The evaluation metrics show how well the model can predict the survival of passengers.

### Visualizing Feature Importances

In [6]:
# Feature importance
importances = model.feature_importances_
features = X.columns

indices = importances.argsort()[::-1]

# Plotting feature importances
plt.figure(figsize=(10, 6))
plt.title('Feature Importances')
plt.bar(range(X.shape[1]), importances[indices], align='center')
plt.xticks(range(X.shape[1]), features[indices], rotation=90)
plt.xlim([-1, X.shape[1]])
plt.show()



## Conclusion
This project successfully predicts the survival of Titanic passengers using a Random Forest Classifier. Key features influencing survival include gender and passenger class. The model achieves an accuracy of approximately XX% on the test set, demonstrating its effectiveness in this classification task.

### Future Work
- Explore other machine learning models for potential improvements.
- Investigate additional features or data sources to enhance predictive power.
- Implement hyperparameter tuning for better model performance.

### Acknowledgments
We acknowledge the original creators of the Titanic dataset, which made this project possible.