## Instructions

Fill in the cells of the notebook as well as report your results in Blackboard.

Upload your final notebook in blackboard as well.


**Don't forget to comment everything you code and analyze your results**

## Presentation of Dataset

The Titanic Dataset contains informations about Titanic passenger and wether they survived or not.

The problematic is: "we want to predict wether a passenger survived or not".

### Data dictionnary
- survival :	Survival	(0 = No, 1 = Yes)
- pclass	: Ticket class	(1 = 1st, 2 = 2nd, 3 = 3rd)
- sex	: Sex
- Age	: Age in years
- sibsp :	number of siblings / spouses aboard the Titanic
- parch	: number of parents / children aboard the Titanic
- ticket	: Ticket number
- fare	: Passenger fare	(money spent for the ticket)
- cabin	: Cabin number
- embarked	: Port of Embarkation	(C = Cherbourg, Q = Queenstown, S = Southampton)

### Start by importing the librairies


In [None]:
# Data vizualisation
import matplotlib.pyplot as plt
import plotly.express as px

# Data preprocessing
import pandas as pd
from sklearn.model_selection import train_test_split

# Machine learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report


from sklearn.model_selection import GridSearchCV


## Data preparation

### Read the dataset

In [None]:
# Load data
titanic_data = pd.read_csv('titanic.csv')

### Explore the dataset

In [None]:
# First look at the data
titanic_data.head()

In [None]:
# Get some information

# Getting a concise summary of the dataframe, including the number of non-null values in each column
print(titanic_data.info())

In [None]:
# Displaying the number of rows and columns in the dataframe
print(f"Dataset dimensions: {titanic_data.shape}")

In [None]:
# Getting the number of unique values in each column
print(titanic_data.nunique())

In [None]:
# Getting some basic statistical details about the data
print(titanic_data.describe())

### Clean the dataset - Missing values (Q8 and Q9)

In [None]:
# Check for missing values
# Report it in blackboard (Q8)


print(titanic_data.isnull().sum())

In [None]:
# Calculating the percentage of missing values for each column
missing_percentage = titanic_data.isnull().sum() / len(titanic_data) * 100

# Printing the missing percentage for each column
print(missing_percentage)


In [None]:
# Deal with the missing values
# Remove the column with more than 75% of missing values

# Calculating the percentage of missing values for each column
missing_percentage = titanic_data.isnull().sum() / len(titanic_data) * 100

# Identifying columns with more than 75% missing values
columns_to_drop = missing_percentage[missing_percentage > 75].index

# Dropping these columns from the DataFrame
titanic_data_cleaned = titanic_data.drop(columns=columns_to_drop)

# Displaying the columns that were dropped
print(f"Columns dropped: {columns_to_drop.tolist()}")

# Displaying the first few rows of the cleaned dataframe
titanic_data_cleaned.head(3)

In [None]:
# Deal with the missing values
# Remove the remaining rows with missing values

# Removing rows with missing values
titanic_data_cleaned = titanic_data_cleaned.dropna()

# Verifying that there are no more missing values
print(titanic_data_cleaned.isnull().sum())

In [None]:
# Get the final shape of the dataset and report it in Blackboard (Q9)

# Checking the dimensions of the updated DataFrame
print(f"New dataset dimensions: {titanic_data_cleaned.shape}")

### Data visualization (Q10)

**I am going to do a plot that shows the survival rate by passenger class**

In [None]:
# Data visualization (Q10)
# Make a plot and analyze it. Use the library you prefer.
# Report the figure and your analysis in blackboard
# Your figure should be self-supporting (everyone can understand it without access to the code: including the title, and legend...)


# Calculating the survival rate by passenger class
survival_rate_by_class = titanic_data_cleaned.groupby('Pclass')['Survived'].mean()

# Creating a bar plot
plt.figure(figsize=(8, 6))
survival_rate_by_class.plot(kind='bar', color=['blue', 'orange', 'green'])
plt.title('Survival Rate by Passenger Class on the Titanic')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')
plt.xticks(ticks=[0, 1, 2], labels=['1st Class', '2nd Class', '3rd Class'], rotation=0)
plt.ylim(0, 1)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Adding value labels on top of each bar
for index, value in enumerate(survival_rate_by_class):
    plt.text(index, value + 0.02, f'{value:.2f}', ha='center')

# Adding a legend
plt.legend(['Survival Rate'], loc='upper left')

# Saving the figure
plt.savefig('survival_rate_by_class.png', bbox_inches='tight')

# Showing the plot
plt.show()

**The results on the plots shows that passengers in 1st class had the highest chance of surviving the Titanic disaster with a survival rate of 0.65, while those in 3rd class had the lowest chance with a survival rate of 0.24.

It's important to remember that these are averages and individual experiences varied. Some 3rd class passengers did survive, and some 1st class passengers did not. The survival rates also don't account for factors beyond passenger class, such as gender, age, and whether a passenger had a ticket to a cabin.

In terms of building a predictive model, these survival rates could be useful features. However, they might not be sufficient on their own, especially since they don't account for the full range of factors that influenced survival. Other features, such as age, sex, and whether a passenger had a family member or child aboard, might also be important.**

### Prepare your data for model training (Q11)



In [None]:
# Deal with non numeric features
# Drop the "Name" and "Ticket" columns


# Droping the "Name" and "Ticket" columns
titanic_data_cleaned = titanic_data_cleaned.drop(columns=['Name', 'Ticket'])

# Displaying the first few rows of the updated DataFrame
print(titanic_data_cleaned.head(3))

In [None]:
# Deal with non numeric features
# Option 1 : Convert the remaining column by mapping the values to numbers (use map from Pandas for example)
# Option 2 (easy) : Remove all non numeric columns


# Mapping the "Sex" column to binary values
titanic_data_cleaned['Sex'] = titanic_data_cleaned['Sex'].map({'male': 0, 'female': 1})

# Mapping the "Embarked" column to categorical values
titanic_data_cleaned['Embarked'] = titanic_data_cleaned['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

# Displaying the first few rows of the updated DataFrame
print(titanic_data_cleaned.head())

In [None]:
# Get the features and label in two variables.


# Splitting the dataset into features and label
features = titanic_data_cleaned.drop('Survived', axis=1)
label = titanic_data_cleaned['Survived']

# Displaying the first few rows of the features and label
print("Features:\n", features.head())
print("\nLabel:\n", label.head())


In [None]:
# Split the features and label in train and test set
# Put 20% of the data in the test set


# Splitting the features and label into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=42)


In [None]:
# Check the shape of each set (Q11)

# Printing the shapes of the training and testing sets
print("Training set shape: ", X_train.shape)
print("Testing set shape: ", X_test.shape)

In [None]:
# Checking the affirmations in question 11
print("X_train.shape[0]==y_train.shape[0]: ", X_train.shape[0]==y_train.shape[0])
print("X_train and X_test have the same number of samples: ", X_train.shape[0]==X_test.shape[0])
print("X_train and X_test have the same number of columns: ", X_train.shape[1]==X_test.shape[1])
print("y_train.shape == y_test.shape: ", y_train.shape == y_test.shape)


## Model Training (Q13)

In [None]:
# Define and train a Random Forest (Report your code in Q13)


# Defining the model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Training the model
rf_model.fit(X_train, y_train)

# Printing the model
print(rf_model)

## Model Evaluation (Q14 and Q15)

In [None]:
# Use your model to make predictions on the test set (Report your code in Q14)


# Using the model to make predictions on the test set
y_pred = rf_model.predict(X_test)

# Printing the predicted values
print(y_pred)

In [None]:
# Use the classification report to evaluate your model
# Report the accuracy in blackboard (Q15)


# Generating the classification report
report = classification_report(y_test, y_pred)
print(report)

# Calculating the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

## Bonus : Try to improve your score ! (Q16)
Report your best score in blackboard (Q16)

In [None]:

# Define the model
rf_model = RandomForestClassifier(random_state=42)

# Define the grid search parameters
param_grid = {
   'n_estimators': [100, 200, 500],
   'max_depth': [None, 10, 20],
   'min_samples_split': [2, 5, 10],
   'min_samples_leaf': [1, 2, 4],
   'bootstrap': [True, False]
}

# Perform grid search
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=3, scoring='accuracy', return_train_score=True)
grid_search.fit(X_train, y_train)

# Print the best parameters and the best score
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)
