# HOMEWORK 2.1: Titanic ML Competition


In [1]:
# Common imports
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
from sklearn.pipeline import Pipeline#This library is essential for constructing a sequence of data processing steps, #facilitating a clean and organized workflow in machine learning pipelines.
from sklearn.impute import SimpleImputer#This library is used to handle missing values in the dataset by replacing them #with a specified strategy, such as the median or most frequent value.
from sklearn.preprocessing import StandardScaler, OrdinalEncoder#StandardScaler library  is necessary to standardize #numerical features, ensuring they have a mean of 0 and a standard deviation of 1, which is often important for machine #learning algorithms. OrdinalEncoder library  is employed to encode categorical features into numerical values, making #them suitable for input into machine learning models.
from sklearn.svm import SVC#This library imports the Support Vector Classifier as it is a ML model commonly used
# #for classification tasks.
from sklearn.neighbors import KNeighborsClassifier#Imports the k-Nearest Neighbors classifier, a simple and effective #algorithm for classification based on nearest neighbors.
from sklearn.ensemble import RandomForestClassifier#Imports the Random Forest classifier, an ensemble learning method #that combines multiple decision trees.
from sklearn.model_selection import cross_val_score#Used for cross-validation, providing an efficient way to assess a #model's performance by splitting the data into multiple subsets and evaluating the model on each subset.
from sklearn.metrics import accuracy_score#The metric used to measure the accuracy of a classification model by  #comparing the predicted labels to the true labels.


In [2]:
# to make this notebook's output stable across runs
np.random.seed(42)

## Description of HW2.1: Tackle the Titanic dataset

This is the legendary Titanic ML competition the best of [Kaggle](https://www.kaggle.com/), 
first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.
Let's go to the [Titanic challenge](https://www.kaggle.com/c/titanic).

The data is already split into a training set and a test set. However, the test data does *not* contain the labels: your goal is to train the best model you can using the training data, then make your predictions on the test data and REPORT your final score.

The goal is to predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and so on.

Note: Students with the highest score on the assignment will be awarded an additional 10 points as an assignment score. 
All submitted scores will be ranked in descending order and the top 5 students will be awarded an additional +10 points. 

load the data:

In [3]:
train_data = pd.read_csv("datasets/titanic/train.csv")
test_data = pd.read_csv("datasets/titanic/test.csv")

Let's take a peek at the top few rows of the training set:

In [4]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The attributes have the following meaning:
* **Survived**: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
* **Pclass**: passenger class.
* **Name**, **Sex**, **Age**: self-explanatory
* **SibSp**: how many siblings & spouses of the passenger aboard the Titanic? // Titanik'te yolcunun kaç kardeşi ve eşi var?
* **Parch**: how many children & parents of the passenger aboard the Titanic? // Titanik'te yolcunun kaç çocuğu ve ebeveyni var?
* **Ticket**: ticket id
* **Fare**: price paid (in pounds)
* **Cabin**: passenger's cabin number
* **Embarked**: where the passenger embarked the Titanic

Let's get more info to see how much data is missing:

In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Okay, the **Age**, **Cabin** and **Embarked** attributes are sometimes null (less than 891 non-null),
 especially the **Cabin** (77% are null). 
 We will ignore the **Cabin** for now and focus on the rest. 
 The **Age** attribute has about 19% null values, 
 so we will need to decide what to do with them. 
 Replacing null values with the median age seems reasonable.

The **Name** and **Ticket** attributes may have some value,
 but they will be a bit tricky to convert into useful numbers that a model can consume. 
 So for now, we will ignore them.

Let's take a look at the numerical attributes:

In [6]:
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


* only 38% **Survived**. :(  That's close enough to 40%, so accuracy will be a reasonable metric to evaluate our model.
* The mean **Fare** was £32.20, which does not seem so expensive (but it was probably a lot of money back then).
* The mean **Age** was less than 30 years old.

Let's check that the target is indeed 0 or 1:

In [7]:
train_data["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

Now let's take a quick look at all the categorical attributes:

In [8]:
train_data["Pclass"].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [9]:
train_data["Sex"].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [10]:
train_data["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

The Embarked attribute tells us where the passenger embarked: C=Cherbourg, Q=Queenstown, S=Southampton.


## TO-DO 2.1.1: Build your pre-processing pipelines for numerical/categorical attributes.
 - Use SimpleImputer for pre-processing. 
 - Use "Median" Strategy for the SimpleImputer for numerical attributes
 - Use "OrdinalEncoder" function and "most_frequent" strategy for categorical attributes
 - Examine the changes that the simpleimpute function makes to the data and give examples of changed values


In [11]:
# Separating numerical and categorical attributes
numerical_features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
#PassengerId feature is not included as it is an identifier and doesn't provide meaningful information for predicting #survival on its own.
categorical_features = ['Sex', 'Embarked']

# 2.1.1 - build the pipeline for the numerical attributes:

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # Impute missing values with the median
    ('scaler', StandardScaler())  # Standardize numerical features
])



# 2.1.1 - build the pipeline for the categorical attributes:
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with the most frequent value
    ('encoder', OrdinalEncoder())  # Encode categorical features
])



# 2.1.1 - interpret results:
# After fitting the pipelines to the data, you can transform the numerical and categorical features separately.
# For example, if you have a DataFrame named 'train_data', you can apply the transformations as follows:

# Apply numerical pipeline to numerical features
X_train_num = num_pipeline.fit_transform(train_data[numerical_features])

# Apply categorical pipeline to categorical features
X_train_cat = cat_pipeline.fit_transform(train_data[categorical_features])

# Display the transformed data
print("Transformed Numerical Features:")
print(pd.DataFrame(X_train_num, columns=numerical_features))

print("\nTransformed Categorical Features:")
print(pd.DataFrame(X_train_cat, columns=categorical_features))



Transformed Numerical Features:
       Pclass       Age     SibSp     Parch      Fare
0    0.827377 -0.565736  0.432793 -0.473674 -0.502445
1   -1.566107  0.663861  0.432793 -0.473674  0.786845
2    0.827377 -0.258337 -0.474545 -0.473674 -0.488854
3   -1.566107  0.433312  0.432793 -0.473674  0.420730
4    0.827377  0.433312 -0.474545 -0.473674 -0.486337
..        ...       ...       ...       ...       ...
886 -0.369365 -0.181487 -0.474545 -0.473674 -0.386671
887 -1.566107 -0.796286 -0.474545 -0.473674 -0.044381
888  0.827377 -0.104637  0.432793  2.008933 -0.176263
889 -1.566107 -0.258337 -0.474545 -0.473674 -0.044381
890  0.827377  0.202762 -0.474545 -0.473674 -0.492378

[891 rows x 5 columns]

Transformed Categorical Features:
     Sex  Embarked
0    1.0       2.0
1    0.0       0.0
2    0.0       2.0
3    0.0       2.0
4    1.0       2.0
..   ...       ...
886  1.0       2.0
887  0.0       2.0
888  0.0       2.0
889  1.0       0.0
890  1.0       1.0

[891 rows x 2 columns]


Cool! Now we have a nice preprocessing pipeline that takes the raw data and outputs numerical input features that we can feed to any Machine Learning model we want.

In [12]:
from sklearn.compose import ColumnTransformer

num_attribs = ["Age", "SibSp", "Parch", "Fare"]
cat_attribs = ["Pclass", "Sex", "Embarked"]

preprocess_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs),
    ])

In [13]:
X_train = preprocess_pipeline.fit_transform(train_data)
X_train

array([[-0.56573646,  0.43279337, -0.47367361, ...,  2.        ,
         1.        ,  2.        ],
       [ 0.66386103,  0.43279337, -0.47367361, ...,  0.        ,
         0.        ,  0.        ],
       [-0.25833709, -0.4745452 , -0.47367361, ...,  2.        ,
         0.        ,  2.        ],
       ...,
       [-0.1046374 ,  0.43279337,  2.00893337, ...,  2.        ,
         0.        ,  2.        ],
       [-0.25833709, -0.4745452 , -0.47367361, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.20276197, -0.4745452 , -0.47367361, ...,  2.        ,
         1.        ,  1.        ]])

Let's not forget to get the labels:

In [14]:
y_train = train_data["Survived"]

We are now ready to train a classifier. 

## TO-DO 2.1.2: Use Support a Classifier (SVC, KNN, RandomForest etc.) 
 - Use 3 different Classifier (using sklearn library)
 - Train the selected classifer using "train_data" and "y_train" labels to classify/predict "Survived Passanger" for TEST DATA (test_data) 
 - Use cross-validation method to get avarage accuracy for the dataset 
   (example: 
             from sklearn.model_selection import cross_val_score
             clf1_scores = cross_val_score(clf1, X_train, y_train, cv=10)
             clf1_scores.mean() 
             
 - Show the best prediction accuracy 


In [15]:
# Load the test data from 'test.csv'
test_data = pd.read_csv('/Users/atenaparsa/Downloads/datasets/titanic/test.csv')

# Separating numerical and categorical attributes
numerical_features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
categorical_features = ['Sex', 'Embarked']

# Build the pipeline for the numerical attributes
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Build the pipeline for the categorical attributes
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder())
])

# Combine pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_pipeline, numerical_features),
        ('cat', cat_pipeline, categorical_features)
    ])

# Fit and transform the training data
X_train = preprocessor.fit_transform(train_data)

# Transform the test data
X_test = preprocessor.transform(test_data)

# Initialize classifiers
clf_svc = SVC()
clf_knn = KNeighborsClassifier()
clf_rf = RandomForestClassifier()

# 2.1.2 - Classifier 1 - Support Vector Classifier
# Train classifier and make predictions
clf_svc.fit(X_train, y_train)
y_pred_svc = clf_svc.predict(X_test)

# 2.1.2 - Classifier 2 - KNeighbors
# Train classifier and make predictions
clf_knn.fit(X_train, y_train)
y_pred_knn = clf_knn.predict(X_test)

# 2.1.2 - Classifier 3 - Random Forest
# Train classifier and make predictions
clf_rf.fit(X_train, y_train)
y_pred_rf = clf_rf.predict(X_test)

# 2.1.2 - Write your the BEST Accuracy
# Cross-validation to get average accuracy
svc_scores = cross_val_score(clf_svc, X_train, y_train, cv=10)
knn_scores = cross_val_score(clf_knn, X_train, y_train, cv=10)
rf_scores = cross_val_score(clf_rf, X_train, y_train, cv=10)

# Display average accuracy for each classifier
print("SVC Average Accuracy:", svc_scores.mean())
print("KNN Average Accuracy:", knn_scores.mean())
print("Random Forest Average Accuracy:", rf_scores.mean())

# Identify the best classifier based on the highest accuracy
best_classifier = max([(svc_scores.mean(), 'SVC'), (knn_scores.mean(), 'KNN'), (rf_scores.mean(), 'Random Forest')])
print("Best Classifier:", best_classifier[1], "with an average accuracy of", best_classifier[0])



SVC Average Accuracy: 0.8204494382022471
KNN Average Accuracy: 0.7957553058676654
Random Forest Average Accuracy: 0.8149063670411986
Best Classifier: SVC with an average accuracy of 0.8204494382022471
