In [None]:
# HOMEWORK 2.1: Titanic ML Competition


In [None]:
# Common imports
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OrdinalEncoder

In [None]:
# to make this notebook's output stable across runs
np.random.seed(42)

## Description of HW2.1: Tackle the Titanic dataset

This is the legendary Titanic ML competition the best of [Kaggle](https://www.kaggle.com/), 
first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.
Let's go to the [Titanic challenge](https://www.kaggle.com/c/titanic).

The data is already split into a training set and a test set. However, the test data does *not* contain the labels: your goal is to train the best model you can using the training data, then make your predictions on the test data and REPORT your final score.

The goal is to predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and so on.

Note: Students with the highest score on the assignment will be awarded an additional 10 points as an assignment score. 
All submitted scores will be ranked in descending order and the top 5 students will be awarded an additional +10 points. 

load the data:

In [None]:
train_data = pd.read_csv("/kaggle/input/titanics/datasets/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanics/datasets/titanic/test.csv")

Let's take a peek at the top few rows of the training set:

In [None]:
train_data.head()

The attributes have the following meaning:
* **Survived**: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
* **Pclass**: passenger class.
* **Name**, **Sex**, **Age**: self-explanatory
* **SibSp**: how many siblings & spouses of the passenger aboard the Titanic? // Titanik'te yolcunun kaç kardeşi ve eşi var?
* **Parch**: how many children & parents of the passenger aboard the Titanic? // Titanik'te yolcunun kaç çocuğu ve ebeveyni var?
* **Ticket**: ticket id
* **Fare**: price paid (in pounds)
* **Cabin**: passenger's cabin number
* **Embarked**: where the passenger embarked the Titanic

Let's get more info to see how much data is missing:

In [None]:
train_data.info()

Okay, the **Age**, **Cabin** and **Embarked** attributes are sometimes null (less than 891 non-null),
 especially the **Cabin** (77% are null). 
 We will ignore the **Cabin** for now and focus on the rest. 
 The **Age** attribute has about 19% null values, 
 so we will need to decide what to do with them. 
 Replacing null values with the median age seems reasonable.

The **Name** and **Ticket** attributes may have some value,
 but they will be a bit tricky to convert into useful numbers that a model can consume. 
 So for now, we will ignore them.

Let's take a look at the numerical attributes:

In [None]:
train_data.describe()

* only 38% **Survived**. :(  That's close enough to 40%, so accuracy will be a reasonable metric to evaluate our model.
* The mean **Fare** was £32.20, which does not seem so expensive (but it was probably a lot of money back then).
* The mean **Age** was less than 30 years old.

Let's check that the target is indeed 0 or 1:

In [None]:
train_data["Survived"].value_counts()

Now let's take a quick look at all the categorical attributes:

In [None]:
train_data["Pclass"].value_counts()

In [None]:
train_data["Sex"].value_counts()

In [None]:
train_data["Embarked"].value_counts()

In [None]:
train_data_cleaned = train_data.dropna()

The Embarked attribute tells us where the passenger embarked: C=Cherbourg, Q=Queenstown, S=Southampton.


## TO-DO 2.1.1: Build your pre-processing pipelines for numerical/categorical attributes.
 - Use SimpleImputer for pre-processing. 
 - Use "Median" Strategy for the SimpleImputer for numerical attributes
 - Use "OrdinalEncoder" function and "most_frequent" strategy for categorical attributes
 - Examine the changes that the simpleimpute function makes to the data and give examples of changed values


In [None]:
# 2.1.1 - build the pipeline for the numerical attributes:

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median'))  # Impute missing values with median
])





# 2.1.1 - build the pipeline for the categorical attributes:
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with most frequent
    ('encoder', OrdinalEncoder())])






# 2.1.1 - interpret results:


Cool! Now we have a nice preprocessing pipeline that takes the raw data and outputs numerical input features that we can feed to any Machine Learning model we want.

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = ["Age", "SibSp", "Parch", "Fare"]
cat_attribs = [ "Sex", "Embarked"]

preprocess_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs),
    ])

In [None]:
X_train = preprocess_pipeline.fit_transform(train_data)
X_train

Let's not forget to get the labels:

In [None]:
y_train = train_data["Survived"]

We are now ready to train a classifier. 

## TO-DO 2.1.2: Use Support a Classifier (SVC, KNN, RandomForest etc.) 
 - Use 3 different Classifier (using sklearn library)
 - Train the selected classifer using "train_data" and "y_train" labels to classify/predict "Survived Passanger" for TEST DATA (test_data) 
 - Use cross-validation method to get avarage accuracy for the dataset 
   (example: 
             from sklearn.model_selection import cross_val_score
             clf1_scores = cross_val_score(clf1, X_train, y_train, cv=10)
             clf1_scores.mean() 
             
 - Show the best prediction accuracy 


In [None]:
clf1 = RandomForestClassifier(random_state=10)
clf1_scores = cross_val_score(clf1, X_train, y_train, cv=10)
print("RandomForestClassifier Accuracy:", clf1_scores.mean())

# Classifier 2: SVC (Support Vector Classifier)
clf2 = SVC()
clf2_scores = cross_val_score(clf2, X_train, y_train, cv=5)
print("SVC Accuracy:", clf2_scores.mean())

# Classifier 3: Logistic Regression
clf3 = LogisticRegression( max_iter=1000)
clf3_scores = cross_val_score(clf3, X_train, y_train, cv=5)
print("Logistic Regression Accuracy:", clf3_scores.mean())

# Show the best prediction accuracy
best_accuracy = max(clf1_scores.mean(), clf2_scores.mean(), clf3_scores.mean())
print("The best accuracy among the classifiers is:", best_accuracy)
