<a href="https://colab.research.google.com/github/JensBlack/IEECR_Hackathon24/blob/main/IEECR_Hackathon_24_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IEECR HACKATHON 2024

## Introduction to Machine Learning with the Titanic Dataset

This tutorial will guide you through a basic machine learning workflow using the Titanic dataset. We'll explore data preprocessing, model training, and evaluation.



In [17]:
#@title Install all necessary libraries, prepare dataset and do some magic!
print("Installing all dependencies")
! pip install numpy pandas matplotlib scipy scikit-learn seaborn
print("Importing all packages...")
# Import all necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# Preparing data for you (just run)
print("Loading data...")
data = sns.load_dataset("titanic")

cleaned_data = data.copy(deep = True)

#data cleaning

## add missing values in age with mean age
print("Found some dust. Cleaning data...")
age_idx = np.argwhere(cleaned_data["age"].isna()).flatten()
cleaned_data["age"].iloc[age_idx] = cleaned_data.age.mean()

## the deck column has almost no values, so let's drop it

cleaned_data = cleaned_data.drop(columns = ["deck"])

## remove people that are unknown in their embarkment
emb_idx = np.argwhere(cleaned_data["embarked"].isna()).flatten()
cleaned_data = cleaned_data.drop(index= emb_idx, axis=0)

targets = cleaned_data.survived
features = cleaned_data.drop(columns = "survived")

#Convert features to be usable

print("Converting data to features and targets...")

## binary encode sex
adpt_features = features.copy(deep = True)

#drop columns that are useless or cheating
adpt_features.drop(columns = ["adult_male", "embark_town", "who", "class", "alive"], inplace = True)

# female = 1, male = 0

n_sex = np.zeros_like(adpt_features.sex)
n_sex[np.argwhere(adpt_features.sex == "female").flatten()] = 1
adpt_features.sex = n_sex

# alone and alive are both binary

adpt_features.alone = adpt_features.alone.astype(int)

# # yes = 1, no = 0
# n_alive = np.zeros_like(adpt_features.alive)
# n_alive[np.argwhere(adpt_features.alive == "yes").flatten()] = 1

# adpt_features.alive = n_alive

## embark needs to be reencoded to numbers, same with class
l_encoder = LabelEncoder()
n_embarked = l_encoder.fit_transform(adpt_features.embarked)
adpt_features.embarked = n_embarked

print("Doing some magic...")
#split data to create test_data for later testing
train_data, test_data, train_targets, test_targets = train_test_split(adpt_features, targets
                                                    , test_size=0.3
                                                    , random_state=42)

#reset index for train and test data to avoid confusion

train_data.reset_index(drop=True, inplace=True)
train_targets.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)
test_targets.reset_index(drop=True, inplace=True)

print("Done! You can start running your own code now")

# If you are reading this, you are looking at preperation code and most likely found the test_targets :)
# In the interest of learning and fairness, we kindly ask you to not accessing the test data or targets until prediction!
# In the real world, test data represents unseen data (e.g., during an experiment), which you trust your classifier to solve without
# information about it. So, do yourself a favor and try to solve this challenge without accessing this data. It is literally cheating!

Installing all dependencies
Importing all packages...
Loading data...
Found some dust. Cleaning data...
Converting data to features and targets...
Doing some magic...
Done! You can start running your own code now


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_data["age"].iloc[age_idx] = cleaned_data.age.mean()


## **START HERE** and check the overview of the dataset ⬇

In [18]:
# show first 5 rows of the dataset

train_data.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,alone
0,1,1,35.0,1,0,90.0,2,0
1,3,1,24.0,0,0,8.85,2,1
2,3,0,21.0,0,0,7.925,2,1
3,2,0,36.0,1,2,27.75,2,0
4,2,0,29.0,1,0,27.7208,0,0


As you can see, there are several columns with different types of information, giving information about each passenger.

For example, the age and sex of each person is shown in the columns "age" and "sex". More complex information such as the passenger class and the fare, are available as well.

See the below section for an explanation of each column and their meaning.

## The Titanic Dataset
  sourced from seaborn example datasets

  Definition	Key

  pclass |Ticket class |	1 = 1st, 2 = 2nd, 3 = 3rd

  sex |	Sex | 1 = female, 0 = male

  Age |	Age in years

  sibsp |	# of siblings / spouses aboard the Titanic

  parch |	# of parents / children aboard the Titanic

  ticket |	Ticket number

  fare |	Passenger fare

  embarked |	Port of Embarkation |	0 = Cherbourg, 1 = Queenstown, 2 = Southampton

In [24]:
# show the targets
print("People that survived according to the training data:")
print("{} of {}".format(train_targets.sum(), train_targets.shape[0]))

People that survived according to the training data:
240 of 622


## Training a machine learning classifier to predict who survived and who did not

### Splitting data into training and validation:

Because we do not have access to the test data (which will be used to evaluate your classifier in the last step), we need to train our model and validate its performance on a seperate set of data.

This set is called "validation set" and is split from the available (known) training data.

In [28]:
# split the data into 2 parts
# train data (which are the features we train on) and the corresponding labels/targets (if a person has survived)
# and validation

# In ML features are often referred to as X and targets as y.

X_train, X_val, y_train, y_val = train_test_split(train_data
                                                  , train_targets
                                                  , test_size=0.2 # chose the ratio of test/train split between 0 and 1
                                                  , random_state=4 # setting a random state allows to reproduce the same train/test set
                                                  # as there is some randomness involved in picking
                                                  )
print("How many samples do we have?")
print("Training set: {} samples".format(X_train.shape[0]))
print("Validation set: {} samples".format(X_val.shape[0]))

How many samples do we have?
Training set: 497 samples
Validation set: 125 samples


In [41]:
#@title Example solution of model import and training (only use if you need a hint)

# select a model and import it
from sklearn.ensemble import RandomForestClassifier
# train the model with chosen parameters
clf = RandomForestClassifier(n_estimators=100
                             , random_state=42)

#some features are more usefull then others. For example, the sex and age should be quite informative, as the famous Simpsons quote goes:
# "Is anyone thinking of the children?"

#select the feature age
selected_features = ["age"]

sel_X_train = X_train[selected_features]
sel_X_val = X_val[selected_features]

#fit the model with training data
print("Fitting 'children'-classifier...")
clf.fit(sel_X_train, y_train)

# predict the validation set
y_pred = clf.predict(sel_X_val)

# calculate the accuracy
accuracy = accuracy_score(y_val, y_pred)
print("Validation accuracy: {:.2f}%".format(accuracy * 100))

print("Wow, that's close to chance! \n Maybe they did not only think of age when deciding who should be saved...\
\n Now it's your turn!")


Fitting 'children'-classifier...
Validation accuracy: 51.20%
Wow, that's close to chance! 
 Maybe they did not think of only age when deciding who should be saved...
 Now it's your turn!


## Testing on the test set

Final step of the challenge. Each time you want to really see the performance of your classifier and "submit" a result to us. Run the cell below with your latest classifier (here called "clf").

Don't forget to change or specify the features in the same way, you did for training and validation!

In [46]:
# use the classifier to predict on the test data
y_pred = clf.predict(test_data[selected_features])

# calculate the accuracy
accuracy = accuracy_score(test_targets, y_pred)

print("Test accuracy: {:.2f}%".format(accuracy * 100))
print("You can run this again everytime you trained a new classifier. \n\
 We ask you to celebrate each performance boost! \n \
  Sometimes (especially in industry) even a few percent points are saving millions")


Test accuracy: 64.42%
You can run this again everytime you trained a new classifier. 
 We ask you to celebrate each performance boost! 
   Sometimes (especially in industry) even a few percent points are saving millions
