# **Exercise-The Titanic Kaggle challenge machine learning: A case study for classification**

Titanic Kaggle Challenge is a competition where you'll use data to predict who could've survived the infamous Titanic disaster.

Classification "survived" or "not survived""

<img src='https://drive.google.com/uc?export=view&id=1jbTFiD4AHpp7zqom8MWc86PVDglbhsWI' width=500px align="center">

Let's explore machine learning with a fun example from Kaggle, a competition site owned by Google. These contests can be for giggles, cash prizes, or even a job offer sometimes! The Titanic Kaggle Challenge is known as one of the classic examples for learning classification in a hands-on way.

One beginner's challenge is based on the Titanic disaster, a famous shipwreck that happened on April 15, 1912. The ship, thought to be unsinkable, hit an iceberg and sank, sadly causing 1502 out of 2224 people onboard to lose their lives because there weren't enough lifeboats.

But it's not just about guessing who made it. It's about deeply exploring the data, finding patterns, and understanding how different factors might have affected survival rates. It poses questions like 'Did socioeconomic status influence survival rates?' and 'What was the impact of the "women and children first" policy? Was the 'women and children first' policy strictly followed?'

Here's the interesting part: it seems that some people were more likely to survive than others. The challenge asks us to figure out who these folks were, using data like their names, ages, genders, and social classes.

We get a file with details about 891 passengers, including whether they survived or not. We'll use this data to teach our machine to make smart guesses.

But the real test comes with another file, this one has information on 418 passengers, but doesn't tell us if they survived. That's where our machine's predictions come in!

Suggested tutorial about Kaggle's titanic challange: [Titanic - Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic)

You can follow Chris White's tutorial on this subject through this Jupyter notebook: [01-intro-classification.ipynb](https://colab.research.google.com/github/ualberta-rcg/python-machine-learning/blob/main/notebooks/01-intro-classification.ipynb#scrollTo=Gd80ekh_zQu4).

#**One-hot encoding**

<img src='https://drive.google.com/uc?export=view&id=1r-2ngkSjI7aues9ILuih25kby705yzLx' width=500px align="center">

Machine learning is a bit like a kid who only likes to play with numbers and not with words. So, when we have categories like **'apple' and 'peer'**, we need to find a way to turn them into numbers.

This is where one-hot encoding comes in handy! It's a cool trick that turns these categories into something machine learning algorithms can work with. It's like giving each category its own 'on' and 'off' switch.

And guess what? There's an easy way to do this if you're using pandas, a tool in Python. It has a function called 'get_dummies' that does all the work for you.

**I thought it would be intresting for you to know**: the term "one-hot" in "one-hot encoding" comes from the way digital circuits are designed. In digital electronics, a one-hot signal is a group of bits among which the legal combinations of values are only those with a single high bit (1) and all the others low (0).

In [None]:
X = pd.get_dummies(train_df[features])
X

The 'get_dummies' function is pretty cool. It only changes the columns that need changing and leaves the ones with numbers just the way they are.

By the way, even though **passenger class **is expressed as a number, have you ever wondered if we should treat it as a category instead? Just a thought!

We're now going to split our data into two groups. One group is like our study material that we'll use to teach our machine. The other group is like a pop quiz to see how well the machine learned from the study material. The machine won't have seen this quiz data during its 'study' time.

Here's how we do it: we'll use two-thirds of our data for teaching (or 'training') and keep one-third for the quiz (or 'testing'). We want to see if our machine's guesses on the quiz match the actual answers.

There's a handy tool in scikit-learn, a library in Python, called 'train_test_split' that does the splitting for us.

Here's a little code snippet:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Let's break down what this does:

- X_train is the data we use for training (two-thirds of the total data)
- y_train is the actual answers for the training data
- X_test is the quiz questions (the remaining one-third of the data)
- y_test is the real answers for the quiz. We'll use these to check how well our machine did.

# **Our first machine learning model: Decision Tree**

Let's kick off our machine learning adventure with a model called a decision tree. Imagine it like a game of 20 questions. The model asks questions about the data and makes decisions based on the answers. And the cool thing? It can change or fine-tune its answers as it gets more information!

<img src='https://drive.google.com/uc?export=view&id=1pd3ojKWIOaPmMzKMaZE06MQg1rDv3-g7' width=500px align="center">

(Image courtsey: Audrey Fukman and Andy Wright on SFoodie, via Serious Eats)




*sklearn.tree.DecisionTreeClassifier*

Think of sklearn's DecisionTreeClassifier as a super-smart detective. It checks out your data and figures out all the right questions to ask on its own automatically. We can even tell it how deep we want it to dig into the data with an option called 'max_depth'(basically, how deep do we want the questions to go).


In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3)

---

#**Fit**

Most, if not all, models from Scikit-learn have a handy tool called a 'fit' method. It's like a personal trainer for your model, helping it learn from your data features and labels.
Think of it as the model's learning button—it uses your features and labels to teach the model.











In [None]:
# fit doesn't modify the model in place
# (returns a trained model)

model = model.fit(X_train, y_train)

---

#**Predict**

Almost every model in Scikit-learn comes with a neat feature called a 'predict' method. Once your model has been trained, 'predict' lets it make guesses about new data that doesn't have labels.

Here we use our **test data** (data that the model hasn't seen before) as an input

In [None]:
predictions = model.predict(X_test)

In [None]:
print("The first ten predicted labels for the test data")
print(list(predictions[:10]))
print('The ten actual labels')
print(list(y_test[:10]))

#**But what did the decision tree do?**


In [None]:
# If you want to install graphviz ....
# Note for conda, you may have to install both graphviz and python-graphviz
# !pip install graphviz
# !conda install graphviz python-graphviz

In [None]:
import graphviz
from sklearn.tree import export_graphviz
from graphviz import Source

dot_data = export_graphviz(model,
                           feature_names=X.columns,
                           class_names=['Died', 'Survived'],
                           filled=True, rounded=True,
                           special_characters=True,
                           out_file=None)
graph = graphviz.Source(dot_data)
graph

#**Measuring the quality of predictions in classification models**

Alright, it's time to get into the world of stats and technical terms! We're going to explore something called a **'confusion matrix**'. This is a handy tool that allows us to compare our model's predictions with the actual results in our test data.

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)

A confusion matrix is a bit like a report card for our predictions. It gives us the numbers for when we got things right and when we got them wrong, in a handy little table:

|          | Predicted 0         | Predicted 1         |
|----------|---------------------|---------------------|
| **Actual 0** | True negative (TN)  | False positive (FP) |
| **Actual 1** | False negative (FN) | True positive (TP)  |

Predicted Didn't Survive	Predicted Survived
Actually Didn't Survive	True negative (TN)	False positive (FP)
Actually Survived	False negative (FN)	True positive (TP)
The true negatives and true positives are when our predictions match reality. True negatives are when we said a person wouldn't survive, and they didn't. True positives are when we said a person would survive, and they did.

False negatives and false positives are where we slipped up. False negatives are when we said a person wouldn't survive, but they ended up surviving. False positives are when we said a person would survive, but they didn't.

We can calculate accuracy as the percentage of times we got it right. It's the number of true negatives and true positives divided by the total number of predictions. We can do this in Scikit-learn using the accuracy_score function. Here's how:

Consider a team of medical researchers trying to predict whether or not a patient is at risk of heart failure:
Just like in our previous examples, accuracy is the proportion of correct predictions. But, as we noted earlier, if heart failures are rare, a model that always predicts "no heart failure" might have a high accuracy but it's not really helpful in the medical context. So, it's important to consider more than just accuracy when evaluating such predictions!

<img src='https://drive.google.com/uc?export=view&id=1os_zF6e9JpaxsP8Yw-jWTrQCDFUW6TTH' width=500px align="center">


Accuracy is basically how often our model gets its guesses right. It's calculated by adding up all the times our model correctly predicted heart failure (true positives) and correctly predicted no heart failure (true negatives), then dividing that by the total number of predictions. In other words,

accuracy = (TN + TP) / (TN + TP + FN + FP).

In Scikit-learn, we can calculate this using something called 'accuracy_score'.

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(predictions, y_test)

Accuracy might not always give us the full picture. Let's say we want to predict a rare disease that only affects 1 in every 1,000 people. If we create a model that always predicts that no one has the disease, it will be 99.9% accurate, but it's not very useful, right?

The same can be applied to our **Titanic challenge**. If we made a model that simply predicted everyone on board died, it would be 62% accurate. But that model wouldn't tell us much about who actually had a better chance of survival. So, accuracy isn't always the best judge of a model's performance!

Precision and recall might not be as straightforward as accuracy, but in certain cases, they can give us a better understanding of how our model is performing.

Imagine a circle that represents all the patients we predicted would experience heart failure. Now, the left part of that circle represents the patients who actually did experience heart failure.

**Precision** is like asking,

"*Out of all the patients we predicted would experience heart failure, how many actually did?*" We calculate it as:

TP / (TP + FP).

**Recall**, on the other hand, is like asking,

"*Out of all the patients who actually experienced heart failure, how many did we correctly predict?*" We calculate it as:

TP / (TP + FN).

Now, there are times when precision or recall matters more:

**High precision** is key when we really want to avoid false positives.

- For example, in court trials, we want to be sure that if we declare someone guilty, they really are.

- Another example is email spam filters. We don't want to accidentally label important emails as spam.

**High recall** is critical when we want to avoid false negatives.

- For example, in cancer screenings, a false negative could mean a patient who actually has cancer gets a clean bill of health.

Think about it, what do you think is more important for recommendation engines like YouTube, Netflix, or Spotify? Would it be precision or recall?

The scores for our model ...

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print('Accuracy: {}'.format(accuracy_score(y_test, predictions)))
print('Precision: {}'.format(precision_score(y_test, predictions)))
print('Recall: {}'.format(recall_score(y_test, predictions)))

#**Bringing it all together...**

We've been working on different pieces in separate parts of this notebook. Let's now bring all of that code together. This way, we can see the big picture of our entire process more clearly.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

# More about this line shortly ...
# np.random.seed(1337)

# Load data
train_df = pd.read_csv('data/titanic/train.csv')

# Choose features and lables
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_df[features], drop_first=True)
y = train_df['Survived']

# Split data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Initialize model and fit to training data
model = DecisionTreeClassifier(max_depth=3)
model = model.fit(X_train, y_train)

# Use model to predict on unseen test data
predictions = model.predict(X_test)

# Evaluate how well the model did
print('Accuracy: {}'.format(accuracy_score(y_test, predictions)))
print('Precision: {}'.format(precision_score(y_test, predictions)))
print('Recall: {}'.format(recall_score(y_test, predictions)))

# **Repeatability ...**

Try running the previous section a few times. Notice anything different each time? A lot of machine learning code relies on random number generators, which can make it tricky to get the same results every time.

Scikit-learn, the toolkit we're using, relies on a random number generator from a library called NumPy. Fortunately, we can set a starting point for this random number generator (think of it like setting a starting line for the random numbers). This makes sure that we get the same sequence of random numbers every time.

See that line in the code that reads: np.random.seed(1337)? If you remove the '#' sign in front of it, you can run the code several times and get the same results each time.

Don't worry about the number 1337 - it's just a common choice because in internet slang it stands for 'leet' or 'elite'. Feel free to choose any number that makes you smile! Just remember to use it every time for the same process if you want to get repeatable results.

#**Random forest**
**If a tree is good, is a forest better?**

<img src='https://drive.google.com/uc?export=view&id=1_eGAlGv9sC4qGGJvnd63PFjjRLF6FoDk' width=500px align="center">

<img src='https://drive.google.com/uc?export=view&id=15dNmxx7L9ISn5nQi0Vy72Wq8Gqj5eNbG' width=500px align="right">

Think about it this way - if one tree is handy, wouldn't a whole forest be even better?

Imagine we could create a bunch of decision trees, each one asking different questions. Then, we could combine their answers to make a final prediction. This is exactly what a Random Forest does.

A Random Forest is what we call an 'ensemble model'. It's like a supergroup band made up of lots of individual musicians, all working together to create a harmonious sound. Here, each model contributes to a final, hopefully better, prediction.




In our toolkit, Scikit-learn, there's a way to use Random Forest. It's called *'sklearn.ensemble.RandomForestClassifier'*.

Just like with DecisionTree, we can decide how deep we want the decision trees to go using 'max_depth'.

But there's another cool setting we can adjust. It's called 'n_estimators', and it lets us decide how many trees to use in our forest. For example, if we want to use 100 trees and set our max_depth to 3, we'd write:

model = RandomForestClassifier(n_estimators=100, max_depth=3). Pretty neat, right?

#**Exercise: use a RandomForestClassifier**

Try your hand at setting up a machine learning pipeline using a RandomForestClassifier model. Follow these steps:

- Start by grabbing the **Titanic data** from the file.
- Divide the data into two sections: training data and testing data.
- Next, fit a random forest model using the training data.
- Then, use the test data to make predictions.
- Finally, see how your model did by evaluating its performance.

Give it a shot! It's a great way to practice what you've learned.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Your code goes here ...
# Hint: other than a couple of lines of code, it should look
#   very much like the decision tree pipeline above

In [None]:
# PRINT SOLUTION (copy/paste output into a cell to run)
show_solution('titanic-random-forest-pipeline.py')

# **Handling missing data**
Alright, let's take another peek at the details of our data set:

In [None]:
train_df.info()

Imagine if we believed that age could really help us predict survival from heart failure. However, we hit a bit of a snag:

Some of our records don't include data for age. This is a problem because most machine learning algorithms get stumped when they encounter missing data.

Okay, there are a few ways we can tackle this problem:

We could simply get rid of any rows with missing data.
Alternatively, we could fill in the blanks with a default value. This could be the average value, if available, or something like zero, -1, or 9999.
Or, perhaps the absence of a value could be informative in itself? For instance, we could record a one if there's a cabin number for the patient, and a zero otherwise.
Now, let's explore how to discard rows and how to replace missing values with the average.

# **Removing Rows**

Firstly, let's use a handy tool to identify where we have missing (or null) values:

In [None]:
train_df.isnull()

In [None]:
# Count the missing data in each column
train_df.isnull().sum()

Pandas dataframes have this useful feature called 'dropna' that can come to the rescue. It helps us filter out rows that have missing data. We can even specify which columns we want to focus on using a special keyword called 'subset'. The best part? It doesn't mess with our original dataframe. Instead, it gives us a fresh new dataframe, all tidy and missing-data-free!

In [None]:
age_non_null_train_df = train_df.dropna(subset=['Age'])
age_non_null_train_df.info()

# If we decide to go further down this road, we might do either:
#    train_df.dropna(subset=['Age'], inplace=True)
#                or
#    train_df = train_df.dropna(subset=['Age'])

# **Replacing missing data with a mean**

Here's another approach: Instead of tossing out rows, we can get creative and fill in the missing age values with a "fictional" age. Let's calculate the mean age from the available non-null values:

In [None]:
train_df.describe()

Pandas Series have a method called fillna that allows us to replace null values in a column with a value of our choice.


In [None]:
# We can make a copy of the dataframe if we don't want to
# modify the original (optional)...
age_mean_train_df = train_df.copy()
# Overwrite the column with new data with the missing data filled
age_mean_train_df['Age'] = train_df['Age'].fillna(train_df['Age'].mean())

In [None]:
age_mean_train_df.describe()

**Here's a helpful tip:** if you're curious about the age distribution in the data, you can use the 'value_counts' method with the 'bins' option to group the ages into different ranges. This can give you a sense of how the ages are distributed across the dataset.

In [None]:
train_df["Age"].value_counts(bins=10, sort=False)

Alright, here's a fun exercise for you:

Let's include the age of the passengers as an additional feature in our machine learning pipeline.
- You can choose whether to use the dataset with the null rows thrown out (e.g., train_df = age_non_null_train_df) or the dataset with the missing age data replaced by the mean (e.g., train_df = age_mean_train_df). It's up to you to decide which approach to take.

- Next, you can pick any classifier algorithm you like, such as DecisionTreeClassifier or RandomForestClassifier.
- Feel free to experiment with different keyword arguments that you think might make a difference. Try changing them and see if they have any effect on the results.

It's a great way to explore and understand the impact of different settings on the model's performance. Have fun tinkering with it!

In [None]:
# Your code here ...

In [None]:
# PRINT SOLUTION (copy/paste output into a cell to run)
# (one possible solution ...)
show_solution('titanic-age-dropna.py')


Alright, here's an exciting exercise for you:

Let's use the most recently trained model to predict the survival of yourself (or your entire family, if you'd like!).

Here's an example to get you started:

In [None]:
# You may need to double-check the order of features, it should look like:
# ["Pclass", "SibSp", "Parch", "Age", "Sex_male"]
print(X_train.columns)
features = X_train.columns

family = [
  [2, 1, 1, 53.0, 1],  # Me
  [2, 1, 1, 52.0, 0],  # Wife
  [2, 0, 2, 10.0, 0]   # Daughter
]

family_df = pd.DataFrame(family, columns=features)
model.predict(family_df)

Feel free to modify the 'family' list with the information of your own family members or even predict your own survival individually. The model will predict the survival outcome based on the provided data. Have fun running your predictions!

# Analogy to explain Precision and Recall

 [Here](https://towardsdatascience.com/precision-and-recall-88a3776c8007) is a great analogy to explain Precision and Recall, and I've paraphrased it below:

Think of it like fishing with a net. If you cast a wide net into a lake and catch 80 out of 100 fish, that's 80% recall. However, you also end up with 80 rocks in your net, which means your precision is 50% since half of the net's contents are unwanted junk.

On the other hand, you could use a smaller net and focus on a specific area of the lake where there are lots of fish and no rocks. In this case, you might only catch 20 out of the fish, but you'll have zero rocks. This results in 20% recall and 100% precision. *italicized text*

<img src='https://drive.google.com/uc?export=view&id=1hR6BOW_L-3vXt_VTYqG48SfstZHqcesh' width=300px align="right">

<img src='https://drive.google.com/uc?export=view&id=18W2wnGFpAbAnKOI1I5aPrpXjx89L4lFI' width=300px align="left">



#These are the notebooks for the classification:
- Practical Guide to 6 Classification Algorithms
link: https://www.kaggle.com/code/faressayah/practical-guide-to-6-classification-algorithms

- Basic Machine Learning with Cancer | Kaggle
https://www.kaggle.com/code/gargmanish/basic-machine-learning-with-cancer

- BreastCancerEDA | Kaggle
https://www.kaggle.com/code/cboychinedu/breastcancereda


