# Introduction machine learning

In [1]:
import os
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sb

## Importing data

The data can be imported using Pandas with the command `pd.read_csv()`.
In many cases, this does not work directly. This is usually due to one of the following issues:
- `FileNotFoundError` --> Either the file name is spelled incorrectly or the path is incorrect.
- `UnicodeDecodeError` --> Either the file name (+path) contains invalid characters (in Windows, for example, "//" must often be used instead of "/"), or the file itself is not saved in the expected "encoding." For the latter, there are two options: (1) Convert the file with an editor. Or (2) set the parameter `encoding=...` parameter.  
There are many possible encodings ([see link](https://docs.python.org/3/library/codecs.html#standard-encodings)), but the most common are "utf-8" (the standard), "ANSI" (on Mac: "iso-8859-1" or ‘ISO8859’) or "ASCII".
- `ParserError` --> Usually means that the "delimiter" (i.e., the separator) is specified incorrectly. It is best to open the file briefly with an editor and check, then set it accordingly with `delimiter="..."` (or `sep="..."`). Typical separators are `","`, `";"`, `"\t"` (tab).
- If the file does not start with the desired column names, this can be corrected by specifying the rows to be skipped --> `skiprows=1` (1, 2, 3,... depending on the case).

## Titanic dataset!

This data is taken from the [Kaggle Titanic challenge](https://www.kaggle.com/c/titanic/data).

Here, we will attempt to predict whether passengers survived the Titanic disaster based on their passenger data.

### Data Dictionary

| Variable   | Definition                        | Key                                        |
|------------|-----------------------------------|--------------------------------------------|
| survival   | Survival                          | 0 = No, 1 = Yes                            |
| pclass     | Ticket class                      | 1 = 1st, 2 = 2nd, 3 = 3rd                  |
| sex        | Sex                               |                                            |
| age        | Age in years                      |                                            |
| sibsp      | # of siblings/spouses aboard the Titanic |                                      |
| parch      | # of parents/children aboard the Titanic |                                      |
| ticket     | Ticket number                     |                                            |
| fare       | Passenger fare                    |                                            |
| cabin      | Cabin number                      |                                            |
| embarked   | Port of Embarkation               | C = Cherbourg, Q = Queenstown, S = Southampton |

### Variable Notes

- **pclass:** A proxy for socio-economic status (SES)
  - 1st = Upper
  - 2nd = Middle
  - 3rd = Lower

- **age:** Age is fractional if less than 1. If the age is estimated, it is in the form of `xx.5`.

- **sibsp:** The dataset defines family relations in this way:
  - Sibling = brother, sister, stepbrother, stepsister
  - Spouse = husband, wife (mistresses and fiancés were ignored)

- **parch:** The dataset defines family relations in this way:
  - Parent = mother, father
  - Child = daughter, son, stepdaughter, stepson
  - Some children traveled only with a nanny, therefore `parch=0` for them.


In [None]:
path_data = "/Data"
filename = os.path.join(path_data, "titanic_train.csv")

data = pd.read_csv(filename)
data.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Data\\titanic_train.csv'

# (1) Initial data exploration
This should now be almost automatic.

- Are there any missing values? --> `.info()`
- Initial overview & search for problematic entries --> `.describe()` (or `.describe(include="all")`)

## Data cleaning
We need to make some decisions here!

- Columns in which we have very few entries --> remove
- Remove columns that we deliberately do not want to use for our predictions --> `Name`, `Ticket`
- Problem case: `Age` --> Here, as an exception, we want to estimate the missing values. This is called **data imputation** and should be avoided in most cases, as it adds generated values, which are essentially *fake data*. However, in this case, please fill in the missing values with `fillna()` using the average age of all other entries.

### Convert categorical data
We still have columns with categorical entries (as strings). These need to be converted to numerical values using `pd.get_dummies`.

Tip: Avoid duplicating the same information. So there is no need for "Sex_male" AND "Sex_female" as one of the two pieces of information is sufficient.

## Data exploration

In [None]:
# here only mildly informative... but feel free to try
# sb.pairplot(data_cleaned, hue="Survived", diag_kind="hist")

In [None]:
data_cleaned["Survived"].value_counts()

## Correlation matrix

Based solely on correlations: 
**Which features can we expect to play a role in predicting survival (`Survived`)?**

**Which feature appears to be the most important?**

In [None]:
fig, ax = plt.subplots(figsize=(12, 10))

corr_matrix = data_cleaned.corr(numeric_only=True)
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sb.heatmap(corr_matrix,
           mask=mask,
           annot=True,
           vmin=-1, vmax=1,
           square=True,
           cmap="RdBu",
           linewidths=.5, fmt=".1f", ax=ax)

plt.show()

# Split into data and labels

- Label: "Survived" --> 0 did not survive | 1 survived
- Data: Everything except "Survived" --> `.drop()`

### Tasks:
- Create the data `X` and the labels `y` from `data`.

In [None]:
# label
y = data_cleaned["Survived"]

# data
X = data_cleaned.drop(["Survived"], axis=1)
X.head()

## Train-test split
The scikit-learn function `train_test_split` randomly divides a data set into training and test data. We can specify the proportion of test data using `test_size=...`, where values between 0 (no data) and 1 (all data) are used.
See also the [Scikit-Learn documentation on train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

Since this is a random distribution, it is better to set a "seed" to make it reproducible, using `random_state=0` (or another number).

In [None]:
from sklearn.model_selection import train_test_split

# Train-test split
X_train, X_test, y_train, y_test = # TODO: add your code
X_train.shape, X_test.shape

# Training of a kNN model
## Scaling data

For some algorithms, it is very important that the data is all scaled similarly. This is also the case for k-nearest neighbors, for example. To do this, we again use the `StandardScaler` from Scikit-Learn.

The "cleanest" approach here is to perform the scaling **based on the training data** so that no indirect information from the test data is included.

In [None]:
from sklearn.preprocessing import StandardScaler

# Scale the data
scaler = StandardScaler()...

X_train = pd.DataFrame(# complete code here,
                       columns=X.columns)
X_test = pd.DataFrame(# complete code here,
                      columns=X.columns)

In [None]:
X_train.head()

# Train model
First, we will try out a k-nearest neighbor model, again using scikit-learn. See [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier).
The most important parameter is `n_neighbors`, i.e., the number of neighbors (the `k` in k-NN).

### Task:
- Train a k-nearest neighbor model with the training data. This means creating a `KNeighborsClassifier` object (with the necessary parameters) and then training it with `.fit()`.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = # complete code here

## Making predictions
While we train a model with `.fit()`, we can make predictions with `.predict()`.

In [None]:
prediction_survival = knn. # complete code here

## Evaluate results

A good way to check classification predictions is the confusion matrix.To do thisconfusion_matrix()and pass it the actual labels and the predicted labels as parameters.

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(# add code here)

In [None]:
# Check which prediction classes were learnt by the model
knn.classes_

In [None]:
fig, ax = plt.subplots(figsize=(4, 4))

sb.heatmap(confusion_matrix(# add code here),
           annot=True, cmap="Blues", cbar=False, fmt=".0f",
           xticklabels=["Died", "Survived"],
           yticklabels=["Died", "Survived"])

# Decision Tree

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decision%20tree#sklearn.tree.DecisionTreeClassifier

In [None]:
from sklearn.model_selection import train_test_split

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=#add,
    random_state=0
) 

# Important! Decision trees need no data scaling!!

In [None]:
X_train.head()

### First train a decision tree WITHOUT setting any parameters!

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()  # do NOT add any parameters here --> we will use the default settings
tree.fit(# add code here)

In [None]:
# Now let's make some predictions...
prediction_survival = ...

In [None]:
fig, ax = plt.subplots(figsize=(4, 4))

sb.heatmap(confusion_matrix(y_train, prediction_survival),
           annot=True, cmap="Blues", cbar=False, fmt=".0f",
           xticklabels=["Died", "Survived"],
           yticklabels=["Died", "Survived"])

### Evaluation:
Looks like the model is pretty good. What else would we need to check to be sure?

- Take a look at the same thing, but this time for the test set.

# Train model
Here is a decision tree model, again using scikit-learn. See [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decision%20tree#sklearn.tree.DecisionTreeClassifier).
The most important parameter is `max_depth`, i.e., the maximum depth of the tree.

### Task:
- Train a decision tree model with the training data and a maximum depth of 2.

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = # add own code

## Evaluate results

### Tasks:
Just as with the kNN model, the task here is to:
- Make predictions based on the test data
- Compare these with the actual values using a confusion matrix.

In [None]:
prediction_survival = # add own code
prediction_survival

In [None]:
fig, ax = plt.subplots(figsize=(4, 4))

sb.heatmap(confusion_matrix(# add own code),
           annot=True, cmap="Blues", cbar=False, fmt=".0f",
           xticklabels=["Died", "Survived"],
           yticklabels=["Died", "Survived"])

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(# add own code)

## Interesting facts about decision trees:
A popular feature of decision trees is that we can also display the trees themselves!

In [None]:
X_train.columns

In [None]:
from sklearn.tree import plot_tree

feature_names = X_train.columns

fig, ax = plt.subplots(figsize=(10, 10))
plot_tree(tree, feature_names=feature_names, filled=True)
plt.show()

### Task:
- Run the same game again, but this time with a tree depth of 4.

In [None]:
tree = DecisionTreeClassifier(# add own code)
# train model


In [None]:
prediction_survival = # add own code

In [None]:
fig, ax = plt.subplots(figsize=(4, 4))

sb.heatmap(confusion_matrix(# add own code),
           annot=True, cmap="Blues", cbar=False, fmt=".0f",
           xticklabels=["Died", "Survived"],
           yticklabels=["Died", "Survived"])

In [None]:
X_train.columns

In [None]:
feature_names = X_train.columns

fig, ax = plt.subplots(figsize=(10, 10))
plot_tree(tree, feature_names=feature_names, filled=True)
plt.show()