# PB016: Artificial intelligence I, labs 12 - Deep learning

Today's topic is a quick and dirty introduction into deep learning. We'll focus namely on:
1. __Dummy deep learning pipeline__
2. __Developing your own deep learning classifier__

---

## 1. Dummy deep learning pipeline

__Basic facts__
- Deep learning consists of designing, training and validating machine learning models based on various [neural architectures](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks) that typically involve multiple (hidden) layers consisting of many neural computing units (a simple example of one unit is [perceptron](https://en.wikipedia.org/wiki/Perceptron), such as the one we implemented in the previous labs).
- An example of a deep learning architecture:

<img src="https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/stacked-representation.png" alt="architecture" width="550px" title="Original image source: Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. ”Deep learning.” MIT press, 2016. (Chap. 1) License: Probably OK to use for academic purposes; for any other use, contact the publisher (MIT Press)."/>

- A number of libraries seamlessly integrating with parallel computational architectures is available for developing deep learning models. Some of the popular examples are:
 - [PyTorch](https://pytorch.org/) - originally a general-purpose ML library written in C, now a state-of-the-art deep learning framework with relatively easy-to-use Python (and C++) abstraction layers.
 - [TensorFlow](https://www.tensorflow.org/) - a general-purpose, highly optimised library for multilinear algebra and statistical learning.
 - [Keras](https://keras.io/) - formerly a separate project, now an abstraction layer for user-friendly development of deep learning models integrated with TensorFlow, PyTorch and [JAX](https://jax.readthedocs.io/).

### A warm-up task - predicting onset of diabetes using Keras
- This is based on a widely used PIMA Indians dataset - a classic machine learning sandbox data described in detail for instance [here](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe).
- The task is to use that dataset to train a classifier for predicting whether or not a person develops diabetes.
- This is based on a number of characteristics (i.e., features) like blood pressure or body mass index.

#### Loading the data using [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)

In [None]:
# importing the library for handy CSV file processing
import pandas as pd

# loading the data, in CSV format, from the web
dataframe = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/example/diabetes.csv')

# checking the first few rows of the CSV
dataframe.head()

### Creating the data structures representing features and labels

In [None]:
# getting just the Outcome column as the vector of labels
# - note that the column contains 0, 1 values that correspond to negaive
#   (no diabetes developed) and positive (diabetes developed) example labels,
#   respectively
df_labels = dataframe.Outcome.values.astype(float)
# the features are the data minus the label vector
# - this contains the remaining features present in the data
df_features = dataframe.drop('Outcome',axis=1).values

### Splitting the data into train and test sets using [scikit-learn](https://sklearn.org/)

In [None]:
# importing a convenience data splitting function from scikit-learn

from sklearn.model_selection import train_test_split

# computing a random 80-20 split (80% training data, 20% of remaining
# "unseen" data for testing the model trained on the 80%)

x_train, x_test, y_train, y_test = train_test_split(df_features,df_labels,\
                                                    test_size=0.2,\
                                                    random_state=42)

### Training a baseline, classical machine learning model (logistic regression)

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear')
logreg.fit(x_train, y_train)

### Evaluating the baseline model

In [None]:
# using the trained model to predict the test labels
y_pred = logreg.predict(x_test)

# importing stuff needed to display a confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay
# making plots prettier
import seaborn as sns

ConfusionMatrixDisplay.from_estimator(
    logreg, x_test, y_test, xticks_rotation="vertical"
)

In [None]:
# importing some widely-used scoring functions from scikit-klearn

from sklearn.metrics import f1_score, precision_score, recall_score

# computing the precision, recall and F1 scores from the predictions
score_p = precision_score(y_test, y_pred, average='macro')
score_r = recall_score(y_test, y_pred, average='macro')
score_f = f1_score(y_test, y_pred, average='macro')
# using a scoring method of the model itself (to compute accuracy)
score_a = logreg.score(x_test, y_test)

# printing out the scores
print('Various scores of the logistic regression classifier on the test set')
# the number of correct predictions (true positives and true negatives)
# divided by the number of all predictions
print('  - accuracy :', score_a)
# the number of patients correctly classified as high risk
# divided by the number of all patients classified as high risk
print('  - precision:', score_p)
# the number of patients correctly classified as high risk,
# divided by the number of all patients that really are high risk
print('  - recall   :', score_r)
# aggregation of the precision and recall values (harmonic mean)
print('  - F1       :', score_f)

### Creating a [Keras](https://keras.io/) model

In [None]:
# adapted from:
#   - https://www.kaggle.com/code/atulnet/pima-diabetes-keras-implementation

# importing the basics from Keras
from keras.models import Sequential
from keras.layers import Dense, Input

# a model for simple sequential stacking of layers
model = Sequential()

# 1st layer: implicit input layer corresponding to the feature vector of size 8
model.add(Input(shape=(8,)))
# 2nd layer: 100 fully connected nodes, a simple non-linear activation
#            (ReLU - rectified linear unit)
model.add(Dense(100, activation='relu'))
# output layer: dim=1, sigmoid activation
#               (probability of the input characteristic of the positive class)
model.add(Dense(1, activation='sigmoid'))

# compiling the model with the binary cross-entropy loss (predicting 0/1)
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

### Training the created model

In [None]:
# simply calling the fit function, training on the training data and validating
# on the test data after each epoch
model.fit(x_train,y_train,epochs=30,\
          validation_data=(x_test, y_test))

- Interpreting the results
 - Not too great:
   - The loss is barely being optimised after about 10 epochs
   - The validation accuracy is worse then the classical ML baseline (barely over 0.7 in most runs as opposed to nearly 0.76)
 - The reasons:
   - More or less default settings of the model
   - More importantly, though, there's no preprocessing of the rather noisy and skewed input data (see for instance [this](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe) or [this](https://www.kaggle.com/code/atulnet/pima-diabetes-keras-implementation) blog post, where examples of detailed exploratory analysis and input data transformations are described)

---

## 2. Developing your own deep learning classifier
- Your task is to predict which passengers survived the Titanic disaster, as described in the [Kaggle](https://www.kaggle.com) challenge on [Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview).

![titanic](https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/titanic.jpg)

- Split into groups (min 2, max 4 people).
- Register at [Kaggle](https://www.kaggle.com/) so that you can officially participate in the [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview) competition  (one account per group is enough).
- Then, use a deep learning library ([Keras](https://keras.io/) might be the easiest option for newbies, but feel free to use for instance [PyTorch](https://pytorch.org/) if that's what you're already comfortable with) to solve the Titanic survivors' prediction problem as follows:
 - Get the challenge [data](https://www.kaggle.com/c/titanic/data) via the URLs in the notebook below.
 - Design a simple neural model for classification of (non)survivors using Keras.
 - Train the model on the `train.csv` dataset (after possibly preprocessing the data).
 - Use the trained model to predict the labels of the set `test.csv` (i.e., the values ​​of the column _"survived"_; for more details, see the competition documentation itself).
 - [Upload results](https://www.kaggle.com/c/titanic/submit) on Kaggle.
- Discuss your model and score with the lab tutor! The collaborating members of the group with the best relative results and/or an interesting/elegant/efficient/unusual model can earn bonus points.


### Loading the train and test data

In [None]:
# importing pandas, just in case it wasn't imported before
import pandas as pd

# loading the train and test data using pandas

df_train = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/train.csv',\
                       index_col='PassengerId')
df_test = pd.read_csv('https://www.fi.muni.cz/~novacek/courses/pb016/labs/data/12/titanic/test.csv', \
                      index_col='PassengerId')

### Checking out the train and test data contents

In [None]:
df_train.head()

In [None]:
df_test.head()

### Developing the model itself

In [None]:
# TODO - YOUR CODE GOES HERE

### Notes on the solution
- Feel free to get inspired on the web, but make sure you understand what you're doing when using someone else's (be it human or AI) code.
- A practical note on getting the submission file to be uploaded to Kaggle, if you're working in Google Colab:
 - You can create the CSV in your virtual environment, for instance using the `submission.to_csv('submission.csv', index=False)` command, assuming the `submission` variable is a _pandas_ data frame object.
 - Then you can simply download it by first importing the `files` module by the `from google.colab import files` line, and then using the module with the `files.download('submission.csv')` line to store the data on your local machine.
 - To save you some effort, the following code cell contains the Google Colab code that can take care of the above steps, assuming you have generated the `predictions` variable from the Titanic test set using your trained model's `predict()` function.

In [None]:
# the pandas data frame with the results
submission = pd.DataFrame({
    'PassengerId': df_test.index,
    'Survived': predictions,
})

# storing the submissions as CSV
submission.sort_values('PassengerId', inplace=True)
submission.to_csv('submission-example.csv', index=False)

# downloading the created CSV file locally
from google.colab import files
files.download('submission-example.csv')

---

#### _Final note_ - the materials used in this notebook are original works credited and licensed as follows:
- Image of Titanic:
 - Retrieved from [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:St%C3%B6wer_Titanic.jpg)
 - Author: Willy Stöwer (image reproduction)
 - License: none (or [Public Domain](https://en.wikipedia.org/wiki/public_domain))
- Image of the DL architecture:
 - See the inline note associated with the image itself.