# MADS-Deep Learning
---
## Portfolio Examination Part 1
#### Janosch Höfer, 938969

## Table of contents

- [Imports](#imports) <br>
- [1. Exercise](#task1) <br>
- [2. Exercise](#task2) <br>
- [3. Exercise](#task3) <br>
- [4. Exercise](#task4) <br>
-[References](#ref)<br>

<a id='imports'></a>
## Imports

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import torch
import torch.nn as nn
from tqdm.notebook import tqdm
from sklearn.model_selection import StratifiedKFold
from sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import SMOTE
from sklearn import metrics
import matplotlib.pyplot as plt

torch.set_default_dtype(torch.float)
torch.manual_seed(42)

<a id='task1'></a>
## Exercise 1
Given a perceptron with weights $(0.1, 0.4, 0.6, 0.7)$ and bias $0.2$, compute the output for the tensor $\begin{pmatrix}1 & 0 & 1 & 0\\0.1 & 0.2 & 0.1 & 0.2\end{pmatrix}$ of dimensions (dataset, features).

In [None]:
weights = torch.tensor([0.1, 0.4, 0.6, 0.7])
bias = torch.tensor(0.2)
vector = torch.tensor([[1, 0, 1, 0], [0.1, 0.2, 0.1, 0.2]])

In [None]:
def perceptron_predict(
    input_t: torch.Tensor, weights: torch.Tensor, bias: torch.Tensor
) -> torch.Tensor:
    return torch.matmul(input_t, weights) + bias

In [None]:
output_1 = perceptron_predict(vector, weights, bias)
output_1

The result of the perceptron is a tensor with dimensionality (dataset). The results are 0.9 and 0.49.

<a id='task2'></a>
## Exercise 2
The vacation platform JourneyAdvisor wants to apply deep learning in their recommender engine,
that recommends points of interest to users based on their user account properties and previously
visited places. The catalog of the platform contains $14,467$ points of interest. Users can check-in at
such places using their phones. The platform has $1,989,345$ users. When users register, they enter
their birthday, a payment method and their home address.<br>
1. Propose a list of features, suitable for the recommendation task. Explain your choice!

Suitable features could be:
* Age
* Number of visits
* Payment method (as embeddings)
* Home address (postal code)

Instead of using the birthday, we can calculate the age of the user. This has the advantage, that age is not only a continuous value. It also is a smaller number than the birthday. The number of visits is also a continuous value that should be used for recommendations. If a user visits types of places more often, we can recommend places that users with similar preferences have visited. The payment method could be an interesting feature. Someone who uses Apple Pay would probably prefer to visit places, that offer this payment method. Whether a place offers Apple Pay can be inferred from other users who have visited that place. The difficulty here is that the payment method is a categorical value. To make this feature usable by our neural network, it has to be transformed, either using One-Hot Encoding or Embeddings. The last feature is the home address, more precisely the zip code. Here we have the same problem, that the zip code is a categorical value. But fortunately we have multiple solutions for that problem. The first approach uses continuous data that can be assigned to the various zip codes, e.g. average salary, crime-rate, house prices, etc. The second approach uses the latitude and longitude. In both approaches we can further fine tune the data, by changing the granularity of the zip code to look at districts, cities or states.

2. Describe a tensor to model the data for JourneyAdvisor. Describe its dimensionality.

$T\begin{matrix}( POI,&&    User,    &&& features)\end{matrix}$<br>
$T\begin{matrix}( 14,467,& 1,989,345, & 4&)\end{matrix}$

3. How many entries does the tensor have?

In [None]:
print(f"{14467 * 1989345 * 4:,d}")

With the above dimensionality the total entries are 115,119,416,460. That's 115 Billion entries. Each additional feature increases it by ca. 28 Billion.

<a id='task3'></a>
## Exercise 3
Familiarize yourself with the SMOTE [[1]](#1) algorithm. In your own words, describe the use-case of the
SMOTE. Among others, address these points:
1. In which situations can it be useful (explain in general and provide three examples)?

Synthetic Minority Oversampling Technique (SMOTE) uses statistical techniques to increase the number of
underrepresented labels.

2. What is its fundamental idea?

SMOTE does not simply duplicated the underrepresented samples. It takes these samples' features and
combines them with the features of the neighbor samples to generate new instances. 

3. How is SMOTE different from oversampling with replacement?


<a id='task4'></a>
## Exercise 4
Create a Jupyter Notebook to solve the following machine learning task in Python, using PyTorch
(and other suitable libraries):
### 1. Load and arrange the dataset *portfolio_data_sose_2022.csv*. It has two features, feature_1 and feature_2, and a target variable target.

In [None]:
df = pd.read_csv("data/portfolio_data_sose_2022.csv")
print(
    f"Number of rows: \t{df.shape[0]} \n"
    f"Number of columns: \t{df.shape[1]}\n"
    f"Number of targets: \t{df['target'].nunique()}"
)

In [None]:
df.head()

### 2. Describe the class distribution.

The classes are highly imbalanced. Out of the 10.000 entries target 0 accounts for 9.900. 

In [None]:
df["target"].value_counts()

In [None]:
bar = sns.countplot(data=df, x="target")
bar.set(xlabel="Target", ylabel="Count")
plt.show()

### 3. Plot the data.

In [None]:
dis = sns.jointplot(data=df, x="feature_1", y="feature_2", hue="target", kind="scatter")
dis.set_axis_labels("Feature 1", "Feature 2", fontsize=12)
plt.show()

Plotting the data not only shows the imbalanced distribution of the two targets. With regard to feature 1 and 2, the data points of both targets can hardly be separated from each other. This could pose a problem for the classification point at hand, because no more features are available for training. 

In [None]:
dis2 = sns.relplot(x="feature_1", y="feature_2", data=df, col="target", hue="target")
dis2.set_axis_labels("Feature 1", "Feature 2", fontsize=12)
dis2.set_titles(col_template="Target {col_name}")
dis2._legend.set_title("Target")
dis2.tight_layout()
plt.show()

### 4. Create a simple (single layer) neural network with two output neurons, one for each of the two classes 0 and 1 (i.e. use a multiclass classification setup).

In [None]:
class NNClassifier(nn.Module):
    def __init__(self, input_size: int, classes: int):
        super().__init__()
        self.lin1 = nn.Linear(input_size, classes)

    def forward(self, x: torch.Tensor):
        return self.lin1(x)

    def predict(self, x: torch.Tensor):
        _, indices = torch.max(self.forward(x), dim=1)
        return indices

The cross validation setting below trains 5 different models using the seeds 0 through 4. The averaged metrics for all 5 models are returned for further evaluation. 

In [None]:
def cross_validation(
    model_class: nn.Module,
    X_train: pd.DataFrame,
    y_train: pd.DataFrame,
    X_test: pd.DataFrame,
    y_test: pd.DataFrame,
    lossfunc: nn.Module,
    lr: float,
    epochs: int,
) -> dict[dict]:

    X_train_t = torch.tensor(X_train.values, dtype=torch.float)
    X_test_t = torch.tensor(X_test.values, dtype=torch.float)
    y_train_t = torch.tensor(y_train.values, dtype=torch.long)
    y_test_t = torch.tensor(y_test.values, dtype=torch.long)

    results = dict()
    for seed in tqdm(range(5)):
        torch.manual_seed(seed)
        model, loss = train_model(model_class, X_train_t, y_train_t, lossfunc, lr, epochs)
        results[f"seed{seed}"] = compute_acc(model, X_test_t, y_test_t)

    results_avg = dict()
    for metric in ["Acc", "Bal_Acc", "Recall", "Precision"]:
        listy = [results[key][metric] for key in results.keys()]
        results_avg[f"Avg_{metric}"] = sum(listy) / len(listy)
    conf = [results[key]["conf_matrix"] for key in results.keys()]
    results_avg["avg_conf"] = np.mean(conf, axis=0)
    return results_avg


def train_model(
    model_class: nn.Module,
    x_data: torch.Tensor,
    y_data: torch.Tensor,
    lossfunc: nn.Module,
    lr: float,
    epochs: int,
) -> tuple[nn.Module, list]:

    model = model_class(2, 2)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    loss = [train_epoch(model, lossfunc, optimizer, x_data, y_data) for _ in range(epochs)]
    return model, loss


def train_epoch(model: nn.Module, lossfunc: nn.Module, optimizer, x_data, y_data):
    model.train()
    y_train_pred = model(x_data)
    optimizer.zero_grad()

    loss = lossfunc(y_train_pred, y_data)
    loss.backward()
    optimizer.step()
    return loss.item()

To evaluate the model the following metrics are used:
- Accuracy:<br>
    $\frac{TP + TN}{TP + TN + FP + FN}$

- Balanced Accuracy:<br>
    $\frac{1}{2} * (sensitivity + specificity)$ <br>
    with $sensitivity = \frac{TP}{TP + FN}$ <br>
    with $specificity = \frac{TN}{TN + FP}$
    
- Recall:<br>
    $\frac{TP}{TP + FN}$

- Precision:<br>
    $\frac{TP}{TP + FP}$

- Confusion Matrix:<br>
    $\begin{pmatrix}TN & FP \\FN & TP \end{pmatrix}$ 

In [None]:
def compute_acc(model: nn.Module, X: torch.Tensor, y: torch.Tensor) -> dict[str, float]:
    model.eval()
    with torch.no_grad():
        y_pred = model.predict(X)
    results = dict()
    results["Acc"] = metrics.accuracy_score(y, y_pred)
    results["Bal_Acc"] = metrics.balanced_accuracy_score(y, y_pred)
    results["Recall"] = metrics.recall_score(y, y_pred)
    results["Precision"] = metrics.precision_score(y, y_pred)
    results["conf_matrix"] = metrics.confusion_matrix(
        y, y_pred, labels=y.unique(), normalize="true"
    )
    return results

### 5. Compare the performance of the neural network in three different settings:

In [None]:
epochs = 5000
learning_rate = 0.1
folds = 5
loss_function = nn.CrossEntropyLoss()

data = df.iloc[:, :2]
labels = df["target"]

For the first setting the plain data is used. The data is split five times into training and test datasets and each training dataset is trained using the cross validation setup.<br>

DISCLAIMER: After a lot of discussion with my fellow students and Your feedback, I have come to the conclusion to use the setup below. Instead of an initial split into training and test data, the complete dataset is used for the five folds. This seems to be the most sensible approach with the instructions at hand, because we are not optimizing hyper-parameters. Splitting into training data before the splits would result in two different sets of test data, the initial test dataset (1) and a test dataset (2) for each fold. Here, the cross validation would be used to optimize hyper-parameters using test data (2), which are then used to train a model which can be evaluated with the initial test dataset (1). This approach would result in one model for each setting instead of 25.

#### Plain Data

In [None]:
normal_results = dict()

splits = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
pbar = tqdm(total=folds)

for fold, (train_idx, test_idx) in enumerate(splits.split(data, labels)):
    X_train, X_test = data.iloc[train_idx], data.iloc[test_idx]
    y_train, y_test = labels.iloc[train_idx], labels.iloc[test_idx]

    tqdm.write(f"Starting crossvalidation with fold {fold+1}")
    normal_results[f"fold{fold}"] = cross_validation(
        model_class=NNClassifier,
        X_train=X_train,
        y_train=y_train,
        X_test=X_test,
        y_test=y_test,
        lossfunc=loss_function,
        lr=learning_rate,
        epochs=epochs,
    )
    pbar.update()
pbar.close()

In [None]:
metric_df = pd.DataFrame(normal_results).T.drop(columns="avg_conf")
metric_df

Using the plain data the average Accuracy for each folds spans between $0.9935 - 0.996$, the average Balanced Accuracy between $0.70 - 0.82$, the average Recall between $0.40 - 0.65$ and average Precision between $0.77 - 1.0$

In [None]:
metric_df_avg = pd.DataFrame(metric_df.mean(), columns=["Normal"]).T
metric_df_avg

The average metrics across all folds are Accuracy: $0.995$, Balanced Accuracy: $0.76$, Recall: $0.52$ and Precision: $0.89$.

In [None]:
conf = [normal_results[key]["avg_conf"] for key in normal_results.keys()]
metrics.ConfusionMatrixDisplay(np.mean(conf, axis=0)).plot()
plt.show()

The above confusion matrix displays the percentage of correctly/wrongly predicted targets. Each row sums up to 100%. Target 0 has been wrongly predicted 0,071% and target 1 has been wrongly predicted 48% times.

#### SMOTE

In [None]:
over_sm = SMOTE(sampling_strategy="not majority", random_state=42, n_jobs=-2)
data_res, labels_res = over_sm.fit_resample(data, labels)
df_res = data_res.assign(target=labels_res.values)

dis3 = sns.jointplot(data=df_res, x="feature_1", y="feature_2", hue="target", kind="scatter")
dis3.set_axis_labels("Feature 1", "Feature 2", fontsize=12)
plt.show(dis3)

In [None]:
smote_results = dict()

splits = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
pbar = tqdm(total=folds)

for fold, (train_idx, test_idx) in enumerate(splits.split(data, labels)):
    X_train, X_test = data.iloc[train_idx], data.iloc[test_idx]
    y_train, y_test = labels.iloc[train_idx], labels.iloc[test_idx]

    X_train_res, y_train_res = over_sm.fit_resample(X_train, y_train)

    tqdm.write(f"Starting crossvalidation with fold {fold+1}")
    smote_results[f"fold{fold}"] = cross_validation(
        model_class=NNClassifier,
        X_train=X_train_res,
        y_train=y_train_res,
        X_test=X_test,
        y_test=y_test,
        lossfunc=loss_function,
        lr=learning_rate,
        epochs=epochs,
    )
    pbar.update()
pbar.close()

In [None]:
metric_df_smote = pd.DataFrame(smote_results).T.drop(columns="avg_conf")
metric_df_smote

Using the plain data the average Accuracy for each folds spans between $0.909 - 0.938$, the average Balanced Accuracy between $0.84 - 0.93$, the average Recall between $0.75 - 0.95$ and average Precision between $0.091 - 0.118$

In [None]:
metric_df_smote_avg = pd.DataFrame(metric_df_smote.mean(), columns=["Smote"]).T
metric_df_smote_avg

The average metrics across all folds are Accuracy: $0.925$, Balanced Accuracy: $0.89$, Recall: $0.86$ and Precision: $0.11$.

In [None]:
conf = [smote_results[key]["avg_conf"] for key in smote_results.keys()]
metrics.ConfusionMatrixDisplay(np.mean(conf, axis=0)).plot()
plt.show()

Target 0 has been wrongly predicted 7,5% and target 1 has been wrongly predicted 14% times.

#### Using appropriate weights

In [None]:
class_weights = compute_class_weight(class_weight="balanced", classes=labels.unique(), y=labels)
class_weights_t = torch.Tensor(class_weights)

loss_function_w = nn.CrossEntropyLoss(class_weights_t)

In [None]:
weight_results = dict()

splits = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
pbar = tqdm(total=folds)

for fold, (train_idx, test_idx) in enumerate(splits.split(data, labels)):
    X_train, X_test = data.iloc[train_idx], data.iloc[test_idx]
    y_train, y_test = labels.iloc[train_idx], labels.iloc[test_idx]

    tqdm.write(f"Starting crossvalidation with fold {fold+1}")
    weight_results[f"fold{fold}"] = cross_validation(
        model_class=NNClassifier,
        X_train=X_train,
        y_train=y_train,
        X_test=X_test,
        y_test=y_test,
        lossfunc=loss_function_w,
        lr=learning_rate,
        epochs=epochs,
    )
    pbar.update()
pbar.close()

In [None]:
metric_df_weight = pd.DataFrame(weight_results).T.drop(columns="avg_conf")
metric_df_weight

Using the plain data the average Accuracy for each folds spans between $0.895 - 0.927$, the average Balanced Accuracy between $0.84 - 0.93$, the average Recall between $0.75 - 0.95$ and average Precision between $0.079 - 0.098$

In [None]:
metric_df_weight_avg = pd.DataFrame(metric_df_weight.mean(), columns=["Weights"]).T
metric_df_weight_avg

The average metrics across all folds are Accuracy: $0.914$, Balanced Accuracy: $0.88$, Recall: $0.85$ and Precision: $0.09$.

In [None]:
conf = [weight_results[key]["avg_conf"] for key in weight_results.keys()]
metrics.ConfusionMatrixDisplay(np.mean(conf, axis=0)).plot()
plt.show()

Target 0 has been wrongly predicted 8,5% and target 1 has been wrongly predicted 15% times.

### 7. Interpret your results, explain your conclusions regarding SMOTE and class weights.

In [None]:
pd.concat([metric_df_avg, metric_df_smote_avg, metric_df_weight_avg])

 $\begin{pmatrix}TN & FP \\FN & TP \end{pmatrix}$ 

In [None]:
# Accuracy Baseline
# (tp + tn) / (tp + tn + fp + fn)
(0 + 9900) / 10000

In [None]:
# Balanced Accuracy Baseline
#        sensitivity   specificity
# 1/2 * (tp/(tp+fn) + tn/(tn+fp))
0.5 * (0 / (0 + 100) + 9900 / (9900 + 0))

In [None]:
# Recall Baseline
# tp / (tp + fn)
# Recall of the minority class
0 / (0 + 100)

In [None]:
# Precision Baseline
# tp / (tp + fp)
# Predicted as minority and correct
100 / (100 + 100)

<a id='ref'></a>
## References
<a id='1'>N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: Synthetic minority
over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.</a>