# Data & Things (RUC F2023)

## Hand-in Exercises for Exam

* This is a template for your exercise solutions. Each solution may use multiple cells. 

* Do your best to make your code clean and clear, e.g., by using comments and markdowns.

* Remeber to fill in the information of all your group members in the following cell.

## Group Members:
* [Rasmus Kjær Nielsen, 68910, rkjaern@ruc.dk]
* [name_2, student number, email_2]
* [Add more if needed]

## 0. Loading of common modules or initialization of other common things, if any

In [None]:
import pandas as pd
import numpy as np
import scratch.deep_learning as dl
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

## 1. EDA and data cleaning (Lecture 2 & 5)

Make an Exploratory Data Analysis (EDA) and data cleaning of the “titanic_survival_data.csv” dataset from Lectures 5 and 6, including dealing with outliers and missing values.

In [None]:
titanic = pd.read_csv("data/titanic.csv")

In [None]:
titanic.head()

In [None]:
titanic["Cabin"].fillna("Unknown", inplace=True)
titanic.dropna(subset=["Age"], inplace=True)

In [None]:
titanic.head()

In [None]:
titanic["Sex"] = titanic.Sex.replace({'male':0, 'female':1})

## One-Hot Encoding

In [None]:
# Get all possible categories for the "PClass" column
print(f"Possible values for PClass: {titanic['Pclass'].unique()}")

# Use Pandas to One-Hot encode the PClass category
dataset_with_one_hot = pd.get_dummies(titanic, columns=["Pclass"], drop_first=False)

# Add back in the old Pclass column, for learning purposes
dataset_with_one_hot["Pclass"] = titanic.Pclass

# Print out the first few rows
dataset_with_one_hot.head()

Same principle. We convert from cabin number and narrow it down to corresponding deck.

In [None]:
dataset_with_one_hot = pd.get_dummies(titanic, columns=["Pclass", "Cabin"], drop_first=False)

cabin_column_names = list(c for c in dataset_with_one_hot.columns if c.startswith("Cabin_"))

print(len(cabin_column_names), "cabins found")

In [None]:
titanic["Deck"] = [c[0] for c in titanic.Cabin]

print("Decks: ", sorted(titanic.Deck.unique()))

dataset_with_one_hot = pd.get_dummies(titanic, columns=["Pclass", "Deck"], drop_first=False)

deck_of_cabin_column_names = list(c for c in dataset_with_one_hot.columns if c.startswith("Deck_"))



In [None]:
titanic.head()

## 2. Classification (Lecture 3 & 4)

Combine the exercise from Lecture 3 with exercise 2 from Lecture 4 into one, and construct some classification models to predict if a passenger would survive or not in the Titanic dataset. 

* a) You should have (1) decision tree, (2) random forest, and (3) KNN. You may also vary the configuration of each model type.
* b) You should do necessary data preprocessing (e.g., missing value fill-in, and data scaling if needed for a classifier). 
* c) You should also do cross-validation of your models.
* d) Plot the ROC with AUC for each model you implement.


## 3. Regression (Lecture 6)

Train a multiple linear regression, a random forest model, and an AdaBoost model on the “boston_housing_data.csv” dataset from Lectures 5 and 6 and remember to do train-test split as well as other necessary pre-processing dataset.

## 4. Clustering (Lecture 7 & 8)

Exercise 2 (both 2.1 and 2.2) from Lecture 7 and exercise 1 from Lecture 8.

## 5. Key-value stores (Lecture 9)

Exercise 1 from Lecture 9.

## 6. Deep learning (Lecture 10)

Train a deep neural network to predict if a passenger would survive or not in the Titanic dataset and remember to do train-test split as well as other necessary pre-processing dataset.

## Feature selection

We decide to use random forrest evaluation find the features that have the most influence towards survivability.

In [None]:
titanic.head()

In [None]:
features = titanic.columns.drop(['Survived', 'Name', 'Ticket', 'Cabin', 'Deck', 'Embarked', 'PassengerId'])
features

In [None]:
X = titanic[features]
y = titanic['Survived']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [None]:
from sklearn.ensemble import RandomForestClassifier

#criterion='entropy', max_features=5, n_estimators=100, random_state=0)
forest = RandomForestClassifier(n_estimators=100, random_state=0) 
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)

print("Accuracy of Random Forest: {}".format(metrics.accuracy_score(y_test, y_pred)))

In [None]:
import numpy as np

def plot_feature_importances(model, features):
    n_features = len(features)
    plt.barh(np.arange(n_features), model.feature_importances_, align='center')
    
    plt.yticks(np.arange(n_features), features)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    plt.ylim(-1, n_features)

In [None]:
plot_feature_importances(forest, features)

This tells us that the features with most importance towards predicting survival is Fare, Sex and Age.

In [None]:
X = titanic[['Age', 'Fare', 'Sex']]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [None]:
X_train = X_train.values.tolist()
X_train

In [None]:
X_test = X_test.values.tolist()

In [None]:
y_train

In [None]:
y_train_oh = [dl.one_hot_encode(y, 2) for y in y_train]
y_train_oh

In [None]:
y_test_oh = [dl.one_hot_encode(y, 2) for y in y_test]

In [None]:
import random
random.seed(0)
    
# Name them so we can turn train on and off
dropout1 = dl.Dropout(0.1)
dropout2 = dl.Dropout(0.1)
dropout3 = dl.Dropout(0.1)
    
t_model = dl.Sequential([
    dl.Linear(3, 32),  # Hidden layer 1: size 32
    dropout1,
    dl.Tanh(),
    dl.Linear(32, 16),   # Hidden layer 2: size 16
    dropout2,
    dl.Tanh(),
    dl.Linear(16, 8),   # Hidden layer 3: size 8
    dropout3,
    dl.Tanh(),
    dl.Linear(8, 2)    # Output layer: size 2
])

In [None]:
from numpy import argmax
import tqdm
def loop(model: dl.Layer,
             images: dl.List[dl.Tensor],
             labels: dl.List[dl.Tensor],
             loss: dl.Loss,
             optimizer: dl.Optimizer = None) -> None:
        correct = 0         # Track number of correct predictions.
        total_loss = 0.0    # Track total loss.
    
        with tqdm.trange(len(images)) as t:
            for i in t:
                predicted = model.forward(images[i])             # Predict.
                if argmax(predicted) == argmax(labels[i]):       # Check for
                    correct += 1                                 # correctness.
                total_loss += loss.loss(predicted, labels[i])    # Compute loss.
    
                # If we're training, backpropagate gradient and update weights.
                if optimizer is not None:
                    gradient = loss.gradient(predicted, labels[i])
                    model.backward(gradient)
                    optimizer.step(model)
    
                # And update our metrics in the progress bar.
                avg_loss = total_loss / (i + 1)
                acc = correct / (i + 1)
                t.set_description(f"mnist loss: {avg_loss:.3f} acc: {acc:.3f}")

In [None]:
from scratch.deep_learning import main
optimizer = dl.Momentum(learning_rate=0.01, momentum=0.99)
loss = dl.SoftmaxCrossEntropy()
    
# Enable dropout and train (takes > 20 minutes on my laptop!)
dropout1.train = dropout2.train = dropout3.train = True
loop(t_model, X_train, y_train_oh, loss, optimizer)

In [None]:
# Disable dropout and evaluate
dropout1.train = dropout2.train = dropout3.train = False
loop(t_model, X_test, y_test_oh, loss)

In [None]:
y_pred = [t_model.forward(x) for x in X_test]
y_pred

In [None]:
dl.softmax(y_pred)

In [None]:
y_pred_binary = [1 if x[1] > 0.5 else 0 for x in dl.softmax(y_pred)]
y_pred_binary

In [None]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred_binary))
print("Recall:", metrics.recall_score(y_test, y_pred_binary))
print("Precision:", metrics.precision_score(y_test, y_pred_binary))
print("Confusion matrix:")
print(metrics.confusion_matrix(y_test, y_pred_binary))

In [None]:
optimizer = dl.Momentum(learning_rate=0.001, momentum=0.99)

dropout1.train = dropout2.train = dropout3.train = True

for _ in range(15):
    loop(t_model, X_train, y_train_oh, loss, optimizer)

In [None]:
dropout1.train = dropout2.train = False
loop(t_model, X_test, y_test_oh, loss)

In [None]:
y_pred = [t_model.forward(x) for x in X_test]
y_pred_binary = [1 if x[1] > 0.5 else 0 for x in dl.softmax(y_pred)]

In [None]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred_binary))
print("Recall:", metrics.recall_score(y_test, y_pred_binary))
print("Precision:", metrics.precision_score(y_test, y_pred_binary))
print("Confusion matrix:")
print(metrics.confusion_matrix(y_test, y_pred_binary))

## Keras

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
keras_model = keras.Sequential(
    [
        keras.Input(shape=(3)),  # As we have 3 columns in our input data X
        layers.Dense(32, activation='tanh'),
        layers.Dropout(0.1),
        layers.Dense(16, activation='tanh'),
        layers.Dropout(0.1),
        layers.Dense(8, activation='tanh'),
        layers.Dropout(0.1),
        layers.Dense(2, activation="softmax")  # Here we specify that we want the last layer to have a softmax activation function
    ]
)

In [None]:
keras_model.summary()

In [None]:
keras_model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

In [None]:
X_train_keras = np.array(X_train)
X_train_keras.shape

In [None]:
y_train_keras = np.array(y_train_oh)
y_train_keras.shape

In [None]:
keras_model.fit(X_train_keras, y_train_keras, batch_size=1, epochs=16)

In [None]:
score = keras_model.evaluate(np.array(X_test), np.array(y_test_oh), verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

In [None]:
y_pred_keras = keras_model.predict(np.array(X_test))
y_pred_keras

In [None]:
y_pred_keras_binary = [1 if x[1] > 0.5 else 0 for x in y_pred_keras]
y_pred_keras_binary

In [None]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred_keras_binary))
print("Recall:", metrics.recall_score(y_test, y_pred_keras_binary))
print("Precision:", metrics.precision_score(y_test, y_pred_keras_binary))
print("Confusion matrix:")
print(metrics.confusion_matrix(y_test, y_pred_keras_binary))

## 7. MapReduce (Lecture 13)

All exercises from Lecture 13.

## 8. Time Series Analysis (Lecture 14 & 15)

Do a time series analysis of the Copenhagen ice cream dataset ("cph_ice_cream_searches.csv") from Lectures 14 and 15.

## 9. IoT (Lecture 17)

All exercises from Lecture 17.