# Final Excercise

In this notebook, you will find the last exercise of the lecture:
You are handed an initial dataset with several features and a univariate target. Next, you have to decide how to proceed. Since you do not have enough data to yet construct a classifier/predictor of any sensible evaluation metrics, the first task is, therefore, to acquire more data. For this purpose you can obtain batches of data according to your own design of experiments, so you will need to decide which experiments you consider necessary to perform. 

You will have four opportunities to acquire more data. Each time you have to decide which experiments to run and send those to Franz Götz-Hahn as a CSV file. The deadlines are 16.06.2023, 23.06.2023, 30.06.2023, and 07.07.2023 and 12:00 (noon). The format in all cases is a table with one row for each choosable feature, and the column entries corresponding to the desired values. Each individual sample will take approximately 30min, so pick a reasonable amount of experiments. For example, you will get the result for 100 experiments roughly 50 hours after the respective deadline. Should the experiment not be conductible, you will get a ``None`` as a result, e.g., if a feature value is out of range.

Once you have your data, you should compare the performance of different classifiers in predicting the targets. The classifiers to compare are [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC), and [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html). You should utilize all the different parts of the E2ML lecture that you considern appropriate. This could include Data Preprocessing, Design of Experiments for the batches, deciding on Performance Measures, Statistical Significance Testing of a hypothesis, Design of Experiments for Hyperparameter Optimization.

Should you wish to present the results from this excercise in the oral examination, you need to hand in your entire package until 14.07.2023-23:59 as a GitHub Repository. Send the link to the (public) repository to Franz Götz-Hahn via [E-Mail](mailto:franz.goetz-hahn@uni-kassel.de). Please use the README of the repository to describe the structure of the package, include any required packages in the setup.py, add the data in the data subfolder, save any results in the results subfolder, and include a _descriptive_ jupyter notebook in the notebooks subfolder.

Do note, that the point of this excercise is **not** to achieve the best performance of your models, but rather to document your process and give the motivation behind your chosen approaches, _even the ones that failed_.

In [9]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import log_loss
from e2ml.experimentation import perform_bayesian_optimization
from e2ml.preprocessing import PrincipalComponentAnalysis
from sklearn.decomposition import PCA
#from e2ml import utils

### **Mollusc Classification** <a class="anchor" id="heart"></a>

Your dataset describes some physical measurements of a specific type of molluscs. Your goal is to predict the `Stage of Life` of the mollusc. The data you can get looks as follows:


| Sex	|Length	|Width	|Height|	Weight	|Non_Shell Weight	|Intestine Weight	|Shell Weight	|Stage of Life |
| ---                           | ----   | ----    | ----    | ----   |----             |----    |---- | ---------- |
| {Male (M), Female (F), Indeterminate (I)} | float (inches)     | float (inches)     |  float (inches)     | float (gram)      | float (gram)              | float (gram)     |  float (gram)     | {Child, Adolescent (Adole), Adult}      |

The table headings are identical to the column names in the corresponding CSV-files. 

We can send out divers that look for molluscs that fit your needs, which will subsequently be analyzed in a laboratory. You can request molluscs with all features except the Stage of Life attribute, as it is the target. The first day of diving has already been completed. After 8 hours of diving, they brought up the following molluscs:

In [10]:
initial_molluscs_data = pd.read_csv('../data/initial_molluscs_data.csv')
initial_molluscs_data

Unnamed: 0,Sex,Length,Width,Height,Weight,Non_Shell Weight,Intestine Weight,Shell Weight,Stage of Life
0,F,0.45,0.345,0.12,0.4165,0.1655,0.095,0.135,Adult
1,F,0.475,0.38,0.145,0.57,0.167,0.118,0.187,Adole
2,M,0.61,0.485,0.17,1.0225,0.419,0.2405,0.36,Adult
3,I,0.43,0.34,0.105,0.4405,0.2385,0.0745,0.1075,Adole
4,M,0.205,0.155,0.045,0.0425,0.017,0.0055,0.0155,Adult
5,M,0.6,0.475,0.175,1.3445,0.549,0.2875,0.36,Child
6,I,0.515,0.39,0.11,0.531,0.2415,0.098,0.1615,Adult
7,F,0.625,0.495,0.16,1.1115,0.4495,0.2825,0.345,Child
8,F,0.65,0.52,0.195,1.6275,0.689,0.3905,0.432,Adult
9,F,0.62,0.48,0.165,1.043,0.4835,0.221,0.31,Adult


In [15]:
adults = initial_molluscs_data.loc[initial_molluscs_data["Stage of Life"] == "Adult"]
#print(adults)
adoles = initial_molluscs_data.loc[initial_molluscs_data["Stage of Life"] == "Adole"]
#print(adoles)
children = initial_molluscs_data.loc[initial_molluscs_data["Stage of Life"] == "Child"]
#print(children)
#will 282 haben (5 Tage. 21 Stunden)
full_length = initial_molluscs_data["Length"]
full_width = initial_molluscs_data["Width"]
full_height = initial_molluscs_data["Height"]
full_weight = initial_molluscs_data["Weight"]
full_non_shell_weight = initial_molluscs_data["Non_Shell Weight"]
full_intestine_weight = initial_molluscs_data["Intestine Weight"]
full_shell_weight = initial_molluscs_data["Shell Weight"]
full_sexes = initial_molluscs_data["Sex"]
full_stages = initial_molluscs_data["Stage of Life"]

volume = full_length * full_width * full_height
#print(volume)
weight_volume_quotients = full_weight / volume 
non_shell_quotient = full_non_shell_weight / full_weight
intestine_quotient = full_intestine_weight / full_weight
shell_quotient = full_shell_weight / full_weight

def getOneHotEncoding(data):
    values = np.sort(np.unique(data))
    enc = np.zeros((len(data), len(values)))
    for i, x in enumerate(data):
        enc[i, np.where(values == x)[0][0]] = 1
    return enc

full_replaced = initial_molluscs_data.replace("Adult",0).replace("Adole",1).replace("Child",2).replace("F",0).replace("I",1).replace("M",2)
pca = PCA(3)
pca = pca.fit(full_replaced)
print(pca.explained_variance_ratio_)
print(pca.singular_values_)


x = np.concatenate((getOneHotEncoding(full_sexes), initial_molluscs_data.values[:,1:-1]), axis=1)
y = getOneHotEncoding(full_stages)

def getNewSamples(old_data:pd.Series, size:int):
    return np.random.normal(old_data.mean(), scale=old_data.std(), size=size)


def score_cross_entropy_loss(mdl, x, y):
    y_pred = softmax(mdl.predict_proba(x))
    return -log_loss(y, y_pred)

def cross_entropy_loss(y_true, y_pred):
    if(len(y_true.shape) == 1):
        return np.array([np.log(y_pred[i]) for i in y_true])
    elif(len(y_true.shape) == 2):
        return np.array([-(y_true[i] * np.log(y_pred[i])).sum() for i in range(len(y_pred))])
    
def softmax(x):
    x = np.array(x)
    return np.exp(x) / np.exp(x).sum(axis=1).reshape(x.shape[0],-1)

print("mlp")
mlp = MLPClassifier(max_iter=1000)
mlp.fit(x[:10], y[:10])
y_pred = mlp.predict_proba(x[11:])
y_pred = softmax(y_pred)
mlp = MLPClassifier(max_iter=1000)
cvs_mlp = cross_val_score(mlp, x, y, cv=3, scoring=score_cross_entropy_loss)
print(cvs_mlp)

y_rfc = full_stages.replace("Adult",0).replace("Adole",1).replace("Child",2)

print("svc")
svc = SVC(kernel="rbf", probability=True)
svc.fit(x[:10], full_stages[:10])
y_pred = softmax(svc.predict_proba(x[11:]))
svc = SVC(kernel="rbf", probability=True)
cvs_svc = cross_val_score(svc, x, y_rfc, cv=3, scoring=score_cross_entropy_loss)
print(cvs_svc)

print("rfc")

rfc = RandomForestClassifier()
rfc.fit(x[:10], y_rfc[:10])
y_pred = softmax(rfc.predict_proba(x[11:]))
rfc = RandomForestClassifier()
cvs_rfc = cross_val_score(rfc, x, y_rfc, cv=3, scoring=score_cross_entropy_loss)
print(f"{cvs_rfc=}")

def objectiveFunction(x, y):
    rfc = RandomForestClassifier()
    cvs_rfc = cross_val_score(rfc, x, y_rfc, cv=3, scoring=score_cross_entropy_loss)
    svc = SVC(kernel="rbf", probability=True)
    cvs_svc = cross_val_score(svc, x, y_rfc, cv=3, scoring=score_cross_entropy_loss)
    mlp = MLPClassifier(max_iter=1000)
    cvs_mlp = cross_val_score(mlp, x, y, cv=3, scoring=score_cross_entropy_loss)
    return (cvs_rfc.mean() + cvs_svc.mean() + cvs_mlp.mean()) / 3




0     0.0210
1     0.0980
2     0.0030
3     0.0200
4     0.0045
5     0.1480
6     0.0300
7     0.0345
8     0.1160
9     0.0285
10    0.0545
11    0.0180
12    0.1795
13    0.0575
14    0.0610
15    0.0710
dtype: float64
[0.46156403 0.39054503 0.14394688]
[3.54394245 3.25991475 1.97911857]
mlp




[-1.06500189 -1.35091072 -1.02324363]
svc
[-1.11897823 -1.02954067 -1.10863598]
rfc
cvs_rfc=array([-0.94716295, -1.16378263, -1.03031771])


In [44]:
def getDict(size:int):
    d = {}
    d["Sex"] = np.random.choice(full_sexes, size)
    d["Length"] = getNewSamples(full_length, size)
    d["Width"] = getNewSamples(full_width, size)
    d["Height"] =  getNewSamples(full_height, size)
    d["Weight"] = d["Height"] * d["Width"] * d["Length"] * np.random.normal(weight_volume_quotients.mean(), weight_volume_quotients.std())
    d["Non_Shell Weight"] = d["Weight"] * np.random.normal(non_shell_quotient.mean(), non_shell_quotient.std())
    d["Intestine Weight"] = d["Weight"] * np.random.normal(intestine_quotient.mean(), intestine_quotient.std())
    d["Shell Weight"] = d["Weight"] * np.random.normal(shell_quotient.mean(), shell_quotient.std())
    return d
d = getDict(282)
print(f"{d['Weight'].mean()=} {d['Weight'].std()=}")
print(f"{full_weight.mean()=} {full_weight.std()=}")
print(f"{d['Non_Shell Weight'].mean()=} {d['Non_Shell Weight'].std()=}")
print(f"{full_non_shell_weight.mean()=} {full_non_shell_weight.std()=}")
print(f"{d['Intestine Weight'].mean()=} {d['Intestine Weight'].std()=}")
print(f"{full_intestine_weight.mean()=} {full_intestine_weight.std()=}")
print(f"{d['Shell Weight'].mean()=} {d['Shell Weight'].std()=}")
print(f"{full_shell_weight.mean()=} {full_shell_weight.std()=}")
orignial_diffs = full_weight - full_intestine_weight - full_non_shell_weight - full_shell_weight
new_diffs = d["Weight"] - d["Intestine Weight"] - d["Non_Shell Weight"] - d["Shell Weight"]
print(orignial_diffs)
print(new_diffs)
print(f"{orignial_diffs.mean()=} {orignial_diffs.std()=}")
print(f"{new_diffs.mean()=} {new_diffs.std()=}")
print(pd.DataFrame(d))

d['Weight'].mean()=0.6869174780664752 d['Weight'].std()=0.3257528636680623
full_weight.mean()=0.877875 full_weight.std()=0.4965779730649894
d['Non_Shell Weight'].mean()=0.279118848396181 d['Non_Shell Weight'].std()=0.13236490127565032
full_non_shell_weight.mean()=0.37634375 full_non_shell_weight.std()=0.22438981845217487
d['Intestine Weight'].mean()=0.13682339585740771 d['Intestine Weight'].std()=0.06488495989765193
full_intestine_weight.mean()=0.18753124999999998 full_intestine_weight.std()=0.10964373652121372
d['Shell Weight'].mean()=0.21586059137792832 d['Shell Weight'].std()=0.1023663075110119
full_shell_weight.mean()=0.25493750000000004 full_shell_weight.std()=0.13599557284460892
0     0.0210
1     0.0980
2     0.0030
3     0.0200
4     0.0045
5     0.1480
6     0.0300
7     0.0345
8     0.1160
9     0.0285
10    0.0545
11    0.0180
12    0.1795
13    0.0575
14    0.0610
15    0.0710
dtype: float64
[0.02434055 0.09934641 0.01592612 0.01760587 0.08040419 0.05707693
 0.0229674  0.08

In [45]:
print(pd.read_csv("FirstBatchBen.csv"))