## Exercise 1: Diving into Classification

In this notebook, we want to dive into a practical application for machine learning. For this we rely on the `scikit-learn` (`sklearn` for short) library as well as pandas for data handling. The onset is very simple. You are invested for a party and promise to help with the preparations. As you are decorating the buffet with snacks, you mix a bowl of peanuts, walnuts and other chocolate covered candy. Once you are done, the hosts inform you that they expecting guests which are allergic to peanuts. You are now tasked to filter out the peanuts from the bowl of snacks. The dataset, `peanuts.csv` below presents the measurements taken on the bowl of dried snacks. Let's use classification to automate the task.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split

In [None]:
# the data was obtained from https://zenodo.org/records/10014609
df = pd.read_csv("peanuts.csv")

In [None]:
# let's inspect the data
print(df.shape, "\n", df.dtypes)

(100, 5) 
 color      int64
shape      int64
height     int64
width      int64
label     object
dtype: object


In the above, we see that most data is detected as `int64`, so this referes to integer numbers. The label however is recognized as object. To convert it into something meaningful, we need to help pandas a bit.

In [None]:
df.label = df.label.astype("category")
df.dtypes

color        int64
shape        int64
height       int64
width        int64
label     category
dtype: object

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   color   100 non-null    int64   
 1   shape   100 non-null    int64   
 2   height  100 non-null    int64   
 3   width   100 non-null    int64   
 4   label   100 non-null    category
dtypes: category(1), int64(4)
memory usage: 3.5 KB


## Prepare data for training

In [None]:
X, y = df[["color","shape", "height","width"]].to_numpy(), df["label"]
print("input data X is available as:", X.shape, X.dtype)
print("label data y is available as:", y.shape, y.dtype)

input data X is available as: (100, 4) int64
label data y is available as: (100,) category


In [None]:
# train_test_split is a super helpful function in sklearn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
# inspect the data
X_test, y_test

## Create the Tree and fit it

In [None]:
#see the docs for details https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
classifier_tree = DecisionTreeClassifier() 
classifier_tree = classifier_tree.fit(X_train, y_train)

## plot the fitted tree

In [None]:
plt.figure(figsize=(12,12)) #To control the fig size, otherwise is very small
plot_tree(classifier_tree)

## Calculate the Performance Metric

In [None]:
y_predict = classifier_tree.predict(X_test)
acc = 100*np.mean((y_predict==y_test))

print(f"accuracy is {acc:2.2f} %")

In [None]:
cm =

disp = 