# LA_1650

Dieses Notizbuch ist als Begleitung zur `PR_1650` gedacht, und soll den groben Ablauf der einzelnen Schritte einer ML-*pipeline* demonstrieren. Sie müssen die einzelnen Schritte nicht abschließend begreifen, aber eine Ahnung davon haben, was etwa im jeweiligen Schritt passiert.

In [None]:
!pip install sklearn

In [None]:
from sklearn.datasets import load_iris

# the as_frame parameter ensures we're
# getting the data as pandas DataFrame
iris = load_iris(as_frame=True)

In [None]:
# you can't do this for every data set
# sometimes you have to define yourself
# what the features are and what their 
# names could be. But in this case, we're
# lucky, since this was already done for us
# Check the documentation for more information
# about the dataset and try out some of the 
# commands listed here:
# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html
iris.feature_names

In [None]:
# this, too, you'd normally have to define yourself
# but again, it's already done for us.
X = iris.data
y = iris.target

# so our X data has sepal length,
# sepal width, petal length, petal width
# and our y data is 0 for setosa, 
# 1 for versicolor and 2 for virginica
X[:10], y[:150]

In [None]:
# run this block a few times to see how the random_state
# parameter changes the way the data is split (with X_train[:10])

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
   X, y, test_size = 0.3, random_state = 42
)

len(X_train), len(X_test)

In [None]:
# as we saw above, the sepal length is always much
# larger than the petal width - so let's normalise
# our data. In this step we'd also do other conversions
# such as mapping the mapping our target data to 
# numbers instead of using string, but as we saw
# above, this was already done for us, too.

# check out the docs to see all the preprocessing possibilities: 
# https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
# note that the 'scale' function shown in the slides works
# a bit differently from what we have encountered in this module so
# far; so in the example below, we're using the MinMaxScaler,
# which maps our values to [0, 1]. So the functions it uses are a 
# bit different from the ones shown in the slides

from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train, X_train_scaled

In [None]:
# did you notice that during the conversion
# our data was changed from a Pandas DataFrame 
# into a simple array? Check with 'type(X_train)'
# It's not a big problem,
# but we lose the column headers. If we want to
# convert them back, we could do:

import pandas as pd
X_train_pandified = pd.DataFrame(X_train_scaled,columns = X_train.columns)
X_train_pandified

# Modell berechnen
👏 Es folgt nun der Spaß: endlich können wir unser erstes Modell trainieren!

🤔 Doch: Welches Modell nehmen wir?

Da wir eine *Kategorie* vorhersagen möchten (ist eine neue Blume eine *setosa*, eine *versicolor* ...) bietet sich von den Algorithmen, die wir kennengelernt haben, der [Entscheidungsbaum](https://scikit-learn.org/stable/modules/tree.html#tree) an. Tatsächlich ist die Frage, welcher der [vielen Algorithmen](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) am besten ist, aber eine sehr schwierige, die von der Problemstellung und den Daten abhängt, und viel Erfahrung braucht. Was aber zumindest vereinfacht wird: Die Bedienung aller Algorithmen ist in `sklearn` einheitlich über die `.fit()`-Funktion gestaltet:

In [None]:
from sklearn import tree
DT = tree.DecisionTreeClassifier()
DT = DT.fit(X_train_scaled, y_train)

In [None]:
# one of the great advantages of the decision tree is that it is 
# very easy to understand and see what's happening under the hood
# we can even print out the actual tree we just trained!
import matplotlib.pyplot as plt
plt.figure(figsize=(20,20))
tree.plot_tree(DT, filled=True)
plt.show()

# as you can see, at every node it asks whether the value
# at a specific column is less than some other value to
# find a good split for our test data.
# By the way, the colors signify as what the data will be
# classified!

In [None]:
# now, let's save our hard work!
import joblib
joblib.dump(DT, 'iris_tree.joblib')

In [None]:
# and finally, if we were out in the field and saw a new
# flower, we could now ask our model, what it would think
# what type of iris it was:

# the only problem is, the data that we trained our model on
# was scaled, so we have to also scale our flower the
# same way how we scaled the training data

new_flower = [[5.1, 3.5, 12, 12]]
new_flower_scaled = scaler.transform(pd.DataFrame(new_flower, columns=X_train.columns))

# remember that 0, 1 and 2 are the classes - you can see
# in the tree above that a very small value in the forth column (X[3])
# always means the flower will be of type 0 (setosa), so play around 
# with the values a bit

DT.predict(new_flower_scaled)

In [None]:
# now since we're not biologists, we can only guess if our model
# makes accurate predictions. But, since we have our test-set, we
# don't have to guess - we can put our model to the test:

# Normally, we'd do this step after computing the model, but in this 
# demonstation, we're doing it at the end to keep the suspense up! 😬

for X, y in zip(X_test_scaled, y_test):
    # zip is a really nifty little helper
    # that allows us to go through two lists
    # simultaneously
    print(DT.predict([X]), y)
    
# Looking pretty good! 😁