# &#127794; Digital Champion - Hands-On Python Session

## &#128210; Inhaltsverzeichnis:
* [1 Introduction](#first-bullet)
* [2 Decision Trees Theory & Task Description](#second-bullet)
* [3 Data Preparation](#third-bullet)
* [4 Simple Decision Tree](#fourth-bullet)
* [5 Optimized Decision Tree](#fifth-bullet)
* [6 Conclusion](#sixth-bullet)
* [7 Appendix](#seventh-bullet)
* [8 Solutions](#eight-bullet)
* [9 Sources](#nineth-bullet)

## 1 Introduction (5 min) <a class="anchor" name="first-bullet">

Jupyter Notebooks are interactive development environments that allow for the combination of code, visualizations, and text in a single environment. They are frequently used for data analysis, code sharing, and documentation. With Jupyter Notebooks, Python code cells can be executed and the results displayed immediately. In this Jupyter Notebook, we will look at ```Decision Trees```.

![alt text for screen readers](./pictures/jupyter_intro_google.png "Introduction to Jupyter Notebook").

### Explanations:
* &#10133; **Code:** With this symbol, we can insert a new code cell into the notebook. Code cells are used to enter executable code. Programming languages such as Python, R, Julia, and others can be used in a code cell.
* &#10133; **Text:** With this symbol, we can insert a new text cell into the notebook. Text cells are used for entering and formatting text. They allow for the insertion of text, headings, lists, images, and other formatting elements.
* 🗑️ **Delete:** With this symbol, we can delete a cell and remove its content.
* &#9654; **Run:** With this symbol, we can execute the code in a cell..
* ```Runtime``` -> ```Run All``` The kernel executes the code and returns the results to the notebook. It is possible that the kernel is no longer connected. In this case, we need to restart the kernel and run all cells again.

Important: These two lines must be executed without fail.

In [None]:
# Loads our dataset from the LearningFriday GitHub repository
!git clone https://github.com/LearningFridayPost3/dc-jupyter-notebook.git

In [None]:
# Changes the working directory (folder where we search for data) so that data can be read.
%cd dc-jupyter-notebook

---


&#128712; **INFO:**</b> 

- We read through the notebook and complete the tasks directly in this notebook.
    
- We can access the solutions using the link provided under the tasks.
    
- Please write questions directly in the Teams chat.

- More complex questions will be addressed in breakout sessions.


---

## 2 Decision Trees: Theory & Task Description (8 min) <a class="anchor" name="second-bullet"></a>

### 2.1 What is a decision tree?

Decision trees are a method for automatically classifying data objects (e.g., people or objects such as packages). A decision tree always consists of a root node and any number of internal nodes (split nodes) as well as at least two leaves (leaf nodes). Each node represents a logical rule, and each leaf represents a class. Below is an example of a decision tree:

![alt text for screen readers](./pictures/dt-example-new.png "Beispiel Entscheidungsbaum").

We have a dataset with many individuals. For each person, we know their income and whether they have a mortgage. The target variable is whether a particular person has insurance (class: ```Has Insurance```) or not (class: ```No Insurance```). This means that with the depicted decision tree, we want to determine whether a person has insurance using the information from ```income_usd``` and ```with_mortgage```.

The following information is included in the decision tree (figure above):

* **gini:** The Gini index describes how well a node separates different classes (e.g., ```No Insurance```, ```Has Insurance```). The value is always between 0 and 1. The smaller the Gini index, the better. When constructing the decision tree, the Gini index is calculated for each node. The logical rule with the smallest Gini index is always chosen.
* **samples:** This value describes the number of observations (e.g., data from individuals) available for splitting a specific node. For example, we see that data from 24 individuals were used to construct this tree. Furthermore, we see that the first node splits the 24 individuals into one group of 13 and another group of 11.
* **value:** Value describes how the ```samples``` are distributed in the node. The value ```[15, 9]``` in the first node, for example, indicates that out of 24 individuals, 15 do not have insurance (class: ```No Insurance```) and 9 have insurance (class: ```Has Insurance```).
* **class:** This value shows the class assigned to a specific node. Example: In the first node, we see the class ```No Insurance``` because more of the 24 individuals have No Insurance (15) than ```Has Insurance``` (9). We can also infer the class from the colors. The redder a node is, the more likely it belongs to the class ```No Insurance```, and the bluer a node is, the more likely it belongs to the class ```Has Insurance```.

---


&#128712; **INFO:**</b>
How to read a decision tree and how the decision tree classifies new observations (e.g., individuals):

- We always start at the root node, i.e., at the very top of the decision tree.

- If the logical rule in the node is satisfied for the new observation, the left path in the decision tree is chosen. If the logical rule in the node is not satisfied, the right path is chosen.

- We traverse the decision tree until we reach a leaf. The ```class``` attribute in the leaf describes the class of the new data object.**


---

<a name='aufgabe_1'></a>
&#9989; **Task 1:**</b> Determine the class of the following two individuals according to the decision tree depicted above:

- Person 1: income_usd = 100'000; with_mortgage = 0

- Person 2: income_usd = 73'000; with_mortgage = 1

-> Solutions to [Task 1](#lösung_aufgabe_1)

In [None]:
# Own solution (with # you can write comments that are not interpreted by Python):

### 2.2 Task Description

Now we are working with the ['heart-disease'](https://archive.ics.uci.edu/dataset/45/heart+disease) (HD) dataset. This patient data is organized in a table with 14 columns (features) and 303 rows (observations). A brief description of the various features:
* age: Age in years
* sex: Male/Female
* restbp: Resting blood pressure in mm/Hg at hospital admission
* chol: Serum cholesterol in mg/dl
* fbs: If fasting blood sugar > 120 mg/dl
* thalach: Maximum heart rate achieved
* exang: Exercise-induced angina (True/False)
* oldpeak: ST depression induced by exercise relative to rest
* ca: Number of major vessels (0-3) colored by fluoroscopy

Our goal is to generate a model with the 303 observations that can classify new observations (or individuals) and thus determine whether heart disease (HD) is present or not. In the model, we use the features described above to predict the target variable (HD):
* hd: Type of heart disease (here: binary)

In a group of expert data scientists, we discussed which model we wanted to use. We decided to use decision trees for classifying heart diseases because they are easy to interpret.

## 3 Data preparation (12 min) <a class="anchor" name="third-bullet"></a>

### 3.1 Read data

For each new Python project, we consider which Python libraries we want to use. A Python library is a reusable block of code that we can integrate into a program or project. Integrating such code blocks is much faster and more reliable than writing the code ourselves.

---


&#128712; **INFO:**</b> 
In Python, anything that follows a '#' is a comment used to describe the code. </div>


---

In [None]:
%%capture
pip install pandas numpy matplotlib scikit-learn

In [None]:
# Importing libraries
# pandas: Library for reading and manipulating data
import pandas as pd
# numpy: Library for calculating KPIs
import numpy as np
# plt: Library for plotting graphs
import matplotlib.pyplot as plt
# DecisionTreeClassifier: Modeling kit for decision trees
from sklearn.tree import DecisionTreeClassifier
# plot_tree: Function for plotting decision trees
from sklearn.tree import plot_tree
# train_test_split: Function to split test objects into training and test sets
from sklearn.model_selection import train_test_split
# cross_val_score: Function for cross-validation
from sklearn.model_selection import cross_val_score
# confusion_matrix: Function for calculating the confusion matrix
from sklearn.metrics import confusion_matrix
# ConfusionMatrixDispla: Function for plotting the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay
# accuracy_score: Function for calculating accuracy
from sklearn.metrics import accuracy_score

---


&#128712; **INFO:**</b>   In Python, data is stored in a variable using the '=' operator. For example, if we want to store the number 5 in the variable 'a', we can do it with the following code:
<br><br>
a = 5 


---

In [None]:
# pd.read_csv(filepath_or_buffer, sep, encoding) allows reading '.csv' data
# pd refers to the pandas library, which is used for data manipulation. The "." indicates that we are using a method (read_csv) from pd.
# The output is stored in the variable df as a DataFrame object.
df = pd.read_csv(filepath_or_buffer='data/processed_cleveland_small.csv',
                 sep=',',
                 encoding='latin')

In [None]:
# When we call the variable 'df', it will be displayed
df

<a name='aufgabe_2'></a>
&#9989; **Task 2:**</b>
Store the data in the variable 'df_start'.
    
-> Solution to [Task 2](#lösung_aufgabe_2)
</div>

In [None]:
# Own solution:


In [None]:
# .head() show what the first five rows of df_start look like
df_start.head()

### 3.2 Missing data

In [None]:
# If we write the column name in square brackets, e.g. df_start['ColumnName'], we get the values of that column
df_start['ca']

The last displayed value is a '?'. We want to check if there are more such 'special' values.

In [None]:
# The .unique() function shows all the different elements in the 'ca' column
df_start['ca'].unique()

The value '?' is the only non-numeric value in the 'ca' column. We need to be cautious with such values as they could be outliers or missing values. Therefore, the Data Science team contacts the dataset's author. After some clarifications, we are sure that these are missing data. However, we do not yet know how often such missing data occurs.

To see how often this value appears, we filter for the value '?'.

In [None]:
# [TableName['ColumnName'] == 'text to check'] can be used to filter a table by a specific column value
df_start[df_start['ca'] == '?']

The value '?' occurs very rarely. We decide to remove these observations from the dataset. This means we remove every observation with a question mark. In real applications, more sophisticated methods are often used that do not ignore the information in the other variables of such observations. Trees, in particular, can handle this very well automatically.

---


&#128712; **INFO:**</b>  The most important comparison operators:
    
* ```==```: the element to be compared must have the same content
* ```!=```: the element to be compared must not have the same content
* ```>=```: the number to be compared must be equal to or greater
* ```<=```: the number to be compared must be equal to or smaller
     

---

<a name='aufgabe_3'></a>
&#9989; **Task 3:**</b>
We have seen that only 4 test objects in the 'ca' column contain a '?'. Since the clarifications have shown that these are missing data, and this phenomenon affects only very few test objects, we want to filter out these values from our 'df_start' table and save the new table as 'df_no_missing'.

-> Solution to [Task 3](#lösung_aufgabe_3)

In [None]:
# Own solution:


In [None]:
# Test: With this code, we check if all '?' have been removed.
df_no_missing['ca'].unique()

### 3.3 Outliers

The authors of the dataset mentioned that there are inexplicable values for age. This means we suspect outliers in the following feature:
* age: age of the test objects

A common way to check for outliers is through visualizations like boxplots. A boxplot is a graphical representation of statistical distributions that visualizes the median, quartiles, and outliers by displaying the data in boxes.

![alt text for screen readers](./pictures/boxplot-new.png "Example Boxplot").

In [None]:
# The .boxplot(ColumnName) function displays a boxplot for a specific feature
df_no_missing.boxplot('age')
plt.show()

We see that there are no outliers in the ```age``` feature. The authors of the dataset seem to have already cleaned the data. If outliers do occur, they should be removed.

### 3.4 Data formatting
In the next step, the data is split into features and the target variable. All features are included in the DataFrame ```X```, while the values of the target variable are stored in ```y```.

Here, ```X``` represents potential observations, i.e., data from individuals, and ```y``` stands for the possible classifications (```0```=```no heart disease```; ```1```=```has heart disease```). The overarching goal is to create a decision tree that classifies ```y``` as accurately as possible based on ```X```.

In [None]:
# With the .copy() function, we create a copy of the table (to avoid altering the original data df_no_missing)
df_clean = df_no_missing.copy()

In [None]:
# All features should be stored in the table 'X'
# The .drop('column name', axis=1) function is used to remove a column
# With the .copy() function, we create a copy of the table
X = df_clean.drop('hd', axis=1).copy()
X.head()

In [None]:
# In this step, we store the target variable as 'y'
y = df_clean['hd'].copy()
y.head()

In the next step, our data is split into a ```training set``` and a ```test set```:

- ```Training set```: This consists of data used to construct a decision tree and train it based on specific observations. The goal is to teach the tree the characteristic properties of a dataset.
- ```Test set```: This dataset is used to test the decision tree. It allows for the evaluation of how well the tree can classify new observations and thus assess its accuracy.

![alt text for screen readers](./pictures/train_test_split.png "Picture train test split").

In [None]:
# The train_test_split() function splits the data into a training set and a test set
# By default, 25% of the data is in the test set and 75% is in the training set
# We see that train_test_split(X, y) generates 4 tables (one X and y table for each set)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## 4 Simple Decision Tree (10 min) <a class="anchor" name="fourth-bullet"></a>
### 4.1 Simple Decision Tree

In this section, we construct/train a ```simple decision tree``` (without optimizations) based on our observations in the training set. This decision tree allows for an initial classification of objects (e.g., individuals) regarding heart disease.

In [None]:
# The DecisionTreeClassifier() function is used for constructing a decision tree. Setting the random state ensures reproducible results.
clf_dt_e = DecisionTreeClassifier(random_state=42)
# .fit(X_train, y_train) assigns a training set (X_train and y_train) to the decision tree
clf_dt_e.fit(X_train, y_train)

In [None]:
# Visualization: With the following code, we can visualize/plot a decision tree
# Here, the size of the plot is defined in inches (15 stands for the width and 7.5 stands for the height)
plt.figure(figsize=(15, 7.5), dpi=600)

# The plot_tree(decision_tree, class_names, feature_names) function is used for visualizing a decision tree
# 'decision_tree' represents the decision tree to be visualized
plot_tree(decision_tree=clf_dt_e,
          # 'filled=True' results in the nodes being filled with colors
          filled=True,
          # 'class_names=["kein HD", "hat HD"]' specifies the classes that should be listed under 'class' in each node
          class_names=["keine HD", "hat HD"],
          # 'feature_names=X.columns' must be provided to ensure the features are correctly labeled in the plot
          feature_names=X.columns)

# The plt.show() function displays the plot.
plt.show()

Congratulations! We have constructed our first simple decision tree. Now, we want to see how well this decision tree, which we trained with the training set, can perform classifications on the test set. This means we want to see how new individuals, who were not used for training, are classified.

In [None]:
# The .predict(table name) function is used to perform classifications
# This function must be provided with a table name as input -> the table must contain the test features
predictions = clf_dt_e.predict(X_test)
# The output is a list of classifications (0=no HD; 1=has HD)
predictions

In the next step, we want to check whether these classifications are correct or incorrect.

In [None]:
# The confusion_matrix(y_true, y_pred, labels) function provides a way to check the classification
# 'y_true' -> list of correct classifications
cm = confusion_matrix(y_true=y_test,
                      # 'y_pred' -> list of performed classifications
                      y_pred=predictions,
                      # 'labels' -> provide classes from clf_dt
                      labels=clf_dt_e.classes_)

# The ConfusionMatrixDisplay(confusion_matrix, display_labels) function is used to visualize the confusion matrix 'cm'
# 'confusion_matrix' -> confusion matrix to be visualized
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              # 'display_labels' -> labels, which should be shown in visualization
                              display_labels=['keine HD', 'hat HD'])
# The .plot() function creates the visualization
disp.plot()
# The plt.show() function displays the plot
plt.show()

The graphic shown above depicts a ```Confusion Matrix```. It is read as follows:

- ```31```: Do not have heart disease -> These 31 were correctly classified (True Negative TN)
- ```12```: Do not have heart disease -> These 12 were incorrectly classified (False Negative FN)
- ```25```: Have heart disease -> These 25 were correctly classified (True Positive TP)
- ```7```: Have heart disease -> These 7 were incorrectly classified (False Positive FP)

The effectiveness of a decision tree is defined by its ```Accuracy```, which can be calculated using the ```Confusion Matrix```.

---


&#128712; **INFO:**</b> 

Accuracy = (TP + TN) / (TP + TN + FP + FN)
<br>
<br>... in our simple decision tree, for example, we have the following accuracy:
<br> 
<br>Accuracy = (25 + 31) / (25 + 31 + 7 + 12) = 0.75
    

---

In [None]:
# The function accuracy_score(y_true, y_pred) shows the calculated accuracy
accuracy_score(y_true=y_test,
               y_pred=predictions)

## 5 Optimized decision tree (10 min) <a class="anchor" name="fifth-bullet"></a>
### 5.1 Pruning

![alt text for screen readers](./pictures/ccp.png "Pruning").

We have constructed a ```simple decision``` tree so far. Together with the Data Science team, we are discussing whether this decision tree is sufficient. Someone from the team has a doubt: the decision tree might be overfitted to the training set. Overfitting is a known phenomenon. It means that the classification of observations from our training set works very well, but less well for the classification of new observations. The decision tree is therefore too adapted to the training data (overfitted).

We can solve this problem by simplifying the construction of the decision tree. We use various parameters (e.g., ```max_depth``` or ```min_samples```) to optimize the decision tree. This results in the decision tree having fewer leaves and thus being simpler and less adapted to the training data. This process is called ```pruning```. The goal is to improve the ```accuracy``` for new observations using pruning.

```Cost Complexity Pruning``` is a specific method to find a smaller decision tree that delivers better results for new observations. We look to see if smaller sub-decision trees (tree with 38 leaves, tree with 37 leaves, etc.) deliver better results than larger ones. To compare smaller trees with larger trees, we use the parameter ```alpha``` as a penalty term that improves the result of smaller trees (overfitting vs. accuracy in the test dataset).

A clearer manual to ```Cost Complexity Pruning``` can be found [here](#anhang_1).

---


&#128712; **INFO:**</b> 

The values of ```alpha``` are to be interpreted as follows:
- ``` 0```: The decision tree is not pruned. It has the maximum size and corresponds to the simple decision tree constructed in Chapter 4.
- ```>0```: The larger the value of ```alpha```, the simpler the decision tree. This means that as ```alpha``` increases, the decision tree will have fewer leaves.
    

---

In [None]:
# .cost_complexity_pruning_path(X_train, y_train) is an algorithm to find the most optimal values for alpha.
path = clf_dt_e.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
ccp_alphas

In [None]:
# We select a value from the list and construct a decision tree with 'alpha' = 0.0077381.
value_alpha = 0.0077381

# Definition of the decision tree
clf_dt_new = DecisionTreeClassifier(random_state=42,
                                    ccp_alpha=value_alpha)
clf_dt_new.fit(X_train, y_train)

# Visualization of the decision tree
plot_tree(decision_tree=clf_dt_new,
          filled=True,
          class_names=["keine HD", "hat HD"],
          feature_names=X.columns)

# The function plt.show() displays the plot.
plt.show()

<a name='aufgabe_4'></a>
&#9989; **Task 4:**</b>

Construct and visualize three different decision trees with different ```alpha``` values. For the construction, we can use and copy the code above.

We should ensure that the decision trees are stored in the following variables (replace ```clf_dt_new``` with ```clf_dt_1```):
- clf_dt_1
- clf_dt_2
- clf_dt_3

-> Solution to [Task 4](#lösung_aufgabe_4)

In [None]:
# Own solution:


In [None]:
# For the constructed decision tree 'clf_dt_new', we want to plot the confusion matrix to check the performance.
# Classifications
predictions = clf_dt_new.predict(X_test)
cm = confusion_matrix(y_true=y_test,
                      y_pred=predictions,
                      labels=clf_dt_new.classes_)

# Visualization of the Confusion Matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=['keine HD', 'hat HD'])

# The function plt.show() displays the plot
disp.plot()
plt.show()

<a name='aufgabe_5'></a>
&#9989; **Task 5:**</b>

Visualize the confusion matrix for the decision trees constructed in Task 4. For the visualizations, we can use and copy the code above.
<br>
<br>-> Solution to [Task 5](#lösung_aufgabe_5)
    

In [None]:
# Own solution:


We have seen that depending on the ```alpha```, the results can get worse or better. Our team is considering how we can easily find the ```alpha``` that yields the best results.

Since we don't want to check this manually, we will write a code that does the following for us:
1. Construct a decision tree for each ```alpha``` value.
2. Calculate the ```accuracy``` for the test set and the training set for each decision tree.
3. Visualize the ```accuracy``` for each decision tree as a function of ```alpha```.

<div class="alert alert-block alert-danger"> <b>Complex code cell:</b> 

The following code cell only needs to be executed. Understanding its functionality is not part of this introduction.
    
</div>

In [None]:
# With [:-1] we remove the largest 'alpha' value (the largest 'alpha' has no leaves).
ccp_alphas = ccp_alphas[:-1]

clf_dts = []

for ccp_alpha in ccp_alphas:
    clf_dt = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
    clf_dt.fit(X_train, y_train)
    clf_dts.append(clf_dt)
    
# With the code below, we visualize 'accuracy' as a function of 'alpha' (for test data and training data)
train_scores = [clf_dt.score(X_train, y_train) for clf_dt in clf_dts]
test_scores = [clf_dt.score(X_test, y_test) for clf_dt in clf_dts]

fig, ax = plt.subplots()
ax.set_xlabel('alpha')
ax.set_ylabel('accuracy')
ax.set_title('Accuracy vs alpha for training and testing sets')
ax.plot(ccp_alphas, train_scores, marker='o', label='train', drawstyle='steps-post')
ax.plot(ccp_alphas, test_scores, marker='o', label='test', drawstyle='steps-post')
ax.legend()
plt.show()

In this visualization, we find the necessary information we need to determine the best ```alpha```. Since we want to create a decision tree that generalizes well, we are particularly interested in the orange line (```test data```).

When selecting the best ```alpha```, we consider the following two points:
1. The ```accuracy``` of the ```test``` should be as high as possible.
2. The ```accuracy``` of the ```train``` should be as high as possible.

Considering these points, the optimal ```alpha``` is the sixth last point on the ```test``` line.

In [None]:
# To find the sixth last point, we call up the list of 'alpha' values again
ccp_alphas

In [None]:
# We define and save the optimal 'alpha'
alpha_opt = 0.01081731

In [None]:
# Definition of the Decision Tree
clf_dt_pruned = DecisionTreeClassifier(random_state=42,
                                       ccp_alpha=alpha_opt)
clf_dt_pruned = clf_dt_pruned.fit(X_train, y_train)

In [None]:
# Visualization of the Decision Tree
plt.figure(figsize=(15, 7.5))
plot_tree(clf_dt_pruned,
          filled=True,
          class_names=["keine HD", "hat HD"],
          feature_names=X.columns)
plt.show()

In [None]:
# 'Accuracy' of the New Decision Tree
predictions_new = clf_dt_pruned.predict(X_test)
accuracy_score(y_true=y_test,
               y_pred=predictions_new)

We obtain the following ```accuracy values```:
* simple decision tree = ```0.746``` (75%)
* optimized decision tree = ```0.786``` (79%)

When we compare these two values, we see that through optimization, we have obtained a decision tree that can handle new observations better.

## 6 Conclusion (5 min) <a class="anchor" name="sixth-bullet"></a>

What have we learned:
* What are decision trees?
* How do we read decision trees?
* How do we prepare data (missing data & outliers)?
* Working with training and test sets
* Creating simple decision trees
* Optimizing simple decision trees
* Accuracy and confusion matrix

What to consider when integrating models:
* Generalization
* Software engineering skills to embed models in applications
* Response times
* Monitoring prediction quality
* Collaboration of multiple roles (data engineer, data scientist, software engineer)

## 7 Appendix <a class="anchor" name="seventh-bullet"></a>

<a name='anhang_1'></a>
### Appendix 1: Guide to Cost Complexity Pruning

The algorithm for cost complexity pruning generally follows these steps:

1. Create an initial decision tree with a training dataset considering all available attributes and features.
2. Evaluate the classification performance of the decision tree using a separate validation dataset.
3. Calculate the potential costs for each internal node of the decision tree in terms of misclassification or other metrics.
4. Assign a cost complexity score to each internal node, usually calculated as the sum of its misclassification costs and a penalty term proportional to the number of descending leaf nodes.
5. Starting from the root node, iteratively prune the node with the lowest cost complexity score, creating a series of smaller decision trees.
6. Evaluate the classification performance of each pruned decision tree using the validation dataset.
7. Select the pruned tree with the best performance, often measured by accuracy or another suitable metric.
8. Optionally, further prune the selected tree by optimizing the complexity parameter α and repeating steps 5-7.
9. The final pruned decision tree is achieved when no further improvements in performance can be obtained through additional pruning.

## 8 Solutions <a class="anchor" name="eight-bullet"></a>

<a name='lösung_aufgabe_1'></a>
### Solution Task 1

Classes:
- Data Object 1 = 'Has Insurance'
- Data Object 2 = 'Has Insurance'

![alt text for screen readers](./pictures/Lösung_mit_Pfaden.png "Lösung").

-> Back to [Task 1](#aufgabe_1)

<a name='lösung_aufgabe_2'></a>
### Solution Task 2

In [None]:
df_start = df

-> Back to [Task 2](#aufgabe_2)

<a name='lösung_aufgabe_3'></a>
### Solution Task 3

In [None]:
df_no_missing = df_start[df_start['ca'] != '?']

-> Back to [Taks 3](#aufgabe_3)

<a name='lösung_aufgabe_4'></a>
### Solution Task 4

The following code cells represent a possible solution to Task 4.

In [None]:
plt.figure(figsize=(15, 7.5), dpi=600)

# We select a value from the list and construct a decision tree with 'alpha' = 0.0000000000001
value_alpha = 0.0000000000001

# Definition of the Decision Tree
clf_dt_1 = DecisionTreeClassifier(random_state=42,
                                ccp_alpha=value_alpha)
clf_dt_1.fit(X_train, y_train)

# Visualization of the Decision Tree
plot_tree(decision_tree=clf_dt_1,
          filled=True,
          class_names=["keine HD", "hat HD"],
          feature_names=X.columns)

# The function plt.show() displays the plot
plt.show()

In [None]:
plt.figure(figsize=(15, 7.5), dpi=600)

# We select a value from the list and construct a decision tree with 'alpha' = 0.01425422
value_alpha = 0.01425422

# Definition of the Decision Tree
clf_dt_2 = DecisionTreeClassifier(random_state=0,
                                ccp_alpha=value_alpha)
clf_dt_2.fit(X_train, y_train)

# Visualization of the Decision Tree
plot_tree(decision_tree=clf_dt_2,
          filled=True,
          class_names=["keine HD", "hat HD"],
          feature_names=X.columns)

# The function plt.show() displays the plot
plt.show()

In [None]:
plt.figure(figsize=(15, 7.5), dpi=600)

# We select a value from the list and construct a decision tree with 'alpha' = 0.2
value_alpha = 0.2

# Definition of the Decision Tree
clf_dt_3 = DecisionTreeClassifier(random_state=42,
                                ccp_alpha=value_alpha)
clf_dt_3.fit(X_train, y_train)

# Visualization of the Decision Tree
plot_tree(decision_tree=clf_dt_3,
          filled=True,
          class_names=["keine HD", "hat HD"],
          feature_names=X.columns)

# The function plt.show() displays the plot
plt.show()

-> Back to [Task 4](#aufgabe_4)

<a name='lösung_aufgabe_5'></a>
### Solution Task 5
The following code cells represent a possible solution to Task 5.

In [None]:
# Classifications
predictions = clf_dt_1.predict(X_test)
cm = confusion_matrix(y_true=y_test,
                      y_pred=predictions,
                      labels=clf_dt_1.classes_)

# Visualization of the Confusion Matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=["keine HD", "hat HD"])

# The function plt.show() displays the plot
disp.plot()
plt.show()

In [None]:
# Classifications
predictions = clf_dt_2.predict(X_test)
cm = confusion_matrix(y_true=y_test,
                      y_pred=predictions,
                      labels=clf_dt_2.classes_)

# Visualization of the Confusion Matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=["keine HD", "hat HD"])

# The function plt.show() displays the plot
disp.plot()
plt.show()

In [None]:
# Classifications
predictions = clf_dt_3.predict(X_test)
cm = confusion_matrix(y_true=y_test,
                      y_pred=predictions,
                      labels=clf_dt_3.classes_)

# Visualization of the Confusion Matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=["keine HD", "hat HD"])

# The function plt.show() displays the plot
disp.plot()
plt.show()

-> Back to [Task 5](#aufgabe_5)

## 9 Sources <a class="anchor" name="nineth-bullet"></a>

### &#169; Sources:
&#128190; **Data:** https://archive.ics.uci.edu/dataset/45/heart+disease 
<br>
&#128252;  **Video:** https://youtu.be/q90UDEgYqeI?list=PLBq2sVJiEBvA9rPo3IEQsJNI4IJbn81tB

### &#128161; Weitere Informationen:

``` Decision Trees: ``` &nbsp; https://www.youtube.com/watch?v=7VeUPuFGJHk&t=0s
<br>
``` Cross Validation: ``` &nbsp; https://www.youtube.com/watch?v=fSytzGwwBVw&t=0s
<br>
``` Confusion Matrix: ``` &nbsp; https://www.youtube.com/watch?v=Kdsp6soqA7o&t=0s
<br>
``` Cost-Complexity Pruning: ``` &nbsp; https://www.youtube.com/watch?v=D0efHEJsfHo&t=0s
<br>
``` Bias and Variance and Overfitting: ``` &nbsp; https://www.youtube.com/watch?v=EuBBz3bI-aA&t=0s