# Decision Tree

We use the sklearn module to train a decision tree.
Here, the task is to predict the iris type into <b>3 classes</b>.
<p>
Versicolor<br>
Setosa<br>
Virginica<br>
</p><br>

### 1. Step: Load Data:

In [None]:
import pandas as pd
data = pd.read_csv("iris.csv", sep = ",")

data

##### Explore and clean:

In [None]:
#Rows 98 to 102 and all columns
data.loc[98:102,:]

#### Convert variant to number

In [None]:
variant = {'Versicolor': 0, 'Setosa' : 1, 'Virginica' : 2}

data['variety'] = data['variety'].map(variant)

#### View impact

In [None]:
data.loc[48:52,:]

#data.loc[98:102,:]

### 2nd step: Correlation analysis:

In [None]:
#Note that the number 3 in the code indicates the number of digits after the decimal point for each cell.
correlation = data.corr()
correlation.style.background_gradient(cmap='coolwarm').format(precision=3)

### 3rd step: combine and normalise labels:

First, let's look at the individual labels as a plot. The count plot is a good tool for this. This allows you to quickly see the number of entries for the individual values:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(20,8))
sns.countplot(x="variety", data=data)

### 4th step: Preparing the data:

In [None]:
#First define parameter variety as y-values (label):
y_data = data.variety.values

#remove label
x_data = data.drop(["variety"],axis=1)

#norm features
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
x_norm = mm.fit_transform(x_data)

#Split into training and test data (70% / 30%), random selection
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_norm,y_data,test_size = 0.3,random_state=1)

### 5th step: Generate, train and test model

In [None]:
#Import the corresponding module
from sklearn.tree import DecisionTreeClassifier

#define model
#parameters see: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
DecisionTreeClassifierModel = DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=5,min_samples_split=2,
                                    min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_features=None,
                                    random_state=0, max_leaf_nodes=5)

DecisionTreeClassifierModel.fit(x_train, y_train)

#some statistics...
print('DecisionTreeClassifierModel Train Score is : ' , DecisionTreeClassifierModel.score(x_train, y_train))
print('DecisionTreeClassifierModel Test Score is : ' , DecisionTreeClassifierModel.score(x_test, y_test))

#Extract classes and important features
print('DecisionTreeClassifierModel Classes are : ' , DecisionTreeClassifierModel.classes_)
print('DecisionTreeClassifierModel feature importances are : ' , DecisionTreeClassifierModel.feature_importances_)

### 6th step: Use metrics

There are a lot of useful functions for displaying the prediction accuracy. Here is a very simple one:

In [None]:
#Calculating Prediction
y_pred = DecisionTreeClassifierModel.predict(x_test)
# Now we calculate the probability of choosing the output for any division
y_pred_prob = DecisionTreeClassifierModel.predict_proba(x_test)
print('Predicted Value for DecisionTreeClassifierModel is : ' , y_pred[:10])
# These are the values that we categorized.
print("test values :" ,y_test[:10] )
print('Prediction Probabilities Value for DecisionTreeClassifierModel is : ' , y_pred_prob[:10])


Another possibility is to use the ‘metric’ package:

In [None]:
from sklearn.metrics import confusion_matrix



#Calculating Confusion Matrix
CM = confusion_matrix(y_test[:10], y_pred[:10])
print('Confusion Matrix is : \n', CM)

# drawing confusion matrix
sns.heatmap(CM, center = True, cmap=plt.cm.Blues, annot=True)
plt.show()

### Visualise decision tree


In [None]:
from sklearn.tree import plot_tree

plt.figure(figsize=(20,25))

class_names = ['versicolor', 'setosa', 'virginica']
feature_names = ['sepal.length', 'sepal.width', 'petal.length','petal.width']

plot_tree(DecisionTreeClassifierModel, filled=True, class_names=class_names, feature_names = feature_names)
plt.title("Decision tree trained on all the iris features")
plt.show()