# Building Trees using scikit-learn - Lab

## Introduction

Following the simple example you saw in the previous lesson, you'll now build a decision tree for a more complex dataset. This lab covers all major areas of standard machine learning practice, from data acquisition to evaluation of results. We'll continue to use the Scikit-learn and Pandas libraries to conduct this analysis, following the same structure we saw in the previous lesson.

## Objectives

In this lab you will:

- Use scikit-learn to fit a decision tree classification model 
- Use entropy and information gain to identify the best attribute to split on at each node 
- Plot a decision tree using Python 

## UCI Banknote authentication dataset

In this lab, you'll work with a popular dataset for classification called the "UCI Bank note authentication dataset". This data was extracted from images that were taken from genuine and forged banknotes! The notes were first digitized, followed by a numerical transformation using DSP techniques. The final set of engineered features are all continuous in nature, meaning that our dataset consists entirely of floats, with no strings to worry about. If you're curious about how the dataset was created, you can visit the UCI link [here](https://archive.ics.uci.edu/ml/datasets/banknote+authentication)!

We have the following attributes in the dataset:  

1. __Variance__ of wavelet transformed image (continuous) 
2. __Skewness__ of wavelet transformed image (continuous) 
3. __Curtosis__ of wavelet transformed image (continuous) 
4. __Entropy__ of image (continuous) 
5. __Class__ (integer) - Target/Label 

## Step 1: Import the necessary libraries 

We've imported all the necessary modules you will require for this lab, go ahead and run the following cell: 

In [1]:
# Import necessary libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score, roc_curve, auc
from sklearn.preprocessing import OneHotEncoder
from sklearn import tree

## Step 2: Import data

Now, you'll load our dataset in a DataFrame, perform some basic EDA, and get a general feel for the data you'll be working with.

- Import the file `'data_banknote_authentication.csv'` as a pandas DataFrame. Note that there is no header information in this dataset 
- Assign column names `'Variance'`, `'Skewness'`, `'Kurtosis'`, `'Entropy'`, and `'Class'` to the dataset in the given order 
- View the basic statistics and shape of the dataset 
- Check for the frequency of positive and negative examples in the target variable

In [3]:
# Create DataFrame
df = pd.read_csv('https://raw.githubusercontent.com/Patriciangugi/dsc-decision-trees-lab/master/data_banknote_authentication.csv', header=None)

df.columns = ['Variance', 'Skewness', 'Kurtosis', 'Entropy', 'Class']

# Describe the dataset


In [4]:
# Shape of dataset
print(df.describe())

# Display the shape of the dataset
print(f"Shape of the dataset: {df.shape}")

          Variance     Skewness     Kurtosis      Entropy        Class
count  1372.000000  1372.000000  1372.000000  1372.000000  1372.000000
mean      0.433735     1.922353     1.397627    -1.191657     0.444606
std       2.842763     5.869047     4.310030     2.101013     0.497103
min      -7.042100   -13.773100    -5.286100    -8.548200     0.000000
25%      -1.773000    -1.708200    -1.574975    -2.413450     0.000000
50%       0.496180     2.319650     0.616630    -0.586650     0.000000
75%       2.821475     6.814625     3.179250     0.394810     1.000000
max       6.824800    12.951600    17.927400     2.449500     1.000000
Shape of the dataset: (1372, 5)


In [5]:
# Class frequency of target variable 
class_distribution = df['Class'].value_counts()
print(f"Class Distribution:\n{class_distribution}")

Class Distribution:
Class
0    762
1    610
Name: count, dtype: int64


## Step 3: Create features, labels, training, and test data

Now we need to create our feature set `X` and labels `y`:  
- Create `X` and `y` by selecting the appropriate columns from the dataset
- Create a 80/20 split on the dataset for training/test. Use `random_state=10` for reproducibility

In [6]:
# Create features and labels
# Create features and labels
X = df[['Variance', 'Skewness', 'Kurtosis', 'Entropy']].values
y = df['Class'].values

# Create a train/test split with 80/20 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

# Verify the shapes of the resulting datasets
print("Training features shape:", X_train.shape)
print("Testing features shape:", X_test.shape)
print("Training labels shape:", y_train.shape)
print("Testing labels shape:", y_test.shape)


Training features shape: (1097, 4)
Testing features shape: (275, 4)
Training labels shape: (1097,)
Testing labels shape: (275,)


In [7]:
# Perform an 80/20 split
from sklearn.model_selection import train_test_split

# Create features and labels
X = df[['Variance', 'Skewness', 'Kurtosis', 'Entropy']].values
y = df['Class'].values

# Perform the train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

# Verify the shapes of the resulting datasets
print("Training features shape:", X_train.shape)
print("Testing features shape:", X_test.shape)
print("Training labels shape:", y_train.shape)
print("Testing labels shape:", y_test.shape)



Training features shape: (1097, 4)
Testing features shape: (275, 4)
Training labels shape: (1097,)
Testing labels shape: (275,)


## Step 4: Train the classifier and make predictions
- Create an instance of a decision tree classifier with `random_state=10` for reproducibility
- Fit the training data to the model 
- Use the trained model to make predictions with test data

In [8]:
# Train a DT classifier
from sklearn.tree import DecisionTreeClassifier

# Create an instance of the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=10)

# Fit the model on the training data
clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = clf.predict(X_test)

# Print some details about the predictions
print("Predictions:", y_pred)



Predictions: [0 0 1 0 1 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 0 1 1 1 1 1 0 1 0 0 1 0 0 0
 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0
 1 0 0 0 1 1 1 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 0 0
 1 1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0
 1 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 0 1 1 0 0 0 1 0 1 1 1
 1 1 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 1 1 1 1 0 1 1 1 0
 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 0 1
 0 0 0 0 0 1 1 1 0 1 0 0 0 1 0 0]


In [9]:
# Make predictions for test data
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Print classification report
class_report = classification_report(y_test, y_pred)
print("Classification Report:\n", class_report)



Accuracy: 0.9781818181818182
Confusion Matrix:
 [[149   3]
 [  3 120]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.98      0.98       152
           1       0.98      0.98      0.98       123

    accuracy                           0.98       275
   macro avg       0.98      0.98      0.98       275
weighted avg       0.98      0.98      0.98       275



## Step 5: Check predictive performance

Use different evaluation measures to check the predictive performance of the classifier: 
- Check the accuracy, AUC, and create a confusion matrix 
- Interpret the results 

In [10]:
from sklearn.metrics import accuracy_score, roc_curve, auc, confusion_matrix
# Calculate accuracy 
acc = accuracy_score(y_test, y_pred)
print('Accuracy is :{0}'.format(acc))

# Check the AUC for predictions
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('\nAUC is :{0}'.format(round(roc_auc, 2)))

# Create and print a confusion matrix 
print('\nConfusion Matrix')
print('----------------')

Accuracy is :0.9781818181818182

AUC is :0.98

Confusion Matrix
----------------


## Level up (Optional)


### Re-grow the tree using entropy 

The default impurity criterion in scikit-learn is the Gini impurity. We can change it to entropy by passing in the argument `criterion='entropy'` to the classifier in the training phase.  

- Create an instance of a decision tree classifier with `random_state=10` for reproducibility. Make sure you use entropy to calculate impurity 
- Fit this classifier to the training data 
- Run the given code to plot the decision tree

In [17]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

# Instantiate and fit a DecisionTreeClassifier
classifier_2 = DecisionTreeClassifier(random_state=10, criterion='entropy')

classifier_2.fit(X_train, y_train)

In [15]:
# Plot and show decision tree
plt.figure(figsize=(12,12), dpi=500)
tree.plot_tree(classifier_2, 
               feature_names=X.columns,
               class_names=np.unique(y).astype('str'),
               filled=True, rounded=True)
plt.show()

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

<Figure size 2000x1000 with 0 Axes>

- We discussed earlier that decision trees are very sensitive to outliers. Try to identify and remove/fix any possible outliers in the dataset.

- Check the distributions of the data. Is there any room for normalization/scaling of the data? Apply these techniques and see if it improves the accuracy score.

## Summary 

In this lesson, we looked at growing a decision tree for the banknote authentication dataset, which is composed of extracted continuous features from photographic data. We looked at data acquisition, training, prediction, and evaluation. We also looked at growing trees using entropy vs. gini impurity criteria. In following lessons, we shall look at more pre-training tuning techniques for ensuring an optimal classifier for learning and prediction.  