<a href="https://colab.research.google.com/github/PreciousAkpokighe/Churn-Prediction-Platform-Capstone-Design/blob/main/Self_study_try_it_activity_10_1_Selecting_tree_depth_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Self-study try-it activity 10.1: Selecting tree depth in Python



    
## Overview

Decision trees are non-parametric supervised learning methods used for both classification and regression tasks. They structure decisions as a tree, comprising a root node, internal decision nodes, branches and leaf nodes, where each leaf represents a final prediction or classification. The tree recursively splits data into subsets based on feature values, forming simple if-then-else rules.

While deeper trees can capture complex patterns, they also increase the risk of overfitting. Decision trees are intuitive, easy to visualise, require minimal data preparation and learn to approximate the target variable from data features.

### About this assignment

This assignment is designed to help you apply machine learning algorithms in Python. You’ll work within a Jupyter Notebook that includes embedded instructions, relevant Python concepts and starter code to guide your progress. Be sure to run all code cells before submitting your work. After completing the assignment, we recommend comparing your results with the provided solution file for self-assessment.


### About this notebook

This notebook is structured into seven parts as follows:

- [Part 1](#part1): Import the data set and exploratory data analysis (EDA)

- [Part 2](#part2): Translate the categorical predictors into numerical predictors

- [Part 3](#part3): Shuffle the data set

- [Part 4](#part4): Calculate the accuracy of the Naïve benchmark on the validation set.

- [Part 5](#part5): Train a decision tree using the default settings

- [Part 6](#part6): Train a decision tree using different maximum depths for the tree

- [Part 7](#part7): Retrain the best classifier using all the samples

## Classification and regression trees

The basic idea behind the algorithm for classification via regression trees can be summarised as follows:

- Load the data set

- Select the best attribute using Attribute Selection Measures (ASM) to split the records.

- Make that attribute a decision node and break the data set into smaller subsets.

- Start building the tree by repeating this process recursively for each child until one of the conditions will match:
    - All the tuples belong to the same attribute value.
    - There are no more remaining attributes.
    - There are no more instances.

### Predict defaults for student loans applications

For this exercise, you will use the data set `loandata.csv` to predict defaults for student loans applications using regression trees.

You will perform the following steps:

1. Load the data set `loandata.csv` into Python.

2. Translate the categorical predictors into numerical predictors.

3. Split the data set into 50% training data, 25% validation data and 25% test data.

4. Calculate the accuracy of the Naïve benchmark on the validation set.

5. Train a decision tree using the default settings.

6. Retry the previous step using different maximum depths for the tree.

7. Choose the most appropriate tree depth and justify your choice. Re-train the best classifier using all the samples from both the training and the validation set. Retrain the best classifier on all samples (including the test set) and describe the tree that you obtain.

[Back to top](#Index:)

<a id='part1'></a>

### Part 1: Import the data set and exploratory data analysis (EDA)

Begin by importing the necessary libraries. You will then use `pandas` to import the data set.

In [1]:
import numpy as np
import pandas as pd
from sklearn import tree, ensemble

Assign the data frame to the variable `df`.

In [3]:
df = pd.read_csv('loandata.csv')


Before building any machine learning algorithms, you need to explore the data.

Begin by visualising the first ten rows of the data frame `df` using the function `.head()`. By default, `.head()` displays the first five rows of a data frame.

Complete the code cell below by passing the desired number of rows as an `int` to the function `.head()`.

In [4]:
df.head(10)

Unnamed: 0,field,graduationYear,loanAmount,selectiveCollege,sex,Default
0,STEM,2006,23159.58054,0,Male,No
1,HUMANITIES,2010,47498.06121,0,Male,Yes
2,HUMANITIES,2012,29637.51953,0,Female,No
3,STEM,2008,25369.57716,1,Female,No
4,BUSINESS,2013,42398.55457,0,Male,Yes
5,HUMANITIES,2012,39253.38426,1,Female,Yes
6,STEM,2005,48903.96685,1,Male,No
7,STEM,2007,30687.01911,1,Male,No
8,STEM,2005,31999.81687,0,Male,No
9,HUMANITIES,2005,45120.41995,0,Female,Yes


For your convenience, here is a brief description of what some of the columns represent:
    
- `field`: the field in which each student is taking their studies in

- `graduationYear`: the year in which each student graduated

- `loanAmount`: the amount each student owns

- `selectiveCollege`: binary valued column: 1 for students who attend a selective college, 0 for students that do not

- `sex`: sex of the student

[Back to top](#Index:)

<a id='part2'></a>

### Part 2: Translate the categorical predictors into numerical predictors


In most of the well-established machine learning systems, categorical variables are handled naturally. However, when dealing with decision trees using `scikit-learn`, you need to encode (translate) categorical features into numerical features.

Arguably, the easiest way to achieve this is by using the `pandas` function `get_dummies()` that converts categorical variables into dummy/indicator variables.


Complete the code cell below by using the data frame `df`.

In [5]:
#encode categorical variables
df = pd.get_dummies(df)

Because you are only interested in the students that will apply for a student loan, you will only need to keep the column `Default_Yes`.

Complete the code cell below by using the function `.

- To select the `bool` columns and convert them to `int` data type.

- `drop()` on `df` to eliminate the *column* `Default_no`. The `axis` parameter in `.drop()` controls whether the function acts on rows or columns.

- Convert the `Default_Yes` to `int` data type.

Run the code cell below to visualise the new dataframe with the encoded columns.

In [7]:
bool_cols = df.select_dtypes(include='bool').columns
df[bool_cols] = df[bool_cols].astype(int)

In [8]:
df = df.drop('Default_No', axis=1)
y = df['Default_Yes']

[Back to top](#Index:)

<a id='part3'></a>

### Part 3: Prepare the target

 Convert the DataFrame to a NumPy array and ensure that the last column contains integer values, which is often done to prepare target labels for machine learning tasks.

In [11]:
Xy = df.to_numpy()
Xy[:,-1] = y.to_numpy().astype(int)

For reproducibility, set the random `seed = 2`. You can do this by using the `NumPy` function `random.seed()`.

Assign your seed to the variable `seed`.

Next, complete the code cell below by using the function `random.shuffle()` on `Xy`.

In [12]:
seed = np.random.seed(2)
np.random.shuffle(Xy)

Before splitting the data into a training set, a test set and a validation set, you need to divide `Xy` into two arrays: the first one, `X`, a 2D array containing all the predictors and the second, `y`, is a 1D array with the response.

Run the code cell below to generate `X`. Complete the remaining code to define `y`.

In [13]:
X=Xy[:,:-1]

In [21]:
y = Xy[:,-1].astype(int)

Because you need to split the data into sets with certain dimensions according to the instructions given above, it would be useful to know how big the `X` and `y` are.

Run the code cell below to retrieve this information.

In [15]:
print(len(X))
print(len(y))

2000
2000


Next, you need to split the messages into 50% training data, 25% validation data and 25% test data.

Run the code below to split `X` into training, validation and test sets.

In [16]:
trainsize = 1000
trainplusvalsize = 500
X_train=X[:trainsize]
X_val=X[trainsize:trainsize + trainplusvalsize]
X_test=X[trainsize + trainplusvalsize:]


Following the same syntax, complete the cell below to split `y` into training set, a validation set and a test set.

**Hint:** Remember that `y` is a 1D array!

In [23]:
y_train = y[:trainsize]
y_val   = y[trainsize:trainsize + trainplusvalsize]
y_test  = y[trainsize + trainplusvalsize:]

[Back to top](#Index:)

<a id='part4'></a>

### Part 4: Calculate the accuracy of the Naïve benchmark on the validation set

In this part, you want to calculate the accuracy of the Naïve benchmark on both the `y` training and validation sets. In other words, you want to understand how accurate your predictions would be, assuming that no one defaulted on their student loans.

Accuracy can be computed by comparing actual test set values and predicted values. In this example, the formulae to compute accuracy are:


$$\text{acc_train} = 1 - \frac{\sum{\text{y_train}}}{\text{len(y_train)}},$$

$$ \text{acc_val} = 1 - \frac{\sum{\text{y_val}}}{\text{len(y_val)}}.$$

Note that $\frac{\sum{\text{y_train}}}{\text{len(y_train)}}$ reflects the proportion of students who defaulted on their loan in the training set, and $\frac{\sum{\text{y_val}}}{\text{len(y_val)}}$ reflects the proportion of students who defaulted on their loan in the validation set.

Compute the required accuracy in the code cell below.

In [18]:
acc_train = 1 - sum(y_train)/len(y_train)
acc_val   = 1 - sum(y_val)/len(y_val)

Run the code cell below to print the results to screen. What can you say about the baseline accuracy if you predict that no students defaulted (i.e., everyone belongs to the majority class)?

In [19]:
print ( 'Naïve guess train and validation', acc_train , acc_val)

Naïve guess train and validation 0.778 0.75


[Back to top](#Index:)

<a id='part5'></a>

### Part 5: Train a decision tree using the default settings

The easiest way to create a decision tree model is by using the function `DecisionTreeClassifier()`. This function is part of the `tree` module of `Scikit-learn` (`sklearn`).

You will explore that there are ways to improve the accuracy of the tree. For now, let's build a classifier using the default settings.

In the code cell below, use `DecisionTreeClassifier()` to define a classifier `clf` .

Next, use the method `fit()` of your classifier to fit your training sets, `X_train` and `y_train`.

In [24]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
print(f"y_train dtype before fit: {y_train.dtype}")
clf.fit(X_train, y_train)
#Fit X_train and y_train

y_train dtype before fit: int64


Run the code cell below to visualize the new scores on the training and validation sets.

In [25]:
print('Full tree guess train/validation',
      clf.score(X_train, y_train),
      clf.score(X_val, y_val))

Full tree guess train/validation 1.0 1.0


[Back to top](#Index:)

<a id='part6'></a>

### Part 6: Train a decision tree using  different maximum depths for the tree

One way to optimise the decision tree algorithm is by adjusting the maximum depth of the tree. This process is an example of pre-pruning.

In the following example, you will compute the score for a decision tree on the same data with `max_depth = 15`.

You will begin by defining the variables `bestdepth` and `bestscore`, assuming the *worst case scenario*. Run the code cell below to initialise the variable as desired.

In [26]:
bestdepth=-1
bestscore=0
max_depth = 15

Next, write a loop to progressively compute the new train/validation scores for different depths.

Here is the pseudocode for the for loop you will need to implement:

```python

for i in range(max_depth):
    # compute new classifier clf with depth = max_depth = i+1
    # fit the X and y training sets with the new classifier
    # compute the updated trainscore using .score() on the training set
    # compute the updated valscore using .score() on the validation set
    # print the scores
    print ( 'Depth:', i+1, 'Training Score:', trainscore, 'Validation Score:', valscore)
     
    # if valscore is better than bestscore:
        # update the value of bestscore
        # increase bestdepth by one unit
    
```

In [27]:
for i in range(15):
    clf = DecisionTreeClassifier(max_depth=i+1)
    #fit the training sets
    clf.fit(X_train, y_train)
    #update trainscore
    trainscore=clf.score(X_train, y_train)
    #update valscore
    valscore=clf.score(X_val, y_val)
    print( 'Depth:', i+1, 'Train Score:', trainscore, 'Validation Score:', valscore)
    if valscore > bestscore:
        #update bestscore
        bestscore=valscore
        #update depth
        bestdepth=i+1

Depth: 1 Train Score: 1.0 Validation Score: 1.0
Depth: 2 Train Score: 1.0 Validation Score: 1.0
Depth: 3 Train Score: 1.0 Validation Score: 1.0
Depth: 4 Train Score: 1.0 Validation Score: 1.0
Depth: 5 Train Score: 1.0 Validation Score: 1.0
Depth: 6 Train Score: 1.0 Validation Score: 1.0
Depth: 7 Train Score: 1.0 Validation Score: 1.0
Depth: 8 Train Score: 1.0 Validation Score: 1.0
Depth: 9 Train Score: 1.0 Validation Score: 1.0
Depth: 10 Train Score: 1.0 Validation Score: 1.0
Depth: 11 Train Score: 1.0 Validation Score: 1.0
Depth: 12 Train Score: 1.0 Validation Score: 1.0
Depth: 13 Train Score: 1.0 Validation Score: 1.0
Depth: 14 Train Score: 1.0 Validation Score: 1.0
Depth: 15 Train Score: 1.0 Validation Score: 1.0


Choose the most appropriate tree depth.



[Back to top](#Index:)

<a id='part7'></a>

### Part 7: Retrain the best classifier using all the samples

For the last part of this assignment, retrain the best classifier using all the samples from the training and the validation sets *together*.

Begin by re-defining our `X_trainval` and `y_trainval`. Below, the `X_trainval`function has been defined for you.

In [28]:
X_trainval = X[:trainsize + trainplusvalsize]


Following the syntax given above, define `y_trainval`.

Again, remember that `y` is a 1D array!

In [30]:
y_trainval = y[:trainsize + trainplusvalsize]


To re-train the sets using the best classifier, re-define `clf`  using `DecisionTreeClassifier()` with `max_depth` equal to the `bestdepth` computed in Part 6.

Next, fit the classifiers to the sets just defined above.

Complete the code cell below:

In [29]:
clf = DecisionTreeClassifier(max_depth=bestdepth)
clf.fit(X_trainval, y_trainval)

Finally, re-train the best classifier on all samples (including the test set).

Do so by using the function `score()` to compute the score on the test set. Assign the result to `test_score`.

Please note that once you include the test set in training, its score is no longer a good predictor of the accuracy of the model. Including the test set in training can, however, improve the generalisability of the model.




In [31]:
test_score = clf.score(X_test, y_test)


In [32]:
print('testing set score', test_score)

testing set score 1.0


 What do you observe?

