# Classification and Regression Tree

$\underline{Problem Statement}$: Given the Independent variables, we have to perform classification on the basis of job_security as target variable.

In [None]:
# Importing the necessary Modules
import pandas as pd   
import numpy as np   
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
os.getcwd() 

In [None]:
df = pd.read_csv("C:/Users/KIIT/Downloads/covid_impact_on_work new.csv")
df.head()

Let us now check for the missing values.

In [None]:
df.isnull().sum()

There are no missing values in the dataset and we can go ahead with building the model.

Let us check the number of rows and the number of columns in the dataframe.

In [None]:
print('The number of rows (observations) is:',df.shape[0],'\n''The number of columns(variables) is:',df.shape[1])

In [None]:
## We will drop the 'Sector','Childcare_Responsibilities','Team_Collaboration_Challenges' variable.

In [None]:
df=df.drop(['Sector','Childcare_Responsibilities','Team_Collaboration_Challenges'], axis=1)
df.head()

Let us check the data types of each of the variables in the data.

In [None]:
df.dtypes

There are three variables (Stress_Level, Hours_Worked_Per_Day & Meetings_Per_Day) which has the object data type.

sklearn in Python does not take the input of object data types when building Classification Trees. So, we need to convert these variables into some numerical form.

We have a choice of converting objects into categories if there are only three levels in a variable like Stress Level (Low / Medium / High) or if the data type is supposedly ordinal in nature whereby assigning numbers will represent their corresponding weightage.

The category data type in pandas is a $\underline{hybrid}$ data type. It looks and behaves like a string in many instances but internally is represented by an array of integers. This way, Python will treat it as a numerical variable.


If this is not the case where the catagory is nominal, One hot encoding is the recommended way forward.

The following code is provided to you to convert the 'object' type variables into categories(Hybrid data types) to numerical variables by assigning ranks/numbers to each category. Though we are not using it here in this case. Our Categorical variables have multiple levels and therefore "One hot encoding it is"

In [None]:
#We could use the following code snippet in the loop. 
#df['Stress_Level']=pd.Categorical(df['Stress_Level']).codes #code used for assigning numerical value to each category

In [None]:
#We are prepping the data by segregating them into Target and independent variables to runt his model going forward

# Let us define the X(predictor) and Y(target) variables

X = df.drop("Job_Security" , axis=1)

Y = df.pop("Job_Security")
#we have made a copy of the data frame as the 'pop' function removes that particular variable from the data frame and stores 
# in another variable

In [None]:
# This line of code is to perform one hot encoding for Categorical Features
X = pd.get_dummies( X, drop_first = False )

In [None]:
list(X.columns)

In [None]:
X.dtypes

We see that the data types of all the variables have been changed to either of $\underline{int64}$ or $\underline{bool}$.

In [None]:
#You shall notice One hot encoding converts catagorical variable (Levels) into seperate columns with binary values 
X.head()

## Split the data into Train and Test.

Before building the model we should split the data into Train and Test. We will thus build a model on the training data and use this model to predict on the test data.

We will be doing a 70:30 split.
70% of the whole data will be used to train the data and then 30% of the data will be used for testing the model thus built.

Before splitting the data, we shall make a copy of the data frame.

In [None]:
data = df.copy()
data.head()
#Just keepipng the copy of the original dataset.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.30, random_state=1)

Now that we have split the data into Train and Test, let us go ahead and build our Decision Tree Model.

## Building the Decision Tree

We will start by building a very basic Decision Tree model.

In [None]:
from sklearn import tree

In [None]:
dt_model = tree.DecisionTreeClassifier(criterion = 'gini',random_state=1)

In the above code snippet we have defined a Decision Tree (which is to be used for classification problems) with the splitting criteria for each node as 'gini'. The 'random_state' parameter ensures that each time we run the code snippet the values remains the same. 

In the above code snippet default values of 'min_samples_split' and 'min_samples_leaf' is taken as 2 and 1 respectively.

Now, that we have defined a Decision Tree, let us go ahead and build the model on the training data.

In [None]:
dt_model.fit(X_train, Y_train)

## Visualizing the Decision Tree

Now, that we have built the tree let us go ahead and visualize the tree to understand the various nuances of the Classification Tree that we just built.

In [None]:
train_char_label = ['No', 'Yes']# defining the classes of the target variable for the ease

Now, we need to create a dot file which will contains all the instructions on how build this graphical visualization of the Classification Tree that we had built.

In [None]:
os.getcwd()

In [None]:
dot_data = tree.export_graphviz(dt_model, #passing the model that we had built earlier
                                feature_names = list(X_train), #names of the independent variables
                                class_names = list(train_char_label)) #passing the names of the classes that we had defined


In [None]:
#If the below two libraries are not installed do install them using the following code snippet in the Jupyter Notebook
# !pip install 'package name'
import pydotplus
import graphviz

In [None]:
#Optional - Install only if the prev code throws error else skip running this line
!pip install pydotplus

In [None]:
#Optional - Install only if the prev code throws error else skip running this line
!pip install graphviz

In [None]:
dot_data = tree.export_graphviz(dt_model, out_file=None,  #passing the model built and setting the output to None as we do
                                #not need the dot file separately to visualize the graph
                         feature_names=list(X_train),    #names of the independent variables 
                         class_names=list(train_char_label),  
                         filled=True)                    #colours the nodes for classification for the ease of visualization

graph = pydotplus.graph_from_dot_data(dot_data) #extracting the visuals from the above file to plot it

Let us check importance of the variables in the Classification Tree that we just built. The importance of a feature or variable is computed as the (normalized) total reduction of the gini criterion brought by that feature. It is also known as the Gini importance. 

In [None]:
pd.Series(dt_model.feature_importances_,index=X_train.columns).sort_values(ascending=False)

From the above output, we can see that 'Health_Issue' is the most important variable followed by 'Meetings_Per_Day_5.207.623.357.656.770' and so on.

Let us take a look at the overall accuracy of the train and test data using the model that we just built.

In [None]:
#Train Data
dt_model.score(X_train,Y_train)

In [None]:
#Test Data
dt_model.score(X_test,Y_test)

The accuracy on the Training Data is 100% and the accuracy on the Test Data is lesser substantially. The model has surely been overfitted. 
Thus, we need to prune or regularize the tree.

## Pruning/Regularizing the Tree

For Pruning/Regularizing the Tree we need to be sure as to what parameters and how to prune the tree.

# Method 1 for Pruning:

#### (by visualizing the tree)

In [None]:
reg_dt_model = tree.DecisionTreeClassifier(criterion = 'gini', 
                                       max_depth=13,         #upto this depth is where the tree has grown uniformly
                                      min_samples_leaf=30,   #ensures that every terminal node (leaf node) have at least 10
                                      #observations in it
                                      min_samples_split=10)  #for every node to be split into two child nodes that particular
                                      #node should have at least 30 observations  
    
#Genral Thumb rule: 1% to 3% of the data should be the 'min_samples_split' and one third times the 'min_samples_split'
#should be 'min_samples_leaf.These are only a rough guideline value.
#Here, we have chose 1% of our training data which has 10,000 observations (100 observations for min_samples_split)

In [None]:
reg_dt_model.fit(X_train, Y_train)

Now that we have built the Pruned/Regularized Classification Tree let us visualize the tree to understand the nuances of the tree.

### Method 1:

In [None]:
# Define the path of the dot.file from the output of this code below
os.getcwd()

In [None]:
dot_data = tree.export_graphviz(reg_dt_model, #passing the model that we had built earlier
                                feature_names = list(X_train), #names of the independent variables
                                class_names = list(train_char_label)) #passing the names of the classes that we had defined

### Method 2:

In [None]:
#The following is a modified version of the above code where we tried create a dot file and visualize

dot_data = tree.export_graphviz(reg_dt_model, out_file=None,#passing the model built and setting the output to None as we do
                                #not need the dot file separately to visualize the graph
                         feature_names=list(X_train), #names of the independent variables 
                         class_names=list(train_char_label),  
                         filled=True) #colours the nodes for classification for the ease of visualization

graph = pydotplus.graph_from_dot_data(dot_data) #extracting the visuals from the above file to plot it

Let us now go ahead and predict both the classes and the probability values on the test data using the Pruned/Regularized Decision Tree.

In [None]:
#We are only predicting the classes over here. Python by default takes the 0.5 cutoff of the probability values while
#predicting the classes

Y_train_predict_class = reg_dt_model.predict(X_train)
Y_test_predict_class = reg_dt_model.predict(X_test)

In [None]:
#Here, we are predicting the probabilities and we can manually input a cutoff value which is different than 0.5.

Y_train_predict_prob = reg_dt_model.predict_proba(X_train)
Y_test_predict_prob = reg_dt_model.predict_proba(X_test)

# Evaluation of the Train and Test Models.

Let us first build the confusion matrix, followed by the Classification Report

In [None]:
from sklearn import metrics

First we will evaluate the model on the Training Data.

In [None]:
print(metrics.confusion_matrix(Y_train,Y_train_predict_class))

In [None]:
tn, fp, fn, tp = metrics.confusion_matrix(Y_train,Y_train_predict_class).ravel()
print('True Negative:',tn,'\n''False Positives:' ,fp,'\n''False Negatives:', fn,'\n''True Positives:', tp)

Let us now go ahead and print the classification report to check the various other parameters.

In [None]:
print(metrics.classification_report(Y_train,Y_train_predict_class))

We have been able to predict 60% of the target variables correctly.

Let us check the confusion matrix for the test data.

In [None]:
print(metrics.confusion_matrix(Y_test,Y_test_predict_class),'\n')
tn, fp, fn, tp = metrics.confusion_matrix(Y_test,Y_test_predict_class).ravel()
print('True Negative:',tn,'\n''False Positives:' ,fp,'\n''False Negatives:', fn,'\n''True Positives:', tp)

Let us now go ahead and print the classification report for the test data and compare between train and test.

In [None]:
print(metrics.classification_report(Y_test,Y_test_predict_class))

In [None]:

# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Assuming 'df' is your dataset, define features and target
X = df.drop(columns=['target_column'])  # Replace 'target_column' with the actual column name
y = df['target_column']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train Decision Tree Classifier
dt_model = DecisionTreeClassifier(criterion='gini', random_state=1)
dt_model.fit(X_train, y_train)

# Make predictions
y_train_pred = dt_model.predict(X_train)
y_test_pred = dt_model.predict(X_test)

# Compute accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Plot Accuracy Graph
plt.figure(figsize=(6, 4))
plt.bar(['Train Accuracy', 'Test Accuracy'], [train_accuracy, test_accuracy], color=['blue', 'green'])
plt.xlabel('Dataset')
plt.ylabel('Accuracy Score')
plt.title('Decision Tree Model Accuracy')
plt.ylim(0, 1)
plt.show()

# Print Accuracy Scores
print(f"Training Accuracy: {train_accuracy:.2f}")
print(f"Testing Accuracy: {test_accuracy:.2f}")


In [None]:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Ensure dataset (df) is loaded properly
# Replace 'target_column' with actual target variable
X = df.drop(columns=['target_column'])  
y = df['target_column']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Store accuracy scores at different depths
train_accuracies = []
test_accuracies = []
depths = range(1, 20)  # Testing depths from 1 to 20

# Train model at different depths
for depth in depths:
    dt_model = DecisionTreeClassifier(max_depth=depth, random_state=1)
    dt_model.fit(X_train, y_train)

    # Compute accuracy
    train_acc = accuracy_score(y_train, dt_model.predict(X_train))
    test_acc = accuracy_score(y_test, dt_model.predict(X_test))

    train_accuracies.append(train_acc)
    test_accuracies.append(test_acc)

# Plot Accuracy Curve
plt.figure(figsize=(8, 5))
plt.plot(depths, train_accuracies, label="Training Accuracy", marker='o', linestyle='-')
plt.plot(depths, test_accuracies, label="Testing Accuracy", marker='s', linestyle='--')
plt.xlabel("Decision Tree Depth")
plt.ylabel("Accuracy")
plt.title("Model Accuracy vs. Tree Depth")
plt.legend()
plt.grid()
plt.show()
