<a href="https://colab.research.google.com/github/KorKanticha/SeniorProject/blob/main/Optimized_DecisionTree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Optimized DecisionTree

#Base Model


In [72]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [277]:
xls_Suicide = pd.ExcelFile("/content/gdrive/MyDrive/SeniorProject_KorBoss/Dummy_data3.xlsx")
df = dict()
df = pd.read_excel(xls_Suicide)

In [263]:
#split dataset in features and target variable
list_col = list(df.columns)
feature_cols = list(list_col)
feature_cols.remove("Success")
feature_cols.remove("Province_Happen_ไม่ทราบ")
feature_cols.remove("Province_Happen_99")
feature_cols.remove("Province_Happen_ ")

In [302]:

X = df[feature_cols] # Features
y = df.Success # Target variable

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))


from sklearn.metrics import precision_score
micro_precision = precision_score(y_pred, y_test, average='micro')
print('Micro-averaged precision score: {0:0.2f}'.format(
      micro_precision))

macro_precision = precision_score(y_pred, y_test, average='macro')
print('Macro-averaged precision score: {0:0.2f}'.format(
      macro_precision))

per_class_precision = precision_score(y_pred, y_test, average=None)
print('Per-class precision score:', per_class_precision)




Accuracy: 0.9554319415588768
Micro-averaged precision score: 0.96
Macro-averaged precision score: 0.95
Per-class precision score: [0.96347607 0.93737769]


In [303]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
confusion_matrix(y_test, y_pred)

array([[9945,  377],
       [ 288, 4311]])

In [304]:
print(classification_report(y_test,y_pred))


              precision    recall  f1-score   support

           0       0.97      0.96      0.97     10322
           1       0.92      0.94      0.93      4599

    accuracy                           0.96     14921
   macro avg       0.95      0.95      0.95     14921
weighted avg       0.96      0.96      0.96     14921



#Optimize Model

Step 1 - Import the library - GridSearchCv
Here we have imported various modules like decomposition, datasets, tree, Pipeline, StandardScaler and GridSearchCV from differnt libraries. We will understand the use of these later while using it in the in the code snipet.
For now just have a look on these imports.

In [305]:
from sklearn import decomposition, datasets
from sklearn import tree
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

Step 2 - Setup the Data
Here we have used datasets to load the inbuilt wine dataset and we have created objects X and y to store the data and the target value respectively.

In [294]:
#dataset = datasets.load_wine()
#X = dataset.data
#y = dataset.target

Step 3 - Using StandardScaler and PCA
StandardScaler is used to remove the outliners and scale the data by making the mean of the data 0 and standard deviation as 1. So we are creating an object std_scl to use standardScaler.


In [306]:
std_slc = StandardScaler()
pca = decomposition.PCA()
dec_tree = tree.DecisionTreeClassifier()

Step 5 - Using Pipeline for GridSearchCV
Pipeline will helps us by passing modules one by one through GridSearchCV for which we want to get the best parameters. So we are making an object pipe to create a pipeline for all the three objects std_scl, pca and dec_tree.

In [307]:
pipe = Pipeline(steps=[('std_slc', std_slc),
                           ('pca', pca),
                           ('dec_tree', dec_tree)])
n_components = list(range(1,X.shape[1]+1,1))
criterion = ['gini', 'entropy']
max_depth = [2,4,6,8,10,12]
parameters = dict(pca__n_components=n_components,
                      dec_tree__criterion=criterion,
                      dec_tree__max_depth=max_depth)

Step 6 - Using GridSearchCV and Printing Results

In [None]:
clf_GS = GridSearchCV(pipe, parameters)
clf_GS.fit(X, y)
print('Best Criterion:', clf_GS.best_estimator_.get_params()['dec_tree__criterion'])
print('Best max_depth:', clf_GS.best_estimator_.get_params()['dec_tree__max_depth'])
print('Best Number Of Components:', clf_GS.best_estimator_.get_params()['pca__n_components'])
print(); print(clf_GS.best_estimator_.get_params()['dec_tree'])


In [301]:
clf = DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4)
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))


from sklearn.metrics import precision_score
micro_precision = precision_score(y_pred, y_test, average='micro')
print('Micro-averaged precision score: {0:0.2f}'.format(
      micro_precision))

macro_precision = precision_score(y_pred, y_test, average='macro')
print('Macro-averaged precision score: {0:0.2f}'.format(
      macro_precision))

per_class_precision = precision_score(y_pred, y_test, average=None)
print('Per-class precision score:', per_class_precision)

print(classification_report(y_test,y_pred))


Accuracy: 0.923530594464178
Micro-averaged precision score: 0.92
Macro-averaged precision score: 0.92
Per-class precision score: [0.92288316 0.92498369]
              precision    recall  f1-score   support

           0       0.97      0.92      0.94     10322
           1       0.84      0.92      0.88      4599

    accuracy                           0.92     14921
   macro avg       0.90      0.92      0.91     14921
weighted avg       0.93      0.92      0.92     14921

