# Decision Tree
---------


### Author Information
**Author:** PJ Gibson  
**Email:** Peter.Gibson@doh.wa.gov  
**Github:**   https://github.com/DOH-PJG1303

### Project Information
**Created Date:** 2023-05-17  
**Last Updated:** 2023-05-22  
**Version:** 1  

### Description
This notebook should serve to educate newcomers to Python on simple Machine Learning techniques in the context of Record Linkage.
More specifically, we'll be creating a decision tree using very basic parameter tuning.
We will visualize our output, measure some basic metrics, and save the model.

### Notes


## 1. Import Libraries

In [None]:
# Standard data analysis tools
import pandas as pd
import numpy as np

# Machine learning tools
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.tree import plot_tree

# Plotting tools
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Prep Data

### 2.1 Read in data

#### 2.1.1 Synthetic Data for Person1

Synthetic string-type data for person 1.
This data is used at the end of the script for validation purposes.

In [None]:
df_synthetic1 = pd.read_csv('Data/synthetic_df1.csv', dtype=str)

#### 2.1.2 Synthetic Data for Person2

Synthetic string-type data for person 2.
This data is used at the end of the script for validation purposes.

In [None]:
df_synthetic2 = pd.read_csv('Data/synthetic_df2.csv', dtype=str)

#### 2.1.3 Training Data

This data is used to train the model

In [None]:
df_training = pd.read_csv('Data/synthetic_training_data.csv',index_col=[0,1])
df_training.head(5)

### 2.2 Visualize Comparisons

In [None]:
# Manually pick out example of two people and how they were compared
row1 = df_synthetic1[['fname','lname','dob','phone','add']].loc[78604:78604]
row2 = df_synthetic2[['fname','lname','dob','phone','add']].loc[100214:100214]
row3 = df_training.sort_index().loc[(78604,100214):(78604,100214)].drop('label',axis=1)

# Format our output
df_comparisons = pd.concat([row1,row2,row3],ignore_index=True)
df_comparisons.index = ['person1','person2','numeric_comparisons']
df_comparisons

### 2.3 Test Train Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_training.drop('label',axis=1), df_training['label'], test_size=0.2, random_state=42)

## 3. Machine Learning Training

### Decision Tree


#### Notes on parameter tuning
----- 

Parameter tuning in machine learning is the process of selecting the appropriate set of optimal parameters for a model in order to improve its accuracy or other metrics. In the case of the Decision Tree, the parameters we are tuning are 'max_depth' and 'criterion'.

'max_depth' is the maximum depth of the tree. It is used to control over-fitting as higher depth will allow the model to learn relations very specific to a particular sample.
'criterion' is the function to measure the quality of a split. Sklearn supports “gini” for the Gini impurity and “entropy” for the information gain.
GridSearchCV is a library function that is a member of sklearn's model_selection package. It helps to loop through predefined hyperparameters and fit your estimator (model) on your training set. So, in the end, you can select the best parameters from the listed hyperparameters.

In our code, we used GridSearchCV with 5-fold cross-validation for the hyperparameter tuning of our Decision Tree model. This means the training set is split into 5 parts and the model will be trained and validated 5 times to get a better estimate of the model performance.

After identifying the best parameters using GridSearchCV, the model is evaluated on the test data. Evaluation metrics like the classification report and confusion matrix provide comprehensive insights about model performance. The classification report shows precision, recall, f1-score, and support for each class while the confusion matrix visualizes the correct predictions and the errors made by the model.

In [None]:
# Define parameter grid
param_grid = {'max_depth': range(1, 5), 'criterion': ['gini', 'entropy']}

# Initialize a DecisionTreeClassifier
dt = DecisionTreeClassifier()

# Initialize GridSearchCV
grid = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy')

# Fit GridSearchCV to the training data
grid.fit(X_train, y_train)

## 4. Visualizations

### 4.1 Parameters

In [None]:
# Print the best parameters
print(f"Best parameters: {grid.best_params_}")

# Predict on the test data
y_pred = grid.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred))

# Print accuracy score
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

### 4.2 Confusion Matrix

In [None]:
# Plot confusion matrix
conf_mat = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_mat, annot=True, fmt='d')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

### 4.3 Decision Tree Plot

In [None]:
# Assuming grid.best_estimator_ is your trained DecisionTreeClassifier
dt_best = grid.best_estimator_

# Set larger figure size
plt.figure(figsize=(20,10))

# Plot the tree
plot_tree(dt_best, filled=True, rounded=True, feature_names=X_train.columns, class_names=True, proportion=True)

# Show the plot
plt.show()

### 4.4 Manually inspect wrong decisions

This chunk identifies the misclassified examples from a machine learning model's predictions and prints out some examples of these incorrectly classified instances
It's a good way to analyze where the model is going wrong.

In [None]:
# Compare the predicted values (y_pred) with the actual test values (y_test)
# and store the boolean result in list_results (True where prediction is correct, False where incorrect)
list_results = y_pred == y_test

# Extract the indices where the predicted values didn't match the actual test values
wrong_guess_indices = list_results[list_results == False].index

# Use these indices to get the corresponding rows from the training data
wrong_responses = df_training.loc[wrong_guess_indices]

# Split the incorrectly predicted data into two parts: one where the actual label is 0 and one where the actual label is 1
wrong_responses_label0 = wrong_responses.query('label == 0')
wrong_responses_label1 = wrong_responses.query('label == 1')

# List of columns that are used in the analysis
used_cols = ['ssn','fname','lname','dob','phone','add']

# Number of examples to print from each class
num_examples_printed_each_class = 2

# Loop through the number of examples to be printed
for i in range(0,num_examples_printed_each_class):
    # Get the indices for the original dataframes (df_synthetic1 and df_synthetic2) for the misclassified examples where actual label is 0
    df1_index = wrong_responses_label0.iloc[i].name[0]
    df2_index = wrong_responses_label0.iloc[i].name[1]

    # Use these indices to fetch the corresponding rows from the original dataframes
    part1 = df_synthetic1.loc[int(df1_index)][used_cols]
    part2 = df_synthetic2.loc[int(df2_index)][used_cols]

    # Combine the two parts and transpose to get a dataframe with the used_cols as columns
    combined = pd.concat([part1,part2], ignore_index=True, axis=1).transpose()

    # Print and display the misclassified examples where actual label is 0 and model predicted 1
    print('label = 0')
    print('model decision = 1')
    display(combined)
    print('\n\n')

# Similar loop for the misclassified examples where the actual label is 1
for i in range(0,num_examples_printed_each_class):
    # Get the indices for the original dataframes (df_synthetic1 and df_synthetic2) for the misclassified examples where actual label is 1
    df1_index = wrong_responses_label1.iloc[i].name[0]
    df2_index = wrong_responses_label1.iloc[i].name[1]

    # Use these indices to fetch the corresponding rows from the original dataframes
    part1 = df_synthetic1.loc[int(df1_index)][used_cols]
    part2 = df_synthetic2.loc[int(df2_index)][used_cols]

    # Combine the two parts and transpose to get a dataframe with the used_cols as columns
    combined = pd.concat([part1,part2], ignore_index=True, axis=1).transpose()

    # Print and display the misclassified examples where actual label is 1 and model predicted 0
    print('label = 1')
    print('model decision = 0')
    display(combined)
    print('\n\n')