# Importing Modules

In [None]:
import time
start = time.time()

Import the time module and create a variable named "start". The start variable will be used later to track the runtime of operations.

In [None]:
curTime = datetime.datetime.now().strftime("%H:%M:%S")
print("Start Time: ", curTime)

Use datetime to get the current time, used to display when the script has started running. Useful for calculating when an operation will end.

In [None]:
import pandas as pd
from sklearn.tree import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
import joblib

Import the remaining modules. The modules are used to the following:<br>
-Pandas: Used for the dataframe datatype which is easy to operate on as well as convert to other files. It is also compatable with fitting for many ML models.<br>
-RandomForestClassifier: The chosen ML model being used to predict the start dates of new reqs.<br>
-RandomizedSearchCV: Tries to find the best parameters for the ML model by testing models at random. Quicker than GridSearchCV but prone to inconsistent values and less accurate models.<br>
-GridSearchCV: Finds the best parameters for the ML model by testing all possible models based on the given parameters to test. 
-Joblib: Used to export the ML model. 

# Tuning

In [None]:
runTimeBool = True

When set to True the runtimes at certain checkpoints will be printed to console.

In [None]:
ridParams = {
    "random_state": [8],
    "max_depth": [20, 50, 100, None],
    "criterion": ["gini", "entropy", "log_loss"],
    "min_samples_leaf": [1, 10, 100, 1000],
    "min_weight_fraction_leaf": [0.0, 0.05, 0.1, 0.2],
    "max_features": ["log2", "sqrt", None],
    "max_leaf_nodes": [None, 20, 50, 100]
              }

A dictionary with the parameters for the DecisionTreeRegressor. Each determines the following:<br>
-random_state: Essentially a seed for the model. Rather than making a random decision when running into a decision with two equally viable options, it will make the same decision every time. Can be set to anything, 8 was chosen at random.<br>
-max_depth: The maximum depth of the decision tree/The maximum number of decisions that the tree can make for a single row.<br>
-criterion: The function to measure the quality of a split.<br>
-min_samples_leaf: The minimum number of samples that a node needs to split off.<br>
-min_weight_fraction_leaf: The minimum weight fraction of the instances required to be at a leaf node.<br>
-max_features: The number of features to consider when looking for the best split.<br>
-max_leaf_nodes: The total number of unique combinations of the predictors.<br>
<br>
Culling unoptimal parameters and adding new parameters to test is encouraged.

In [None]:
totModels = 1
for index, (key, value) in enumerate(gridParams.items()):
    if index == 0:
        totMpdels = len(value)
    else:
        totModels = totModels*len(value)
        
print(f"Total models to explore: {totModels}")

Calculate the total number of models that will need to be tested by GridSearchCV based on the parameters given. Print this to the console.

In [None]:
xaxis = pd.read_csv("xaxis.csv")
yaxis = pd.read_csv("yaxis.csv")

Create the xaxis and yaxis dataframes from csv files output during preprocessing.

In [None]:
gridTest = GridSearchCV(estimator=DecisionTreeRegressor(), param_grid = gridParams, cv = 2, #n_jobs =-1s
)
gridTest.fit(xaxis, yaxis)

Create a GridSearchCV object loaded with the parameters established in the gridParams dictionary. If n_jobs is set to -1, all processing cores will be used for the operation. Then, fit the object using the xaxis and yaxis dataframes.

In [None]:
joblib.dump(gridTest.best_estimator_, "tuned decision tree.pkl")
if runTimeBool == True:
    checkpoint = round(time.time()-start, 2)
    print(f"Export PKL File: {checkpoint}\n")

Use joblib to save the model to a pkl file.

In [None]:
if printGridTest == True:
    print(f"\n ======================")
    print("\n Best params: ", gridTest.best_params_)
    print("\n Best score: ", gridTest.best_score_)
    print("\n========================\n")

Print the best parameters and it's accuracy to console.

In [None]:
if runTimeBool == True:
    checkpoint = round(time.time()-start, 2)
    print(f"Finish Runtime: {checkpoint/60} minutes, {checkpoint%60} seconds")

Print finish runtime to console, this one in particular is converted to minutes and seconds rather than just seconds because the operation can take a long time especially when there are many models for GridSearchCV to explore.