Install missing required packages, and upgrade packages to current versions

In [None]:
!pip install scikit-learn xgboost shap

Import packages used throughout notebook

In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
import matplotlib.pyplot as plt
import sklearn
import boto3
from IPython.display import Image, display
import warnings
warnings.filterwarnings('ignore')

# Text Detection in Images
Here we will demonstrate how a pre-trained machine learning model can be used to create value once operationalized. Amazon Rekognition can be used to detect text in an image, where teams from Amazon have already trained Rekognition on millions of images containing text. There are three photos we will try: a simple text based photo, a complex photo with lots of text, and a Nutrien example photo.

In [None]:
AI_quote = 'AI_quote.jpg'
times_square = 'times_square.jpg'
nutrien_railcar = 'railcar.jpg'

print('Simple:')
display(Image(f"{AI_quote}", width=500))
print('\nComplex:')
display(Image(f"{times_square}", width=500))
print('\nNutrien:')
display(Image(f"{nutrien_railcar}", width=500))

The following function opens the specified image, parses it using an API call to Amazon Rekognition, then outputs a dataframe with the relevant information returned from the API. Further details are provided in the comments.

In [None]:
def detect_text(photo_file):
    
    # Setting up the client to make an API request to Amazon Rekognition
    client=boto3.client('rekognition')
    
    # Opening the file into our notebook, then making the API call with the opened image
    with open(photo_file, 'rb') as image:
        response=client.detect_text(Image={'Bytes': image.read()})
    
    # Storing the text part of the response from the API
    textDetections=response['TextDetections']
    
    # Setting up a dataframe to store the text found in the API response
    text_df = pd.DataFrame(columns=['Text', 'Confidence (%)', 'ID', 'ParentID', 'Type'])
    
    # Loop that parses all text detected in each image to the created dataframe
    for i, text in enumerate(textDetections):
            text_df.loc[i, 'Text'] = text['DetectedText']
            text_df.loc[i, 'Confidence (%)'] = round(text['Confidence'],2)
            text_df.loc[i, 'ID'] = text['Id']
            if 'ParentId' in text:
                text_df.loc[i, 'ParentID'] = text['ParentId']
            text_df.loc[i, 'Type'] = text['Type']
    
    # Populated dataframe is returned
    return text_df

For each example we re-display a sample of the image, then output the returned dataframe that contains the text extracted from each image using the function above.

In [None]:
photo=f"{AI_quote}"
AI_quote_df = detect_text(photo)
display(Image(f"{AI_quote}", width=300))
AI_quote_df

In [None]:
photo=f"{times_square}"
times_square_df = detect_text(photo)
display(Image(f"{times_square}", width=300))
times_square_df

In [None]:
photo=f"{nutrien_railcar}"
nutrien_railcar_df = detect_text(photo)
display(Image(f"{nutrien_railcar}", width=300))
nutrien_railcar_df

With our Nutrien use case, we would like to know the Department of Transportation (DOT) code located just above the table in the image. We loop over all the text in the text column of the dataframe, and use a Regex statement to find text that fits the format of a dot code. More information on regex statements can be found here:
https://docs.python.org/3/howto/regex.html

In [None]:
import re

for i, text in enumerate(nutrien_railcar_df['Text']):
    # findall returns a list of matched strings in each piece of text
    found = re.findall('DOT\s*[A-Za-z0-9]{9}', text)

    # if the dot code is found, we assign the dot code to a variable, print, then exit the for loop
    if len(found) == 1:
        dot_code = found[0]
        print(f'{dot_code}')
        print(f'Found on line {i}')
        break

# Regression

The following example is a simple training exercise on carbon emission data (in tonnes) to introduce us to machine learning modelling, using linear regression.

## Data Preparation
We are going to create a function that splits our data into a training set and a test set. 80% of the data will be used for the training set and the remaining 20% will be used for the test set.

In [None]:
from sklearn.model_selection import train_test_split
def prepare_data(data, target):
    # Seperate the predictor variables (X) from the target variable (y) and into their own dataframes
    X = data.drop(target, axis=1)
    y = data[target]
    
    # Create a training and test set for the predictor and target variables
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)
    
    return X_train, X_test, y_train, y_test

We will load in our emission data and name the dataframe emissions_data. We will also see what the data looks like by typing in our dataframe name on the line beneath the code to load in the data. This way we can get a snippet of the data to understand what it looks like. Run the cell below to load in the emissions data and see what the data looks like.

Emission data can be found here: https://ourworldindata.org/grapher/annual-co2-emissions-per-country

Country information data can be found here: https://www.kaggle.com/datasets/fernandol/countries-of-the-world

In [None]:
emissions_data = pd.read_csv('carbon_emissions.csv')
emissions_data

## Categorical Encoding
Now we will use One Hot Encoding to convert our categorical variable (Continent) to numerical variables. This is done so the ML model can make numerical sense of the categorical variables. Run the cell below to encode the categorical feature and see what the encoded feature looks like after.

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False)
ohe_continent = ohe.fit_transform(emissions_data[['Continent']])
ohe_continent_df = pd.DataFrame(ohe_continent, columns=ohe.categories_[0])

print('Original:')
display(emissions_data[['Continent']])
print('One Hot Encoded:')
display(ohe_continent_df)

We will now append the one hot encoded features to the original dataframe, drop the previous continent feature as it is no longer needed, and also set the row index to the country name for easier use. From here we are ready to model on the final dataframe seen after running the cell.

In [None]:
transformed_emissions_data = pd.concat([emissions_data, ohe_continent_df], axis=1)
transformed_emissions_data.drop('Continent', axis=1, inplace=True)
transformed_emissions_data.set_index('Country', inplace=True)

transformed_emissions_data

## Splitting Data
Now we will prepare our data by splitting it into training and test sets using the function we made earlier. In order to understand exactly what this function does, we will also see what the X_train, y_train, X_test, y_test datasets in that order. You will notice that the X_train and X_test datasets are all the predictor variables and the y_test and y_train datset is the target variable (emissions).

In [None]:
X_train, X_test, y_train, y_test = prepare_data(transformed_emissions_data, 'Annual CO2 emissions')
display(X_train,y_train,X_test,y_test)

## Linear Regression Model
Now we will run a linear regression model on our prepared dataset below. We will evaluate this model with 4 metrics: mean absolute error, mean squared error, root mean squared error, and r2_score. Run the cell below to create the model, train it, and generate predictions.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
lr_model = LinearRegression() # create model
lr_model.fit(X_train, y_train) # train model
lr_pred = lr_model.predict(X_test) # generate predictions

The following function will show emission prediction outputs vs true values for a specified country in the test set (X_test), using the specified predictions.

In [None]:
import math

def get_country_outputs(country, model_predictions):
    iloc_val = y_test.index.get_loc(country)

    print(f"Input:\n{X_test.loc[country].name}\n")
    print(f"\tPredicted Carbon Emissions: \t{model_predictions[iloc_val]:,.0f}")
    print(f"\tActual Carbon Emissions: \t{y_test.loc[country]:,.0f}")
    print(f"\tAbsolute error: \t\t{abs(model_predictions[iloc_val] - y_test.loc[country]):,.0f}")
    print(f"\tSquared error: \t\t\t{(model_predictions[iloc_val] - y_test.loc[country]) ** 2:,.0f}\n")

Let's take a look at a couple of sample predictions by our model. Run the cell below to look at the input, the predicted emissions, the actual emissions, and the error. Feel free to check other countries, seen in the test portion (X_test) of the split data.

In [None]:
get_country_outputs('Germany', lr_pred)
get_country_outputs('Malaysia', lr_pred)

We can see for certain countries like Germany and Malaysia, the model predicts quite well within a certain carbon emission range. But others with either really high or really low emissions are poorly handled by the model.

In [None]:
get_country_outputs('Brazil', lr_pred)
get_country_outputs('Suriname', lr_pred)

To get a less anecdotal look at the results, let's compute metrics across the entire test set. The cell below will evaluate the model's predictions with the 4 metrics, and print the equation of the model. 

In [None]:
mae = mean_absolute_error(y_test, lr_pred)
mse = mean_squared_error(y_test, lr_pred)
rmse = mean_squared_error(y_test, lr_pred, squared=False)
r2 = r2_score(y_test, lr_pred) # results are nonsense, included for reference

coef = lr_model.coef_
intercept = lr_model.intercept_
cols = X_train.columns

print(f"Mean Absolute Error: {mae:,.0f}")
print(f"Mean Squared Error: {mse:,.0f}")
print(f"Root Mean Squared Error: {rmse:,.0f}")
print(f"R2: {r2}")
print(f"\nEquation for Regression Model:")
print(f"log(carbon emissions) = {coef[0]:.2f}({cols[0]}) + {coef[1]:.2f}({cols[1]}) + {coef[2]:.2f}({cols[2]}) + {coef[3]:.2f}({cols[3]}) + {coef[4]:.0f}({cols[4]})\
       + {coef[5]:.2f}({cols[5]}) + {coef[6]:.2f}({cols[6]}) + {coef[7]:.2f}({cols[7]}) + {coef[8]:.2f}({cols[8]}) + {coef[9]:.2f}({cols[9]})\
        + {coef[10]:.2f}({cols[10]}) + {coef[11]:.2f}({cols[11]}) + {coef[12]:.2f}({cols[12]}) + {coef[13]:.2f}({cols[13]}) + {coef[14]:.2f}({cols[14]})\
         + {coef[15]:.2f}({cols[15]}) + {coef[16]:.2f}({cols[16]}) + {coef[17]:.2f}({cols[17]}) + {coef[18]:.2f}({cols[18]}) + {coef[19]:.2f}({cols[19]})\
         + {intercept:.2f}")

Although the results by no means are perfect (see R2 score), it is interesting to see how certain features push the predictions one way or the other. Depending what continent the country is on influences whether the model thinks that country will have more or less carbon emissions for example.

To achieve stronger results we could try various other models, scaling, and hyperparameter tuning. Ideally the dataset could also contain stronger correlated features like industry metrics and vehicle usage, more directly related to carbon emissions. For linear regression we would likely also want to remove outliers like China and India, whereas other models would likely handle these better. Further exploration, implementing some of the noted improvements, will be done on the emissions dataset at the end of this notebook.

# Binary Classification (Credit)

## Naive Rule
We want to understand what the accuracy of a Naive Rule Model is, so we create a simple function to get us the accuracy for it. This is a simple and effective way to rule out any ML model that does not make value adding predictions.

In [None]:
def naive_rule_accuracy(y_train, y_test):
    majority_class = y_train.value_counts().idxmax()

    test_counts = y_test.value_counts()
    accuracy_naive = test_counts[majority_class] / test_counts.sum()

    print(f"The accuracy of the Naive Model is: {accuracy_naive}")

## Model Evaluation
We want to create a function to automatically evaluate our models. We will be looking at accuracy, recall, percision, f1-score, confusion matrix and the ROC Curve

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, ConfusionMatrixDisplay, RocCurveDisplay
import matplotlib.pyplot as plt

def evaluate(model, X_test, y_test):
    pred = model.predict(X_test)
    # accuracy = correct_predictions / all_predictions 
    acc = accuracy_score(y_test, pred)

    # true_positives / (true_positives + false_postives)
    # how many positive predictions were true
    prec = precision_score(y_test, pred, average='weighted')

    # true_postives / (true_positives + false_negatives)
    # how many postives out of all were identified
    rec = recall_score(y_test, pred, average='weighted')

    # harmonic mean of precision and recall
    f1 = f1_score(y_test, pred, average='weighted')
    
    print(f"accuracy: {acc}")
    print(f"precision: {prec}")
    print(f"recall: {rec}")
    print(f"f1: {f1}")
    
    try:
        # prob = model.predict_proba(X_test)
        # roc_auc = roc_auc_score(y_test, prob, multi_class='ovo')
        # print(f"roc_auc: {roc_auc}")
        roc_display = RocCurveDisplay.from_estimator(model, X_test, y_test)
    except:
        pass
    
    cm_display = ConfusionMatrixDisplay.from_predictions(y_test, pred)
    

## Data Preparation
We will now look at credit data for another Binary Classification problem. We will load in the data as credit_data, veiw it and then split it similarly to the rice dataset. For this dataset we will be looking at payment history patterns for customers (the CustomerID field has been removed for anonymity) and try to predict if they will be credit risks or not.

In [None]:
credit_data = pd.read_csv('Company.csv')
credit_data

In [None]:
X_train, X_test, y_train, y_test = prepare_data(credit_data, 'Risk')

## Naive Rule Benchmark
Before we do any ML, lets look at the Naive Model accuracy. If a model cant beat the accuracy of the Naive Model, then there is no point in looking at it further.

In [None]:
naive_rule_accuracy(y_train, y_test)

# Logistic Regression Model
Now lets run the model, like we did above and see if we get an improved output from the naive rule.

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)

ax = plt.gca()

ax.xaxis.set_ticklabels(['Low Risk','High Risk'])
ax.yaxis.set_ticklabels(['Low Risk','High Risk'])
plt.show()

Run the cell below to see how many false positives and false negatives there were

In [None]:
pred = model.predict(X_test)
false_positives = 0
false_negatives = 0
for prediction, truth in zip(pred, y_test):
    if truth == 1 and prediction == 0:
        false_negatives += 1
    if truth == 0 and prediction == 1:
        false_positives += 1

print(f"False Positives: {false_positives}")
print(f"False negatives: {false_negatives}")

# Binary Classification (Rice)

## Data Preperation
Like all the previous dataset we will load in our rice data as rice_data. We will use this dataset to predict if the rice is Jasmine or is Gonen. (1 = Jasmine, 0 = Gonen). We will load in the data and then split it into training and test sets. 

In [None]:
rice_data = pd.read_csv('rice.csv')
rice_data

In [None]:
X_train, X_test, y_train, y_test = prepare_data(rice_data, 'Class')

## Naive Rule Benchmark
Before we do any ML, lets look at the Naive Model accuracy. If a model cant beat the accuracy of the Naive Model, then there is no point in looking at it further.

In [None]:
naive_rule_accuracy(y_train, y_test)

## Logistic Regression Model
Now lets run a Logistic Regression Model and produce some evaluation metrics and the confusion matrix.

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)

fig = plt.gcf()
ax = plt.gca()

ax.xaxis.set_ticklabels(['Gonen','Jasmine']); ax.yaxis.set_ticklabels(['Gonen','Jasmine']);
plt.show()

The cell below display's the ROC_AUC score and graph.

# Classification (Crop Recommendation)

## Data Preparation
Now we will import the crop.csv dataset as crop_data. We will use this dataset to better predict the 'label' colunm. Run the cell below to see what the dataset looks like after it has been loaded.

In [None]:
crop_data = pd.read_csv('crops.csv')
crop_data

Just like the heart dataset, we will now split this dataset into training and test sets. If you would like to see what these datasets look like, run the cell, open another cell below them and type in the name(s) of the dataset(s) you wish to see (see similar example with heart_data above).

In [None]:
X_train, X_test, y_train, y_test = prepare_data(crop_data, 'label')

## Naive Rule Benchmark
Before we do any ML, lets look at the Naive Model accuracy. If a model cant beat the accuracy of the Naive Model, then there is no point in looking at it further. The Naive Rule Benchmark for this problem will be very low, given it is a multi-class problem.

In [None]:
naive_rule_accuracy(y_train, y_test)

## Naive Bayes Model
We will first use the Naive Bayes Model on our dataset. We are using the Gaussian Naive Bayes Model as our predictor variables are continous and not discrete. Click the cell below to run it and get a confusion matrix, as well as the accuracy, percision, recall, f1 score and roc_auc.

In [None]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)
plt.show()

## Sochastic Gradient Descent Model
Now we will be running the Sochastic Gradient Descent Model. Click the cell below to produce the output and evaluate the model.

In [None]:
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)
plt.show()

## Perceptron Model
Now we will be running the Sochastic Gradient Descent Model. Click the cell below to produce the output and evaluate the model.

In [None]:
from sklearn.linear_model import Perceptron
model = Perceptron()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)
plt.show()

## Decision Tree Model
Now we will be running a Decision Tree Model. Click the cell below to produce the output and evaluate the model.

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)
plt.show()

## XGBoost Model
Now we will be running a XGBoost Model. Click the cell below to produce the output and evaluate the model.

In [None]:
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)
plt.show()

# Hyperparameter Tuning
We will use python to optimize the hyperparameters of our SGD Classifier. We want to see if through hyperparameter tuning we can improve the performance of the model. We will be using the same crop dataset as we used for the first SGD Model, so we will start by splitting the dataset again. 

In [None]:
X_train, X_test, y_train, y_test = prepare_data(crop_data, 'label')

Now let us write in the hyperparameter tuning function. In the param_grid we will be defining the various parameters we discussed in the slide. Run the 2 cells below to tune the model hyperparameters, the cells following will display the results of the tuning.

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {
    "penalty": ['l1', 'l2', 'elasticnet'], # The various options to put a penalty on errors (also known as regularization)
    "alpha": [0.0001, 0.001, 0.01], # The constant that multiplies the regularization term. The higher the value the, the stronger the penalty
    "eta0": [0.001, 0.01, 0.1], # The initial learning rate for the model. Will change with adaptive learning
    "learning_rate": ['constant', 'adaptive'] # Does the model keep the learning rate constant or change as it runs 
}
grid_cv = GridSearchCV(SGDClassifier(), param_grid, n_jobs=-1, cv=5, scoring="f1_weighted")
# n_jobs = means the number of jobs to run in parallel, -1 means use all processors
# cv = cross validation, how many folds
# scoring = what we will be scoring the model on, in our case it will be the weighted f1 score.

In [None]:
grid_cv.fit(X_train, y_train)

Run the cell below to see what is the best score produced by the optimal set of hyperparameters.

In [None]:
grid_cv.best_score_

Run the cell below to find out which combination of hyperparameters turned out to be the best.

In [None]:
grid_cv.best_params_

Run both the cells below to evaluate the model with the optimal set of hyperparameters and see what the performance stats look like above. You will notice a significant improvement in the accuracy of the model.

In [None]:
model = grid_cv.best_estimator_

In [None]:
evaluate(model, X_test, y_test)
plt.show()

# Feature Scaling
We want to see if our model preforms any better if we standardize or normalize the data. Just like before we will be using our crop data and splitting into training and test sets. We will be using the same SGD classifier because that model had some room for improvement. We want to see if either standardization or normalization will improve the model.

## Data Prep
First step is to split the data into training and test sets

In [None]:
X_train, X_test, y_train, y_test = prepare_data(crop_data, 'label')

Now we will scale the data. We are going to both normalize the data and standardize it.

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

X = crop_data.drop("label", axis=1)
y = crop_data["label"]

standard_scaler = StandardScaler()
standard_scaler.fit(X)
X_s_scaled = pd.DataFrame(standard_scaler.transform(X), columns=X.columns)

minmax_scaler = MinMaxScaler()
minmax_scaler.fit(X)
X_mm_scaled = pd.DataFrame(minmax_scaler.transform(X), columns=X.columns)
with pd.option_context('display.float_format', lambda x: '%.3f' % x):  
    print("Unscaled Data:") 
    display(X.describe())
    print("Standardized Data:")
    display(X_s_scaled.describe())
    print("Normalized Data:")
    display(X_mm_scaled.describe())

## Unscaled Data
We are going to run the same model on the three datasets above and see which one comes out with the best performance. All of them are SGD Models and we will see the confusion matrix, accuracy, percision, recall and the f1 score.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = SGDClassifier()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)
plt.show()

## Standardized Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_s_scaled, y, test_size=0.2, random_state=42)
model = SGDClassifier()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)
plt.show()

## Normalized Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_mm_scaled, y, test_size=0.2, random_state=42)
model = SGDClassifier()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)
plt.show()

# SHAP Values
To understand the importance of predictor variable will have on the outcome, we can use the SHAP package in python. We want to know how the predictors affect the outcome for our crop dataset with an XGBoost model so we will first train an XGBoost Model with that data again then see the shap values.

In [None]:
import shap

Split the dataset into training and test sets, like before.

In [None]:

X_train, X_test, y_train, y_test = prepare_data(crop_data, 'label')

Let us now train and evaluate the model, same as we did once before.

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.fit(y_train)
y_train_enc = encoder.transform(y_train)
y_test_enc = encoder.transform(y_test)

In [None]:
model = xgb.XGBClassifier()

model.fit(X_train, y_train_enc)
evaluate(model, X_test, y_test_enc)
plt.show()

Run the two cell below to calculate the SHAP values. It may take a couple minutes to complete.

In [None]:
explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_train)

Run the cell below to see a bar plot of the SHAP values for each outcome. You will notice that after a comma there is a number from 0 - 21. That number represents one of the 22 outcomes the target variable could take and the graph will represent the importance of the features that will lead to that specific outcome. To see how the SHAP values change with each target class, change the number from what is with anything in between 0 and 21.

In [None]:
shap.plots.bar(shap_values[...,0], show=False)
fig = plt.gcf()
ax = plt.gca()
fig.set_figheight(11)
fig.set_figwidth(11)
font_dict = {'size':16}
font_dict_title = {'size':18}
fig.patch.set_facecolor('xkcd:light grey')
plt.xlabel('Mean Shap Value for Target Variable',font_dict)
plt.ylabel('Predictor Variables', font_dict)
plt.title('SHAP Values for Crop Classification', font_dict_title)
plt.show()

# Emissions Revisited
Now that we have seen some techniques to improve machine learning models, lets try implementing some of them on the initial linear regression example on carbon emissions, focusing on two new models: RandomForestRegressor and XGBRegressor.

In [None]:
emissions_data = pd.read_csv('carbon_emissions.csv')
emissions_data.set_index('Country', inplace=True)
emissions_data

## Categorical Encoding and Scaling
Now we will use One Hot Encoding to convert our categorical variable (Continent) to numerical variables and scaling will be done on the numerical values. To easily do both in one step we will use a column transformer, which handles the dataframe transformations more direct. Run the cell below to encode the categorical features, scale the numeric features, and see what the dataset looks like after. Note column names are updated for their transformations.

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

all_features = emissions_data.columns.tolist()
categorical_features = ['Continent']
passthrough_columns = ['Annual CO2 emissions']

# list comprehension to get all column names except categorical and passthrough (target)
numerical_features = [features for features in all_features if features not in passthrough_columns + categorical_features] 


ct = ColumnTransformer(transformers=[("scaled", StandardScaler(), numerical_features),
                                     ("onehot", OneHotEncoder(sparse=False), categorical_features)],
                                    remainder='passthrough')

transformed_emissions_data = ct.fit_transform(emissions_data)
transformed_emissions_data = pd.DataFrame(transformed_emissions_data, columns = ct.get_feature_names_out(), index = emissions_data.index)

transformed_emissions_data

## Splitting Data

In [None]:
X_train, X_test, y_train, y_test = prepare_data(transformed_emissions_data, 'remainder__Annual CO2 emissions')
display(X_train,y_train,X_test,y_test)

## Random Forest Regression Grid Search
Now we will run a grid search over a random forest regression model on our prepared dataset. We will evaluate this model with 4 metrics: mean absolute error, mean squared error, root mean squared error, and R2 score. Run the cell below to create the model as well as train it.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [5, 50, 100, 150, 250], # The number of trees in the forest
    "max_depth": [None, 2, 5, 10, 15],# The possible depth of each individual tree
}
grid_cv = GridSearchCV(RandomForestRegressor(random_state=5), param_grid, n_jobs=-1, cv=5, scoring='neg_root_mean_squared_error')

grid_cv.fit(X_train, y_train) # train model

In [None]:
grid_cv.best_params_

In [None]:
grid_cv.best_score_

Here we recreate the RFR model using the best result from the grid search. Then generate predictions using that model.

In [None]:
rfr_model = grid_cv.best_estimator_ # create model
rfr_model.fit(X_train, y_train)
rfr_pred = rfr_model.predict(X_test) # generate predictions

Let's take a look at the same sample predictions from earlier, but this time use our random forest regression model. Run the cell below to look at the input, the predicted emissions, the actual emissions, and the error. Feel free to check other countries, seen in the test portion of the split data.

In [None]:
get_country_outputs('Germany', rfr_pred)
get_country_outputs('Malaysia', rfr_pred)
get_country_outputs('Brazil', rfr_pred)
get_country_outputs('Suriname', rfr_pred)

The cell below will evaluate the model with the 4 metrics.

In [None]:
mae = mean_absolute_error(y_test, rfr_pred)
mse = mean_squared_error(y_test, rfr_pred)
rmse = mean_squared_error(y_test, rfr_pred, squared=False)
r2 = r2_score(y_test, rfr_pred)

print(f"Mean Absolute Error: {mae:,.0f}")
print(f"Mean Squared Error: {mse:,.0f}")
print(f"Root Mean Squared Error: {rmse:,.0f}")
print(f"R2: {r2}")

With some relatively simple changes our RMSE is about a quarter of the earlier linear regression model's RMSE, and the R2 score has become actually usable.

## XGBoost Grid Search
We can repeat the previous process for an XGBoost Regression model instead.

In [None]:
param_grid = {
    'booster': ['gbtree', 'dart'],
    "n_estimators": [1, 2, 3, 4, 5, 50, 100], # The number of trees in the forest
    "max_depth": [None, 2, 5, 10],# The possible depth of each individual tree
}
grid_cv = GridSearchCV(xgb.XGBRegressor(random_state=5), param_grid, n_jobs=-1, cv=5, scoring='neg_root_mean_squared_error')

grid_cv.fit(X_train, y_train) # train model

In [None]:
grid_cv.best_params_

In [None]:
grid_cv.best_score_

In [None]:
xgb_model = grid_cv.best_estimator_
xgb_model.fit(X_train, y_train)
xgb_pred = xgb_model.predict(X_test)

In [None]:
get_country_outputs('Germany', xgb_pred)
get_country_outputs('Malaysia', xgb_pred)
get_country_outputs('Brazil', xgb_pred)
get_country_outputs('Suriname', xgb_pred)

In [None]:
mae = mean_absolute_error(y_test, xgb_pred)
mse = mean_squared_error(y_test, xgb_pred)
rmse = mean_squared_error(y_test, xgb_pred, squared=False)
r2 = r2_score(y_test, xgb_pred)

print(f"Mean Absolute Error: {mae:,.0f}")
print(f"Mean Squared Error: {mse:,.0f}")
print(f"Root Mean Squared Error: {rmse:,.0f}")
print(f"R2: {r2}")

We see an improvement using XGBoost over the Random Forest Regressor! There are also many more hyperparameters we could try tuning both for the RFR and XGBoost, with XGBoost having even more. If we were to continue this comparison we would want to actively compare the grid_cv.best_score_ 's, rather than the test set RMSE so as not to create a bias towards the test set and keep the model as generalizable as possible.