Install missing required packages

In [None]:
!conda install -c conda-forge xgboost shap -y

Import packages used throughout notebook

In [None]:
import pandas as pd
import xgboost as xgb
import matplotlib.pyplot as plt

# Regression

## Data Preparation
We are going to create a function that splits our data into a training set and a test set. 80% of the data will be used for the training set and the remaining 20% will be used for the test set.

In [None]:
from sklearn.model_selection import train_test_split
def prepare_data(data, target):
    # Seperate the predictor variables (X) from the target variable (y) and into their own dataframes
    X = data.drop(target, axis=1)
    y = data[target]
    
    # Create a training and test set for the predictor and target variables
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8)
    
    return X_train, X_test, y_train, y_test

We will load in our salary data and name the dataframe salary_data. We will also see what the data looks like by typing in our dataframe name on the line beneath the code to load in the data. This way we can get a snippet of the data to understand what it looks like. Run the cell below to load in the salary data and see what the data looks like.

In [None]:
salary_data = pd.read_csv('Salary.csv')
salary_data

## Categorical Encoding
Now we will use Label Encoding to convert our categorical variables (Profession and Equipment) to numerical variables. This is done so the ML model can make sense of the categorical variables. Run the cell below to categorically encode these variables and see what the dataset looks like after.

In [None]:
from sklearn.preprocessing import LabelEncoder
profession_encoder = LabelEncoder()
salary_data['Profession'] = profession_encoder.fit_transform(salary_data['Profession'])

equipment_encoder = LabelEncoder()
salary_data['Equipment'] = equipment_encoder.fit_transform(salary_data['Equipment'])

salary_data

## Splitting Data
Now we will prepare our data by splitting it into training and test sets using the function we made earlier. In order to understand exactly what this function does, we will also see what the X_train, y_train, X_test, y_test datasets in that order. You will notice that the X_train and X_test datasets are all the predictor variables and the y_test and y_train datset is the target variable (Salary).

In [None]:
X_train, X_test, y_train, y_test = prepare_data(salary_data, 'Salary')
display(X_train,y_train,X_test,y_test)

## Linear Regression Model
Now we will run a linear regression model on our prepared dataset below. We will evaluate this model with 4 metrics: mean absolute error, mean squared error, root mean squared error and the R2 score. Run the cell below to create the model, train it, and generate predictions.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
model = LinearRegression() # create model
model.fit(X_train, y_train) # train model
pred = model.predict(X_test) # generate predictions

Let's take a look at a sample prediction by our model. Run the cell below to look at the input, the predicted salary, the actual salary, and the error.

In [None]:
import math

print(f"Input:\n{X_test.iloc[0]}\n")
print(f"Predicted salary: ${pred[0]}")
print(f"Actual salary: ${y_test.iloc[0]}")
print(f"Absolute error: ${abs(pred[0] - y_test.iloc[0])}")
print(f"Squared error: {(pred[0] - y_test.iloc[0]) ** 2}")

The cell below will evaluate the model with the 4 metrics, and print the equation of the model. 

In [None]:
mae = mean_absolute_error(y_test, pred)
mse = mean_squared_error(y_test, pred)
rmse = mean_squared_error(y_test, pred, squared=False)
r2 = r2_score(y_test, pred)

coef = model.coef_
intercept = model.intercept_
cols = X_train.columns

print(f"Mean Absolute Error: {mae}")
print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f"R2: {r2}")
print(f"\nEquation for Regression Model:")
print(f"Salary = {coef[0]}({cols[0]}) + {coef[1]}({cols[1]}) + {coef[2]}({cols[2]}) + {coef[3]}({cols[3]}) + {coef[4]}({cols[4]})")

# Binary Classification (Credit)

## Naive Rule
We want to understand what the accuracy of a Naive Rule Model is, so we create a simple function to get us the accuracy for it. This is a simple and effective way to rule out any ML model that does not make value adding predictions.

In [None]:
def naive_rule_accuracy(y_train, y_test):
    majority_class = y_train.value_counts().idxmax()

    test_counts = y_test.value_counts()
    accuracy_naive = test_counts[majority_class] / test_counts.sum()

    print(f"The accuracy of the Naive Model is: {accuracy_naive}")

## Model Evaluation
We want to create a function to automatically evaluate our models. We will be looking at accuracy, recall, percision, f1-score, confusion matrix and the ROC Curve

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, plot_confusion_matrix, roc_auc_score
import matplotlib.pyplot as plt

def evaluate(model, X_test, y_test):
    pred = model.predict(X_test)
    # accuracy = correct_predictions / all_predictions 
    acc = accuracy_score(y_test, pred)

    # true_positives / (true_positives + false_postives)
    # how many positive predictions were true
    prec = precision_score(y_test, pred, average='weighted')

    # true_postives / (true_positives + false_negatives)
    # how many postives out of all were identified
    rec = recall_score(y_test, pred, average='weighted')

    # harmonic mean of precision and recall
    f1 = f1_score(y_test, pred, average='weighted')
    
    print(f"accuracy: {acc}")
    print(f"precision: {prec}")
    print(f"recall: {rec}")
    print(f"f1: {f1}")
    
    try:
        prob = model.predict_proba(X_test)
        roc_auc = roc_auc_score(y_test, prob, multi_class='ovo')
        print(f"roc_auc: {roc_auc}")
    except:
        pass
    
    fig, ax = plt.subplots(figsize=(10, 10))
    plot_confusion_matrix(model, X_test, y_test, xticks_rotation='vertical', ax=ax)
    ax.set_title('Confusion Matrix')
    ax.set_ylabel('Actual labels')

    
    

## Data Preparation
We will now look at credit data for another Binary Classification problem. We will load in the data as credit_data, veiw it and then split it similarly to the rice dataset. For this dataset we will be looking at payment history patterns for customers (the CustomerID field has been removed for anonymity) and try to predict if they will be credit risks or not.

In [None]:
credit_data = pd.read_csv('Company.csv')
credit_data

In [None]:
X_train, X_test, y_train, y_test = prepare_data(credit_data, 'Risk')

## Naive Rule Benchmark
Before we do any ML, lets look at the Naive Model accuracy. If a model cant beat the accuracy of the Naive Model, then there is no point in looking at it further.

In [None]:
naive_rule_accuracy(y_train, y_test)

# Logistic Regression Model
Now lets run the model, like we did above and see if we get an improved output from the naive rule.

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)

fig = plt.gcf()
ax = plt.gca()

ax.xaxis.set_ticklabels(['Low Risk','High Risk'])
ax.yaxis.set_ticklabels(['Low Risk','High Risk'])

Run the cell below to see how many false positives and false negatives there were

In [None]:
pred = model.predict(X_test)
false_positives = 0
false_negatives = 0
for prediction, truth in zip(pred, y_test):
    if truth == 1 and prediction == 0:
        false_negatives += 1
    if truth == 0 and prediction == 1:
        false_positives += 1

print(f"False Positives: {false_positives}")
print(f"False negatives: {false_negatives}")

Run the cell below to see what the ROC_AUC score was

In [None]:
from sklearn.metrics import roc_curve, plot_roc_curve
roc_curve(y_test, model.predict_proba(X_test)[:,1])
plot_roc_curve(model, X_test, y_test)

# Binary Classification (Rice)

## Data Preperation
Like all the previous dataset we will load in our rice data as rice_data. We will use this dataset to predict if the rice is Jasmine or is Gonen. (1 = Jasmine, 0 = Gonen). We will load in the data and then split it into training and test sets. 

In [None]:
rice_data = pd.read_csv('rice.csv')
rice_data

In [None]:
X_train, X_test, y_train, y_test = prepare_data(rice_data, 'Class')

## Naive Rule Benchmark
Before we do any ML, lets look at the Naive Model accuracy. If a model cant beat the accuracy of the Naive Model, then there is no point in looking at it further.

In [None]:
naive_rule_accuracy(y_train, y_test)

## Logistic Regression Model
Now lets run a Logistic Regression Model and produce some evaluation metrics and the confusion matrix.

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)

fig = plt.gcf()
ax = plt.gca()

ax.xaxis.set_ticklabels(['Gonen','Jasmine']); ax.yaxis.set_ticklabels(['Gonen','Jasmine']);

The cell below display's the ROC_AUC score and graph.

In [None]:
from sklearn.metrics import roc_curve, plot_roc_curve
roc_curve(y_test, model.predict_proba(X_test)[:,1])
plot_roc_curve(model, X_test, y_test)

# Classification (Crop Recommendation)

## Data Preparation
Now we will import the crop.csv dataset as crop_data. We will use this dataset to better predict the 'label' colunm. Run the cell below to see what the dataset looks like after it has been loaded.

In [None]:
crop_data = pd.read_csv('crops.csv')
crop_data

Just like the heart dataset, we will now split this dataset into training and test sets. If you would like to see what these datasets look like, run the cell, open another cell below them and type in the name(s) of the dataset(s) you wish to see (see similar example with heart_data above).

In [None]:
X_train, X_test, y_train, y_test = prepare_data(crop_data, 'label')

## Naive Rule Benchmark
Before we do any ML, lets look at the Naive Model accuracy. If a model cant beat the accuracy of the Naive Model, then there is no point in looking at it further. The Naive Rule Benchmark for this problem will be very low, given it is a multi-class problem.

In [None]:
naive_rule_accuracy(y_train, y_test)

## Naive Bayes Model
We will first use the Naive Bayes Model on our dataset. We are using the Gaussian Naive Bayes Model as our predictor variables are continous and not discrete. Click the cell below to run it and get a confusion matrix, as well as the accuracy, percision, recall, f1 score and roc_auc.

In [None]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)

## Sochastic Gradient Descent Model
Now we will be running the Sochastic Gradient Descent Model. Click the cell below to produce the output and evaluate the model.

In [None]:
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)

## Perceptron Model
Now we will be running the Sochastic Gradient Descent Model. Click the cell below to produce the output and evaluate the model.

In [None]:
from sklearn.linear_model import Perceptron
model = Perceptron()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)

## Decision Tree Model
Now we will be running a Decision Tree Model. Click the cell below to produce the output and evaluate the model.

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)

## XGBoost Model
Now we will be running a XGBoost Model. Click the cell below to produce the output and evaluate the model.

In [None]:
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)

# Hyperparameter Tuning
We will use python to optimize the hyperparameters of our SGD Classifier. We want to see if through hyperparameter tuning we can improve the performance of the model. We will be using the same crop dataset as we used for the first SGD Model, so we will start by splitting the dataset again. 

In [None]:
X_train, X_test, y_train, y_test = prepare_data(crop_data, 'label')

Now let us write in the hyperparameter tuning function. In the param_grid we will be defining the various parameters we discussed in the slide. Run the 2 cells below to tune the model hyperparameters, the cells following will display the results of the tuning.

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {
    "penalty": ['l1', 'l2', 'elasticnet'], # The various options to put a penalty on errors (also known as regularization)
    "alpha": [0.0001, 0.001, 0.01], # The constant that multiplies the regularization term. The higher the value the, the stronger the penalty
    "eta0": [0.001, 0.01, 0.1], # The initial learning rate for the model. Will change with adaptive learning
    "learning_rate": ['constant', 'adaptive'] # Does the model keep the learning rate constant or change as it runs 
}
grid_cv = GridSearchCV(SGDClassifier(), param_grid, n_jobs=-1, cv=5, scoring="f1_weighted")
# n_jobs = means the number of jobs to run in parallel, -1 means use all processors
# cv = cross validation, how many folds
# scoring = what we will be scoring the model on, in our case it will be the weighted f1 score.

In [None]:
grid_cv.fit(X_train, y_train)

Run the cell below to see what is the best score produced by the optimal set of hyperparameters.

In [None]:
grid_cv.best_score_

Run the cell below to find out which combination of hyperparameters turned out to be the best.

In [None]:
grid_cv.best_params_

Run both the cells below to evaluate the model with the optimal set of hyperparameters and see what the performance stats look like above. You will notice a significant improvement in the accuracy of the model.

In [None]:
model = grid_cv.best_estimator_

In [None]:
evaluate(model, X_test, y_test)

# Feature Scaling
We want to see if our model preforms any better if we standardize or normalize the data. Just like before we will be using our crop data and splitting into training and test sets. We will be using the same SGD classifier because that model had some room for improvement. We want to see if either standardization or normalization will improve the model.

## Data Prep
First step is to split the data into training and test sets

In [None]:
X_train, X_test, y_train, y_test = prepare_data(crop_data, 'label')

Now we will scale the data. We are going to both normalize the data and standardize it.

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

X = crop_data.drop("label", axis=1)
y = crop_data["label"]

standard_scaler = StandardScaler()
standard_scaler.fit(X)
X_s_scaled = pd.DataFrame(standard_scaler.transform(X), columns=X.columns)

minmax_scaler = MinMaxScaler()
minmax_scaler.fit(X)
X_mm_scaled = pd.DataFrame(minmax_scaler.transform(X), columns=X.columns)
with pd.option_context('display.float_format', lambda x: '%.3f' % x):  
    print("Unscaled Data:") 
    display(X.describe())
    print("Standardized Data:")
    display(X_s_scaled.describe())
    print("Normalized Data:")
    display(X_mm_scaled.describe())

## Unscaled Data
We are going to run the same model on the three datasets above and see which one comes out with the best performance. All of them are SGD Models and we will see the confusion matrix, accuracy, percision, recall and the f1 score.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = SGDClassifier()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)

## Standardized Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_s_scaled, y, test_size=0.2, random_state=42)
model = SGDClassifier()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)

## Normalized Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_mm_scaled, y, test_size=0.2, random_state=42)
model = SGDClassifier()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)

# SHAP Values
To understand the importance of predictor variable will have on the outcome, we can use the SHAP package in python. We want to know how the predictors affect the outcome for our crop dataset with an XGBoost model so we will first train an XGBoost Model with that data again then see the shap values.

In [None]:
import shap

Split the dataset into training and test sets, like before.

In [None]:
X_train, X_test, y_train, y_test = prepare_data(crop_data, 'label')

Let us now train and evaluate the model, same as we did once before.

In [None]:
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
evaluate(model, X_test, y_test)

Run the two cell below to calculate the SHAP values. It may take a couple minutes to complete.

In [None]:
explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_train)

Run the cell below to see a bar plot of the SHAP values for each outcome. You will notice that after a comma there is a number from 0 - 21. That number represents one of the 22 outcomes the target variable could take and the graph will represent the importance of the features that will lead to that specific outcome. To see how the SHAP values change with each target class, change the number from what is with anything in between 0 and 21.

In [None]:
shap.plots.bar(shap_values[...,21], show=False)
fig = plt.gcf()
ax = plt.gca()
fig.set_figheight(11)
fig.set_figwidth(11)
font_dict = {'size':16}
font_dict_title = {'size':18}
fig.patch.set_facecolor('xkcd:light grey')
plt.xlabel('Mean Shap Value for Target Variable',font_dict)
plt.ylabel('Predictor Variables', font_dict)
plt.title('SHAP Values for Crop Classification', font_dict_title)
plt.show()