# Exercise: Tune the area under the curve

In this exercise, we'll make and compare two models using ROC curves, and tune one using the area under the curve (AUC).

The goal of our models is to identify whether each item detected on the mountain is a hiker (`true`) or a tree (`false`). We'll work with our `motion` feature here. Let's take a look:

In [None]:
import numpy as np
import pandas
!pip install statsmodels
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/hiker_or_tree.csv
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/m2d_make_roc.py
import matplotlib.pyplot as plt
import sklearn.model_selection

# Load our data from disk
df = pandas.read_csv("hiker_or_tree.csv", delimiter="\\t")

# Remove features we no longer want
del df["height"]
del df["texture"]

# Split into train and test
train, test =  sklearn.model_selection.train_test_split(df, test_size=0.5, random_state=1)

# Graph our feature
# Define a helper function to plot histograms by class
def plot_histogram_by_group(data, column, group_column, bins=12):
    groups = data[group_column].unique()
    for group in groups:
        subset = data[data[group_column] == group]
        plt.hist(subset[column], bins=bins, alpha=0.5, label=f'{group_column}={group}')
    plt.xlabel(column)
    plt.ylabel('Count')
    plt.title(f'{column} by {group_column}')
    plt.legend()
    plt.grid(True)
    plt.show()

plot_histogram_by_group(test, "motion", "is_hiker", bins=12)

Motion seems associated with hikers more than trees, but not perfectly. Presumably, this is because trees blow in the wind and some hikers are found sitting down.

## A logistic regression model and a random forest

Let's train the same logistic regression model we used in the previous exercise, as well as a random-forest model. Both will try to predict which objects are hikers.

First, the logistic regression:

In [None]:
import statsmodels.api
from sklearn.metrics import accuracy_score

# This is a helper method that reformats the data to be compatible
# with this particular logistic regression model 
prep_data = lambda x:  np.column_stack((np.full(x.shape, 1), x))

# Train a logistic regression model to predict hiker based on motion
lr_model = statsmodels.api.Logit(train.is_hiker, prep_data(train.motion), add_constant=True).fit()

# Assess its performance
# -- Train
predictions = lr_model.predict(prep_data(train.motion)) > 0.5
train_accuracy = accuracy_score(train.is_hiker, predictions)

# -- Test
predictions = lr_model.predict(prep_data(test.motion)) > 0.5
test_accuracy = accuracy_score(test.is_hiker, predictions)

print("Train accuracy", train_accuracy)
print("Test accuracy", test_accuracy)

# Plot the model
plt.scatter(test["motion"], test["is_hiker"], alpha=0.6, label="Data", edgecolor='k')

# Create a smooth range of motion values for the trendline
x_vals = np.linspace(test["motion"].min(), test["motion"].max(), 200)

# Get predicted probabilities from the logistic regression model
y_vals = lr_model.predict(prep_data(x_vals))

# Plot the logistic regression trendline
# Keep the trendline function for later use
predict_with_logistic_regression = lambda x: lr_model.predict(prep_data(x))

# Scatter plot of the test data
plt.scatter(test["motion"], test["is_hiker"], alpha=0.6, label="Data", edgecolor='k')

# Create a smooth range of motion values and use the trendline function
x_vals = np.linspace(test["motion"].min(), test["motion"].max(), 200)
y_vals = predict_with_logistic_regression(x_vals)

# Plot the trendline
plt.plot(x_vals, y_vals, color='red', label="Logistic Regression Model")

plt.xlabel("motion")
plt.ylabel("is_hiker")
plt.title("Logistic Regression")
plt.legend()
plt.grid(True)
plt.show()

Now, our random-forest model:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Create a random forest model with 50 trees
random_forest = RandomForestClassifier(random_state=2,
                                       verbose=False)

# Train the model
random_forest.fit(train[["motion"]], train.is_hiker)

# Assess its performance
# -- Train
predictions = random_forest.predict(train[["motion"]])
train_accuracy = accuracy_score(train.is_hiker, predictions)

# -- Test
predictions = random_forest.predict(test[["motion"]])
test_accuracy = accuracy_score(test.is_hiker, predictions)


# Train and test the model
print("Random Forest Performance:")
print("Train accuracy", train_accuracy)
print("Test accuracy", test_accuracy)


These models have similar (but not identical) performance on the test set in terms of accuracy.

## Create ROC plots

Let's create ROC curves for these models. To do this, we'll simply import code from the last exercises so that we can focus on what we want to learn here. If you need a refresher on how these were made, reread the last exercise.

Note that we've made a slight change. Now our method produces both a graph and the table of numbers we used to create the graph.

First, let's look at the logistic regression model:

In [None]:
from m2d_make_roc import create_roc_curve # import our previous ROC code

# Get ROC data
_, thresholds_lr = create_roc_curve(predict_with_logistic_regression, test, "motion")

# Plot ROC curve using matplotlib
plt.figure(figsize=(6, 6))
plt.plot(thresholds_lr["fpr"], thresholds_lr["tpr"], color='blue', label='ROC Curve')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Random Guess')

plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.grid(True)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()

# Show the table of thresholds
thresholds_lr

We can see our model does better than chance (it's not a diagonal line). Our table shows the false positive rate (_fpr_) and true positive rate (_tpr_) for each threshold.

Let's repeat this for our random-forest model:

In [None]:
# Don't worry about this lambda function. It simply reorganizes 
# the data into the shape expected by the random forest model, 
# and calls predict_proba, which gives us predicted probabilities
# that the label is 'hiker'
predict_with_random_forest = lambda x: random_forest.predict_proba(np.array(x).reshape(-1, 1))[:, 1]

# Generate ROC data (thresholds_rf should be a DataFrame with 'fpr' and 'tpr')
_, thresholds_rf = create_roc_curve(predict_with_random_forest, test, "motion")

# Plot ROC curve
plt.figure(figsize=(6, 6))
plt.plot(thresholds_rf["fpr"], thresholds_rf["tpr"], color='green', label='Random Forest ROC Curve')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Random Guess')

# Optional: Fill under the curve like Plotly's `fill="tozeroy"`
plt.fill_between(thresholds_rf["fpr"], thresholds_rf["tpr"], alpha=0.2, color='green')

plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Random Forest ROC Curve")
plt.legend(loc="lower right")
plt.grid(True)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()

# Show the table of results
thresholds_rf

## Area under the curve

Our models look quite similar. Which model do we think is best? Let's use _area under the curve_ (AUC) to compare them. We should expect a number larger than 0.5, because these models are both better than chance, but smaller than 1, because they aren't perfect.

In [None]:
from sklearn.metrics import roc_auc_score

# Logistic regression
print("Logistic Regression AUC:", roc_auc_score(test.is_hiker, predict_with_logistic_regression(test.motion)))

# Random Forest
print("Random Forest AUC:", roc_auc_score(test.is_hiker, predict_with_random_forest(test.motion)))

By a very thin margin, the logistic regression model comes out on top.

Remember, this doesn't mean the logistic regression model will always do a better job than the random forest. It means that the logistic regression model is a slightly better choice for this kind of data, and probably is marginally less reliant on having the perfect decision thresholds chosen.

## Decision Threshold Tuning

We can also use our ROC information to find the best thresholds to use. We'll just work with our random-forest model for this part.

First, let's take a look at the rate of True and False positives with the default threshold of 0.5:

In [None]:
# Print out its expected performance at the default threshold of 0.5
# We previously obtained this information when we created our graphs
row_of_0point5 = thresholds_rf[thresholds_rf.threshold == 0.5]
print("TPR at threshold of 0.5:", row_of_0point5.tpr.values[0])
print("FPR at threshold of 0.5:", row_of_0point5.fpr.values[0])

We can expect that, when real hikers are seen, we have an 86% chance of identifying them. When trees or hikers are seen, we have a 16% chance of identifying them as a hiker.

Let's say that for our particular situation, we consider obtaining true positive just as important as avoiding a false positive. We don't want to ignore hikers on the mountain, but we also don't want to send our team out into dangerous conditions for no reason. 

We can find the best threshold by making our own scoring system and seeing which threshold would get the best result:

In [None]:
# Calculate how good each threshold is from our TPR and FPR. 
# Our criteria is that the TPR is as high as possible and 
# the FPR is as low as possible. We consider them equally important
scores = thresholds_rf.tpr - thresholds_rf.fpr

# Find the entry with the lowest score according to our criteria
index_of_best_score = np.argmax(scores)
best_threshold = thresholds_rf.threshold[index_of_best_score]
print("Best threshold:", best_threshold)

# Print out its expected performance
print("TPR at this threshold:", thresholds_rf.tpr[index_of_best_score])
print("FPR at this threshold:", thresholds_rf.fpr[index_of_best_score])

Our best threshold, with this criteria, is 0.74, not 0.5! This would have us still identify 83% of hikers properly—a slight decrease from 86%—but only misidentify 3.6% of trees as hikers.

If you'd like, play with how we're calculating our scores here, and see how the threshold is adjusted.

## Summary

That's it! Here we've created ROC curves for two different models, using code we wrote in the previous exercise.

Visually, they were quite similar, and when we compared them using the area-under-the-curve metric we found that the logistic regression model was marginally better performing.

We then used the ROC curve to tune our random-forest model, based on criteria specific to our circumstances. Our very simple criteria of `TPR - FPR` let us pick a threshold that was right for us.