# Exercise: Assessing a logistic regression model

In the previous exercise, we fit a simple logistic regression model to predict the chance of an avalanche. This time, we will create the same model and take a deeper look at how to best understand the mistakes that it makes.

## Data visualisation

Let's remind ourselves of our data. Remember we are planning to train a model that can predict avalanches based on the number of weak layers of snow.

In [1]:
import pandas
import graphing # custom graphing code. See our GitHub repo for details

#Import the data from the .csv file
dataset = pandas.read_csv('Data/avalanche.csv', delimiter="\t")


#Let's have a look at the data and the relationship we are going to model
print(dataset)

graphing.box_and_whisker(dataset, label_x="avalanche", label_y="weak_layers")

      Unnamed: 0  avalanche  no_visitors  surface_hoar  fresh_thickness  \
0              0          1            2      6.624345         4.388244   
1              1          1            2      3.927031         5.257594   
2              2          1            1      2.707691         3.584448   
3              3          1            9      5.631902         5.376657   
4              4          1            4      6.704904         5.924346   
...          ...        ...          ...           ...              ...   
1090        1090          1            6      5.633441         4.527306   
1091        1091          1            8      4.883818         5.576013   
1092        1092          1            3      4.871239         4.679674   
1093        1093          1            5      3.473572         3.110620   
1094        1094          0            1      4.550976         3.694971   

           wind  weak_layers  tracked_out  
0    -11.126870            6            1  
1    -68.37

It seems that avalanches are associated with having more weak layers of snow. That said, some days many weak layers have been recorded, but no avalanche occurred. This means our model will have difficulty being extremely accurate using this label. Let's continue though, and come back to this in a future exercise.  

Before we begin, we need to split our dataset into training and test sets. We will train on the _training_ set, and test on (you guessed it) the _test_ set.

In [2]:
# TODO SPLIT PROPERLY
train = dataset[:700]
test = dataset[700:]


## Fitting a model

Let's fit a simple logistic regression model using log-loss as a cost function. This is a very standard way to fit a classification model - so standard, in fact that we don't need to specify it at all.

In [3]:
import statsmodels.formula.api as smf

# Perform logistic regression.
model = smf.logit("avalanche ~ weak_layers", train).fit()

print(model.summary())


pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.

Optimization terminated successfully.
         Current function value: 0.656670
         Iterations 4
                           Logit Regression Results                           
Dep. Variable:              avalanche   No. Observations:                  700
Model:                          Logit   Df Residuals:                      698
Method:                           MLE   Df Model:                            1
Date:                Tue, 15 Jun 2021   Pseudo R-squ.:                 0.04167
Time:                        22:13:06   Log-Likelihood:                -459.67
converged:                       True   LL-Null:                       -479.66
Covariance Type:            nonrobust   LLR p-value:                 2.572e-10
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Interc

## Assessing model visually

For interest's sake, lets plot our model against the actual data in the test dataset

In [4]:
def predict(weak_layers):
    return model.predict(dict(weak_layers=weak_layers))

graphing.scatter_2D(test, label_x="weak_layers", label_y="avalanche", trendline=predict)

It's hard to see the s-shape of the trendline, because the number of weak layers of snow, and the likelihood of an avalanche, are only weakly related. If we zoom out, we can get a slightly better view

In [14]:
graphing.scatter_2D(test, label_x="weak_layers", label_y="avalanche", x_range=[-20,20], trendline=predict)

Checking the earlier graph, we can see that our model will predict an avalanche when the number of weak layers of snow is greater than 5. We can tell this because the value of the line is `0.5` at `x=5`.

How this relates with points is hard to tell - the points overlap and so it is difficult to see how many points are at 0 or at 1. How else can we assess the model?

## Assess with cost function

Let's assess our model with a log-loss cost function.


In [None]:
# TO-DO

Hmn. What does that mean? The number seems low, but it's hard to get a grasp on exactly what this means for real-world performance. 

## Assess accuracy

Let's instead assess _accuracy_. This is what proportion of predictions the model got correct, after predictions are converted from probabilities to `avalanche` or `no-avalanche`

In [24]:
import numpy
predictions = model.predict(test)

# convert to absolute values
avalanche_predicted = predictions >= 0.5

# Calculate how many were predicted correctly, and divide by how many predictions there were
guess_was_correct = test.avalanche == avalanche_predicted
accuracy = numpy.average(guess_was_correct)

# Print the accuracy
print(accuracy)

0.6045662100456621


It looks like it's predicting the correct answer 60% of the time. This is good information. What kind of mistakes is it making, though? Let's take a look at whether it is guessing avalanche when there are none (false positives), or failing to guess 'avalanche' when one actually occurs (false negative)

In [30]:
# False Positive: calculate how often it guessed avalanche when none actually occurred
false_positive = np.sum(np.logical_not(guess_was_correct) & test.avalanche) / test.shape[0]

# False negative: calculate how often it guessed no avalanche, when one actually happened
false_negative = np.sum(np.logical_not(guess_was_correct) & np.logical_not(test.avalanche)) / test.shape[0]


print(f"Wrongly predicted an avalanches {false_positive * 100}% of the time")
print(f"Failed to predict avalanches {false_negative * 100}% of the time")

Wrongly predicted an avalanches 13.515981735159817% of the time
Failed to predict avalanches 26.027397260273972% of the time


I think we can agree that's a lot more understandable than what the cost function or the graph!

## Summary

TODO