## Example Logistic Model in Python<br>
<br>

As discussed in the lecture, we use logistic regression for classification. The term *"logistic"* refers to the logistic sigmoid equation. Typically, we work with a dichotomic model, meaning there are only two classes. Note, that *sklearn* also handles multi class problems.<br>
The idea is that we start with a linear model for the log odds ratios of the probabilities $p$ for class $A$ and $1-p$ for class $B$ in the case of a two-class problem.<br>
<br>
$log\left(\frac{p}{1-p}\right) = \beta_0 + \Sigma_{n = 1}^{N} \beta_n\,x_n + \epsilon$
<br>
<br>
Leading to the logistic equation<br>
<br>
$p = \frac{e^{\beta_0 + \Sigma_{n = 1}^{N} \beta_n\,x_n + \epsilon}}{1+e^{\beta_0 + \Sigma_{n = 1}^{N} \beta_n\,x_n + \epsilon}}$
<br>

<br>

**0) Loading Libraries**<br>
<br>

As usual, we load all the standard libraries

In [None]:
#standard libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In contrast to a regression problem, we need to evaluate your classification model via accuracy in terms of how often it voted for the correct class and how sure it was about its decission (probabilities). For that purpose we want to create a confusion matrix and a so-called cross-entropy plot.

In [None]:
#confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

The actual logistic regression is done by the tool **sm (stats model)**. 

In [None]:
import statsmodels.api as sm

Of course we need to normalize the data.

In [None]:
#for scaling and normalizing the data
from sklearn.preprocessing import MinMaxScaler

<br>

**1) Loading and Extracting the Data**<br>
<br>

This time we need to load the version of the molecule data set that is categorical (*"Toxic"* vs *"Non-Toxic"*).

In [None]:
Train  = pd.read_csv("../04 Datasets/molecular_train_gbc_cat.csv")
Test   = pd.read_csv("../04 Datasets/molecular_test_gbc_cat.csv")

In [None]:
Test.head()

<br>

<br>

**2) Scaling the Data**<br>
<br>

As before, we scale the data, but extract *X* and *Y* first.

In [None]:
scaler = MinMaxScaler(feature_range = (0, 1)) 

In [None]:
XTrain = Train.drop(['logP', 'label'], axis = 1)
XTest  = Test.drop(['logP', 'label'], axis = 1)

In [None]:
TrainS = scaler.fit_transform(XTrain)
TestS  = scaler.transform(XTest)

In [None]:
#scaling returns an array, but we need a dataframe for the fit routine
TrainS = pd.DataFrame(TrainS, columns = XTrain.columns)
TestS  = pd.DataFrame(TestS, columns = XTest.columns)

In [None]:
TrainS.head()

Extracting Y:

In [None]:
YTrain = Train[['label']]
YTest  = Test[['label']]

In [None]:
YTrain.head()

<br>

**3) Performing the Fit**<br>
<br>

For *sm* we need to add the intercept as a constant:

In [None]:
X = sm.add_constant(TrainS)

In [None]:
X.head()

And finally, for *sm* to understand that we have two classes, we generate so called dummy variables (see feature encoding).

In [None]:
Y = pd.get_dummies(YTrain)

In [None]:
Y.head()

<br>

Now we can run the fit. Note, that we have a binomial problem: *"Toxic"* or *"Non-Toxic"* and therefore set that as an input for *sm*.

In [None]:
my_model = sm.GLM(Y, X, family = sm.families.Binomial()).fit()

In [None]:
my_model.summary()

<br>

**4) Evaluating the Fit**<br>
<br>

The results are comparable to those from the linear regression example, since we used the same data set.<br>
Next, we want to evaluate the model and predict the labels from the test set and compare the predicted labels to the actual, true labels

In [None]:
predProbs   = my_model.predict(sm.add_constant(TestS))

In [None]:
Pred        = np.round(predProbs).astype(int) 
predictions = ['Non-Toxic' if i==1 else 'Toxic' for i in Pred] # we saw in my_model.summary() that the first label (index 0) referres to "Non-Toxic"

Now we have the labels and the probabilities:

In [None]:
print(predProbs[:10]) #probabilities

In [None]:
predictions[:10]

4.1) Based on the predictions, we can calculate the **accuracy**, i.e. how often the model voted for the correct class.

In [None]:
accuracy    = 100*(Test['label'] == predictions).sum()/len(predictions)
print(f'accuracy = {accuracy: .2f}%')

An accuracy of 84% is a relatively good result, considering the overlap between "Toxic" and "Non-Toxic" molecules we saw in earlier modules.

4.2) But the accuracy gives us only limited information. Therefore we generate a **confusion matrix** as discusssed in the lecture.

In [None]:
L = ['Non-Toxic', 'Toxic']

cm   = confusion_matrix(Test['label'], predictions, labels = L, normalize = 'true')
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = L)
disp.plot(cmap = 'gray')
plt.show()

4.3) An even more detailed evaluation of the model performance can be obtained by generating a histogram of the probabilities the model voted for the correct label - a so called **cross entropy** plot. Ideally, this plot shows a clear peak at $p = 1$ for the true label for each class. That is usually not the case and we can thereby see if the model struggles with particular classes.

In [None]:
PredProbs = np.vstack((predProbs, 1 - predProbs))

In [None]:
fig, ax = plt.subplots(len(L), 1, sharex = True)
fig.set_figheight(6)
fig.subplots_adjust(hspace = 0.5)
fig.suptitle('entropy')

for i, l in enumerate(L):
    idx = [k for k, y in enumerate(Test['label']) if y == l]
    idx = np.array(idx)
    (value, where) = np.histogram(PredProbs[i,idx], bins = np.arange(0, 1, 0.01), density = True)
    w = 0.5*(where[1:] + where[:-1])
    ax[i].plot(w, value, 'k-')
    ax[i].set_ylabel('frequency')
    ax[i].set_title(l)
ax[len(L)-1].set_xlabel('probability')
plt.show()