# Classification

## Misclassification
If we want to classify customers into fradulant and non-fradulant, we could use a linear model, such as Linear Regression Models, to determine which set of features has the best impact for predictions.<br>
This translates to finding the vector of betas that minimises the misclassification numbers.

## What is meant by misclassification?

In terms of loss function, linear models typically minimise the sum of squares of errors.<br>
But assuming the target value is one, as in binary classification, then such quadratic loss penalises large deviations from it.<br>
Hence this argument is not ideal for classification.<br>
Being close to one value does not mean anything in classification problems since we are interested in correctly classifying the examples in the correct class or not.

We would use a zero, one loss which takes the sum of incorrect misclassifications.<br>
For example, if we predicted correctly, the loss will be zero otherwise it will be one.<br>
However such a function is difficult to minimise so we could use a smoother version of such a loss function which is using a logistic regression called the log loss function.<br>
The Log loss function transforms any continuous input into a zero/one outcome thanks to the sigmoid function:

$ S(x) = \frac{1}{1+e^{-x}} $

$ S(x)$ (Training loss) - The log loss function on the training data set.

$ \frac{1}{1+e^{-x}} $ (Simple Identity) - The probability that the example belongs to a positive class.

If the outcome is binary (zero/one, true/false, etc.) we use the Sigmoid function to minimise the log loss function on the training dataset and typically this operation is performed in order to maximise the probability the the example to the positive class.<br>
The probability of belonging to the negative class is obtained using the second axiom of probability.

In [1]:
import pandas as pd
df = pd.read_csv(r'C:\Users\Daniel Ho\DFE-DATA4\DFE Data Engineering Next Steps\scikit_data\diabetes.csv')

In [4]:
X = df.drop(['diabetes'], axis = 1)
y = df['diabetes']

In [5]:
X.head() # the features

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [6]:
y.head() # the target is binary

0    1
1    0
2    1
3    0
4    1
Name: diabetes, dtype: int64

In [3]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
from sklearn.linear_model import LogisticRegression

In [9]:
lr = LogisticRegression(max_iter=10000)

In [10]:
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

In [12]:
y_pred

array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
      dtype=int64)

Looking at `X_test` in position iloc zero.<br>
`y_pred` has given a zero for this position, why is this?

In [13]:
X_test.iloc[0]

pregnancies      6.00
glucose         98.00
diastolic       58.00
triceps         33.00
insulin        190.00
bmi             34.00
dpf              0.43
age             43.00
Name: 668, dtype: float64

If we consider the coefficient coming from our estimation, and we use those coefficients and we use that to perform a multiplication with the metric `X_test`, and we also sum the intercept, since we are in a linear model:

In [14]:
lr.coef_ @ X_test.iloc[0] + lr.intercept_

array([-0.96512498])

We get a negative value for the raw model.<br>
Negative values associated with the raw model imply that the example is classified as belonging gto the negative class.<br>
That is why we see a zero at `y_pred.iloc[0]`.<br>
And we have a classification threshold of 0.5.

Looking at position 9:

In [16]:
lr.coef_ @ X_test.iloc[9] + lr.intercept_

array([1.25016766])

In [18]:
y_pred[9]

1

We have a positive value for the raw model and therefore is estimated as one - it belongs to the positive class.

If instead we want to ge the raw probabilities, we can use `predict_proba` method which ingests X_test and it returns a list of probabilities for each example:

In [20]:
y_pred_proba = lr.predict_proba(X_test)

In [23]:
y_pred_proba[:5]

array([[0.72414674, 0.27585326],
       [0.81155295, 0.18844705],
       [0.88551068, 0.11448932],
       [0.83645513, 0.16354487],
       [0.52847124, 0.47152876]])

We get an array of lists, each containing two values.<br>
On the left describes the probability that the example belongs to the negative class.<br>
On the right, the probability of the positive class.<br>

We initialised the logistic regression model with the argument `LogisticRegression(max_iter=10000)`.<br>
This is a parameter of the model, but the logistic regression has other parameters, including `LogisticRegression(C=1)` parameter which controls regularisation.<br>
By default the logistic regression applies an L2-regularisation - like the one applied in the Ridge regression model.<br>
This can also be controlled by the `LogisticRegression(penalty='12')` arguement , and is by default set to 12.

In [None]:
LogisticRegression(C=1, penalty='12', max_iter=10000)

Using grid search to look for specific paramaters:

In [24]:
from sklearn.model_selection import GridSearchCV

In [29]:
import numpy as np
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}
logreg = LogisticRegression()
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
logreg_cv.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': array([1.00000000e-05, 8.48342898e-05, 7.19685673e-04, 6.10540230e-03,
       5.17947468e-02, 4.39397056e-01, 3.72759372e+00, 3.16227766e+01,
       2.68269580e+02, 2.27584593e+03, 1.93069773e+04, 1.63789371e+05,
       1.38949549e+06, 1.17876863e+07, 1.00000000e+08]),
                         'penalty': ['l1', 'l2']})

In [31]:
logreg_cv.best_params_

{'C': 31.622776601683793, 'penalty': 'l2'}

In [32]:
logreg_cv.best_score_

0.7720378515260563

In [33]:
best_model = logreg_cv.best_estimator_

In [34]:
y_pred = best_model.predict(X_test)

In [35]:
y_pred

array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
      dtype=int64)

# Evaluating the model
## Using confusion matrix

|  | Pred: Yes | Pred: No |
| --- | --- | --- |
| Act: Yes | 80 | 19 |
| Act: No | 18 | 37 |

In [41]:
from sklearn.metrics import confusion_matrix

In [42]:
confusion_matrix(y_test, y_pred)

array([[80, 19],
       [18, 37]], dtype=int64)

Generally, we want to minimise the amount of false negatives and maximise the number of true positives (in this instance).<br>
And from the confusion matrix we can typically extract other information such as precision and recall.<br>
We can compute them using:

In [43]:
from sklearn.metrics import classification_report

In [44]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.81      0.81        99
           1       0.66      0.67      0.67        55

    accuracy                           0.76       154
   macro avg       0.74      0.74      0.74       154
weighted avg       0.76      0.76      0.76       154



There is a trade-off between precision and recall.<br>
They are inversely related.<br>
Maximise one and the other minimises.

To over come this, we can use the F1 score which is a mixture of both measures.