# Preparing features

<img src="./pict/MLmem.jpg"  
  width="600"
/>

To predict class, we turn to the familiar logistic regression.

Logistic regression is suitable for classification problem. For example, as we have, when there is a choice between two categories - whether an insurance payment will be required or not.
Let's try to train our model. Do you think it will be possible to do this using raw data? Let's take a risk.

In [19]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

data = pd.read_csv('./data/travel_insurance.csv')

In [63]:
data.head()

Unnamed: 0,Agency,Agency Type,Distribution Channel,Product Name,Claim,Duration,Destination,Net Sales,Commision (in value),Gender,Age
0,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,81
1,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,71
2,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,65,AUSTRALIA,-49.5,29.7,,32
3,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,60,AUSTRALIA,-39.6,23.76,,32
4,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,79,ITALY,-19.8,11.88,,41


In [64]:
train, valid = train_test_split(data, test_size=0.25, random_state=12345)

features_train = train.drop('Claim', axis=1)
target_train = train['Claim']
features_valid = valid.drop('Claim', axis=1)
target_valid = valid['Claim']

In [66]:
# model = LogisticRegression()
# model.fit(features_train, target_train) 

Now you can say: “And I said/said!” Indeed, a mistake.

Why did this happen?

Logistic regression calculates category membership using a formula consisting of features. They can only be numerical. Our data also contained categorical features - this was the mistake.

## Direct encoding

The technique of direct coding, or mapping (One-Hot Encoding, OHE) will help convert categorical features into numerical ones.

We will explain the principle of One-Hot Encoding using the values of the Gender attribute.

For each value of the Gender attribute (F, M, None), a column is created:
    
- Gender_F
- Gender_M
- Gender_None (no gender data).

<img src="./pict/10.png"  
  width="600"
/>

Let's summarize. Using the OHE technique, categorical features are converted into numerical ones in two stages:
- A new column is created for each characteristic value;
- If the category suits the object, it is assigned 1, if not - 0.

New characteristics (Gender_F, Gender_M, Gender_None) are called dummy-variable.

For direct coding, the pandas library has a function `pd.get_dummies()` (“get dummy variables”).

## Dummy trap

With direct coding, things are not so simple. When there is too much data, you can fall into the trap of bogus features. We'll tell you how to avoid getting into it.

To apply for a US visa, you need to prove that you have money. You decided to play it safe, so you took a bank account statement, a certificate from work, and 2-personal income tax. Although the visa center only needs two documents.

Your model doesn’t really need extra information either. If you leave everything as is, it will be more difficult for her to learn.

Three new columns have been added to the table. Since they are strongly related to each other, we will remove one without regret. You can restore the column using the remaining two. This way we won't fall into a dummy trap.

We will remove the column by calling the `pd.get_dummies()` function with the `drop_first` argument.

It removes the first column and is passed as `drop_first=True` or `drop_first=False` (True means the first column is reset, False means it is not reset).

When training logistic regression, you may encounter a warning from the sklearn library. To disable it, specify the argument `solver='liblinear'`

## Ordinal encoding

Let's talk about another technique for encoding features in a decision tree and a random forest.

If a decision tree asks questions at nodes, does that mean it can also work with categorical features?

<img src="./pict/1.jpg"  
  width="1100"
/>

Now let's try to train the model:

In [13]:
# from sklearn.tree import DecisionTreeClassifier

# tree = DecisionTreeClassifier(random_state=12345)
# tree.fit(features_train, target_train) 

Error again. Caused by the way the decision tree is trained in the `sklearn` library.

How to fix it? A new technique is needed to encode categories expressed in text with numbers - Ordinal Encoding. It works like this:
- It is recorded what number the class is coded by;
- The numbers are placed in a column.

The technique is suitable for feature transformation in decision tree and random forest

<img src="./pict/11.png"  
  width="300"
/>

The conversion is carried out in three stages:
1. Create an object of this data structure.

2. To get a list of categorical features, call the fit() method - as in model training. We pass data to it as an argument.

3. Transform the data using the transform() function. The changed data will be stored in the data_ordinal variable.

In order for the code to add column names, we will format the data in the DataFrame() structure:

If feature transformation is required only once, as in our problem, the code can be simplified by calling the fit_transform() function. It combines the functions: fit() and transform().

## Coding summary

Let's figure out which encoding to choose and why Ordinal Encoding is not suitable for logistic regression.

You were introduced to two techniques for coding categorical variables. To summarize:

1. If all traits are to become quantitative, the OHE technique is suitable;
2. When all the features are categorical and they need to be converted into numbers - Ordinal Encoding.

Why is `Ordinal Encoding` not suitable for logistic regression? She tries to calculate everything using a formula. If we are talking about the Age feature, then this is reasonable, but with Gender there are difficulties. For example, adding the values ​​“1” and “0” (“woman” and “man”) and dividing by “2” does not result in “average gender”.

<img src="./pict/12.png"  
  width="900"
/>

# Feature scaling

`Dispersion of a random variable is a measure of the dispersion of the values of a random variable relative to its mathematical expectation`

`The square root of the variance is called the standard deviation, standard deviation or standard spread.`

The data has columns: Age and Commission.

Let's say the possible age ranges from 0 to 100 years, and the insurance commission ranges from 100 to 1000.

The values ​​and their spreads in the Commission column are larger, so the algorithm will automatically decide that this attribute is more important than age. But this is not so: all signs are significant.

One of the scaling methods is `data standardization`.

Assuming that all features are normally distributed, the `mean` (M) and `variance` (D) are determined from the sample. The characteristic values are converted using the formula:

<img src="./pict/13.png"  
  width="600"
/>

The new feature has a mean of 0 and a variance of 1.

`sklearn` has a separate structure for data standardization - `StandardScaler`. It is located in the `sklearn.preprocessing` module.
Import `StandardScaler` from the library:

Let's create an object of this structure and configure it on the training data. Setting is calculating the mean and variance:

We transform the training and validation samples using the transform() function. We will save the changed sets in the variables: features_train_scaled and features_valid_scaled:

When writing changed features to the source dataframe, the code may raise a `SettingWithCopy` warning. The reason is the behavior of `sklearn` and `pandas`.

To prevent the warning from appearing, add the following line to the code:
`pd.options.mode.chained_assignment = None`

<div class="alert alert-danger">
IMPORTANT! Scaling of features should only be done after <b>train_test_split</b>
</div>

# Cheat sheet

In [None]:
# One-hot-encoding: getting dummy features
pd.get_dummies(df['column'])
pd.get_dummies(df['column'], drop_first=True)
# drop_first = True - drop the first column (avoiding the dummy trap)

In [2]:
# Ordinal Encoding

from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
encoder.fit(data)
data_ordinal = encoder.transform(data)

# adding column titles
data_ordinal = pd.DataFrame(encoder.transform(data), columns=data.columns)

# automatic learning and conversion
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
data_ordinal = pd.DataFrame(encoder.fit_transform(data), columns=data.columns)

In [None]:
# Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)

# Accuracy metric

Suppose we work in a patent office. They brought us a patent that they had invented a non-invasive method for detecting leukemia in infants. It predicts a newborn's susceptibility to developing leukemia with 98% accuracy.

As we will now see, this test is indeed `98%` accurate! And yet it is just as stupid, being a good illustration of why Accurasy is not used to evaluate the accuracy of binary classification.

- The patent assumes that the model predicts leukemia if and only the patient's name is Gregory.

Let's see how my leukemia test fits into this framework. In 2022 in Ukraine, approximately 3% of babies out of 1000 are given the name Gregory, and the lifetime prevalence of leukemia is 1.4%, or 14 people in every 1000

If we consider these two factors to be mutually independent and apply my "Gregory means leukemia" test to 1 million people, then we can expect to see a discordance matrix that looks like this:

<div class="scrollable_content">
    <table cellpadding="0" cellspacing="0" style="width: 500px; text-align: center;">
        <thead ><tr>
            <th scope="col" style="text-align: center;"></th>
            <th scope="col" style="text-align: center;">Leukemia</th>
            <th scope="col" style="text-align: center;">Not leukemia</th>
            <th scope="col" style="text-align: center;">Total</th>
               </tr>
        </thead>
        <tbody>
        <tr>
            <td style="text-align: center;">Gregory</td>
            <td style="text-align: center;">42</td>
            <td style="text-align: center;">4 958</td>
            <td style="text-align: center;">5 000</td>
        </tr>
        <tr>
            <td style="text-align: center;">Not Gregory</td>
            <td style="text-align: center;">13 958</td>
            <td style="text-align: center;">981 042</td>
            <td style="text-align: center;">995 000</td>
        </tr>
        <tr>
            <td style="text-align: center;">Total</td>
            <td style="text-align: center;">14 000</td>
            <td style="text-align: center;">986 000</td>
            <td style="text-align: center;">1 000 000</td>
        </tr>
        </tbody></table><div></div></div>

In [44]:
correct  = 42 + 981042
total = 42+ 4930 + 13958 + 981042

result = correct/total
    
print('result {:.2%}'.format(result))

result 98.11%


# Balance and imbalance of classes

We received a percentage of correct answers close to 100%. But there is no understanding of what is happening. In our problem there is a strong class imbalance.

Classes are unbalanced when their ratio is far from 1:1. class balance is observed if their number is approximately equal.

<img src="./pict/14.jpeg"  
  width="400"
/>

`Accuracy` is not suitable. We need a new metric! But first, a few important definitions.

You already know that a class labeled “1” is called positive, and a class labeled “0” is called negative.

If we compare these answers with the predictions, we get the following division:
- True Positive (TP) and True Negative (TN)
- False Positive (FP) and False Negative (FN).

Let's summarize. The characteristics "positive" and "negative" refer to the `prediction`, and "true" and "false" refer to its `correctness`.

<b>True positives</b> What does true positive (TP) mean? The model marked the object as one, and its real value is also 1.

<b>True Negative</b> If the predicted and actual class value are negative, the response is a true negative.

<b>False Positive</b> Type I errors are false positive responses (FP). They occur when the model predicted “1”, but the actual value of the class is “0”.

<b>True positive</b> Error of the second type - false negative responses (FN). False negatives occur when the model predicted “0” but the actual class value is “1”.

## Error Matrix

What we get:

The correct predictions are made along the main diagonal (from the upper left corner):
- TN in the upper left corner;
- TP in the lower right corner.

Outside the main diagonal are erroneous options:
- FP in the upper right corner;
- FN in the lower left corner.

<img src="./pict/15.png"  
  width="500"
/>

The confusion matrix is ​​in the familiar `sklearn.metrics` module. The `confusion_matrix()` function takes correct answers and predictions as input and returns an error matrix.

# Recall

The error matrix will help you build new metrics. Let's start with recall.

Completeness reveals what proportion of positive responses the model identified among all responses. They are usually worth their weight in gold, and it is important to understand how well the model finds them.

Recall is calculated using the following formula:

<img src="./pict/16.png"  
  width="700"
/>

Recall is the proportion of TP responses among all that have a true label of 1. It’s good when the recall value is close to one: the model is good at finding positive objects. If it’s closer to zero, the model needs to be rechecked and repaired.

# Precision

Precision measures how many negative answers the model found while searching for positive ones. The more negative ones, the lower the accuracy.

Precision is calculated using the following formula:

<img src="./pict/17.png"  
  width="700"
/>

Recall that TP is true positive responses. FP—positive responses marked by the model. We need an accuracy close to unity.

# Precision vs. recall

When a model predicts positive classes poorly, both precision and recall are low. Is it possible to increase their values?

<img src="./pict/17.jpg"  
  width="600"
/>

In [49]:
# Example

total_row = 1000

# Matrix
#  230  170 
#  200  400

tn = 230
fp = 170
fn = 200
tp = 400

In [50]:
accuracy = (tn + tp) / total_row
accuracy

0.63

In [53]:
recall = tp / (tp + fn)
recall

0.6666666666666666

In [55]:
precision = tp / (tp + fp)
precision

0.7017543859649122

In [58]:
tn = 500
fp = 0
fn = 0
tp = 500

recall = tp / (tp + fn)
recall

1.0

We've covered the Recall, but how can we achieve high Precision?

The formula only takes into account errors for the positive class, not the negative one. It is necessary to train a model that, on the contrary, answers “1” as rarely as possible.

`But you shouldn’t answer everything with “0”, otherwise you won’t be able to “hack” the metric: in the formula, zero will be divided by zero.`

# F1-score

Separately, completeness and accuracy are not very informative. It is necessary to simultaneously increase the performance of both. Or turn to a new metric that will unite them.

Completeness and accuracy evaluate the quality of a positive class forecast from different perspectives. Recall describes how well the model understood the features of this class and recognized it. Precision detects whether the model is overdoing it by assigning positive labels.

Both metrics are important. Aggregating metrics, one of which is F1-score, help control them in parallel. This is the harmonic mean of completeness and accuracy. A one in F1 means that the ratio of recall to precision is 1:1.

<img src="./pict/18.png"  
  width="700"
/>

<b>Important:</b> when the Recall or Precision is close to zero, then the harmonic mean itself approaches 0.

The graph shows the F1-measure values at different values of precision and recall. Blue corresponds to zero, and yellow to one.

<img src="./pict/19.png"  
  width="500"
/>

# Unbalanced classification

Machine learning algorithms consider all objects in the training set to be equal by default. If it is important to indicate that some objects are more important, their class is assigned a weight (class_weight).

The logistic regression algorithm in the `sklearn` library has a `class_weight` argument. By default it is `None`, i.e. the classes are equivalent:

If you specify `class_weight='balanced'`, the algorithm will calculate how many times class “0” is more common than class “1”.

Let's denote this number N (an unknown number of times). The new class weights look like this:

The rare class will have more weight.

Decision tree and random forest also have a `class_weight` argument.

## Increasing the sample

How to make objects of a rare class not so rare in the data?

Now in the test you get 1 point for solving any problem. The most important tasks are repeated several times to make them easier to remember.

When training models, this technique is called `upsampling`.

The transformation takes place in several stages:
- Divide the training sample into negative and positive objects;
- Copy positive objects several times;
- Taking into account the received data, create a new training sample;

Mix up the data: Repeating the same questions one after another will not help learning.

Python's list multiplication syntax can help you copy objects multiple times. To repeat the elements of a list, it is multiplied by a number (the required number of times):

In [62]:
answers = [0, 1, 0]
print(answers)

answers_x3 = answers * 3
print(answers_x3) 

[0, 1, 0]
[0, 1, 0, 0, 1, 0, 0, 1, 0]


## Reducing the sample

How to make objects of a frequent class less frequent?

Instead of repeating important questions, remove some of the unimportant ones. This can be done using the `downsampling` technique

The transformation takes place in several stages:
    
- Divide the training sample into negative and positive objects;
- Randomly discard some of the negative objects;
- Taking into account the received data, create a new training sample;

Shuffle the data. Positive ones should not follow negative ones: it will be more difficult for algorithms to learn.

## Classification threshold

What's the best way to train logistic regression? Let's see what's inside her.

To determine the answer, logistic regression calculates which class an object is close to, then compares the result to zero. For convenience, we translate proximity to classes into class probability: the model tries to estimate how likely a particular class is.

We have only two classes (zero and one). Class “1” probability is enough for us. The number will be from zero to one: if more than 0.5 - the object is positive, less - negative.

The boundary where the negative class ends and the positive one begins is called the threshold. By default it is 0.5, but what if you change it?

How will precision and recall change if the threshold is reduced from 0.5 to 0.2?

`The Precision will decrease, but the Recall will increase` why?

## Changing the threshold

In the `sklearn` library, class probabilities are calculated by the `predict_proba()` function. It receives object attributes as input and returns probabilities:

# PR curve

Let's plot what the metric values look like when the threshold changes.

The accuracy value is plotted vertically on the graph, and recall value is plotted horizontally.

The curve showing their values is called the PR curve. The higher the curve, the better the model.

<img src="./pict/20.png"  
  width="500"
/>

# TPR и FPR

When there are no positive objects, the accuracy cannot be calculated. Let's choose other characteristics in which there is no division by zero.

Before moving on to the new curve, let's give a few important definitions.

How to measure how correctly a classifier finds objects? Proportion of correctly predicted objects to the total number of objects in the class. This relationship is called

`TPR` (True Positive Rate) or recall. The formula looks like this, where `P=TP+FN`.

<img src="./pict/21.png"  
  width="700"
/>

The proportion of false positives to the total number of objects outside the class (False Positive Rate, `FPR`) is calculated similarly.

This is the ratio of `FP responses` (False Positives - negatives classified as positive) to the sum of negative responses:

`FP` and `TN` (True Negatives - correctly classified negative responses). Below is the formula where `N=FP+TN`:

<img src="./pict/22.png"  
  width="700"
/>

There will be no division by zero: the denominators contain values ​​that are constant and do not depend on changes in the model.

# ROC curve

We have witnessed a new confrontation - `TPR` versus `FPR`. Let's depict it on a graph.

We plot the proportion of false positive responses (`FPR`) horizontally, and the proportion of true positive responses (`TPR`) vertically. Let's go through the values ​​of the logistic regression threshold and draw a curve.

It is called the `ROC curve`, or `error curve` (receiver operating characteristic; the term comes from signal processing theory).

For a model that always responds at random, the `ROC curve` looks like a straight line going from the bottom left to the top right. The higher the graph, the higher the `TPR` value and the better the quality of the model.

<img src="./pict/23.png"  
  width="500"
/>

To identify how much our model differs from a random one, let’s calculate the area under the ROC curve - `AUC-ROC` (Area Under Curve ROC). This is a new quality metric that ranges from 0 to 1. The `AUC-ROC` of the random model is 0.5.

The `roc_curve()` function from the `sklearn.metrics` module will help you build a ROC curve:

It takes as input the values of the target feature and the probability of the positive class. Iterates through different thresholds and returns three lists: `FPR` values, `TPR` values and `considered thresholds`.

# MSE/RMSE in a regression task

Which metric is best suited for regression tasks? MSE is the mean square error. But how do you understand what the model thinks is correct?

<font size="5">  
    MSE = sum of squared object errors / number of object
</font> 

<img src="./pict/9.png"  
  width="700" />

Один из простых способов это посмотреть адекватность модели (сравнить значения с средним) 

<img src="./pict/24.png"  
  width="700"
/>

# Coefficient of determination

To avoid having to constantly compare the model with the average, we introduce a new metric. It is expressed not in absolute values, but in relative ones.

The `coefficient of determination`, or `R2 metric` (coefficient of determination; R-squared), calculates the proportion of the model's mean squared error from the MSE of the mean, and then subtracts this value from unity. An increase in the metric means an increase in the quality of the model.
The formula for calculating R2 looks like this:

<img src="./pict/25.png"  
  width="700"
/>

- The value of the R2 metric is equal to one only in one case, if MSE is zero. This model predicts all answers perfectly.
- R2 is zero: the model performs the same as the average.
- If the R2 metric is negative, the quality of the model is very low.
- The value of R2 cannot be greater than one.

# Mean absolute deviation

Let's get acquainted with the new quality metric - `MAE` (mean absolute error). It is similar to MSE, but does not have squaring

Let's write the new metric in symbols, not words.

<img src="./pict/26.png"  
  width="400"
/>

1. The value of the target feature for an object with serial number `i` in the sample on which the quality is measured. For example, test. The letter `y` is associated with the target feature. The subscript indicates the object number.

<img src="./pict/27.png"  
  width="400"
/>

2. Prediction value for an object with serial number `i`, for example, in the test sample. The `circumflexus` sign above the `y` indicates that this is a model prediction and not the correct answer.

Object deviation is the difference between the value of the target feature and the prediction

To get rid of the difference between underestimation and overestimation in the new metric, the `absolute deviation` is calculated.

To collect deviations for the entire sample, we add the following notation:

<img src="./pict/28.png"  
  width="400"
/>

3. Number of objects in the sample.

<img src="./pict/29.png"  
  width="400"
/>

4. Summation over all sample objects (i varies from 1 to N).

This brings us to the `mean absolute deviation`, or `MAE` formula:

<img src="./pict/30.png"  
  width="700"
/>

# MAE interpretation

To calculate MSE, we took the average value as a constant. But is it suitable for calculating MAE? Let's figure it out.

The constant model is chosen so that the value of the MAE metric is extremely low. We need to find the value of `a` that achieves the minimum:

<img src="./pict/31.png"  
  width="700"
/>

The minimum is obtained when `a` is equal to the `median` of the target feature.

# Impact of scatter on metrics

Let's consider how MAE and RMSE depend on the spread of the target feature.

Unlike MAE, the RMSE metric is more sensitive to large values: significant errors greatly affect the final value of the square root of the mean square error.

Here are three graphs of error distribution (the horizontal graph shows the error values, and the vertical graph shows their number):
    
1. The first one has an equal number of small (from 0 to 10) and large (40 and above) errors.
2. In the second and third graphs, the gap gradually increases. Very big errors appear, and more small ones appear. MAE values ​​do not change: large errors are compensated by small ones. But RMSE is increasing.

<img src="./pict/33.jpg"  
  width="700"
/>

Now you know that you can change the value of one metric without changing another.