<a href="https://colab.research.google.com/github/Silangwe1/machine_learning_examples/blob/master/1_building_logistic_regression_265.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Logistic Regression Model

In this tutorial we will build a classification model to predict whether a policyholder will claim within a year. The response (target) variable will not be a continuous value like before (in Linear Regression), but rather a categorical one. 

The model we will consider here is called **Logistic Regression**.

## Logistic Regression

For classification, linear regression is not the right approach as it implies quantitative differences in classes, which may not be appropriate. Rather than modelling the response directly, logistic regression models the *probability* that the reponse belongs to a particular category. 

For example if we model the probability of a student score being above average, then we classify it as such if the probability is greater than a specific amount (i.e. we predict *above average* if $P(X) > T$, where $T$ is the threshold probability, usually 0.5 for a binary case)

Since we are modelling the probability, $P(X)$ should be greater or equal to 0 and smaller or equal to 1 for it to make sense. We therefore require a function that gives outputs between 0 and 1 for all input values of $X$. For this we use the **logistic function** displayed graphical below:

<img src="https://github.com/Samantha-movius/hello-world/blob/master/logistic_reg.png?raw=true" alt="Drawing" style="width: 500px;"/>

Which is defined by the function:

$$P(X) = \displaystyle \frac{e^{\beta_0 + \beta_1 X}}{1+e^{\beta_0 + \beta_1 X}}$$

After a bit of manipulation we arrive at:

\begin{align}
1 - P(X) &= \displaystyle \frac{1}{1+e^{\beta_0 + \beta_1 X}} \\
\therefore \log \left( \frac{P(X)}{1-P(X)} \right) &= {\beta_0 + \beta_1 X}
\end{align}

So the fraction on the left is being modelled as a linear function of the observations $X$, and this is known as the **log odds ratio**. Without the log sign in front of it, it is known simply as the odds ratio. While $P(X)$ is bounded between 0 and 1, the odds ratio is bounded between 0 and $\infty$. 

## Building a Logistic Regression Model

In [0]:
# Import some libraries we will need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [0]:
# Read data in and view first few entries
df = pd.read_csv('claims_data.csv')
df.head()

Unnamed: 0,age,sex,bmi,steps,children,smoker,region,insurance_claim,claim_amount
0,19,female,27.9,3009,0,yes,southwest,yes,16884.924
1,18,male,33.77,3008,1,no,southeast,yes,1725.5523
2,28,male,33.0,3009,3,no,southeast,no,0.0
3,33,male,22.705,10009,0,no,northwest,no,0.0
4,32,male,28.88,8010,0,no,northwest,yes,3866.8552


### Pre-Processing

We will start by pre-processing the data so that we can run it through the algorithm. Just to recap, this involves:
* Splitting the data into features and labels
* Transforming the categorical features 
* Splitting the data into training and testing data

In [0]:
# Lables
y = df['insurance_claim']

# Features
X = df.drop(['insurance_claim', 'claim_amount'], axis=1)

In [0]:
# Transforming the Features
X_transformed = pd.get_dummies(X, drop_first=True)

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.2, random_state=50)

Now our data is ready. Let's train the logistic regression model.

### Training

We import LogisticRegression from sklearn.linear_model. 

In [0]:
from sklearn.linear_model import LogisticRegression

We create an instance of the `LogisticRegression()` object using the default parameters for now. In the following tutorial we'll look at varying one of the parameters in an attempt to improve model performance.

In [0]:
lm = LogisticRegression()

We use the `fit()` method to train the model.

In [0]:
lm.fit(X_train,y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Now that the model is trained, we can extract the parameters. The parameters consist of the intercept and the coefficients related to the features. These parameters can be used to predict future claims given the features.

Intercept

In [0]:
lm.intercept_[0]

-3.6037659420897197

Coefficients

In [0]:
coeff_df = pd.DataFrame(lm.coef_.T,X_transformed.columns,columns=['Coefficient'])
coeff_df

Unnamed: 0,Coefficient
age,0.019738
bmi,0.168708
steps,-0.000137
children,-1.256102
sex_male,-0.012538
smoker_yes,3.016435
region_northwest,-0.510453
region_southeast,-0.155735
region_southwest,-0.157182


What can you infer from the coefficients above?

### Predicting
As we did before in Linear Regression, we use the predict function to obtain predictions from our test data.

In [0]:
pred_lm = lm.predict(X_test)

### Testing

For testing the results we will look at two different metrics called **confusion matrix** and **classification report**.

#### [Confusion Matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

![title](https://github.com/Explore-AI/Public-Data/blob/master/Data/matrix2.png?raw=true)

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known, and the format of this is displayed in the image above. The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing. (Thanks [DataSchool](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) for the help in explaining!)

Let's now define the most basic terms, which are whole numbers (not rates):

* true positives (TP) : These are cases in which we predicted a claim, and they did indeed claim.
* true negatives (TN) : We predicted no claim, and they did indeed not claim.
* false positives (FP): We predicted claim, but they actually didn't claim (Also known as a **Type I error**).
* false negatives (FN): We predicted no claim, but they actually claimed. (Also known as a **Type II error**).

From the confusion matrix, we can determine the model **accuracy** as (TP+TN)/(TP+TN+FP+FN) which is the proportion of data  that was correctly classified.

Now let's import the `confusion_matrix` object to check the results.

In [0]:
from sklearn.metrics import confusion_matrix

The confusion matrix takes in two arguments: the unseen y_test data as well as our predictions.

In [0]:
confusion_matrix(y_test, pred_lm)

array([[ 97,  19],
       [ 17, 135]], dtype=int64)

This doesn't look that nice, so we can put this matrix into a dataframe together with the appropriate labels to make it more clear which values relate to which metric. The matrix works alphabetically, so the first row/column refers to 'no' claim since  it comes before 'yes' alphabetically

In [0]:
labels = ['No claim', 'Claim']

pd.DataFrame(data=confusion_matrix(y_test, pred_lm), index=labels, columns=labels)

Unnamed: 0,No claim,Claim
No claim,97,19
Claim,17,135


Much better. The rows represent the actual output, while the columns indicate the predicted output. We see that we have classified 97+135=233 claims correctly, and 17+19=36 claims incorrectly.

#### [Classification reports](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

The Classification Report gives us more information on where our model is going wrong - looking specifically at the performance caused by Type I & II errors.  The following metrics are calculated as part of the classification report. 
* Precision: When it predicts yes, how often is it correct? 

$ Precision = \frac{True Positive}{Predicted Positive} $

* Recall: When the outcome is actually True, how often do we predict it so?

$ Recall = \frac{True Positive}{Condition True}$

* [F1 score](https://en.wikipedia.org/wiki/F1_score): = weighted average of Precision and Recall. 

$F_1 = 2 \times \frac {precision \times recall }{precision + recall }$

Now let's import the `classification_report` object to check the results.

In [0]:
from sklearn.metrics import classification_report

Similarly to the confusion matrix, the classification matrix takes in two arguments: the unseen y_test data as well as our predictions.

In [0]:
print('Classification Report')
print(classification_report(y_test, pred_lm, target_names=['No claim', 'Claim']))

Classification Report
              precision    recall  f1-score   support

    No claim       0.85      0.84      0.84       116
       Claim       0.88      0.89      0.88       152

   micro avg       0.87      0.87      0.87       268
   macro avg       0.86      0.86      0.86       268
weighted avg       0.87      0.87      0.87       268



Our accuracy is 87%, which is good. Can we improve this number? And what can we do to try and improve it? In the next tutorial we will look at ways to increase the accuracy and select the best model. 

Before we move on to this, let's consider the advantages and disadvantages of logistic regression.

## Advantages & Disadvantages of Logistic Regression

**Advantages**

* Convenient probability scores for observations (probability of each outcome is transformed into a classification)
* Not a major issue if there are dependance between features (much worse with linear regression)

**Disadvantages**

* Can overfit when data is unbalanced (i.e. one label category dominates)
* Doesn't handle large number of categorical features/variables well