# INTRODUCTION TO LOGISTICAL REGRESSION

## What is logistical Regression [Read here for more math and logic](https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python)

**Logistic regression** is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary).  Like all regression analyses, the logistic regression is a predictive analysis.  Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

**Logistical regression in machine learning** is used to make predictive models, that are used for classification purposes.

**Example applications of logistical regression**

- Predict if a person will buy an insurance or not(yes or no) basing on certain attributes of the individual such as age and salary.
- Predict if a person will get cancer or not basing on certain attributes like eating habits
- Do body weight, calorie intake, fat intake, and age have an influence on the probability of having a heart attack (yes vs. no)? [Read more](https://en.wikipedia.org/wiki/Logistic_regression#:~:text=Logistic%20regression%20can%20be%20binomial,vs.%20%22loss%22\).)

**Conclusion**:\
Logistical regression is used for classification

### Types of Logistical regression

- **binomial:** target variable can have only 2 possible types: “0” or “1” which may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.
- **multinomial:** target variable can have 3 or more possible types which are not ordered(i.e. types have no quantitative significance) like “disease A” vs “disease B” vs “disease C”.
- **ordinal:** it deals with target variables with ordered categories. For example, a test score can be categorized as:“very poor”, “poor”, “good”, “very good”. Here, each category can be given a score like 0, 1, 2, 3.
[source](https://www.edureka.co/community/46062/different-types-of-logistic-regression)

## CLASSIFICATION IN MACHINE LEARNING

Classification involves grouping into one or many known categories, to do this we use logistical models

### Types of classification

- **Binary classification:** this involves grouping into two classes of 1 or 0(a yes or no)hence the name binary(1 or 0, a true or false)
- **Multi-class classification:** This involves grouping into more than two categories or classes, example democrat, replublican or independent.

## In Depth Knowledge and Formulars

linear Regression formula:

### $ y = \beta_o + \beta X_1 + \beta X_2 + ... + \beta X_n $

sigmoid function or logistical function: 

### $ p = \frac{1}{1+ e^{-y}} $

applying sigmoid function to linear regression:

## $ p = \frac{1}{1 + e^{-(\beta_o + \beta X_1 + \beta X_2 + ... + \beta X_n)}} $

### Properties of logistic regression

1. The dependent variable in logistic regression follows Bernoulli Distribution.
2. Estimation is done through maximum likelihood.
3. No R Square, Model fitness is calculated through Concordance, KS-Statistics.


[Read here for more math and logic](https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python)

## Example

In [268]:
import pandas as pd
import numpy as np

In [269]:
col_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 
             'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

df = pd.read_csv('dataset/diabetes.csv')
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [270]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [271]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [272]:
pd.isna(df).sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [273]:
x = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 
     'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

y = 'Outcome'

In [274]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

In [275]:
X_train, X_test, y_train, y_test = train_test_split(df[x], df[y], train_size=0.8, random_state=0)

In [276]:
X_test

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
661,1,199,76,43,0,42.9,1.394,22
122,2,107,74,30,100,33.6,0.404,23
113,4,76,62,0,0,34.0,0.391,25
14,5,166,72,19,175,25.8,0.587,51
529,0,111,65,0,0,24.6,0.660,31
...,...,...,...,...,...,...,...,...
476,2,105,80,45,191,33.7,0.711,29
482,4,85,58,22,49,27.8,0.306,28
230,4,142,86,0,0,44.0,0.645,22
527,3,116,74,15,105,26.3,0.107,24


In [277]:
y_test

661    1
122    0
113    0
14     1
529    0
      ..
476    1
482    0
230    1
527    0
380    0
Name: Outcome, Length: 154, dtype: int64

In [278]:
# X_train = preprocessing.scale(X_train)
# X_test = preprocessing.scale(X_test)

In [279]:
from sklearn.linear_model import LogisticRegression

In [280]:
model = LogisticRegression()

In [281]:
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [282]:
model.predict([[6, 148, 72, 35, 0, 33.6, 0.627, 50]])

array([1])

In [283]:
model.predict([[12, 121, 78, 17, 0, 26.5, 0.259, 62]])

array([0])

## Model Evaluation

In [284]:
from sklearn import metrics

In [285]:
y_pred = model.predict(X_test)

In [286]:
print('Accuracy: ', metrics.accuracy_score(y_test, y_pred))
print('Precision: ', metrics.precision_score(y_test, y_pred))
print('Recall: ', metrics.recall_score(y_test, y_pred))

Accuracy:  0.8246753246753247
Precision:  0.7631578947368421
Recall:  0.6170212765957447
