# Logistic Regression

## What is Logistic Regression?
- Logistic regression is another type of classifier, which is different from linear regression
- Logistic regression predicts whether something is True or False, and the plot is an S-curve that goes from 0 to 1 (probability of False to True)
- How is it different from SVM?
    - SVM can not tell us what the probability is of being classified in a given category
    - For example, going back to the example we used in our SVM class, if my Serotonin is 3 and my Dopamine is 6, what is the chance that I would be considered happy? 90 percent? 60 percent? etc...
- Logistic Regression tell us this information!

## Activity: Fit a Logistic Regression Classifier on diabetes.csv
**In groups of 3, complete the following steps:**
- Load the dataset: pd.read_csv('diabetes.csv')
- Use these features: feature_cols = ['Pregnancies', 'Insulin', 'BMI', 'Age']
- Split the data to train and test: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
- Obtain the statistics of y_test

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

df = pd.read_csv('../Data/Diabetes.csv')

feature_cols = ['Pregnancies', 'Insulin', 'BMI', 'Age']
X = df[feature_cols]
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [10]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [14]:
y_pred = log_reg.predict(X_test)
y_pred

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [15]:
y_pred_precent = log_reg.predict_proba(X_test)
y_pred_precent

array([[0.63247571, 0.36752429],
       [0.71643656, 0.28356344],
       [0.71104114, 0.28895886],
       [0.5858938 , 0.4141062 ],
       [0.84103973, 0.15896027],
       [0.82934844, 0.17065156],
       [0.50110974, 0.49889026],
       [0.48658459, 0.51341541],
       [0.72321388, 0.27678612],
       [0.32810562, 0.67189438],
       [0.64244443, 0.35755557],
       [0.25912035, 0.74087965],
       [0.63949765, 0.36050235],
       [0.76987637, 0.23012363],
       [0.57345769, 0.42654231],
       [0.80896485, 0.19103515],
       [0.54236399, 0.45763601],
       [0.8809859 , 0.1190141 ],
       [0.56071047, 0.43928953],
       [0.63038849, 0.36961151],
       [0.55812011, 0.44187989],
       [0.62388338, 0.37611662],
       [0.80183978, 0.19816022],
       [0.58322696, 0.41677304],
       [0.84451719, 0.15548281],
       [0.7468329 , 0.2531671 ],
       [0.90256923, 0.09743077],
       [0.30366288, 0.69633712],
       [0.84641691, 0.15358309],
       [0.7802164 , 0.2197836 ],
       [0.

In [13]:
from sklearn.metrics import r2_score, mean_squared_error

print(r2_score(y_pred, y_test))
print(mean_squared_error(y_pred, y_test))

-1.5427609427609426
0.3072916666666667


## Confusion Matrix
A confusion matrix is a table that is used to describe the performance of a classifier on a set of test data where we know the true vales. Essentially, we use it to check how well our classifier's predicted values matched against the known values of the same data.

The confusion matrix itself is a simple 2x2 matrix, but it's important we go over the terminology of each row/column in the matrix:

**True Positives (TP):** we correctly predicted a positive outcome (i.e. someone has diabetes, and we correctly predicted it)

**True Negatives (TN):** we correctly predicted a negative outcome (i.e. someone does not have diabetes, and we correctly predicted it)

**False Positives (FP):** we incorrectly predicted a positive outcome (i.e. someone does not diabetes, and we incorrectly said that they did)

**False Negatives (FN):** we incorrectly predicted a negative outcome (i.e. someone has diabetes, and we incorrectly said that they do not)



## Activity: Write a function that calculates the confusion matrix for the Pima Diabetes dataset
- How many 0s (no diabetes) in y_test are predicted correctly as 0 (no diabetes) in y_pred?
    - True Positives
- How many 0s (no diabetes) in y_test are predicted incorrectly as 1 (diabetes) in y_pred?
    - False Positive
- How many 1s (diabetes) in y_test are predicted incorrectly as 0 (no diabetes) in y_pred?
    - False Negative
- How many 1s (diabetes) in y_test are predicted correctly 1 (diabetes) in y_pred?
    - True Negative

In [8]:
def compare_ys(y_test, y_pred, test_value, pred_value):
    count = 0
    for i, j in zip(y_test, y_pred):
        if i == test_value and j == pred_value:
            count += 1
    return count

def confusion_matrix(y_test, y_pred):
    conf_matrix  = np.zeros((2, 2))
    for row_index in [0, 1]:
        for column_index in [0, 1]:
            counter = 0
            for (test_index, pred_index) in zip(y_test, y_pred):
                if (test_index == row_index) & (pred_index == column_index):
                        counter += 1
            conf_matrix[row_index, column_index] = counter 
    return conf_matrix
    
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

[[118.  12.]
 [ 47.  15.]]


### Easier way to compute elements of Confusion Matrix using sklearn

In [9]:
from sklearn import metrics

confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)

[[118  12]
 [ 47  15]]
