# Overview - Logistic Regression from Linear Regression

## How is it used?

Classification vs Linear Regression predicting values

## Recall Linear Regression

### Formula

$$ \hat y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n = \sum_{i=0}^{N} \beta_i x_i $$

## Classification: Use Logistic Regression

Probability of belonging to a particular group

Transform from linear regression!

$$ \hat y = \sum_{i=0}^{N} \beta_i x_i $$

$$ P = \displaystyle \frac{1}{1+e^{-\hat y}} = \frac{1}{1+e^{-\sum_{i=0}^{N} \beta_i x_i}} $$

$$ = \frac{1}{1+e^{-\beta_0}e^{-\beta_1 x_1}\ldots e^{-\beta_N x_N}} $$

# Implementing Logistic Regression

In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

## Play with some data

In [2]:
# import some data to play with
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(
    data= np.c_[iris['data'], iris['target']],
    columns= iris['feature_names'] + ['target']
)

In [3]:
display(df.head())
display(df.describe())

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


## Prepare the data to do the classification

In [4]:
# Get the features and then the target
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

In [5]:
# Split for test & training  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=27)

## Create the logistic regression model

In [6]:
logreg = LogisticRegression(fit_intercept = False, C = 1e12, solver='lbfgs', multi_class='auto')
model_log = logreg.fit(X_train, y_train)
model_log

LogisticRegression(C=1000000000000.0, class_weight=None, dual=False,
          fit_intercept=False, intercept_scaling=1, max_iter=100,
          multi_class='auto', n_jobs=None, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

In [7]:
y_hat_test = logreg.predict(X_test)
y_hat_train = logreg.predict(X_train)

## Evaluate the model

### Training Set

In [8]:
# Was our model correct?
residuals = y_train == y_hat_train

print('Number of values correctly predicted:')
print(pd.Series(residuals).value_counts())

Number of values correctly predicted:
True     117
False      3
Name: target, dtype: int64


In [9]:
print('Percentage of values correctly predicted: ')
print(pd.Series(residuals).value_counts(normalize=True))

Percentage of values correctly predicted: 
True     0.975
False    0.025
Name: target, dtype: float64


### Testing Set

In [10]:
residuals = y_test == y_hat_test

In [11]:
print('Number of values correctly predicted:')
print(pd.Series(residuals).value_counts())

Number of values correctly predicted:
True     28
False     2
Name: target, dtype: int64


In [12]:
print('Percentage of values correctly predicted: ')
print(pd.Series(residuals).value_counts(normalize=True))

Percentage of values correctly predicted: 
True     0.933333
False    0.066667
Name: target, dtype: float64
