<h1 class="title">Python Tips<br>#4 Logistic Regression</h1>
<br>
<center>Michael Siebel</center>
<br>

In [2]:
# Remove warnings
import warnings
warnings.filterwarnings('ignore')

%run ../HTML_Functions.ipynb 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

This post will go through the standard sklearn process of running a logistic regression.  We will rerun Tip #3's code to load IRIS as a toy data set.

# Load Data

In [3]:
# Load Libraries
## Main Data Wrangling Library
import pandas as pd
## Main Linear Algebra Library
import numpy as np
## Split data
from sklearn.model_selection import train_test_split
## Logistic Regression
from sklearn.linear_model import LogisticRegression
## Practice data
from sklearn import datasets

# Load IRIS data and c
X, y = datasets.load_iris(return_X_y=True, as_frame=True)
## Add column names
X.columns = ["Sepal Length", "Sepal Width", "Petal Length", "Petal Width"]

# Preview data
display(X.head())

Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


We added a few more libraries this time.  numpy is Python's package for linear algebra; pandas is Python's package for data wrangling.  I never run a Python script without numpy and pandas.  We are only using numpy to look at unique values for the target variable, though.

# Target Variable

(or dependent variable, or endongenous variable, or outcome, or whatever lexicon you use)

In [4]:
# Add target labels
labels = ['Not Versicolor', 'Versicolor']

# Make Target Binary
y = y == 1

# Target Counts
print('Final Target Counts')
unique_elements, counts_elements = np.unique(y, return_counts=True)
display(pd.DataFrame(counts_elements, index=labels, columns=["Count"]))

Final Target Counts


Unnamed: 0,Count
Not Versicolor,100
Versicolor,50


## Stata Equivalent

The IRIS dataset contains three categories for the target variable.  sklearn's logistic regression function is also its multinomial logistic regression function, so if we don't make the target variable binary, our code would still work.  Converting to a binary was quite easy and essentially the same as in Stata, but let's compare it to Stata code:

In [5]:
# Python code:
# y = y == 1
# * Stata code:
# . replace y = y == 1
# * Alternative Stata code:
# . replace y = 0 if y != 1

Next, we loaded the functions train_test_split to split our dataset to a training and testing dataset, and LogisticRegression for the modeling.  train_test_split is meant to test if we are overfitting--this more relevant for more complex datasets.  With LogisticRegression, I added a the option C=1e9 which is a large number that essentially removes the effect of the regularization parameter--again, the regularization parameter is meant to prevent too much influence of any one column but given that we have only 4 columns it isn't needed.  By removing it, it is likely to produce results similar to what glm() produces in R and what logit produces in Stata.

# Modeling

In [6]:
# Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.7, random_state=0
)

# Model
clf = LogisticRegression(C=1e9).fit(X_train, y_train)
print("Intercept")
print(clf.intercept_.round(2))
print("")
print("Coefficients")
print(clf.coef_.round(2))
print("")
print("Accuracy Score")
print(clf.score(X_test, y_test).round(2))

Intercept
[16.02]

Coefficients
[[-0.68 -4.72  0.87 -1.44]]

Accuracy Score
0.7


# Conclusion

That is it! 
Throwing all the code together:

In [7]:
# Load Libraries
## Main Data Wrangling Library
import pandas as pd
## Main Linear Algebra Library
import numpy as np
## Split data
from sklearn.model_selection import train_test_split
## Logistic Regression
from sklearn.linear_model import LogisticRegression
## Practice data
from sklearn import datasets

# Load IRIS data and c
X, y = datasets.load_iris(return_X_y=True, as_frame=True)
## Add column names
X.columns = ["Sepal Length", "Sepal Width", "Petal Length", "Petal Width"]

# Preview data
display(X.head())

# Add target labels
labels = ['Not Versicolor', 'Versicolor']

# Make Target Binary
y = y == 1

# Target Counts
print('Final Target Counts')
unique_elements, counts_elements = np.unique(y, return_counts=True)
display(pd.DataFrame(counts_elements, index=labels, columns=["Count"]))

# Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.7, random_state=0
)

# Model
clf = LogisticRegression(C=1e9).fit(X_train, y_train)
print("Intercept")
print(clf.intercept_.round(2))
print("")
print("Coefficients")
print(clf.coef_.round(2))
print("")
print("Accuracy Score")
print(clf.score(X_test, y_test).round(2))

Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Final Target Counts


Unnamed: 0,Count
Not Versicolor,100
Versicolor,50


Intercept
[16.02]

Coefficients
[[-0.68 -4.72  0.87 -1.44]]

Accuracy Score
0.7


# Save Log

In [None]:
from IPython.display import display, Javascript

display(Javascript(
    "document.body.dispatchEvent("
    "new KeyboardEvent('keydown', {key:'s', keyCode: 83, ctrlKey: true}"
    "))"
))

!jupyter nbconvert --to html_toc "Tip4_Logistic_Regression.ipynb"  --ExtractOutputPreprocessor.enabled=False --CSSHTMLHeaderPreprocessor.style=stata-dark 

<IPython.core.display.Javascript object>