# Week 10 Classification Lecture Demo

This notebook contains an example demonstrate the method of logistics regression. Two datasets for the demonstration are available in the `data` folder of this repo. We will continue to use the `statsmodel` library for the analysis. 

## Example-Logistics Regression

**Case Background**

The `Lasagna Triers Logistic Regression.csv` file contains data on 856 people who have either tried or not tried a company’s new frozen lasagna product. The categorical dependent variable, Have Tried, and several of the potential explanatory variables contain text. Using the numeric variables, including dummies, how well is logistic regression able to classify the triers and nontriers?

Therefore, the objective of this case is to use logistic regression to classify users as triers or nontriers, and to interpret the resulting output. 

<center><img src="../Image/lasana.jpg" width=400 height=400 /></center>

In [19]:
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np

df_lasagna = pd.read_csv('../data/Lasagna Triers Logistic Regression.csv')
df_lasagna.head()

Unnamed: 0,Person,Age,Weight,Income,Pay Type,Car Value,CC Debt,Gender,Live Alone,Dwell Type,Mall Trips,Nbhd,Have Tried
0,1,48,175,65500,Hourly,2190,3510,Male,No,Home,7,East,No
1,2,33,202,29100,Hourly,2110,740,Female,No,Condo,4,East,Yes
2,3,51,188,32200,Salaried,5140,910,Male,No,Condo,1,East,No
3,4,56,244,19000,Hourly,700,1620,Female,No,Home,3,West,No
4,5,28,218,81400,Salaried,26620,600,Male,No,Apt,3,West,Yes


In [20]:
df_lasagna = df_lasagna.rename(columns={'Pay Type':'Pay_Type', 'Live Alone':'Live_Alone',
                                        'Dwell Type':'Dwell_Type','Have Tried':'Have_Tried',
                                        'Car Value':'Car_Value','CC Debt':'CC_Debt',
                                        'Mall Trips':'Mall_Trips'})

df_lasagna = pd.get_dummies(df_lasagna, columns=['Pay_Type','Gender','Live_Alone',
                                                 'Dwell_Type','Have_Tried'])
df_lasagna.head()

Unnamed: 0,Person,Age,Weight,Income,Car_Value,CC_Debt,Mall_Trips,Nbhd,Pay_Type_Hourly,Pay_Type_Salaried,Gender_Female,Gender_Male,Live_Alone_No,Live_Alone_Yes,Dwell_Type_Apt,Dwell_Type_Condo,Dwell_Type_Home,Have_Tried_No,Have_Tried_Yes
0,1,48,175,65500,2190,3510,7,East,1,0,0,1,1,0,0,0,1,1,0
1,2,33,202,29100,2110,740,4,East,1,0,1,0,1,0,0,1,0,0,1
2,3,51,188,32200,5140,910,1,East,0,1,0,1,1,0,0,1,0,1,0
3,4,56,244,19000,700,1620,3,West,1,0,1,0,1,0,0,0,1,1,0
4,5,28,218,81400,26620,600,3,West,0,1,0,1,1,0,1,0,0,0,1


In [21]:
our_formula = 'Have_Tried_Yes ~ Age + Weight + Income \
            + Car_Value + CC_Debt + Mall_Trips \
           + Pay_Type_Salaried + Gender_Male \
           + Live_Alone_Yes + Dwell_Type_Condo + Dwell_Type_Home'

logitfit = smf.logit(formula=str(our_formula), data=df_lasagna).fit()
print(logitfit.summary())

Optimization terminated successfully.
         Current function value: 0.401836
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:         Have_Tried_Yes   No. Observations:                  856
Model:                          Logit   Df Residuals:                      844
Method:                           MLE   Df Model:                           11
Date:                Mon, 29 Nov 2021   Pseudo R-squ.:                  0.4098
Time:                        16:42:36   Log-Likelihood:                -343.97
converged:                       True   LL-Null:                       -582.80
Covariance Type:            nonrobust   LLR p-value:                 1.853e-95
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            -2.5406      0.910     -2.793      0.005      -4.324      -0.758
Age     

In [22]:
print(logitfit.summary2())

                         Results: Logit
Model:              Logit            Pseudo R-squared: 0.410     
Dependent Variable: Have_Tried_Yes   AIC:              711.9429  
Date:               2021-11-29 16:42 BIC:              768.9701  
No. Observations:   856              Log-Likelihood:   -343.97   
Df Model:           11               LL-Null:          -582.80   
Df Residuals:       844              LLR p-value:      1.8527e-95
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     7.0000                                       
-----------------------------------------------------------------
                   Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-----------------------------------------------------------------
Intercept         -2.5406   0.9097 -2.7928 0.0052 -4.3236 -0.7576
Age               -0.0697   0.0108 -6.4476 0.0000 -0.0909 -0.0485
Weight             0.0070   0.0038  1.8270 0.0677 -0.0005  0.0146
Income             0.0000   0.0000  

In [23]:
model_odds = pd.DataFrame(np.exp(logitfit.params), columns= ['Odd Ratio'])
model_odds

Unnamed: 0,Odd Ratio
Intercept,0.07882
Age,0.932684
Weight,1.007058
Income,1.000005
Car_Value,0.999973
CC_Debt,1.000078
Mall_Trips,1.987755
Pay_Type_Salaried,3.79145
Gender_Male,1.291162
Live_Alone_Yes,3.753284


In [24]:
logitfit.pred_table()

array([[280.,  81.],
       [ 73., 422.]])

In [25]:
predict = logitfit.predict(df_lasagna)

df_lasagna['Prediction'] = predict
df_lasagna.head()

Unnamed: 0,Person,Age,Weight,Income,Car_Value,CC_Debt,Mall_Trips,Nbhd,Pay_Type_Hourly,Pay_Type_Salaried,Gender_Female,Gender_Male,Live_Alone_No,Live_Alone_Yes,Dwell_Type_Apt,Dwell_Type_Condo,Dwell_Type_Home,Have_Tried_No,Have_Tried_Yes,Prediction
0,1,48,175,65500,2190,3510,7,East,1,0,0,1,1,0,0,0,1,1,0,0.752757
1,2,33,202,29100,2110,740,4,East,1,0,1,0,1,0,0,1,0,0,1,0.351476
2,3,51,188,32200,5140,910,1,East,0,1,0,1,1,0,0,1,0,1,0,0.07649
3,4,56,244,19000,700,1620,3,West,1,0,1,0,1,0,0,0,1,1,0,0.091847
4,5,28,218,81400,26620,600,3,West,0,1,0,1,1,0,1,0,0,0,1,0.602193


In [26]:
def case(row):
    if row['Prediction'] > 0.5:
        val = 1
    else:
        val = 0
    return val

df_lasagna['Analysis_Case'] = df_lasagna.apply(case, axis='columns')
df_lasagna.head(40)

Unnamed: 0,Person,Age,Weight,Income,Car_Value,CC_Debt,Mall_Trips,Nbhd,Pay_Type_Hourly,Pay_Type_Salaried,...,Gender_Male,Live_Alone_No,Live_Alone_Yes,Dwell_Type_Apt,Dwell_Type_Condo,Dwell_Type_Home,Have_Tried_No,Have_Tried_Yes,Prediction,Analysis_Case
0,1,48,175,65500,2190,3510,7,East,1,0,...,1,1,0,0,0,1,1,0,0.752757,1
1,2,33,202,29100,2110,740,4,East,1,0,...,0,1,0,0,1,0,0,1,0.351476,0
2,3,51,188,32200,5140,910,1,East,0,1,...,1,1,0,0,1,0,1,0,0.07649,0
3,4,56,244,19000,700,1620,3,West,1,0,...,0,1,0,0,0,1,1,0,0.091847,0
4,5,28,218,81400,26620,600,3,West,0,1,...,1,1,0,1,0,0,0,1,0.602193,1
5,6,51,173,73000,24520,950,2,East,0,1,...,0,1,0,0,1,0,1,0,0.076923,0
6,7,44,182,66400,10130,3500,6,West,0,1,...,0,0,1,0,1,0,0,1,0.936322,1
7,8,29,189,46200,10250,2860,5,West,0,1,...,1,1,0,0,1,0,0,1,0.867534,1
8,9,28,200,61100,17210,3180,10,West,0,1,...,1,1,0,0,1,0,0,1,0.995374,1
9,10,29,209,9800,2090,1270,7,East,0,1,...,0,0,1,1,0,0,0,1,0.9886,1


In [27]:
df_testing = pd.read_csv('../data/New_Customers.csv')
df_testing.head()

Unnamed: 0,New_Person,Age,Weight,Income,Pay Type,Car Value,CC Debt,Gender,Live Alone,Dwell Type,Mall Trips,Nbhd
0,1,36,146,85568,Salaried,10213,510,Female,No,Condo,3,South
1,2,40,225,68725,Salaried,3041,80,Female,No,Home,9,West
2,3,48,197,86876,Salaried,4806,2100,Male,No,Condo,9,West
3,4,49,177,38436,Salaried,8679,590,Male,Yes,Apt,3,West
4,5,34,223,77784,Salaried,12456,590,Female,No,Condo,8,West


In [28]:
df_testing = df_testing.rename(columns={'Pay Type':'Pay_Type', 'Live Alone':'Live_Alone',
                                        'Dwell Type':'Dwell_Type','Have Tried':'Have_Tried',
                                        'Car Value':'Car_Value','CC Debt':'CC_Debt','Mall Trips':'Mall_Trips'})

df_testing = pd.get_dummies(df_testing, columns=['Pay_Type','Gender','Live_Alone','Dwell_Type'])
df_testing.head()

Unnamed: 0,New_Person,Age,Weight,Income,Car_Value,CC_Debt,Mall_Trips,Nbhd,Pay_Type_Salaried,Gender_Female,Gender_Male,Live_Alone_No,Live_Alone_Yes,Dwell_Type_Apt,Dwell_Type_Condo,Dwell_Type_Home
0,1,36,146,85568,10213,510,3,South,1,1,0,1,0,0,1,0
1,2,40,225,68725,3041,80,9,West,1,1,0,1,0,0,0,1
2,3,48,197,86876,4806,2100,9,West,1,0,1,1,0,0,1,0
3,4,49,177,38436,8679,590,3,West,1,0,1,0,1,1,0,0
4,5,34,223,77784,12456,590,8,West,1,1,0,1,0,0,1,0


In [30]:
new_predict = logitfit.predict(df_testing)
df_testing['New_Prediction'] = new_predict
df_testing.head()

Unnamed: 0,New_Person,Age,Weight,Income,Car_Value,CC_Debt,Mall_Trips,Nbhd,Pay_Type_Salaried,Gender_Female,Gender_Male,Live_Alone_No,Live_Alone_Yes,Dwell_Type_Apt,Dwell_Type_Condo,Dwell_Type_Home,New_Prediction
0,1,36,146,85568,10213,510,3,South,1,1,0,1,0,0,1,0,0.369351
1,2,40,225,68725,3041,80,9,West,1,1,0,1,0,0,0,1,0.985216
2,3,48,197,86876,4806,2100,9,West,1,0,1,1,0,0,1,0,0.974404
3,4,49,177,38436,8679,590,3,West,1,0,1,0,1,1,0,0,0.564361
4,5,34,223,77784,12456,590,8,West,1,1,0,1,0,0,1,0,0.97041


In [31]:
def case(row):
    if row['New_Prediction'] > 0.5:
        val = 1
    else:
        val = 0
    return val

df_testing['Analysis_Case'] = df_testing.apply(case, axis='columns')
df_testing.head()

Unnamed: 0,New_Person,Age,Weight,Income,Car_Value,CC_Debt,Mall_Trips,Nbhd,Pay_Type_Salaried,Gender_Female,Gender_Male,Live_Alone_No,Live_Alone_Yes,Dwell_Type_Apt,Dwell_Type_Condo,Dwell_Type_Home,New_Prediction,Analysis_Case
0,1,36,146,85568,10213,510,3,South,1,1,0,1,0,0,1,0,0.369351,0
1,2,40,225,68725,3041,80,9,West,1,1,0,1,0,0,0,1,0.985216,1
2,3,48,197,86876,4806,2100,9,West,1,0,1,1,0,0,1,0,0.974404,1
3,4,49,177,38436,8679,590,3,West,1,0,1,0,1,1,0,0,0.564361,1
4,5,34,223,77784,12456,590,8,West,1,1,0,1,0,0,1,0,0.97041,1
