# Week 10 Classification Lecture Demo

This notebook contains an example demonstrate the method of logistics regression. Two datasets for the demonstration are available in the `data` folder of this repo. We will continue to use the `statsmodel` and `scikit-learn` libraries for the analysis. 

## Example-Logistics Regression

**Case Background**

The `Lasagna Triers Logistic Regression.csv` file contains data on 856 people who have either tried or not tried a company’s new frozen lasagna product. The categorical dependent variable, Have Tried, and several of the potential explanatory variables contain text. Using the numeric variables, including dummies, how well is logistic regression able to classify the triers and nontriers?

Therefore, the objective of this case is to use logistic regression to classify users as triers or nontriers, and to interpret the resulting output. 

<center><img src="../Image/lasana.jpg" width=400 height=400 /></center>

In [None]:
#Importing the libraries we need to use

import pandas as pd
import statsmodels.formula.api as smf
import numpy as np

#Importing the dataset as new dataframe and overview the head of datasframe
df_lasagna = pd.read_csv('../data/Lasagna Triers Logistic Regression.csv')
df_lasagna.head()

In [None]:
#Recalling the skills we have learned in renaming the variables
#Variable names with space should be renamed for the analysis in statsmodel

df_lasagna = df_lasagna.rename(columns={'Pay Type':'Pay_Type', 'Live Alone':'Live_Alone',
                                        'Dwell Type':'Dwell_Type','Have Tried':'Have_Tried',
                                        'Car Value':'Car_Value','CC Debt':'CC_Debt','Mall Trips':'Mall_Trips'})

#Creating multiple dummy variables in the dataframe - df_lasagna and overview the head of dataframe again

df_lasagna = pd.get_dummies(df_lasagna, columns=['Pay_Type','Gender','Live_Alone','Dwell_Type','Have_Tried'])
df_lasagna.head()

In [None]:
#Introducing a different way to form the statsmodel syntax
our_formula = 'Have_Tried_Yes ~ Age + Weight + Income \
            + Car_Value + CC_Debt + Mall_Trips \
           + Pay_Type_Salaried + Gender_Male \
           + Live_Alone_Yes + Dwell_Type_Condo + Dwell_Type_Home'
logitfit = smf.logit(formula=str(our_formula), data=df_lasagna).fit()
print(logitfit.summary())

In [None]:
#Using summary2 to avoid scientific nottation in the outputs
print(logitfit.summary2())

In [None]:
#The purpose of this step is to obtain an odd ratio for better interpretation

model_odds = pd.DataFrame(np.exp(logitfit.params), columns= ['OR'])
model_odds

In [None]:
#Creating a classification matrix to check the correctness of our model
logitfit.pred_table()

In [None]:
#Ok, this remind me the total number of the observations in this dataframe
df_lasagna.shape[0]

### Explaination of the above matrix

In the upper-lef corner, we can see that 280 of our observations are true negative; these observations have actual and predicted values of 0 on the outcome. 81 are the observations that our model did not classify correctly. Bottom right corner, we have 422 true positive; theses observations have actual and predicted values of 1 on the outomces. More specifically, 422 of the 495 triers, or 85.25% are classified correctly as triers.

Thus our model correctly classified 702 of 856 (see the above cell to quickly check the total number of observations). Thus, our model can predict 82.01% (702/856) of the correct classifications.

In [None]:
#In sample prediction
predict = logitfit.predict(df_lasagna)

#Creating a new variables of the model prediction in the original dataframe
#The Prediction variable tells the probability of the observation classify as Trier

df_lasagna['Prediction'] = predict
df_lasagna.head()


In [None]:
# If this probability (i.e. the prediction value) is greater than 0.5, the person is classified as a trier; 
# If it is less than 0.5, the person isclassified as a nontrier. 
# Using the below codes to create a new data frame for demonstration

def case(row):
    if row['Prediction'] > 0.5:
        val = 1
    else:
        val = 0
    return val

df_lasagna['Analysis_Case'] = df_lasagna.apply(case, axis='columns')
df_lasagna.head()

In [None]:
#For clearer presentation, we can create a new dataframe only involved the actual case and predictions fro investigation

df_lasgana_classification = df_lasagna[['Have_Tried_Yes','Prediction','Analysis_Case']].copy()
df_lasgana_classification

## What can we do next?

Explanatory values for new people, those whose trier status is unknown, could be fed into the logistic regression equation to score them (probabilities). Then perhaps some incentives could be sent to the top scorers (or the middle scorers) to increase their chances of trying the product. The point is that logistic regression is then being used as a tool to identify the people most likely to be triers.

To demonstrate this step, we can use a new dataset (or we called it as testing set) to make the prediction. The file name of the testing set is `New_Customers.csv`.

In [None]:
df_testing = pd.read_csv('../data/New_Customers.csv')
df_testing.head()

In [None]:
df_testing = df_testing.rename(columns={'Pay Type':'Pay_Type', 'Live Alone':'Live_Alone',
                                        'Dwell Type':'Dwell_Type','Have Tried':'Have_Tried',
                                        'Car Value':'Car_Value','CC Debt':'CC_Debt','Mall Trips':'Mall_Trips'})

df_testing = pd.get_dummies(df_testing, columns=['Pay_Type','Gender','Live_Alone','Dwell_Type'])
df_testing.head()

In [None]:
#Applyting our trained model logitfit for the testing dataset to make the prediction on the newly collected data
new_predict = logitfit.predict(df_testing)
df_testing['Prediction'] = predict
df_testing.head()

def case(row):
    if row['Prediction'] > 0.5:
        val = 1
    else:
        val = 0
    return val

df_testing['Analysis_Case'] = df_testing.apply(case, axis='columns')
df_testing.head()

### Explaination of the above code

We simply replicate the process in the previous step to make sure the variable names are consistent with the model speficiation. `logitfit` is the trained model name we specified based on the Lasagna Trier data. The difference here is that there is the trier status of the customer in `New_Customer.csv` is unknown and we use the model we specified `Logitfit` to make the prediction about the classification of each new observations.