# **Assignment 1 - Logistic Regression**
## Author: Jake Brulato
## Tuesday 5:30 - 8:15, Kornelia

**Problem description and questions: A supermarket offers a new line of organic products. The
supermarket’s management wants to determine which customers are likely to purchase these
products. The supermarket has a customer loyalty program. As an initial buyer incentive plan,
the supermarket provided coupons for the organic products to all of the loyalty program
participants and collected data that includes whether these customers purchased any of the
organic products. Based on the data collected, the supermarket wants to understand the
behavior of their customers and their likelihood of purchase of organic products. The
ORGANICS data set contains 8 variables as shown in the table below and more than 22,000
observations**

In [98]:
#Import the packages you will use
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

In [99]:
#Import the data
df_JB = pd.read_csv('Organics.csv', sep=',')
df_JB

Unnamed: 0,ID,DemAffl,DemAge,DemGender,PromClass,PromSpend,PromTime,TargetBuy
0,140,10.0,76.0,U,Gold,16000.00,4.0,0
1,620,4.0,49.0,U,Gold,6000.00,5.0,0
2,868,5.0,70.0,F,Silver,0.02,8.0,1
3,1120,10.0,65.0,M,Tin,0.01,7.0,1
4,2313,11.0,68.0,F,Tin,0.01,8.0,0
...,...,...,...,...,...,...,...,...
22218,52834058,13.0,65.0,F,Silver,1500.00,5.0,0
22219,52834376,15.0,73.0,U,Gold,6053.06,12.0,0
22220,52837057,9.0,70.0,F,Gold,6000.00,5.0,0
22221,52838096,11.0,66.0,F,Silver,5000.00,5.0,0


In [100]:
#Calculate the nulls and then drop them
df_JB.isnull().sum()
clean_JB = df_JB.dropna()
clean_JB.info()
print(clean_JB.isnull().sum())
clean_JB

<class 'pandas.core.frame.DataFrame'>
Index: 17272 entries, 0 to 22221
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         17272 non-null  int64  
 1   DemAffl    17272 non-null  float64
 2   DemAge     17272 non-null  float64
 3   DemGender  17272 non-null  object 
 4   PromClass  17272 non-null  object 
 5   PromSpend  17272 non-null  float64
 6   PromTime   17272 non-null  float64
 7   TargetBuy  17272 non-null  int64  
dtypes: float64(4), int64(2), object(2)
memory usage: 1.2+ MB
ID           0
DemAffl      0
DemAge       0
DemGender    0
PromClass    0
PromSpend    0
PromTime     0
TargetBuy    0
dtype: int64


Unnamed: 0,ID,DemAffl,DemAge,DemGender,PromClass,PromSpend,PromTime,TargetBuy
0,140,10.0,76.0,U,Gold,16000.00,4.0,0
1,620,4.0,49.0,U,Gold,6000.00,5.0,0
2,868,5.0,70.0,F,Silver,0.02,8.0,1
3,1120,10.0,65.0,M,Tin,0.01,7.0,1
4,2313,11.0,68.0,F,Tin,0.01,8.0,0
...,...,...,...,...,...,...,...,...
22216,52830893,13.0,49.0,M,Silver,500.00,9.0,0
22218,52834058,13.0,65.0,F,Silver,1500.00,5.0,0
22219,52834376,15.0,73.0,U,Gold,6053.06,12.0,0
22220,52837057,9.0,70.0,F,Gold,6000.00,5.0,0


In [101]:
#Drop the columns you wont be using for your data
model_data_JB = clean_JB.drop(columns= ['ID'])
# model_data_JB = clean_JB.drop(columns= ['ID', 'PromClass'])
# model_data_JB = clean_JB.drop(columns= ['ID', 'DemGender'])
model_data_JB

Unnamed: 0,DemAffl,DemAge,DemGender,PromClass,PromSpend,PromTime,TargetBuy
0,10.0,76.0,U,Gold,16000.00,4.0,0
1,4.0,49.0,U,Gold,6000.00,5.0,0
2,5.0,70.0,F,Silver,0.02,8.0,1
3,10.0,65.0,M,Tin,0.01,7.0,1
4,11.0,68.0,F,Tin,0.01,8.0,0
...,...,...,...,...,...,...,...
22216,13.0,49.0,M,Silver,500.00,9.0,0
22218,13.0,65.0,F,Silver,1500.00,5.0,0
22219,15.0,73.0,U,Gold,6053.06,12.0,0
22220,9.0,70.0,F,Gold,6000.00,5.0,0


**1. What variable would you consider as the target variable? Explain your reason. (2 pts)**

I believe the target variable would be 'TargetBuy' as it is the only variable to return 0 or 1, with logistic regression, we want to predict the outcome of the binary variable based on the other independents in the model.

**2. Select any 5 variables to consider as independent variables in the model. Explain the
reasons for your selection. (5 pts)**

- PromSpend: 
    - The overall amount of spenifing according to a lifetime, it varies and intially could be thought as good predictor, however it was determined later not to be statisically significant.
- DemGender_M:
    - After dummy coding DemGender into three, DemGender_M was one of the best to be reported in the model, having extremely low P-values and Z-value that was statistacally significant
- DemGender_F:
    - After dummy coding DemGender into three, DemGender_F was one of the best to be reported in the model, having extremely low P-values and Z-value that was statistacally significant
- DemAffl:
    - How much a wealth a person has on a 30 point scales, shows what economic bracket people would be in to more likely buy the organic line.
- DemAge:
    - How old the person will be if they by the organics, could be a good indicator on what age range would try to eat healthier.


**3. Are there any variables which cannot be used in your model? Why? (3 pts)**

- ID would not make sense as it just serves as a unique identifier for each of the people's data in each row.
- PromClass could be used but would not be significant to our target variable as it doesn't pass any threshold tests.

**4. What variables need to be dummy coded before you run your logistic regression model?
Explain what new dummy coded columns you created. (6 pts)**

- DemGender and PromClass would need to be dummy coded as logistic regression cannot take in text values. I did both to run multiple models to determine which value would overall be better as predictors for our target y. You also need to drop one of the variables to make sure multicoliniarity doesn't occur (n-1).

In [102]:
model_data_JB = pd.get_dummies(model_data_JB,dtype=int)
model_data_JB['DemGender_M'] = model_data_JB['DemGender_M'].astype(int)
model_data_JB['DemGender_F'] = model_data_JB['DemGender_F'].astype(int)
model_data_JB = model_data_JB.drop(columns='DemGender_U')
model_data_JB['PromClass_Gold'] = model_data_JB['PromClass_Gold'].astype(int)
model_data_JB['PromClass_Platinum'] = model_data_JB['PromClass_Platinum'].astype(int)
model_data_JB['PromClass_Silver'] = model_data_JB['PromClass_Silver'].astype(int)
model_data_JB['PromClass_Tin'] = model_data_JB['PromClass_Tin'].astype(int)
model_data_JB = model_data_JB.drop(columns='PromClass_Tin')
print(model_data_JB)
# print("Gold: ", model_data_JB['PromClass_Gold'].value_counts().get(1, 1))
# print("Plat: ", model_data_JB['PromClass_Platinum'].value_counts().get(1, 1))
# print("Silver: ", model_data_JB['PromClass_Silver'].value_counts().get(1, 1))
# print("Tin: ", model_data_JB['PromClass_Tin'].value_counts().get(1, 1))


       DemAffl  DemAge  PromSpend  PromTime  TargetBuy  DemGender_F  \
0         10.0    76.0   16000.00       4.0          0            0   
1          4.0    49.0    6000.00       5.0          0            0   
2          5.0    70.0       0.02       8.0          1            1   
3         10.0    65.0       0.01       7.0          1            0   
4         11.0    68.0       0.01       8.0          0            1   
...        ...     ...        ...       ...        ...          ...   
22216     13.0    49.0     500.00       9.0          0            0   
22218     13.0    65.0    1500.00       5.0          0            1   
22219     15.0    73.0    6053.06      12.0          0            0   
22220      9.0    70.0    6000.00       5.0          0            1   
22221     11.0    66.0    5000.00       5.0          0            1   

       DemGender_M  PromClass_Gold  PromClass_Platinum  PromClass_Silver  
0                0               1                   0                 0

**5. Do you have to consider missing values in your dataset? How did you handle the presence
of missing values, if any? (4 pts)**

In [103]:
#Calculate the nulls and then drop them
model_data_JB.isnull().sum()
clean_JB = model_data_JB.dropna()
clean_JB.info()
print(clean_JB.isnull().sum())
clean_JB

<class 'pandas.core.frame.DataFrame'>
Index: 17272 entries, 0 to 22221
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   DemAffl             17272 non-null  float64
 1   DemAge              17272 non-null  float64
 2   PromSpend           17272 non-null  float64
 3   PromTime            17272 non-null  float64
 4   TargetBuy           17272 non-null  int64  
 5   DemGender_F         17272 non-null  int64  
 6   DemGender_M         17272 non-null  int64  
 7   PromClass_Gold      17272 non-null  int64  
 8   PromClass_Platinum  17272 non-null  int64  
 9   PromClass_Silver    17272 non-null  int64  
dtypes: float64(4), int64(6)
memory usage: 1.4 MB
DemAffl               0
DemAge                0
PromSpend             0
PromTime              0
TargetBuy             0
DemGender_F           0
DemGender_M           0
PromClass_Gold        0
PromClass_Platinum    0
PromClass_Silver      0
dtype: int64


Unnamed: 0,DemAffl,DemAge,PromSpend,PromTime,TargetBuy,DemGender_F,DemGender_M,PromClass_Gold,PromClass_Platinum,PromClass_Silver
0,10.0,76.0,16000.00,4.0,0,0,0,1,0,0
1,4.0,49.0,6000.00,5.0,0,0,0,1,0,0
2,5.0,70.0,0.02,8.0,1,1,0,0,0,1
3,10.0,65.0,0.01,7.0,1,0,1,0,0,0
4,11.0,68.0,0.01,8.0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
22216,13.0,49.0,500.00,9.0,0,0,1,0,0,1
22218,13.0,65.0,1500.00,5.0,0,1,0,0,0,1
22219,15.0,73.0,6053.06,12.0,0,0,0,1,0,0
22220,9.0,70.0,6000.00,5.0,0,1,0,1,0,0


**6. Provide following screenshots from your logistic regression model. (5 pts)**
- a. The model result summary, along with the coefficient table
- b. The classification (confusion) matrix output


In [104]:
y = np.array(clean_JB['TargetBuy'])

x = clean_JB[['DemAffl', 'DemAge', 'DemGender_M', 'PromSpend', 'DemGender_F']]

# x = clean_JB[['DemAffl', 'DemAge', 'PromSpend', 'PromTime', 'PromClass_Gold', 'PromClass_Platinum', 'PromClass_Silver' ]]

# x = clean_JB[['DemAffl', 'DemAge', 'PromSpend', 'PromTime', 'DemGender_M', 'DemGender_F']]
X_train, X_test, y_train, y_test = train_test_split(x,y)

X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

model = sm.Logit(y_train, X_train)
result = model.fit()

print(result.summary())

Optimization terminated successfully.
         Current function value: 0.444311
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                12954
Model:                          Logit   Df Residuals:                    12948
Method:                           MLE   Df Model:                            5
Date:                Wed, 07 Feb 2024   Pseudo R-squ.:                  0.2280
Time:                        13:41:01   Log-Likelihood:                -5755.6
converged:                       True   LL-Null:                       -7455.6
Covariance Type:            nonrobust   LLR p-value:                     0.000
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
const          -2.3074      0.178    -12.952      0.000      -2.657      -1.958
DemAffl         0.2545    

**7. Is the overall model statistically significant? Explain how you arrived at your conclusion. (4
pts)**

- Overall most of the model is statisically significant with the following variables passing both the P-value and T-test:
    - DemAffl
    - DemAge
    - DemGender_M
    - DemGender_F

- The ones that did not past this test are:
    - PromSpend

- If I could, I would drop PromSpend completely from the model because to me it doesn't make sense to keep a statistically insignificant predictor in the model. After checking all the other variables (PromTime, PromClass_Gold, PromClass_Silver, PromClass_Tin, PromClass_Platinum), all were determined to be statistically insignificant. This makes me want to omit the 5th predictor completely.

**8. Interpret the impacts of each independent variable on the target variable. (10 pts)**

- For every level/point that the Target variable goes up by:
    - DemAffl increases by 0.2545  holding other variables constant. This indicates a positive relationship between affluence and the likelihood of purchasing organic product.
    - DemAge decrease by -0.0545, holding other variables are held constant. This suggests a negative relationship between age and the likelihood of purchasing organic products.
    - DemGender_M increases by 1.1344 compared to the baseline, holding all else constant, indicatinng a small postive likelihood of purchasing organic.
    - Demgender_F increases by 2.0827 compared to the baseline, holding all else constant. This suggests a stronger positive impact of being female on the likelihood of purchasing organic products compared to being male.
    - Promspend would decrease by -3.11e-06 however it would not be considered because its not statistically significant.

**9. From the confusion matrix, compute the accuracy, precision, recall and F1 score for this
model. (6 pts)**

In [105]:
from sklearn import metrics
X_test['Predicted_Prob'] = result.predict(X_test)
X_test.head()
predictions = (X_test['Predicted_Prob'] >= .5).astype(int)

conf_matrix = metrics.confusion_matrix(y_test, predictions)

conf_matrix

array([[2964,  180],
       [ 681,  493]])

- Accuracy: (2964 + 493)/ (2964 + 180 + 681 + 493) = 0.80060213061
- Precision: 493 / (493 + 180) = 0.73254086181
- Recall: 493 / (493 + 681) = 0.41993185689
- F1 Score: 2∗(0.7324∗0.4199)/(0.7324+0.4199) = 0.5331 

**10. Summarize your findings to the Director of Marketing for this company. Based on your
summary, provide two recommendations that could address the company’s problem
described earlier. (Limit this answer to a paragraph of not more than 200 words) (5 pts)**

As the Director of Marketing, I would target a younger crowd specifically more towards women but not to the point where it would disinclude men from the marketing campaign. Some ways I could bring them in would be with social media campaigns, store displays, or more personalized marketing to resonate with them. 

Another recommendation would be to target the more affluent customers as well, trying to draw in people with more classy campaigns or bring a perception about the food that causes the more affluent customers to notice them.

A third reason (although not really neccssary and just my thoughts) would be to see why Spending is not really significant to the model and find away for Organic food to become more viable based on peoples spending strategies and influence perhaps better decisions.