## Introduction to Data Science

#### University of Redlands - DATA 101
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data101.joannabieri.com](https://joannabieri.com/data101.html)

---------------------------------------
# Homework Day 18
---------------------------------------

GOALS:

1. Practice Logistic Regression
2. Interpret Logistic Regression Results

----------------------------------------------------------


This homework has **1 Exercise** and **1 Challenge Exercise**

### Important Information

- Email: [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
- Office Hours: Duke 209 <a href="https://joannabieri.com/schedule.html"> Click Here for Joanna's Schedule</a>


### Announcements

**Come to Lab!** If you need help we are here to help!

### Day 18 Assignment - same drill.


1. Make sure **Pull** any new content from the class repo - then **Copy** it over into your working diretory.
2. Open the file Day##-HW.ipynb and start doing the problems.
    * You can do these problems as you follow along with the lecture notes and video.
3. Get as far as you can before class.
4. Submit what you have so far **Commit** and **Push** to Git.
5. Take the daily check in quiz on **Canvas**.
7. Come to class with lots of questions!


In [76]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'colab'

from itables import show

# This stops a few warning messages from showing
pd.options.mode.chained_assignment = None 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Machine Learning Packages
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression 
from sklearn import metrics

### Data: A collection of Emails

- Emails for the first three months of 2012 for an email account
- Data from 3921 emails and 21 variables on them
- Outcome: whether the email is spam or not
- Predictors: number of characters, whether the email had "Re:" in the subject, time at which email was sent, number of times the word "inherit" shows up in the email, etc.


Data Information: https://www.openintro.org/data/index.php?data=email

This lab follows the Data Science in a Box units "Unit 4 - Deck 6: Logistic regression" by Mine Çetinkaya-Rundel. It has been updated for our class and translated to Python by Joanna Bieri.

In [77]:
file_name = 'data/email.csv'
DF = pd.read_csv(file_name)

In [78]:
DF

Unnamed: 0,spam,to_multiple,from,cc,sent_email,...,re_subj,exclaim_subj,urgent_subj,exclaim_mess,number
0,0,0,1,0,0,...,0,0,0,0,big
1,0,0,1,0,0,...,0,0,0,1,small
2,0,0,1,0,0,...,0,0,0,6,small
3,0,0,1,0,0,...,0,0,0,48,small
4,0,0,1,0,0,...,0,0,0,1,none
...,...,...,...,...,...,...,...,...,...,...,...
3916,1,0,1,0,0,...,0,0,0,0,small
3917,1,0,1,0,0,...,0,0,0,0,small
3918,0,1,1,0,0,...,0,0,0,5,small
3919,0,1,1,0,0,...,0,0,0,0,small


**Exercise 1** Logistic Regression with ONE explanatory variable.

Choose another variable from the data set to use as your explanatory variable and create a Logistic Regression model to predict if an email is spam or not. You should do all of the following:

1. Say what variable you are using to predict spam messages (do some analysis, at minimum a value_counts()). Why do you think this is a good variable to use in predicting if an email is spam.
2. Create and fit a Logistic Regression model.
3. Show the results: intercept, coefficient, basic confusion matrix prediction.
4. What do you think the decision cutoff should be? Update the cutoff and redo the confusion matrix.
5. Explain your results in words. You should talk about False Negative and False positive rates and what they mean in terms of the variables you chose.


**Exercise 2 - challenge** Logistic Regression with MORE THAN ONE explanatory variable.

Try redoing the analysis, but this time add a few more explanatory variables. Again do some analysis of the variables you are chosing and state why they are a good choice. Then answer again questions 1-5.

In [79]:
#1
DF.groupby('exclaim_mess')['spam'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
exclaim_mess,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,1435.0,0.150523,0.357708,0.0,0.0,0.0,0.0,1.0
1,733.0,0.113233,0.317094,0.0,0.0,0.0,0.0,1.0
2,507.0,0.049310,0.216728,0.0,0.0,0.0,0.0,1.0
3,128.0,0.093750,0.292626,0.0,0.0,0.0,0.0,1.0
4,190.0,0.026316,0.160496,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...
947,1.0,0.000000,,0.0,0.0,0.0,0.0,0.0
1197,1.0,0.000000,,0.0,0.0,0.0,0.0,0.0
1203,2.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
1209,1.0,1.000000,,1.0,1.0,1.0,1.0,1.0


In [80]:
#2/3
# Getting a subset of the rows
DF_model = DF[['exclaim_mess','spam']]

# Getting the variables
X = DF_model['exclaim_mess'].values.reshape(-1,1)
y = DF_model['spam']

# Doing the regression
LM = LogisticRegression()
LM.fit(X,y)

# Getting the predicted probabilities 
y_pred_prob = LM.predict_proba(X)[:, 1]

# Applying the default cutoff, which is 0.5, to get class predictions
y_pred_default = (y_pred_prob >= 0.5).astype(int)

# Calculating the basic confusion matrix
cm_default = confusion_matrix(y, y_pred_default)

# Getting parameters
# Changed 'logreg' to 'LM' in these three lines
intercept = LM.intercept_[0]
coefficient = LM.coef_[0][0]
classes = LM.classes_

print("\nCONFUSION MATRIX")
print(cm_default)
print('Classes:')
print(LM.classes_)
print('Coefficients:')
print(LM.coef_)
print('Intercept:')
print(LM.intercept_)


CONFUSION MATRIX
[[3554    0]
 [ 367    0]]
Classes:
[0 1]
Coefficients:
[[0.00027241]]
Intercept:
[-2.27234271]


In [81]:
#4
# Getting a subset of the rows
DF_model = DF[['exclaim_mess','spam']]

# Getting the variables
X = DF_model['exclaim_mess'].values.reshape(-1,1)
y = DF_model['spam']

# Doing the regression
LM = LogisticRegression()
LM.fit(X,y)

# Getting the predicted probabilities
y_pred_prob = LM.predict_proba(X)[:, 1]  

# Extracting parameters
intercept = LM.intercept_[0]  
coefficient = LM.coef_[0][0]  
classes = LM.classes_  

#Creating a new cutoff
NEW_CUTOFF = 0.9 
y_pred_new = (y_pred_prob >= NEW_CUTOFF).astype(int)
cm_new = confusion_matrix(y, y_pred_new)

print("\nCONFUSION MATRIX (Cutoff 0.9):")
print(cm_new)
print('Classes:')
print(LM.classes_)
print('Coefficients:')
print(LM.coef_)
print('Intercept:')
print(LM.intercept_)


CONFUSION MATRIX (Cutoff 0.9):
[[3554    0]
 [ 367    0]]
Classes:
[0 1]
Coefficients:
[[0.00027241]]
Intercept:
[-2.27234271]


Q: Explain your results in words. You should talk about False Negative and False positive rates and what they mean in terms of the variables you chose.

A: The positive coefficient means that every extra exclamation point significantly increases the email's probability of being spam. I then chose a cutoff of 0.9 to minimize the false positives. However, I will be accepting more False Negatives, meaning more actual spam will end up in the inbox, but that is fine compared to losing important emails.

### Challenge

In [88]:
#1
DF.groupby(['exclaim_mess', 'urgent_subj', 'exclaim_subj'])['spam'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
exclaim_mess,urgent_subj,exclaim_subj,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0,0,1378.0,0.147315,0.354548,0.0,0.00,0.0,0.00,1.0
0,0,1,55.0,0.218182,0.416818,0.0,0.00,0.0,0.00,1.0
0,1,0,2.0,0.500000,0.707107,0.0,0.25,0.5,0.75,1.0
1,0,0,715.0,0.104895,0.306633,0.0,0.00,0.0,0.00,1.0
1,0,1,15.0,0.400000,0.507093,0.0,0.00,0.0,1.00,1.0
...,...,...,...,...,...,...,...,...,...,...
947,0,0,1.0,0.000000,,0.0,0.00,0.0,0.00,0.0
1197,0,0,1.0,0.000000,,0.0,0.00,0.0,0.00,0.0
1203,0,0,2.0,0.000000,0.000000,0.0,0.00,0.0,0.00,0.0
1209,0,0,1.0,1.000000,,1.0,1.00,1.0,1.00,1.0


In [87]:
#2/3
# Getting a subset of the rows
DF_model = DF[['exclaim_mess','urgent_subj','exclaim_subj','spam']]

# Getting the variables
X = DF_model[['exclaim_mess','urgent_subj','exclaim_subj']].values  

# Doing the regression
LM = LogisticRegression()
LM.fit(X,y)

# Getting the predicted probabilities 
y_pred_prob = LM.predict_proba(X)[:, 1]

# Applying the default cutoff which is 0.5, to get class predictions
y_pred_default = (y_pred_prob >= 0.5).astype(int)

# Calculating the basic confusion matrix
cm_default = confusion_matrix(y, y_pred_default)

# Getting parameters
intercept = LM.intercept_[0]
coefficient = LM.coef_[0][0]  # Note: with multiple features, you might want to access all coefficients
classes = LM.classes_

print("\nCONFUSION MATRIX")
print(cm_default)
print('Classes:')
print(LM.classes_)
print('Coefficients:')
print(LM.coef_)
print('Intercept:')
print(LM.intercept_)


CONFUSION MATRIX
[[3554    0]
 [ 367    0]]
Classes:
[0 1]
Coefficients:
[[2.78894634e-04 1.61740897e+00 2.46565899e-02]]
Intercept:
[-2.27961263]


In [89]:
#4
# Getting a subset of the rows
DF_model = DF[['exclaim_mess','urgent_subj','exclaim_subj','spam']]

# Getting the variables
X = DF_model[['exclaim_mess','urgent_subj','exclaim_subj']]  
y = DF_model['spam']

# Doing the regression
LM = LogisticRegression()
LM.fit(X,y)

# Getting the predicted probabilities
y_pred_prob = LM.predict_proba(X)[:, 1]

# Extracting parameters
intercept = LM.intercept_[0]  
coefficient = LM.coef_[0][0]  
classes = LM.classes_  

#Creating a new cutoff
NEW_CUTOFF = 0.9 
y_pred_new = (y_pred_prob >= NEW_CUTOFF).astype(int)
cm_new = confusion_matrix(y, y_pred_new)

print("\nCONFUSION MATRIX (Cutoff 0.9):")
print(cm_new)
print('Classes:')
print(LM.classes_)
print('Coefficients:')
print(LM.coef_)
print('Intercept:')
print(LM.intercept_)


CONFUSION MATRIX (Cutoff 0.9):
[[3554    0]
 [ 367    0]]
Classes:
[0 1]
Coefficients:
[[2.78894634e-04 1.61740897e+00 2.46565899e-02]]
Intercept:
[-2.27961263]


5
Q: Explain your results in words. You should talk about False Negative and False positive rates and what they mean in terms of the variables you chose.

A: I chose these 3 columns because all of their coefficients are positive. The positive coefficient means that every extra exclamation point significantly increases the email's probability of being spam. I then chose a cutoff of 0.9 to minimize the false positives. However, I will be accepting more False Negatives, meaning more actual spam will end up in the inbox, but that is fine compared to losing important emails.