## Introduction to Data Science

#### University of Redlands - DATA 101
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data101.joannabieri.com](https://joannabieri.com/data101.html)

---------------------------------------
# Homework Day 18
---------------------------------------

GOALS:

1. Practice Logistic Regression
2. Interpret Logistic Regression Results

----------------------------------------------------------


This homework has **1 Exercise** and **1 Challenge Exercise**

### Important Information

- Email: [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
- Office Hours: Duke 209 <a href="https://joannabieri.com/schedule.html"> Click Here for Joanna's Schedule</a>


### Announcements

**Come to Lab!** If you need help we are here to help!

### Day 18 Assignment - same drill.


1. Make sure **Pull** any new content from the class repo - then **Copy** it over into your working diretory.
2. Open the file Day##-HW.ipynb and start doing the problems.
    * You can do these problems as you follow along with the lecture notes and video.
3. Get as far as you can before class.
4. Submit what you have so far **Commit** and **Push** to Git.
5. Take the daily check in quiz on **Canvas**.
7. Come to class with lots of questions!


In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'colab'

from itables import show

# This stops a few warning messages from showing
pd.options.mode.chained_assignment = None 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Machine Learning Packages
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression 
from sklearn import metrics

### Data: A collection of Emails

- Emails for the first three months of 2012 for an email account
- Data from 3921 emails and 21 variables on them
- Outcome: whether the email is spam or not
- Predictors: number of characters, whether the email had "Re:" in the subject, time at which email was sent, number of times the word "inherit" shows up in the email, etc.


Data Information: https://www.openintro.org/data/index.php?data=email

This lab follows the Data Science in a Box units "Unit 4 - Deck 6: Logistic regression" by Mine Çetinkaya-Rundel. It has been updated for our class and translated to Python by Joanna Bieri.

In [2]:
file_name = 'data/email.csv'
DF = pd.read_csv(file_name)

In [3]:
DF

Unnamed: 0,spam,to_multiple,from,cc,sent_email,time,image,attach,dollar,winner,...,viagra,password,num_char,line_breaks,format,re_subj,exclaim_subj,urgent_subj,exclaim_mess,number
0,0,0,1,0,0,2012-01-01T06:16:41Z,0,0,0,no,...,0,0,11.370,202,1,0,0,0,0,big
1,0,0,1,0,0,2012-01-01T07:03:59Z,0,0,0,no,...,0,0,10.504,202,1,0,0,0,1,small
2,0,0,1,0,0,2012-01-01T16:00:32Z,0,0,4,no,...,0,0,7.773,192,1,0,0,0,6,small
3,0,0,1,0,0,2012-01-01T09:09:49Z,0,0,0,no,...,0,0,13.256,255,1,0,0,0,48,small
4,0,0,1,0,0,2012-01-01T10:00:01Z,0,0,0,no,...,0,2,1.231,29,0,0,0,0,1,none
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3916,1,0,1,0,0,2012-03-31T00:03:45Z,0,0,0,no,...,0,0,0.332,12,0,0,0,0,0,small
3917,1,0,1,0,0,2012-03-31T14:13:19Z,0,0,1,no,...,0,0,0.323,15,0,0,0,0,0,small
3918,0,1,1,0,0,2012-03-30T16:20:33Z,0,0,0,no,...,0,0,8.656,208,1,0,0,0,5,small
3919,0,1,1,0,0,2012-03-28T16:00:49Z,0,0,0,no,...,0,0,10.185,132,0,0,0,0,0,small


**Exercise 1** Logistic Regression with ONE explanatory variable.

Choose another variable from the data set to use as your explanatory variable and create a Logistic Regression model to predict if an email is spam or not. You should do all of the following:

1. Say what variable you are using to predict spam messages (do some analysis, at minimum a value_counts()). Why do you think this is a good variable to use in predicting if an email is spam.
2. Create and fit a Logistic Regression model.
3. Show the results: intercept, coefficient, basic confusion matrix prediction.
4. What do you think the decision cutoff should be? Update the cutoff and redo the confusion matrix.
5. Explain your results in words. You should talk about False Negative and False positive rates and what they mean in terms of the variables you chose.


**Exercise 2 - challenge** Logistic Regression with MORE THAN ONE explanatory variable.

Try redoing the analysis, but this time add a few more explanatory variables. Again do some analysis of the variables you are chosing and state why they are a good choice. Then answer again questions 1-5.

In [5]:
DF['dollar'].value_counts().head(10)


dollar
0     3175
2      151
4      146
1      120
6       44
8       35
16      23
10      22
5       20
12      20
Name: count, dtype: int64

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd

X = DF[['dollar']]
y = DF['spam']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

model = LogisticRegression()
model.fit(X_train, y_train)

intercept = model.intercept_[0]
coef = model.coef_[0][0]
intercept, coef


(np.float64(-2.1970131885296555), np.float64(-0.05759832236683518))

In [8]:
y_pred_default = model.predict(X_test)
confusion_matrix(y_test, y_pred_default)


array([[1069,    0],
       [ 108,    0]])

In [9]:
y_prob = model.predict_proba(X_test)[:, 1]
y_pred_03 = (y_prob >= 0.3).astype(int)

confusion_matrix(y_test, y_pred_03)


array([[1069,    0],
       [ 108,    0]])

I chose the dollar variable, which counts how many money-related words or $ signs are in an email. Most emails have 0, but some have a few, so it varies enough to be useful. Spam emails often talk about money or promises of cash, so they usually have more of these words. Because of that, this variable is a simple and clear way to help guess whether an email is spam.


Since we’re only using one simple variable, the model’s predicted probabilities are usually low, even for spam. Because of that, the standard cutoff of 0.5 often misses too many spam emails.

A better decision cutoff is around 0.3, because it catches more spam messages while still keeping the number of incorrectly flagged real emails reasonable. In spam detection, it’s usually more important to catch spam than to avoid every false alarm, so choosing a slightly lower cutoff makes sense.


The model can make two types of mistakes. A false positive is when a normal email gets marked as spam, which happens if a real message talks about money. A false negative is when a spam email gets marked as normal, which happens if the spam doesn’t mention money. When we lower the cutoff, we catch more spam but also accidentally send more real emails to the spam folder.


In [31]:
#Exercis# Check value counts
for col in ['dollar','winner','viagra','exclaim_mess','attach','password','urgent_subj']:
    print(f"{col} value counts:")
    print(DF[col].value_counts())
    print()



dollar value counts:
Series([], Name: count, dtype: int64)

winner value counts:
Series([], Name: count, dtype: int64)

viagra value counts:
Series([], Name: count, dtype: int64)

exclaim_mess value counts:
Series([], Name: count, dtype: int64)

attach value counts:
Series([], Name: count, dtype: int64)

password value counts:
Series([], Name: count, dtype: int64)

urgent_subj value counts:
Series([], Name: count, dtype: int64)



In [36]:
print("Shape of dataframe:", DF.shape)
print(DF.head())


Shape of dataframe: (0, 21)
Empty DataFrame
Columns: [spam, to_multiple, from, cc, sent_email, time, image, attach, dollar, winner, inherit, viagra, password, num_char, line_breaks, format, re_subj, exclaim_subj, urgent_subj, exclaim_mess, number]
Index: []

[0 rows x 21 columns]


In [37]:
predictors = ['dollar','winner','viagra','exclaim_mess','attach','password','urgent_subj']
print(DF[predictors].info())
print(DF[predictors].head())


<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   dollar        0 non-null      int64  
 1   winner        0 non-null      float64
 2   viagra        0 non-null      int64  
 3   exclaim_mess  0 non-null      int64  
 4   attach        0 non-null      int64  
 5   password      0 non-null      int64  
 6   urgent_subj   0 non-null      int64  
dtypes: float64(1), int64(6)
memory usage: 0.0 bytes
None
Empty DataFrame
Columns: [dollar, winner, viagra, exclaim_mess, attach, password, urgent_subj]
Index: []


In [38]:
DF['attach'] = DF['attach'].map({'yes':1, 'no':0})
DF['urgent_subj'] = DF['urgent_subj'].map({'yes':1, 'no':0})


In [39]:
num_cols = ['dollar','winner','viagra','exclaim_mess','password']
for col in num_cols:
    DF[col] = pd.to_numeric(DF[col], errors='coerce')


In [40]:
DF = DF.dropna(subset=predictors + ['spam'])
print("Shape after cleaning:", DF.shape)  # Should be >0


Shape after cleaning: (0, 21)


In [45]:
num = 2
intercept = -1.80
slope = -0.0621

eta = intercept + slope*num

P = np.exp(eta)/(1+np.exp(eta))
print(P)

0.1273939433505534


In [46]:
num = 40
intercept = -1.80
slope = -0.0621

eta = intercept + slope*num

P = np.exp(eta)/(1+np.exp(eta))
print(P)

0.013599894814523065


In [47]:
intercept = -1.80
slope = -0.0621

P = []

num_characters = list(np.arange(0,200,10))

for num in num_characters:
    eta = intercept + slope*num
    P.append(np.exp(eta)/(1+np.exp(eta)))


# Plot the results
fig = px.scatter(DF_model,x='num_char',y='spam',opacity=.5)

fig.add_trace(
    px.line(x=num_characters,y=P,color_discrete_sequence=['black']).data[0]
)

fig.show()