#### Name: Akshat Bhat

#### UID: 2018130003

#### Roll No. 5

#### BE COMPS

## Exp 7] Text Analysis: Separating Spam From Ham

### Task 1.1 - Loading the Dataset

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [20]:
df = pd.read_csv('emails.csv', encoding='cp1252', delimiter=',', quotechar='"') 

In [21]:
df

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
...,...,...
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0


In [22]:
print("Shape of data (samples, features): ",df.shape)

Shape of data (samples, features):  (5728, 2)


**Question 1: How many emails are in the dataset?**

We can see there are 5728 emails in the given dataset 

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5728 entries, 0 to 5727
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5728 non-null   object
 1   spam    5728 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 89.6+ KB


In [24]:
df['spam'].value_counts()

0    4360
1    1368
Name: spam, dtype: int64

**Question 2: How many of the emails are spam?**

1368 emails are spam 

In [25]:
first_w = df.apply(lambda x:x['text'].split()[0][:-1],axis=1)
first_w.value_counts()

Subject    5728
dtype: int64

**Question 3: Which word appears at the beginning of every email in the dataset? Respond as a lower-case word with punctuation removed.**
    
Subject

In [26]:
# %matplotlib inline 
# sns.set(rc = {'figure.figsize':(11.7,8.27)})
# sns.set_style('darkgrid')
# nlp_words = nltk.FreqDist(first_words)
# nlp_words.plot(20) 

**Question 4: Could a spam classifier potentially benefit from including the frequency of the word that appears in every email?**
    
No -- the word appears in every email so this variable would not help us differentiate spam from ham.

In [27]:
max_characters = 0
longest_email = None
for i,email in enumerate(df['text']):
    max_characters = max(max_characters,len(email))
    longest_email = email

print('The longest email has',max_characters,'characters\n')
print('The longest email is:\n\n',longest_email)

The longest email has 43952 characters

The longest email is:

 Subject: news : aurora 5 . 2 update  aurora version 5 . 2  - the fastest model just got faster -  epis announces the release of aurora , version 5 . 2  aurora the electric market price forecasting tool is already  legendary for power and speed . we ' ve combined a powerful chronological  dispatch model with the capability to simulate the market from 1  day to 25 + years . add to that a risk analysis section , powered by user  selectable monte carlo & / or latin hypercube modeling , enough  portfolio analysis power to please the toughest critic , & inputs and  outputs from standard excel & access tables and you ' ve got one of most  powerful tools in the market .  just a few months ago we expanded our emissions modeling  capabilities , added our quarterly database update , increased the speed  of the entire model , and made  but that wasn ' t enough .  we ' ve done it again . some of the operations that we ' ve  included . 

**Question 5: How many characters are in the longest email in the dataset (where longest is measured in terms of the maximum number of characters)?**

43952 characters

### Task 2.1 - Preparing the Corpus 

In [28]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Akshat\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [29]:
import string
from nltk.corpus import stopwords
from nltk import PorterStemmer as Stemmer

def preprocess(text):
    # Converting to lowercase
    text = text.lower()
    # Removing punctuation
    text = ''.join([t for t in text if t not in string.punctuation])
    # Removing stopwords
    text = [t for t in text.split() if t not in stopwords.words('english')]
    # Applying stemming
    st = Stemmer()
    text = [st.stem(t) for t in text]
    text = ' '.join(text)
    return text

In [31]:
from tqdm import tqdm
tqdm.pandas()
corpus = df.apply(lambda x: preprocess(x['text']), axis=1)

In [32]:
corpus # preprocessed corpus

0       subject natur irresist corpor ident lt realli ...
1       subject stock trade gunsling fanni merril muzo...
2       subject unbeliev new home made easi im want sh...
3       subject 4 color print special request addit in...
4       subject money get softwar cd softwar compat gr...
                              ...                        
5723    subject research develop charg gpg forward shi...
5724    subject receipt visit jim thank invit visit ls...
5725    subject enron case studi updat wow day super t...
5726    subject interest david pleas call shirley cren...
5727    subject news aurora 5 2 updat aurora version 5...
Length: 5728, dtype: object

In [33]:
corpus.to_csv('corpus.csv')

In [63]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 

# For creating a document term matrix
def fn_tdm_df(docs): 
    vectorizer = CountVectorizer()
    vec = vectorizer.fit_transform(docs)
    df2 = pd.DataFrame(vec.toarray().transpose(), index=vectorizer.get_feature_names())
    return df2,vec

In [64]:
dtm,X_transform = fn_tdm_df(corpus) # dtm is the Document Term Matrix

In [65]:
dtm.to_csv('dtm.csv')

In [66]:
dtm

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5718,5719,5720,5721,5722,5723,5724,5725,5726,5727
00,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,4,0,0
000,0,0,0,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
0000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
000000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
00000000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zymg,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zzmacmac,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zzn,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zzncacst,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Question: How many terms are in dtm?**

29254 terms

In [60]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 

def sparse_tdm_df(docs):
    vectorizer = CountVectorizer(min_df = 0.05) # limit dtm to contain terms appearing in at least 5% of documents
    vec = vectorizer.fit_transform(docs)
    df2 = pd.DataFrame(vec.toarray().transpose(), index = vectorizer.get_feature_names())
    return df2

In [61]:
spdtm = sparse_tdm_df(corpus) # Sparse Document Term Matrix
spdtm

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5718,5719,5720,5721,5722,5723,5724,5725,5726,5727
00,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,4,0,0
000,0,0,0,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
01,0,0,0,0,0,0,0,0,0,0,...,0,0,2,0,0,0,0,1,0,0
02,0,0,0,0,0,0,0,0,0,0,...,0,1,2,0,0,1,1,0,0,0
03,0,0,0,0,0,0,0,0,0,0,...,0,6,0,0,3,3,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
work,0,0,0,0,0,0,1,0,0,0,...,2,0,1,0,0,0,0,1,0,0
would,0,0,0,0,0,0,0,0,0,0,...,0,1,1,0,1,1,0,0,0,0
write,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
www,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2


In [62]:
spdtm.to_csv('spdtm.csv')

**Question: How many terms are in spdtm?**

366 terms

### Task 3.1 - Building machine learning models

In [67]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer().fit(X_transform)
X_tfidf = tfidf_transformer.transform(X_transform)
print(X_tfidf.shape)

(5728, 29254)


In [68]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['spam'], test_size=0.30, random_state=123)

In [80]:
from sklearn import tree

spamCART = tree.DecisionTreeClassifier() # threshold of 0.5 for predictions is set by default
spamCART = spamCART.fit(X_train, y_train)
pred = spamCART.predict(X_test)

In [81]:
from sklearn.metrics import confusion_matrix,classification_report
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98      1319
           1       0.92      0.92      0.92       400

    accuracy                           0.96      1719
   macro avg       0.95      0.95      0.95      1719
weighted avg       0.96      0.96      0.96      1719



In [82]:
from sklearn.ensemble import RandomForestClassifier
spamRF = RandomForestClassifier() 
spamRF.fit(X_train, y_train)
rf_pred = spamRF.predict(X_test)

In [84]:
print(classification_report(y_test, rf_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98      1319
           1       0.99      0.91      0.95       400

    accuracy                           0.98      1719
   macro avg       0.98      0.95      0.97      1719
weighted avg       0.98      0.98      0.98      1719



In [None]:
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(2):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

#ROC curve for a specific class here for the class 2
roc_auc[2]