<a href="https://colab.research.google.com/github/OjChi/DS-Lab/blob/main/Lab_7_Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Lab 7 - Text Analysis
###Ojasa Chitre

####Objective: Separating Spam From Ham

Nearly every email user has at some point encountered a "spam" email, which is an unsolicited message often advertising a product, containing links to malware, or attempting to scam the recipient. Roughly 80-90% of more than 100 billion emails sent each day are spam emails, most being sent from botnets of malware-infected computers. The remainder of emails are called "ham" emails.

As a result of the huge number of spam emails being sent across the Internet each day, most email providers offer a spam filter that automatically flags likely spam messages and separates them from the ham. Though these filters use a number of techniques (e.g. looking up the sender in a so-called "Blackhole List" that contains IP addresses of likely spammers), most rely heavily on the analysis of the contents of an email via text analytics.

In this homework problem, we will build and evaluate a spam filter using a publicly available dataset (Note email.csv file in lab folder) The "ham" messages in this dataset come from the inbox of former Enron Managing Director for Research Vincent Kaminski, one of the inboxes in the Enron Corpus. One source of spam messages in this dataset is the SpamAssassin corpus, which contains hand-labeled spam messages contributed by Internet users. The remaining spam was collected by Project Honey Pot, a project that collects spam messages and identifies spammers by publishing email address that humans would know not to contact but that bots might target with spam. The full dataset we will use was constructed as roughly a 75/25 mix of the ham and spam messages.

The dataset contains just two fields:

· text: The text of the email.

· spam: A binary variable indicating if the email was spam.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import re
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

##Problem 1.1 – Loading the Dataset

In [3]:
emails = pd.read_csv('/content/drive/MyDrive/Engineering/BE/Sem8/DS/Lab/Lab7/emails.csv')

In [4]:
emails.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


###How many emails are in the dataset?

In [5]:
len(emails)

5728

###How many of the emails are spam?

In [6]:
emails['spam'].value_counts()

0    4360
1    1368
Name: spam, dtype: int64

###Could a spam classifier potentially benefit from including the frequency of the word that appears in every email?
            Yes -- the number of times the word appears might help us differentiate spam from ham.
            No -- the word appears in every email so this variable would not help us differentiate spam from ham.

###Answer:
Yes, the number of occurences of specific words in the data will allow us to differentiate between spam and ham.

###The nchar() function counts the number of characters in a piece of text. How many characters are in the longest email in the dataset (where longest is measured in terms of the maximum number of characters)?

In [7]:
max([len(i) for i in emails['text']])

43952

In [8]:
sum([len(i) for i in emails['text']]) + len(emails) - 1

8922898

##Problem 2.1 -Preparing the Corpus

Here I have joined all the contents of the 'text' column

###1) Build a new corpus variable called corpus.

In [9]:
corpus = emails["text"]
corpus
len(corpus)

5728

In [10]:
# corpus = corpus.split()
# corpus[:10]

###2) Convert the text to lowercase.

In [11]:
corpus = list(map(lambda x: x.lower(), corpus))
len(corpus)

5728

###3) Remove all punctuation from the corpus.

In [12]:
corpus = [re.sub(r'[^\w\s]', '', i) for i in corpus]
# res = re.sub(r'[^\w\s]', '', test_str)
len(corpus)

5728

###4) Remove all English stopwords from the corpus.

In [13]:
stopwords_file = open("/content/drive/MyDrive/Engineering/BE/Sem8/DS/Lab/Lab7/stopwords.txt","r+")
stopwords_file

<_io.TextIOWrapper name='/content/drive/MyDrive/Engineering/BE/Sem8/DS/Lab/Lab7/stopwords.txt' mode='r+' encoding='UTF-8'>

In [14]:
stopwords = stopwords_file.read()
stopwords_file.close()
stopwords

'sw = c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "would", "should", "could", "ought", "i\'m", "you\'re", "he\'s", "she\'s", "it\'s", "we\'re", "they\'re", "i\'ve", "you\'ve", "we\'ve", "they\'ve", "i\'d", "you\'d", "he\'d", "she\'d", "we\'d", "they\'d", "i\'ll", "you\'ll", "he\'ll", "she\'ll", "we\'ll", "they\'ll", "isn\'t", "aren\'t", "wasn\'t", "weren\'t", "hasn\'t", "haven\'t", "hadn\'t", "doesn\'t", "don\'t", "didn\'t", "won\'t", "wouldn\'t", "shan\'t", "shouldn\'t", "can\'t", "cannot", "couldn\'t", "mustn\'t", "let\'s", "that\'s", "who\'s", "what\'s", "here\'s", "there\'s", "when\'s", "where\'s", 

In [15]:
stopwords = stopwords[7:-1]
stopwords

'"i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "would", "should", "could", "ought", "i\'m", "you\'re", "he\'s", "she\'s", "it\'s", "we\'re", "they\'re", "i\'ve", "you\'ve", "we\'ve", "they\'ve", "i\'d", "you\'d", "he\'d", "she\'d", "we\'d", "they\'d", "i\'ll", "you\'ll", "he\'ll", "she\'ll", "we\'ll", "they\'ll", "isn\'t", "aren\'t", "wasn\'t", "weren\'t", "hasn\'t", "haven\'t", "hadn\'t", "doesn\'t", "don\'t", "didn\'t", "won\'t", "wouldn\'t", "shan\'t", "shouldn\'t", "can\'t", "cannot", "couldn\'t", "mustn\'t", "let\'s", "that\'s", "who\'s", "what\'s", "here\'s", "there\'s", "when\'s", "where\'s", "why\'s

In [16]:
stopwords = re.split('"|, ',stopwords)
stopwords

['',
 'i',
 '',
 '',
 'me',
 '',
 '',
 'my',
 '',
 '',
 'myself',
 '',
 '',
 'we',
 '',
 '',
 'our',
 '',
 '',
 'ours',
 '',
 '',
 'ourselves',
 '',
 '',
 'you',
 '',
 '',
 'your',
 '',
 '',
 'yours',
 '',
 '',
 'yourself',
 '',
 '',
 'yourselves',
 '',
 '',
 'he',
 '',
 '',
 'him',
 '',
 '',
 'his',
 '',
 '',
 'himself',
 '',
 '',
 'she',
 '',
 '',
 'her',
 '',
 '',
 'hers',
 '',
 '',
 'herself',
 '',
 '',
 'it',
 '',
 '',
 'its',
 '',
 '',
 'itself',
 '',
 '',
 'they',
 '',
 '',
 'them',
 '',
 '',
 'their',
 '',
 '',
 'theirs',
 '',
 '',
 'themselves',
 '',
 '',
 'what',
 '',
 '',
 'which',
 '',
 '',
 'who',
 '',
 '',
 'whom',
 '',
 '',
 'this',
 '',
 '',
 'that',
 '',
 '',
 'these',
 '',
 '',
 'those',
 '',
 '',
 'am',
 '',
 '',
 'is',
 '',
 '',
 'are',
 '',
 '',
 'was',
 '',
 '',
 'were',
 '',
 '',
 'be',
 '',
 '',
 'been',
 '',
 '',
 'being',
 '',
 '',
 'have',
 '',
 '',
 'has',
 '',
 '',
 'had',
 '',
 '',
 'having',
 '',
 '',
 'do',
 '',
 '',
 'does',
 '',
 '',
 'did',
 '',
 '',


In [17]:
stopwords = [i for i in stopwords if i]
stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'would',
 'should',
 'could',
 'ought',
 "i'm",
 "you're",
 "he's",
 "she's",
 "it's",
 "we're",
 "they're",
 "i've",
 "you've",
 "we've",
 "they've",
 "i'd",
 "you'd",
 "he'd",
 "she'd",
 "we'd",
 "they'd",
 "i'll",
 "you'll",
 "he'll",
 "she'll",
 "we'll",
 "they'll",
 "isn't",
 "aren't",
 "wasn't",
 "weren't",
 "hasn't",
 "haven't",
 "hadn't",
 "doesn't",
 "don't",
 "didn't",
 "won't",
 "wouldn't",
 "shan't",
 "shouldn't",
 "can't",
 'cannot',
 "couldn't",
 "mustn't",
 "let's",
 "that's",
 "who's",
 "what'

In [18]:
# corpus_without_sw = [word for word in corpus if not word in stopwords]
# corpus_without_sw

In [19]:
corpus_without_sw = []
for sentence in corpus:
  corpus_without_sw.append(' '.join([word for word in sentence.split() if not word in stopwords]))
corpus_without_sw

['subject naturally irresistible corporate identity lt really hard recollect company market full suqgestions information isoverwhelminq good catchy logo stylish statlonery outstanding website will make task much easier promise havinq ordered iogo company will automaticaily become world ieader isguite ciear without good products effective business organization practicable aim will hotat nowadays market promise marketing efforts will become much effective list clear benefits creativeness hand made original logos specially done reflect distinctive company image convenience logo stationery provided formats easy use content management system letsyou change website content even structure promptness will see logo drafts within three business days affordability marketing break shouldn t make gaps budget 100 satisfaction guaranteed provide unlimited amount changes extra fees surethat will love result collaboration look portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

In [20]:
corpus = corpus_without_sw
len(corpus)

5728

###5) Stem the words in the corpus

In [21]:
porter = PorterStemmer()
corpus_stemmed = []
for sentence in corpus:
  corpus_stemmed.append(' '.join([porter.stem(word) for word in sentence.split()]))

corpus_stemmed

['subject natur irresist corpor ident lt realli hard recollect compani market full suqgest inform isoverwhelminq good catchi logo stylish statloneri outstand websit will make task much easier promis havinq order iogo compani will automaticaili becom world ieader isguit ciear without good product effect busi organ practic aim will hotat nowaday market promis market effort will becom much effect list clear benefit creativ hand made origin logo special done reflect distinct compani imag conveni logo stationeri provid format easi use content manag system letsyou chang websit content even structur prompt will see logo draft within three busi day afford market break shouldn t make gap budget 100 satisfact guarante provid unlimit amount chang extra fee surethat will love result collabor look portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ interest _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

In [22]:
corpus = corpus_stemmed
len(corpus)

5728

In [23]:
emails['text'] = corpus

###6) Build a document term matrix from the corpus, called dtm

In [24]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(emails['text'], emails['spam'], test_size=0.33, random_state=42)

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfvec = TfidfVectorizer()
X_train = tfvec.fit_transform(X_train)
X_test = tfvec.transform(X_test)

In [26]:
X_train.shape

(3837, 24100)

In [27]:
X_test.shape

(1891, 24100)

In [28]:
dtm = pd.DataFrame(X_train.toarray(),columns = tfvec.get_feature_names())
dtm



Unnamed: 0,00,000,0000,000000,00000000,000000000003619,000000000003997,000000000005168,000000000005409,000000000005411,...,zwrocic,zwwyw,zwzm,zxghlajf,zyc,zygoma,zymg,zzn,zzncacst,zzzz
0,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.03293,0.0,0.070332,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3832,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3833,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3834,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3835,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


####How many terms are in dtm?

##3.1 

###1) CART

In [37]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(random_state=111)

In [38]:
clf = clf.fit(X_train, y_train)

In [39]:
clf.score(X_test,y_test)

0.9592808038075092

###2) Random Forest

In [32]:
from sklearn.ensemble import RandomForestClassifier

In [33]:
scores = []
for i in range(200):
  clf_RF = RandomForestClassifier(oob_score=True,random_state=i)
  clf_RF.fit(X_train, y_train)
  scores.append(clf_RF.score(X_test,y_test))
  print(i,clf_RF.score(X_test,y_test))

0 0.973030142781597
1 0.9714436805922793
2 0.9687995769434162
3 0.973030142781597
4 0.9682707562136436
5 0.967741935483871
6 0.9719725013220518
7 0.9735589635113696
8 0.9661554732945531
9 0.970386039132734
10 0.9693283976731888
11 0.970386039132734
12 0.967741935483871
13 0.970386039132734
14 0.9714436805922793
15 0.967741935483871
16 0.970386039132734
17 0.9740877842411423
18 0.9672131147540983
19 0.9635113696456901
20 0.9698572184029614
21 0.9687995769434162
22 0.9698572184029614
23 0.9693283976731888
24 0.9709148598625066
25 0.973030142781597
26 0.9682707562136436
27 0.967741935483871
28 0.9672131147540983
29 0.9666842940243258
30 0.9661554732945531
31 0.9693283976731888
32 0.9687995769434162
33 0.9719725013220518
34 0.9725013220518244
35 0.9682707562136436
36 0.9709148598625066
37 0.9672131147540983
38 0.9698572184029614
39 0.9698572184029614
40 0.9751454257006875
41 0.9687995769434162
42 0.9709148598625066
43 0.9746166049709148
44 0.9682707562136436
45 0.9693283976731888
46 0.9693

In [36]:
print(scores.index(max(scores)),max(scores))

111 0.9767318878900053


###Conclusion:
I was able to do text analysis by creating a corpus (Making lowercase, removing punctuations, removing stopwords, stemming words). I was able to compare the CART model and Random Forest model. There was about a 1% increase in accuracy in the Random Forest (Ensemble model).
