# A1: 

### A1: According to my experience with my previous NLP project of classifying job descriptions, the first and most important step in this project is labeling the data, i.e. finding the phishing email from the data. But the amount of email is huge because it contains emails from 100 companies with 12 months span. If we have report from customers about phishing mails list, then this is much easier.  But if not, I would start from filtering out all emails related to money transfer. This could be done using simple keywords match with a keywords list of ['money', 'transfer', 'checking', 'account', 'bank', ...]. After filtering out these emails, I will group the emails by sender names, and focus on the senders with multiple emails addresses. This should narrow the email list to a much smaller set that human can check one by one and find all phishing emails. 

### Regarding how to identify phishing email in realtime, 1) we can send a warning to the receiver of potential risk if a sender start using a new email. 2) We can compile a sample of labeled normal communication and  money transfer emails and phishing emails found above, and use all the information of the email as features ( the header, content of the email , sender IP, sender email, etc) to train a classification model. Start from logistic regression model and increase the complexity of the model if needed. Precision is the metric we want to maximize because the ground truth can not be reliably determined thus recall is less meaningful. Then we can send warning to receivers if the probabilites of the incoming email is classified as fraud is greater than a specific threshold. 3) We can calculate the similarities of content of incoming email with previous emails from the same sender, and trigger a warning if the similarity is lower than a specific threshold. This reflects a change of writing style that should be recognized. 

### With the above screening, the majority of phishing email should be identified. 

In [1]:
import pandas as pd
%matplotlib inline
import glob
import re

# A2.1

# Combine all the comments in training and testing samples into an overall file respectively. There are line breaker in the comments, let's remove them at this step. 

In [None]:
df_train_unsup = pd.DataFrame({'description' : []})
train_unsup = glob.glob("train/unsup/*.txt")

for file in train_unsup:
    with open(file, "r") as f:
        comments = f.read()
        comments = comments.replace('<br /><br />','')
    df_train_unsup = df_train_unsup.append({'description':comments},ignore_index=True)
print(df_train_unsup.head())
print(df_train_unsup.tail())
df_train_unsup.to_csv('all_train_unsup.csv',index = False)

In [2]:
df_train_unsup = pd.read_csv('all_train_unsup.csv')

## 1.1. Stop words

In [3]:
## You might need to run the following two lines to download stopwords for NLTK
# import nltk
# nltk.download('stopwords')

In [4]:
from nltk.corpus import stopwords

In [5]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## 1.2 Stemming

In [6]:
from nltk.stem.snowball import EnglishStemmer

In [7]:
stemmer = EnglishStemmer(ignore_stopwords=False)

## 1.3 CountVectorizer

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

## Build a stem_tokenizer function

In [10]:
import re
def stem_tokenizer(text):
    stemmer = EnglishStemmer(ignore_stopwords=True)
    words = re.sub(r"[^A-Za-z0-9\-]", " ", text).lower().split()
    words = [stemmer.stem(word) for word in words]
    return words 

print(stem_tokenizer('I am a good man. I love the world. Really'))

['i', 'am', 'a', 'good', 'man', 'i', 'love', 'the', 'world', 'realli']


In [11]:
# The first model using CountVectorizer with stop words. 
#I tried using CountVectorizer without stop words too and list the result at the end of the answer. 
cv = CountVectorizer(stop_words=stopwords.words('english'),
                     tokenizer=stem_tokenizer,
                     lowercase=True,
                     max_df=1.0,
                     min_df=1
                    )

### Fit vectorizer using texts

In [12]:
texts = df_train_unsup['description'].tolist()

In [13]:
cv.fit(texts)



CountVectorizer(stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                tokenizer=<function stem_tokenizer at 0x130c46af0>)

### Vocabulary of the vectorizer

In [14]:
cv.vocabulary_

{'newspaperman': 64701,
 'johnni': 49458,
 'twenni': 98078,
 'live': 54799,
 '90': 3445,
 'complet': 21364,
 '20': 2165,
 'person': 71341,
 'lifestyl': 54296,
 '-': 0,
 'fedora': 33648,
 'manual': 57508,
 'typewrit': 98318,
 'charleston': 18519,
 'work': 105245,
 'great': 40147,
 'idea': 46066,
 'movi': 62293,
 'done': 27683,
 'better': 12055,
 'miss': 60946,
 'clich': 20122,
 'never': 64519,
 'use': 100582,
 'one': 67918,
 'twice': 98118,
 'find': 34508,
 'anticip': 7259,
 'reaction': 77473,
 'harsher': 42299,
 '90s': 3469,
 'world': 105307,
 'goe': 39234,
 'along': 5939,
 'often': 67373,
 'guess': 40730,
 'right': 79204,
 'make': 56892,
 'much': 62617,
 'fun': 36881,
 'lot': 55510,
 'call': 16477,
 'save': 81799,
 'damsel': 24282,
 'distress': 27179,
 'name': 63617,
 'virginia': 101861,
 'natch': 63824,
 'three': 95111,
 'differ': 26387,
 'occas': 67020,
 'respond': 78676,
 'appropri': 7635,
 'flutter': 35352,
 'eyelid': 32573,
 'time': 95535,
 'independ': 46924,
 'women': 105025,
 '

In [15]:
#vocabulary = [[item,cv.vocabulary_[item]] for item in cv.vocabulary_.keys()]
cv_fit=cv.fit_transform(texts)
words = cv.get_feature_names()
counts = cv_fit.toarray().sum(axis=0)

# <font color='red'>This is the Answer! </font> 

In [16]:
# This is the Answer
n = len(words)
words_counts = []
for i in range(n):
    words_counts.append([words[i],counts[i]])
words_counts.sort(key = lambda x:-x[1])
print(words_counts[0:100])

[['movi', 102124], ['film', 96351], ['one', 54915], ['like', 44760], ['time', 30888], ['good', 29618], ['make', 29574], ['charact', 28848], ['get', 28050], ['see', 27929], ['watch', 27151], ['stori', 25469], ['even', 25186], ['would', 24518], ['realli', 23165], ['scene', 21571], ['bad', 20045], ['well', 19906], ['look', 19707], ['much', 19272], ['end', 19132], ['show', 18964], ['great', 18895], ['peopl', 18737], ['go', 18474], ['also', 18332], ['-', 18269], ['first', 17921], ['love', 17796], ['way', 17551], ['play', 17421], ['think', 17266], ['act', 17097], ['thing', 16457], ['could', 16000], ['made', 15710], ['seem', 15243], ['know', 14951], ['say', 14368], ['plot', 13815], ['two', 13593], ['work', 13436], ['mani', 13358], ['take', 13346], ['come', 13260], ['want', 13242], ['actor', 13232], ['never', 12960], ['tri', 12942], ['seen', 12895], ['littl', 12764], ['best', 12534], ['year', 12514], ['life', 12161], ['better', 11756], ['ever', 11735], ['give', 11678], ['man', 11154], ['find',

# Below is the top 100 most frequent words does not excluding stop words for archive and comparison. The stop words do appear extremly frequent in the reviews we study. Excluding stop words result makes more sense to me because most of the stop words have no distinguishing power thus do not help 


[['the', 676672], ['and', 328836], ['a', 327346], ['of', 293124], ['to', 270470], ['is', 214246], ['it', 190771], ['in', 187552], ['i', 170425], ['this', 150470], ['that', 146176], ['s', 127658], ['movi', 102124], ['film', 96351], ['was', 95903], ['as', 92822], ['with', 89605], ['for', 88889], ['but', 83974], ['you', 69345], ['t', 68410], ['on', 66797], ['not', 59860], ['are', 59752], ['he', 59743], ['his', 58520], ['have', 55808], ['one', 54915], ['be', 54515], ['they', 46921], ['at', 46441], ['all', 46227], ['by', 45455], ['like', 44760], ['who', 43614], ['an', 43485], ['from', 41850], ['so', 40467], ['there', 38363], ['her', 36112], ['or', 36096], ['just', 35725], ['about', 34683], ['has', 33951], ['if', 33607], ['out', 33218], ['what', 32260], ['some', 32005], ['time', 30888], ['good', 29618], ['make', 29574], ['can', 29319], ['charact', 28848], ['when', 28383], ['more', 28370], ['get', 28050], ['see', 27929], ['very', 27722], ['she', 27423], ['watch', 27151], ['up', 25639], ['stori', 25469], ['no', 25278], ['even', 25186], ['would', 24518], ['their', 24195], ['my', 23893], ['which', 23708], ['only', 23384], ['realli', 23165], ['had', 21791], ['scene', 21571], ['other', 21425], ['were', 21424], ['we', 20997], ['me', 20991], ['bad', 20045], ['well', 19906], ['look', 19707], ['than', 19673], ['most', 19537], ['much', 19272], ['end', 19132], ['show', 18964], ['great', 18895], ['will', 18760], ['peopl', 18737], ['been', 18565], ['go', 18474], ['into', 18357], ['also', 18332], ['-', 18269], ['do', 18145], ['first', 17921], ['him', 17850], ['because', 17846], ['love', 17796], ['how', 17608], ['don', 17574], ['way', 17551]]

# Q2.2

In [24]:
with open('train/neg/3316_2.txt', "r") as f:
    comments = f.read()
cv_one_fit = cv.transform([comments])
print(cv_one_fit.toarray().sum())
print(cv_fit[0].toarray().sum())
print(comments)

96
110
This was the first Ewan McGregor movie I ever saw outside of Star Wars. Since then I have become a very big Ewan McGregor fan but I still can't bring myself to forgive this movie's existence.<br /><br />My sister has always been a huge Jane Austen fan and because of that, I have been subjected to various of the classics, Emma being one of them. I've always considered them irritating, stupid and boring. However, after watching this terrible rendition, I was forced to admit that the original Emma was delightful and charming. Ewan McGregor scarcely serves a purpose in this film after they hacked and mutilated the part of Frank Churchill. Gweneth Paltrow is ridiculous in an already ridiculous character and the rest of the film is unremarkable and stupid.<br /><br />My recommendation to anybody who is remotely interested in English period drama... go see the originals. If you're a Ewan McGregor fan... believe me, by skipping this film, you haven't missed anything but five minutes col

In [19]:
# scipy cdist is faster than sklearn.metrics.pairwise.cosine_similarity when computing cosine similarity
from scipy.spatial.distance import cdist
#from sklearn.metrics.pairwise import cosine_similarity
texts_scores = []
for i in range(len(texts)):
    sim = 1. - cdist(cv_one_fit.toarray(), cv_fit[i].toarray(), 'cosine')
    texts_scores.append([sim,texts[i]])

texts_scores.sort(key = lambda x:-x[0])



[[array([[0.29978002]]), "Avoid this film at all costs! This is just a very very cheap remake of the french movie by the same name. Being a big fan of that Taxi, I was intrigued to see what the Yanks had done with it. Watching it was painful, the only reason I watched till the end was to see how low they were going to go with this film. Queen Latifa's performance is the only redeeming aspect of this film, who does her best with an unoriginal, crass and cheap script. The best scenes are direct rip-offs of the original film and Jimmy Fallon is wasted on a one dimensional ridiculous character.Go see the original french film and give this one as wide a berth as you can!"], [array([[0.29297103]]), 'For the B movie fans, there are films to watch, and there are "films" to watch. This movie falls in the later. Having become a b movie fan earlier this year, I made it my goal to watch a majority of the "bad" films that exist. This not included garbage made today, I\'m talking about 90\'s and ear

In [20]:
for i in range(10):
    print(texts_scores[i],'\n')

[array([[0.29978002]]), "Avoid this film at all costs! This is just a very very cheap remake of the french movie by the same name. Being a big fan of that Taxi, I was intrigued to see what the Yanks had done with it. Watching it was painful, the only reason I watched till the end was to see how low they were going to go with this film. Queen Latifa's performance is the only redeeming aspect of this film, who does her best with an unoriginal, crass and cheap script. The best scenes are direct rip-offs of the original film and Jimmy Fallon is wasted on a one dimensional ridiculous character.Go see the original french film and give this one as wide a berth as you can!"] 

[array([[0.29297103]]), 'For the B movie fans, there are films to watch, and there are "films" to watch. This movie falls in the later. Having become a b movie fan earlier this year, I made it my goal to watch a majority of the "bad" films that exist. This not included garbage made today, I\'m talking about 90\'s and ear

In [21]:
for i in range(100):
    print(texts_scores[i],'\n')

[array([[0.29978002]]), "Avoid this film at all costs! This is just a very very cheap remake of the french movie by the same name. Being a big fan of that Taxi, I was intrigued to see what the Yanks had done with it. Watching it was painful, the only reason I watched till the end was to see how low they were going to go with this film. Queen Latifa's performance is the only redeeming aspect of this film, who does her best with an unoriginal, crass and cheap script. The best scenes are direct rip-offs of the original film and Jimmy Fallon is wasted on a one dimensional ridiculous character.Go see the original french film and give this one as wide a berth as you can!"] 

[array([[0.29297103]]), 'For the B movie fans, there are films to watch, and there are "films" to watch. This movie falls in the later. Having become a b movie fan earlier this year, I made it my goal to watch a majority of the "bad" films that exist. This not included garbage made today, I\'m talking about 90\'s and ear

In [25]:
scores = [item[0][0] for item in texts_scores]