## Spam Filtering
In this programming assignment, we will be looking at Spam Filtering with a real data set that has a "label" for every email - i.e. spam or not spam. We will use logistic regression classifier to solve this assignment and participate in a friendly competition on Kaggle (Details below). The assignment goes from data loading to data inspection to data pre-processing to creating a train/test data set to finally doing machine learning, making predictions and evaluating it. This is typically one part of the "full pipeline" in ML modeling/prototyping - So you will get a sampler taste of some "prototype pipeline" work that happens in practice! Have fun!! And if you get stuck somewhere - Use discord - Maybe someone has a suggestion that will unblock you.

The submission consists of two parts:
a) A submission of your complete working code with train/validation data sets + your write-up with insights and your learnings (details on this provided below)
b) Evaluation of your best model on the Kaggle evaluation data set - For this you can form a team of 2 - To brainstorm ideas and make your best submission. Include your team name, team members in your submission.

Kaggle Starting Point for the competition: https://www.kaggle.com/t/7d2850f5b99a41fba457f2ad7acd0fca

### Team
**Team Name on Kaggel:** Qingchuan & Zihe

**Team Members:** Qingchuan Hou, Zihe Song

## Loading the data set

In [1]:
import nltk
# nltk.download('punkt')
# nltk.download('stopwords')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import pandas as pd
import numpy as np

In [2]:
all_maills = pd.read_csv("email_data/all_emails.csv",sep=',',index_col=None,engine='python',error_bad_lines=False)
test_maills = pd.read_csv("email_data/eval_students_2.csv",sep=',',index_col=None,engine='python',error_bad_lines=False)



  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
print(all_maills.shape,test_maills.shape)

(4260, 3) (1468, 2)


In [4]:
test_maills.head()

Unnamed: 0,id,text
0,5074,"Subject: prc review stinson , i am going to ..."
1,2941,Subject: fwd : hello from charles shen at will...
2,2032,Subject: considered unsolicited bulk email fro...
3,3443,Subject: re : looking for someone who has expe...
4,3063,"Subject: re : term project : brian , no prob..."


## 1) Inspecting the data set

In [5]:
# 1. Print a few lines (i.e. each line is an email and a label) from the data_set containing spam (use a pandas functionality - e.g. getting the top lines)

print(all_maills.head())

     id                                               text  spam
0  1235  Subject: naturally irresistible your corporate...     1
1  1236  Subject: the stock trading gunslinger  fanny i...     1
2  1238  Subject: 4 color printing special  request add...     1
3  1239  Subject: do not have money , get software cds ...     1
4  1240  Subject: great nnews  hello , welcome to medzo...     1


In [6]:
# 2. Print a few lines from data_set that are not spam

print(all_maills[all_maills["spam"] == 0].head())

        id                                               text  spam
1026  2603  Subject: hello guys ,  i ' m " bugging you " f...     0
1027  2604  Subject: sacramento weather station  fyi  - - ...     0
1028  2605  Subject: from the enron india newsdesk - jan 1...     0
1029  2606  Subject: re : powerisk 2001 - your invitation ...     0
1030  2607  Subject: re : resco database and customer capt...     0


In [7]:
# 3. Print the emails between lines 5000 and 5010 in the data set
print(all_maills.iloc[4000:4011])

        id                                               text  spam
4000  6587  Subject: re : var calibration issues  we are p...     0
4001  6588  Subject: re : carnegie mellon team meeting  un...     0
4002  6590  Subject: re : statistica & lunch  rick ,  we a...     0
4003  6592  Subject: re : interview with research dept . c...     0
4004  6593  Subject: re : natural gas storage item  vince ...     0
4005  6594  Subject: revised : organizational changes  to ...     0
4006  6595  Subject: valuation methodology  we ' ve had a ...     0
4007  6596  Subject: again , i should have also sent the f...     0
4008  6597  Subject: registration materials for nfcf  to :...     0
4009  6599  Subject: latest  vince ,  i appologize for shi...     0
4010  6601  Subject: visit to enron  grant and vince ,  th...     0


## 2) Data processing step for this HW: 
Do the following process for all emails in your data set - 1) Tokenize into words 2) Remove stop/filler words and 3) Remove punctuations 
Below - We have it done for a sample sentence

### Tokenizer
Apply a tokenizer to tokenize the sentences in your email - So your sentence gets broken down to words. We will use a tokenizer from the NLTK library (Natural Language Tool Kit) below for a single sentence. 

In [8]:
def tokenizer(x):
    y = x.apply(lambda row: word_tokenize(row['text']), axis=1) 
    return y

all_tokenized_text = tokenizer(all_maills)
test_tokenized_text = tokenizer(test_maills)

In [9]:
# all_maills['tokenized_text'] = all_maills['text'].apply(word_tokenize)
# test_maills['tokenized_text'] = test_maills['text'].apply(word_tokenize)

In [10]:
all_tokenized_text.head()

0    [Subject, :, naturally, irresistible, your, co...
1    [Subject, :, the, stock, trading, gunslinger, ...
2    [Subject, :, 4, color, printing, special, requ...
3    [Subject, :, do, not, have, money, ,, get, sof...
4    [Subject, :, great, nnews, hello, ,, welcome, ...
dtype: object

### Stop Words: Remove Stop Words (or Filler words ) using stop words list

In [11]:
def filterwords(x):
    return x.apply(lambda x: [word for word in x if word not in stopwords.words('english')])

all_filtered_text = filterwords(all_tokenized_text)

test_filtered_text = filterwords(test_tokenized_text)

In [12]:
print(all_filtered_text.head())

0    [Subject, :, naturally, irresistible, corporat...
1    [Subject, :, stock, trading, gunslinger, fanny...
2    [Subject, :, 4, color, printing, special, requ...
3    [Subject, :, money, ,, get, software, cds, !, ...
4    [Subject, :, great, nnews, hello, ,, welcome, ...
dtype: object


### Punctuations: Remove punctuations and other special characters from tokens

In [106]:
def remove_punctuation(x):
    return x.apply(lambda x: [word for word in x if word.isalnum()] )

all_new_text = remove_punctuation(all_filtered_text)
test_new_text = remove_punctuation(test_filtered_text)

In [105]:
print(all_new_text.head(1))
print(all_new_text[1])

0    [Subject, naturally, irresistible, corporate, ...
dtype: object
['Subject', 'stock', 'trading', 'gunslinger', 'fanny', 'merrill', 'muzo', 'colza', 'attainder', 'penultimate', 'like', 'esmark', 'perspicuous', 'ramble', 'segovia', 'group', 'try', 'slung', 'kansas', 'tanzania', 'yes', 'chameleon', 'continuant', 'clothesman', 'libretto', 'chesapeake', 'tight', 'waterway', 'herald', 'hawthorn', 'like', 'chisel', 'morristown', 'superior', 'deoxyribonucleic', 'clockwork', 'try', 'hall', 'incredible', 'mcdougall', 'yes', 'hepburn', 'einsteinian', 'earmark', 'sapling', 'boar', 'duane', 'plain', 'palfrey', 'inflexible', 'like', 'huzzah', 'pepperoni', 'bedtime', 'nameable', 'attire', 'try', 'edt', 'chronography', 'optima', 'yes', 'pirogue', 'diffusion', 'albeit']


## 3) Exercise: 
Inspect the resulting list below for any of your emails - Does it look clean and ready to be used for the next step in spam detection? Any other pre-processing steps you can think of or may want to do before spam detection? How about including other NLP features like bi-grams and tri-grams?

The another processing I add here is using the lowercase word. And I also have tried the n-grams function in next section when I develop the model.

### Lowercase word

In [107]:
def lower(x):
    return [[word.lower() for word in maill] for maill in x]

all_new_text_lower = lower(all_new_text)
test_new_text_lower = lower(test_new_text)

In [100]:
print(all_new_text_lower[1])
print()

['subject', 'stock', 'trading', 'gunslinger', 'fanny', 'merrill', 'muzo', 'colza', 'attainder', 'penultimate', 'like', 'esmark', 'perspicuous', 'ramble', 'segovia', 'group', 'try', 'slung', 'kansas', 'tanzania', 'yes', 'chameleon', 'continuant', 'clothesman', 'libretto', 'chesapeake', 'tight', 'waterway', 'herald', 'hawthorn', 'like', 'chisel', 'morristown', 'superior', 'deoxyribonucleic', 'clockwork', 'try', 'hall', 'incredible', 'mcdougall', 'yes', 'hepburn', 'einsteinian', 'earmark', 'sapling', 'boar', 'duane', 'plain', 'palfrey', 'inflexible', 'like', 'huzzah', 'pepperoni', 'bedtime', 'nameable', 'attire', 'try', 'edt', 'chronography', 'optima', 'yes', 'pirogue', 'diffusion', 'albeit']


### Dataframe organization

In [108]:
all_new = all_maills.copy()
test_new = test_maills.copy()

all_new['text'] = [' '.join(map(str, l)) for l in all_new_text_lower]  # change list to string
test_new['text'] = [' '.join(map(str, l)) for l in test_new_text_lower]

In [109]:
print(all_new.head())
print(test_new.head())

     id                                               text  spam
0  1235  subject naturally irresistible corporate ident...     1
1  1236  subject stock trading gunslinger fanny merrill...     1
2  1238  subject 4 color printing special request addit...     1
3  1239  subject money get software cds software compat...     1
4  1240  subject great nnews hello welcome medzonline s...     1
     id                                               text
0  5074  subject prc review stinson going prc review bo...
1  2941  subject fwd hello charles shen williams co ton...
2  2032  subject considered unsolicited bulk email mess...
3  3443  subject looking someone experience finance mat...
4  3063  subject term project brian problem vince brian...


In [34]:
print(all_new['text'][1])

Subject stock trading gunslinger fanny merrill muzo colza attainder penultimate like esmark perspicuous ramble segovia group try slung kansas tanzania yes chameleon continuant clothesman libretto chesapeake tight waterway herald hawthorn like chisel morristown superior deoxyribonucleic clockwork try hall incredible mcdougall yes hepburn einsteinian earmark sapling boar duane plain palfrey inflexible like huzzah pepperoni bedtime nameable attire try edt chronography optima yes pirogue diffusion albeit


## 4) Train/Validation Split
Now for each email in your data set - You have boiled the email down to its essentials - A list of words that are clean and ready for some Machine Learning! Maybe punctuations matter for spam emails!!? 
If you wish to keep them, you may for your curiosity and see how it impacts metrics (i.e. skip step 3 above). 

What we will do now is split the data set into train and test set - The train set can have 80% of the data (i.e. emails along with their labels) chosen at random - But with good representation from both spam and not-spam email classes. And the same goes for the test set - Which would have the remaining 20% of the data.
Look up python libraries that can do this data split for you automatically?

In [17]:
train = all_new.groupby("spam").sample(frac=0.8)
valid = all_new[~all_new.index.isin(train.index)]

In [18]:
print('All data shape:', all_new.shape)
print('Train data shape:', train.shape)
print('Validation data shape:', valid.shape)

All data shape: (4260, 3)
Train data shape: (3408, 3)
Validation data shape: (852, 3)


In [19]:
train_X, train_y = train['text'], train['spam']
valid_X, valid_y = valid['text'], valid['spam']
test_X = test_new['text']

## 5) Train your model and evaluate on Kaggle
Report your train/validation F1-score for your baseline model (starter LR model) and also your best LR model. Also report your insights on what worked and what did not on the Kaggle evaluation. How can your model be improved? Where does your model make mistakes?

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score,f1_score

In [290]:
model = Pipeline([('tf-idf',  TfidfVectorizer(max_df=0.6, min_df=1, sublinear_tf=True,ngram_range=(1,1))),
                    ('LR', LogisticRegression()),
                     ])


In [22]:
model.fit(train_X, train_y)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LogisticRegression())])

In [23]:
predicted = model.predict(valid_X)
print('accuracy_score:',accuracy_score(valid_y,predicted))
print('f-1 score:', f1_score(valid_y,predicted))

accuracy_score 0.9753521126760564


### Predict on test data

In [293]:
# I put the sample data code here again so I can only run one block here when I test my model with different data
train = all_new.groupby("spam").sample(frac=0.95)
valid = all_new[~all_new.index.isin(train.index)]
train_X, train_y = train['text'], train['spam']
valid_X, valid_y = valid['text'], valid['spam']
model.fit(train_X, train_y)

predicted = model.predict(valid_X)
print('accuracy_score',accuracy_score(valid_y,predicted))
print('f-1 score:', f1_score(valid_y,predicted))

accuracy_score 0.9906103286384976
f-1 score: 0.98


In [296]:
test_new['spam'] = model.predict(test_X)

In [295]:
print(test_new)

        id                                               text  spam
0     5074  subject prc review stinson going prc review bo...     0
1     2941  subject fwd hello charles shen williams co ton...     0
2     2032  subject considered unsolicited bulk email mess...     1
3     3443  subject looking someone experience finance mat...     0
4     3063  subject term project brian problem vince brian...     0
...    ...                                                ...   ...
1463  6232  subject job application enron research group d...     0
1464  3227  subject lawyer ian sorry delay responding curr...     0
1465  5773  subject enron india mark two points 1 probably...     0
1466  5421  subject sorry see 11 30 hyatt lobby vince j ka...     0
1467  4708  subject departure research group unique extrao...     0

[1468 rows x 3 columns]


### Export the predated data to csv file

In [297]:
prediction = test_new[['id','spam']]

print(prediction.head())
print(prediction.shape)

     id  spam
0  5074     0
1  2941     0
2  2032     1
3  3443     0
4  3063     0
(1468, 2)


In [298]:
prediction.to_csv('prediction.csv',index=False)

### Conclusion

The first model we build got a 0.97955 F1 score. The F1 score of our final model on Kaggle is 0.98722. It increased by 0.008.

For the first model, I only did the instruction step on data processing. Then use the basic tfidf and LR model.

I did a few improvements between the first model and the last one. First, I added the lowercase function on the data processing step. Second, I used the `max_df` function for tf-idf to make my model only focus on the more important words. These two function both gived us a better accuracy of our model.

I also tried using the n-grams function, but it seems could not improve my model. Also, any other change in the data processing will decrease the score of my model.