# Classify Emails beyond spams 

# LDA - Latent Dirichlet Allocation

In this project, LDA - Latent Dirichlet Allocation has used for validating the clusters returned by hierarcchical clustering and K -Means clustering. It is mainly used to gain insights into the clusters formed by the clustering algorithm and thereby highlighting the topic of contents for each cluster. 

However, LDA has been implemented preliminarily on the dataset before the application of clustering algorithm so as to check if at all there exist spams emails belonging to certain topics in the dataset. Also, LDA has been applied to all the three Levels i.e. initial data, data + synonyms and data + synonyms + hypernyms. 

The main purposes of LDA in this project are: 
1. Conduct preliminary tests (on all the three levels of data) to check if groups or topics depending on the subject of the emails already exist.
2. Validating the clusters obtained from the clusters returned by the clustering algorithms.
3. To check if the application of LDA after clustering generates better results.
4. If addition of synonyms and hypernyms contributes to better results.


### This notebook shows the implementation of LDA on the data without the inclusion of synonyms and hypernyms. Level 1


### Final Project - Riti Chakraborty¶

In [1]:
#Riti Chakraborty

#importing the required libraries
import pandas as pd
import numpy as np
import random
from sklearn.feature_extraction.text import CountVectorizer
import numpy
from numpy import nan

#for flattening lists
from itertools import chain

#To handle warning
import warnings
warnings.filterwarnings('ignore')

#For implementing Natural Language Processing approaches.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer

#For using Regular expression
import re

#For Handling Strings
import string

#For implementing word sense disambiguation
from nltk.corpus import wordnet as wn
from wordsegment import load, segment

#Important to call load()
load()

#For LDA Implementation # Importing Gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
import gensim
from gensim import corpora

#For evaluation Topic models formed
from gensim.models import CoherenceModel

#For visualising LDA Output
import pyLDAvis
import pyLDAvis.gensim 



In [2]:
#Reading Email_train Data.
Emails = pd.read_csv('../input_data/TRAININGEMAILS.csv')

#Adding the label column 0-Spam and 1- Ham
label = pd.read_csv('../input_data/Label.txt', sep=" ", header=None)
Emails['label'] = label[0]

#Keeping those records for which From Field in not Empty i.e. Removing Empty Emails.
Emails_Notnull= Emails[pd.notnull(Emails['From'])]

#Seperatinbg out the spams
Spams = Emails_Notnull.loc[Emails_Notnull['label']==0]
Hams = Emails_Notnull.loc[Emails_Notnull['label']==1]



#Extracting only the Subject Column
data_subset = Spams[["Subject"]]
pd.DataFrame(data_subset).to_csv("../exported_tables/data_for_LDA.csv")
#converting the data in the dataframe into str type
data_subset = data_subset.astype(str) 

#retaining proper index
data_subset=data_subset.reset_index(drop=True)

# Converting each row to list of lists
list1=data_subset.values.tolist()
list2=[]

for i in range(0, len(list1)):
    list2.append(' '.join(map(str, list1[i])))

#converting each row into vectors #Printing the count of each term in the emails #Emails on Rows and Terms on columns
vectorizer1 = CountVectorizer()
row_vectors=vectorizer1.fit_transform(list2).todense()


### Defining Stopwords removal function

In [3]:
#Defining a Function for removing Stopwords
def stopword_remove(l2):
    fil_list2=[]
    for sent in l2:
        stop_words=["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "spam","yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "hibody", "body","these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
        word_tokens = word_tokenize(sent.lower())
        filtered_sentence = [w for w in word_tokens if not w in stop_words]
        filtered_sentence = []

        for w in word_tokens:
            if w not in stop_words:
                filtered_sentence.append(w)

        fil_list2.append(' '.join(filtered_sentence))
    return fil_list2



#Calling the function to remove stop words
fil_list2=stopword_remove(list2)


### Defining a function to remove punctuation

In [4]:
#Defining a function to remove punctuation
def no_punctuation(my_str):
    punctuations = '''!()-[]{};:'"\\,<>./?@#$%^&*_~'''
    no_punct = ""
    for char in my_str:
        if char not in punctuations:
            no_punct = no_punct + char
        else:
            no_punct = no_punct + " "

# display the unpunctuated string
    return no_punct

###  Defining a function to segment the strings and fetch synonyms and hypernyms from the wordnet API. Extraction of Synonyms and hypernyms happens in Level2 and Level 3, not here in Level 1.

In [5]:
#Word tokenisation, word segmentation
def seg_syn(l1):
    wordlist2=[]
    wordlist2.append(no_punctuation(l1).split())
    seg=[]
    for w in wordlist2:
        no_integers = [x for x in w if not (x.isdigit() or x[0] == '-' and x[1:].isdigit())]
        pure_string=[x for x in no_integers if not any(c.isdigit() for c in x)]
        for s in pure_string:
            seg.append(segment(s))
    flat_seg_list = [item for sublist in seg for item in sublist]

    return flat_seg_list

In [6]:
list2_syn=[]
for l in fil_list2:

    #Storing the synonyms and the hypernyms part returned from the previous function
    emails=seg_syn(str(l))
    list2_syn.append(emails)

#flattening the list created from above
flat_list_syn_hyp=len(list(chain(*list2_syn)))

#Converting back to strings
flat_list2_syn=[]
for ls in list2_syn:
    flat_list2_syn.append(' '.join(list(set(ls))))

df_flatlist=pd.DataFrame(flat_list2_syn)
df_flatlist["Index"]=df_flatlist.index

doc_compt=[]
for text in flat_list2_syn:
    doc_compt.append(re.sub(r'\b\w{1,4}\b', '', str(text)))


### LDA on the data

In [7]:
stop = set(stopwords.words('english'))
exclude_punct = set(string.punctuation) 
lemma = WordNetLemmatizer()

def clean(doc):
    
    #Removing Stop words
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    
    #Removing Punctuation
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude_punct)
    
    #for stemming 
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    result = ''.join([i for i in normalized if not i.isdigit()])
    
    return result

doc_clean = [clean(doc).split() for doc in  doc_compt]        

dictionary = gensim.corpora.Dictionary(doc_clean)

# dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
bow_corpus = [dictionary.doc2bow(doc) for doc in doc_clean]

lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=5, id2word=dictionary, passes=2, workers=2)
for idx, topic in lda_model.print_topics(-1):
    print('\n Topic: {} \nWords: {}'.format(idx, topic))


 Topic: 0 
Words: 0.022*"discount" + 0.015*"online" + 0.015*"better" + 0.014*"watch" + 0.013*"replica" + 0.012*"viagra" + 0.010*"offer" + 0.010*"price" + 0.010*"buying" + 0.009*"order"

 Topic: 1 
Words: 0.019*"message" + 0.017*"money" + 0.013*"credit" + 0.011*"secret" + 0.010*"penis" + 0.009*"address" + 0.009*"unique" + 0.008*"rate" + 0.008*"information" + 0.008*"social"

 Topic: 2 
Words: 0.019*"price" + 0.017*"today" + 0.012*"discount" + 0.012*"offer" + 0.012*"computer" + 0.012*"system" + 0.011*"work" + 0.010*"approved" + 0.009*"special" + 0.008*"month"

 Topic: 3 
Words: 0.015*"today" + 0.014*"quote" + 0.014*"watch" + 0.013*"hello" + 0.013*"insurance" + 0.012*"friend" + 0.010*"email" + 0.009*"discount" + 0.008*"weight" + 0.008*"enter"

 Topic: 4 
Words: 0.030*"price" + 0.023*"mining" + 0.021*"pfizer" + 0.019*"watch" + 0.017*"business" + 0.014*"personal" + 0.011*"mortgage" + 0.011*"rate" + 0.011*"visitor" + 0.008*"every"


### Computing Perplexity and Coherence Score.

In [8]:
random.seed(3425)
# Compute Perplexity : lowest perplexity is considered the best.
perplexity=lda_model.log_perplexity(bow_corpus)
print('\n Perplexity of the Spam Classification model: ', perplexity)  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=doc_clean, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\n Coherence Score of the Spam Classification model: ', coherence_lda)


 Perplexity of the Spam Classification model:  -7.587339587706853

 Coherence Score of the Spam Classification model:  0.7089646830269934


# Visualization

In [9]:
pyLDAvis.enable_notebook()
id2word=dictionary
vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, id2word)
vis