# <font color='EC2C04'>  SPAM DETECTION SYSTEM with Ling-Spam COLLECTION </font>

This notebook includes the tasks performed in the **second part** of the Resit Coursework of the Big Data Applications module. For a complete description of the purposes of the analysis, please consult the PDF document provided which also represent the first part of the Coursework.
From that file, you will be also able to have a detailed the description of the corpus used for this analysis. Nevertheless, you can access to a summary of the file from the **[part 1]**(# 1.Access and Import the Files) of this script, in which the Readme.txt file of the dataset is accessed and shown.
This script goes through the entire process that is necessary to create and obtain the optimal model able to identify whether a message represents a Spam or Ham.
In particular, the phases in which the document is organised are summarised as following:

1. **Access and Import the File**  
2. **Clean the Text files**
3. **Convert Text files into Vectors of fixed dimensions and Encode the Target into a Binary variable**
4. **Normalise the Values and Turn the tuple Label/Dense Vector into LabeledPoints local vector**
5. **Classification models**


**! NOTE:** To better understand and expose the abilities learnt with this coursework I believe it may be useful to work in two directions:
1. Start with simple/small attempts for each function in order to verify the results of my actions
2. Give the possibility of these results to be reproducible ( when importing new data or modify parameters ) by defining functions that can include and iterate the functions

In [1]:
# PYSPARK INSTALLATION
!pip install pyspark 



In [1]:
# LOAD THE PACKAGES 
import pyspark
from pathlib import Path
import re
import nltk
import re
from nltk.corpus import stopwords
from pyspark import StorageLevel
from pyspark.mllib.feature import IDF, Normalizer
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.classification import LogisticRegression

In [2]:
# DEFINE THE CONTEXT
sc = pyspark.SparkContext.getOrCreate()

---

# 1. Access and Import the Files

In [3]:
# Define the Directory 
%cd /Users/simonezanetti/Desktop/lingspam_public

# Define Your own directory
# %cd /BigDataResitCw/lingspam_public

/Users/simonezanetti/Desktop/lingspam_public


In [4]:
# Access to the Readme.txt file to access to a description of the Corpus 
# used for this task.
!cat readme.txt

This directory contains the Ling-Spam corpus, as described in the 
paper:

I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, George Paliouras, 
and C.D. Spyropoulos, "An Evaluation of Naive Bayesian Anti-Spam 
Filtering". In Potamias, G., Moustakis, V. and van Someren, M. (Eds.), 
Proceedings of the Workshop on Machine Learning in the New Information 
Age, 11th European Conference on Machine Learning (ECML 2000), 
Barcelona, Spain, pp. 9-17, 2000.

There are four subdirectories, corresponding to four versions of 
the corpus:

bare: Lemmatiser disabled, stop-list disabled.
lemm: Lemmatiser enabled, stop-list disabled.
lemm_stop: Lemmatiser enabled, stop-list enabled.
stop: Lemmatiser disabled, stop-list enabled.

Each one of these 4 directories contains 10 subdirectories (part1, 
..., part10). These correspond to the 10 partitions of the corpus 
that were used in the 10-fold experiments. In each repetition, one 
part was reserved for testing and the other 9 were use

---

**!  NOTE:** For this task the **Bare** subdirectory will be used

---

In [5]:
# Goal is Extracting the text files contained in each of the 10 subdirectories 
# contained in the 'bare' directory that has been chosen to be used in this analysis.
# I could use also the package glob ( and glob.glob. Check in case )


path = Path('bare') # Set the directory bare as path
print(*path.iterdir()) # Check that this is going to do what you want: 
                    # -> Obtaining as (sub)directories all the 10 parts within the folder bare
                    # to then iterate over all of them to assign their content into an RDD file.
        

bare/part3 bare/part4 bare/part5 bare/part2 bare/part10 bare/part9 bare/part7 bare/part1 bare/part6 bare/part8


In [6]:
# Store each directory path in a list, in order to iterate that and be able to obtain the RDD's. 
dr = list(path.iterdir()) 

# Make sure the function .resolve() works in connecting
# the path with each directory (x) and its turned into a str
# This will be useful in the loop set in the following cell
for i in range(0,len(dr)):
    print(dr[i])
    print(dr[i].resolve())

bare/part3
/Users/simonezanetti/Desktop/lingspam_public/bare/part3
bare/part4
/Users/simonezanetti/Desktop/lingspam_public/bare/part4
bare/part5
/Users/simonezanetti/Desktop/lingspam_public/bare/part5
bare/part2
/Users/simonezanetti/Desktop/lingspam_public/bare/part2
bare/part10
/Users/simonezanetti/Desktop/lingspam_public/bare/part10
bare/part9
/Users/simonezanetti/Desktop/lingspam_public/bare/part9
bare/part7
/Users/simonezanetti/Desktop/lingspam_public/bare/part7
bare/part1
/Users/simonezanetti/Desktop/lingspam_public/bare/part1
bare/part6
/Users/simonezanetti/Desktop/lingspam_public/bare/part6
bare/part8
/Users/simonezanetti/Desktop/lingspam_public/bare/part8


In [7]:
# Set the loop that allows to store the content of each directory as Rdd within a List.
store_rdds = []

# From the (sub)directories to Rdd files
for x in dr:  # iterate through the directories 
    rdd = sc.wholeTextFiles(str(x.resolve())) 
    store_rdds.append(rdd) # I append each rdd obtained with wholeTextFile() in the empty list created 

# Check that I have 10 directories in the list

print("Length file/Number of Rdd's",len(store_rdds),'\n\n')
if len(store_rdds) == 10:
    print('Right number!\n\n')
else:
    print('Wrong number. Check what happened.\n\n')
# Check that it worked 
print(store_rdds[1].take(2)) # I take the first subdirectory(turned into an Rdd)
                             # and have a look at first two messages.
    
    
# I can see from the last print that the each tuple includes the file path.
# ex. 'file:/Users/simonezanetti/Desktop/lingspam_public/bare/part4/6-266msg3.txt'
# This is not useful. By considering the example shown, what we want is only
# 6-266msg3. This is useful later on to identify the labels as filenames 
# starting with 'spmsg' identify the spam messages. ( next .. )

Length file/Number of Rdd's 10 


Right number!


[('file:/Users/simonezanetti/Desktop/lingspam_public/bare/part4/6-266msg3.txt', 'Subject: bisfai deadline extension !\n\nbisfai deadline extension ! the deadline for the bar - ilan symposium on foundations of artificial intelligence has been extended to february 27 . the conference itself will take place as scheduled , june 20-22 , in ramat - gan and jerusalem , israel . for more information contact : bisfai @ bimacs . cs . biu . ac . il daniel radzinski tovna translation machines jerusalem , israel dr @ tovna . co . il\n'), ('file:/Users/simonezanetti/Desktop/lingspam_public/bare/part4/8-1074msg1.txt', 'Subject: 8th international conference on functional grammar\n\neighth international conference on functional grammar , july 6th-9th , 1998 the biennial series on conferences on functional grammar will be continued in 1998 at the vrije universiteit amsterdam ( netherlands ) , where a four-day conference will be held from 6th to 9th july 

In [8]:
a = store_rdds[1].take(4)
for x in a:
    print(x[0])
    a = re.split('[/.]',x[0])  # square brackets indicate every of the symbols inside can be used to split
    print(a)

# I want to keep only the second last element of this list ( index -2)    
print('\n\n\nSUMMARY:\n I need this:',a[-2])

file:/Users/simonezanetti/Desktop/lingspam_public/bare/part4/6-266msg3.txt
['file:', 'Users', 'simonezanetti', 'Desktop', 'lingspam_public', 'bare', 'part4', '6-266msg3', 'txt']
file:/Users/simonezanetti/Desktop/lingspam_public/bare/part4/8-1074msg1.txt
['file:', 'Users', 'simonezanetti', 'Desktop', 'lingspam_public', 'bare', 'part4', '8-1074msg1', 'txt']
file:/Users/simonezanetti/Desktop/lingspam_public/bare/part4/6-353msg2.txt
['file:', 'Users', 'simonezanetti', 'Desktop', 'lingspam_public', 'bare', 'part4', '6-353msg2', 'txt']
file:/Users/simonezanetti/Desktop/lingspam_public/bare/part4/6-864msg1.txt
['file:', 'Users', 'simonezanetti', 'Desktop', 'lingspam_public', 'bare', 'part4', '6-864msg1', 'txt']



SUMMARY:
 I need this: 6-864msg1


In [9]:
# .. To solve this issue it is possible to use the re.split function from the package
# NLTK ( tackled on the INTRODUCTION TO NATURAL LANGUAGE PROCESSING module)
# lambda function will help to iterate the function over the sequence of text within each RDD
# map function allows me to apply a function to the Rdd file.


# A loop that iterates over each of the 10 Rdd's: 
# I think this is not the most elegant way, but I could not find another
store_rdd = []
for i in range(0,len(store_rdds)):
    a = store_rdds[i]
    a =  a.map(lambda x: (re.split('[/\.]', x[0])[-2], x[1])) # x[1] is the text itself
    store_rdd.append(a)

print('Check if it worked:\n\n',store_rdd[1].take(1))
print('\n\nWell done !')


Check if it worked:

 [('6-266msg3', 'Subject: bisfai deadline extension !\n\nbisfai deadline extension ! the deadline for the bar - ilan symposium on foundations of artificial intelligence has been extended to february 27 . the conference itself will take place as scheduled , june 20-22 , in ramat - gan and jerusalem , israel . for more information contact : bisfai @ bimacs . cs . biu . ac . il daniel radzinski tovna translation machines jerusalem , israel dr @ tovna . co . il\n')]


Well done !


In [10]:
# At this point I should turn the results of this phase into a function, 
# which will help me to be able to import a file and obtain these results
# by only setting the path/folder name

def access_and_import(path):
    path = Path('bare')
    dr = list(path.iterdir()) 
    for i in range(0,len(dr)):
        print('Verify if the directory is properly set:')
        print(dr[i])
        print(dr[i].resolve())
    print('\n\n')
    store_rdds = []
    for x in dr:  
        rdd = sc.wholeTextFiles(str(x.resolve())) 
        store_rdds.append(rdd) 
    store_rdd = []
    for i in range(0,len(store_rdds)):
        a = store_rdds[i]
        a =  a.map(lambda x: (re.split('[/\.]', x[0])[-2], x[1])) # x[1] is the text itself
        store_rdd.append(a)
    print('Check if split function worked:\n\n',store_rdd[1].take(1))
    return store_rdd
# Check if it worked 
store_rdd = access_and_import('bare')

Verify if the directory is properly set:
bare/part3
/Users/simonezanetti/Desktop/lingspam_public/bare/part3
Verify if the directory is properly set:
bare/part4
/Users/simonezanetti/Desktop/lingspam_public/bare/part4
Verify if the directory is properly set:
bare/part5
/Users/simonezanetti/Desktop/lingspam_public/bare/part5
Verify if the directory is properly set:
bare/part2
/Users/simonezanetti/Desktop/lingspam_public/bare/part2
Verify if the directory is properly set:
bare/part10
/Users/simonezanetti/Desktop/lingspam_public/bare/part10
Verify if the directory is properly set:
bare/part9
/Users/simonezanetti/Desktop/lingspam_public/bare/part9
Verify if the directory is properly set:
bare/part7
/Users/simonezanetti/Desktop/lingspam_public/bare/part7
Verify if the directory is properly set:
bare/part1
/Users/simonezanetti/Desktop/lingspam_public/bare/part1
Verify if the directory is properly set:
bare/part6
/Users/simonezanetti/Desktop/lingspam_public/bare/part6
Verify if the directory is

In [11]:
def split_train_test(store_rdd, ntest): 
#ntest: number of rdds I want to keep for test set
    ntrain = len(store_rdd)- ntest
    print('Num Rdd used for Train:',ntrain)
    train=sc.emptyRDD() # I set an empty RDD and I then unite the train RDDs together
    for x in range(0,ntrain):
        train = train.union(store_rdd[x])
    test = sc.emptyRDD() # Same for test
    for x in range(ntrain,len(store_rdd)):
        test = test.union(store_rdd[x])
    #union = train.union(test) # Not sure so far but it may be useful to perform operation when still together and split after
    return train,test

In [12]:
train,test = split_train_test(store_rdd, 1)

Num Rdd used for Train: 9


---

# 2. Apply NLP techniques to clean the strings


In [13]:
# I use only the test set not but later I will create a function to iterate everything
# and this will allow to have results over train and test data.

print(test.values().take(1)) # Function values leaves out the key 
test_val = test.values() # This leaves out the key ( the filename. ex: '8-1064msg1') 

['Subject: re : 8 . 1044 , disc : grammar in schools\n\n( re message from : linguist @ linguistlist . org ) > > linguist list : vol-8 - 1044 . sat jul 12 1997 . issn : 1068-4875 . > > subject : 8 . 1044 , disc : grammar in schools > > i know and teach that not all infinitives contain ` to \' . i also give > the students examples ( e . g . ` i asked him to kindly apologise \' ) where > placing the adverb anywhere else would cause ambiguity . > > jennifer chew an example i once concocted to justify " splitting the infintive " ( or not , as the case may be ) is : a ) after a heavy meal , i prepared slowly to go home digesting b ) after a heavy meal , i prepared to slowly go home digesting c ) after a heavy meal , i prepared to go home slowly digesting in this context , with the possible exception of the third case , the natural ( and therefore near-enough unambiguous ) association of the adverb is as follows : a ) after a heavy meal , i prepared _ slowly to go home digesting b ) after a h

In [14]:
# 1. REMOVE PUNCTUATION 
# This is helpful to remove noise from the data, i.e we remove the punctuation that does not lead
# any significant inside to the model.
# However, I keep the ! and ? because spam tend to often have these simbols as a way for advertisment to give enthusiasm

test_val =  test_val.map(lambda x: re.sub('[:()\[\],.?!";_]','',x))
test_val.take(1)[:10]

["Subject re  8  1044  disc  grammar in schools\n\n re message from  linguist @ linguistlist  org  > > linguist list  vol-8 - 1044  sat jul 12 1997  issn  1068-4875  > > subject  8  1044  disc  grammar in schools > > i know and teach that not all infinitives contain ` to '  i also give > the students examples  e  g  ` i asked him to kindly apologise '  where > placing the adverb anywhere else would cause ambiguity  > > jennifer chew an example i once concocted to justify  splitting the infintive   or not  as the case may be  is  a  after a heavy meal  i prepared slowly to go home digesting b  after a heavy meal  i prepared to slowly go home digesting c  after a heavy meal  i prepared to go home slowly digesting in this context  with the possible exception of the third case  the natural  and therefore near-enough unambiguous  association of the adverb is as follows  a  after a heavy meal  i prepared  slowly to go home digesting b  after a heavy meal  i prepared to slowly  go  home diges

In [15]:
# 2. TOKENISATION

#nltk.download('punkt') # Apply the Punkt sentence tokenizer: the standard tokenizer for English language
test_val = test_val.map(nltk.word_tokenize) # This tokenises the RDD by applying word_tokenize to each of the tests through the map function
# Verify the result
test_val.take(1)[0][:10] # Narrow the output print 

['Subject',
 're',
 '8',
 '1044',
 'disc',
 'grammar',
 'in',
 'schools',
 're',
 'message']

In [16]:
# Now I need to unite again together the key and its values
test = test.keys().zip(test_val)

# We may have risked to create lots of empty strings.
# Find a way to filter them 

In [17]:
# DEFINE A FUNCTION THAT SUMMARISE THE PROCESS DONE IN PART 2

def remove_punct_tokenisation(rdd):
    rdd_val = rdd.values()
    rdd_val =  rdd_val.map(lambda x: re.sub('[:()\[\],.?!";_]','',x))
    rdd_val = rdd_val.map(nltk.word_tokenize) # This tokenises the RDD by applying word_tokenize to each of the tests through the map function
    rdd = rdd.keys().zip(rdd_val)
    print('\n\nLook of data after tokenisation and punctuation removed:\n', rdd.take(2))
    return rdd

# Run the function 
train = remove_punct_tokenisation(train)
#train.take(1)



Look of data after tokenisation and punctuation removed:
 [('6-110msg3', ['Subject', 'job', 'posting', '-', 'apple-iss', 'research', 'center', 'content', '-', 'length', '3386', 'apple-iss', 'research', 'center', 'a', 'us', '$', '10', 'million', 'joint', 'venture', 'between', 'apple', 'computer', 'inc', 'and', 'the', 'institute', 'of', 'systems', 'science', 'of', 'the', 'national', 'university', 'of', 'singapore', 'located', 'in', 'singapore', 'is', 'looking', 'for', 'a', 'senior', 'speech', 'scientist', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', 'the', 'successful', 'candidate', 'will', 'have', 'research', 'expertise', 'in', 'computational', 'linguistics', 'including', 'natural', 'language', 'processing', 'and', '*', '*', 'english', '*', '*', 'and', '*', '*', 'chinese', '*', '*', 'statistical', 'language', 'modeling', 'knowledge', 'of', 'state-of', '-', 'the-art', 'corpus-based', 'n', '-', 'gram', 'lang

---

# 3. Convert Text files into Vectors of fixed size and Encode the Target into a Binary variable

This process is necessary to allow use the strings as useful inputs for Machine Learning models.

In [18]:
# Hash functions return a unique encode for each word.
# www.youtube.com/watch?v=2BldESGZKB8 
# Very useful reference for the concept


# Let's see an example to observe if this concept could work: 
text = 'Hello'
length = 10  # Length of the vectors. This parameter will be changed later on
vec = [0] * length  # create vector of 0s of the dimension we define 
print('Initial vector:',vec, '\n\n')
for word in text: 
                 
    hsh = hash(word)  
    print('Hashed word:', hsh)
    print('Hashed word % N:', hsh % length)
    vec[hsh % length] = vec[hsh % length] + 1 # You add one to this position of the vector
    print(vec)
print('\n\nVector identifying the Word', text, ':',vec)
 # return hashed word vector

Initial vector: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 


Hashed word: -3636030952904220701
Hashed word % N: 9
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Hashed word: -5940189418225466560
Hashed word % N: 0
[1, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Hashed word: -1407138643904109550
Hashed word % N: 0
[2, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Hashed word: -1407138643904109550
Hashed word % N: 0
[3, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Hashed word: 4880476542979782855
Hashed word % N: 5
[3, 0, 0, 0, 0, 1, 0, 0, 0, 1]


Vector identifying the Word Hello : [3, 0, 0, 0, 0, 1, 0, 0, 0, 1]


In [19]:
# Let's Try to apply this to our task

# The MAIN GOAL IS TO HAVE RETURNED NUMERICAL INPUTS THAT CAN BE ABLE TO 
# 1. be input into the model
# 2. Have the same vector size for each vector ( that I define with the parameter/variable N)

def hash_vectors(text, vect_length): 
    vec = [0] * vect_length  # initialise a vector of all zeroes
    for x in text: 
        hsh = hash(x) # Every word defined by its hash
        vec[hsh % vect_length] = vec[hsh % vect_length] + 1 # This adds 1 in the index defined.
                                                            # In case the letter appears again then 1 would be sum to 1 and so on.
    return vec # vector identifying the word


test_keys = test.keys() # These are the labels that identify whether is spam or ham
test_val = test.values() # I select only the values 
test_val1 = test_val.map(lambda text: hash_vectors(text,8)) # Apply the function over the rdd values 

print('Tokens before Vectorization:\n', test_val.take(1))
print('\n\nTokens after Vectorization:\n', test_val1.take(1))

# Try another dimension of vector
test_val1 = test_val.map(lambda text: hash_vectors(text,50)) 
print('\n\nTokens after Vectorization:\n', test_val1.take(1))

Tokens before Vectorization:
 [['Subject', 're', '8', '1044', 'disc', 'grammar', 'in', 'schools', 're', 'message', 'from', 'linguist', '@', 'linguistlist', 'org', '>', '>', 'linguist', 'list', 'vol-8', '-', '1044', 'sat', 'jul', '12', '1997', 'issn', '1068-4875', '>', '>', 'subject', '8', '1044', 'disc', 'grammar', 'in', 'schools', '>', '>', 'i', 'know', 'and', 'teach', 'that', 'not', 'all', 'infinitives', 'contain', '`', 'to', "'", 'i', 'also', 'give', '>', 'the', 'students', 'examples', 'e', 'g', '`', 'i', 'asked', 'him', 'to', 'kindly', 'apologise', "'", 'where', '>', 'placing', 'the', 'adverb', 'anywhere', 'else', 'would', 'cause', 'ambiguity', '>', '>', 'jennifer', 'chew', 'an', 'example', 'i', 'once', 'concocted', 'to', 'justify', 'splitting', 'the', 'infintive', 'or', 'not', 'as', 'the', 'case', 'may', 'be', 'is', 'a', 'after', 'a', 'heavy', 'meal', 'i', 'prepared', 'slowly', 'to', 'go', 'home', 'digesting', 'b', 'after', 'a', 'heavy', 'meal', 'i', 'prepared', 'to', 'slowly', 'g

In [20]:
# ENCODE THE TARGET
# I need to turn the keys into binary factors that identify whether the message
# is Spam or not ( 1 or 0 )


# Let's organize a way to replace the string with 1 if it's spam or 0 if it's not

a = test_keys.take(4) # Example with only 4 elements in the list 
for z,i in zip(a, range(0,len(a))):
    if z.startswith('spm'):
        a[i] = 1
    else:
        a[i] = 0


# Define a function to replace the element 
def binary_target(z):
    #for z,i in zip(list_, range(0,len(list_))): # For future reference: This was the mistake
                                                 # I was doing for an hour getting error. 
                                                 # Remember, when you operate a map
                                                 # you are iterating over the token directly,
                                                 # not the list. 
                                                
    if z.startswith('spm'):
        z = 1
        #list_[i] = 1           # Future reference of what to NOT do
    else:
        z = 0
        #list_[i] = 0
    return z
            

# Make the function work for the Rdd file
test_keys_binary = test_keys.map(binary_target)
print('Check the result over the first 10 targets:\n\n','Before:\n',test_keys.take(10),
      '\n\nAfter:\n',test_keys_binary.take(10),'\n\n')   
#cls_vec_RDD = test_keys.map(lambda x: (1 if x.startswith('spm') else 0))




# NOW, I ZIP THE LABELS AND THE INPUT BACK TOGETHER

# Keep 5 as vector dimension so far
test_val1 = test_val.map(lambda text: hash_vectors(text,5)) 
test = test_keys_binary.zip(test_val1)
print('Verify if the Labels and Values have been merged back:\n', test.take(2))



Check the result over the first 10 targets:

 Before:
 ['8-1064msg1', '6-806msg1', '6-816msg1', '8-1208msg1', 'spmsgc111', 'spmsgc105', 'spmsgc139', 'spmsgc138', 'spmsgc104', 'spmsgc110'] 

After:
 [0, 0, 0, 0, 1, 1, 1, 1, 1, 1] 


Verify if the Labels and Values have been merged back:
 [(0, [78, 56, 88, 59, 69]), (0, [12, 8, 3, 4, 12])]


In [21]:
# Put into a FUNCTION 

def vector_input(rdd, vect_length):
    def hash_vectors(text, vect_length): 
        vec = [0] * vect_length  # initialise a vector of all zeroes
        for x in text: 
            hsh = hash(x) # Every word defined by its hash
            vec[hsh % vect_length] = vec[hsh % vect_length] + 1 # This adds 1 in the index defined.
                                                            # In case the letter appears again then 1 would be sum to 1 and so on.
        return vec # vector identifying the word


    rdd_keys = rdd.keys() # These are the labels that identify whether is spam or ham
    rdd_val = rdd.values() # I select only the values 
    rdd_val1 = rdd_val.map(lambda text: hash_vectors(text, vect_length)) # Apply the function over the rdd values 
    rdd_keys_binary = rdd_keys.map(binary_target)
    print('\n\nTokens before Vectorization:\n', rdd_val.take(1))
    print('\n\nTokens after Vectorization:\n', test_val1.take(1))
    rdd = rdd_keys_binary.zip(rdd_val1)
    return rdd   

# Run the Function 
train = vector_input(train, 6)
print('Check function worked:\n', train.take(3))



Tokens before Vectorization:
 [['Subject', 'job', 'posting', '-', 'apple-iss', 'research', 'center', 'content', '-', 'length', '3386', 'apple-iss', 'research', 'center', 'a', 'us', '$', '10', 'million', 'joint', 'venture', 'between', 'apple', 'computer', 'inc', 'and', 'the', 'institute', 'of', 'systems', 'science', 'of', 'the', 'national', 'university', 'of', 'singapore', 'located', 'in', 'singapore', 'is', 'looking', 'for', 'a', 'senior', 'speech', 'scientist', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', 'the', 'successful', 'candidate', 'will', 'have', 'research', 'expertise', 'in', 'computational', 'linguistics', 'including', 'natural', 'language', 'processing', 'and', '*', '*', 'english', '*', '*', 'and', '*', '*', 'chinese', '*', '*', 'statistical', 'language', 'modeling', 'knowledge', 'of', 'state-of', '-', 'the-art', 'corpus-based', 'n', '-', 'gram', 'language', 'models', 'cache', 'language', 'mod

# 4. Normalise the Values

In [22]:
# Split again the keys and the values.
# I know this is reduntant but I think it may be better in the future in case of modification
# and when I will set the functions.
# I obviously don't need to Normalise the Target since it's a binary category of 0 and 1.

test_val = test.values()
normalizer1 = Normalizer() # set the normalizer 
test_val = normalizer1.transform(test_val)
print('Check the values are now Normalised:\n',test_val.take(3))

# 'from 'https://spark.apache.org/docs/2.2.0/mllib-feature-extraction.html'
# Each sample normalized using $L^2$ norm.

test = test.keys().zip(test_val)
print('\n\nCheck the values are zipped back:\n',test.take(3))

Check the values are now Normalised:
 [DenseVector([0.4913, 0.3527, 0.5543, 0.3716, 0.4346]), DenseVector([0.618, 0.412, 0.1545, 0.206, 0.618]), DenseVector([0.4169, 0.3335, 0.4235, 0.4669, 0.5636])]


Check the values are zipped back:
 [(0, DenseVector([0.4913, 0.3527, 0.5543, 0.3716, 0.4346])), (0, DenseVector([0.618, 0.412, 0.1545, 0.206, 0.618])), (0, DenseVector([0.4169, 0.3335, 0.4235, 0.4669, 0.5636]))]


In [23]:
# I have the values back in the form (label,DenseVector(VectorValues))
# I may need to turn them into LabeledPoints ( Label, VectorValues)
# This is because Supervised Learning Machine Learning algorithms
# require this type of input to work on (Py)spark.

# From 'https://spark.apache.org/docs/2.2.0/mllib-data-types.html'
# 'A labeled point is a local vector, either dense or sparse, associated with a 
# label/response. In MLlib, labeled points are used in supervised learning algorithms. 
# We use a double to store a label, so we can use labeled points in both regression and 
# classification. For binary classification, a label should be either 0 (negative) or 1 
# (positive)' 


# Let's set the function starting with a small example
#a = test.keys().take(3)
#b = test.values().take(3)
#c = test.take(1)
    
# I think the problem is that once I map the RDD I cannot use the argument .keys and values anymore
# So I should probably work with index
    
    
# Define the function that returns a LabeledPoint from each of the lists contained in the rdd
# Then input this with a lambda
def Lab_points(c):
    c = LabeledPoint(c[0],c[1])
    return c
 
# Run the function 
TestLabPoint = test.map(Lab_points)
print('Check if Lab_points function worked:\n\n',TestLabPoint.take(3))

def Lab_points_rdd(rdd):
    def Lab_points(c):
        c = LabeledPoint(c[0],c[1])
        return c
    RDD = rdd.map(Lab_points)
    return RDD

# RUN THE FINAL FUNCTION that takes an rdd with (label,DenseVector) and returns LabeledPoints(label,vector)

test = Lab_points_rdd(test)
print('\n\nCheck if Lab_points_rdd function worked:\n\n',test.take(3))    

Check if Lab_points function worked:

 [LabeledPoint(0.0, [0.4912953308535948,0.3527248529205296,0.5542819117322608,0.37162082718412937,0.4346074080627954]), LabeledPoint(0.0, [0.6180314431495256,0.4120209620996838,0.1545078607873814,0.2060104810498419,0.6180314431495256]), LabeledPoint(0.0, [0.4168635654068489,0.3334908523254791,0.4235333824533585,0.46688719325567074,0.5635995404300597])]


Check if Lab_points_rdd function worked:

 [LabeledPoint(0.0, [0.4912953308535948,0.3527248529205296,0.5542819117322608,0.37162082718412937,0.4346074080627954]), LabeledPoint(0.0, [0.6180314431495256,0.4120209620996838,0.1545078607873814,0.2060104810498419,0.6180314431495256]), LabeledPoint(0.0, [0.4168635654068489,0.3334908523254791,0.4235333824533585,0.46688719325567074,0.5635995404300597])]


In [24]:
# NOW CREATE A FUNCTION THAT NORMALISE AND GIVES YOU VALUES BACK INTO A LABELEDPOINT LOCAL VECTOR

def Norm_and_LabPoint(rdd):
    rdd_val = rdd.values()
    normalizer1 = Normalizer() # set the normalizer 
    rdd_val = normalizer1.transform(rdd_val)
    rdd = rdd.keys().zip(rdd_val)
    rdd = Lab_points_rdd(rdd)
    print('\n\n\n\nData are now like:\n',rdd.take(3))
    return rdd

# Check on the Train data if it worked:
train = Norm_and_LabPoint(train)
print('\n\nCheck if function Norm_and_LabPoints worked:\n\n',train.take(2))





Data are now like:
 [LabeledPoint(0.0, [0.37532546250723536,0.7410271952065929,0.33683054327572404,0.24059324519694572,0.20209832596543442,0.3127712187560294]), LabeledPoint(0.0, [0.396315989816045,0.5584452583771543,0.4503590793364148,0.3512800818824035,0.33326571870894695,0.3062441739487621]), LabeledPoint(0.0, [0.27163990004069294,0.4074598500610394,0.611189775091559,0.4074598500610394,0.2910427500435996,0.3686541500552261])]


Check if function Norm_and_LabPoints worked:

 [LabeledPoint(0.0, [0.37532546250723536,0.7410271952065929,0.33683054327572404,0.24059324519694572,0.20209832596543442,0.3127712187560294]), LabeledPoint(0.0, [0.396315989816045,0.5584452583771543,0.4503590793364148,0.3512800818824035,0.33326571870894695,0.3062441739487621])]


---

# Allow Reproducibility of Results by Defining a Function 

In [25]:
# I need to take into consideration that I need to have the actions performed on both 
# Train and Test data at a certain point.
# At this regards, I am not sure the best way is run the function twice,
# But I think the possibility of splitting based on the number of subdirectories is simply 
# easy, and for this reason I sticked on that.

def Before_Modeling(directory, ntest, vect_dimension):
    print('Preparing to access to directory:', directory,'\n')
    store_rdd = access_and_import(directory)
    print('\n\nData accessed and imported correctly\n')
    train,test = split_train_test(store_rdd, ntest)
    print('\n\nTrain and Test split completed\n')
    train = remove_punct_tokenisation(train)
    test = remove_punct_tokenisation(test)
    print('\n\nPunctuation Removed and Tokenisation completed\n')
    train = vector_input(train, vect_dimension)
    test = vector_input(test, vect_dimension)
    print('\n\nTrain and Test turned into Vectors of Fixed Dimensions and Target labelled\n')
    train = Norm_and_LabPoint(train)
    test = Norm_and_LabPoint(test)
    print('\n\n Data Normalised and LabeledPoints Local Vector set\n')
    
    return train,test

# Try to verify if it worked ( Yes, At this point I am very afraid )
train,test = Before_Modeling('bare', 2 , 20)


# Ok, everything worked.

# Note that I have tried to add some print comment to give me the possibility to check easily
# at which stage the program was running.
# However, since the functions are run both for test and train I will have the information
# printed two times.
# From one aspect, this is ok because it allows me to verify that the operation have worked
# both for train and for test data.
# On the other side, this may create to much mess when printing the function and risk to not follow
# properly the results.

# I could overcome this issue by setting in each function a parameter print_comments = True
# and then set on the function something like
# if print_comments = True:
        #print('....')

# However, at this stage the operation is a little bit too long and I don't have much time to do that

Preparing to access to directory: bare 

Verify if the directory is properly set:
bare/part3
/Users/simonezanetti/Desktop/lingspam_public/bare/part3
Verify if the directory is properly set:
bare/part4
/Users/simonezanetti/Desktop/lingspam_public/bare/part4
Verify if the directory is properly set:
bare/part5
/Users/simonezanetti/Desktop/lingspam_public/bare/part5
Verify if the directory is properly set:
bare/part2
/Users/simonezanetti/Desktop/lingspam_public/bare/part2
Verify if the directory is properly set:
bare/part10
/Users/simonezanetti/Desktop/lingspam_public/bare/part10
Verify if the directory is properly set:
bare/part9
/Users/simonezanetti/Desktop/lingspam_public/bare/part9
Verify if the directory is properly set:
bare/part7
/Users/simonezanetti/Desktop/lingspam_public/bare/part7
Verify if the directory is properly set:
bare/part1
/Users/simonezanetti/Desktop/lingspam_public/bare/part1
Verify if the directory is properly set:
bare/part6
/Users/simonezanetti/Desktop/lingspam_pub



Look of data after tokenisation and punctuation removed:
 [('9-1296msg1', ['Subject', 'new', 'book', 'australian', 'language', 'australian', 'language', 'nordlinger', 'rachel', 'max', 'planck', 'institute', 'for', 'psycholinguistics', 'nijmegen', 'constructive', 'case', 'isbn', '1-57586', '-', '134', '-', '8', 'paper', '1-57586', '-', '135', '-', '6', 'cloth', 'csli', 'publications', '1998', 'http', '/', '/', 'csli-www', 'stanford', 'edu', '/', 'publications', '/', 'email', 'pubs', '@', 'roslin', 'stanford', 'edu', 'australian', 'aboriginal', 'languages', 'have', 'many', 'interesting', 'grammatical', 'characteristics', 'that', 'challenge', 'some', 'of', 'the', 'central', 'assumptions', 'of', 'current', 'linguistic', 'theory', 'these', 'languages', 'exhibit', 'many', 'unusual', 'morphosyntactic', 'characteristics', 'that', 'have', 'not', 'yet', 'been', 'adequately', 'incorporated', 'into', 'current', 'linguistic', 'theory', 'this', 'volume', 'focuses', 'on', 'the', 'complex', 'propert



Tokens after Vectorization:
 [[78, 56, 88, 59, 69]]


Tokens before Vectorization:
 [['Subject', 'new', 'book', 'australian', 'language', 'australian', 'language', 'nordlinger', 'rachel', 'max', 'planck', 'institute', 'for', 'psycholinguistics', 'nijmegen', 'constructive', 'case', 'isbn', '1-57586', '-', '134', '-', '8', 'paper', '1-57586', '-', '135', '-', '6', 'cloth', 'csli', 'publications', '1998', 'http', '/', '/', 'csli-www', 'stanford', 'edu', '/', 'publications', '/', 'email', 'pubs', '@', 'roslin', 'stanford', 'edu', 'australian', 'aboriginal', 'languages', 'have', 'many', 'interesting', 'grammatical', 'characteristics', 'that', 'challenge', 'some', 'of', 'the', 'central', 'assumptions', 'of', 'current', 'linguistic', 'theory', 'these', 'languages', 'exhibit', 'many', 'unusual', 'morphosyntactic', 'characteristics', 'that', 'have', 'not', 'yet', 'been', 'adequately', 'incorporated', 'into', 'current', 'linguistic', 'theory', 'this', 'volume', 'focuses', 'on', 'the', 'complex

---

# Classification Models

In [46]:
# Referenced this documents to define the codes and understand the concepts:
#'https://spark.apache.org/docs/latest/mllib-linear-methods.html'

from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import NaiveBayes
from pyspark.mllib.classification import SVMWithSGD,SVMModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [28]:
# LOGISTIC REGRESSION

# 1. Build the model
model = LogisticRegressionWithLBFGS.train(train)

In [29]:
# 2. Evaluating the model on training data
labelsAndPreds = train.map(lambda p: (p.label, model.predict(p.features)))
# The documentation on 'https://spark.apache.org/docs/latest/mllib-linear-methods.html' identifies
# the error as following. However, I prefer to identify the correct answers in order to define an accuracy 
# trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(parsedData.count())

correct = labelsAndPreds.filter(lambda lp: lp[0] == lp[1]).count() / float(train.count())
print("Accuracy = " + str(correct))

# 3. Evaluating the model on test data
labelsAndPreds = test.map(lambda p: (p.label, model.predict(p.features)))
correct = labelsAndPreds.filter(lambda lp: lp[0] == lp[1]).count() / float(test.count())
print("Accuracy = " + str(correct))

Accuracy = 0.8997840172786177
Accuracy = 0.8719723183391004


In [30]:
# Define a function that automatically iterates the process

def train_logistic(train,test):
    model = LogisticRegressionWithLBFGS.train(train)
    labelsAndPredsTrain = train.map(lambda p: (p.label, model.predict(p.features)))
    correctTrain = labelsAndPredsTrain.filter(lambda lp: lp[0] == lp[1]).count() / float(train.count())
    labelsAndPredsTest = test.map(lambda p: (p.label, model.predict(p.features)))
    correctTest = labelsAndPredsTest.filter(lambda lp: lp[0] == lp[1]).count() / float(test.count())
    print("Logistic Regression Accuracy:\n. Training data = " + str(correctTrain),'\n. Test data = ',str(correctTest))
    return model

In [31]:
# Attempt to run the function 
a = train_logistic(train,test)

Logistic Regression Accuracy:
. Training data = 0.8997840172786177 
. Test data =  0.8719723183391004


In [49]:
# SUPPORT VECTOR MACHINE

# 1. Build the model
model = SVMWithSGD.train(train)
# 2. Evaluating the model on training data
labelsAndPreds = train.map(lambda p: (p.label, model.predict(p.features)))

correct = labelsAndPreds.filter(lambda lp: lp[0] == lp[1]).count() / float(train.count())
print("Accuracy = " + str(correct))

# 3. Evaluating the model on test data
labelsAndPreds = test.map(lambda p: (p.label, model.predict(p.features)))
correct = labelsAndPreds.filter(lambda lp: lp[0] == lp[1]).count() / float(test.count())
print("Accuracy = " + str(correct))



# I realise it matches everything as 0.
# This may be cause data are highly unbalanced.
# I should try to balance the data with ex. 

Accuracy = 0.8336933045356372


In [32]:
# Naive Bayes

def train_naive(train,test):
    model = NaiveBayes.train(train)
    labelsAndPredsTrain = train.map(lambda p: (p.label, model.predict(p.features)))
    correctTrain = labelsAndPredsTrain.filter(lambda lp: lp[0] == lp[1]).count() / float(train.count())
    labelsAndPredsTest = test.map(lambda p: (p.label, model.predict(p.features)))
    correctTest = labelsAndPredsTest.filter(lambda lp: lp[0] == lp[1]).count() / float(test.count())
    print("Naive Bayes Accuracy:\n. Training data = " + str(correctTrain),'\n. Test data = ',str(correctTest))
    return model

# Actually I realised the structure of the codes is identical for both Logistic Regression Pipeline and Naive bayes.
# So I could define a function splitting Model Creation and model evaluation to automate it.
# This element will be taken into consideration in a future analysis.

In [33]:
b = train_naive(train,test)

Naive Bayes Accuracy:
. Training data = 0.8336933045356372 
. Test data =  0.8339100346020761


In [47]:
# Support Vector Machine 

def train_svm(train,test):
    model = SVMWithSGD.train(train, iterations= 50)
    labelsAndPredsTrain = train.map(lambda p: (p.label, model.predict(p.features)))
    correctTrain = labelsAndPredsTrain.filter(lambda lp: lp[0] == lp[1]).count() / float(train.count())
    labelsAndPredsTest = test.map(lambda p: (p.label, model.predict(p.features)))
    correctTest = labelsAndPredsTest.filter(lambda lp: lp[0] == lp[1]).count() / float(test.count())
    print("Support Vector Machine Accuracy:\n. Training data = " + str(correctTrain),'\n. Test data = ',str(correctTest))
    return model

In [48]:
c = train_svm(train,test)

Support Vector Machine Accuracy:
. Training data = 0.8336933045356372 
. Test data =  0.8339100346020761


In [41]:
# Create a function that automatically iterates over the 3 models

def try_models(train,test):
    print('Training and Testing different models:\n')
    a = train_logistic(train,test)
    print(' - - - - - - - - - - - - - - - - - - - \n') # Create some space
    b = train_naive(train,test)
    print(' - - - - - - - - - - - - - - - - - - - \n')
    c = train_svm(train,test)
    print(' - - - - - - - - - - - - - - - - - - - \n')
    
    return [a,b,c]

In [42]:
# Train and Test the 3 different Models 
a,b,c = try_models(train,test)

Training and Testing different models:


Logistic Regression Accuracy:
. Training data = 0.8997840172786177 
. Test data =  0.8719723183391004
 - - - - - - - - - - - - - - - - - - - 

Naive Bayes Accuracy:
. Training data = 0.8336933045356372 
. Test data =  0.8339100346020761
 - - - - - - - - - - - - - - - - - - - 

Support Vector Machine Accuracy:
. Training data = 0.8336933045356372 
. Test data =  0.8339100346020761
 - - - - - - - - - - - - - - - - - - - 



In [55]:
# Try to modify the size of the Vector in order to verify the evolution of the accuracy over this modification
# I modify the size of the test data, including only one directory, allowing more messages to be included in the training set.
# I could also avoid to have to reimport everything, and directly start from the phase in which I vectorize the texts.


def modify_vector_size(directory):
    a = [100,500,1000,1500] # I could also put values lower than 20 but this would not make sense, and I believe the same
                               # regards trying values close to that. I will start with 100 and move above.
                               # values higher than 500/1000 but they may be highly expensive 
                               # This will create lots of mess in my notebook while running cause it will print 
    for x in a:
        print('Attempt with VectorSize:',x)
        train,test = Before_Modeling(directory, 1 , x) 
        a,b,c = try_models(train,test)
        print(' - - - - - - - - - -  - - - - - - - - \n')
    # I am not interested into them to be returned, as I am interested in 
    # this to print the different accuracies

In [None]:
# This is going to lead to an excessively verbose output. I have already mention the necessity of defining in the 
# function a parameter 'print = True' ( see the conclusions for more ).
# on the other side, this has its advantages, since it allows to check if any mistake has been done,
# and also check the shape of the vector.
# ( APOLOGIES for the mess you are going to see !) 

modify_vector_size('bare')

Attempt with VectorSize: 100
Preparing to access to directory: bare 

Verify if the directory is properly set:
bare/part3
/Users/simonezanetti/Desktop/lingspam_public/bare/part3
Verify if the directory is properly set:
bare/part4
/Users/simonezanetti/Desktop/lingspam_public/bare/part4
Verify if the directory is properly set:
bare/part5
/Users/simonezanetti/Desktop/lingspam_public/bare/part5
Verify if the directory is properly set:
bare/part2
/Users/simonezanetti/Desktop/lingspam_public/bare/part2
Verify if the directory is properly set:
bare/part10
/Users/simonezanetti/Desktop/lingspam_public/bare/part10
Verify if the directory is properly set:
bare/part9
/Users/simonezanetti/Desktop/lingspam_public/bare/part9
Verify if the directory is properly set:
bare/part7
/Users/simonezanetti/Desktop/lingspam_public/bare/part7
Verify if the directory is properly set:
bare/part1
/Users/simonezanetti/Desktop/lingspam_public/bare/part1
Verify if the directory is properly set:
bare/part6
/Users/simon



Look of data after tokenisation and punctuation removed:
 [('8-1064msg1', ['Subject', 're', '8', '1044', 'disc', 'grammar', 'in', 'schools', 're', 'message', 'from', 'linguist', '@', 'linguistlist', 'org', '>', '>', 'linguist', 'list', 'vol-8', '-', '1044', 'sat', 'jul', '12', '1997', 'issn', '1068-4875', '>', '>', 'subject', '8', '1044', 'disc', 'grammar', 'in', 'schools', '>', '>', 'i', 'know', 'and', 'teach', 'that', 'not', 'all', 'infinitives', 'contain', '`', 'to', "'", 'i', 'also', 'give', '>', 'the', 'students', 'examples', 'e', 'g', '`', 'i', 'asked', 'him', 'to', 'kindly', 'apologise', "'", 'where', '>', 'placing', 'the', 'adverb', 'anywhere', 'else', 'would', 'cause', 'ambiguity', '>', '>', 'jennifer', 'chew', 'an', 'example', 'i', 'once', 'concocted', 'to', 'justify', 'splitting', 'the', 'infintive', 'or', 'not', 'as', 'the', 'case', 'may', 'be', 'is', 'a', 'after', 'a', 'heavy', 'meal', 'i', 'prepared', 'slowly', 'to', 'go', 'home', 'digesting', 'b', 'after', 'a', 'heavy'



Tokens after Vectorization:
 [[78, 56, 88, 59, 69]]


Train and Test turned into Vectors of Fixed Dimensions and Target labelled





Data are now like:
 [LabeledPoint(0.0, [0.07762601334258107,0.23287804002774323,0.0,0.029109755003467904,0.038813006671290534,0.07762601334258107,0.0,0.038813006671290534,0.029109755003467904,0.019406503335645267,0.06792276167475844,0.8635893984362144,0.048516258339113175,0.019406503335645267,0.06792276167475844,0.0,0.019406503335645267,0.06792276167475844,0.019406503335645267,0.06792276167475844,0.029109755003467904,0.029109755003467904,0.029109755003467904,0.0,0.038813006671290534,0.019406503335645267,0.048516258339113175,0.019406503335645267,0.0,0.048516258339113175,0.0,0.038813006671290534,0.019406503335645267,0.029109755003467904,0.009703251667822634,0.009703251667822634,0.0,0.029109755003467904,0.0,0.05821951000693581,0.08732926501040371,0.038813006671290534,0.0,0.009703251667822634,0.019406503335645267,0.0,0.009703251667822634,0.0485162583391131

Logistic Regression Accuracy:
. Training data = 0.9746543778801844 
. Test data =  0.9550173010380623
 - - - - - - - - - - - - - - - - - - - 

Naive Bayes Accuracy:
. Training data = 0.8337173579109063 
. Test data =  0.8339100346020761
 - - - - - - - - - - - - - - - - - - - 

Support Vector Machine Accuracy:
. Training data = 0.8337173579109063 
. Test data =  0.8339100346020761
 - - - - - - - - - - - - - - - - - - - 

 - - - - - - - - - -  - - - - - - - - 

Attempt with VectorSize: 500
Preparing to access to directory: bare 

Verify if the directory is properly set:
bare/part3
/Users/simonezanetti/Desktop/lingspam_public/bare/part3
Verify if the directory is properly set:
bare/part4
/Users/simonezanetti/Desktop/lingspam_public/bare/part4
Verify if the directory is properly set:
bare/part5
/Users/simonezanetti/Desktop/lingspam_public/bare/part5
Verify if the directory is properly set:
bare/part2
/Users/simonezanetti/Desktop/lingspam_public/bare/part2
Verify if the directory is properl



Look of data after tokenisation and punctuation removed:
 [('8-1064msg1', ['Subject', 're', '8', '1044', 'disc', 'grammar', 'in', 'schools', 're', 'message', 'from', 'linguist', '@', 'linguistlist', 'org', '>', '>', 'linguist', 'list', 'vol-8', '-', '1044', 'sat', 'jul', '12', '1997', 'issn', '1068-4875', '>', '>', 'subject', '8', '1044', 'disc', 'grammar', 'in', 'schools', '>', '>', 'i', 'know', 'and', 'teach', 'that', 'not', 'all', 'infinitives', 'contain', '`', 'to', "'", 'i', 'also', 'give', '>', 'the', 'students', 'examples', 'e', 'g', '`', 'i', 'asked', 'him', 'to', 'kindly', 'apologise', "'", 'where', '>', 'placing', 'the', 'adverb', 'anywhere', 'else', 'would', 'cause', 'ambiguity', '>', '>', 'jennifer', 'chew', 'an', 'example', 'i', 'once', 'concocted', 'to', 'justify', 'splitting', 'the', 'infintive', 'or', 'not', 'as', 'the', 'case', 'may', 'be', 'is', 'a', 'after', 'a', 'heavy', 'meal', 'i', 'prepared', 'slowly', 'to', 'go', 'home', 'digesting', 'b', 'after', 'a', 'heavy'



Tokens after Vectorization:
 [[78, 56, 88, 59, 69]]


Train and Test turned into Vectors of Fixed Dimensions and Target labelled





Data are now like:
 [LabeledPoint(0.0, [0.0,0.01040256683797447,0.0,0.0,0.0,0.0,0.0,0.01040256683797447,0.0,0.01040256683797447,0.0,0.02080513367594894,0.0,0.0,0.031207700513923405,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01040256683797447,0.0,0.0,0.02080513367594894,0.0,0.0,0.0,0.01040256683797447,0.0,0.0,0.02080513367594894,0.0,0.0,0.0,0.0,0.031207700513923405,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.031207700513923405,0.02080513367594894,0.01040256683797447,0.0,0.01040256683797447,0.09362310154177023,0.031207700513923405,0.02080513367594894,0.0,0.04161026735189788,0.01040256683797447,0.0,0.0,0.01040256683797447,0.0,0.0,0.01040256683797447,0.01040256683797447,0.0,0.0,0.01040256683797447,0.0,0.0,0.07281796786582129,0.0,0.05201283418987235,0.0,0.0,0.01040256683797447,0.02080513367594894,0.01040256683797447,0.0,0.0,0.14563593573164257,0.01040256683797447,0.0208051336





Data are now like:
 [LabeledPoint(0.0, [0.0,0.0,0.0,0.08695652173913043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2391304347826087,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021739130434782608,0.0,0.0,0.043478260869565216,0.0,0.021739130434782608,0.0,0.15217391304347827,0.0,0.0,0.0,0.0,0.0,0.0,0.021739130434782608,0.021739130434782608,0.0,0.0,0.0,0.021739130434782608,0.0,0.0,0.0,0.021739130434782608,0.021739130434782608,0.0,0.0,0.0,0.0,0.1956521739130435,0.0,0.0,0.0,0.043478260869565216,0.0,0.0,0.021739130434782608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021739130434782608,0.021739130434782608,0.021739130434782608,0.0,0.06521739130434782,0.0,0.21739130434782608,0.0,0.021739130434782608,0.0,0.021739130434782608,0.0,0.021739130434782608,0.06521739130434782,0.0,0.0,0.08695652173913043,0.021739130434782608,0.021739130434782608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478260869565216,0.0,0.0,0.0,0.021739130434782608,0.0,0.15217391304347827,0.0,0.0,0.021739130434782608,0.043478260869565216,0.0,0.0217391304

Logistic Regression Accuracy:
. Training data = 1.0 
. Test data =  0.9688581314878892
 - - - - - - - - - - - - - - - - - - - 

Naive Bayes Accuracy:
. Training data = 0.8529185867895546 
. Test data =  0.8442906574394463
 - - - - - - - - - - - - - - - - - - - 

Support Vector Machine Accuracy:
. Training data = 0.8337173579109063 
. Test data =  0.8339100346020761
 - - - - - - - - - - - - - - - - - - - 

 - - - - - - - - - -  - - - - - - - - 

Attempt with VectorSize: 1000
Preparing to access to directory: bare 

Verify if the directory is properly set:
bare/part3
/Users/simonezanetti/Desktop/lingspam_public/bare/part3
Verify if the directory is properly set:
bare/part4
/Users/simonezanetti/Desktop/lingspam_public/bare/part4
Verify if the directory is properly set:
bare/part5
/Users/simonezanetti/Desktop/lingspam_public/bare/part5
Verify if the directory is properly set:
bare/part2
/Users/simonezanetti/Desktop/lingspam_public/bare/part2
Verify if the directory is properly set:
bare/pa



Look of data after tokenisation and punctuation removed:
 [('8-1064msg1', ['Subject', 're', '8', '1044', 'disc', 'grammar', 'in', 'schools', 're', 'message', 'from', 'linguist', '@', 'linguistlist', 'org', '>', '>', 'linguist', 'list', 'vol-8', '-', '1044', 'sat', 'jul', '12', '1997', 'issn', '1068-4875', '>', '>', 'subject', '8', '1044', 'disc', 'grammar', 'in', 'schools', '>', '>', 'i', 'know', 'and', 'teach', 'that', 'not', 'all', 'infinitives', 'contain', '`', 'to', "'", 'i', 'also', 'give', '>', 'the', 'students', 'examples', 'e', 'g', '`', 'i', 'asked', 'him', 'to', 'kindly', 'apologise', "'", 'where', '>', 'placing', 'the', 'adverb', 'anywhere', 'else', 'would', 'cause', 'ambiguity', '>', '>', 'jennifer', 'chew', 'an', 'example', 'i', 'once', 'concocted', 'to', 'justify', 'splitting', 'the', 'infintive', 'or', 'not', 'as', 'the', 'case', 'may', 'be', 'is', 'a', 'after', 'a', 'heavy', 'meal', 'i', 'prepared', 'slowly', 'to', 'go', 'home', 'digesting', 'b', 'after', 'a', 'heavy'



Tokens after Vectorization:
 [[78, 56, 88, 59, 69]]


Train and Test turned into Vectors of Fixed Dimensions and Target labelled





Data are now like:
 [LabeledPoint(0.0, [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010457024970856451,0.0,0.010457024970856451,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010457024970856451,0.0,0.0,0.010457024970856451,0.0,0.0,0.0,0.0,0.0,0.0,0.020914049941712903,0.0,0.0,0.0,0.0,0.010457024970856451,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020914049941712903,0.0,0.0,0.010457024970856451,0.09411322473770806,0.020914049941712903,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010457024970856451,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020914049941712903,0.0,0.0,0.0,0.1463983495919903,0.0,0.020914049941712903,0.0,0.0,0.0,0.0,0.0,0.0627421498251387,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010457024970856451,0.0,0.010457024970856451,0.010457024970856451,0.1882264494754161,0.0,0.0,0.010457024970856451,0.07319917479599515,0.0,0.0,0.0,0.0,0.05228512485428225,0.9097611724645





Data are now like:
 [LabeledPoint(0.0, [0.0,0.0,0.0,0.06661733875264914,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22205779584216379,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02220577958421638,0.0,0.0,0.04441155916843276,0.0,0.02220577958421638,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02220577958421638,0.02220577958421638,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1998520162579474,0.0,0.0,0.0,0.0,0.0,0.0,0.02220577958421638,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02220577958421638,0.0,0.0,0.0,0.22205779584216379,0.0,0.02220577958421638,0.0,0.02220577958421638,0.0,0.0,0.06661733875264914,0.0,0.0,0.0,0.02220577958421638,0.02220577958421638,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04441155916843276,0.0,0.0,0.0,0.02220577958421638,0.0,0.15544045708951465,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02220577958421638,0.0,0.0,0.02220577958421638,0.0,0.0,0.0,0.0,0.06661733875264914,0.0,0.0,0.0,0.0,0.0,0.0,0.02220577958421638,0.02220577958421638,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06661733875

Logistic Regression Accuracy:
. Training data = 1.0 
. Test data =  0.9688581314878892
 - - - - - - - - - - - - - - - - - - - 

Naive Bayes Accuracy:
. Training data = 0.8870967741935484 
. Test data =  0.8615916955017301
 - - - - - - - - - - - - - - - - - - - 

Support Vector Machine Accuracy:
. Training data = 0.8337173579109063 
. Test data =  0.8339100346020761
 - - - - - - - - - - - - - - - - - - - 

 - - - - - - - - - -  - - - - - - - - 

Attempt with VectorSize: 1500
Preparing to access to directory: bare 

Verify if the directory is properly set:
bare/part3
/Users/simonezanetti/Desktop/lingspam_public/bare/part3
Verify if the directory is properly set:
bare/part4
/Users/simonezanetti/Desktop/lingspam_public/bare/part4
Verify if the directory is properly set:
bare/part5
/Users/simonezanetti/Desktop/lingspam_public/bare/part5
Verify if the directory is properly set:
bare/part2
/Users/simonezanetti/Desktop/lingspam_public/bare/part2
Verify if the directory is properly set:
bare/pa



Look of data after tokenisation and punctuation removed:
 [('8-1064msg1', ['Subject', 're', '8', '1044', 'disc', 'grammar', 'in', 'schools', 're', 'message', 'from', 'linguist', '@', 'linguistlist', 'org', '>', '>', 'linguist', 'list', 'vol-8', '-', '1044', 'sat', 'jul', '12', '1997', 'issn', '1068-4875', '>', '>', 'subject', '8', '1044', 'disc', 'grammar', 'in', 'schools', '>', '>', 'i', 'know', 'and', 'teach', 'that', 'not', 'all', 'infinitives', 'contain', '`', 'to', "'", 'i', 'also', 'give', '>', 'the', 'students', 'examples', 'e', 'g', '`', 'i', 'asked', 'him', 'to', 'kindly', 'apologise', "'", 'where', '>', 'placing', 'the', 'adverb', 'anywhere', 'else', 'would', 'cause', 'ambiguity', '>', '>', 'jennifer', 'chew', 'an', 'example', 'i', 'once', 'concocted', 'to', 'justify', 'splitting', 'the', 'infintive', 'or', 'not', 'as', 'the', 'case', 'may', 'be', 'is', 'a', 'after', 'a', 'heavy', 'meal', 'i', 'prepared', 'slowly', 'to', 'go', 'home', 'digesting', 'b', 'after', 'a', 'heavy'



Tokens after Vectorization:
 [[78, 56, 88, 59, 69]]


Train and Test turned into Vectors of Fixed Dimensions and Target labelled





Data are now like:
 [LabeledPoint(0.0, [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0104903441482036,0.0,0.0,0.0,0.0,0.0209806882964072,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0104903441482036,0.0,0.0,0.0104903441482036,0.0,0.0,0.0,0.0104903441482036,0.0,0.0,0.0209806882964072,0.0,0.0,0.0,0.0,0.0104903441482036,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0104903441482036,0.0,0.0,0.0,0.0,0.0104903441482036,0.0,0.0,0.0419613765928144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0104903441482036,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0104903441482036,0.0,0.0,0.0,0.0,0.0,0.0,0.0209806882964072,0.0,0.0,0.0,0.0104903441482036,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0104903441482036,0.0104903441482036,0.0104903441482036,0.0,0.0,0.0,0.0104903441482036,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0104903441482036,0.0209806882964072,0.0,0.0,0.0,0.0,0.0,0.0104903441482036,0.0209806882964072,0.0,0.0,0.





Data are now like:
 [LabeledPoint(0.0, [0.0,0.0,0.0,0.09081532183729997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045407660918649985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022703830459324992,0.0,0.0,0.0,0.022703830459324992,0.0,0.0,0.0,0.0,0.022703830459324992,0.0,0.0,0.0,0.0,0.06811149137797498,0.0,0.0,0.0,0.045407660918649985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022703830459324992,0.0,0.022703830459324992,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022703830459324992,0.0,0.022703830459324992,0.0,0.0,0.0,0.0,0.0,0.022703830459324992,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045407660918649985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045407660918649985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022703830459324992,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022703830459324992,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15892681321527494,0.0,0.0,0.0,0.0,0.06811149137797498,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022703830459324992,0.0,0.0,0

Logistic Regression Accuracy:
. Training data = 1.0 
. Test data =  0.9619377162629758
 - - - - - - - - - - - - - - - - - - - 

Naive Bayes Accuracy:
. Training data = 0.8940092165898618 
. Test data =  0.8685121107266436
 - - - - - - - - - - - - - - - - - - - 



---

### To overcome my mistake, I am going to manually paste the result in here:
In the future.. ( see conclusions ) 

**VECTOR SIZE**: 100

Logistic Regression Accuracy:
. Training data = 0.9746543778801844 
. Test data =  0.9550173010380623
 - - - - - - - - - - - - - - - - - - - 

Naive Bayes Accuracy:
. Training data = 0.8337173579109063 
. Test data =  0.8339100346020761
 - - - - - - - - - - - - - - - - - - - 

Support Vector Machine Accuracy:
. Training data = 0.8337173579109063 
. Test data =  0.8339100346020761
 - - - - - - - - - - - - - - - - - - - 



 
**VECTOR SIZE: 500** <- *******
 
**Logistic Regression Accuracy:
. Training data = 1.0 
. Test data =  0.9688581314878892**       <- ********************************
 - - - - - - - - - - - - - - - - - - - 

Naive Bayes Accuracy:
. Training data = 0.8529185867895546 
. Test data =  0.8442906574394463
 - - - - - - - - - - - - - - - - - - - 

Support Vector Machine Accuracy:
. Training data = 0.8337173579109063 
. Test data =  0.8339100346020761
 - - - - - - - - - - - - - - - - - - - 


**VECTOR SIZE**: 1000
 
 Logistic Regression Accuracy:
. Training data = 1.0 
. Test data =  0.9688581314878892
 - - - - - - - - - - - - - - - - - - - 

Naive Bayes Accuracy:
. Training data = 0.8870967741935484 
. Test data =  0.8615916955017301
 - - - - - - - - - - - - - - - - - - - 

Support Vector Machine Accuracy:
. Training data = 0.8337173579109063 
. Test data =  0.8339100346020761
 - - - - - - - - - - - - - - - - - - - 
 
 
**VECTOR SIZE**: 1500
 

Logistic Regression Accuracy:
. Training data = 1.0 
. Test data =  0.9619377162629758
 - - - - - - - - - - - - - - - - - - - 

Naive Bayes Accuracy:
. Training data = 0.8940092165898618 
. Test data =  0.8685121107266436
 - - - - - - - - - - - - - - - - - - -
 Support Vector Machine Accuracy:

 

---

# Conclusion