* How to preprocess text using the following techniques:
    - Stopwords removal
    - Tokenization
    - Stemming
    - Lemmatization
* How to build a spam detector using one of the following models:
    - Bag-of-words model
    - TF-IDF model

## Word Frequencies and Stopwords

A text is made of characters, words, sentences and paragraphs. The most basic statistical analysis you can do is to look at the word frequency distribution, i.e., visualising the word frequencies of a given text corpus.

### To summarise, Zipf’s law (discovered by the linguist-statistician George Zipf) states that the frequency of a word is inversely proportional to the word’s rank, where rank 1 is given to the most frequent word, rank 2 is given to the second most frequent, and so on. This is also called the <b> power law distribution </b>.


According to Zipf’s law, the frequency of a given word is dependent on the inverse of its rank.

 

# f(r, α) ∝ 1/rα

### - where
* α ≈ 1
* r = rank of a word
* f(r, α) = frequency in the corpus
 

if the most frequent term occurs cf1 times, then the second most frequent term has half as many occurrences, the third most frequent term a third as many occurrences, and so on. The intuition is that frequency decreases very rapidly with rank. Equation 3 is one of the simplest ways of formalizing such a rapid decrease and it has been found to be a reasonably good model.

## https://nlp.stanford.edu/IR-book/html/htmledition/zipfs-law-modeling-the-distribution-of-terms-1.html


Zipf’s law helps us form the basic intuition for stopwords; these words have the highest frequencies (or lowest ranks) in a text and are typical of limited importance.

Broadly, there are three kinds of words present in any text corpus:

* Highly frequent words called stopwords, such as ‘is’, ‘an’ and ‘the’.
* Significant words, which are typically more important to understand the text.
* Rarely occurring words, which are again less important than significant words.

Generally speaking, stopwords are removed from the text for two reasons:

* They provide no useful information, especially in applications such as spam detectors or search engines. Therefore, you’re going to remove stopwords from the spam data set.
* As far as the data size is concerned, since the frequency of words is high, removing stopwords will result in smaller data and reduced size results in the faster computation of text data. There’s also the advantage of fewer features if stopwords are removed.

However, there are exceptions when these words should not be removed. In the next module, you’ll learn concepts such as POS (parts of speech), tagging and parsing, where stopwords are preserved because they provide meaningful (grammatical) information in those applications. Generally, stopwords are removed unless they prove to be helpful in your application or analysis.

On the other hand, you won’t remove the rarely occurring words because they might provide useful information for spam detection. Also, removing them provides no added efficiency in computation since their frequency is so low.

make frequency distribution from a text corpus and remove stopwords in Python using the NLTK library.

In [1]:
import requests
from nltk import FreqDist

# load the ebook
url = "https://www.gutenberg.org/files/16/16-0.txt"
peter_pan = requests.get(url,verify = False).text

# break the book into different words using the split() method
peter_pan_words = peter_pan.split()# write your code here

# build frequency distribution using NLTK's FreqDist() function
word_frequency =  FreqDist(peter_pan_words)# write your code here

# extract the frequency of third most frequent word
freq = word_frequency.most_common(3)[2][1]

# print the third most frequent word - don't change the following code, it is used to evaluate the code
print(freq)



1206


In [2]:
# import requests
# from nltk import FreqDist
# from nltk.corpus import stopwords

# # load the ebook
# url = "https://www.gutenberg.org/files/16/16-0.txt"
# peter_pan = requests.get(url, verify = False).text

# # break the book into different words using the split() method
# peter_pan_words = peter_pan.split()

# # extract nltk stop word list
# stopwords = stopwords.words('english')

# # remove 'stopwords' from 'peter_pan_words'
# no_stops = [word for word in peter_pan_words if word not in stopwords]

# # create word frequency of no_stops
# word_frequency = FreqDist(no_stops)

# # extract the most frequent word and its frequency
# frequency = word_frequency.most_common(1)[0][1]

# # print the third most frequent word - don't change the following code, it is used to evaluate the code
# print(frequency)

# Now doing tockenisation

In [3]:
document = "At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God."

In [4]:
from nltk.tokenize import word_tokenize
words = word_tokenize(document)

# Bag of words

represent text in a format that you can feed into machine learning algorithms. The most common and popular approach is creating a bag-of-words representation of your text data. The central idea is that any given piece of text, i.e., tweets, articles, messages, emails, etc., can be ‘represented’ by a list of all the words that occur in it (after removing the stopwords), where the sequence of occurrence does not matter.

you can create ‘bags’ for representing each of the messages in your training and test data set. But how do you go from these bags to building a spam classifier?


Let’s say for most of the spam messages, the bags contain words such as prize and lottery, and most of the ham bags do not. Now, when you run into a new message, look at its ‘bag-of-words’ representation. Does the bag for this message resemble that of messages you already know as spam, or does it not resemble them? Based on the answer to the previous question, you can then classify the message.

 

The next question is, how do you get a machine to do all that? Well, it turns out that for doing that, you need to represent all the bags in a matrix format, after which you can use ML algorithms such as Naive Bayes, logistic regression and SVM to do the final classification.

 

that’s how the text is represented in the form of a matrix. It can then be used to train machine learning models. Each document sits on a separate row, and each word of the vocabulary has its own column. These vocabulary words are also called <b>features </b> of the text.

The bag-of-words representation is also called the bag-of-words model, but this is not to be confused with a machine learning model. A bag-of-words model is just the matrix that you get from text data.

Another thing to note is that the values inside any cell can be filled in either of the two ways:

* Fill the cell with the frequency of a word (i.e., a cell can have a value of 0 or more)
* Fill the cell with either 0, in case the word is not present, or 1, in case the word is present (binary format)

Both approaches work fine and do not usually result in a big difference. The frequency approach is slightly more popular, and the NLTK library in Python also fills the bag-of-words model with word frequencies rather than binary 0 or 1 values.

In [5]:
string1 = "there was one place on my ankle that was itching"
string2 = "but you did not scratch it"
string3 = "and then my ear began to itch"
string4 = "and next to my back"

In [6]:
main = string1 +" " + string2+" " + string3 +" " + string4

In [7]:
main

'there was one place on my ankle that was itching but you did not scratch it and then my ear began to itch and next to my back'

In [8]:
ls = main.split()

In [9]:
len(ls)

28

In [10]:
len(set(ls))

23

# Bag of words module last question


In [11]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
pd.set_option('max_colwidth', 100)

In [12]:
def preprocess(document):
    'changes documents to lower case and remove stopwords'

    document = document.lower()
    words = word_tokenize(document)
    words = [word for word in words if word not in stopwords.words('english')]
    document = " ".join(words)
    
    return document

In [13]:
spam = pd.read_csv("SMSSpamCollection.txt", sep="\t", names=["label", "message"])
spam.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [14]:
spam = spam.iloc[0:100,:]


In [15]:
messages = spam.message
print(messages)

0     Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...
1                                                                           Ok lar... Joking wif u oni...
2     Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3                                                       U dun say so early hor... U c already then say...
4                                           Nah I don't think he goes to usf, he lives around here though
                                                     ...                                                 
95    Your free ringtone is waiting to be collected. Simply text the password "MIX" to 85069 to verify...
96                                                                      Watching telugu movie..wat abt u?
97                                                    i see. When we finish we have loads of loans to pay
98    Hi. Wk been ok - on hols now! Yes on for

In [16]:
messages = [message for message in messages]
print(messages)

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'Ok lar... Joking wif u oni...', "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", 'U dun say so early hor... U c already then say...', "Nah I don't think he goes to usf, he lives around here though", "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv", 'Even my brother is not like to speak with me. They treat me like aids patent.', "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune", 'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.', 'Had your mobile 

In [17]:
messages = [preprocess(message) for message in messages]
print(messages)

['go jurong point , crazy .. available bugis n great world la e buffet ... cine got amore wat ...', 'ok lar ... joking wif u oni ...', "free entry 2 wkly comp win fa cup final tkts 21st may 2005. text fa 87121 receive entry question ( std txt rate ) & c 's apply 08452810075over18 's", 'u dun say early hor ... u c already say ...', "nah n't think goes usf , lives around though", "freemsg hey darling 's 3 week 's word back ! 'd like fun still ? tb ok ! xxx std chgs send , £1.50 rcv", 'even brother like speak . treat like aids patent .', "per request 'melle melle ( oru minnaminunginte nurungu vettam ) ' set callertune callers . press * 9 copy friends callertune", 'winner ! ! valued network customer selected receivea £900 prize reward ! claim call 09061701461. claim code kl341 . valid 12 hours .', 'mobile 11 months ? u r entitled update latest colour mobiles camera free ! call mobile update co free 08002986030', "'m gon na home soon n't want talk stuff anymore tonight , k ? 've cried enoug

In [18]:
vectorizer = CountVectorizer()
bow_model = vectorizer.fit_transform(messages)
print(bow_model.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [19]:
print(bow_model.shape)
print(vectorizer.get_feature_names_out())

(100, 640)
['000' '07732584351' '0800' '08000930705' '08002986030'
 '08452810075over18' '09061209465' '09061701461' '09066364589' '10' '100'
 '1000' '10am' '11' '12' '1500' '150p' '150pm' '16' '169' '18' '20' '2005'
 '21st' '2nd' '3aj' '4403ldnw1a7rw18' '450ppw' '4txt' '50' '5000' '5249'
 '530' '5we' '6031' '6days' '81010' '85069' '87077' '87121' '87575' '8am'
 '900' '92h' '9pm' 'abiola' 'abt' 'ac' 'accomodations' 'aco' 'actin'
 'advise' 'aft' 'afternoon' 'ah' 'ahead' 'ahhh' 'aids' 'almost' 'already'
 'alright' 'always' 'amore' 'amp' 'animation' 'another' 'anymore'
 'anything' 'apologetic' 'apply' 'appointment' 'arabian' 'ard' 'around'
 'ask' 'available' 'awarded' 'babe' 'back' 'badly' 'barbie' 'becoz' 'bed'
 'beforehand' 'best' 'bit' 'blessing' 'bonus' 'box' 'breather' 'britney'
 'brother' 'buffet' 'bugis' 'burger' 'burns' 'bus' 'buy' 'bx420' 'ca'
 'call' 'callers' 'callertune' 'calls' 'camcorder' 'came' 'camera' 'car'
 'cash' 'casualty' 'catch' 'caught' 'cause' 'cave' 'chances' 'char

In [20]:
bow_model.toarray().sum()

934

In the last segment, you saw the problem of redundant tokens. This will result in an inefficient model when you build your spam detector. Stemming ensures that different variations of a word, say ‘warm’, ‘warmer’, ‘warming’ and ‘warmed’, are represented by a single token, ‘warm’ because they all represent the same information (represented by the ‘stem’ of the word).

 

Another similar preprocessing step (and an alternative to stemming) is lemmatization.



word = 'played'
# create function to chop off the suffixes 'ing' and 'ed'
def stemmer(word):
    if word[-3:] == 'ing':
        return word[:-3]# write your code here   
    elif word[-2:] == 'ed':
         return word[:-2]
    return word
    
print(stemmer(word))