# Text Processing: Stemming and Lemmatization

*Original Dataset: https://www.kaggle.com/uciml/sms-spam-collection-dataset/home*

For this exercise, there are 100 sms that have been parsed and categorized as "Spam" or "Ham". The dataframe also contains the original text message. We have converted the dataframe into a dictionary for this exercise (execute the first two cells).

In the given dictionary, there are 100 entries, starting from 0 to 99 as the keys. The value for each of them is two strings, `class` and `text`. `class` contains either "spam" or "ham", based on the category of the sms, and `text` contains the original text message.

In [2]:
import pandas as pd

df = pd.read_csv("/dsa/data/DSA-8410/spam.csv", encoding='latin1')
mini_df = df[['v1', 'v2']][:100]
mini_df.columns = ['class', 'text']

mini_df.to_csv('messages.csv', index=False)

In [3]:
df = pd.read_csv('messages.csv')
msgs = df.T.to_dict()

**Task 1.** Create a list of strings from this dictionary with the `text` values, and convert all of the strings into lowercase. Print out the first five (5) items from your list.

In [7]:
# Your code goes here
#---------------------
print(msgs.keys())

dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])


In [8]:
text_list = [msgs[i]['text'].lower() for i in msgs]
print(text_list[:5])

['go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there got amore wat...', 'ok lar... joking wif u oni...', "free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005. text fa to 87121 to receive entry question(std txt rate)t&c's apply 08452810075over18's", 'u dun say so early hor... u c already then say...', "nah i don't think he goes to usf, he lives around here though"]


**Task 2.** Use `nltk` packages tokenize functionality on each of the strings in your list. The result should be a list of lists. Print out the first five (5) items from your list.

In [9]:
# Your code goes here
#---------------------
import nltk
from nltk.tokenize import word_tokenize

In [10]:
nltk.download('punkt')
tokenized_list = [word_tokenize(text) for text in text_list]
print(tokenized_list[:5])

[nltk_data] Downloading package punkt to /home/djkgg/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


[['go', 'until', 'jurong', 'point', ',', 'crazy', '..', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'cine', 'there', 'got', 'amore', 'wat', '...'], ['ok', 'lar', '...', 'joking', 'wif', 'u', 'oni', '...'], ['free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005.', 'text', 'fa', 'to', '87121', 'to', 'receive', 'entry', 'question', '(', 'std', 'txt', 'rate', ')', 't', '&', 'c', "'s", 'apply', '08452810075over18', "'s"], ['u', 'dun', 'say', 'so', 'early', 'hor', '...', 'u', 'c', 'already', 'then', 'say', '...'], ['nah', 'i', 'do', "n't", 'think', 'he', 'goes', 'to', 'usf', ',', 'he', 'lives', 'around', 'here', 'though']]


**Task 3.** Remove the stopwords, punctuations and numbers from your list (list of lists). Punctuations and numbers can be checked by the function `string.punctuation` used after a string. If the result is false, you can remove that particular string from the list.

In [12]:
# Your code goes here
#---------------------
from nltk.corpus import stopwords
import string

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

punctuations = set(string.punctuation)

def clean_tokens(tokens):
    return [word for word in tokens if word not in stop_words and word not in punctuations and not word.isdigit()]

cleaned_list = [clean_tokens(tokens) for tokens in tokenized_list]

print(cleaned_list[:5])

[['go', 'jurong', 'point', 'crazy', '..', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'cine', 'got', 'amore', 'wat', '...'], ['ok', 'lar', '...', 'joking', 'wif', 'u', 'oni', '...'], ['free', 'entry', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005.', 'text', 'fa', 'receive', 'entry', 'question', 'std', 'txt', 'rate', 'c', "'s", 'apply', '08452810075over18', "'s"], ['u', 'dun', 'say', 'early', 'hor', '...', 'u', 'c', 'already', 'say', '...'], ['nah', "n't", 'think', 'goes', 'usf', 'lives', 'around', 'though']]


[nltk_data] Downloading package stopwords to /home/djkgg/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**Task 4.** Use `nltk` packages `PorterStemmer` to stem the cleaned-text list that you got as a result of **Task 3**. Use a new variable to store the stemmed-word list, and keep the result from the **Task 3** intact. As we will use the cleaned-text list from **Task 3** in the later tasks.

In [13]:
# Your code goes here
#---------------------
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stemmed_list = [[stemmer.stem(word) for word in tokens] for tokens in cleaned_list]

print(stemmed_list[:5])

[['go', 'jurong', 'point', 'crazi', '..', 'avail', 'bugi', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'cine', 'got', 'amor', 'wat', '...'], ['ok', 'lar', '...', 'joke', 'wif', 'u', 'oni', '...'], ['free', 'entri', 'wkli', 'comp', 'win', 'fa', 'cup', 'final', 'tkt', '21st', 'may', '2005.', 'text', 'fa', 'receiv', 'entri', 'question', 'std', 'txt', 'rate', 'c', "'s", 'appli', '08452810075over18', "'s"], ['u', 'dun', 'say', 'earli', 'hor', '...', 'u', 'c', 'alreadi', 'say', '...'], ['nah', "n't", 'think', 'goe', 'usf', 'live', 'around', 'though']]


**Task 5.** Use `nltk` packages `WordNetLemmatizer` to find the lemma (or root word) from the cleaned-text list that you got as a result of **Task 3**. Consider all of the words to be a `Verb`. Use a new variable to store the lemmatized-word list, and keep the result from **Task 3** intact. As we will use the cleaned-text list from **Task 3** in the later tasks. We assume every word is a verb to make the problem easier, but we could have applied a `POS` tagger and inferred the POS for that word. 

In [14]:
# Your code goes here
#---------------------
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

lemmatized_list = [[lemmatizer.lemmatize(word, pos='v') for word in tokens] for tokens in cleaned_list]

print(lemmatized_list[:5])


[nltk_data] Downloading package wordnet to /home/djkgg/nltk_data...


[['go', 'jurong', 'point', 'crazy', '..', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'cine', 'get', 'amore', 'wat', '...'], ['ok', 'lar', '...', 'joke', 'wif', 'u', 'oni', '...'], ['free', 'entry', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005.', 'text', 'fa', 'receive', 'entry', 'question', 'std', 'txt', 'rate', 'c', "'s", 'apply', '08452810075over18', "'s"], ['u', 'dun', 'say', 'early', 'hor', '...', 'u', 'c', 'already', 'say', '...'], ['nah', "n't", 'think', 'go', 'usf', 'live', 'around', 'though']]


**Task 6.** For each lemma that we got from **Task 5**, calculate how many times they occur in all of the messages. Sort them in descending order by the number of total occurrences, and print out the top ten (10) words and their number of occurrences.

In [15]:
# Your code goes here
#---------------------
from collections import Counter

all_words = [word for tokens in lemmatized_list for word in tokens]

word_counts = Counter(all_words)

top_10_words = word_counts.most_common(10)

for word, count in top_10_words:
    print(f'{word}: {count}')


...: 24
u: 17
call: 15
n't: 14
's: 13
'm: 13
go: 11
get: 11
free: 10
like: 10


**Task 7.** From the result we got from **Task 6**, remove all of the words with a length of 1 and select the top hundred (100) most frequent terms from it. We will use this list of words in our next task.

In [16]:
# Your code goes here
#---------------------

filtered_word_counts = {word: count for word, count in word_counts.items() if len(word) > 1}

top_100_words = Counter(filtered_word_counts).most_common(100)

for word, count in top_100_words:
    print(f'{word}: {count}')


...: 24
call: 15
n't: 14
's: 13
'm: 13
go: 11
get: 11
free: 10
like: 10
'': 9
..: 8
ok: 8
sorry: 8
txt: 6
already: 6
home: 6
smile: 6
say: 5
still: 5
want: 5
reply: 5
yes: 5
way: 5
ur: 5
ha: 5
finish: 5
know: 5
anything: 5
please: 5
see: 5
lt: 5
gt: 5
back: 4
send: 4
prize: 4
claim: 4
mobile: 4
time: 4
try: 4
'll: 4
tell: 4
later: 4
pain: 4
come: 4
hi: 4
great: 3
lar: 3
joke: 3
text: 3
think: 3
word: 3
even: 3
customer: 3
na: 3
tonight: 3
cash: 3
make: 3
feel: 3
miss: 3
first: 3
lor: 3
meet: 3
lol: 3
catch: 3
love: 3
wait: 3
yeah: 3
look: 3
do: 3
need: 3
sms: 3
man: 3
end: 3
check: 3
wat: 2
wif: 2
entry: 2
win: 2
fa: 2
may: 2
std: 2
apply: 2
early: 2
live: 2
around: 2
though: 2
week: 2
'd: 2
å£1.50: 2
treat: 2
request: 2
callertune: 2
value: 2
network: 2
months: 2
update: 2
gon: 2
stuff: 2
anymore: 2
've: 2


**Task 8.** For each message (use the lemma-list we created for **Task 5**), calculate the number of times each word from **Task 7** (top-100 words) occurs in that message. 
Create a **Data-Matrix** using your calculations. Each row should correspond to a message, and each column should correspond to a word from the list we got in **Task 7**. Each cell should correspond to how many times that particular word (from column) occurs for that specific message (from row).

You can use Pandas-DataFrame to store your **Data-Matrix**. Print the first five rows of the Data-Matrix.

In [17]:
# Your code goes here
#---------------------
import pandas as pd

top_100_words = [word for word, count in top_100_words]

data_matrix = []

for tokens in lemmatized_list:
   
    word_count = {word: tokens.count(word) for word in top_100_words}
    data_matrix.append(word_count)

df = pd.DataFrame(data_matrix, columns=top_100_words)

print(df.head())


   ...  call  n't  's  'm  go  get  free  like  ''  ...  request  callertune  \
0    2     0    0   0   0   1    1     0     0   0  ...        0           0   
1    2     0    0   0   0   0    0     0     0   0  ...        0           0   
2    0     0    0   2   0   0    0     1     0   0  ...        0           0   
3    2     0    0   0   0   0    0     0     0   0  ...        0           0   
4    0     0    1   0   0   1    0     0     0   0  ...        0           0   

   value  network  months  update  gon  stuff  anymore  've  
0      0        0       0       0    0      0        0    0  
1      0        0       0       0    0      0        0    0  
2      0        0       0       0    0      0        0    0  
3      0        0       0       0    0      0        0    0  
4      0        0       0       0    0      0        0    0  

[5 rows x 100 columns]


# Save your notebook, then `File > Close and Halt`