# Solution for Module-1 Exercise

In [1]:
import pandas as pd

df = pd.read_csv("/dsa/data/DSA-8410/spam.csv", encoding='latin1')
mini_df = df[['v1', 'v2']][:100]
mini_df.columns = ['class', 'text']

mini_df.to_csv('messages.csv', index=False)

In [2]:
cur_df = pd.read_csv('messages.csv')
msgs = cur_df.T.to_dict()

*Original Dataset: https://www.kaggle.com/uciml/sms-spam-collection-dataset/home*

For this exercise, we give you 100 sms that have been parsed and categorized as "Spam" or "Ham". The dataframe also contains the original text message. We have converted the dataframe into a dictionary for this exercise.

In the given dictionary, there are 100 entries, starting from 0 to 99 as the keys. The value for each of them is two strings, `class` and `text`. `class` contains either "spam" or "ham", based on the category of the sms, and `text` contains the original text message.

**Task 1.** Create a list of strings from this dictionary with the `text` values, and convert all of the strings into lowercase. Print out the first five (5) items from your list.

In [3]:
# Your code goes here
#---------------------
strs = [value['text'].lower() for key, value in msgs.items()]
print(strs[:5])

['go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there got amore wat...', 'ok lar... joking wif u oni...', "free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005. text fa to 87121 to receive entry question(std txt rate)t&c's apply 08452810075over18's", 'u dun say so early hor... u c already then say...', "nah i don't think he goes to usf, he lives around here though"]


**Task 2.** Use `nltk` packages tokenize functionality on each of the strings in your list. The result should be a list of lists. Print out the first five (5) items from your list.

In [4]:
# Your code goes here
#---------------------
from nltk import word_tokenize

msg = [word_tokenize(sent) for sent in strs]
print(msg[:5])

[['go', 'until', 'jurong', 'point', ',', 'crazy', '..', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'cine', 'there', 'got', 'amore', 'wat', '...'], ['ok', 'lar', '...', 'joking', 'wif', 'u', 'oni', '...'], ['free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005.', 'text', 'fa', 'to', '87121', 'to', 'receive', 'entry', 'question', '(', 'std', 'txt', 'rate', ')', 't', '&', 'c', "'s", 'apply', '08452810075over18', "'s"], ['u', 'dun', 'say', 'so', 'early', 'hor', '...', 'u', 'c', 'already', 'then', 'say', '...'], ['nah', 'i', 'do', "n't", 'think', 'he', 'goes', 'to', 'usf', ',', 'he', 'lives', 'around', 'here', 'though']]


**Task 3.** Remove the stopwords, punctuations and numbers from your list (list of lists). Punctuations and numbers can be removed by checking against the built-in string `string.punctuation`. If a particular character is found in `string.punctuation`, you can remove that from your string.

In [5]:
# Your code goes here
#---------------------
import nltk
import string
from nltk.corpus import stopwords
nltk.download("stopwords")

stop_words = stopwords.words("english")
clean_msg = []

for cur in msg:
    # Remove stop-words
    r_st = [word for word in cur if word not in stop_words]
    r_punc = []
    # Remove punctuations
    for nxt in r_st:
        r_punc.append("".join([lett for lett in nxt if lett not in string.punctuation]))
    # Eliminate empty strings caused by removing punctuations
    r_punc = [word for word in r_punc if word]
    clean_msg.append(r_punc)

print(clean_msg[:5])

[['go', 'jurong', 'point', 'crazy', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amore', 'wat'], ['ok', 'lar', 'joking', 'wif', 'u', 'oni'], ['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005', 'text', 'fa', '87121', 'receive', 'entry', 'question', 'std', 'txt', 'rate', 'c', 's', 'apply', '08452810075over18', 's'], ['u', 'dun', 'say', 'early', 'hor', 'u', 'c', 'already', 'say'], ['nah', 'nt', 'think', 'goes', 'usf', 'lives', 'around', 'though']]


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/scottgs/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Task 4.** Use `nltk` packages `PorterStemmer` to stem the cleaned-text list that you got as a result of **Task 3**. Use a new variable to store the stemmed-word list, and keep the result from the **Task 3** intact. As we will use the cleaned-text list from **Task 3** in the later tasks.

In [6]:
# Your code goes here
#---------------------
from nltk.stem import PorterStemmer
porter = PorterStemmer()

stems = []
for cur in clean_msg:
    stem = [porter.stem(word) for word in cur]
    stems.append(stem)

print(stems[:5])

[['go', 'jurong', 'point', 'crazi', 'avail', 'bugi', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amor', 'wat'], ['ok', 'lar', 'joke', 'wif', 'u', 'oni'], ['free', 'entri', '2', 'wkli', 'comp', 'win', 'fa', 'cup', 'final', 'tkt', '21st', 'may', '2005', 'text', 'fa', '87121', 'receiv', 'entri', 'question', 'std', 'txt', 'rate', 'c', 's', 'appli', '08452810075over18', 's'], ['u', 'dun', 'say', 'earli', 'hor', 'u', 'c', 'alreadi', 'say'], ['nah', 'nt', 'think', 'goe', 'usf', 'live', 'around', 'though']]


**Task 5.** Use `nltk` packages `WordNetLemmatizer` to find the lemma (or root word) from the cleaned-text list that you got as a result of **Task 3**. Consider all of the words to be a `Verb`. Use a new variable to store the lemmatized-word list, and keep the result from the **Task 3** intact. As we will use the cleaned-text list from **Task 3** in the later tasks.

In [7]:
# Your code goes here
#---------------------
from nltk.stem import WordNetLemmatizer
wordnet = WordNetLemmatizer()

lemmas = []
for cur in clean_msg:
    lemma = [wordnet.lemmatize(word, pos="v") for word in cur]
    lemmas.append(lemma)

print(lemmas[:5])

[['go', 'jurong', 'point', 'crazy', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'get', 'amore', 'wat'], ['ok', 'lar', 'joke', 'wif', 'u', 'oni'], ['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005', 'text', 'fa', '87121', 'receive', 'entry', 'question', 'std', 'txt', 'rate', 'c', 's', 'apply', '08452810075over18', 's'], ['u', 'dun', 'say', 'early', 'hor', 'u', 'c', 'already', 'say'], ['nah', 'nt', 'think', 'go', 'usf', 'live', 'around', 'though']]


**Task 6.** For each lemma in the list that we got from **Task 5**, calculate how many times they occur in all of the messages. Sort them in descending order by the number of total occurances, and print out the top ten (10) words and their number of occurances.

In [8]:
# Your code goes here
#---------------------
occurs = {}

for cur in lemmas:
    for nxt in cur:
        if (nxt not in occurs):
            occurs[nxt] = 1
        else:
            occurs[nxt] += 1

cnt = 0
for k in sorted(occurs, key=occurs.get, reverse=True):
    print('word:', k, occurs[k])
    cnt += 1
    if (cnt >= 10):
        break

word: u 17
word: call 15
word: nt 14
word: s 13
word: m 13
word: go 11
word: get 11
word: free 10
word: like 10
word: ok 8


**Task 7.** From the result we got from **Task 6**, remove all of the words with length of 1, and select the top hundred (100) most frequent words from it. We will use this list of words in our next task.

In [9]:
# Your code goes here
#---------------------

rows = list()
for k in sorted(occurs, key=occurs.get, reverse=True):
    if (len(k) > 1):
        rows.append(k)
        if (len(rows) >= 100):
            break

print('length', len(rows), '->', rows)

length 100 -> ['call', 'nt', 'go', 'get', 'free', 'like', 'ok', 'sorry', 'txt', 'already', 'home', 'smile', 'say', 'still', 'want', 'reply', 'yes', 'way', 'ur', 'ha', 'finish', 'know', 'anything', 'please', 'see', 'lt', 'gt', 'back', 'send', 'prize', 'claim', 'mobile', 'time', 'try', 'll', 'tell', 'later', 'pain', 'come', 'hi', 'great', 'lar', 'joke', 'text', 'think', 'word', 'even', 'customer', 'na', 'tonight', 'cash', 'make', 'feel', 'miss', 'first', 'lor', 'meet', 'lol', 'catch', 'love', 'wait', 'yeah', 'look', 'do', 'need', 'sms', 'man', 'end', 'check', 'wat', 'wif', 'entry', 'win', 'fa', 'may', '87121', 'std', 'apply', 'early', 'live', 'around', 'though', 'week', 'å£150', 'treat', 'request', 'melle', 'callertune', 'value', 'network', 'months', 'update', 'gon', 'stuff', 'anymore', 've', 'enough', 'today', '16', 'urgent']


**Task 8.** For each message (use the lemma-list we created for **Task 5**), calculate the number of time each word from **Task 7** (top-100 words) occurs in that message. 
Create a **Data-Matrix** using your calculations. Each row should correspond to a message, and each column should correspond to a word from the list we got in **Task 7**. Each cell should correspond to how many times that particular word (from column) occurs in that particular message (from row).

You can use Pandas-DataFrame to store your **Data-Matrix**. Print the first 5 rows of the Data-Matrix.

In [10]:
# Your code goes here
#---------------------
import pandas as pd

dm = []

idx = 0
for cur in lemmas:
    tf = {}
    for word in rows:
        tf[word] = cur.count(word)
    dm.append(tf)

df = pd.DataFrame(dm)
df.head()

Unnamed: 0,call,nt,go,get,free,like,ok,sorry,txt,already,...,months,update,gon,stuff,anymore,ve,enough,today,16,urgent
0,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Save your notebook, then `File > Close and Halt`