# Task 2: Text Pre-Processing

**Name: Manmeet Singh**<br>
**Student Id: 30749476**

**Description**<br>
Write Python code to preprocess a set of tweets and convert them into numerical representations (which are suitable for input into recommender-systems/ information-retrieval algorithms).<br>
1. Generate the corpus vocabulary with the same structure as sample_vocab.txt. Vocabulary must be sorted alphabetically.<br>
2. For each day (i.e., sheet in your excel file), calculate the top 100 frequent unigram and top-100 frequent bigrams. If you have less than 100 bigrams for a particular day, just include the top-n bigrams for that day (n<100).<br>
3. Generate the sparse representation (i.e., doc-term matrix) of the excel file according.<br>

## Importing the required Libraries

In [1]:
# importing required libraries
import pandas as pd
import numpy as np
import langid
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.collocations import *

## Loading Excel file and Stopword file

In [2]:
# loading the excel file
sheetName = pd.ExcelFile('30749476.xlsx')

# Open stopwords file and create a list of stopwords
with open('stopwords_en.txt', 'r') as stop:
    stopWords = stop.read().split()

# initializing an empty data frame
finalData = pd.DataFrame()


### Steps to perform.<br>
1. Using the “langid” package, only keeps the tweets that are in English language.<br>
2. The word tokenization must use the following regular expression, "[a-zA-Z]+(?:[-'][a-zA-Z]+)?".<br>
3. The context-independent and context-dependent (with the threshold set to more than 60 days ) stop words must be removed from the vocab. The provided context-independent stop words list (i.e, stopwords_en.txt ) must be used.<br>
4. Tokens should be stemmed using the Porter stemmer.<br>
5. Rare tokens (with the threshold set to less than 5 days ) must be removed from the vocab.<br>
6. Creating sparse matrix using countvectorizer.<br>
7. Tokens with the length less than 3 should be removed from the vocab.<br>
8. First 200 meaningful bigrams (i.e., collocations) must be included in the vocab using PMI measure.<br>

## Methodology<br>

1. Initialize empty list called finalList to store all the unique words after removing stop words from tweets.<br>
2. Initiate for loop which read each sheet one by one and perform following steps on data of each sheet:<br>
    a. Read data form sheet in a dataframe.<br>
    b. Drop columns and rows having Nan values. Drop only for those which have all the columsn and rows as Nan.<br>
    c. Drop first row and rename the column as text, id, and created_at.<br>
    d. Reset the index.<br>
    e. Assign column created_at value of date itself, as it contains date in UTC format.<br>
    f. Remove all the data from text column which is not a string or tweet.<br>
    g. Create new column lang to classify the language of tweet and then filter out non-english tweets.<br>
    h. Initialize tokenizer **(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")**.<br>
    i. Create column tokens and create tokens of tweets using above tokenizer.<br>
    j. Create contextIndependent column and remove all the stop words and tokens having length less than 3.<br>
    k. Unique list of words from contextIndependent. Will be used to remove context Dependent words.<br>
    l. Append the dataframe and add all the data from each sheet to dataframe.

In [3]:
# Initialize empty list called finalList to store all the unique words after removing stop words from tweets
finalList = []
for sheet in sheetName.sheet_names:

    file = pd.read_excel("30749476.xlsx", sheet_name=sheet)
      # drop all Nan columns and rows
    file.dropna(axis =1, how='all', inplace=True)
    file.dropna(how='all', inplace=True)
    # drop first row
    file.drop([file.index[0]], inplace=True)
    # rename columns names
    file.rename(columns={file.columns[0]: "text", file.columns[1]: "id", file.columns[2]: "created_at"}, inplace=True)
    # reset index of dataframe
    file.reset_index(drop=True, inplace=True)
    
    # file = file[file["text"].apply(lambda x: len(str(x))>1)]
     # assign column created_at value of date as it contains date in UTC format
    file['created_at'] = sheet
     # remove rows which are not string or tweet.
    file = file[file["text"].apply(lambda x: isinstance(x, str))]

    file["lang"] = [langid.classify(i)[0] for i in file["text"]]

    # filtering english tweets
    file = file[file["lang"] == 'en']
    
    # initiate tokenizer
    tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")
    
    # creating tokens from the tweets.
    file["tokens"] = file["text"].apply(lambda x: tokenizer.tokenize(x))
    
    # removing stop words
    file["contextIndependent"] = file["tokens"].apply(lambda x: [i for i in x if i not in stopWords])
    
    # removing tokens with length less than 3
    file["contextIndependent"] = file["contextIndependent"].apply(lambda x: [i for i in x if len(i) > 2])
    
#     # adding all the tokens to allList
    allList = []
    for i in file["contextIndependent"]:
        allList.extend(i)
    # creating a unique list to remove threshold stop words
    final = list(set(allList))
    
    # create unique set of words
    finalList.extend(final) 
    
    
#     allList.clear()
    

    finalData = finalData.append(file, ignore_index = False)

## Context Dependent and Threshold words.<br>

1. Create empty dictionary stopThresDict and take count of unique words created in above section (finalList).<br>
2. Create empty lisy stopThresWords and add all the words whose count is less than 60 and equal and more than 5.<br>
3. These words are context dependent words based on threshold more than 60 days and less than 5 days.<br>
4. Create contextDependent column in finalData and remove all the threshold words.

In [4]:
# create empty dictionary to take count of words
stopThresDict = {}
# take count of all unique words from context independent column.
for i in finalList:
        if i in stopThresDict:
            stopThresDict[i]+=1
        else:
            stopThresDict[i] = 1
# create empty list
stopThresWords = []
# store all words based on threshold limit of more than 60 days and less than 5 days.
for key,value in stopThresDict.items():
    
    if value < 5:
        stopThresWords.append(key)
    elif value > 60:
        stopThresWords.append(key)
    else:
        continue

# contextDependent column after removing threshold words.
finalData["contextDependent"] = finalData["contextIndependent"].apply(lambda x: [word for word in x if word not in stopThresWords])

## Porter Stemming<br>

Stemm all the words in contextDependent based on porter stemming and store in column stemmer.

In [5]:
ps = PorterStemmer()
finalData["stemmer"] = finalData["contextDependent"].apply(lambda x: [ps.stem(word) for word in x])


## Uni gram Dictionary<br>

1. Creating uni gram from stemmer column.<br>
2. Loop over finalData dataframe date wise and take count of all the words and store in uniDict and word:count as key:value pair.<br>
3. Add word:count pair in list as tuple.<br>
4. Sort all the key value pair based on count of words and store them according to dates.

In [6]:
uniDict = {} # create empty dictionary
# loop over the data frame based on dates
for date in sheetName.sheet_names:
    uni = {}
    uniList = []
    # filter data based on date to avoid duplicate entries
    df = finalData[finalData["created_at"] == date]
    
    # take count of all the words in column stemmer
    for text in df["stemmer"]:
        for words in text:
            if words in uni:
                uni[words]+=1
            else:
                uni[words] = 1
    # add key value pair of word and count to list as tuple.
    for key, values in uni.items():
        uniList.append((key, values))
        
    # sort the date in uniList as per the count
    uniList.sort(key=lambda tup: tup[1], reverse=True)
    uniDict[date] = uniList

## Creating Bi grams (top 200 based on PMI measure)<br>

1. Define function top200Bigram and set PMI measure = 200.<br>
2. Creating bi gram from tokens column.<br>
3. Loop over finalData dataframe date wise and pass all the tokens to top200Bigram fuction to generate tuple of bi grams.<br>
4. Join the biagrams with "_" and add to dictionary as value and date as key.<br>
5. Add all the bi grams to uniqueBi list and create a unique list of bigrams.<br>
6. Take count of all the bi grams and add to list as pair of word:count as a tuple.<br>
5. Sort all the key value pair based on count of words and store them according to dates in biGramDict dictionary.

In [8]:
# this code was taken and refered from lab excercise

# define fuctioin top200Bigram and assign pmi_measure=200 as we need to extract 200 meaning full bigrams
def top200Bigram(tokenList, pmi_measure = 200):
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(tokenList)
    bigram_finder.apply_freq_filter(20)
    topNBigrams = bigram_finder.nbest(bigram_measures.pmi, pmi_measure)
    return topNBigrams

In [19]:
# Creating bi Gram
biDict = {}
# emplty list for bi grams
extened_biGram_List = []
for date in sheetName.sheet_names:
    # empty list
    TList = []
    # filter dataframe as per the date
    biGram = finalData[finalData["created_at"] == date]
    
    # loop over all the tokens and add to TList
    for token in biGram["tokens"]:
        TList.extend(token)
    
    # pass TList as parameter to top200Bigram function
    top_200 = top200Bigram(TList)
    # join bi grams with "_"
    biDict[date] = ["_".join(x) for x in top_200] 
    # add bi grams to list
    extened_biGram_List.extend(biDict[date])
# create unique list of bi grams
uniqueBi = list(set(extened_biGram_List))

# empty bi gram dictionary
biGramDict = {}
# extened_biGram_List = []
for keys, value in biDict.items():
    TempDict = {}
    biGramList = []
    # take count of bi grams
    for word in value:
        if word in TempDict:
            TempDict[word]+=1
        else:
            TempDict[word]=1
    # save word count as tuple in list
    for word, count in TempDict.items():
        biGramList.append((word, count))
    # sort based on count
    biGramList.sort(key=lambda tup: tup[1], reverse=True)
    # update date key value as sorted list
    biGramDict[keys] = biGramList

## Create Vocabulary<br>

1. Take set of words in stemmer column of dataframe.<br>
2. Add add uniqueBi to the uniqueVocab list.<br>
3. Sort the list alphabetically.

In [11]:
vocabList = []
# add all words of stemmer to lsit
for tokens in finalData["stemmer"]:
    vocabList.extend(tokens)
# take set of vocabList to get unique vocab
uniqueVocab = list(set(vocabList))
# add bi gram to this uniqueVocab
uniqueVocab.extend(uniqueBi)
# sort the list
uniqueVocab.sort()

In [20]:
# Creating final vocabulary 
Vocab = {}
# initiate serial number
i = 0
for words in uniqueVocab:
    Vocab[words] = i # word and serial number in alphabetical order 
    i+=1

## Combine values of uni and bi gram based on date.

In [21]:
# Combine Uni and Bi gram dictionaries together
combined_Grams = {}
# add values of bi gram to uni gram based on date as key.
for date in sheetName.sheet_names:
    combined_Grams[date] = uniDict[date]+biGramDict[date]

## Count Vector (doc-term matrix)<br>

1. Loop over combined_Grams dictionary.<br>
2. Fetch serial number from Vocab dictionary and count of term from combined_Grams.<br>
3. Add the doc-term result to list for each date.

In [22]:
# empty countVec list
countVec = []

for key,value in combined_Grams.items():
    dataList = []
    dataList.append(key) # append dataList with date
    for words in value:
        # create doc:term list
        x = str(Vocab[words[0]])+":"+str(words[1])
        # append dataList with doc:term
        dataList.append(x)
    # append each list for respective date to countVec
    countVec.append(dataList)    

## Write Vocab, countVec, uni and bi gram to text files.

In [27]:
with open('30749476_100uni.txt','w+', encoding='utf-8') as writeData:
    for key, value in uniDict.items():
        writeData.write('{}:{}\n'.format(key, value[0:100]))

In [28]:
with open('30749476_100bi.txt','w+', encoding='utf-8') as writeData:
    for key, value in biGramDict.items():
        if len(value) < 100:
            writeData.write('{}:{}\n'.format(key, value))
        else:
            writeData.write('{}:{}\n'.format(key, value[0:100]))

In [29]:
with open('30749476_countVec.txt','w+', encoding='utf-8') as writeData:
    for value in countVec:
        writeData.write(",".join(value))
        writeData.write('\n')

In [30]:
with open('30749476_vocab.txt','w+', encoding='utf-8') as writeData:
    for key,value in Vocab.items():
        writeData.write('{}:{}\n'.format(key, value))