#### Date: 12/09/2020

##### Version: 1.0

#### Environment: Python 3.7.4 and Jupyter notebook

### Libraries used:

* os (for loading the system path and files, included in Anaconda Python 3.7)
* re (for regular expression, included in Anaconda Python 3.7)
* langid (for language classification, included in Anaconda Python 3.7)
* pandas (for dataframe manipulation, included in Anaconda Python 3.7)
* nltk (for natural language toolkit, included in Anaconda Python 3.7)
* sklearn (for machine learning, included in Anaconda Python 3.7)
* itertools (for list manipulation , included in Anaconda Python 3.7)

## 1. Importing the libraries 

In [None]:
import pandas as pd
from langid import classify
import re
from itertools import chain
from nltk import pos_tag
from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.probability import FreqDist
from nltk.collocations import BigramAssocMeasures as bigram_measures
from nltk.collocations import BigramCollocationFinder as bigram_finder
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import MWETokenizer


## 2. Initializing the various lists, dictionaries and the file path for the input file used in the program

In [None]:
#Initializing the tokenstoken_list =[]
data_dict = {}
data_list = []
word_list = []
clean_dict = {}
token_dict = {}
bigram_dict = {}
unigram_dict = {}
index_dict ={}
without_R_tokens = {}
stemmer = PorterStemmer()
file_path = "31224075.xlsx"
token_list =[]

## 3. Parsing and reading the Input Excel File | Reading the stopwords file

In [None]:
%%time
#Creates a Excelfile object
file = pd.ExcelFile(file_path)

# Reading the stopwords file for context independent stopwords removal
with open('stopwords_en.txt','r') as stopwords:
        stopwords_list = [line.strip() for line in stopwords]
stopwords.close()

## 4. Reading the Excel File and Pre-processing the data

### The mains pre-processing tasks done in this include: 
#### 1. Tokenization
#### 2. Remove tokens less than length 3.
#### 3. Remove stopwords (Context Independent stopwords)


* Parse each sheet, to find the field with text, we need to drop all the NA's from the file. This is done using the dropna when all the valus in the row and columns are NA's. 
* Pass each record from the sheet to be procced.
* Ckassify the text to process only the English tweets
* If it is an English tweet pass it to the RegxTokenization. Here each sentence/tweet is processed to return a token list.
* The tokens having length less than 3 are removed from the token list.
* The tokens list is then used to filter out non the context independent stop words from the tokens list
* This word list is then stored in dictionary and appended as lists in the clean data dictionary with the sheetname i.e  date as key.  
* Additionally the token_list is stored as is in the token dictionary for further use to create bigrams.

In [None]:
%%time
#Parsing the data and reading the data one sheet at a time
for sheets in file.sheet_names:
    df = pd.read_excel(file, sheets)
    
    #Handle the NA's in the sheet
    df = df.dropna(axis=1, how='all')
    df = df.dropna(axis=0, how='all')
   
    #Adjust the header for the columns in the df
    for col in df.columns:
        if 'Unnamed' in col:
            df.columns = df.iloc[0]
            
    #Loop through the length of the dataframe in each sheet        
    for i in range(1,len(df)):
        #Checking for English language tweets
        x = langid.classify(str(df.text.values[i]))
        #Process the data only if it is in English langiage
        if x[0] == 'en':
            #Apply RegexTokenizer to make tokens from the sentence
            tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")
            
            #Apply tokenize to the tokenizer defined above
            tokens = tokenizer.tokenize(str(df.text.values[i]))
           
            #Removing words less than 3 using regex
            
            #Regex expression to compile for the words having length less than 3
            short_word =re.compile(r'^(..?)(?!.)')
            
            #Removing the tokens with length less than 3 from the tokens list
            filtered = [ele.lower() for ele in tokens if not short_word.match(ele)]
            
            #Removing context independent stop words from the filtered token list
            without_sw = [w for w in filtered if w not in stopwords_list]
            
            #Storing the without stopwords tokens list in the data dictionary
            data_dict = without_sw
           
            #Storing the data dictionary to the data list
            data_list.append(data_dict)  
            
            #Appending the tokens list to tokens list
            token_list.append(tokens)
    
    #Storing the token list to token dictionary with sheetname as the key (which stores the date)
    token_dict[sheets] = list(chain(*token_list))
    
    #Storing the without stopwords tokens list to clean data dictionary with sheetname as the key (which stores the date)
    clean_dict[sheets] = list(chain(*data_list))
    
    #Re initializing the lists to store new values for the next data record
    data_list = []
    token_list = []

## 5. Removing Rare Tokens and Context dependent words  (with the threshold set to more than 60 days) And Stemming

* Rare tokens with the threshold set to 5 days is removed. 
* Context dependent words with threshold set to greater than 60 days is removed from the clean data dictionary. 

* Each unique word from the clean dictionary is combined to a list, which is passed to the FreqDist using the rare_tokens. 
* FreqDist returns the frequency for each of the value in the keys.
* If the token is present for more than 60 days it is removed as context dependent frequent words. Other words which appear for less than 5 days are removed as rare tokens. 

* Removing rare tokens is very intuitive, as unique words as such as names, brands and html leftouts need to be removed for different NLP or machine learning tasks. An instnace of using names as a predictor for a text classification problem is a bad approach.

* Removing context dependent words sometimes leads to less ambigiuty in solving a classification problem and might not serve the purpose. 

In [None]:
%%time
# Creating the clean data tokens list for the clean dictionary
clean_data_tokens = list(chain(*[set(value) for value in clean_dict.values()]))
# Find the frequency distribution for the the lists
rare_tokens = FreqDist(clean_data_tokens)

#Deleting the values from the rare_tokens dictionary
for k, v in list(rare_tokens.items()):
    # Rare tokens to delete for tokens which have the frequency of less than 5
    if v < 5:
        del rare_tokens[k]
    # Remove context independent words from the clean data for threshold set to greater than 60
    if v > 60:
        del rare_tokens[k]

        
#Stemming the data to eliminate words with same meanings 
clean_data = [stemmer.stem(word) for word in rare_tokens.keys()]


## 6. Creating Unigrams 

* Here, using the already pre processed clean dictionary which has the tokens with less than length 3 and removed context independent stopwords.  
> Even after we have filtered, for only for informative words, mostly there is a possibiity of multiple words representing the same maening in different forms and are mapped to the same word,but have a different spelling and structure because of the sentence context.  
> For example, “connect”, “connecting”, “connected”, “connections”, and “connects” could all be used to illustrate the word connect.

* Here the clean_dict is stemmed using Porter Stemmer.
* This clean dictionary stemmed is used to find the FreqDist.
* Of which only the most common 100 unigrams is stored in the dictionary for each key i.e date

* Write the top 100 unigrams for each date in the output file. 

In [None]:
%%time

#Creating unigrams from the clean tokens dictionary 
#(Which is already tokenized and the stop words have been removed from it and tokens less than 3 have also been eliminated.)
for k, v in clean_dict.items():
    #Stemming each of the tokens
    clean_dict_stem = [stemmer.stem(word) for word in v]
    #Creating frequency Distribution and passing the dictionary with words
    unigram = FreqDist(clean_dict_stem)
    #Finding the most common 100 words in each key
    unigram_dict[k] = unigram.most_common(100)

#Creating Unigram.txt file
fw = open('31224075_100uni.txt', 'w')
for k, v in unigram_dict.items():
    fw.write(k)
    fw.write(':')
    fw.write(str(v))
    fw.write('\n')
fw.close()


## 9. Creating Bigrams 

* Using the token dictionary initially created with just the tokens
* Creating N grams of length 2  by passing N =2 
* Find the frequency Distribution for this bigram data
* Sort the bigram dictfor each key i.e date

* Write the bigrams_dict to bigrams output file.

In [None]:
%%time
#Creating bigrams from the token dictionary
for k, v in token_dict.items():
    #Creating bigrams using Ngrams
    bigrams = ngrams([x.lower() for x in v], n = 2)
    #Finding the Frequency distribution
    bigram = FreqDist(bigrams)
    #Finding the 100 most common 
    bigram_dict[k] = bigram.most_common(100)

#Sorting the key i.e date     
sorted(bigram_dict.items(), key=lambda kv: kv[1], reverse=True)
#print(token_dict)

# Writing the Bigram.txt file
fw = open('31224075_100bi.txt', 'w')
for k, v in bigram_dict.items():
    fw.write(k)
    #print(k,v)
    fw.write(':')
    fw.write(str(v))
    fw.write('\n')
#fw.write(str(bigram_dict))
fw.close()


## 6. Create Vocab File

* Using chain combine all the token lists from the clean dictionary to a list.
* Using the Bigram_finder from BigramCollocationFinder in nltk and by using by_words bigrams are created from the token_list.
* Using the nbest function and passing the bigram_measures as pmi set to 200 
* Join the bigrams found as collocation words ('_') 
* Combine the collocation words found and the clean data to form a vocab list.
* Sort the vocab to list the words in an ascending alphabetical format.
* Storing the vocab words as key and index as value in the index dictionary.
* Read the vocab list and write into the vocab output file in the required format. i.e (token:index)


In [None]:
%%time
#Creating the token list by combining the  clean_dict values
token_list = list(chain(*clean_dict.values()))
#Creating the bigrams from the token list
finder = bigram_finder.from_words(token_list)
#Finding the top 200 meaningful bigrams from the 
pre_collocation = finder.nbest(bigram_measures.pmi, 200)
#Combining the bigrams to create collocations with '_'
collocation = ['_'.join(value).lower() for value in pre_collocation]

#Combine the clean data and the collocation words to form the vocab
vocab = [*collocation, *set(clean_data)]

#Sort the vocab in alphabetical order
vocab.sort()

#Creating Vocab.txt file
fw = open('31224075_vocab.txt', 'w')
for index, word in enumerate(vocab):
    fw.write(word)
    fw.write(':')
    fw.write(str(index))
    fw.write('\n')
    index_dict[word] = index
fw.close()


## 7. Create the Sparse matrix using Counter Vectorization

* Use Count Vectorizer to convert a collection of text documents to a matrix of token counts.
* Generating the count vector representation for each key i.e. date. 
* Re tokenize the callocation words with the clean dictionary tokens.
* Stemming the callocated words 
* Use vector fit transform on the new tokenized words to get the data features. 
* Writing the processed sparse matrix in the output file format. And replacing the key word witht the index of that word.



In [None]:
%%time
#Initializing the Counter Vector 
vectorizer = CountVectorizer(analyzer = "word") 
# Retokenize the callocation words
mweToken = MWETokenizer(pre_collocation)

#Write the counter Vector file
fw = open('31224075_countVec.txt', 'w')
for k, v in clean_dict.items():
    word = (mweToken.tokenize(v))
    token = []
    for w in word:
        if re.search('(_)', w) is None:
            token.append(stemmer.stem(w))
        else:
            token.append(w)
    data_features = vectorizer.fit_transform([' '.join(token)])
    vocab_feature = vectorizer.get_feature_names()
    fw.write(k)
    for data, indices in zip(data_features.data, data_features.indices):
        word = vocab_feature[indices]
        index = index_dict.get(word, None)
        #print(index)
        if index == None:
            continue
        fw.write(','+str(index))
        fw.write(':')
        fw.write(str(data))
    fw.write('\n')
    
fw.close()

##  Some statistics Information from the data:

### Plotting the most frequent words in the data after tokenization : 

In [None]:
import matplotlib.pyplot as plt
#Plotting most freq words
plot_words = list(chain(*token_dict.values()))
plot_words = FreqDist(plot_words)
plot_words = list(reversed(plot_words.most_common(20)))
print(plot_words)
data_words = []
words_counts = []
for v in plot_words:
    data_words.append(v[0])
    words_counts.append(v[1])

plt.xticks(rotation=90)
plt.plot(data_words, words_counts)
plt.show

### Plotting the most frequent token after removing the words less than length 3 and removing the stop words.

In [None]:
import matplotlib.pyplot as plt
#Plotting most freq words
plot_words = list(chain(*clean_dict.values()))
plot_words = FreqDist(plot_words)
plot_words = list(reversed(plot_words.most_common(20)))
print(plot_words)
data_words = []
words_counts = []
for v in plot_words:
    data_words.append(v[0])
    words_counts.append(v[1])

plt.xticks(rotation=90)
plt.plot(data_words, words_counts)
plt.show

In [None]:
import numpy as np
words = list(chain(*clean_dict.values()))
#vocab = set(words)
lexical_diversity = len(words)/len(vocab)
print ("Vocabulary size: ",len(vocab))
print ("Total number of tokens: ", len(words))
print ("Lexical diversity: ", lexical_diversity)
print ("Total number of documents:", len(token_dict))
lens = [len(value) for value in token_dict.values()]
print ("Average document length:", np.mean(lens))
print ("Maximun document length:", np.max(lens))
print ("Minimun document length:", np.min(lens))
print ("Standard deviation of document length:", np.std(lens))

## Conclusion & Learnings:

Almost all of the NLP or Machine learning tasks, require data to be preprocessed before training a model or to perform further tasks.
Most of the tasks and models cannot use raw text directly, so it requires text to be preprocessed. The preprocessing methods can be different depending on the nature of the tasks. 
Why use NLTK to perform most of the tasks?
* NLTK is one among the many leading platforms in dealing with language data.
* Provides easy-to-use APIs for many text preprocessing methods. 

Some of the important tasks in data preprocessing include:
* Lowercase
* Tokenization
* Stopword Filtering
* Stemming
* Finding additional colocated words from ngrams
* POS Tagger

Each of which is important to determine the feature data for the model. The different unigrams, bigrams and n-grams from the data are are used for a variety of things. Some use cases include auto completion of sentences, auto spell check and we can also look for the grammar in the sentence. 

As most of the machine learning features only work on numbers, there is a need to convert this character/text data to numerical values. Hence the use of sparse matrix using count vectorization. In order to analyze the text it needs to be converted to numbers.
A few examples of using this approach could result in Text Summarization, Sentimental analysis of data. Topic Modelling etc. 




## References

1. http://www.nltk.org/api/nltk.tokenize.html
2. http://www.nltk.org/book/ch05.html
3. https://www.nltk.org/howto/collocations.html
4. https://medium.com/python-in-plain-english/collocation-discovery-with-pmi-3bde8f351833