## Text_analytics/Assignment_1/MDS201803

In [0]:
import unicodedata
import numpy as np
import pickle
import string
import re
from nltk import ngrams
from plotly import express as px
import plotly.graph_objects as go

The code below is used to extract the **bengali** text corpus from xml file. The first 100 articles are extracted only. 

In [0]:
from wiki_dump_reader import Cleaner, iterate

#https://github.com/CyberZHG/wiki-dump-reader
#pip install wiki-dump-reader
#Code adapted from https://github.com/CyberZHG/wiki-dump-reader
def write_corpus():
    corpus_file = '/media/subhasish/Professional/CMI/Sem_3/Text_analysis/CorpusFileName_1.txt'
    page_count = 0
    cleaner = Cleaner()
    with open(corpus_file, 'w', encoding='utf-8') as output:
        for title, text in iterate('/home/subhasish/Downloads/bnwiki-latest-pages-articles.xml'):
            text = cleaner.clean_text(text)
            cleaned_text, links = cleaner.build_links(text)
            output.write(title + '\n' + cleaned_text + '\n')
            page_count += 1
            if page_count % 100000 == 0:
                print('Pages dumped = ', page_count)
                
    output.close()
write_corpus()

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


the corpus is read as a single string named **raw**

In [0]:
f = open('/content/drive/My Drive/CorpusFileName.txt')
raw = f.read()

### Preprocessing of the data

After examining the raw corpus, it is observed that the data contains punctuations, symbols, english words and digits also. Since we are intereted in the bengali words only we preprocess the data to remove them.

Regular Expressions are used for preprocessing of the data

In [0]:
raw = re.sub("[0-9]","",raw)       # removing digits
raw = re.sub("\n"," ",raw)         # removing newline command
raw = re.sub("="," ",raw)          # removing '=' symbol        
raw = re.sub("→"," ",raw)          # removing '→' symbol        
raw = re.sub("[a-zA-Z]","",raw)    # removing english words
raw = re.sub("–"," ",raw)   
raw = re.sub("।"," ",raw) 
raw = re.sub("[!#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\"]","",raw)

In [0]:
raw[:200]

'বাংলা ভাষা বাংলা ভাষা বাঙলা বাঙ্গলা তথা বাঙ্গালা নামগুলোতেও পরিচিত একটি ইন্দোআর্য ভাষা যা দক্ষিণ এশিয়ার বাঙালি জাতির প্রধান কথ্য ও লেখ্য ভাষা  মাতৃভাষীর সংখ্যায় বাংলা ইন্দোইউরোপীয় ভাষা পরিবারের চতু'

Next we split the raw string w.r.t whitespace (' ') and list the terms.

In [0]:
list_words = raw.split(" ")

In [0]:
# we note the number of terms before final preprocessing
len(list_words)

39669496

In [0]:
# removing whitespaces
words = []
for term in list_words:
    if term not in [''] :
        words.append(term)
len(words)        

32208190

To remove the bengali digits we use the following code which uses `unicodedata` package to identify the bengali digits

In [0]:
def is_bengai_digit(word):
    try:
        lang = unicodedata.name(word.strip()[0])
        if 'BENGALI DIGIT' in lang:
            return True
        else:
            return False
    except:
        return False

In [0]:
words_no_digit = []
for i in words:
    if not is_bengai_digit(i):
        words_no_digit.append(i)

len(words_no_digit)

30814262

Now we remove the words which have length less than 2

In [0]:
words_new = []     # vocabulary list after removing the words of length less than 2
for i in words_no_digit:
    if len(i) > 2:
        words_new.append(i)

len(words_new)

28968242

Creating a dictionary to store the words along with their frequencies :


In [0]:
word_dict = {}
for i in words_new:
    try:
        word_dict[i] = word_dict[i]+1
    except Exception:
        word_dict[i] = 1

In [0]:
with open("/content/drive/My Drive/Cleaned_Corpus.txt", "wb") as myFile:
    pickle.dump(corpus, myFile)

In [0]:
for i in list(word_dict.keys()):
    if word_dict[i] <= 20:
        del word_dict[i]

Now we sort the word dictionary w.r.t to the word frequency

In [0]:
tokens = sorted(word_dict.items(), key=lambda x: x[1], reverse=True)

In [0]:
with open("/content/drive/My Drive/token_dict.pickle", "wb") as myFile:
    pickle.dump(tokens, myFile)

#### Verifying Zipf's law

According to Zipf's law the frequency of the i-th most frequent token is proportional to 1/i. In other words if N be the frequency of the most frequent word, the second most frequent word would have frequency N/2 , the third most frequent word would have frequency N/3 and so on. But the frequencies of the top 5 tokens in the given data do not match this criterion.

we have :<br>
\begin{equation}
r = k/f \\ log(r) = log(k) - log(f).....(1)
\end{equation}

<br> where $r$ is the rank of the word, $f$ is the frequency and $k$ is proportionality constant
<br>$(1)$ is an equation of a negatively sloped straight line. we now take the observed word frequencies and plot their $log$ values alonside the $log(r)$ values.


Indexing the words (ranking the words w.r.t their frequencies)

In [0]:
index = 1
token_indexed = []
for i in tokens:
    i = tuple([i[0],i[1],index])
    token_indexed.append(i)
    index = index + 1

In [0]:
log_rank = []
log_freq = []
for i in token_indexed:
    log_freq.append(np.log(i[1]))
    log_rank.append(np.log(i[2]))
    
y_bar = np.mean(log_freq)
x_bar = np.mean(log_rank)
log_k = y_bar + x_bar    # OLS estimate of the intercept parameter for fixed slope linear regression  
x_sim = np.arange(0,11,0.1)
y_sim = log_k - x_sim

To verify with zipf's law, we fit a straight line of slope $(-1)$ to the given data.

In [0]:
log_k # the OLS estimate of the intercept

14.30345486605854

In [0]:
fig = go.Figure()
fig.add_trace(go.Scatter(y = log_freq, x = log_rank, mode='lines', name='Observed'))
fig.add_trace(go.Scatter(y = y_sim, x = x_sim, mode='lines', name='Expected'))

fig.update_layout(
    title="Observed log(freq)",
    xaxis_title="log(word_rank)",
    yaxis_title="log(word_frequency)")

In [0]:
words_num = 20
k = np.exp(log_k)
expected = list(k/i for i in range(1,words_num + 1))
observed = list(tokens[i][1] for i in range(words_num))
label = list(tokens[i][0] for i in range(words_num))

fig = go.Figure()
fig.add_trace(go.Scatter(y = expected, x = label, name='Expected frequency', mode='lines'))
fig.add_trace(go.Scatter(y = observed, x = label, name='Observed frequency', mode='lines'))

The above graph shows the empirical Zipf's Law vs the ovserved word frequencies.