# Word Count, Phrase Analysis, Cross-Corpus Analysis

In learning English, there are phrases and words that are overly used and seldom used - it depends on what corpus is being used. Here, we will do word count, phrase analysis and cross-corpus analysis to determine the phrases that are overly used by learners.
<br><br>
One dataset is taken from [`British National Corpus`](http://www.natcorp.ox.ac.uk/), which is from 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English, both spoken and written, from the late twentieth century. Another one is called [`NAIST Lang-8`](https://sites.google.com/site/naistlang8corpora/),a language exchange social networking website geared towards language learners. The website is run by Lang-8 Inc., which is based in Tokyo, Japan.


https://drive.google.com/drive/folders/1vtCjRptZL6T4mffzbnqwi5i4WrqVnZHr?usp=sharing


## N-gram counting
We will do tokenization and calculation of frequency. The rules of tokenization in this Lab are:
 1. Ignore case (e.g., "The" is the same as "the")
 2. Split by white spaces <s>and punctuations</s>
 3. Ignore all punctuation
<br><br>

In [None]:
import os
import re
import string
from pprint import pprint

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:

def tokenize(text):
    """
    Input:
    "This is an example.'

    Sample output: 
    ['this', 'is', 'an', 'example', '.']
    """  
    #### [ TODO ] transform text to lower case
    lowerText = text.lower()
    pureText = re.sub(r'[.,"\'-?:!;]', '', re.sub(r'[0-9]+', '', lowerText))
    #### [ TODO ] seperate the words by white space
    splitText = pureText.split()
    return splitText

from collections import Counter

def calculate_frequency(tokens):
    """
    Input:
    ['this', 'is', 'an', 'example', ...]

    Sample output: 
    {
        'the': 79809, 
        'project': 288,
        ...
    }
    """
    #### [ TODO ] 
    topToken = Counter(tokens).most_common()
    di = dict(topToken)
    return di
   
def ranking(tokens):
    rank = 0
    previous_number = 0
    for element in tokens:
        rank += 1
        tokens[element] = rank
    return tokens

def get_ngram(tokens,n):
    ngram_result = []
    ngram_list = [tokens[i:i+n] for i in range(0,len(tokens)-1)]
    for element in ngram_list:
        ngram_result.append(' '.join(element[0 : n]))
    return ngram_result

    """
    Input:
    ['this', 'is', 'an', 'example', ...]

    Sample output: 
    ['this is', 'is an', 'an example', ...]
    """
    #### [TODO] 
    

In [None]:
### data test 
f = open('/content/drive/MyDrive/test.txt', 'r',encoding="utf-8")
fileRead = f.read()
#### [ TODO ] generate test unigrams and calculate document frequency of unigram in test
test_unigram = tokenize(fileRead)
test_bigram = get_ngram(test_unigram,2)
print(test_bigram)
# test_bigram_counter = calculate_frequency(test_bigram)
# lang_unigram_Rank = ranking(test_bigram_counter)
# print(lang_unigram_Rank)
f.close()


['having spent', 'spent half', 'half days', 'days and', 'and full', 'full weeks', 'weeks at', 'at king', 'king henry', 'henry v', 'v school', 'school in', 'in coventry', 'coventry i', 'i feel', 'feel i', 'i can', 'can appreciate', 'appreciate more', 'more what', 'what being', 'being a', 'a teacher', 'teacher is', 'is like', 'like the', 'the challenges', 'challenges and', 'and every', 'every day', 'day tasks', 'tasks they', 'they face', 'face and', 'and how', 'how relevant', 'relevant some', 'some of', 'of the', 'the aspects', 'aspects of', 'of learning', 'learning that', 'that we', 'we had', 'had studied', 'studied were', 'were to', 'to the', 'the way', 'way the', 'the pupils', 'pupils there', 'there learnt', 'learnt i', 'i spent', 'spent a', 'a lot', 'lot of', 'of my', 'my time', 'time observing', 'observing teachers', 'teachers but', 'but also', 'also did', 'did some', 'some teaching', 'teaching myself', 'myself which', 'which included', 'included taking', 'taking over', 'over a', 'a

In [None]:
#  file_path = os.path.join('data', 'bnc.txt')
f = open('/content/drive/MyDrive/bnc.txt', 'r',encoding="utf-8")
fileRead = f.read()
#### [ TODO ] generate BNC unigrams and calculate document frequency of unigram in BNC
BNC_unigram = tokenize(fileRead)
BNC_unigram_counter = calculate_frequency(BNC_unigram)
f.close()


In [None]:
# Read lang-8 Data
# file_path = os.path.join('data','lang8.txt')
f = open('/content/drive/MyDrive/lang8.txt', 'r',encoding="utf-8")
fileRead = f.read()
#### [ TODO ] generate lang8 unigrams and calculate document frequency of unigram in lang8
lang_unigram = tokenize(fileRead)
lang_unigram_counter = calculate_frequency(lang_unigram)
f.close()


## Rank
Rank unigrms by their frequencies. The higher the frequency, the higher the rank. (The most frequent unigram ranks 1.)<br>
<span style="color: red">[ TODO ]</span> <u>Rank unigrams for Lang-8 and BNC.</u>.

In [None]:
#### [ TODO ] Rank unigrams for lang
lang_unigram_Rank = ranking(lang_unigram_counter)


In [None]:
#### [ TODO ] Rank unigrams for BNC
BNC_unigram_Rank = ranking(BNC_unigram_counter)


## Calculate Rank Ratio
In this step, you need to map the same unigram in two dataset, and calculate the Rank Ratio of unigrams.  <br>Please follow the formula for calculating Rank Ratio:<br> 
<br>

$Rank Ratio = \frac{Rank of BNC }{Rank of Lang8}$
<br><br>
If the unigram doesn't appear in BNC, the rank of it is treated as 1.

<span style="color: red">[ TODO ]</span> Please calculate all rank ratios of unigrams in Lang-8.

In [None]:
#### [ TODO ] Calculate Rank Ratio
lang_rank_ratio = {}
for element in lang_unigram_Rank:
    if(element in BNC_unigram_Rank):
        lang_rank_ratio[element] = BNC_unigram_Rank[element]/lang_unigram_Rank[element]
    else:
        lang_rank_ratio[element] = 1

list_lang_rank_ratio = sorted(lang_rank_ratio.items(), key=lambda x: x[1],reverse=True)
list_lang_rank_ratio_top = list_lang_rank_ratio[0:30]



## sort the result
<span style="color: red">[ TODO ]</span> Please show top 30 unigrams in Rank Ratio and the value of their Rank Ratio in this format: 
<br>
<img src="https://scontent-hkt1-2.xx.fbcdn.net/v/t39.30808-6/307940624_756082125461769_4218487831464443689_n.jpg?_nc_cat=100&ccb=1-7&_nc_sid=730e14&_nc_ohc=M0u8b1s2wakAX_Mgt7E&_nc_ht=scontent-hkt1-2.xx&oh=00_AT_peeQy_D2UyQYlMWbCIZjQTU7F38SJyE2A09J_SnZ-aA&oe=632E03C0" width=50%>

In [None]:
#### [ TODO ] 
lang_rank_ratio_top_with_rank = []
rank = 1
for i in range(0,len(list_lang_rank_ratio_top)):
    lang_rank_ratio_top_with_rank.append([rank] + list(list_lang_rank_ratio_top[i]))
    rank += 1
  
import itertools
from tabulate import tabulate
headers = ["rank","unigram", "rankratio"]
print(tabulate(lang_rank_ratio_top_with_rank, headers = headers))

  rank  unigram          rankratio
------  -------------  -----------
     1  doesnt             85.5124
     2  internet           72.3168
     3  countrys           69.6875
     4  opcit              51.9052
     5  radstone           50.2724
     6  isnt               49.8752
     7  uht                49.7557
     8  kants              49.1012
     9  eu                 49.0184
    10  dont               48.4824
    11  companys           48.3068
    12  anthocyanins       47.6316
    13  ibid               43.583
    14  japans             43.4812
    15  webers             43.1641
    16  luthers            41.5939
    17  bryman             40.2765
    18  ibidp              39.0174
    19  womens             38.4173
    20  creon              38.0404
    21  microneedles       37.292
    22  rtas               37.2328
    23  didnt              36.6757
    24  pneumophila        35.8189
    25  globalisation      35.6638
    26  roosevelts         35.5508
    27  punic         

## for Bigrams
<span style="color: red">[ TODO ]</span> Do the Same Thing for Bigrams  
Hint:  
1. generate all bigrams for BNC / lang8  
2. calculate frequency for each bigrams  
3. rank bigrams by frequency  
4. calculate the rank ratio of each bigram
5. print out the top 30 highest rank ratio bigrams  

In [None]:
#### [ TODO ] 
BNC_bigram = get_ngram(BNC_unigram,2)
BNC_bigram_counter = calculate_frequency(BNC_bigram)
BNC_bigram_Rank = ranking(BNC_bigram_counter)

lang_bigram = get_ngram(lang_unigram,2)
lang_bigram_counter = calculate_frequency(lang_bigram)
lang_bigram_Rank = ranking(lang_bigram_counter)

#### [ TODO ] Calculate Rank Ratio
langbigram_rank_ratio = {}
count = 0
for element in lang_bigram_Rank:
    if(element in BNC_bigram_Rank):
        langbigram_rank_ratio[element] = BNC_bigram_Rank[element]/lang_bigram_Rank[element]
    else:
        langbigram_rank_ratio[element] = 1

list_langbigram_rank_ratio = sorted(langbigram_rank_ratio.items(), key=lambda x: x[1],reverse=True)
list_langbigram_rank_ratio_top = list_langbigram_rank_ratio[0:30]

langbigram_rank_ratio_top_with_rank = []
rank = 1
for i in range(0,len(list_langbigram_rank_ratio_top)):
    langbigram_rank_ratio_top_with_rank.append([rank] + list(list_langbigram_rank_ratio_top[i]))
    rank += 1
  
headers = ["rank","bigram", "rankratio"]
print(tabulate(langbigram_rank_ratio_top_with_rank, headers = headers))

## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1OKbXhcv6E3FEQDPnbHEHEeHvpxv01jxugMP7WwnKqKw/edit#gid=0) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to e-learn website. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.  