# Word Count, Phrase Analysis, Cross-Corpus Analysis

In learning English, there are phrases and words that are overly used and seldom used - it depends on what corpus is being used. Here, we will do word count, phrase analysis and cross-corpus analysis to determine the phrases that are overly used by learners.
<br><br>
One dataset is taken from [`British National Corpus`](http://www.natcorp.ox.ac.uk/), which is from 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English, both spoken and written, from the late twentieth century. Another one is called [`NAIST Lang-8`](https://sites.google.com/site/naistlang8corpora/),a language exchange social networking website geared towards language learners. The website is run by Lang-8 Inc., which is based in Tokyo, Japan.


https://drive.google.com/drive/folders/1vtCjRptZL6T4mffzbnqwi5i4WrqVnZHr?usp=sharing


## N-gram counting
We will do tokenization and calculation of frequency. The rules of tokenization in this Lab are:
 1. Ignore case (e.g., "The" is the same as "the")
 2. Split by white spaces <s>and punctuations</s>
 3. Ignore all punctuation
<br><br>

In [2]:
import os
import re
import string

In [3]:

def tokenize(text):
    """
    Input:
    "This is an example.'

    Sample output: 
    ['this', 'is', 'an', 'example', '.']
    """  
    #### [ TODO ] transform text to lower case
    text = text.lower()
    #### [ TODO ] seperate the words by white space
    tokens = text.split(' ')
    return tokens
    
from collections import Counter

def calculate_frequency(tokens):
    """
    Input:
    ['this', 'is', 'an', 'example', ...]

    Sample output: 
    {
        'the': 79809, 
        'project': 288,
        ...
    }
    """
    frequency = Counter(tokens)
    return frequency
    #### [ TODO ] 
   


def get_ngram(tokens, n=2):
    """
    Input:
    ['this', 'is', 'an', 'example', ...]

    Sample output: 
    ['this is', 'is an', 'an example', ...]
    """
    #### [TODO] 
    return [' '.join(tokens[i:i+n]) for i in range(len(tokens)-1)]
    

In [3]:
file_path = os.path.join('data', 'bnc.txt')
BNC_unigram = []
#### [ TODO ] generate BNC unigrams and calculate document frequency of unigram in BNC
with open(file_path, 'r', encoding='UTF-8') as f:
    for line in f:
        tokens = tokenize(line)
        BNC_unigram.extend(tokens)

BNC_unigram_counter = calculate_frequency(BNC_unigram)

In [4]:
# Read lang-8 Data
file_path = os.path.join('data','lang8.txt')
lang_unigram = []

#### [ TODO ] generate lang8 unigrams and calculate document frequency of unigram in lang8
with open(file_path, 'r', encoding='UTF-8') as f:
    for line in f:
        tokens = tokenize(line)
        lang_unigram.extend(tokens)

lang_unigram_counter = calculate_frequency(lang_unigram)


## Rank
Rank unigrms by their frequencies. The higher the frequency, the higher the rank. (The most frequent unigram ranks 1.)<br>
<span style="color: red">[ TODO ]</span> <u>Rank unigrams for Lang-8 and BNC.</u>.

In [14]:
lang_unigram_Rank = {}

#### [ TODO ] Rank unigrams for lang

sorted_lang_unigram = sorted(lang_unigram_counter.items(), key=lambda word: word[1],reverse=True)
j = 1
for i in sorted_lang_unigram:
    lang_unigram_Rank[i[0]] = j
    j = j+1

In [16]:
BNC_unigram_Rank = {}

#### [ TODO ] Rank unigrams for BNC

sorted_BNC_unigram = sorted(BNC_unigram_counter.items(), key=lambda word: word[1],reverse=True)
j = 1
for i in sorted_BNC_unigram:
    BNC_unigram_Rank[i[0]] = j
    j = j+1

## Calculate Rank Ratio
In this step, you need to map the same unigram in two dataset, and calculate the Rank Ratio of unigrams.  <br>Please follow the formula for calculating Rank Ratio:<br> 
<br>

$Rank Ratio = \frac{Rank of BNC }{Rank of Lang8}$
<br><br>
If the unigram doesn't appear in BNC, the rank of it is treated as 1.

<span style="color: red">[ TODO ]</span> Please calculate all rank ratios of unigrams in Lang-8.

In [18]:
#### [ TODO ] Calculate Rank Ratio
unigram_rank_ratio = {}
for unigram, rank in lang_unigram_Rank.items():
    if unigram in BNC_unigram_Rank.keys():
        unigram_rank_ratio[unigram] = BNC_unigram_Rank[unigram]/rank
    else:
        unigram_rank_ratio[unigram] = 1/rank

## sort the result
<span style="color: red">[ TODO ]</span> Please show top 30 unigrams in Rank Ratio and the value of their Rank Ratio in this format: 
<br>
<img src="https://scontent-hkt1-2.xx.fbcdn.net/v/t39.30808-6/307940624_756082125461769_4218487831464443689_n.jpg?_nc_cat=100&ccb=1-7&_nc_sid=730e14&_nc_ohc=M0u8b1s2wakAX_Mgt7E&_nc_ht=scontent-hkt1-2.xx&oh=00_AT_peeQy_D2UyQYlMWbCIZjQTU7F38SJyE2A09J_SnZ-aA&oe=632E03C0" width=50%>

In [37]:
#### [ TODO ] 
print('rank\tunigram\t\t\tRank\tRatio')
sorted_unigram_rank_ratio = {}
j = 0
sorted_unigram_rank_ratio = sorted(unigram_rank_ratio.items(), key=lambda word: word[1],reverse=True)
for unigram, rank_ratio in sorted_unigram_rank_ratio:
    print(f'{j+1}\t{unigram}\t\t\t{round(rank_ratio, 3)}')
    j = j+1
    if j == 30:
        break


rank	unigram			Rank	Ratio
1	'the			368.504
2	world.			335.697
3	years.			290.015
4	life.			203.29
5	them.			189.452
6	society.			185.065
7	value.			175.799
8	country.			170.908
9	it.			169.416
10	1.			161.564
11	year.			148.194
12	work.			146.203
13	way.			138.256
14	place.			137.674
15	below.			133.332
16	states.			132.424
17	problems.			131.027
18	activities.			126.197
19	today.			126.056
20	other.			122.825
21	out.			110.795
22	this.			105.3
23	service.			98.54
24	area.			97.132
25	strategy.			95.607
26	well.			94.41
27	again.			92.409
28	children.			91.014
29	europe.			89.698
30	performance.			89.025


## for Bigrams
<span style="color: red">[ TODO ]</span> Do the Same Thing for Bigrams  
Hint:  
1. generate all bigrams for BNC / lang8  
2. calculate frequency for each bigrams  
3. rank bigrams by frequency  
4. calculate the rank ratio of each bigram
5. print out the top 30 highest rank ratio bigrams  

In [4]:
file_path = os.path.join('data', 'bnc.txt')
BNC_bigram = []
#### [ TODO ] generate BNC unigrams and calculate document frequency of unigram in BNC
with open(file_path, 'r', encoding='UTF-8') as f:
    for line in f:
        tokens = tokenize(line)
        bigram = get_ngram(tokens)
        BNC_bigram.extend(bigram)

BNC_bigram_counter = calculate_frequency(BNC_bigram)

In [5]:
# Read lang-8 Data
file_path = os.path.join('data','lang8.txt')
lang_bigram = []

#### [ TODO ] generate lang8 unigrams and calculate document frequency of unigram in lang8
with open(file_path, 'r', encoding='UTF-8') as f:
    for line in f:
        tokens = tokenize(line)
        bigram = get_ngram(tokens) 
        lang_bigram.extend(bigram)

lang_bigram_counter = calculate_frequency(lang_bigram)

In [6]:
lang_bigram_Rank = {}

#### [ TODO ] Rank unigrams for lang

sorted_lang_bigram = sorted(lang_bigram_counter.items(), key=lambda word: word[1],reverse=True)
j = 1
for i in sorted_lang_bigram:
    lang_bigram_Rank[i[0]] = j
    j = j+1

In [7]:
BNC_bigram_Rank = {}

#### [ TODO ] Rank unigrams for BNC

sorted_BNC_bigram = sorted(BNC_bigram_counter.items(), key=lambda word: word[1],reverse=True)
j = 1
for i in sorted_BNC_bigram:
    BNC_bigram_Rank[i[0]] = j
    j = j+1

In [8]:
#### [ TODO ] Calculate Rank Ratio
bigram_rank_ratio = {}
for bigram, rank in lang_bigram_Rank.items():
    if bigram in BNC_bigram_Rank.keys():
        bigram_rank_ratio[bigram] = BNC_bigram_Rank[bigram]/rank
    else:
        bigram_rank_ratio[bigram] = 1/rank

In [9]:
#### [ TODO ] 
print('rank\tbigram\t\t\tRank\tRatio')
sorted_bigram_rank_ratio = {}
j = 0
sorted_bigram_rank_ratio = sorted(bigram_rank_ratio.items(), key=lambda word: word[1],reverse=True)
for bigram, rank_ratio in sorted_bigram_rank_ratio:
    print(f'{j+1}\t{bigram}\t\t\t{round(rank_ratio, 3)}')
    j = j+1
    if j == 30:
        break

rank	bigram			Rank	Ratio
1	the country.			1525.922
2	the internet			1289.347
3	introduction this			1258.465
4	as well.			1127.555
5	heat exchanger			1099.119
6	the other.			906.431
7	the bohr			856.34
8	of society.			849.987
9	for them.			831.704
10	in 2004			781.397
11	history relevant			735.047
12	child soldiers			722.501
13	birthweight ratio			702.668
14	exam performance			699.844
15	-1 the			676.44
16	or not.			675.417
17	2 figure			669.589
18	genetically modified			664.976
19	united states.			648.111
20	rate constant			632.863
21	open source			616.6
22	to him.			613.239
23	based care			606.654
24	tort law			598.446
25	of them.			589.531
26	induction motor			575.241
27	internet and			561.387
28	eu is			557.598
29	the tamworth			548.5
30	bowel sounds			524.632


## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1OKbXhcv6E3FEQDPnbHEHEeHvpxv01jxugMP7WwnKqKw/edit#gid=0) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to e-learn website. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.  