# Word Count, Phrase Analysis, Cross-Corpus Analysis

In learning English, there are phrases and words that are overly used and seldom used - it depends on what corpus that is being used. Here, we will do word count, phrase analysis and cross-corpus analysis to determine the phrases that are overly used by learners.
<br><br>
One dataset is taken from [`British National Corpus`](http://www.natcorp.ox.ac.uk/), which is from 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English, both spoken and written, from the late twentieth century. Another one is called [`NAIST Lang-8`](https://sites.google.com/site/naistlang8corpora/),a language exchange social networking website geared towards language learners. The website is run by Lang-8 Inc., which is based in Tokyo, Japan.    

You can access the datasets with the following link:  
https://drive.google.com/drive/folders/1vtCjRptZL6T4mffzbnqwi5i4WrqVnZHr?usp=sharing



## N-gram counting
We will do tokenization and calculation of frequency. The rules of tokenization in this Lab are:
 1. Ignore cases (e.g., "The" is the same as "the")
 2. Split by white spaces <s>and punctuations</s>
 3. Ignore all punctuation
<br><br>

In [None]:
import os
import re
import string

In [None]:

def tokenize(text):
    #### [ TODO ] transform to lower case
    text = text.lower()
    ### [ TODO ] seperate the words
    tokens = text.translate(str.maketrans('', '', string.punctuation)).split(' ')
    return tokens
from collections import Counter
def calculate_frequency(tokens):
    # [ TODO ]
    frequency = Counter(tokens)
    return frequency
    """
    Sample output: 
    {
        'the': 79809, 
        'project': 288,
        ...
    }
    """

def get_ngram(tokens, n=2):
    # [TODO]
    return [' '.join(tokens[i:i+n]) for i in range(len(tokens)-1)]

In [None]:
file_path = os.path.join('data', 'bnc.txt')
BNC_unigram = []

#### [ TODO ] calculate document frequency of unigram in BNC
with open(file_path, 'r',encoding='UTF-8') as f:
    for line in f:
        tokens = tokenize(line)
        BNC_unigram.extend(tokens)

BNC_unigram_counter = calculate_frequency(BNC_unigram)

In [None]:
# Read lang-8 Data
file_path = os.path.join('data','lang8.csv')
lang_unigram = []

#### [ TODO ] calculate document frequency of unigram in lang8
with open(file_path,'r',encoding="utf8") as f:
    for line in f:
        tokens = tokenize(line)
        lang_unigram.extend(tokens)

lang_unigram_counter = calculate_frequency(lang_unigram)

        

## Rank
Rank unigrms by their frequencies. The higher the frequency, the higher the rank.(The most frequent unigram ranks 1.)<br>
<span style="color: red">[ TODO ]</span> <u>Rank unigrams for Lang-8 and BNC.</u>.

In [None]:
lang_unigram_Rank = {}

#### [ TODO ] Rank unigrams for lang


for i,unigram in enumerate(sorted(lang_unigram_counter.items(), key=lambda item: item[1],reverse=True)):
    lang_unigram_Rank[unigram[0]] = i+1

for

In [None]:
BNC_unigram_Rank = {}

#### [ TODO ] Rank unigrams for lang

for i,unigram in enumerate(sorted(BNC_unigram_counter.items(), key=lambda item: item[1],reverse=True)):
    BNC_unigram_Rank[unigram[0]] = i+1

## Calculate Rank Ratio
In this step, you need to map the same unigram in two dataset, and caalculate the Rank Ratio of unigram in Lang-8.  <br>Please follow the formula for calculating Rank Ratio:<br> 
<br>

$Rank Ratio = \frac{Rank of BNC }{Rank of Lang8}$
<br><br>
If the unigram doesn't appear in BNC, the rank of it is treated as 1.

<span style="color: red">[ TODO ]</span> Please calculate all rank ratios of unigrams in Lang-8.

In [None]:
unigram_result = {}
for term,rank in lang_unigram_Rank.items():
    if term in BNC_unigram_Rank.keys():
        unigram_result[term] = BNC_unigram_Rank[term]/rank
    else:
        unigram_result[term] = 1/rank

## sort the result
<span style="color: red">[ TODO ]</span> Please show top 30 unigrams in Rank Ratio and the value of their Rank Ratio in this format: 
<br>
<img src="https://scontent-hkt1-2.xx.fbcdn.net/v/t39.30808-6/307940624_756082125461769_4218487831464443689_n.jpg?_nc_cat=100&ccb=1-7&_nc_sid=730e14&_nc_ohc=M0u8b1s2wakAX_Mgt7E&_nc_ht=scontent-hkt1-2.xx&oh=00_AT_peeQy_D2UyQYlMWbCIZjQTU7F38SJyE2A09J_SnZ-aA&oe=632E03C0" width=50%>

In [None]:
print(f'rank\tunigram\t\t\t\tRank Ratio')
for i,one in enumerate(sorted(unigram_result.items(), key=lambda item: item[1],reverse=True)[:30]):
    if len(one[0])<8:
        print(f'{i+1}\t{one[0]}\t\t\t\t{round(one[1],3)}')
    elif len(one[0])>=16:
        print(f'{i+1}\t{one[0]}\t\t{round(one[1],3)}')
    else:
        print(f'{i+1}\t{one[0]}\t\t\t{round(one[1],3)}')

rank	unigram				Rank Ratio
1	dont				1647.883
2	wanna				965.38
3	thats				795.332
4	didnt				658.563
5	doesnt				503.039
6	havent				497.181
7	isnt				396.261
8	favorite			352.281
9	english
			338.979
10	ive				327.974
11	todayi				313.543
12	japanese
			293.676
13	im				279.914
14	cant				275.829
15	everyday
			246.353
16	hadnt				245.969
17	hes				233.413
18	vacation
			232.454
19	wasnt				185.347
20	japan
				172.697
21	itll				166.082
22	osaka
				165.945
23	japans				160.445
24	theres				154.79
25	someones			153.884
26	arent				152.039
27	hasnt				151.918
28	awesome
			149.661
29	internet			148.291
30	semester
			146.371


## for Bigrams
<span style="color: red">[ TODO ]</span> Do the Same Thing for Bigrams

In [None]:
file_path = os.path.join('data', 'bnc.txt')
bnc_bigram = []
#### [ TODO ] calculate document frequency of unigram in bnc
with open(file_path, 'r',encoding='UTF-8') as f:
    for line in f:
        tokens = tokenize(line)
        bigram = get_ngram(tokens)
        bnc_bigram.extend(bigram)

bnc_bigram_counter = (calculate_frequency(bnc_bigram))




In [None]:
file_path = os.path.join('data','lang8.csv')
lang_bigram = []
with open(file_path,'r', encoding="utf8") as f:
    for line in f:
        tokens = tokenize(line)
        bigram = get_ngram(tokens)
        lang_bigram.extend(bigram)

lang_bigram_counter = (calculate_frequency(lang_bigram))



In [None]:
lang_bigram_Rank = {}
for i,bigram in enumerate(sorted(lang_bigram_counter.items(), key=lambda item: item[1],reverse=True)):
    lang_bigram_Rank[bigram[0]] = i+1


In [None]:
BNC_bigram_Rank = {}
for i,bigram in enumerate(sorted(bnc_bigram_counter.items(), key=lambda item: item[1],reverse=True)):
    BNC_bigram_Rank[bigram[0]] = i+1
bigram_result = {}


In [None]:
for term,rank in lang_bigram_Rank.items():
    if term in BNC_bigram_Rank.keys():
        bigram_result[term] = BNC_bigram_Rank[term]/rank
    else:
        bigram_result[term] = 1/rank


In [None]:
print(f'rank\tbigram\t\t\t\tRank Ratio')
for i,one in enumerate(sorted(bigram_result.items(), key=lambda item: item[1],reverse=True)[:30]):
    if len(one[0])<8:
        print(f'{i+1}\t{one[0]}\t\t\t\t{round(one[1],3)}')
    elif len(one[0])>=16:
        print(f'{i+1}\t{one[0]}\t\t{round(one[1],3)}')
    else:
        print(f'{i+1}\t{one[0]}\t\t\t{round(one[1],3)}')

rank	bigram				Rank Ratio
1	i dont				363080.167
2	study english			35188.888
3	so im				15431.061
4	i didnt				15370.125
5	meet you
			13878.429
6	im very				12457.973
7	learn english			11868.028
8	i cant				8989.845
9	i havent			8578.976
10	my family
			7718.84
11	im so				7385.273
12	my diary
			6630.556
13	i wont				6090.669
14	ive been			5649.957
15	good night
			5608.794
16	cant understand			5537.622
17	they dont			5516.94
18	by myself
			4984.605
19	my home
			4818.499
20	than before
			4106.673
21	my english			4010.532
22	in japan
			3990.14
23	im sorry			3897.65
24	please correct			3738.287
25	im glad				3428.818
26	im afraid			3306.523
27	dont you			3303.248
28	my room
			3276.736
29	good morning
			3173.661
30	im trying			3031.639


## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1OKbXhcv6E3FEQDPnbHEHEeHvpxv01jxugMP7WwnKqKw/edit#gid=0) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to eeclass. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.  