# PELIC frequency statistics

<br>

**Authors:** Cassie Maz, John Starr, Ben Naismith  
**Date:** 9 June 2020

<br>

The code in this notebook calculates the frequency of tokens (words and lemmas) in PELIC. The data come from [PELIC_compiled.csv](https://github.com/ELI-Data-Mining-Group/PELIC-dataset/blob/master/PELIC_compiled.csv), resulting in frequency distributions of tokens based on students' L1 and Level. 

<br>

**Notebook contents:**
- [Initial setup](#Initial-setup)
- [Total frequencies](#Total-frequencies)
    * The output is a csv with a column for total frequency count and a column for frequency per million.
- [Frequencies by level](#Frequencies-by-level): 2, 3, 4, 5
    * Note: The majority of students in PELIC are in levels 3, 4, and 5.
    * The output is an NLTK Frequency Distribution for each level, exported as a pickle file.
- [Frequencies by L1](#Frequencies-by-L1) Arabic, Chinese, Korean, Japanese, Spanish
    * Note: we are not creating frequency dictionaries for all L1's, but only the 5 most common.
    * The output is an NLTK Frequency Distribution for each L1, exported as a pickle file.
- [Conditional frequency distributions](#Conditional-frequency-distributions)
    * There are two conditional frequency distributions, for level and L1, to allow access to word frequencies based on either Level or L1 using only one pickle file, rather than separate pickle files.
- [Demonstration](#Demonstration)
    * A short demonstration comparing the top 20 Arabic and Japanese lemmas
    
**Note:** Distributions do not take capitalization into account – a capitalized word and the same non-capitalized word will go towards the same count.

## Initial setup

In [1]:
# Import necessary modules
import pandas as pd
import pickle as pkl
from nltk import FreqDist
from nltk.probability import ConditionalFreqDist
import random
from ast import literal_eval

In [2]:
# Read in PELIC_compiled which contains all the processed texts

pelic_df = pd.read_csv('../pelic_compiled.csv', index_col = 'answer_id')
pelic_df.info()
pelic_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46230 entries, 1 to 48420
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   anon_id      46230 non-null  object
 1   L1           46230 non-null  object
 2   gender       46230 non-null  object
 3   level_id     46230 non-null  int64 
 4   class_id     46230 non-null  object
 5   question_id  46230 non-null  int64 
 6   version      46230 non-null  int64 
 7   text_len     46230 non-null  int64 
 8   text         46230 non-null  object
 9   tokens       46230 non-null  object
 10  tok_lem_POS  46230 non-null  object
dtypes: int64(4), object(7)
memory usage: 4.2+ MB


Unnamed: 0_level_0,anon_id,L1,gender,level_id,class_id,question_id,version,text_len,text,tokens,tok_lem_POS
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,eq0,Arabic,Male,4,g,5,1,177,I met my friend Nife while I was studying in a...,"['I', 'met', 'my', 'friend', 'Nife', 'while', ...","(('I', 'i', 'PRP'), ('met', 'meet', 'VBD'), ('..."
2,am8,Thai,Female,4,g,5,1,137,"Ten years ago, I met a women on the train betw...","['Ten', 'years', 'ago', ',', 'I', 'met', 'a', ...","(('Ten', 'ten', 'CD'), ('years', 'year', 'NNS'..."
3,dk5,Turkish,Female,4,w,12,1,64,In my country we usually don't use tea bags. F...,"['In', 'my', 'country', 'we', 'usually', 'do',...","(('In', 'in', 'IN'), ('my', 'my', 'PRP$'), ('c..."
4,dk5,Turkish,Female,4,w,13,1,6,I organized the instructions by time.,"['I', 'organized', 'the', 'instructions', 'by'...","(('I', 'i', 'PRP'), ('organized', 'organize', ..."
5,ad1,Korean,Female,4,w,12,1,59,"First, prepare a port, loose tea, and cup.\nSe...","['First', ',', 'prepare', 'a', 'port', ',', 'l...","(('First', 'first', 'RB'), (',', ',', ','), ('..."


In [3]:
# Because lists are read in as strings, these need to be converted back to lists

pelic_df.tokens = pelic_df.tokens.apply(literal_eval)
pelic_df.tok_lem_POS = pelic_df.tok_lem_POS.apply(literal_eval)

In [4]:
# Keep only the four relevant columns

pelic_df = pelic_df[["anon_id", "level_id", "L1", "tokens", "tok_lem_POS"]]
pelic_df.head()

Unnamed: 0_level_0,anon_id,level_id,L1,tokens,tok_lem_POS
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,eq0,4,Arabic,"[I, met, my, friend, Nife, while, I, was, stud...","((I, i, PRP), (met, meet, VBD), (my, my, PRP$)..."
2,am8,4,Thai,"[Ten, years, ago, ,, I, met, a, women, on, the...","((Ten, ten, CD), (years, year, NNS), (ago, ago..."
3,dk5,4,Turkish,"[In, my, country, we, usually, do, n't, use, t...","((In, in, IN), (my, my, PRP$), (country, count..."
4,dk5,4,Turkish,"[I, organized, the, instructions, by, time, .]","((I, i, PRP), (organized, organize, VBD), (the..."
5,ad1,4,Korean,"[First, ,, prepare, a, port, ,, loose, tea, ,,...","((First, first, RB), (,, ,, ,), (prepare, prep..."


In [5]:
# Create a list of all word and lemma tokens, flatten these lists, and make all tokens lowercase

toks_list = [y.lower() for x in pelic_df.tokens for y in x]
print(toks_list[:10])

lemmas_list = [y[1].lower() for x in pelic_df.tok_lem_POS for y in x]
print(lemmas_list[:10])

['i', 'met', 'my', 'friend', 'nife', 'while', 'i', 'was', 'studying', 'in']
['i', 'meet', 'my', 'friend', 'nife', 'while', 'i', 'be', 'study', 'in']


In [6]:
# Calculate the total number of tokens

total_toks = len(toks_list)
total_lemmas = len(lemmas_list)
print('PELIC total tokens:',total_toks)
print('PELIC total lemmas:',total_lemmas)

# These should match.

PELIC total tokens: 4819157
PELIC total lemmas: 4819157


## Total frequencies
This will be the general frequency distribution of all tokens for all students, i.e. a dictionary where each word or lemma is the key and its raw frequency is the value.

### Raw frequency counts

In [7]:
# Create the distribution using our toks_list and lemmas_list

freq_dist = FreqDist(toks_list)
freq_dist_lemmas = FreqDist(lemmas_list)

print(freq_dist) # 49,241 word types
print(freq_dist_lemmas) # 41,448 lemma types - this is lower than word types as expected

<FreqDist with 49241 samples and 4819157 outcomes>
<FreqDist with 41362 samples and 4819157 outcomes>


In [8]:
# Check a random sample of 5 entries in the dictionaries

random.sample(freq_dist.items(), 5)
random.sample(freq_dist_lemmas.items(), 5)

# As we can see here, spelling is not corrected in the texts, a factor which must always be considered.

[('youngstown', 10),
 ('sober-minded', 1),
 ('paticpates', 1),
 ('948', 1),
 ('fluctuation', 30)]

In [9]:
# Create dataframes from these frequency distributions

# words
freq_df = pd.DataFrame.from_dict(freq_dist, orient='index', columns=["frequency"]) # create df from dict
freq_df = freq_df.sort_values(by=['frequency'], ascending = False) # sort by frequency
freq_df = freq_df.reset_index() # make new index
freq_df = freq_df.rename(columns = {"index":"word" }) # rename index
freq_df.head(10)

Unnamed: 0,word,frequency
0,.,279067
1,",",217804
2,the,193235
3,to,135001
4,and,108331
5,i,98387
6,a,92882
7,in,92013
8,of,88910
9,is,75973


In [10]:
# lemmas
freq_df_lemmas = pd.DataFrame.from_dict(freq_dist_lemmas, orient='index', columns=["frequency"]) # create df from dict
freq_df_lemmas = freq_df_lemmas.sort_values(by=['frequency'], ascending = False) # sort by frequency
freq_df_lemmas = freq_df_lemmas.reset_index() # make new index
freq_df_lemmas = freq_df_lemmas.rename(columns = {"index":"lemma" }) # rename index
freq_df_lemmas.head(10)

Unnamed: 0,lemma,frequency
0,.,279067
1,",",217804
2,the,193235
3,be,179888
4,to,135001
5,and,108331
6,a,103373
7,i,98387
8,in,92013
9,of,88910


**Notes:** 
- Using NLTK tokens, punctuation is also included in the dictionary. If considering frequency ranking (for example for frequency bands), it is important to exclude punctuation.
- Note the difference between the top 10 most frequent words and lemmas - in the lemma list the lemma 'be' is much higher as it combines 'is' and other verb forms.

### Frequency per million
In addition to raw frequencies, frequency per million is a useful measurement for comparion, both within and across corpora. The formula for the per/mil statistic is:  

`per M factor = 1,000,000 / total corpus frequency  
word per M = word frequency * per M factor`

In [11]:
per_M_factor = 1000000 / total_toks # total_lemmas would produce the same result
per_M_factor

0.2075051715476379

In [12]:
# Create a new column for the per_M statistic for each word and lemma

# words
freq_df['per_M'] = freq_df.frequency.map(lambda x: round(x*per_M_factor,1)) # Round the results to one decimal
freq_df.head()

Unnamed: 0,word,frequency,per_M
0,.,279067,57907.8
1,",",217804,45195.5
2,the,193235,40097.3
3,to,135001,28013.4
4,and,108331,22479.2


In [13]:
# lemmas
freq_df_lemmas['per_M'] = freq_df_lemmas.frequency.map(lambda x: round(x*per_M_factor,1)) # Round the results to one decimal
freq_df_lemmas.head()

Unnamed: 0,lemma,frequency,per_M
0,.,279067,57907.8
1,",",217804,45195.5
2,the,193235,40097.3
3,be,179888,37327.7
4,to,135001,28013.4


In [14]:
# Print a csv of these dataframes for easy access

freq_df.to_csv("word_frequencies.csv")
freq_df_lemmas.to_csv("lemma_frequencies.csv")

## Frequencies by level<a name="levelfreqdict"></a>
The following code is the same basic process, only now we will have four dictionaries, one per level.

In [15]:
# First, create a separate 'lemmas' column in pelic_df from the tok_lem_POS tuple
pelic_df['lemmas'] = pelic_df.tok_lem_POS.apply(lambda row: [x[1] for x in row])
pelic_df.head()

Unnamed: 0_level_0,anon_id,level_id,L1,tokens,tok_lem_POS,lemmas
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,eq0,4,Arabic,"[I, met, my, friend, Nife, while, I, was, stud...","((I, i, PRP), (met, meet, VBD), (my, my, PRP$)...","[i, meet, my, friend, nife, while, i, be, stud..."
2,am8,4,Thai,"[Ten, years, ago, ,, I, met, a, women, on, the...","((Ten, ten, CD), (years, year, NNS), (ago, ago...","[ten, year, ago, ,, i, meet, a, woman, on, the..."
3,dk5,4,Turkish,"[In, my, country, we, usually, do, n't, use, t...","((In, in, IN), (my, my, PRP$), (country, count...","[in, my, country, we, usually, do, not, use, t..."
4,dk5,4,Turkish,"[I, organized, the, instructions, by, time, .]","((I, i, PRP), (organized, organize, VBD), (the...","[i, organize, the, instruction, by, time, .]"
5,ad1,4,Korean,"[First, ,, prepare, a, port, ,, loose, tea, ,,...","((First, first, RB), (,, ,, ,), (prepare, prep...","[first, ,, prepare, a, port, ,, loose, tea, ,,..."


In [16]:
# Next, split pelic_df into smaller dataframes of each level

level2_df = pelic_df[pelic_df.level_id == 2]
level3_df = pelic_df[pelic_df.level_id == 3]
level4_df = pelic_df[pelic_df.level_id == 4]
level5_df = pelic_df[pelic_df.level_id == 5]

In [17]:
# Create the frequency distributions for each level

# words
level2_freq_dist = FreqDist([y.lower() for x in level2_df.tokens for y in x])
level3_freq_dist = FreqDist([y.lower() for x in level3_df.tokens for y in x])
level4_freq_dist = FreqDist([y.lower() for x in level4_df.tokens for y in x])
level5_freq_dist = FreqDist([y.lower() for x in level5_df.tokens for y in x])

# lemmas
level2_freq_dist_lemmas = FreqDist([y.lower() for x in level2_df.lemmas for y in x])
level3_freq_dist_lemmas = FreqDist([y.lower() for x in level3_df.lemmas for y in x])
level4_freq_dist_lemmas = FreqDist([y.lower() for x in level4_df.lemmas for y in x])
level5_freq_dist_lemmas = FreqDist([y.lower() for x in level5_df.lemmas for y in x])

In [18]:
# Check the results

# words
print(level2_freq_dist)
print(level2_freq_dist.most_common(10))
print(level3_freq_dist)
print(level3_freq_dist.most_common(10))
print(level4_freq_dist)
print(level4_freq_dist.most_common(10))
print(level5_freq_dist)
print(level5_freq_dist.most_common(10))

<FreqDist with 2641 samples and 34894 outcomes>
[('.', 3305), ('i', 1328), (',', 1318), ('and', 945), ('to', 802), ('the', 795), ('is', 770), ('my', 757), ('a', 739), ('in', 668)]
<FreqDist with 17219 samples and 729061 outcomes>
[('.', 49617), (',', 30768), ('the', 26880), ('i', 21546), ('to', 20039), ('and', 15475), ('a', 14547), ('in', 14370), ('is', 13065), ('of', 11469)]
<FreqDist with 28939 samples and 2213651 outcomes>
[('.', 128822), (',', 101411), ('the', 82504), ('to', 63373), ('and', 50090), ('i', 48617), ('a', 41915), ('in', 40310), ('of', 38846), ('is', 33978)]
<FreqDist with 30718 samples and 1841551 outcomes>
[('.', 97323), (',', 84307), ('the', 83056), ('to', 50787), ('and', 41821), ('of', 38340), ('in', 36665), ('a', 35681), ('is', 28160), ('i', 26896)]


In [19]:
# lemmas
print(level2_freq_dist_lemmas)
print(level2_freq_dist_lemmas.most_common(10))
print(level3_freq_dist_lemmas)
print(level3_freq_dist_lemmas.most_common(10))
print(level4_freq_dist_lemmas)
print(level4_freq_dist_lemmas.most_common(10))
print(level5_freq_dist_lemmas)
print(level5_freq_dist_lemmas.most_common(10))

<FreqDist with 2258 samples and 34894 outcomes>
[('.', 3305), ('be', 1420), ('i', 1328), (',', 1318), ('and', 945), ('to', 802), ('a', 796), ('the', 795), ('my', 757), ('in', 668)]
<FreqDist with 14184 samples and 729061 outcomes>
[('.', 49617), (',', 30768), ('be', 27537), ('the', 26880), ('i', 21546), ('to', 20039), ('a', 15677), ('and', 15475), ('in', 14370), ('of', 11469)]
<FreqDist with 23456 samples and 2213651 outcomes>
[('.', 128822), (',', 101411), ('the', 82504), ('be', 81296), ('to', 63373), ('and', 50090), ('i', 48617), ('a', 46545), ('in', 40310), ('of', 38846)]
<FreqDist with 24677 samples and 1841551 outcomes>
[('.', 97323), (',', 84307), ('the', 83056), ('be', 69635), ('to', 50787), ('and', 41821), ('a', 40355), ('of', 38340), ('in', 36665), ('i', 26896)]


In [20]:
# Check that the sum of period frequency counts should add up to the total period frequency count

# words
print(level2_freq_dist['.'] + level3_freq_dist['.'] + level4_freq_dist['.'] + level5_freq_dist['.'])
print(freq_dist['.'])

# lemmas
print(level2_freq_dist_lemmas['.'] + level3_freq_dist_lemmas['.'] + level4_freq_dist_lemmas['.'] + level5_freq_dist_lemmas['.'])
print(freq_dist_lemmas['.'])

279067
279067
279067
279067


In [21]:
# Export to pickle files

# words
level2file = open('word_frequency_pickles/level2_word_frequencies.pkl','wb')
pkl.dump(level2_freq_dist, level2file)
level2file.close()

level3file = open('word_frequency_pickles/level3_word_frequencies.pkl','wb')
pkl.dump(level3_freq_dist, level3file)
level3file.close()

level4file = open('word_frequency_pickles/level4_word_frequencies.pkl','wb')
pkl.dump(level4_freq_dist, level4file)
level4file.close()

level5file = open('word_frequency_pickles/level5_word_frequencies.pkl','wb')
pkl.dump(level5_freq_dist, level5file)
level5file.close()

# lemmas
level2file_lemmas = open('lemma_frequency_pickles/level2_lemma_frequencies.pkl','wb')
pkl.dump(level2_freq_dist_lemmas, level2file_lemmas)
level2file_lemmas.close()

level3file_lemmas = open('lemma_frequency_pickles/level3_lemma_frequencies.pkl','wb')
pkl.dump(level3_freq_dist_lemmas, level3file_lemmas)
level3file_lemmas.close()

level4file_lemmas = open('lemma_frequency_pickles/level4_lemma_frequencies.pkl','wb')
pkl.dump(level4_freq_dist_lemmas, level4file_lemmas)
level4file_lemmas.close()

level5file_lemmas = open('lemma_frequency_pickles/level5_lemma_frequencies.pkl','wb')
pkl.dump(level5_freq_dist_lemmas, level5file_lemmas)
level5file_lemmas.close()

## Frequencies by L1<a name="langfreqdict"></a>

Now for the same process, but by L1. We will do this for the five most frequent languagaes: Arabic, Chinese, Korean, Japanese, and Spanish. However, this same process could be applied to create frequency dictionaries for any of the L1s in PELIC.

In [22]:
pelic_df.L1.value_counts()

Arabic               16831
Korean                9226
Chinese               8503
Japanese              2782
Spanish               1909
Turkish               1538
Thai                  1383
Taiwanese              678
Portuguese             603
Other                  493
French                 477
Italian                393
Russian                193
Hebrew                 189
English                154
Farsi                  144
Mongol                 144
Vietnamese             104
German                  85
Indonesian              71
Romanian                69
Russian,Ukrainian       62
Azerbaijani             45
Suundi                  41
Swedish                 31
Montenegrin             27
Zulu                    25
Polish                  20
Hindi                    6
Swahili                  4
Name: L1, dtype: int64

In [23]:
# Create sub-dataframes by language

ARA_df = pelic_df[pelic_df.L1 == "Arabic"]
CHI_df = pelic_df[pelic_df.L1 == "Chinese"]
KOR_df = pelic_df[pelic_df.L1 == "Korean"]
JAP_df = pelic_df[pelic_df.L1 == "Japanese"]
SPA_df = pelic_df[pelic_df.L1 == "Spanish"]

In [24]:
# Create token lists (words and lemmas)

#words 
ARA_freq_dist = FreqDist([y.lower() for x in ARA_df.tokens for y in x])
CHI_freq_dist = FreqDist([y.lower() for x in CHI_df.tokens for y in x])
KOR_freq_dist = FreqDist([y.lower() for x in KOR_df.tokens for y in x])
JAP_freq_dist = FreqDist([y.lower() for x in JAP_df.tokens for y in x])
SPA_freq_dist = FreqDist([y.lower() for x in SPA_df.tokens for y in x])

# lemmas
ARA_freq_dist_lemmas = FreqDist([y.lower() for x in ARA_df.lemmas for y in x])
CHI_freq_dist_lemmas = FreqDist([y.lower() for x in CHI_df.lemmas for y in x])
KOR_freq_dist_lemmas = FreqDist([y.lower() for x in KOR_df.lemmas for y in x])
JAP_freq_dist_lemmas = FreqDist([y.lower() for x in JAP_df.lemmas for y in x])
SPA_freq_dist_lemmas = FreqDist([y.lower() for x in SPA_df.lemmas for y in x])

In [25]:
# Check data

print("Arabic (words):")
print(ARA_freq_dist)
print(ARA_freq_dist.most_common(10))

print("\nArabic (lemmas):")
print(ARA_freq_dist_lemmas)
print(ARA_freq_dist_lemmas.most_common(10))

print("\nChinese (words):")
print(CHI_freq_dist)
print(CHI_freq_dist.most_common(10))

print("\nChinese (lemmas):")
print(CHI_freq_dist_lemmas)
print(CHI_freq_dist_lemmas.most_common(10))

print("\nKorean (words):")
print(KOR_freq_dist)
print(KOR_freq_dist.most_common(10))

print("\nKorean (lemmas):")
print(KOR_freq_dist_lemmas)
print(KOR_freq_dist_lemmas.most_common(10))

print("\nJapanese (words):")
print(JAP_freq_dist)
print(JAP_freq_dist.most_common(10))

print("\nJapanese (lemmas):")
print(JAP_freq_dist_lemmas)
print(JAP_freq_dist_lemmas.most_common(10))

print("\nSpanish (words):")
print(SPA_freq_dist)
print(SPA_freq_dist.most_common(10))

print("\nSpanish (lemmas):")
print(SPA_freq_dist_lemmas)
print(SPA_freq_dist_lemmas.most_common(10))

Arabic (words):
<FreqDist with 26608 samples and 1534315 outcomes>
[('.', 87084), ('the', 67846), (',', 62284), ('to', 43467), ('and', 36320), ('in', 32747), ('i', 30720), ('of', 27889), ('a', 27225), ('is', 24273)]

Arabic (lemmas):
<FreqDist with 21653 samples and 1534315 outcomes>
[('.', 87084), ('the', 67846), (',', 62284), ('be', 56339), ('to', 43467), ('and', 36320), ('in', 32747), ('i', 30720), ('a', 30298), ('of', 27889)]

Chinese (words):
<FreqDist with 20488 samples and 950340 outcomes>
[('.', 55369), (',', 46239), ('the', 37648), ('to', 26182), ('and', 20464), ('a', 18789), ('i', 18452), ('in', 16784), ('of', 16153), ('is', 14633)]

Chinese (lemmas):
<FreqDist with 16064 samples and 950340 outcomes>
[('.', 55369), (',', 46239), ('the', 37648), ('be', 33764), ('to', 26182), ('a', 20842), ('and', 20464), ('i', 18452), ('in', 16784), ('of', 16153)]

Korean (words):
<FreqDist with 20480 samples and 987135 outcomes>
[('.', 61690), (',', 48546), ('the', 32939), ('to', 27218), ('i'

In [26]:
# Export to pickle files

# words
ARAfile = open('word_frequency_pickles/arabic_word_frequencies.pkl','wb') 
pkl.dump(ARA_freq_dist, ARAfile)
ARAfile.close()

CHIfile = open('word_frequency_pickles/chinese_word_frequencies.pkl','wb') 
pkl.dump(CHI_freq_dist, CHIfile)
CHIfile.close()

KORfile = open('word_frequency_pickles/korean_word_frequencies.pkl','wb') 
pkl.dump(KOR_freq_dist, KORfile)
KORfile.close()

JAPfile = open('word_frequency_pickles/japanese_word_frequencies.pkl','wb') 
pkl.dump(JAP_freq_dist, JAPfile)
JAPfile.close()

SPAfile = open('word_frequency_pickles/spanish_word_frequencies.pkl','wb') 
pkl.dump(SPA_freq_dist, SPAfile)
SPAfile.close()


# lemmas
ARAfile_lemmas = open('lemma_frequency_pickles/arabic_lemma_frequencies.pkl','wb') 
pkl.dump(ARA_freq_dist_lemmas, ARAfile_lemmas)
ARAfile_lemmas.close()

CHIfile_lemmas = open('lemma_frequency_pickles/chinese_lemma_frequencies.pkl','wb') 
pkl.dump(CHI_freq_dist_lemmas, CHIfile_lemmas)
CHIfile_lemmas.close()

KORfile_lemmas = open('lemma_frequency_pickles/korean_lemma_frequencies.pkl','wb') 
pkl.dump(KOR_freq_dist_lemmas, KORfile_lemmas)
KORfile_lemmas.close()

JAPfile_lemmas = open('lemma_frequency_pickles/japanese_lemma_frequencies.pkl','wb') 
pkl.dump(JAP_freq_dist_lemmas, JAPfile_lemmas)
JAPfile_lemmas.close()

SPAfile_lemmas = open('lemma_frequency_pickles/spanish_lemma_frequencies.pkl','wb') 
pkl.dump(SPA_freq_dist_lemmas, SPAfile_lemmas)
SPAfile_lemmas.close()

## Conditional frequency distributions
In the previous section, we created frequency counts of overall word frequencies, as well as separate frequency distributions for the levels: 2, 3, 4, 5 and the languages: Arabic, Chinese, Korean, Japanese, Spanish.

Here, we create two additional _Conditional_ Frequency Distributions: one conditional on all four levels, and one conditional on all five languages above. These distributions allow access to word frequencies based on either Level or L1 using only one pickle file, rather then separate pickle files.

### Level

In [27]:
# Create a list of tuples so that the level is associated with each word or lemma

# words
level2_cond_list = [(2, y.lower()) for x in level2_df.tokens for y in x]
level3_cond_list = [(3, y.lower()) for x in level3_df.tokens for y in x]
level4_cond_list = [(4, y.lower()) for x in level4_df.tokens for y in x]
level5_cond_list = [(5, y.lower()) for x in level5_df.tokens for y in x]

# lemmas
level2_cond_list_lemmas = [(2, y.lower()) for x in level2_df.lemmas for y in x]
level3_cond_list_lemmas = [(3, y.lower()) for x in level3_df.lemmas for y in x]
level4_cond_list_lemmas = [(4, y.lower()) for x in level4_df.lemmas for y in x]
level5_cond_list_lemmas = [(5, y.lower()) for x in level5_df.lemmas for y in x]

In [28]:
# The level dataframes are no longer needed and can be deleted to save memory.

del level2_df
del level3_df
del level4_df
del level5_df

In [29]:
# Combine the above lists

level_cond_list = level2_cond_list + level3_cond_list + level4_cond_list + level5_cond_list
level_cond_list_lemmas = level2_cond_list_lemmas + level3_cond_list_lemmas + level4_cond_list_lemmas + level5_cond_list_lemmas

In [30]:
# Create conditional frequency distributions with the above lists

level_cond_dist = ConditionalFreqDist()
level_cond_dist_lemmas = ConditionalFreqDist()

In [31]:
# Making level the condition (i.e., the first tuple entry and adding frequencies)

for x in level_cond_list:
    cond = x[0]
    level_cond_dist[cond][x[1]] += 1

for x in level_cond_list_lemmas:
    cond = x[0]
    level_cond_dist_lemmas[cond][x[1]] += 1

In [32]:
# Check the resulting conditional distributions

# words
print(level_cond_dist[2].most_common(10))
print(level_cond_dist[3].most_common(10))
print(level_cond_dist[4].most_common(10))
level_cond_dist[5].most_common(10))

# lemmas
print(level_cond_dist_lemmas[2].most_common(10))
print(level_cond_dist_lemmas[3].most_common(10))
print(level_cond_dist_lemmas[4].most_common(10))
print(level_cond_dist_lemmas[5].most_common(10))

[('.', 97323),
 (',', 84307),
 ('the', 83056),
 ('be', 69635),
 ('to', 50787),
 ('and', 41821),
 ('a', 40355),
 ('of', 38340),
 ('in', 36665),
 ('i', 26896)]

In [33]:
# Check that the counts for period add up to the total period count
# found to be 279067 earlier in this notebook

level_cond_dist[2]['.'] + level_cond_dist[3]['.'] + level_cond_dist[4]['.'] + level_cond_dist[5]['.']
level_cond_dist_lemmas[2]['.'] + level_cond_dist_lemmas[3]['.'] + level_cond_dist_lemmas[4]['.'] + level_cond_dist_lemmas[5]['.']

279067

### L1 
Repeating the process above but with L1s instead of levels

In [34]:
# Create a list of tuples so that the L1 is associated with each word or lemma

# words
ARA_cond_list = [('Arabic', y.lower()) for x in ARA_df.tokens for y in x]
CHI_cond_list = [('Chinese', y.lower()) for x in CHI_df.tokens for y in x]
KOR_cond_list = [('Korean', y.lower()) for x in KOR_df.tokens for y in x]
JAP_cond_list = [('Japanese', y.lower()) for x in JAP_df.tokens for y in x]
SPA_cond_list = [('Spanish', y.lower()) for x in SPA_df.tokens for y in x]

# lemmas
ARA_cond_list_lemmas = [('Arabic', y.lower()) for x in ARA_df.lemmas for y in x]
CHI_cond_list_lemmas = [('Chinese', y.lower()) for x in CHI_df.lemmas for y in x]
KOR_cond_list_lemmas = [('Korean', y.lower()) for x in KOR_df.lemmas for y in x]
JAP_cond_list_lemmas = [('Japanese', y.lower()) for x in JAP_df.lemmas for y in x]
SPA_cond_list_lemmas = [('Spanish', y.lower()) for x in SPA_df.lemmas for y in x]

In [35]:
# The dataframes are no longer needed and can be deleted to save memory.

del ARA_df
del CHI_df
del KOR_df
del JAP_df
del SPA_df

In [36]:
# Combine the above lists

lang_cond_list = ARA_cond_list + CHI_cond_list + KOR_cond_list + JAP_cond_list + SPA_cond_list
lang_cond_list_lemmas = ARA_cond_list_lemmas + CHI_cond_list_lemmas + KOR_cond_list_lemmas + JAP_cond_list_lemmas + SPA_cond_list_lemmas

In [37]:
# Create conditional frequency distributions with the above lists

lang_cond_dist = ConditionalFreqDist()
lang_cond_dist_lemmas = ConditionalFreqDist()

In [38]:
# Making L1 the condition (i.e., the first tuple entry and adding frequencies)

for x in lang_cond_list:
    cond = x[0]
    lang_cond_dist[cond][x[1]] += 1
    
for x in lang_cond_list_lemmas:
    cond = x[0]
    lang_cond_dist_lemmas[cond][x[1]] += 1

In [39]:
# Check the resulting conditional distributions

# words
print('Arabic (words)\n',lang_cond_dist['Arabic'].most_common(10))
print('\nChinese (words)\n',lang_cond_dist['Chinese'].most_common(10))
print('\nKorean (words)\n',lang_cond_dist['Korean'].most_common(10))
print('\nJapanese (words)\n',lang_cond_dist['Japanese'].most_common(10))
print('\nSpanish (words)\n',lang_cond_dist['Spanish'].most_common(10))

# lemmas
print('\nArabic (lemmas)\n',lang_cond_dist_lemmas['Arabic'].most_common(10))
print('\nChinese (lemmas)\n',lang_cond_dist_lemmas['Chinese'].most_common(10))
print('\nKorean (lemmas)\n',lang_cond_dist_lemmas['Korean'].most_common(10))
print('\nJapanese (lemmas)\n',lang_cond_dist_lemmas['Japanese'].most_common(10))
print('\nSpanish (lemmas)\n',lang_cond_dist_lemmas['Spanish'].most_common(10))

Arabic (words)
 [('.', 87084), ('the', 67846), (',', 62284), ('to', 43467), ('and', 36320), ('in', 32747), ('i', 30720), ('of', 27889), ('a', 27225), ('is', 24273)]

Chinese (words)
 [('.', 55369), (',', 46239), ('the', 37648), ('to', 26182), ('and', 20464), ('a', 18789), ('i', 18452), ('in', 16784), ('of', 16153), ('is', 14633)]

Korean (words)
 [('.', 61690), (',', 48546), ('the', 32939), ('to', 27218), ('i', 22767), ('and', 20394), ('a', 20041), ('of', 18041), ('in', 16835), ('is', 15863)]

Japanese (words)
 [('.', 20439), (',', 16542), ('the', 13197), ('to', 10009), ('i', 7993), ('and', 7290), ('of', 6983), ('a', 6929), ('in', 6576), ('is', 5213)]

Spanish (words)
 [('.', 10380), ('the', 10033), (',', 9896), ('to', 6226), ('and', 5550), ('a', 4447), ('i', 4427), ('in', 4393), ('of', 4371), ('is', 3622)]

Arabic (lemmas)
 [('.', 87084), ('the', 67846), (',', 62284), ('be', 56339), ('to', 43467), ('and', 36320), ('in', 32747), ('i', 30720), ('a', 30298), ('of', 27889)]

Chinese (lemm

**Note:** Unlike with level, the counts for "." will not add up to the total because we are only using five languages, not all of them.

In [40]:
# Export to pickle files

# words
levelfile = open('word_frequency_pickles/level_conditional_word_frequencies.pkl', 'wb')
pkl.dump(level_cond_dist, levelfile)
levelfile.close()

langfile = open('word_frequency_pickles/L1_conditional_word_frequencies.pkl', 'wb')
pkl.dump(lang_cond_dist, langfile)
langfile.close()

# lemmas
levelfile_lemmas = open('lemma_frequency_pickles/level_conditional_lemma_frequencies.pkl', 'wb')
pkl.dump(level_cond_dist_lemmas, levelfile_lemmas)
levelfile_lemmas.close()

langfile_lemmas = open('lemma_frequency_pickles/L1_conditional_lemma_frequencies.pkl', 'wb')
pkl.dump(lang_cond_dist_lemmas, langfile_lemmas)
langfile_lemmas.close()

## Demonstration
The following small example showcases how the frequency information above might be useful. Here we contrast the Top 20 Arabic lemmas with the top 20 Spanish lemmas.

In [41]:
# Remove punctuation items from the lemma freq dictionary we are using and make into lists
Arabic_top20 = [x for x in list(lang_cond_dist_lemmas['Arabic'].most_common(30)) if x[0].isalpha()][:20]
Spanish_top20 = [x for x in list(lang_cond_dist_lemmas['Spanish'].most_common(30)) if x[0].isalpha()][:20]

In [42]:
top20_df = pd.DataFrame({'Arabic': Arabic_top20, 'Spanish': Spanish_top20})
top20_df

Unnamed: 0,Arabic,Spanish
0,"(the, 67846)","(the, 10033)"
1,"(be, 56339)","(be, 8751)"
2,"(to, 43467)","(to, 6226)"
3,"(and, 36320)","(and, 5550)"
4,"(in, 32747)","(a, 5048)"
5,"(i, 30720)","(i, 4427)"
6,"(a, 30298)","(in, 4393)"
7,"(of, 27889)","(of, 4371)"
8,"(have, 20827)","(that, 3098)"
9,"(that, 16282)","(have, 3055)"


These look very similar, so let's see if there are any differences in terms of inclusion.

In [43]:
Arabic_top20_list = [x[0] for x in top20_df.Arabic]
Spanish_top20_list = [x[0] for x in top20_df.Spanish]

print('Top 20 lemmas used by L1 Arabic but not L1 Spanish students:',
      [x for x in Arabic_top20_list if x not in Spanish_top20_list])
print('Top 20 lemmas used by L1 Spanish but not L1 Arabic students:',
      [x for x in Spanish_top20_list if x not in Arabic_top20_list])

Top 20 lemmas used by L1 Arabic but not L1 Spanish students: ['they', 'people', 'will']
Top 20 lemmas used by L1 Spanish but not L1 Arabic students: ['this', 'can', 'because']


From the above, we see that overall, as would be expected, the most frequent lemmas are very consistent regardless of the learners' L1. However, we might be interested to look deeper into the use of the items which did differ. Could it be that Spanish learners use more discourse markers or complex sentences with _because_? Do Spanish and Arabic learners show different patterns of usage for modal verbs like _can_ and _will_? These questions can now be explored empirically in PELIC using the full texts in the [`PELIC_compiled.csv`](https://github.com/ELI-Data-Mining-Group/PELIC-dataset/blob/master/PELIC_compiled.csv).

For more detailed tutorials, please see the [tutorials folder](https://github.com/ELI-Data-Mining-Group/PELIC-dataset/tree/master/tutorials) and description in the [`README.md`](https://github.com/ELI-Data-Mining-Group/PELIC-dataset/blob/master/README.md). For, example, you may wish to create concordances of the items in these to 20 list in order to see them in context. The concordancing function is described in the [`PELIC_concordancing_tutorial`](https://github.com/ELI-Data-Mining-Group/PELIC-dataset/blob/master/tutorials/PELIC_concordancing_tutorial.ipynb).

[Back to top](#PELIC-frequency-statistics)