# project_code_final: Analysis of the ELI Data #
## Ben Naismith ##

### This notebook ###

This notebook contains the most up-to-date and streamlined code for the project. For early cleaning and analysis efforts, please refer to the following documents:
- Cleaning: https://github.com/Data-Science-for-Linguists/Bigram-analysis-of-writing-from-the-ELI/tree/master/early_experiments/project_code1_cleaning.ipynb
- Analysis: https://github.com/Data-Science-for-Linguists/Bigram-analysis-of-writing-from-the-ELI/tree/master/early_experiments/project_code2_analysis.ipynb

### Table of contents ###

1.  [Data sharing plan](#1.-Data-sharing-plan): description of sample data contents and licensing agreement
2.  [Initial setup](#2.-Initial-setup): importing necessary modules
3.  [Student information](#3.-Student-information): S_info_csv and S_info_df
4.  [Student responses](#4.-Student-responses): answer_csv and answer_df
5.  [Course IDs](#5.-Course-IDs): course_csv and course_df
6.  [User file internal](#6.-user_file_internal): user_csv and user_df
7.  [Basic info about dataframes](#7.-Basic-info-about-dataframes): description of dataframes in sections 3-6
8.  [Tokenization of answers](#8.-Tokenization-of-answers): tokenization from answers_df
9.  [Bigrams](#9.-Bigrams): creating bigram column from tokens
10. [Corpus frequency dictionary](#10.-Corpus-frequency-dictionary): creating unigram frequency dictionary
11. [Bigram frequency dictionary](#11.-Bigram-frequency-dictionary): creating bigram frequency dictionary
12. [Mutual Information](#12.-Mutual-Information): creating a function for calculating MI
13. [Combo dataframe](#13.-Combo-dataframe): combines earlier dataframes for easier analysis
14. [Occurrences per million](#14.-Occurrences-per-million): create function to calculate occurrences per million
15. [bigram_df](#15.-bigram_df): create new dataframe using new functions from earlier sections
16. [levels_df](#16.-levels_df): create a small dataframe with useful overall statistics
17. [Pickling](#17.-Pickling): saving pickles of dataframes and MI dict
18. [Visualizations](#18.-Visualizations): create function to calculate occurrences per million

### 1. Data sharing plan ###

The full ELI data set (see project_plan.md) is private at this time. Below is a workbook with the current code for analyzing that data. In order to see how the code works, snippets of data have been displayed throughout.

A sample of the 'sanitized' data is included in the 'data' folder in this same repository. It contains samples of the four CSV files referred to in this code, consisting of 1000 answers, in order to allow for testing and reproducibility by others of the code. These 1000 answers are the first 1000 from the answer_csv file and correspond to user_file_id 7505 to 10108.

Ultimately, it is the intention of the dataset's authors for the entire dataset to be made public, with a CC license. Please see the LICENSE_notes.md for details

### 2. Initial setup ###

In [1]:
#Import necesary modules
from __future__ import division
import numpy as np
import pandas as pd
import nltk
import glob
import matplotlib.pyplot as plt
import random

#return every shell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#Create short-hand for directory root
cor_dir = "/Users/Benjamin's/Documents/ELI_Data_Mining/Data-Archive/1_sanitized/"

In [2]:
#Add starter code created by Na-Rae Han for the ELI research group
from elitools import *

Pretty printing has been turned OFF
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48384 entries, 1 to 48420
Data columns (total 8 columns):
question_id        48384 non-null int64
anon_id            48353 non-null object
user_file_id       48384 non-null int64
text               47175 non-null object
directory          14 non-null object
is_doublespaced    48384 non-null int64
is_plagiarized     48384 non-null int64
is_deleted         48384 non-null int64
dtypes: int64(5), object(3)
memory usage: 3.3+ MB
<class 'pandas.core.frame.DataFrame'>
Index: 913 entries, ez9 to bn6
Data columns (total 20 columns):
gender                       913 non-null object
birth_year                   913 non-null int64
native_language              913 non-null object
language_used_at_home        912 non-null object
language_used_at_home_now    855 non-null object
non_native_language_1        859 non-null object
yrs_of_study_lang1           863 non-null object
study_in_classroom_lang1     863 non-null 

### 3. Student information 
- S_info_csv
- S_info_df

In [3]:
#Process the student_information.csv file
S_info_csv = cor_dir + "student_information.csv"
S_info_df = pd.read_csv(S_info_csv, index_col = 'anon_id')

S_info_df.head() #Issues still apparent with integers turned into floats
S_info_df.tail(10) #6 anon_id with no personal info - perhaps not students and to be 'pruned', as well as teachers with 'English' as the native language

Unnamed: 0_level_0,gender,birth_year,native_language,language_used_at_home,language_used_at_home_now,non_native_language_1,yrs_of_study_lang1,study_in_classroom_lang1,ways_of_study_lang1,non_native_language_2,yrs_of_study_lang2,study_in_classroom_lang2,ways_of_study_lang2,non_native_language_3,yrs_of_study_lang3,study_in_classroom_lang3,ways_of_study_lang3,createddate,modifieddate,course_history
anon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
ez9,Male,1978,Arabic,Arabic,,English,more than 5 years,1.0,Studied grammar;Worked in pairs/groups;Studied...,Turkish,less than 1 year,0.0,Studied by myself,,,0.0,other,2006-01-30 15:07:18,2006-03-14 15:13:37,6;12;18;24;30
gm3,Male,1980,Arabic,Arabic,,English,more than 5 years,1.0,Studied grammar;Had a native-speaker teacher;S...,,,0.0,other,,,0.0,other,2006-01-30 15:07:28,2006-03-14 15:12:49,6;12;24;30;38
fg5,Male,1938,Nepali,Nepali,,English,more than 5 years,1.0,Studied grammar;Worked in pairs/groups;Had a n...,French,less than 1 year,1.0,Studied grammar;Worked in pairs/groups;Had a n...,Hindi,more than 5 years,0.0,Studied by myself,2006-01-30 15:07:45,2006-03-14 15:11:36,18;24
ce5,Female,1984,Korean,Korean,,English,more than 5 years,1.0,Studied grammar;Worked in pairs/groups;Had a n...,German,1-2 years,1.0,Studied grammar;Studied vocabulary;Listened to...,,,0.0,other,2006-01-30 15:07:49,2006-03-14 15:12:24,6;12;24;30;38;56
fi7,Female,1982,Korean,Korean;Japanese,,English,more than 5 years,1.0,Studied grammar;Had a native-speaker teacher;S...,Japanese,less than 1 year,1.0,Studied grammar;Studied vocabulary;Listened to...,French,1-2 years,1.0,Studied grammar;Studied vocabulary;Listened to...,2006-01-30 15:07:52,2006-03-14 15:12:17,6;12;24;30;38


Unnamed: 0_level_0,gender,birth_year,native_language,language_used_at_home,language_used_at_home_now,non_native_language_1,yrs_of_study_lang1,study_in_classroom_lang1,ways_of_study_lang1,non_native_language_2,yrs_of_study_lang2,study_in_classroom_lang2,ways_of_study_lang2,non_native_language_3,yrs_of_study_lang3,study_in_classroom_lang3,ways_of_study_lang3,createddate,modifieddate,course_history
anon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
ec5,Female,1963,Chinese,Chinese,Chinese,,,,,,,,,,,,,2011-06-16 14:08:05,2011-06-16 14:13:03,719;720;721;722;723;772;774;785;813;819;858;85...
cy2,Male,1988,Arabic,Arabic,Arabic,English,less than 1 year,1.0,Studied grammar;Worked in pairs/groups;Had a n...,,,0.0,other,,,0.0,other,2011-06-20 14:09:05,2011-06-20 14:11:31,845;846;847;871;872;927;928;931;949;950;1008;1...
br9,Female,1981,Chinese,Chinese,Chinese,English,more than 5 years,1.0,Studied grammar;Worked in pairs/groups;Studied...,,,0.0,other,,,0.0,other,2011-06-20 14:09:15,2011-06-20 14:12:02,868;869;870;871;872;947;951;953
cl5,Male,1987,Arabic,Arabic,Arabic;English,English,less than 1 year,1.0,Studied grammar;Studied vocabulary;Practiced s...,,,0.0,other,,,0.0,other,2011-06-20 14:09:23,2011-06-20 14:13:16,770;771;778;779;781;856;857;859;861;871;952;95...
de1,Male,1983,Arabic,Arabic,Arabic,English,more than 5 years,1.0,Studied grammar;Studied vocabulary;Teacher spo...,,,0.0,other,,,0.0,other,2011-06-20 14:09:27,2011-06-20 14:12:02,850;851;852;871;872;926;932;933;944;945;1008;1...
ap0,Male,1978,Japanese,Japanese,Japanese,English,more than 5 years,1.0,Studied grammar;Studied vocabulary;Listened to...,,,0.0,other,,,0.0,other,2011-06-20 14:09:33,2011-06-20 14:12:52,845;846;847;871;872
gu4,Male,1983,Arabic,Arabic,Arabic;English,Arabic,more than 5 years,0.0,Studied by myself;I lived in a country where t...,,,0.0,other,,,0.0,other,2011-06-20 14:09:34,2011-06-20 14:13:04,772;773;774;775;776;868;869;870;871;872;922;92...
hb0,Female,1980,Arabic,Arabic,Arabic,English,3-5 years,1.0,Studied grammar;Had a native-speaker teacher;T...,,,0.0,other,,,0.0,other,2011-06-20 14:09:38,2011-06-20 14:13:01,851;869;870;871;872;923;942;944;945;946;1008;1...
dp8,Male,1991,Arabic,Arabic;English,Arabic;English,English,1-2 years,1.0,Studied grammar;Worked in pairs/groups;Had a n...,,,0.0,other,,,0.0,other,2011-06-20 14:10:15,2011-06-20 14:13:57,868;869;870;871;872
bn6,Male,1986,Arabic,Arabic;English,Arabic;English,English,more than 5 years,1.0,Studied grammar;Studied vocabulary;Teacher spo...,,,0.0,other,,,0.0,other,2011-06-20 14:11:17,2011-06-20 14:15:51,860;861;862;871;872;930;947;948;949;951;998;99...


In [4]:
#Remove anyone with 'English' or 'NaN' as their native_language, i.e. not students

#First try to create filters

Englishfilter = S_info_df['native_language'] == 'English' #first filter works
NaNfilter = S_info_df['native_language'] == np.nan #second filter doesn't

fake_Ss = S_info_df.loc[Englishfilter] #works, but...
fake_Ss

#fake_Ss = S_info_df.loc[(Englishfilter) or (NaNfilter)] #doesn't work
#fake_Ss


Unnamed: 0_level_0,gender,birth_year,native_language,language_used_at_home,language_used_at_home_now,non_native_language_1,yrs_of_study_lang1,study_in_classroom_lang1,ways_of_study_lang1,non_native_language_2,yrs_of_study_lang2,study_in_classroom_lang2,ways_of_study_lang2,non_native_language_3,yrs_of_study_lang3,study_in_classroom_lang3,ways_of_study_lang3,createddate,modifieddate,course_history
anon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
ez7,Male,1987,English,Arabic,Arabic;English,Arabic,more than 5 years,0.0,I lived in a country where they spoke Arabic,English,less than 1 year,1.0,Studied grammar;Studied vocabulary;Studied pro...,,,0.0,other,2007-02-20 10:05:39,2007-03-20 10:09:23,156;167;180;191;200;212;223;234;245;256
ay4,Female,1974,English,Korean,Korean,Korean,more than 5 years,1.0,Studied grammar;Had a native-speaker teacher;S...,,,0.0,other,,,0.0,other,2009-06-09 12:04:22,2009-11-13 12:43:36,509;515;516;517;560;571;574;601;622;642;645


### 4. Student responses###
- answer_csv
- answer_df

In [5]:
#Process answer.csv file
answer_csv = cor_dir + "answer.csv"
answer_df = pd.read_csv(answer_csv, index_col = 'answer_id')

answer_df.head()
answer_df.tail(10)

Unnamed: 0_level_0,question_id,anon_id,user_file_id,text,directory,is_doublespaced,is_plagiarized,is_deleted
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,5,eq0,7505,I met my friend Nife while I was studying in a...,,0,0,0
2,5,am8,7506,"Ten years ago, I met a women on the train betw...",,0,0,0
3,12,dk5,7507,In my country we usually don't use tea bags. F...,,0,0,0
4,13,dk5,7507,I organized the instructions by time.,,0,0,0
5,12,ad1,7508,"First, prepare a port, loose tea, and cup.\r\r...",,0,0,0


Unnamed: 0_level_0,question_id,anon_id,user_file_id,text,directory,is_doublespaced,is_plagiarized,is_deleted
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
48411,6138,dv8,100847,Early Second Language Education\r\r\r\nSaudi A...,,1,0,0
48412,6138,ce1,100848,Publicly funded health care system\r\r\r\n\r\r...,,0,0,0
48413,6139,fo7,100911,Happiness is the most effective feeling in peo...,,1,0,0
48414,6139,fs9,100912,everyone want to play some games. some people ...,,1,0,0
48415,6139,cl7,100913,Playing a game is fun only when you win?\r\r\r...,,1,0,0
48416,6139,dr8,100914,Many people enjoy a game in their free time. B...,,1,0,0
48417,6137,fv1,100915,\r\r\r\n ...,,0,0,0
48418,6137,fo1,100916,Some patients are suffering from the...,,0,0,0
48419,6119,ge8,100917,My house looks amazing and modern. I decorated...,,0,0,0
48420,6027,ge8,100918,History and Geography a...,,0,0,0


### 5. Course IDs ###
(to help with finding specific texts and linking other data frames)
- course_csv
- course_df

In [6]:
#Process course.csv file
course_csv = cor_dir + "course.csv"
course_df = pd.read_csv(course_csv, index_col = 'course_id')

course_df.head()

Unnamed: 0_level_0,class_id,level_id,semester,section,course_description
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,2,2064,A,Reading Pre_Intermediate 2064 A
2,1,3,2064,B,Reading Low_Intermediate 2064 B
3,1,4,2064,M,Reading Intermediate 2064 M
4,1,4,2064,P,Reading Intermediate 2064 P
5,1,4,2064,Q,Reading Intermediate 2064 Q


### 6. user_file_internal ###
- big csv file with a lot of information
- helps with finding specific texts and linking other data frames
- includes file_type_id, course_id, and many other fields

In [7]:
#Process user_file_wavtxt.csv file
user_csv = cor_dir + "user_file_internal.csv"
user_df = pd.read_csv(user_csv, index_col = 'user_file_id')

user_df.head()

Unnamed: 0_level_0,anon_id,file_type_id,file_info_id,user_file_parent_id,course_id,session_id,document_id,activity,order_num,due_date,...,modifiedby,modifieddate,allow_submit_after_duedate,allow_multiple_accesses,allow_double_spacing,duration,pull_off_date,direction,grammar_qp_id,is_deleted
user_file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,aj8,1,,,10,,,12,,2006-08-07 14:19:48,...,,,0,0,0,,,,,0
2,fg8,1,,,10,,,12,,2006-08-07 14:19:48,...,,,0,0,0,,,,,0
3,be0,1,,,10,,,12,,2006-08-07 14:19:48,...,,,0,0,0,,,,,0
4,fc4,1,,,10,,,12,,2006-08-07 14:19:48,...,,,0,0,0,,,,,0
5,fc4,1,,1.0,10,,,12,,2006-08-07 14:19:48,...,,,0,0,0,,,,,0


### 7. Basic info about dataframes ###

The following information is an overview of the four dataframes/csv files currently being looked at:

#### S_info_df ####
Size:
- there are 941 entries, i.e. students, although at least 9 need to be removed once filters can be made to work
- 21 columns including info about languages spoken, personal data like age, and learning preferences
- Some columns will likely be removed if deemed unhelpful/unnecessary (e.g. 4th language spoken)
- Some data is normalized, e.g. years of study, but others was open, resulting in very varied responses

Connection to other dataframes:
- link to answer_df is anon_id

Most useful columns for this project:
- anon_id (for linking to other df)
- L1, gender, time studying, age (for data analysis)  


#### answer_df ####
Size:
- there are 47175 'text' entries, i.e. student responses, although 48384 total rows. The remaining (including many null texts need to be removed as without texts they serve no purpose
- 9 columns including info about the question, the answer, and characteristics of the text (like if it was plagiarized)

Connection to other dataframes:
- link to S_info_df and course_df is anon_id column

Most useful columns for this project:
- answer_id (shorthand for the individual texts to be analyzed)
- text (the most important column so far) -> to be converted into tokens, bigrams, etc.  
- anon_id (for linking to other df)


#### course_df ####
Size:
- there are 1071 entries, i.e. one row for each course
- 6 columns including info about the course and class, both in terms of their assigned number and a description

Connection to other dataframes:
- link to user_df is course_id 

Most useful columns for this project:
- only really useful as a transition for linking to other df  


#### user_df ####
Size:
- there are 76371 rows, each with a file_id number. However, it is unclear how to use this informatin effectively.
- There are 29 columns, although many are not useful for this project
- A lot of the cells have no input
- Some columns will likely be removed if deemed unhelpful/unnecessary

Connection to other dataframes:
- link to course_df is course_id column

Most useful columns for this project:
- course_id (to link to other DF)
- file_type_id (for indicating the type of activity used in class)

In [8]:
S_info_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 913 entries, ez9 to bn6
Data columns (total 20 columns):
gender                       913 non-null object
birth_year                   913 non-null int64
native_language              913 non-null object
language_used_at_home        912 non-null object
language_used_at_home_now    855 non-null object
non_native_language_1        859 non-null object
yrs_of_study_lang1           863 non-null object
study_in_classroom_lang1     863 non-null float64
ways_of_study_lang1          863 non-null object
non_native_language_2        309 non-null object
yrs_of_study_lang2           312 non-null object
study_in_classroom_lang2     863 non-null float64
ways_of_study_lang2          863 non-null object
non_native_language_3        55 non-null object
yrs_of_study_lang3           59 non-null object
study_in_classroom_lang3     863 non-null float64
ways_of_study_lang3          863 non-null object
createddate                  913 non-null object
modifieddate    

In [9]:
answer_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48384 entries, 1 to 48420
Data columns (total 8 columns):
question_id        48384 non-null int64
anon_id            48353 non-null object
user_file_id       48384 non-null int64
text               47175 non-null object
directory          14 non-null object
is_doublespaced    48384 non-null int64
is_plagiarized     48384 non-null int64
is_deleted         48384 non-null int64
dtypes: int64(5), object(3)
memory usage: 3.3+ MB


In [10]:
course_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1071 entries, 1 to 1123
Data columns (total 5 columns):
class_id              1071 non-null int64
level_id              1071 non-null int64
semester              1071 non-null int64
section               1071 non-null object
course_description    1058 non-null object
dtypes: int64(3), object(2)
memory usage: 50.2+ KB


In [11]:
user_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27134 entries, 1 to 100918
Data columns (total 28 columns):
anon_id                       26922 non-null object
file_type_id                  27134 non-null int64
file_info_id                  2151 non-null float64
user_file_parent_id           25884 non-null float64
course_id                     27134 non-null int64
session_id                    26142 non-null float64
document_id                   1599 non-null float64
activity                      27134 non-null int64
order_num                     2722 non-null float64
due_date                      3286 non-null object
post_date                     3714 non-null object
assignment_name               2700 non-null object
version                       27134 non-null int64
directory                     0 non-null float64
filename                      0 non-null float64
content_text                  964 non-null object
createdby                     24955 non-null object
createddate        

### 8. Tokenization of answers ###

Tokenizing the text in answer.csv to allow for further analysis, e.g., of bigrams.


In [12]:
#column to tokenize
answer_df[['text']].head()

Unnamed: 0_level_0,text
answer_id,Unnamed: 1_level_1
1,I met my friend Nife while I was studying in a...
2,"Ten years ago, I met a women on the train betw..."
3,In my country we usually don't use tea bags. F...
4,I organized the instructions by time.
5,"First, prepare a port, loose tea, and cup.\r\r..."


In [13]:
#Creating 'toks' column and changing NaN to empty strings
answer_df = answer_df[answer_df['text'].notnull()]
answer_df['toks'] = answer_df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)

answer_df.head()

Unnamed: 0_level_0,question_id,anon_id,user_file_id,text,directory,is_doublespaced,is_plagiarized,is_deleted,toks
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,5,eq0,7505,I met my friend Nife while I was studying in a...,,0,0,0,"[I, met, my, friend, Nife, while, I, was, stud..."
2,5,am8,7506,"Ten years ago, I met a women on the train betw...",,0,0,0,"[Ten, years, ago, ,, I, met, a, women, on, the..."
3,12,dk5,7507,In my country we usually don't use tea bags. F...,,0,0,0,"[In, my, country, we, usually, do, n't, use, t..."
4,13,dk5,7507,I organized the instructions by time.,,0,0,0,"[I, organized, the, instructions, by, time, .]"
5,12,ad1,7508,"First, prepare a port, loose tea, and cup.\r\r...",,0,0,0,"[First, ,, prepare, a, port, ,, loose, tea, ,,..."


### 9. Bigrams###

Creating a bigram columns from the tok column


In [14]:
#Creating a column of bigrams from the 'toks' column

answer_df['bigrams'] = answer_df.toks.apply(lambda x: list(nltk.bigrams(x)))
answer_df.head(1)

Unnamed: 0_level_0,question_id,anon_id,user_file_id,text,directory,is_doublespaced,is_plagiarized,is_deleted,toks,bigrams
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,5,eq0,7505,I met my friend Nife while I was studying in a...,,0,0,0,"[I, met, my, friend, Nife, while, I, was, stud...","[(I, met), (met, my), (my, friend), (friend, N..."


### 10. Corpus frequency dictionary ###

Create a frequency dictionary for all toks from answer_df

In [15]:
#Joining all the answers before tokenizing them to create a corpus of tokens

answer_corpus = ' '.join(answer_df['text'])
answer_corpus[:100]
answer_corpus_tok = nltk.word_tokenize(answer_corpus)
answer_corpus_tok[:20]

'I met my friend Nife while I was studying in a middle school. I was happy when I met him because he '

['I', 'met', 'my', 'friend', 'Nife', 'while', 'I', 'was', 'studying', 'in', 'a', 'middle', 'school', '.', 'I', 'was', 'happy', 'when', 'I', 'met']

In [16]:
#Creating a dictionary from the answer_corpus

answer_dict = nltk.FreqDist(answer_corpus_tok)
random.sample(list(answer_dict.items()),5) #random 5-item sample

[('friend=', 1), ('Vegetables', 23), ('hugged', 16), ('Booking', 3), ('disorders', 59)]

### 11. Bigram frequency dictionary ###
Create a bigram frequency dictionary from answer_corpus_tok

In [17]:
#Creating bigrams from the answer_corpus_tok

answer_corpus_bigrams = list(nltk.bigrams(answer_corpus_tok))
answer_corpus_bigrams[:10]

[('I', 'met'), ('met', 'my'), ('my', 'friend'), ('friend', 'Nife'), ('Nife', 'while'), ('while', 'I'), ('I', 'was'), ('was', 'studying'), ('studying', 'in'), ('in', 'a')]

In [18]:
#Again creating a dictionary, this time a bigram dictionary

answer_bigram_dict = nltk.FreqDist(answer_corpus_bigrams)
random.sample(list(answer_bigram_dict.items()),5) #random 5-item sample

[(('think', 'taking'), 2), (('coffee', 'when'), 4), (('Accent', 'coaches'), 1), (('unhappy', 'on'), 4), (('odor', '.'), 19)]

### 12. Mutual Information

Creating a function to calculate Mutual Information (MI), a useful measure of two-way collocation

(from https://corpus.byu.edu/mutualInformation.asp)  

Mutual Information is calculated as follows:  
MI = log ( (AB * sizeCorpus) / (A * B * span) ) / log (2)  

Suppose we are calculating the MI for the collocate color near purple in BYU-BNC.  

A = frequency of node word (e.g. purple): 1262  
B = frequency of collocate (e.g. color): 115  
AB = frequency of collocate near the node word (e.g. color near purple): 24  
sizeCorpus= size of corpus (# words; in this case the BNC): 96,263,399  
span = span of words (e.g. 3 to left and 3 to right of node word): 6  
log (2) is literally the log10 of the number 2: .30103  

MI = 11.37 = log ( (24 * 96,263,399) / (1262 * 115 * 6) ) / .30103  

In [19]:
#The above formula turned into python code

import math
from math import log

def MI(word1, word2):
  prob_word1 = answer_dict[word1] / float(sum(answer_dict.values()))
  prob_word2 = answer_dict[word2] / float(sum(answer_dict.values()))
  prob_word1_word2 = answer_bigram_dict[word1, word2] / float(sum(answer_bigram_dict.values()))
  return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)

In [20]:
#Example of MI:

#This is a collocation which should have a medium strength MI (between 4-7)
answer_bigram_dict['young', 'people']
answer_dict['young']
answer_dict['people']

#'young' collocates strongly with 'people' (about 25% of time) but 'people' doesn't collocate strongly with 'young'

469

1605

24516

In [21]:
MI('young','people')

#That is the standard range for a M1 score

5.840354713355728

In [22]:
#Example #2 with two words that have a weaker MI

answer_bigram_dict['the', 'man']
answer_dict['the']
answer_dict['man']

MI('the', 'man')

254

171927

1547

2.1986947748534735

### 13. Combo dataframe
- joins answer_df, user_df, and course_df
- removes unnecessary columns
- narrows results down to only answers from writing classes and first versions of their work

In [23]:
#join answer_df and user_df along 'user_file_id' column
combo_df = answer_df.join(user_df, on='user_file_id', lsuffix='user_file_id')

#now join this new df with course_df along 'course_id' column
combo_df = combo_df.join(course_df, on='course_id', lsuffix='user_file_id')

In [24]:
#Dropping unnecessary columns (there are a lot)
combo_df = combo_df.drop(['directoryuser_file_id', 'is_doublespaced', 'is_plagiarized', 'is_deleteduser_file_id',
                            'modifiedby', 'modifieddate', 'allow_submit_after_duedate', 'anon_id', 'file_type_id',
                            'file_info_id', 'user_file_parent_id', 'createdby', 'session_id',
                           'document_id','filename', 'content_text', 'createddate', 'allow_multiple_accesses',
                           'directoryuser_file_id', 'is_doublespaced', 'is_plagiarized', 'is_deleteduser_file_id',
                           'modifiedby', 'modifieddate', 'allow_submit_after_duedate','activity', 'order_num', 
                            'due_date', 'post_date', 'assignment_name', 'directory', 'activity', 'semester',
                            'order_num', 'due_date', 'post_date', 'assignment_name', 'allow_double_spacing',
                           'duration', 'pull_off_date', 'direction', 'grammar_qp_id', 'is_deleted',
                            'section', 'course_description'], axis = 1)

In [25]:
#keeping only 1st versions of students' work
combo_df = combo_df.loc[combo_df['version'] == 1]

#'version' column now unnecessary
combo_df = combo_df.drop(['version'], axis = 1)

In [26]:
#keeping only answers from writing classes (class_id = 2)
combo_df = combo_df.loc[combo_df['class_id'] == 2]

#'class_id' column now unnecessary
combo_df = combo_df.drop(['class_id'], axis = 1)

In [27]:
#just change the order of columns to something more logical and rename some columns
combo_df = combo_df[['question_id','user_file_id', 'anon_iduser_file_id', 'level_id', 'course_id', 'text', 'toks', 'bigrams']]
combo_df.rename(columns={'anon_iduser_file_id':'anon_id'}, inplace=True)

#finished result =  much cleaner
combo_df.head()

Unnamed: 0_level_0,question_id,user_file_id,anon_id,level_id,course_id,text,toks,bigrams
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3,12,7507,dk5,4,115,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","[(In, my), (my, country), (country, we), (we, ..."
4,13,7507,dk5,4,115,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","[(I, organized), (organized, the), (the, instr..."
5,12,7508,ad1,4,115,"First, prepare a port, loose tea, and cup.\r\r...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, ,), (,, prepare), (prepare, a), (a, p..."
6,13,7508,ad1,4,115,By time,"[By, time]","[(By, time)]"
7,12,7509,eg5,4,115,"First, prepare your cup, loose tea or bag tea,...","[First, ,, prepare, your, cup, ,, loose, tea, ...","[(First, ,), (,, prepare), (prepare, your), (y..."


In [28]:
#remove level 2 (too few to be usefully analyzed)

combo_df.level_id.unique()

combo_df = combo_df.loc[combo_df['level_id'] != 2]

combo_df.level_id.unique()

array([4, 5, 3, 2])

array([4, 5, 3])

In [29]:
#updated MI formula with combo_dict

def MI(word1, word2):
  prob_word1 = combo_unigram_dict[word1] / float(sum(combo_unigram_dict.values()))
  prob_word2 = combo_unigram_dict[word2] / float(sum(combo_unigram_dict.values()))
  prob_word1_word2 = combo_bigram_dict[word1, word2] / float(sum(combo_bigram_dict.values()))
  return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)

In [30]:
#create a column for total number of bigrams per text

combo_df['bigram_len'] = [len(x) for x in combo_df['bigrams']]

In [31]:
#create a column of lowercase bigrams to use with MI
combo_df['bigrams_lower'] = [[(x.lower(), y.lower()) for x, y in element] for element in combo_df['bigrams']]
combo_df.head(2)

Unnamed: 0_level_0,question_id,user_file_id,anon_id,level_id,course_id,text,toks,bigrams,bigram_len,bigrams_lower
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3,12,7507,dk5,4,115,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","[(In, my), (my, country), (country, we), (we, ...",67,"[(in, my), (my, country), (country, we), (we, ..."
4,13,7507,dk5,4,115,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","[(I, organized), (organized, the), (the, instr...",6,"[(i, organized), (organized, the), (the, instr..."


#### Making the MI_sum column, i.e. the total MI of all the bigrams in each answer

In [32]:
#create new freq dicts for combo_df (unigrams and bigrams) using same 
#code as earlier versions with answer_df

combo_corpus = ' '.join(combo_df['text'])
combo_corpus_tok = nltk.word_tokenize(combo_corpus)
combo_corpus_tok = list(map(lambda x:x.lower(),combo_corpus_tok)) #making everything lowercase
combo_unigram_dict = nltk.FreqDist(combo_corpus_tok)

combo_corpus_bigrams = list(nltk.bigrams(combo_corpus_tok))
combo_bigram_dict = nltk.FreqDist(combo_corpus_bigrams)

In [33]:
#updated MI formula with combo_dict and workaround to avoid math domain errors
def MI(word1, word2):
  prob_word1 = combo_unigram_dict[word1] / sum(combo_unigram_dict.values())
  prob_word2 = combo_unigram_dict[word2] / sum(combo_unigram_dict.values())
  prob_word1_word2 = combo_bigram_dict[word1, word2] / sum(combo_bigram_dict.values())
  y = prob_word1*prob_word2
  x = (prob_word1_word2/y) if y != 0 else 0
  if x != 0:
    return math.log(x,2)
  else:
    return 0

In [34]:
#Create list of all text_MI scores (takes a while)

row = 0
text_MI = []

for x in combo_df['bigrams_lower']:
    y = [round(sum(MI(x[0], x[1]) for x in combo_df.iloc[row][9]),2)]
    row += 1
    text_MI.append(y)

In [35]:
text_MI[:20] #check the results

[[181.28], [7.65], [228.84], [-0.24], [120.11], [102.94], [269.31], [11.61], [229.97], [19.53], [160.22], [96.46], [188.73], [125.52], [74.17], [254.38], [190.67], [7.85], [407.79], [8.22]]

In [36]:
len(combo_df['bigrams_lower'])
len(text_MI)
row

12702

12702

12702

In [37]:
text_MI = pd.Series(text_MI) #turn the list into a series

In [38]:
#create a total of MI scores for each text (for machine learning later)
combo_df['MI_sum'] = [x[0] for x in text_MI]

combo_df.head(3)

Unnamed: 0_level_0,question_id,user_file_id,anon_id,level_id,course_id,text,toks,bigrams,bigram_len,bigrams_lower,MI_sum
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3,12,7507,dk5,4,115,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","[(In, my), (my, country), (country, we), (we, ...",67,"[(in, my), (my, country), (country, we), (we, ...",181.28
4,13,7507,dk5,4,115,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","[(I, organized), (organized, the), (the, instr...",6,"[(i, organized), (organized, the), (the, instr...",7.65
5,12,7508,ad1,4,115,"First, prepare a port, loose tea, and cup.\r\r...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, ,), (,, prepare), (prepare, a), (a, p...",73,"[(first, ,), (,, prepare), (prepare, a), (a, p...",228.84


In [39]:
#create an avg_bigram_MI scores for each text

combo_df['avg_bigram_MI'] = combo_df['MI_sum'] / combo_df['bigram_len'] 

In [40]:
combo_df[['avg_bigram_MI']] = combo_df[['avg_bigram_MI']].apply(lambda x: pd.Series.round(x, 2)) #round to 2 decimals
combo_df.head()

Unnamed: 0_level_0,question_id,user_file_id,anon_id,level_id,course_id,text,toks,bigrams,bigram_len,bigrams_lower,MI_sum,avg_bigram_MI
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3,12,7507,dk5,4,115,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","[(In, my), (my, country), (country, we), (we, ...",67,"[(in, my), (my, country), (country, we), (we, ...",181.28,2.71
4,13,7507,dk5,4,115,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","[(I, organized), (organized, the), (the, instr...",6,"[(i, organized), (organized, the), (the, instr...",7.65,1.28
5,12,7508,ad1,4,115,"First, prepare a port, loose tea, and cup.\r\r...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, ,), (,, prepare), (prepare, a), (a, p...",73,"[(first, ,), (,, prepare), (prepare, a), (a, p...",228.84,3.13
6,13,7508,ad1,4,115,By time,"[By, time]","[(By, time)]",1,"[(by, time)]",-0.24,-0.24
7,12,7509,eg5,4,115,"First, prepare your cup, loose tea or bag tea,...","[First, ,, prepare, your, cup, ,, loose, tea, ...","[(First, ,), (,, prepare), (prepare, your), (y...",49,"[(first, ,), (,, prepare), (prepare, your), (y...",120.11,2.45


In [41]:
#Let's also remove very short texts of less than 10 words which are not 'essays'

combo_df = combo_df.loc[combo_df['bigram_len'] >= 10]
combo_df.head()

Unnamed: 0_level_0,question_id,user_file_id,anon_id,level_id,course_id,text,toks,bigrams,bigram_len,bigrams_lower,MI_sum,avg_bigram_MI
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3,12,7507,dk5,4,115,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","[(In, my), (my, country), (country, we), (we, ...",67,"[(in, my), (my, country), (country, we), (we, ...",181.28,2.71
5,12,7508,ad1,4,115,"First, prepare a port, loose tea, and cup.\r\r...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, ,), (,, prepare), (prepare, a), (a, p...",73,"[(first, ,), (,, prepare), (prepare, a), (a, p...",228.84,3.13
7,12,7509,eg5,4,115,"First, prepare your cup, loose tea or bag tea,...","[First, ,, prepare, your, cup, ,, loose, tea, ...","[(First, ,), (,, prepare), (prepare, your), (y...",49,"[(first, ,), (,, prepare), (prepare, your), (y...",120.11,2.45
8,13,7509,eg5,4,115,"I organized the instructions by time, beacause...","[I, organized, the, instructions, by, time, ,,...","[(I, organized), (organized, the), (the, instr...",38,"[(i, organized), (organized, the), (the, instr...",102.94,2.71
11,12,7511,fv6,4,115,"To make tea, nothing is easier, even if someti...","[To, make, tea, ,, nothing, is, easier, ,, eve...","[(To, make), (make, tea), (tea, ,), (,, nothin...",98,"[(to, make), (make, tea), (tea, ,), (,, nothin...",269.31,2.75


### 14. Occurrences per million ###
- Create function for calculating occurrences per million  
- For unigrams and bigrams  

Formula:

FN = FO(1,000,000) / C

FN = normalized frequency
FO = observed frequency
C = corpus size

In [42]:
#total number of unigrams
total_unigrams = len(combo_corpus_tok)

#total number of bigrams
total_bigrams = len(combo_corpus_bigrams)

total_unigrams
total_bigrams

#different by one a bigrams will be naturally be unigrams - 1 (for the first one)

2549012

2549011

In [43]:
#create function where you enter the unigram and it tells you the frequency in the corpus per million tokens

def unigram_per_M(unigram):
   return (combo_unigram_dict[unigram]*1000000) / total_unigrams

In [44]:
#create function where you enter the bigram and it tells you the frequency in the corpus per million tokens

def bigram_per_M(word1, word2):
   return (combo_bigram_dict[word1, word2]*1000000) / total_bigrams

### 15. bigram_df ###

Create bigram_df showing relevant info based on above formulas

- columns for this dataframe:
    - default index
    - bigrams
    - MI scores
    - occurrences per million
    - normalized percentage used at each proficiency level

In [45]:
#Creating bigrams and tokens columns

bigram_df = pd.DataFrame.from_dict(combo_bigram_dict,orient='index')
bigram_df = bigram_df.reset_index()
bigram_df = bigram_df.rename(columns = {0:'tokens', 'index': 'bigram'})
bigram_df.head()

Unnamed: 0,bigram,tokens
0,"(in, my)",2629
1,"(my, country)",875
2,"(country, we)",17
3,"(we, usually)",80
4,"(usually, do)",53


In [46]:
#Changing bigram tuples to lists for easier manipulation

bigram_df['bigram'] = [list(x) for x in bigram_df['bigram']]

#### Creating MI column

In [47]:
#Creating MI column (takes a few hours)

bigram_df['MI'] = [MI(x[0], x[1]) for x in bigram_df['bigram']]

In [48]:
#Rounding results to two decimal places

bigram_df[['MI']] = bigram_df[['MI']].apply(lambda x: pd.Series.round(x, 2))
bigram_df.head()

Unnamed: 0,bigram,tokens,MI
0,"[in, my]",2629,3.13
1,"[my, country]",875,5.5
2,"[country, we]",17,0.36
3,"[we, usually]",80,3.41
4,"[usually, do]",53,3.07


#### Creating per_million column

In [49]:
bigram_df['per_million'] = [bigram_per_M(x[0], x[1]) for x in bigram_df['bigram']]

In [50]:
#Rounding to two decimal places

bigram_df[['per_million']] = bigram_df[['per_million']].apply(lambda x: pd.Series.round(x, 2))
bigram_df.head()

Unnamed: 0,bigram,tokens,MI,per_million
0,"[in, my]",2629,3.13,1031.38
1,"[my, country]",875,5.5,343.27
2,"[country, we]",17,0.36,6.67
3,"[we, usually]",80,3.41,31.38
4,"[usually, do]",53,3.07,20.79


#### Creating 'normalized toks per level' and 'relative percentage per level' columns

In [51]:
#create level dataframes
level_3 = combo_df.loc[combo_df['level_id'] == 3, :] 
level_4 = combo_df.loc[combo_df['level_id'] == 4, :] 
level_5 = combo_df.loc[combo_df['level_id'] == 5, :] 

#create frequency dictionaries for each level
level_3_corpus = ' '.join(level_3['text'])
level_3_tok = nltk.word_tokenize(level_3_corpus)
level_3_tok = list(map(lambda x:x.lower(),level_3_tok))
level_3_bigrams = list(nltk.bigrams(level_3_tok))
level_3_bigram_dict = nltk.FreqDist(level_3_bigrams)

level_4_corpus = ' '.join(level_4['text'])
level_4_tok = nltk.word_tokenize(level_4_corpus)
level_4_tok = list(map(lambda x:x.lower(),level_4_tok))
level_4_bigrams = list(nltk.bigrams(level_4_tok))
level_4_bigram_dict = nltk.FreqDist(level_4_bigrams)

level_5_corpus = ' '.join(level_5['text'])
level_5_tok = nltk.word_tokenize(level_5_corpus)
level_5_tok = list(map(lambda x:x.lower(),level_5_tok))
level_5_bigrams = list(nltk.bigrams(level_5_tok))
level_5_bigram_dict = nltk.FreqDist(level_5_bigrams)

In [52]:
#Example of what each cell should contain in the level_3 column
#level_3_bigram_dict divided by the value from combo_bigram_dict

#for example
"{0:.2f}%".format(level_3_bigram_dict['in', 'the'] / combo_bigram_dict['in', 'the'] * 100)

#totals for all 3 levels should add up to 100%
"{0:.2f}%".format(level_3_bigram_dict['in', 'the'] / combo_bigram_dict['in', 'the'] * 100)
"{0:.2f}%".format(level_4_bigram_dict['in', 'the'] / combo_bigram_dict['in', 'the'] * 100)
"{0:.2f}%".format(level_5_bigram_dict['in', 'the'] / combo_bigram_dict['in', 'the'] * 100)

12.17 + 40.75 + 47.07 #close enough!

'12.06%'

'12.06%'

'40.50%'

'47.01%'

99.99000000000001

In [53]:
#create updated freq dicts for combo_df (unigrams and bigrams)

combo_corpus = ' '.join(combo_df['text'])
combo_corpus_tok = nltk.word_tokenize(combo_corpus)
combo_corpus_tok = list(map(lambda x:x.lower(),combo_corpus_tok))
combo_unigram_dict = nltk.FreqDist(combo_corpus_tok)

combo_corpus_bigrams = list(nltk.bigrams(combo_corpus_tok))
combo_bigram_dict = nltk.FreqDist(combo_corpus_bigrams)

In [54]:
#Checking that level bigram dicts add up to existing total bigram dict
level_3_bigram_dict['in', 'the']
level_4_bigram_dict['in', 'the']
level_5_bigram_dict['in', 'the']

level_3_bigram_dict['in', 'the'] + level_4_bigram_dict['in', 'the'] + level_5_bigram_dict['in', 'the']

combo_bigram_dict['in', 'the']

1347

4525

5252

11124

11124

In [55]:
#also necessary to normalize as different number of responses at each level

#weighting for each level
level_3_weighting = len(level_3.index) / len(combo_df.index)
level_4_weighting = len(level_4.index) / len(combo_df.index)
level_5_weighting = len(level_5.index) / len(combo_df.index)

level_3_weighting
level_4_weighting
level_5_weighting

level_3_weighting+level_4_weighting+level_5_weighting #should equal 100

#difference between observed and expected, i.e. expected weighting (.33) -  actual weighting (level_N_percent)
level_3_change = (1/3) - level_3_weighting
level_4_change = (1/3) - level_4_weighting
level_5_change = (1/3) - level_5_weighting

level_3_change
level_4_change
level_5_change

round(level_3_change + level_4_change + level_5_change, 2) # should be 0

0.24625775830595106

0.4115553121577218

0.3421869295363271

1.0

0.08707557502738225

-0.07822197882438847

-0.008853596202993808

-0.0

In [57]:
#example of normalizing with ['in', 'the'] bigram

#un-normalized number
level_3_bigram_dict['in', 'the']
level_4_bigram_dict['in', 'the']
level_5_bigram_dict['in', 'the']
combo_bigram_dict['in', 'the']

#normalized number
n3 = level_3_bigram_dict['in', 'the'] + (combo_bigram_dict['in', 'the'] * level_3_change)
n4 = level_4_bigram_dict['in', 'the'] + (combo_bigram_dict['in', 'the'] * level_4_change)
n5 = level_5_bigram_dict['in', 'the'] + (combo_bigram_dict['in', 'the'] * level_5_change)

n3
n4
n5

n3 + n4 + n5

1347

4525

5252

11124

2315.6286966046

3654.8587075575024

5153.512595837897

11124.0

In [58]:
#create a function for the above

def norm_toks_level3(word1, word2):
    return int((level_3_bigram_dict[word1,word2] + (combo_bigram_dict[word1,word2] * level_3_change)))

def norm_toks_level4(word1, word2):
    return int((level_4_bigram_dict[word1,word2] + (combo_bigram_dict[word1,word2] * level_4_change)))
            
def norm_toks_level5(word1, word2):
    return int((level_5_bigram_dict[word1,word2] + (combo_bigram_dict[word1,word2] * level_5_change)))

#Examples:
norm_toks_level3('in', 'the')
norm_toks_level4('in', 'the')
norm_toks_level5('in', 'the')

2315

3654

5153

In [59]:
#And as a comparative percentage
def norm_percent_level3(word1, word2):
    return(round(100*((level_3_bigram_dict[word1,word2] + (combo_bigram_dict[word1,word2] * level_3_change))
                      / (combo_bigram_dict[word1, word2]) if combo_bigram_dict[word1, word2] != 0 else 0),2))

def norm_percent_level4(word1, word2):
    return(round(100*((level_4_bigram_dict[word1,word2] + (combo_bigram_dict[word1,word2] * level_4_change))
                      / (combo_bigram_dict[word1, word2]) if combo_bigram_dict[word1, word2] != 0 else 0),2))

def norm_percent_level5(word1, word2):
    return(round(100*((level_5_bigram_dict[word1,word2] + (combo_bigram_dict[word1,word2] * level_5_change))
                      / (combo_bigram_dict[word1, word2]) if combo_bigram_dict[word1, word2] != 0 else 0),2))

#Examples:
norm_percent_level3('in', 'the')
norm_percent_level4('in', 'the')
norm_percent_level5('in', 'the')

20.82

32.86

46.33

In [60]:
#Normalized tokens pplied to the whole dataframe

bigram_df['lv3_norm_toks'] = [norm_toks_level3(x[0], x[1]) for x in bigram_df['bigram']]
bigram_df['lv4_norm_toks'] = [norm_toks_level4(x[0], x[1]) for x in bigram_df['bigram']]
bigram_df['lv5_norm_toks'] = [norm_toks_level5(x[0], x[1]) for x in bigram_df['bigram']]

bigram_df.head()

Unnamed: 0,bigram,tokens,MI,per_million,lv3_norm_toks,lv4_norm_toks,lv5_norm_toks
0,"[in, my]",2629,3.13,1031.38,687,1124,805
1,"[my, country]",875,5.5,343.27,249,321,298
2,"[country, we]",17,0.36,6.67,2,10,3
3,"[we, usually]",80,3.41,31.38,14,51,13
4,"[usually, do]",53,3.07,20.79,6,26,19


In [61]:
#And now the comparative percentages

bigram_df['lv3_rel_%'] = [norm_percent_level3(x[0], x[1]) for x in bigram_df['bigram']]
bigram_df['lv4_rel_%'] = [norm_percent_level4(x[0], x[1]) for x in bigram_df['bigram']]
bigram_df['lv5_rel_%'] = [norm_percent_level5(x[0], x[1]) for x in bigram_df['bigram']]

bigram_df.head()

Unnamed: 0,bigram,tokens,MI,per_million,lv3_norm_toks,lv4_norm_toks,lv5_norm_toks,lv3_rel_%,lv4_rel_%,lv5_rel_%
0,"[in, my]",2629,3.13,1031.38,687,1124,805,26.28,42.94,30.78
1,"[my, country]",875,5.5,343.27,249,321,298,28.71,37.01,34.29
2,"[country, we]",17,0.36,6.67,2,10,3,14.59,62.77,22.64
3,"[we, usually]",80,3.41,31.38,14,51,13,18.71,64.68,16.61
4,"[usually, do]",53,3.07,20.79,6,26,19,12.48,50.67,36.85


#### Creating level per_million columns

In [62]:
#create per_million columns for each level

bigram_df['lv3_per_M'] = round(bigram_df['lv3_norm_toks']*1000000/total_bigrams, 2)
bigram_df['lv4_per_M'] = round(bigram_df['lv4_norm_toks']*1000000/total_bigrams, 2)
bigram_df['lv5_per_M'] = round(bigram_df['lv5_norm_toks']*1000000/total_bigrams, 2)

In [63]:
bigram_df.index += 1 #frequency lists look better starting at 1

bigram_df.head()

Unnamed: 0,bigram,tokens,MI,per_million,lv3_norm_toks,lv4_norm_toks,lv5_norm_toks,lv3_rel_%,lv4_rel_%,lv5_rel_%,lv3_per_M,lv4_per_M,lv5_per_M
1,"[in, my]",2629,3.13,1031.38,687,1124,805,26.28,42.94,30.78,269.52,440.96,315.81
2,"[my, country]",875,5.5,343.27,249,321,298,28.71,37.01,34.29,97.68,125.93,116.91
3,"[country, we]",17,0.36,6.67,2,10,3,14.59,62.77,22.64,0.78,3.92,1.18
4,"[we, usually]",80,3.41,31.38,14,51,13,18.71,64.68,16.61,5.49,20.01,5.1
5,"[usually, do]",53,3.07,20.79,6,26,19,12.48,50.67,36.85,2.35,10.2,7.45


In [64]:
combo_df.head()
bigram_df.head()

Unnamed: 0_level_0,question_id,user_file_id,anon_id,level_id,course_id,text,toks,bigrams,bigram_len,bigrams_lower,MI_sum,avg_bigram_MI
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3,12,7507,dk5,4,115,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","[(In, my), (my, country), (country, we), (we, ...",67,"[(in, my), (my, country), (country, we), (we, ...",181.28,2.71
5,12,7508,ad1,4,115,"First, prepare a port, loose tea, and cup.\r\r...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, ,), (,, prepare), (prepare, a), (a, p...",73,"[(first, ,), (,, prepare), (prepare, a), (a, p...",228.84,3.13
7,12,7509,eg5,4,115,"First, prepare your cup, loose tea or bag tea,...","[First, ,, prepare, your, cup, ,, loose, tea, ...","[(First, ,), (,, prepare), (prepare, your), (y...",49,"[(first, ,), (,, prepare), (prepare, your), (y...",120.11,2.45
8,13,7509,eg5,4,115,"I organized the instructions by time, beacause...","[I, organized, the, instructions, by, time, ,,...","[(I, organized), (organized, the), (the, instr...",38,"[(i, organized), (organized, the), (the, instr...",102.94,2.71
11,12,7511,fv6,4,115,"To make tea, nothing is easier, even if someti...","[To, make, tea, ,, nothing, is, easier, ,, eve...","[(To, make), (make, tea), (tea, ,), (,, nothin...",98,"[(to, make), (make, tea), (tea, ,), (,, nothin...",269.31,2.75


Unnamed: 0,bigram,tokens,MI,per_million,lv3_norm_toks,lv4_norm_toks,lv5_norm_toks,lv3_rel_%,lv4_rel_%,lv5_rel_%,lv3_per_M,lv4_per_M,lv5_per_M
1,"[in, my]",2629,3.13,1031.38,687,1124,805,26.28,42.94,30.78,269.52,440.96,315.81
2,"[my, country]",875,5.5,343.27,249,321,298,28.71,37.01,34.29,97.68,125.93,116.91
3,"[country, we]",17,0.36,6.67,2,10,3,14.59,62.77,22.64,0.78,3.92,1.18
4,"[we, usually]",80,3.41,31.38,14,51,13,18.71,64.68,16.61,5.49,20.01,5.1
5,"[usually, do]",53,3.07,20.79,6,26,19,12.48,50.67,36.85,2.35,10.2,7.45


### 16. levels_df ###

Create and overall numbers mini dataframe called levels_df

In [65]:
#To see overall types and tokens by level

#first find length of sub-corpora
lv3_unigrams = len(level_3_tok)
lv4_unigrams = len(level_4_tok)
lv5_unigrams = len(level_5_tok)

lv3_bigrams = len(level_3_bigrams)
lv4_bigrams = len(level_4_bigrams)
lv5_bigrams = len(level_5_bigrams)

unigram_toks = pd.Series([lv3_unigrams, lv4_unigrams, lv5_unigrams, total_unigrams], index=['Level 3', 'Level 4', 'Level 5', 'Total'])
bigram_toks = pd.Series([lv3_bigrams, lv4_bigrams, lv5_bigrams, total_bigrams], index=['Level 3', 'Level 4', 'Level 5', 'Total'])

In [66]:
#find number of types for each level and overall

total_unigram_types = len(set(combo_corpus_tok))
lv3_unigram_types = len(set(level_3_tok))
lv4_unigram_types = len(set(level_4_tok))
lv5_unigram_types = len(set(level_5_tok))

total_bigram_types = len(set(combo_corpus_bigrams))
lv3_bigram_types = len(set(level_3_bigrams))
lv4_bigram_types = len(set(level_4_bigrams))
lv5_bigram_types = len(set(level_5_bigrams))

unigram_types = pd.Series([lv3_unigram_types, lv4_unigram_types, lv5_unigram_types, total_unigram_types], index=['Level 3', 'Level 4', 'Level 5', 'Total'])
bigram_types = pd.Series([lv3_bigram_types, lv4_bigram_types, lv5_bigram_types, total_bigram_types], index=['Level 3', 'Level 4', 'Level 5', 'Total'])

In [67]:
#find total number of texts at each level and overall

total_texts = len(combo_df.index)
lv3_texts = len(combo_df.loc[combo_df['level_id'] == 3, :])
lv4_texts = len(combo_df.loc[combo_df['level_id'] == 4, :])
lv5_texts = len(combo_df.loc[combo_df['level_id'] == 5, :])

texts = pd.Series([lv3_texts, lv4_texts, lv5_texts, total_texts], index=['Level 3', 'Level 4', 'Level 5', 'Total'])

In [68]:
#create dataframe

levels_df = pd.concat([unigram_toks, unigram_types, bigram_toks, bigram_types, texts], axis = 1)
levels_df.columns = ['unigram_toks', 'unigram_types', 'bigram_toks', 'bigram_types', 'texts']
levels_df

Unnamed: 0,unigram_toks,unigram_types,bigram_toks,bigram_types,texts
Level 3,282844,11816,282843,81209,2698
Level 4,1193172,23231,1193171,236467,4509
Level 5,1060753,23667,1060752,236637,3749
Total,2549012,39016,2549011,430738,10956


### 17. Pickling ###

Saving pickles of dataframes and MI dict in order to save time in future and to use in other notebooks

In [69]:
#save bigram_df as a pickle file and csv for later use

outfile = 'bigram_df.pkl'
bigram_df.to_pickle(outfile)
print(outfile, 'pickled.')

outfile = 'bigram_df.csv'
bigram_df.to_csv(outfile)
print(outfile, 'written out.')

#to read in later, use: pandas.read_pickle()

bigram_df.pkl pickled.
bigram_df.csv written out.


In [70]:
#save combo_df as a pickle file and csv for later use

outfile = 'combo_df.pkl'
combo_df.to_pickle(outfile)
print(outfile, 'pickled.')

outfile = 'combo_df.csv'
combo_df.to_csv(outfile)
print(outfile, 'written out.')

combo_df.pkl pickled.
combo_df.csv written out.


In [71]:
#save levels_df as a pickle file and csv for later use

outfile = 'levels_df.pkl'
levels_df.to_pickle(outfile)
print(outfile, 'pickled.')

outfile = 'levels_df.csv'
levels_df.to_csv(outfile)
print(outfile, 'written out.')

levels_df.pkl pickled.
levels_df.csv written out.


In [72]:
#Make and pickle an MI_dict
import pickle

MI_dict = dict(zip(str(bigram_df.bigram), bigram_df.MI))

with open('MI_dict.pkl', 'wb') as handle:
    pickle.dump(MI_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

print('MI_dict.pkl written out.')

MI_dict.pkl written out.


### 18. Visualizations

Visualizations based on this data can be found in a separate notebook:

https://github.com/Data-Science-for-Linguists/Bigram-analysis-of-writing-from-the-ELI/tree/master/Visualizations.ipynb