# Project_Code 2: Clean up and analysis of ELI Data #
## Ben Naismith ##

### Changes since 'Project_Code1' ###

This new document has been created as a number of significant changes have been made to the original code. Based on discussions with other members of the ELI Data Mining Group, the following points were determined:

- For the sake of efficiency, it is better not to merge the different data frames into one big one
- A 'sanitization' step of the data was completed which duplicated some of the steps of my initial code. These duplications include removing unwanted apostrophes, changing all 'null' and 'ull' to NaN, and removing empty or unreal students (who were most likely teachers). As such, the dataset is now ready for more in-depth cleaning and analysis, i.e. the purpose of this notebook. The code for the sanitization step is in a private repository of the ELI Data Mining Groups 'convert_0_to_1.ipynb'.

### Data Sharing Plan ###

The full ELI data set (see project_plan.md) is private at this time. Below is a workbook with the current code for organizing and cleaning that data. In order to see how the code works, snippets of data have been displayed throughout.

A sample of the 'sanitized' data is included in the 'data' folder in this same repository. It contains samples of the four CSV files referred to in this code, consisting of 1000 answers, in order to allow for testing and reproducibility by others of the code. These 1000 answers are the first 1000 from the answer_csv file and correspond to user_file_id 7505 to 10108.

Ultimately, it is the intention of the dataset's authors for the entire dataset to be made public, with a CC license. Please see the LICENSE_notes.md for details

### Initial setup ###

In [1]:
#Import necesary modules
import numpy as np
import pandas as pd
import nltk
import glob
import matplotlib.pyplot as plt

%pprint #turn off pretty printing

#return every shell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#Create short-hand for directory root
cor_dir = "/Users/Benjamin's/Documents/ELI_Data_Mining/Data-Archive/1_sanitized/"

Pretty printing has been turned OFF


In [2]:
#Add starter code created by Na-Rae Han for the ELI research group
from elitools import *

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48384 entries, 1 to 48420
Data columns (total 8 columns):
question_id        48384 non-null int64
anon_id            48353 non-null object
user_file_id       48384 non-null int64
text               47175 non-null object
directory          14 non-null object
is_doublespaced    48384 non-null int64
is_plagiarized     48384 non-null int64
is_deleted         48384 non-null int64
dtypes: int64(5), object(3)
memory usage: 3.3+ MB
<class 'pandas.core.frame.DataFrame'>
Index: 920 entries, ez9 to aq6
Data columns (total 20 columns):
gender                       920 non-null object
birth_year                   920 non-null int64
native_language              920 non-null object
language_used_at_home        919 non-null object
language_used_at_home_now    860 non-null object
non_native_language_1        864 non-null object
yrs_of_study_lang1           869 non-null object
study_in_classroom_lang1     869 non-null float64
ways_of_study_lang1         

### Student information (S_info_csv and S_info_df) ###

In [3]:
#Process the student_information.csv file
S_info_csv = cor_dir + "student_information.csv"
S_info_df = pd.read_csv(S_info_csv, index_col = 'anon_id')

S_info_df.head() #Issues still apparent with integers turned into floats
S_info_df.tail(10) #6 anon_id with no personal info - perhaps not students and to be 'pruned', as well as teachers with 'English' as the native language

Unnamed: 0_level_0,gender,birth_year,native_language,language_used_at_home,language_used_at_home_now,non_native_language_1,yrs_of_study_lang1,study_in_classroom_lang1,ways_of_study_lang1,non_native_language_2,yrs_of_study_lang2,study_in_classroom_lang2,ways_of_study_lang2,non_native_language_3,yrs_of_study_lang3,study_in_classroom_lang3,ways_of_study_lang3,createddate,modifieddate,course_history
anon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
ez9,Male,1978,Arabic,Arabic,,English,more than 5 years,1.0,Studied grammar;Worked in pairs/groups;Studied...,Turkish,less than 1 year,0.0,Studied by myself,,,0.0,other,2006-01-30 15:07:18,2006-03-14 15:13:37,6;12;18;24;30
gm3,Male,1980,Arabic,Arabic,,English,more than 5 years,1.0,Studied grammar;Had a native-speaker teacher;S...,,,0.0,other,,,0.0,other,2006-01-30 15:07:28,2006-03-14 15:12:49,6;12;24;30;38
fg5,Male,1938,Nepali,Nepali,,English,more than 5 years,1.0,Studied grammar;Worked in pairs/groups;Had a n...,French,less than 1 year,1.0,Studied grammar;Worked in pairs/groups;Had a n...,Hindi,more than 5 years,0.0,Studied by myself,2006-01-30 15:07:45,2006-03-14 15:11:36,18;24
ce5,Female,1984,Korean,Korean,,English,more than 5 years,1.0,Studied grammar;Worked in pairs/groups;Had a n...,German,1-2 years,1.0,Studied grammar;Studied vocabulary;Listened to...,,,0.0,other,2006-01-30 15:07:49,2006-03-14 15:12:24,6;12;24;30;38;56
fi7,Female,1982,Korean,Korean;Japanese,,English,more than 5 years,1.0,Studied grammar;Had a native-speaker teacher;S...,Japanese,less than 1 year,1.0,Studied grammar;Studied vocabulary;Listened to...,French,1-2 years,1.0,Studied grammar;Studied vocabulary;Listened to...,2006-01-30 15:07:52,2006-03-14 15:12:17,6;12;24;30;38


Unnamed: 0_level_0,gender,birth_year,native_language,language_used_at_home,language_used_at_home_now,non_native_language_1,yrs_of_study_lang1,study_in_classroom_lang1,ways_of_study_lang1,non_native_language_2,yrs_of_study_lang2,study_in_classroom_lang2,ways_of_study_lang2,non_native_language_3,yrs_of_study_lang3,study_in_classroom_lang3,ways_of_study_lang3,createddate,modifieddate,course_history
anon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
cy2,Male,1988,Arabic,Arabic,Arabic,English,less than 1 year,1.0,Studied grammar;Worked in pairs/groups;Had a n...,,,0.0,other,,,0.0,other,2011-06-20 14:09:05,2011-06-20 14:11:31,845;846;847;871;872;927;928;931;949;950;1008;1...
br9,Female,1981,Chinese,Chinese,Chinese,English,more than 5 years,1.0,Studied grammar;Worked in pairs/groups;Studied...,,,0.0,other,,,0.0,other,2011-06-20 14:09:15,2011-06-20 14:12:02,868;869;870;871;872;947;951;953
cl5,Male,1987,Arabic,Arabic,Arabic;English,English,less than 1 year,1.0,Studied grammar;Studied vocabulary;Practiced s...,,,0.0,other,,,0.0,other,2011-06-20 14:09:23,2011-06-20 14:13:16,770;771;778;779;781;856;857;859;861;871;952;95...
de1,Male,1983,Arabic,Arabic,Arabic,English,more than 5 years,1.0,Studied grammar;Studied vocabulary;Teacher spo...,,,0.0,other,,,0.0,other,2011-06-20 14:09:27,2011-06-20 14:12:02,850;851;852;871;872;926;932;933;944;945;1008;1...
ap0,Male,1978,Japanese,Japanese,Japanese,English,more than 5 years,1.0,Studied grammar;Studied vocabulary;Listened to...,,,0.0,other,,,0.0,other,2011-06-20 14:09:33,2011-06-20 14:12:52,845;846;847;871;872
gu4,Male,1983,Arabic,Arabic,Arabic;English,Arabic,more than 5 years,0.0,Studied by myself;I lived in a country where t...,,,0.0,other,,,0.0,other,2011-06-20 14:09:34,2011-06-20 14:13:04,772;773;774;775;776;868;869;870;871;872;922;92...
hb0,Female,1980,Arabic,Arabic,Arabic,English,3-5 years,1.0,Studied grammar;Had a native-speaker teacher;T...,,,0.0,other,,,0.0,other,2011-06-20 14:09:38,2011-06-20 14:13:01,851;869;870;871;872;923;942;944;945;946;1008;1...
dp8,Male,1991,Arabic,Arabic;English,Arabic;English,English,1-2 years,1.0,Studied grammar;Worked in pairs/groups;Had a n...,,,0.0,other,,,0.0,other,2011-06-20 14:10:15,2011-06-20 14:13:57,868;869;870;871;872
bn6,Male,1986,Arabic,Arabic;English,Arabic;English,English,more than 5 years,1.0,Studied grammar;Studied vocabulary;Teacher spo...,,,0.0,other,,,0.0,other,2011-06-20 14:11:17,2011-06-20 14:15:51,860;861;862;871;872;930;947;948;949;951;998;99...
aq6,Female,1964,English,English,English,,,,,,,,,,,,,2012-09-14 14:05:38,2012-09-14 14:09:19,1114


In [4]:
#Remove anyone with 'English' or 'NaN' as their native_language, i.e. not students

#First try to create filters

Englishfilter = S_info_df['native_language'] == 'English' #first filter works
NaNfilter = S_info_df['native_language'] == np.nan #second filter doesn't

fake_Ss = S_info_df.loc[Englishfilter] #works, but...
fake_Ss

#fake_Ss = S_info_df.loc[(Englishfilter) or (NaNfilter)] #doesn't work
#fake_Ss


Unnamed: 0_level_0,gender,birth_year,native_language,language_used_at_home,language_used_at_home_now,non_native_language_1,yrs_of_study_lang1,study_in_classroom_lang1,ways_of_study_lang1,non_native_language_2,yrs_of_study_lang2,study_in_classroom_lang2,ways_of_study_lang2,non_native_language_3,yrs_of_study_lang3,study_in_classroom_lang3,ways_of_study_lang3,createddate,modifieddate,course_history
anon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
ez7,Male,1987,English,Arabic,Arabic;English,Arabic,more than 5 years,0.0,I lived in a country where they spoke Arabic,English,less than 1 year,1.0,Studied grammar;Studied vocabulary;Studied pro...,,,0.0,other,2007-02-20 10:05:39,2007-03-20 10:09:23,156;167;180;191;200;212;223;234;245;256
ay4,Female,1974,English,Korean,Korean,Korean,more than 5 years,1.0,Studied grammar;Had a native-speaker teacher;S...,,,0.0,other,,,0.0,other,2009-06-09 12:04:22,2009-11-13 12:43:36,509;515;516;517;560;571;574;601;622;642;645
aq6,Female,1964,English,English,English,,,,,,,,,,,,,2012-09-14 14:05:38,2012-09-14 14:09:19,1114


### Student responses (answer_csv and answer_df) ###

In [5]:
#Process answer.csv file
answer_csv = cor_dir + "answer.csv"
answer_df = pd.read_csv(answer_csv, index_col = 'answer_id')

answer_df.head()
answer_df.tail(10)

Unnamed: 0_level_0,question_id,anon_id,user_file_id,text,directory,is_doublespaced,is_plagiarized,is_deleted
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,5,eq0,7505,I met my friend Nife while I was studying in a...,,0,0,0
2,5,am8,7506,"Ten years ago, I met a women on the train betw...",,0,0,0
3,12,dk5,7507,In my country we usually don't use tea bags. F...,,0,0,0
4,13,dk5,7507,I organized the instructions by time.,,0,0,0
5,12,ad1,7508,"First, prepare a port, loose tea, and cup.\r\r...",,0,0,0


Unnamed: 0_level_0,question_id,anon_id,user_file_id,text,directory,is_doublespaced,is_plagiarized,is_deleted
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
48411,6138,dv8,100847,Early Second Language Education\r\r\r\nSaudi A...,,1,0,0
48412,6138,ce1,100848,Publicly funded health care system\r\r\r\n\r\r...,,0,0,0
48413,6139,fo7,100911,Happiness is the most effective feeling in peo...,,1,0,0
48414,6139,fs9,100912,everyone want to play some games. some people ...,,1,0,0
48415,6139,cl7,100913,Playing a game is fun only when you win?\r\r\r...,,1,0,0
48416,6139,dr8,100914,Many people enjoy a game in their free time. B...,,1,0,0
48417,6137,fv1,100915,\r\r\r\n ...,,0,0,0
48418,6137,fo1,100916,Some patients are suffering from the...,,0,0,0
48419,6119,ge8,100917,My house looks amazing and modern. I decorated...,,0,0,0
48420,6027,ge8,100918,History and Geography a...,,0,0,0


### Course IDs ###
(should help with finding specific texts and linking other data frames)

In [6]:
#Process course.csv file
course_csv = cor_dir + "course.csv"
course_df = pd.read_csv(course_csv, index_col = 'course_id')

course_df.head()

Unnamed: 0_level_0,class_id,level_id,semester,section,course_description
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,2,2064,A,Reading Pre_Intermediate 2064 A
2,1,3,2064,B,Reading Low_Intermediate 2064 B
3,1,4,2064,M,Reading Intermediate 2064 M
4,1,4,2064,P,Reading Intermediate 2064 P
5,1,4,2064,Q,Reading Intermediate 2064 Q


###  user_file_internal ###
- big csv file with a lot of information
- should help with finding specific texts and linking other data frames
- includes file_type_id, course_id, and paths of text and wav files (i.e. all the spoken responses I need)


In [7]:
#Process user_file_wavtxt.csv file
user_csv = cor_dir + "user_file_internal.csv"
user_df = pd.read_csv(user_csv, index_col = 'user_file_id')

user_df.head()

Unnamed: 0_level_0,anon_id,file_type_id,file_info_id,user_file_parent_id,course_id,session_id,document_id,activity,order_num,due_date,...,modifiedby,modifieddate,allow_submit_after_duedate,allow_multiple_accesses,allow_double_spacing,duration,pull_off_date,direction,grammar_qp_id,is_deleted
user_file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,aj8,1,,,10,,,12,,2006-08-07 14:19:48,...,,,0,0,0,,,,,0
2,fg8,1,,,10,,,12,,2006-08-07 14:19:48,...,,,0,0,0,,,,,0
3,be0,1,,,10,,,12,,2006-08-07 14:19:48,...,,,0,0,0,,,,,0
4,fc4,1,,,10,,,12,,2006-08-07 14:19:48,...,,,0,0,0,,,,,0
5,fc4,1,,1.0,10,,,12,,2006-08-07 14:19:48,...,,,0,0,0,,,,,0


### Basic info about dataframes ###

The following information is an overview of the four dataframes/csv files currently being looked at:

#### S_info_df ####
Size:
- there are 941 entries, i.e. students, although at least 9 need to be removed once filters can be made to work
- 21 columns including info about languages spoken, personal data like age, and learning preferences
- Some columns will likely be removed if deemed unhelpful/unnecessary (e.g. 4th language spoken)
- Some data is normalized, e.g. years of study, but others was open, resulting in very varied responses

Connection to other dataframes:
- link to answer_df is anon_id

Most useful columns for this project:
- anon_id (for linking to other df)
- L1, gender, time studying, age (for data analysis)  


#### answer_df ####
Size:
- there are 47175 'text' entries, i.e. student responses, although 48384 total rows. The remaining (including many null texts need to be removed as without texts they serve no purpose
- 9 columns including info about the question, the answer, and characteristics of the text (like if it was plagiarized)

Connection to other dataframes:
- link to S_info_df and course_df is anon_id column

Most useful columns for this project:
- answer_id (shorthand for the individual texts to be analyzed)
- text (the most important column so far) -> to be converted into tokens, bigrams, etc.  
- anon_id (for linking to other df)


#### course_df ####
Size:
- there are 1071 entries, i.e. one row for each course
- 6 columns including info about the course and class, both in terms of their assigned number and a description

Connection to other dataframes:
- link to user_df is course_id 

Most useful columns for this project:
- only really useful as a transition for linking to other df  


#### user_df ####
Size:
- there are 76371 rows, each with a file_id number. However, it is unclear how to use this informatin effectively.
- There are 29 columns, although many are not useful for this project
- A lot of the cells have no input
- Some columns will likely be removed if deemed unhelpful/unnecessary

Connection to other dataframes:
- link to course_df is course_id column

Most useful columns for this project:
- course_id (to link to other DF)
- file_type_id (for indicating the type of activity used in class)

In [8]:
S_info_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 920 entries, ez9 to aq6
Data columns (total 20 columns):
gender                       920 non-null object
birth_year                   920 non-null int64
native_language              920 non-null object
language_used_at_home        919 non-null object
language_used_at_home_now    860 non-null object
non_native_language_1        864 non-null object
yrs_of_study_lang1           869 non-null object
study_in_classroom_lang1     869 non-null float64
ways_of_study_lang1          869 non-null object
non_native_language_2        311 non-null object
yrs_of_study_lang2           314 non-null object
study_in_classroom_lang2     869 non-null float64
ways_of_study_lang2          869 non-null object
non_native_language_3        55 non-null object
yrs_of_study_lang3           59 non-null object
study_in_classroom_lang3     869 non-null float64
ways_of_study_lang3          869 non-null object
createddate                  920 non-null object
modifieddate    

In [9]:
answer_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48384 entries, 1 to 48420
Data columns (total 8 columns):
question_id        48384 non-null int64
anon_id            48353 non-null object
user_file_id       48384 non-null int64
text               47175 non-null object
directory          14 non-null object
is_doublespaced    48384 non-null int64
is_plagiarized     48384 non-null int64
is_deleted         48384 non-null int64
dtypes: int64(5), object(3)
memory usage: 3.3+ MB


In [10]:
course_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1071 entries, 1 to 1123
Data columns (total 5 columns):
class_id              1071 non-null int64
level_id              1071 non-null int64
semester              1071 non-null int64
section               1071 non-null object
course_description    1058 non-null object
dtypes: int64(3), object(2)
memory usage: 50.2+ KB


In [11]:
user_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27134 entries, 1 to 100918
Data columns (total 28 columns):
anon_id                       26922 non-null object
file_type_id                  27134 non-null int64
file_info_id                  2151 non-null float64
user_file_parent_id           25884 non-null float64
course_id                     27134 non-null int64
session_id                    26142 non-null float64
document_id                   1599 non-null float64
activity                      27134 non-null int64
order_num                     2722 non-null float64
due_date                      3286 non-null object
post_date                     3714 non-null object
assignment_name               2700 non-null object
version                       27134 non-null int64
directory                     0 non-null float64
filename                      0 non-null float64
content_text                  964 non-null object
createdby                     24955 non-null object
createddate        

### Creating find_stuff function ###

Goal: create a function that allows for easy retrieval within, from the various different, dataframes.


In [12]:
#adapted from initial work of Brianna - thank you!

#this works to find all the course_id entries for a particular class type, in this case '3' which == speaking

def find_stuff(df, class_type):
    class_id = df.loc[df['class_id'] == class_type]
    return class_id

test = find_stuff(course_df, 3)
test.head()

Unnamed: 0_level_0,class_id,level_id,semester,section,course_description
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
13,3,2,2064,A,Speaking Pre_Intermediate 2064 A
14,3,3,2064,B,Speaking Low_Intermediate 2064 B
15,3,4,2064,M,Speaking Intermediate 2064 M
16,3,4,2064,P,Speaking Intermediate 2064 P
17,3,4,2064,Q,Speaking Intermediate 2064 Q


In [13]:
#test #2

test2 = find_stuff(course_df, 5)
test2.head()

Unnamed: 0_level_0,class_id,level_id,semester,section,course_description
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
25,5,2,2064,A,Grammar Pre_Intermediate 2064 A
26,5,3,2064,B,Grammar Low_Intermediate 2064 B
27,5,4,2064,M,Grammar Intermediate 2064 M
28,5,4,2064,P,Grammar Intermediate 2064 P
29,5,4,2064,Q,Grammar Intermediate 2064 Q


- Next step is to either expand on this function or create other similar ones to allow look up of other types of info

### Tokenization of answers ###

Goal: tokenize the text in answer.csv to allow for further analysis (bigrams, lexical diversity, etc.)


In [14]:
#find column to tokenize

answer_df[['text']].head()

Unnamed: 0_level_0,text
answer_id,Unnamed: 1_level_1
1,I met my friend Nife while I was studying in a...
2,"Ten years ago, I met a women on the train betw..."
3,In my country we usually don't use tea bags. F...
4,I organized the instructions by time.
5,"First, prepare a port, loose tea, and cup.\r\r..."


In [16]:
#With the magic of stackoverflow, this seems to work, converting NaN to empty strings
answer_df = answer_df[answer_df['text'].notnull()]
answer_df['toks'] = answer_df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)

answer_df.head()

Unnamed: 0_level_0,question_id,anon_id,user_file_id,text,directory,is_doublespaced,is_plagiarized,is_deleted,toks
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,5,eq0,7505,I met my friend Nife while I was studying in a...,,0,0,0,"[I, met, my, friend, Nife, while, I, was, stud..."
2,5,am8,7506,"Ten years ago, I met a women on the train betw...",,0,0,0,"[Ten, years, ago, ,, I, met, a, women, on, the..."
3,12,dk5,7507,In my country we usually don't use tea bags. F...,,0,0,0,"[In, my, country, we, usually, do, n't, use, t..."
4,13,dk5,7507,I organized the instructions by time.,,0,0,0,"[I, organized, the, instructions, by, time, .]"
5,12,ad1,7508,"First, prepare a port, loose tea, and cup.\r\r...",,0,0,0,"[First, ,, prepare, a, port, ,, loose, tea, ,,..."


### Bigrams###

Goal: create a bigram columns from the tok column


In [17]:
#mini-test to make sure I am creating bigrams correctly

bigram_test = answer_df.toks[1]
bigram_test
list(nltk.bigrams(bigram_test))

#test works, let's try on dataframe

answer_df['bigrams'] = answer_df.toks.apply(lambda x: list(nltk.bigrams(x)))
answer_df.head()

['I', 'met', 'my', 'friend', 'Nife', 'while', 'I', 'was', 'studying', 'in', 'a', 'middle', 'school', '.', 'I', 'was', 'happy', 'when', 'I', 'met', 'him', 'because', 'he', 'was', 'a', 'good', 'student', 'in', 'our', 'school', '.', 'We', 'continued', 'the', 'middle', 'and', 'high', 'school', 'to', 'gather', 'in', 'the', 'same', 'school', '.', 'We', 'were', 'studying', 'in', 'the', 'different', 'classes', 'in', 'the', 'middle', 'school', ';', 'however', ',', 'in', 'the', 'high', 'school', 'we', 'were', 'studying', 'in', 'the', 'same', 'class', '.', 'We', 'went', 'to', 'many', 'places', 'in', 'the', 'free', 'time', 'while', 'we', 'were', 'studying', 'in', 'the', 'high', 'school', '.', 'When', 'we', 'finished', 'from', 'the', 'high', 'school', ',', 'I', 'went', 'to', 'K.S', 'University', 'and', 'he', 'went', 'to', 'I.M', 'University', '.', 'While', 'we', 'were', 'enjoying', 'in', 'academic', 'life', ',', 'we', 'made', 'many', 'achievement', 'in', 'these', 'universities', '.', 'I', 'graduate

[('I', 'met'), ('met', 'my'), ('my', 'friend'), ('friend', 'Nife'), ('Nife', 'while'), ('while', 'I'), ('I', 'was'), ('was', 'studying'), ('studying', 'in'), ('in', 'a'), ('a', 'middle'), ('middle', 'school'), ('school', '.'), ('.', 'I'), ('I', 'was'), ('was', 'happy'), ('happy', 'when'), ('when', 'I'), ('I', 'met'), ('met', 'him'), ('him', 'because'), ('because', 'he'), ('he', 'was'), ('was', 'a'), ('a', 'good'), ('good', 'student'), ('student', 'in'), ('in', 'our'), ('our', 'school'), ('school', '.'), ('.', 'We'), ('We', 'continued'), ('continued', 'the'), ('the', 'middle'), ('middle', 'and'), ('and', 'high'), ('high', 'school'), ('school', 'to'), ('to', 'gather'), ('gather', 'in'), ('in', 'the'), ('the', 'same'), ('same', 'school'), ('school', '.'), ('.', 'We'), ('We', 'were'), ('were', 'studying'), ('studying', 'in'), ('in', 'the'), ('the', 'different'), ('different', 'classes'), ('classes', 'in'), ('in', 'the'), ('the', 'middle'), ('middle', 'school'), ('school', ';'), (';', 'howe

Unnamed: 0_level_0,question_id,anon_id,user_file_id,text,directory,is_doublespaced,is_plagiarized,is_deleted,toks,bigrams
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,5,eq0,7505,I met my friend Nife while I was studying in a...,,0,0,0,"[I, met, my, friend, Nife, while, I, was, stud...","[(I, met), (met, my), (my, friend), (friend, N..."
2,5,am8,7506,"Ten years ago, I met a women on the train betw...",,0,0,0,"[Ten, years, ago, ,, I, met, a, women, on, the...","[(Ten, years), (years, ago), (ago, ,), (,, I),..."
3,12,dk5,7507,In my country we usually don't use tea bags. F...,,0,0,0,"[In, my, country, we, usually, do, n't, use, t...","[(In, my), (my, country), (country, we), (we, ..."
4,13,dk5,7507,I organized the instructions by time.,,0,0,0,"[I, organized, the, instructions, by, time, .]","[(I, organized), (organized, the), (the, instr..."
5,12,ad1,7508,"First, prepare a port, loose tea, and cup.\r\r...",,0,0,0,"[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, ,), (,, prepare), (prepare, a), (a, p..."


### Create frequency dictionary for entire corpus ###

Attempting to create frequency dictionary for all toks

In [18]:
testdict = nltk.FreqDist(answer_df.toks[1])
testdict
#looks ok, now to apply to the whole column

FreqDist({'in': 15, '.': 12, 'the': 11, 'I': 10, 'school': 9, 'to': 6, ',': 6, 'was': 5, 'studying': 5, 'high': 5, ...})

In [19]:
fdict = answer_df.toks.apply(lambda x: nltk.FreqDist(x))
fdict.head()
#haha they are mini-dicts for each text rather than for the dataframe as a whole

answer_id
1    {'I': 10, 'met': 2, 'my': 4, 'friend': 2, 'Nif...
2    {'Ten': 1, 'years': 1, 'ago': 1, ',': 8, 'I': ...
3    {'In': 1, 'my': 1, 'country': 1, 'we': 5, 'usu...
4    {'I': 1, 'organized': 1, 'the': 1, 'instructio...
5    {'First': 1, ',': 9, 'prepare': 1, 'a': 2, 'po...
Name: toks, dtype: object

In [20]:
answer_corpus = ' '.join(answer_df['text'])
answer_corpus[:100]
answer_corpus_tok = nltk.word_tokenize(answer_corpus)
answer_corpus_tok[:20]

#probably not the most efficient way but it seems to have worked at least for tokenizing whole corpus.

'I met my friend Nife while I was studying in a middle school. I was happy when I met him because he '

['I', 'met', 'my', 'friend', 'Nife', 'while', 'I', 'was', 'studying', 'in', 'a', 'middle', 'school', '.', 'I', 'was', 'happy', 'when', 'I', 'met']

In [21]:
answer_dict = nltk.FreqDist(answer_corpus_tok)
answer_dict

#success!

FreqDist({'.': 264755, ',': 218149, 'the': 171927, 'to': 133262, 'and': 105988, 'I': 93236, 'a': 89283, 'of': 88552, 'in': 77170, 'is': 75659, ...})

### Create frequency dictionary for bigrams of entire corpus ###

Attempting to create frequency dictionary for all bigrams

In [22]:
#Let's try to do this from the answer_corpus_tok

answer_corpus_bigrams = list(nltk.bigrams(answer_corpus_tok))
answer_corpus_bigrams[:10]

[('I', 'met'), ('met', 'my'), ('my', 'friend'), ('friend', 'Nife'), ('Nife', 'while'), ('while', 'I'), ('I', 'was'), ('was', 'studying'), ('studying', 'in'), ('in', 'a')]

In [23]:
#Ok, now time for the dictionary
answer_bigram_dict = nltk.FreqDist(answer_corpus_bigrams)
answer_bigram_dict

#success!

FreqDist({('.', 'I'): 24177, (',', 'I'): 21399, ('in', 'the'): 18669, ('.', 'The'): 17403, (',', 'and'): 16701, ('of', 'the'): 15011, ('.', 'In'): 13393, (',', 'the'): 12288, ('.', 'It'): 9348, ('to', 'the'): 8553, ...})

## After Progress-report 2

The following is everything that has been completed since Progress Report 2.  See progress_report.MD for details.

### Next goals:
Create another DF called bigrams_df with bigrams, MI scores, occurences per million score, and perhaps more to bge added later. To do so:  
1) Create function for calculating MI 
2) Create function for calculating occurences per million for unigrams and bigrams  
3) Apply the MI formula for pairs of words in the bigram list and create a column in the new DF  
4) Apply the occurences per million for bigrams and create a column in the new DF  
5) Create a column showing percentage of time the bigrams are used by the three proficiency levels  


### Calculating Mutual Information (MI)

(from https://corpus.byu.edu/mutualInformation.asp)  

Mutual Information is calculated as follows:  
MI = log ( (AB * sizeCorpus) / (A * B * span) ) / log (2)  

Suppose we are calculating the MI for the collocate color near purple in BYU-BNC.  

A = frequency of node word (e.g. purple): 1262  
B = frequency of collocate (e.g. color): 115  
AB = frequency of collocate near the node word (e.g. color near purple): 24  
sizeCorpus= size of corpus (# words; in this case the BNC): 96,263,399  
span = span of words (e.g. 3 to left and 3 to right of node word): 6  
log (2) is literally the log10 of the number 2: .30103  

MI = 11.37 = log ( (24 * 96,263,399) / (1262 * 115 * 6) ) / .30103  

In [24]:
#Found something called 'Pointwise Mutual Information' - I believe it is what I am looking for.

import math
from math import log

def MI(word1, word2):
  prob_word1 = answer_dict[word1] / float(sum(answer_dict.values()))
  prob_word2 = answer_dict[word2] / float(sum(answer_dict.values()))
  prob_word1_word2 = answer_bigram_dict[word1, word2] / float(sum(answer_bigram_dict.values()))
  return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)

In [25]:
#something I imagine has an average MI
answer_bigram_dict['young', 'people']
answer_dict['young']
answer_dict['people']

#Yes - 'young' collocates strongly with 'people' (about 25% of time) but 'people' doesn't collocate strongly with 'young'

469

1605

24516

In [26]:
MI('young','people')

#That is the standard range for a M1 score

5.840354713355728

In [27]:
#Time to try one that shouldn't have as high MI, e.g. 'man' with 'the'

answer_bigram_dict['the', 'man']
answer_dict['the']
answer_dict['man']

MI('the', 'man')

#With a smoothing of MI3, this would not show up on collocation lists (a good thing)

254

171927

1547

2.1986947748534735

### Creating combined dataframe for easier analysis and viewing
- joins answer_df, user_df, and course_df
- removes unnecessary columns
- narrows results down to only answers from writing classes and first versions of their work

In [28]:
#join answer_df and user_df along 'user_file_id' column
combo_df = answer_df.join(user_df, on='user_file_id', lsuffix='user_file_id')

#now join this new df with course_df along 'course_id' column
combo_df = combo_df.join(course_df, on='course_id', lsuffix='user_file_id')

In [29]:
#Dropping unnecessary columns (there a lot)
combo_df = combo_df.drop(['directoryuser_file_id', 'is_doublespaced', 'is_plagiarized', 'is_deleteduser_file_id',
                            'modifiedby', 'modifieddate', 'allow_submit_after_duedate', 'anon_id', 'file_type_id',
                            'file_info_id', 'user_file_parent_id', 'createdby', 'session_id',
                           'document_id','filename', 'content_text', 'createddate', 'allow_multiple_accesses',
                           'directoryuser_file_id', 'is_doublespaced', 'is_plagiarized', 'is_deleteduser_file_id',
                           'modifiedby', 'modifieddate', 'allow_submit_after_duedate','activity', 'order_num', 
                            'due_date', 'post_date', 'assignment_name', 'directory', 'activity', 'semester',
                            'order_num', 'due_date', 'post_date', 'assignment_name', 'allow_double_spacing',
                           'duration', 'pull_off_date', 'direction', 'grammar_qp_id', 'is_deleted',
                            'section', 'course_description'], axis = 1)

In [30]:
#keeping only 1st versions of students' work
combo_df = combo_df.loc[combo_df['version'] == 1]

#'version' column now unnecessary
combo_df = combo_df.drop(['version'], axis = 1)

In [31]:
#keeping only answers from writing classes (class_id = 2)
combo_df = combo_df.loc[combo_df['class_id'] == 2]

#'class_id' column now unnecessary
combo_df = combo_df.drop(['class_id'], axis = 1)

In [32]:
#just change the order of columns to something more logical and rename some columns
combo_df = combo_df[['question_id','user_file_id', 'anon_iduser_file_id', 'level_id', 'course_id', 'text', 'toks', 'bigrams']]
combo_df.rename(columns={'anon_iduser_file_id':'anon_id'}, inplace=True)

#finished result =  much cleaner
combo_df.head()

Unnamed: 0_level_0,question_id,user_file_id,anon_id,level_id,course_id,text,toks,bigrams
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3,12,7507,dk5,4,115,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","[(In, my), (my, country), (country, we), (we, ..."
4,13,7507,dk5,4,115,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","[(I, organized), (organized, the), (the, instr..."
5,12,7508,ad1,4,115,"First, prepare a port, loose tea, and cup.\r\r...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, ,), (,, prepare), (prepare, a), (a, p..."
6,13,7508,ad1,4,115,By time,"[By, time]","[(By, time)]"
7,12,7509,eg5,4,115,"First, prepare your cup, loose tea or bag tea,...","[First, ,, prepare, your, cup, ,, loose, tea, ...","[(First, ,), (,, prepare), (prepare, your), (y..."


### Create function for calculating occurrences per million for unigrams and bigrams  

Formula:

FN = FO(1,000,000) / C

FN = normalized frequency
FO = observed frequency
C = corpus size

In [75]:
#total number of unigrams
total_unigrams = len(combo_corpus_tok)

#total number of bigrams
total_bigrams = len(combo_corpus_bigrams)

total_unigrams
total_bigrams

#different by one a bigrams will be naturally be unigrams - 1 (for the first one)

2553650

2553649

In [53]:
#create new freq dicts for combo_df (unigrams and bigrams) using same 
#code as earlier versions with answer_df

combo_corpus = ' '.join(combo_df['text'])
combo_corpus_tok = nltk.word_tokenize(combo_corpus)
combo_unigram_dict = nltk.FreqDist(combo_corpus_tok)

combo_corpus_bigrams = list(nltk.bigrams(combo_corpus_tok))
combo_bigram_dict = nltk.FreqDist(combo_corpus_bigrams)

In [68]:
#create function where you enter the unigram and it tells 
#you the frequency in the corpus per million tokens

def unigram_per_M(unigram):
   return (combo_unigram_dict[unigram]*1000000) / total_unigrams

In [67]:
#test manually and with defined function
combo_unigram_dict['the']

(97163*1000000)/2553650
unigram_per_M('the')

97163

38048.675425371526

38048.675425371526

In [93]:
#create function where you enter the bigram and it tells 
#you the frequency in the corpus per million tokens

def bigram_per_M(word1, word2):
   return (combo_bigram_dict[word1, word2]*1000000) / total_bigrams

In [95]:
#test manually and with defined function
combo_bigram_dict['the', 'man']

(75*1000000)/2553649
bigram_per_M('the', 'man')

75

29.369737187843747

29.369737187843747

### Create a bigram_df showing relevant info based on above formulas
- columns for this dataframe:
    - default index
    - bigrams
    - MI scores
    - occurrences per million
    - normalized percentage used at each proficiency level

In [190]:
bigram_df = pd.DataFrame.from_dict(combo_bigram_dict,orient='index')
bigram_df = bigram_df.reset_index()
bigram_df = bigram_df.rename(columns = {0:'tokens', 'index': 'bigram'})
bigram_df.head()

#first two bullet points complete - now to add more columns

Unnamed: 0,bigram,tokens
0,"(In, my)",808
1,"(my, country)",825
2,"(country, we)",17
3,"(we, usually)",49
4,"(usually, do)",55


In [210]:
#Changing bigram tuples to lists for easier manipulation
bigram_df['bigram'] = [list(x) for x in bigram_df['bigram']]

#### Creating MI column

In [211]:
#New MI calculator based on new dictionary

def MI(word1, word2):
  prob_word1 = combo_unigram_dict[word1] / float(sum(combo_unigram_dict.values()))
  prob_word2 = combo_unigram_dict[word2] / float(sum(combo_unigram_dict.values()))
  prob_word1_word2 = combo_bigram_dict[word1, word2] / float(sum(combo_bigram_dict.values()))
  return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)

In [212]:
test = bigram_df.iloc[0][0]
MI(test[0], test[1])

#it works on one cell, so theoretically should work on all...

4.052017460488083

In [213]:
bigram_df['MI'] = [MI(x[0], x[1]) for x in bigram_df['bigram']]

#it took a few hours to run it, but it worked!

In [227]:
bigram_df[['MI']] = bigram_df[['MI']].apply(lambda x: pd.Series.round(x, 2))
bigram_df.head()

Unnamed: 0,bigram,tokens,MI
0,"[In, my]",808,4.05
1,"[my, country]",825,5.59
2,"[country, we]",17,0.66
3,"[we, usually]",49,3.17
4,"[usually, do]",55,3.38


#### Creating per_million column

In [214]:
#testing one one cell first
bigram_per_M(test[0], test[1])

316.4099686370366

In [228]:
bigram_df['per_million'] = [bigram_per_M(x[0], x[1]) for x in bigram_df['bigram']]

In [230]:
bigram_df[['per_million']] = bigram_df[['per_million']].apply(lambda x: pd.Series.round(x, 2))
bigram_df.head()

Unnamed: 0,bigram,tokens,MI,per_million
0,"[In, my]",808,4.05,316.41
1,"[my, country]",825,5.59,323.07
2,"[country, we]",17,0.66,6.66
3,"[we, usually]",49,3.17,19.19
4,"[usually, do]",55,3.38,21.54


#### Creating %\_per_level column 

In [248]:
#create level dataframes
level_3 = combo_df.loc[combo_df['level_id'] == 3, :] 
level_4 = combo_df.loc[combo_df['level_id'] == 4, :] 
level_5 = combo_df.loc[combo_df['level_id'] == 5, :] 

#create frequency dictionaries for each level
level_3_corpus = ' '.join(level_3['text'])
level_3_tok = nltk.word_tokenize(level_3_corpus)
level_3_bigrams = list(nltk.bigrams(level_3_tok))
level_3_bigram_dict = nltk.FreqDist(level_3_bigrams)

level_4_corpus = ' '.join(level_4['text'])
level_4_tok = nltk.word_tokenize(level_4_corpus)
level_4_bigrams = list(nltk.bigrams(level_4_tok))
level_4_bigram_dict = nltk.FreqDist(level_4_bigrams)

level_5_corpus = ' '.join(level_5['text'])
level_5_tok = nltk.word_tokenize(level_5_corpus)
level_5_bigrams = list(nltk.bigrams(level_5_tok))
level_5_bigram_dict = nltk.FreqDist(level_5_bigrams)

In [257]:
#test to see what I want in each cell in the level_3 column
level_3_bigram_dict #I need the values from this dictionary divided by the value from
combo_bigram_dict #this dictionary

#for example
level_3_bigram_dict['in', 'the'] / combo_bigram_dict['in', 'the'] 

#or better yet as a percentage
"{0:.2f}%".format(level_3_bigram_dict['in', 'the'] / combo_bigram_dict['in', 'the'] * 100)

#totals for all 3 levels should add up to 100%
"{0:.2f}%".format(level_3_bigram_dict['in', 'the'] / combo_bigram_dict['in', 'the'] * 100)
"{0:.2f}%".format(level_4_bigram_dict['in', 'the'] / combo_bigram_dict['in', 'the'] * 100)
"{0:.2f}%".format(level_5_bigram_dict['in', 'the'] / combo_bigram_dict['in', 'the'] * 100)

FreqDist({('.', 'I'): 1708, (',', 'I'): 1467, ('.', 'The'): 1254, ('in', 'the'): 1209, (',', 'and'): 1119, ('of', 'the'): 894, ('.', 'In'): 890, (',', 'the'): 785, ('is', 'a'): 729, (',', 'you'): 709, ...})

FreqDist({('in', 'the'): 10087, (',', 'and'): 9824, ('.', 'The'): 9741, ('of', 'the'): 9030, ('.', 'In'): 8533, (',', 'I'): 8147, (',', 'the'): 8050, ('.', 'I'): 7541, ('.', 'It'): 5106, ('.', 'For'): 4660, ...})

0.11985724199464658

'11.99%'

'11.99%'

'40.71%'

'47.15%'

In [296]:
#also necessary to normalize as different number of responses at each level

#weighting for each level
level_3_percent = len(level_3.index) / len(combo_df.index)
level_4_percent = len(level_4.index) / len(combo_df.index)
level_5_percent = len(level_5.index) / len(combo_df.index)

level_3_percent
level_4_percent
level_5_percent

0.2671149144254279

0.3767573349633252

0.3266350855745721

In [297]:
#example of normalizing with ['in', 'the'] bigram

#normalized number
level_3_bigram_dict['in', 'the'] * (level_3_percent/(1/3)) 

#applied 
level_3_bigram_dict['in', 'the'] * (level_3_percent/(1/3)) / combo_bigram_dict['in', 'the']

#as a percent
"{0:.2f}%".format(level_3_bigram_dict['in', 'the'] * (level_3_percent/(1/3)) / combo_bigram_dict['in', 'the']*100)

968.825794621027

0.09604697081600347

'9.60%'

In [312]:
#create a function for the above
def norm_percent_level3(word1, word2):
    return "{0:.2f}%".format(level_3_bigram_dict[word1, word2] * (level_3_percent/(1/3)) / combo_bigram_dict[word1, word2]*100)

def norm_percent_level4(word1, word2):
    return "{0:.2f}%".format(level_4_bigram_dict[word1, word2] * (level_4_percent/(1/3)) / combo_bigram_dict[word1, word2]*100)

def norm_percent_level5(word1, word2):
    return "{0:.2f}%".format(level_5_bigram_dict[word1, word2] * (level_5_percent/(1/3)) / combo_bigram_dict[word1, word2]*100)

norm_percent_level3('in', 'the')
norm_percent_level4('in', 'the')
norm_percent_level5('in', 'the')

'9.60%'

'46.01%'

'46.20%'

In [316]:
#Let's see what it looks like if applied to the whole dataframe

bigram_df['level_3'] = [norm_percent_level3(x[0], x[1]) for x in bigram_df['bigram']]
bigram_df['level_4'] = [norm_percent_level4(x[0], x[1]) for x in bigram_df['bigram']]
bigram_df['level_5'] = [norm_percent_level5(x[0], x[1]) for x in bigram_df['bigram']]

bigram_df.head()

Unnamed: 0,bigram,tokens,MI,per_million,level_3,level_4,level_5
0,"[In, my]",808,4.05,316.41,9.32%,64.77%,30.32%
1,"[my, country]",825,5.59,323.07,14.18%,51.65%,35.40%
2,"[country, we]",17,0.66,6.66,4.71%,79.78%,23.06%
3,"[we, usually]",49,3.17,19.19,9.81%,78.43%,18.00%
4,"[usually, do]",55,3.38,21.54,2.91%,63.71%,35.63%


#### A lot of work for a very small final dataframe, but at least it should be usable for machine analysis and future research.

### Next goals (for final submission of code):  
<br>
_Final analysis touch ups_:
-	Deal with capitalization issues skewing data
-	Remove levels from combo_df other than 3,4,5 (easy to do but need time to re-run whole script afterwards)


_Machine learning_:
- Predict level based on bigram frequency (types and tokens)
- Predict level based on MI of bigrams used 


_Visualizations_:
- Create visualizations (heat maps for predictions and bar graphs for observed stats)
- Sort bigram_df in different orders to produce tables of common bigrams
- Tidy up notebook / add descriptive detail