# Project_Code 2: Continued clean up and analysis of ELI Data #
## Ben Naismith ##

### Changes since 'Project_Code1' ###

This new document has been created as a number of significant changes have been made to the original code. Based on discussions with other members of the ELI Data Mining Group, the following points were determined:

- For the sake of efficiency, it is better not to merge the different data frames into one big one
- A 'sanitization' step of the data was completed which duplicated some of the steps of my initial code. These duplications include removing unwanted apostrophes, changing all 'null' and 'ull' to NaN, and removing empty or unreal students (who were most likely teachers). As such, the dataset is now ready for more in-depth cleaning and analysis, i.e. the purpose of this notebook. The code for the sanitization step is in a private repository of the ELI Data Mining Groups 'convert_0_to_1.ipynb'.

### Data Sharing Plan ###

The full ELI data set (see project_plan.md) is private at this time. Below is a workbook with the current code for organizing and cleaning that data. In order to see how the code works, snippets of data have been displayed throughout.

This notebook will continue to be updated until the project is ready, at which point a sample of raw data, e.g. a CSV of 1000 answers, will be included in the repository to allow for testing and reproducibility by others of the code. The exact method for sampling will be determined once the initial code is complete, as it is necessary to first have cleaner data before it can be sampled; at present, sampling results in errors due to false students, entries, etc. which can not be linked to the appropriate CSV files.

Ultimately, it is the intention of the dataset's authors for the entire dataset to be made public, with a CC license. Please see the LICENSE.md for details

### Initial setup ###

In [2]:
#Import necesary modules
import numpy as np
import pandas as pd
import nltk
import glob
import matplotlib.pyplot as plt

%pprint #turn off pretty printing

#return every shell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#Create short-hand for directory root
cor_dir = "/Users/Benjamin's/Documents/ELI_Data_Mining/Data-Archive/1_sanitized/"

Pretty printing has been turned OFF


### Student information (S_info_csv and S_info_df) ###

In [3]:
#Process the student_information.csv file
S_info_csv = cor_dir + "student_information.csv"
S_info_df = pd.read_csv(S_info_csv, index_col = 'anon_id')

S_info_df.head() #Issues still apparent with integers turned into floats
S_info_df.tail(10) #6 anon_id with no personal info - perhaps not students and to be 'pruned', as well as teachers with 'English' as the native language

Unnamed: 0_level_0,gender,birth_year,native_language,language_used_at_home,language_used_at_home_now,non_native_language_1,yrs_of_study_lang1,study_in_classroom_lang1,ways_of_study_lang1,non_native_language_2,yrs_of_study_lang2,study_in_classroom_lang2,ways_of_study_lang2,non_native_language_3,yrs_of_study_lang3,study_in_classroom_lang3,ways_of_study_lang3,createddate,modifieddate,course_history
anon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
ez9,Male,1978.0,Arabic,Arabic,,English,more than 5 years,1.0,Studied grammar;Worked in pairs/groups;Studied...,Turkish,less than 1 year,0.0,Studied by myself,,,0.0,other,2006-01-30 15:07:18,2006-03-14 15:13:37,6;12;18;24;30
gm3,Male,1980.0,Arabic,Arabic,,English,more than 5 years,1.0,Studied grammar;Had a native-speaker teacher;S...,,,0.0,other,,,0.0,other,2006-01-30 15:07:28,2006-03-14 15:12:49,6;12;24;30;38
fg5,Male,1938.0,Nepali,Nepali,,English,more than 5 years,1.0,Studied grammar;Worked in pairs/groups;Had a n...,French,less than 1 year,1.0,Studied grammar;Worked in pairs/groups;Had a n...,Hindi,more than 5 years,0.0,Studied by myself,2006-01-30 15:07:45,2006-03-14 15:11:36,18;24
ce5,Female,1984.0,Korean,Korean,,English,more than 5 years,1.0,Studied grammar;Worked in pairs/groups;Had a n...,German,1-2 years,1.0,Studied grammar;Studied vocabulary;Listened to...,,,0.0,other,2006-01-30 15:07:49,2006-03-14 15:12:24,6;12;24;30;38;56
fi7,Female,1982.0,Korean,Korean;Japanese,,English,more than 5 years,1.0,Studied grammar;Had a native-speaker teacher;S...,Japanese,less than 1 year,1.0,Studied grammar;Studied vocabulary;Listened to...,French,1-2 years,1.0,Studied grammar;Studied vocabulary;Listened to...,2006-01-30 15:07:52,2006-03-14 15:12:17,6;12;24;30;38


Unnamed: 0_level_0,gender,birth_year,native_language,language_used_at_home,language_used_at_home_now,non_native_language_1,yrs_of_study_lang1,study_in_classroom_lang1,ways_of_study_lang1,non_native_language_2,yrs_of_study_lang2,study_in_classroom_lang2,ways_of_study_lang2,non_native_language_3,yrs_of_study_lang3,study_in_classroom_lang3,ways_of_study_lang3,createddate,modifieddate,course_history
anon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
hb0,Female,1980.0,Arabic,Arabic,Arabic,English,3-5 years,1.0,Studied grammar;Had a native-speaker teacher;T...,,,0.0,other,,,0.0,other,2011-06-20 14:09:38,2011-06-20 14:13:01,851;869;870;871;872;923;942;944;945;946;1008;1...
dp8,Male,1991.0,Arabic,Arabic;English,Arabic;English,English,1-2 years,1.0,Studied grammar;Worked in pairs/groups;Had a n...,,,0.0,other,,,0.0,other,2011-06-20 14:10:15,2011-06-20 14:13:57,868;869;870;871;872
bn6,Male,1986.0,Arabic,Arabic;English,Arabic;English,English,more than 5 years,1.0,Studied grammar;Studied vocabulary;Teacher spo...,,,0.0,other,,,0.0,other,2011-06-20 14:11:17,2011-06-20 14:15:51,860;861;862;871;872;930;947;948;949;951;998;99...
aq6,Female,1964.0,English,English,English,,,,,,,,,,,,,2012-09-14 14:05:38,2012-09-14 14:09:19,1114
fm3,,,,,,,,,,,,,,,,,,2012-09-17 17:12:46,,1034;1035;1036;1037;1038;1099;1100;1101;1102;1103
ey5,,,,,,,,,,,,,,,,,,2013-04-11 13:28:41,,1089
gb5,,,,,,,,,,,,,,,,,,2013-06-20 13:12:55,,1092
aa7,,,,,,,,,,,,,,,,,,2013-07-12 16:25:34,,1074;1075;1076;1077;1078
gf3,,,,,,,,,,,,,,,,,,2013-11-21 13:42:32,,1112
gl8,,,,,,,,,,,,,,,,,,2014-10-23 14:14:57,,1077


In [4]:
#Remove anyone with 'English' or 'NaN' as their native_language, i.e. not students

#First try to create filters

Englishfilter = S_info_df['native_language'] == 'English' #first filter works
NaNfilter = S_info_df['native_language'] == np.nan #second filter doesn't

fake_Ss = S_info_df.loc[Englishfilter] #works, but...
fake_Ss

#fake_Ss = S_info_df.loc[(Englishfilter) or (NaNfilter)] #doesn't work
#fake_Ss


Unnamed: 0_level_0,gender,birth_year,native_language,language_used_at_home,language_used_at_home_now,non_native_language_1,yrs_of_study_lang1,study_in_classroom_lang1,ways_of_study_lang1,non_native_language_2,yrs_of_study_lang2,study_in_classroom_lang2,ways_of_study_lang2,non_native_language_3,yrs_of_study_lang3,study_in_classroom_lang3,ways_of_study_lang3,createddate,modifieddate,course_history
anon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
ez7,Male,1987.0,English,Arabic,Arabic;English,Arabic,more than 5 years,0.0,I lived in a country where they spoke Arabic,English,less than 1 year,1.0,Studied grammar;Studied vocabulary;Studied pro...,,,0.0,other,2007-02-20 10:05:39,2007-03-20 10:09:23,156;167;180;191;200;212;223;234;245;256
ay4,Female,1974.0,English,Korean,Korean,Korean,more than 5 years,1.0,Studied grammar;Had a native-speaker teacher;S...,,,0.0,other,,,0.0,other,2009-06-09 12:04:22,2009-11-13 12:43:36,509;515;516;517;560;571;574;601;622;642;645
aq6,Female,1964.0,English,English,English,,,,,,,,,,,,,2012-09-14 14:05:38,2012-09-14 14:09:19,1114


### Student responses (answer_csv and answer_df) ###

In [5]:
#Process answer.csv file
answer_csv = cor_dir + "answer.csv"
answer_df = pd.read_csv(answer_csv, index_col = 'answer_id')

answer_df.head()
answer_df.tail(10)

Unnamed: 0_level_0,question_id,anon_id,user_file_id,text,directory,is_doublespaced,is_plagiarized,is_deleted
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,5,eq0,7505,I met my friend Nife while I was studying in a...,,0,0,0
2,5,am8,7506,"Ten years ago, I met a women on the train betw...",,0,0,0
3,12,dk5,7507,In my country we usually don't use tea bags. F...,,0,0,0
4,13,dk5,7507,I organized the instructions by time.,,0,0,0
5,12,ad1,7508,"First, prepare a port, loose tea, and cup.\r\r...",,0,0,0


Unnamed: 0_level_0,question_id,anon_id,user_file_id,text,directory,is_doublespaced,is_plagiarized,is_deleted
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
48411,6138,dv8,100847,Early Second Language Education\r\r\r\nSaudi A...,,1,0,0
48412,6138,ce1,100848,Publicly funded health care system\r\r\r\n\r\r...,,0,0,0
48413,6139,fo7,100911,Happiness is the most effective feeling in peo...,,1,0,0
48414,6139,fs9,100912,everyone want to play some games. some people ...,,1,0,0
48415,6139,cl7,100913,Playing a game is fun only when you win?\r\r\r...,,1,0,0
48416,6139,dr8,100914,Many people enjoy a game in their free time. B...,,1,0,0
48417,6137,fv1,100915,\r\r\r\n ...,,0,0,0
48418,6137,fo1,100916,Some patients are suffering from the...,,0,0,0
48419,6119,ge8,100917,My house looks amazing and modern. I decorated...,,0,0,0
48420,6027,ge8,100918,History and Geography a...,,0,0,0


### Course IDs ###
(should help with finding specific texts and linking other data frames)

In [6]:
#Process course.csv file
course_csv = cor_dir + "course.csv"
course_df = pd.read_csv(course_csv, index_col = 'course_id')

course_df.head()

Unnamed: 0_level_0,class_id,level_id,semester,section,course_description
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,2,2064,A,Reading Pre_Intermediate 2064 A
2,1,3,2064,B,Reading Low_Intermediate 2064 B
3,1,4,2064,M,Reading Intermediate 2064 M
4,1,4,2064,P,Reading Intermediate 2064 P
5,1,4,2064,Q,Reading Intermediate 2064 Q


###  user_file_wav_txt ###
- big csv file with a lot of information
- should help with finding specific texts and linking other data frames
- includes file_type_id, course_id, and paths of text and wav files (i.e. all the spoken responses I need)


In [7]:
#Process user_file_wavtxt.csv file
user_csv = cor_dir + "user_file_wavtxt.csv"
user_df = pd.read_csv(user_csv, index_col = 'user_file_id')

user_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0_level_0,anon_id,file_type_id,file_info_id,user_file_parent_id,course_id,session_id,document_id,activity,order_num,due_date,...,modifiedby,modifieddate,allow_submit_after_duedate,allow_multiple_accesses,allow_double_spacing,duration,pull_off_date,direction,grammar_qp_id,is_deleted
user_file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
239,,5,13.0,,90,,,1,1.0,2006-08-07 14:19:48,...,,2006-08-08 12:17:18,0,0,0,,,,,0
240,,5,13.0,,90,,,1,2.0,2006-08-07 14:19:48,...,,2006-08-08 12:19:15,0,0,0,,,,,0
241,,5,13.0,,90,,,1,3.0,2006-08-07 14:19:48,...,,2006-08-08 12:21:18,0,0,0,,,,,0
242,,5,13.0,,90,,,1,4.0,2006-08-07 14:19:48,...,,2006-08-08 12:24:00,0,0,0,,,,,0
243,,5,13.0,,90,,,1,5.0,2006-08-07 14:19:48,...,,2006-08-08 12:14:57,0,0,0,,,,,0


### Basic info about dataframes ###

The following information is an overview of the four dataframes/csv files currently being looked at:

#### S_info_df ####
Size:
- there are 941 entries, i.e. students, although at least 9 need to be removed once filters can be made to work
- 21 columns including info about languages spoken, personal data like age, and learning preferences
- Some columns will likely be removed if deemed unhelpful/unnecessary (e.g. 4th language spoken)
- Some data is normalized, e.g. years of study, but others was open, resulting in very varied responses

Connection to other dataframes:
- link to answer_df is anon_id

Most useful columns for this project:
- anon_id (for linking to other df)
- L1, gender, time studying, age (for data analysis)  


#### answer_df ####
Size:
- there are 47175 'text' entries, i.e. student responses, although 48384 total rows. The remaining (including many null texts need to be removed as without texts they serve no purpose
- 9 columns including info about the question, the answer, and characteristics of the text (like if it was plagiarized)

Connection to other dataframes:
- link to S_info_df and course_df is anon_id column

Most useful columns for this project:
- answer_id (shorthand for the individual texts to be analyzed)
- text (the most important column so far) -> to be converted into tokens, bigrams, etc.  
- anon_id (for linking to other df)


#### course_df ####
Size:
- there are 1071 entries, i.e. one row for each course
- 6 columns including info about the course and class, both in terms of their assigned number and a description

Connection to other dataframes:
- link to user_df is course_id 

Most useful columns for this project:
- only really useful as a transition for linking to other df  


#### user_df ####
Size:
- there are 76371 rows, each with a file_id number. However, it is unclear how to use this informatin effectively.
- There are 29 columns, although many are not useful for this project
- A lot of the cells have no input
- Some columns will likely be removed if deemed unhelpful/unnecessary

Connection to other dataframes:
- link to course_df is course_id column

Most useful columns for this project:
- course_id (to link to other DF)
- file_type_id (for indicating the type of activity used in class)

In [8]:
S_info_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 941 entries, ez9 to gl8
Data columns (total 20 columns):
gender                       920 non-null object
birth_year                   920 non-null float64
native_language              920 non-null object
language_used_at_home        919 non-null object
language_used_at_home_now    860 non-null object
non_native_language_1        866 non-null object
yrs_of_study_lang1           871 non-null object
study_in_classroom_lang1     871 non-null float64
ways_of_study_lang1          871 non-null object
non_native_language_2        312 non-null object
yrs_of_study_lang2           315 non-null object
study_in_classroom_lang2     871 non-null float64
ways_of_study_lang2          871 non-null object
non_native_language_3        56 non-null object
yrs_of_study_lang3           60 non-null object
study_in_classroom_lang3     871 non-null float64
ways_of_study_lang3          871 non-null object
createddate                  941 non-null object
modifieddate  

In [9]:
answer_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48384 entries, 1 to 48420
Data columns (total 8 columns):
question_id        48384 non-null int64
anon_id            48353 non-null object
user_file_id       48384 non-null int64
text               47175 non-null object
directory          14 non-null object
is_doublespaced    48384 non-null int64
is_plagiarized     48384 non-null int64
is_deleted         48384 non-null int64
dtypes: int64(5), object(3)
memory usage: 3.3+ MB


In [10]:
course_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1071 entries, 1 to 1123
Data columns (total 5 columns):
class_id              1071 non-null int64
level_id              1071 non-null int64
semester              1071 non-null int64
section               1071 non-null object
course_description    1058 non-null object
dtypes: int64(3), object(2)
memory usage: 50.2+ KB


In [11]:
user_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76371 entries, 239 to 103667
Data columns (total 28 columns):
anon_id                       76142 non-null object
file_type_id                  76371 non-null int64
file_info_id                  13241 non-null float64
user_file_parent_id           33348 non-null float64
course_id                     76371 non-null int64
session_id                    0 non-null float64
document_id                   0 non-null float64
activity                      76371 non-null int64
order_num                     40323 non-null float64
due_date                      5346 non-null object
post_date                     5346 non-null object
assignment_name               0 non-null float64
version                       76371 non-null int64
directory                     76371 non-null object
filename                      76371 non-null object
content_text                  0 non-null float64
createdby                     0 non-null float64
createddate           

### Creating Speaking Answers dataframe ###

1) Start with 'course.csv' which has class_id (we want #3 for speaking classes)  
2) In 'course.csv', class_id is linked to course_id  
3) In 'user_file_wavtxt.csv' course_id is linked to file_type_id (we want #6 for RSA)  
4) MISSING STEP OR STEPS - nothing links to answer.csv other than anon_id and this isn't specific enough - is this information in the original text file?  
(Final goal) answer_id -> text in 'answer.csv'

In [12]:
#ALL ATTEMPTS FAIL MISERABLY

### Creating find_stuff function ###

Goal: create a function that allows for easy retrieval within, from the various different, dataframes.


In [13]:
#adapted from initial work of Brianna - thank you!

#this works to find all the course_id entries for a particular class type, in this case '3' which == speaking

def find_stuff(df, class_type):
    class_id = df.loc[df['class_id'] == class_type]
    return class_id

test = find_stuff(course_df, 3)
test.head()

Unnamed: 0_level_0,class_id,level_id,semester,section,course_description
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
13,3,2,2064,A,Speaking Pre_Intermediate 2064 A
14,3,3,2064,B,Speaking Low_Intermediate 2064 B
15,3,4,2064,M,Speaking Intermediate 2064 M
16,3,4,2064,P,Speaking Intermediate 2064 P
17,3,4,2064,Q,Speaking Intermediate 2064 Q


In [14]:
#test #2

test2 = find_stuff(course_df, 5)
test2.head()

Unnamed: 0_level_0,class_id,level_id,semester,section,course_description
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
25,5,2,2064,A,Grammar Pre_Intermediate 2064 A
26,5,3,2064,B,Grammar Low_Intermediate 2064 B
27,5,4,2064,M,Grammar Intermediate 2064 M
28,5,4,2064,P,Grammar Intermediate 2064 P
29,5,4,2064,Q,Grammar Intermediate 2064 Q


- Next step is to either expand on this function or create other similar ones to allow look up of other types of info

### Tokenization of answers ###

Goal: tokenize the text in answer.csv to allow for further analysis (bigrams, lexical diversity, etc.)


In [15]:
#find column to tokenize

answer_df[['text']].head()

Unnamed: 0_level_0,text
answer_id,Unnamed: 1_level_1
1,I met my friend Nife while I was studying in a...
2,"Ten years ago, I met a women on the train betw..."
3,In my country we usually don't use tea bags. F...
4,I organized the instructions by time.
5,"First, prepare a port, loose tea, and cup.\r\r..."


In [16]:
# apply tokenizing function to 'text' column, using .map()
    #answer_df['toks'] = answer_df['text'].map(nltk.word_tokenize)
#Perhaps not working because of NaN values.

In [17]:
#With the magic of stackoverflow, this seems to work, converting NaN to empty strings
answer_df = answer_df[answer_df['text'].notnull()]
answer_df['toks'] = answer_df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)

answer_df.head()

Unnamed: 0_level_0,question_id,anon_id,user_file_id,text,directory,is_doublespaced,is_plagiarized,is_deleted,toks
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,5,eq0,7505,I met my friend Nife while I was studying in a...,,0,0,0,"[I, met, my, friend, Nife, while, I, was, stud..."
2,5,am8,7506,"Ten years ago, I met a women on the train betw...",,0,0,0,"[Ten, years, ago, ,, I, met, a, women, on, the..."
3,12,dk5,7507,In my country we usually don't use tea bags. F...,,0,0,0,"[In, my, country, we, usually, do, n't, use, t..."
4,13,dk5,7507,I organized the instructions by time.,,0,0,0,"[I, organized, the, instructions, by, time, .]"
5,12,ad1,7508,"First, prepare a port, loose tea, and cup.\r\r...",,0,0,0,"[First, ,, prepare, a, port, ,, loose, tea, ,,..."


### Bigrams###

Goal: create a bigram columns from the tok column


In [18]:
#mini-test to make sure I am creating bigrams correctly

bigram_test = answer_df.toks[1]
bigram_test
list(nltk.bigrams(bigram_test))

#test works, let's try on dataframe

answer_df['bigrams'] = answer_df.toks.apply(lambda x: list(nltk.bigrams(x)))
answer_df.head()

['I', 'met', 'my', 'friend', 'Nife', 'while', 'I', 'was', 'studying', 'in', 'a', 'middle', 'school', '.', 'I', 'was', 'happy', 'when', 'I', 'met', 'him', 'because', 'he', 'was', 'a', 'good', 'student', 'in', 'our', 'school', '.', 'We', 'continued', 'the', 'middle', 'and', 'high', 'school', 'to', 'gather', 'in', 'the', 'same', 'school', '.', 'We', 'were', 'studying', 'in', 'the', 'different', 'classes', 'in', 'the', 'middle', 'school', ';', 'however', ',', 'in', 'the', 'high', 'school', 'we', 'were', 'studying', 'in', 'the', 'same', 'class', '.', 'We', 'went', 'to', 'many', 'places', 'in', 'the', 'free', 'time', 'while', 'we', 'were', 'studying', 'in', 'the', 'high', 'school', '.', 'When', 'we', 'finished', 'from', 'the', 'high', 'school', ',', 'I', 'went', 'to', 'K.S', 'University', 'and', 'he', 'went', 'to', 'I.M', 'University', '.', 'While', 'we', 'were', 'enjoying', 'in', 'academic', 'life', ',', 'we', 'made', 'many', 'achievement', 'in', 'these', 'universities', '.', 'I', 'graduate

[('I', 'met'), ('met', 'my'), ('my', 'friend'), ('friend', 'Nife'), ('Nife', 'while'), ('while', 'I'), ('I', 'was'), ('was', 'studying'), ('studying', 'in'), ('in', 'a'), ('a', 'middle'), ('middle', 'school'), ('school', '.'), ('.', 'I'), ('I', 'was'), ('was', 'happy'), ('happy', 'when'), ('when', 'I'), ('I', 'met'), ('met', 'him'), ('him', 'because'), ('because', 'he'), ('he', 'was'), ('was', 'a'), ('a', 'good'), ('good', 'student'), ('student', 'in'), ('in', 'our'), ('our', 'school'), ('school', '.'), ('.', 'We'), ('We', 'continued'), ('continued', 'the'), ('the', 'middle'), ('middle', 'and'), ('and', 'high'), ('high', 'school'), ('school', 'to'), ('to', 'gather'), ('gather', 'in'), ('in', 'the'), ('the', 'same'), ('same', 'school'), ('school', '.'), ('.', 'We'), ('We', 'were'), ('were', 'studying'), ('studying', 'in'), ('in', 'the'), ('the', 'different'), ('different', 'classes'), ('classes', 'in'), ('in', 'the'), ('the', 'middle'), ('middle', 'school'), ('school', ';'), (';', 'howe

Unnamed: 0_level_0,question_id,anon_id,user_file_id,text,directory,is_doublespaced,is_plagiarized,is_deleted,toks,bigrams
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,5,eq0,7505,I met my friend Nife while I was studying in a...,,0,0,0,"[I, met, my, friend, Nife, while, I, was, stud...","[(I, met), (met, my), (my, friend), (friend, N..."
2,5,am8,7506,"Ten years ago, I met a women on the train betw...",,0,0,0,"[Ten, years, ago, ,, I, met, a, women, on, the...","[(Ten, years), (years, ago), (ago, ,), (,, I),..."
3,12,dk5,7507,In my country we usually don't use tea bags. F...,,0,0,0,"[In, my, country, we, usually, do, n't, use, t...","[(In, my), (my, country), (country, we), (we, ..."
4,13,dk5,7507,I organized the instructions by time.,,0,0,0,"[I, organized, the, instructions, by, time, .]","[(I, organized), (organized, the), (the, instr..."
5,12,ad1,7508,"First, prepare a port, loose tea, and cup.\r\r...",,0,0,0,"[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, ,), (,, prepare), (prepare, a), (a, p..."


### Create frequency dictionary for entire corpus ###

Attempting to create frequency dictionary for all toks

In [30]:
testdict = nltk.FreqDist(answer_df.toks[1])
testdict
#looks ok, now to apply to the whole column

FreqDist({'in': 15, '.': 12, 'the': 11, 'I': 10, 'school': 9, 'to': 6, ',': 6, 'was': 5, 'studying': 5, 'high': 5, ...})

In [63]:
fdict = answer_df.toks.apply(lambda x: nltk.FreqDist(x))
fdict.head()
#haha they are mini-dicts for each text rather than for the dataframe as a whole

answer_id
1    {'I': 10, 'met': 2, 'my': 4, 'friend': 2, 'Nif...
2    {'Ten': 1, 'years': 1, 'ago': 1, ',': 8, 'I': ...
3    {'In': 1, 'my': 1, 'country': 1, 'we': 5, 'usu...
4    {'I': 1, 'organized': 1, 'the': 1, 'instructio...
5    {'First': 1, ',': 9, 'prepare': 1, 'a': 2, 'po...
Name: toks, dtype: object

In [62]:
answer_corpus = ' '.join(answer_df['text'])
answer_corpus[:100]
answer_corpus_tok = nltk.word_tokenize(answer_corpus)
answer_corpus_tok[:20]

#probably not the most efficient way but it seems to have worked at least for tokenizing whole corpus.

'I met my friend Nife while I was studying in a middle school. I was happy when I met him because he '

['I', 'met', 'my', 'friend', 'Nife', 'while', 'I', 'was', 'studying', 'in', 'a', 'middle', 'school', '.', 'I', 'was', 'happy', 'when', 'I', 'met']

In [65]:
answer_dict = nltk.FreqDist(answer_corpus_tok)
answer_dict

#success!

FreqDist({'.': 264755, ',': 218149, 'the': 171927, 'to': 133262, 'and': 105988, 'I': 93236, 'a': 89283, 'of': 88552, 'in': 77170, 'is': 75659, ...})

### Create frequency dictionary for bigrams of entire corpus ###

Attempting to create frequency dictionary for all bigrams

In [76]:
#Let's try to do this from the answer_corpus_tok

answer_corpus_bigrams = list(nltk.bigrams(answer_corpus_tok))
answer_corpus_bigrams[:10]

[('I', 'met'), ('met', 'my'), ('my', 'friend'), ('friend', 'Nife'), ('Nife', 'while'), ('while', 'I'), ('I', 'was'), ('was', 'studying'), ('studying', 'in'), ('in', 'a')]

In [78]:
#Ok, now time for the dictionary
answer_bigram_dict = nltk.FreqDist(answer_corpus_bigrams)
answer_bigram_dict

#success! (although unless I use MI or do something about stop words/punctuation, then it's not very useful)

FreqDist({('.', 'I'): 24177, (',', 'I'): 21399, ('in', 'the'): 18669, ('.', 'The'): 17403, (',', 'and'): 16701, ('of', 'the'): 15011, ('.', 'In'): 13393, (',', 'the'): 12288, ('.', 'It'): 9348, ('to', 'the'): 8553, ...})

### Count vectors ###

Attempting to create count vector of toks and bigram columns


In [19]:
from sklearn.feature_extraction.text import CountVectorizer
textvec = CountVectorizer(min_df=1, tokenizer=nltk.word_tokenize)

In [20]:
# sents turned into sparse vector of word frequency counts
toks_counts = textvec.fit_transform(answer_df.text)

In [24]:
toks_counts.shape
toks_counts.toarray()

(47175, 63041)

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [66]:
#not sure if this has any use! (Perhaps TF-IDF?)

#### Calculating Mutual Information (MI) ####

(from https://corpus.byu.edu/mutualInformation.asp)  

Mutual Information is calculated as follows:  
MI = log ( (AB * sizeCorpus) / (A * B * span) ) / log (2)  

Suppose we are calculating the MI for the collocate color near purple in BYU-BNC.  

A = frequency of node word (e.g. purple): 1262  
B = frequency of collocate (e.g. color): 115  
AB = frequency of collocate near the node word (e.g. color near purple): 24  
sizeCorpus= size of corpus (# words; in this case the BNC): 96,263,399  
span = span of words (e.g. 3 to left and 3 to right of node word): 6  
log (2) is literally the log10 of the number 2: .30103  

MI = 11.37 = log ( (24 * 96,263,399) / (1262 * 115 * 6) ) / .30103  