# Project_Code 2: Continued clean up and analysis of ELI Data #
## Ben Naismith ##

### Changes since 'Project_Code1' ###

This new document has been created as a number of significant changes have been made to the original code. Based on discussions with other members of the ELI Data Mining Group, the following points were determined:

- For the sake of efficiency, it is better not to merge the different data frames into one big one
- A 'sanitization' step of the data was completed which duplicated some of the steps of my initial code. These duplications include removing unwanted apostrophes, changing all 'null' and 'ull' to NaN, and removing empty or unreal students (who were most likely teachers). As such, the dataset is now ready for more in-depth cleaning and analysis, i.e. the purpose of this notebook. The code for the sanitization step is in a private repository of the ELI Data Mining Groups 'convert_0_to_1.ipynb'.

### Data Sharing Plan ###

The full ELI data set (see project_plan.md) is private at this time. Below is a workbook with the current code for organizing and cleaning that data. In order to see how the code works, snippets of data have been displayed throughout.

This notebook will continue to be updated until the project is ready, at which point a sample of raw data, e.g. a CSV of 1000 answers, will be included in the repository to allow for testing and reproducibility by others of the code. The exact method for sampling will be determined once the initial code is complete, as it is necessary to first have cleaner data before it can be sampled; at present, sampling results in errors due to false students, entries, etc. which can not be linked to the appropriate CSV files.

Ultimately, it is the intention of the dataset's authors for the entire dataset to be made public, with a CC license. Please see the LICENSE.md for details

### Initial setup ###

In [18]:
#Import necesary modules
import numpy as np
import pandas as pd
import nltk
import glob
import matplotlib.pyplot as plt

%pprint #turn off pretty printing

#return every shell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#Create short-hand for directory root
cor_dir = "/Users/Benjamin's/Documents/ELI_Data_Mining/Data-Archive/1_sanitized/"

Pretty printing has been turned ON


### Student information (S_info_csv and S_info_df) ###

In [19]:
#Process the student_information.csv file
S_info_csv = cor_dir + "student_information.csv"
S_info_df = pd.read_csv(S_info_csv, index_col = 'anon_id')

S_info_df.head() #Issues still apparent with integers turned into floats
S_info_df.tail(10) #6 anon_id with no personal info - perhaps not students and to be 'pruned', as well as teachers with 'English' as the native language

Unnamed: 0_level_0,gender,birth_year,native_language,language_used_at_home,language_used_at_home_now,non_native_language_1,yrs_of_study_lang1,study_in_classroom_lang1,ways_of_study_lang1,non_native_language_2,yrs_of_study_lang2,study_in_classroom_lang2,ways_of_study_lang2,non_native_language_3,yrs_of_study_lang3,study_in_classroom_lang3,ways_of_study_lang3,createddate,modifieddate,course_history
anon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
ez9,Male,1978.0,Arabic,Arabic,,English,more than 5 years,1.0,Studied grammar;Worked in pairs/groups;Studied...,Turkish,less than 1 year,0.0,Studied by myself,,,0.0,other,2006-01-30 15:07:18,2006-03-14 15:13:37,6;12;18;24;30
gm3,Male,1980.0,Arabic,Arabic,,English,more than 5 years,1.0,Studied grammar;Had a native-speaker teacher;S...,,,0.0,other,,,0.0,other,2006-01-30 15:07:28,2006-03-14 15:12:49,6;12;24;30;38
fg5,Male,1938.0,Nepali,Nepali,,English,more than 5 years,1.0,Studied grammar;Worked in pairs/groups;Had a n...,French,less than 1 year,1.0,Studied grammar;Worked in pairs/groups;Had a n...,Hindi,more than 5 years,0.0,Studied by myself,2006-01-30 15:07:45,2006-03-14 15:11:36,18;24
ce5,Female,1984.0,Korean,Korean,,English,more than 5 years,1.0,Studied grammar;Worked in pairs/groups;Had a n...,German,1-2 years,1.0,Studied grammar;Studied vocabulary;Listened to...,,,0.0,other,2006-01-30 15:07:49,2006-03-14 15:12:24,6;12;24;30;38;56
fi7,Female,1982.0,Korean,Korean;Japanese,,English,more than 5 years,1.0,Studied grammar;Had a native-speaker teacher;S...,Japanese,less than 1 year,1.0,Studied grammar;Studied vocabulary;Listened to...,French,1-2 years,1.0,Studied grammar;Studied vocabulary;Listened to...,2006-01-30 15:07:52,2006-03-14 15:12:17,6;12;24;30;38


Unnamed: 0_level_0,gender,birth_year,native_language,language_used_at_home,language_used_at_home_now,non_native_language_1,yrs_of_study_lang1,study_in_classroom_lang1,ways_of_study_lang1,non_native_language_2,yrs_of_study_lang2,study_in_classroom_lang2,ways_of_study_lang2,non_native_language_3,yrs_of_study_lang3,study_in_classroom_lang3,ways_of_study_lang3,createddate,modifieddate,course_history
anon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
hb0,Female,1980.0,Arabic,Arabic,Arabic,English,3-5 years,1.0,Studied grammar;Had a native-speaker teacher;T...,,,0.0,other,,,0.0,other,2011-06-20 14:09:38,2011-06-20 14:13:01,851;869;870;871;872;923;942;944;945;946;1008;1...
dp8,Male,1991.0,Arabic,Arabic;English,Arabic;English,English,1-2 years,1.0,Studied grammar;Worked in pairs/groups;Had a n...,,,0.0,other,,,0.0,other,2011-06-20 14:10:15,2011-06-20 14:13:57,868;869;870;871;872
bn6,Male,1986.0,Arabic,Arabic;English,Arabic;English,English,more than 5 years,1.0,Studied grammar;Studied vocabulary;Teacher spo...,,,0.0,other,,,0.0,other,2011-06-20 14:11:17,2011-06-20 14:15:51,860;861;862;871;872;930;947;948;949;951;998;99...
aq6,Female,1964.0,English,English,English,,,,,,,,,,,,,2012-09-14 14:05:38,2012-09-14 14:09:19,1114
fm3,,,,,,,,,,,,,,,,,,2012-09-17 17:12:46,,1034;1035;1036;1037;1038;1099;1100;1101;1102;1103
ey5,,,,,,,,,,,,,,,,,,2013-04-11 13:28:41,,1089
gb5,,,,,,,,,,,,,,,,,,2013-06-20 13:12:55,,1092
aa7,,,,,,,,,,,,,,,,,,2013-07-12 16:25:34,,1074;1075;1076;1077;1078
gf3,,,,,,,,,,,,,,,,,,2013-11-21 13:42:32,,1112
gl8,,,,,,,,,,,,,,,,,,2014-10-23 14:14:57,,1077


In [87]:
#Remove anyone with 'English' or 'NaN' as their native_language, i.e. not students

#First try to create filters

Englishfilter = S_info_df['native_language'] == 'English' #first filter works
NaNfilter = S_info_df['native_language'] == np.nan #second filter doesn't

fake_Ss = S_info_df.loc[(Englishfilter)] #works, but...
fake_Ss

fake_Ss = S_info_df.loc[(Englishfilter) & (NaNfilter)] #doesn't work
fake_Ss


Unnamed: 0_level_0,gender,birth_year,native_language,language_used_at_home,language_used_at_home_now,non_native_language_1,yrs_of_study_lang1,study_in_classroom_lang1,ways_of_study_lang1,non_native_language_2,yrs_of_study_lang2,study_in_classroom_lang2,ways_of_study_lang2,non_native_language_3,yrs_of_study_lang3,study_in_classroom_lang3,ways_of_study_lang3,createddate,modifieddate,course_history
anon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
ez7,Male,1987.0,English,Arabic,Arabic;English,Arabic,more than 5 years,0.0,I lived in a country where they spoke Arabic,English,less than 1 year,1.0,Studied grammar;Studied vocabulary;Studied pro...,,,0.0,other,2007-02-20 10:05:39,2007-03-20 10:09:23,156;167;180;191;200;212;223;234;245;256
ay4,Female,1974.0,English,Korean,Korean,Korean,more than 5 years,1.0,Studied grammar;Had a native-speaker teacher;S...,,,0.0,other,,,0.0,other,2009-06-09 12:04:22,2009-11-13 12:43:36,509;515;516;517;560;571;574;601;622;642;645
aq6,Female,1964.0,English,English,English,,,,,,,,,,,,,2012-09-14 14:05:38,2012-09-14 14:09:19,1114


Unnamed: 0_level_0,gender,birth_year,native_language,language_used_at_home,language_used_at_home_now,non_native_language_1,yrs_of_study_lang1,study_in_classroom_lang1,ways_of_study_lang1,non_native_language_2,yrs_of_study_lang2,study_in_classroom_lang2,ways_of_study_lang2,non_native_language_3,yrs_of_study_lang3,study_in_classroom_lang3,ways_of_study_lang3,createddate,modifieddate,course_history
anon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1


### Student responses (answer_csv and answer_df) ###

In [21]:
#Process answer.csv file
answer_csv = cor_dir + "answer.csv"
answer_df = pd.read_csv(answer_csv, index_col = 'answer_id')

answer_df.head()
answer_df.tail(10)

Unnamed: 0_level_0,question_id,anon_id,user_file_id,text,directory,is_doublespaced,is_plagiarized,is_deleted
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,5,eq0,7505,I met my friend Nife while I was studying in a...,,0,0,0
2,5,am8,7506,"Ten years ago, I met a women on the train betw...",,0,0,0
3,12,dk5,7507,In my country we usually don't use tea bags. F...,,0,0,0
4,13,dk5,7507,I organized the instructions by time.,,0,0,0
5,12,ad1,7508,"First, prepare a port, loose tea, and cup.\r\r...",,0,0,0


Unnamed: 0_level_0,question_id,anon_id,user_file_id,text,directory,is_doublespaced,is_plagiarized,is_deleted
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
48411,6138,dv8,100847,Early Second Language Education\r\r\r\nSaudi A...,,1,0,0
48412,6138,ce1,100848,Publicly funded health care system\r\r\r\n\r\r...,,0,0,0
48413,6139,fo7,100911,Happiness is the most effective feeling in peo...,,1,0,0
48414,6139,fs9,100912,everyone want to play some games. some people ...,,1,0,0
48415,6139,cl7,100913,Playing a game is fun only when you win?\r\r\r...,,1,0,0
48416,6139,dr8,100914,Many people enjoy a game in their free time. B...,,1,0,0
48417,6137,fv1,100915,\r\r\r\n ...,,0,0,0
48418,6137,fo1,100916,Some patients are suffering from the...,,0,0,0
48419,6119,ge8,100917,My house looks amazing and modern. I decorated...,,0,0,0
48420,6027,ge8,100918,History and Geography a...,,0,0,0


### OTHER NECESSARY CSVs (answer_csv and answer_df) ###

### Basic info about dataframe ###

#### S_info_df ####
Size:
- there are 941 entries, i.e. students, although at least 9 need to be removed once filters can be made to work
- 20 columns including info about languages spoken, personal data like age, and learning preferences
- Some columns will likely be removed if deemed unhelpful/unnecessary (e.g. 4th language spoken)
- Some data is normalized, e.g. years of study, but others was open, resulting in very varied responses

Connection to other dataframes:
- link to answer_df is anon_id


In [88]:
S_info_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 941 entries, ez9 to gl8
Data columns (total 20 columns):
gender                       920 non-null object
birth_year                   920 non-null float64
native_language              920 non-null object
language_used_at_home        919 non-null object
language_used_at_home_now    860 non-null object
non_native_language_1        866 non-null object
yrs_of_study_lang1           871 non-null object
study_in_classroom_lang1     871 non-null float64
ways_of_study_lang1          871 non-null object
non_native_language_2        312 non-null object
yrs_of_study_lang2           315 non-null object
study_in_classroom_lang2     871 non-null float64
ways_of_study_lang2          871 non-null object
non_native_language_3        56 non-null object
yrs_of_study_lang3           60 non-null object
study_in_classroom_lang3     871 non-null float64
ways_of_study_lang3          871 non-null object
createddate                  941 non-null object
modifieddate  

In [89]:
answer_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48384 entries, 1 to 48420
Data columns (total 8 columns):
question_id        48384 non-null int64
anon_id            48353 non-null object
user_file_id       48384 non-null int64
text               47175 non-null object
directory          14 non-null object
is_doublespaced    48384 non-null int64
is_plagiarized     48384 non-null int64
is_deleted         48384 non-null int64
dtypes: int64(5), object(3)
memory usage: 3.3+ MB


### Creating Speaking Answers dataframe ###
- class_id = 3 for speaking classes
- file_type_id = RSAs
- use grep