# Building `PELIC_compiled.csv`

<br>

**Author:** Ben Naismith (bnaismith@pitt.edu)  
**Date:** 9 June 2020

<br>

This notebook provides a tutorial for creating the PELIC_compiled.csv from the PELIC corpus files in the [`corpus_files`](https://github.com/ELI-Data-Mining-Group/PELIC-dataset/tree/master/corpus_files) folder. The final csv file is also available in the current repository.

<br>

**Notebook contents:**
- [Reading in necessary files](#Reading-in-necessary-files)
- [Compiling dataframe](#Compiling-dataframe)
- [Writing out `PELIC_compiled`](#Writing-out-PELIC_compiled)
- [`PELIC_compiled` mini demonstration](#PELIC_compiled-mini-demonstration)

In [1]:
# Import necessary modules
import pandas as pd
import pickle as pkl
from ast import literal_eval

## Reading in necessary files

The three necessary csv files are found in the [`corpus_files`](https://github.com/ELI-Data-Mining-Group/PELIC-dataset/tree/master/corpus_files) folder.

- [answer.csv](https://github.com/ELI-Data-Mining-Group/PELIC_dataset/tree/master/corpus_files/answer.csv)
- [course.csv](https://github.com/ELI-Data-Mining-Group/PELIC_dataset/tree/master/corpus_files/course.csv)
- [student_information.csv](https://github.com/ELI-Data-Mining-Group/PELIC_dataset/tree/master/corpus_files/student_information.csv)

In [2]:
# Read in answer.csv

answer_df = pd.read_csv("../corpus_files/answer.csv", index_col = 'answer_id',  # answer_id is unique
                        dtype = {'question_id':'object','version':'object'}, # str not ints
                        converters={'tokens':literal_eval,'tok_lem_POS':literal_eval}) # read in as lists
answer_df.info()
answer_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46230 entries, 1 to 48420
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   question_id   46230 non-null  object
 1   anon_id       46230 non-null  object
 2   course_id     46230 non-null  int64 
 3   version       46230 non-null  object
 4   created_date  46230 non-null  object
 5   text_len      46230 non-null  int64 
 6   text          46230 non-null  object
 7   tokens        46230 non-null  object
 8   tok_lem_POS   46230 non-null  object
dtypes: int64(2), object(7)
memory usage: 3.5+ MB


Unnamed: 0_level_0,question_id,anon_id,course_id,version,created_date,text_len,text,tokens,tok_lem_POS
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,5,eq0,149,1,2006-09-20 16:11:08,177,I met my friend Nife while I was studying in a...,"[I, met, my, friend, Nife, while, I, was, stud...","[(I, I, PRP), (met, meet, VBD), (my, my, PRP$)..."
2,5,am8,149,1,2006-09-20 22:09:14,137,"Ten years ago, I met a women on the train betw...","[Ten, years, ago, ,, I, met, a, women, on, the...","[(Ten, ten, CD), (years, year, NNS), (ago, ago..."
3,12,dk5,115,1,2006-09-21 10:16:17,63,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","[(In, in, IN), (my, my, PRP$), (country, count..."
4,13,dk5,115,1,2006-09-21 10:16:17,6,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","[(I, I, PRP), (organized, organize, VBD), (the..."
5,12,ad1,115,1,2006-09-21 10:19:01,59,"First, prepare a port, loose tea, and cup.\nSe...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, first, RB), (,, ,, ,), (prepare, prep..."


In [3]:
# Read in course.csv

course_df = pd.read_csv("../corpus_files/course.csv", index_col='course_id')
course_df.info()
print(course_df['level_id'].value_counts())
course_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1066 entries, 1 to 1123
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   class_id  1066 non-null   object
 1   level_id  1066 non-null   int64 
 2   semester  1066 non-null   object
 3   section   1066 non-null   object
dtypes: int64(1), object(3)
memory usage: 41.6+ KB
4    402
5    305
3    273
2     86
Name: level_id, dtype: int64


Unnamed: 0_level_0,class_id,level_id,semester,section
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,r,2,2006_spring,A
2,r,3,2006_spring,B
3,r,4,2006_spring,M
4,r,4,2006_spring,P
5,r,4,2006_spring,Q


In [4]:
# Read in student_information.csv

sinfo_df = pd.read_csv("../corpus_files/student_information.csv", index_col = 'anon_id')
sinfo_df.info()
sinfo_df.fillna('', inplace=True) # Replace all NaN with empty strings
sinfo_df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 1313 entries, ez9 to gg7
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   gender                      1313 non-null   object 
 1   birth_year                  913 non-null    float64
 2   native_language             1313 non-null   object 
 3   language_used_at_home       912 non-null    object 
 4   non_native_language_1       859 non-null    object 
 5   yrs_of_study_lang1          863 non-null    object 
 6   study_in_classroom_lang1    863 non-null    object 
 7   ways_of_study_lang1         863 non-null    object 
 8   non_native_language_2       309 non-null    object 
 9   yrs_of_study_lang2          312 non-null    object 
 10  study_in_classroom_lang2    863 non-null    object 
 11  ways_of_study_lang2         311 non-null    object 
 12  non_native_language_3       55 non-null     object 
 13  yrs_of_study_lang3          59 non-nu

Unnamed: 0_level_0,gender,birth_year,native_language,language_used_at_home,non_native_language_1,yrs_of_study_lang1,study_in_classroom_lang1,ways_of_study_lang1,non_native_language_2,yrs_of_study_lang2,study_in_classroom_lang2,ways_of_study_lang2,non_native_language_3,yrs_of_study_lang3,study_in_classroom_lang3,ways_of_study_lang3,course_history,yrs_of_english_learning,yrs_in_english_environment,age
anon_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
ez9,Male,1978,Arabic,Arabic,English,more than 5 years,yes,Studied grammar;Worked in pairs/groups;Studied...,Turkish,less than 1 year,no,Studied by myself,,,no,,6;12;18;24;30,1-2 years,3-5 years,27
gm3,Male,1980,Arabic,Arabic,English,more than 5 years,yes,Studied grammar;Had a native-speaker teacher;S...,,,no,,,,no,,6;12;24;30;38,1-2 years,more than 5 years,25
fg5,Male,1938,Nepali,Nepali,English,more than 5 years,yes,Studied grammar;Worked in pairs/groups;Had a n...,French,less than 1 year,yes,Studied grammar;Worked in pairs/groups;Had a n...,Hindi,more than 5 years,no,Studied by myself,18;24,more than 5 years,more than 5 years,66
ce5,Female,1984,Korean,Korean,English,more than 5 years,yes,Studied grammar;Worked in pairs/groups;Had a n...,German,1-2 years,yes,Studied grammar;Studied vocabulary;Listened to...,,,no,,6;12;24;30;38;56,more than 5 years,3-5 years,21
fi7,Female,1982,Korean,Korean;Japanese,English,more than 5 years,yes,Studied grammar;Had a native-speaker teacher;S...,Japanese,less than 1 year,yes,Studied grammar;Studied vocabulary;Listened to...,French,1-2 years,yes,Studied grammar;Studied vocabulary;Listened to...,6;12;24;30;38,less than 1 year,none,23


In [5]:
# Read in test_scores.csv

tests_df = pd.read_csv("../corpus_files/test_scores.csv", index_col = False)
tests_df.info()
tests_df.fillna('', inplace=True) # Replace all NaN with empty strings
tests_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1419 entries, 0 to 1418
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   anon_id           1419 non-null   object 
 1   semester          1407 non-null   object 
 2   LCT_Form          1218 non-null   float64
 3   LCT_Score         1419 non-null   float64
 4   MTELP_Form        1119 non-null   object 
 5   MTELP_I           1419 non-null   float64
 6   MTELP_II          1419 non-null   float64
 7   MTELP_III         1419 non-null   float64
 8   MTELP_Conv_Score  1419 non-null   float64
 9   Writing_Sample    1419 non-null   float64
dtypes: float64(7), object(3)
memory usage: 111.0+ KB


Unnamed: 0,anon_id,semester,LCT_Form,LCT_Score,MTELP_Form,MTELP_I,MTELP_II,MTELP_III,MTELP_Conv_Score,Writing_Sample
0,aa0,2010_spring,,28.0,,20.0,31.0,13.0,80.0,4.8
1,aa1,2009_spring,1.0,12.0,Q,17.0,17.0,13.0,66.0,3.0
2,aa2,2009_spring,1.0,15.0,Q,9.0,10.0,3.0,38.0,2.3
3,aa3,2012_summer,1.0,28.0,P,23.0,23.0,12.0,73.0,3.2
4,aa5,2010_fall,2.0,13.0,,22.0,15.0,8.0,64.0,3.1


## Compiling dataframe
The `answer_df`, `course_df`, and `sinfo_df` dataframes created in the previous section are now compiled into a single dataframe called `pelic_df`. Where necessary, functions are created to pull information from the different dataframes.  
Each row represents one text and is accompanied by relevant information about the author and class.
- `answer_df` (basis of pelic_df)
- `course_df` (class type and level information)
- `sinfo_df` (L1 and gender information)

In [6]:
# Start with answer.csv, the primary source of texts and their info.

pelic_df = answer_df.copy()
pelic_df.head()

Unnamed: 0_level_0,question_id,anon_id,course_id,version,created_date,text_len,text,tokens,tok_lem_POS
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,5,eq0,149,1,2006-09-20 16:11:08,177,I met my friend Nife while I was studying in a...,"[I, met, my, friend, Nife, while, I, was, stud...","[(I, I, PRP), (met, meet, VBD), (my, my, PRP$)..."
2,5,am8,149,1,2006-09-20 22:09:14,137,"Ten years ago, I met a women on the train betw...","[Ten, years, ago, ,, I, met, a, women, on, the...","[(Ten, ten, CD), (years, year, NNS), (ago, ago..."
3,12,dk5,115,1,2006-09-21 10:16:17,63,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","[(In, in, IN), (my, my, PRP$), (country, count..."
4,13,dk5,115,1,2006-09-21 10:16:17,6,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","[(I, I, PRP), (organized, organize, VBD), (the..."
5,12,ad1,115,1,2006-09-21 10:19:01,59,"First, prepare a port, loose tea, and cup.\nSe...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, first, RB), (,, ,, ,), (prepare, prep..."


#### Add to pelic_df the native language of the author of each text
The L1 information is found in `sinfo_df`.

In [7]:
# Create a function to return the L1 based on the anon_id in 

def get_native_lang(idstr):
    if idstr in sinfo_df.index:
        return sinfo_df.loc[idstr, 'native_language']
    else: return 'Unknown'

# Test the function
print(sinfo_df.loc['eq0','native_language'])
print(get_native_lang('eq0'))

Arabic
Arabic


In [8]:
# Create a new 'L1' (first langauge) column using the get_native_lang function

pelic_df['L1'] = pelic_df['anon_id'].apply(get_native_lang)
pelic_df.head()

Unnamed: 0_level_0,question_id,anon_id,course_id,version,created_date,text_len,text,tokens,tok_lem_POS,L1
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,5,eq0,149,1,2006-09-20 16:11:08,177,I met my friend Nife while I was studying in a...,"[I, met, my, friend, Nife, while, I, was, stud...","[(I, I, PRP), (met, meet, VBD), (my, my, PRP$)...",Arabic
2,5,am8,149,1,2006-09-20 22:09:14,137,"Ten years ago, I met a women on the train betw...","[Ten, years, ago, ,, I, met, a, women, on, the...","[(Ten, ten, CD), (years, year, NNS), (ago, ago...",Thai
3,12,dk5,115,1,2006-09-21 10:16:17,63,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","[(In, in, IN), (my, my, PRP$), (country, count...",Turkish
4,13,dk5,115,1,2006-09-21 10:16:17,6,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","[(I, I, PRP), (organized, organize, VBD), (the...",Turkish
5,12,ad1,115,1,2006-09-21 10:19:01,59,"First, prepare a port, loose tea, and cup.\nSe...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, first, RB), (,, ,, ,), (prepare, prep...",Korean


#### Add to pelic_df the level of the student at the time the text was written
The level information is found in `course_df`.

In [9]:
# Create a general function to return information based on the course_id in course_df

def get_course_info(idstr, columnname):
    if idstr in course_df.index:
        return course_df.loc[idstr, columnname]
    else: return 'Unknown'

# Test the function
print(course_df.loc[1,'class_id']) 
print(get_course_info(1, 'class_id'))
print(get_course_info(1, 'level_id'))

r
r
2


In [10]:
# Create a new 'class_id' column and 'level_id' column in pelic_df using the 'get_course_info' function and lambda

pelic_df['class_id'] = pelic_df['course_id'].apply(lambda x: get_course_info(x, 'class_id'))
pelic_df['level_id'] = pelic_df['course_id'].apply(lambda x: get_course_info(x, 'level_id'))
pelic_df['semester'] = pelic_df['course_id'].apply(lambda x: get_course_info(x, 'semester'))
pelic_df.head(5)

Unnamed: 0_level_0,question_id,anon_id,course_id,version,created_date,text_len,text,tokens,tok_lem_POS,L1,class_id,level_id,semester
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,5,eq0,149,1,2006-09-20 16:11:08,177,I met my friend Nife while I was studying in a...,"[I, met, my, friend, Nife, while, I, was, stud...","[(I, I, PRP), (met, meet, VBD), (my, my, PRP$)...",Arabic,g,4,2006_fall
2,5,am8,149,1,2006-09-20 22:09:14,137,"Ten years ago, I met a women on the train betw...","[Ten, years, ago, ,, I, met, a, women, on, the...","[(Ten, ten, CD), (years, year, NNS), (ago, ago...",Thai,g,4,2006_fall
3,12,dk5,115,1,2006-09-21 10:16:17,63,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","[(In, in, IN), (my, my, PRP$), (country, count...",Turkish,w,4,2006_fall
4,13,dk5,115,1,2006-09-21 10:16:17,6,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","[(I, I, PRP), (organized, organize, VBD), (the...",Turkish,w,4,2006_fall
5,12,ad1,115,1,2006-09-21 10:19:01,59,"First, prepare a port, loose tea, and cup.\nSe...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, first, RB), (,, ,, ,), (prepare, prep...",Korean,w,4,2006_fall


**Note:** If desired, the same code could be used to create a _created\_date,_ or _section_ column to focus on development over time.

#### Add to pelic_df the gender of the author of each text (when known)
The gender information is found in `sinfo_df`.

In [11]:
# Create a general function to return information based on the anon_id in sinfo_df

def get_user_info(idstr, columnname):
    if idstr in sinfo_df.index:
        return sinfo_df.loc[idstr, columnname]
    else: return 'Unknown'

# Test the function
print(sinfo_df.loc['eq0','gender'])
print(get_user_info('eq0', 'gender'))

Male
Male


In [12]:
# Create a new 'gender' column in pelic_df using the 'get_user_info' function and lambda

pelic_df['gender'] = pelic_df['anon_id'].apply(lambda x: get_user_info(x, 'gender'))
pelic_df.head(5)

Unnamed: 0_level_0,question_id,anon_id,course_id,version,created_date,text_len,text,tokens,tok_lem_POS,L1,class_id,level_id,semester,gender
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,5,eq0,149,1,2006-09-20 16:11:08,177,I met my friend Nife while I was studying in a...,"[I, met, my, friend, Nife, while, I, was, stud...","[(I, I, PRP), (met, meet, VBD), (my, my, PRP$)...",Arabic,g,4,2006_fall,Male
2,5,am8,149,1,2006-09-20 22:09:14,137,"Ten years ago, I met a women on the train betw...","[Ten, years, ago, ,, I, met, a, women, on, the...","[(Ten, ten, CD), (years, year, NNS), (ago, ago...",Thai,g,4,2006_fall,Female
3,12,dk5,115,1,2006-09-21 10:16:17,63,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","[(In, in, IN), (my, my, PRP$), (country, count...",Turkish,w,4,2006_fall,Female
4,13,dk5,115,1,2006-09-21 10:16:17,6,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","[(I, I, PRP), (organized, organize, VBD), (the...",Turkish,w,4,2006_fall,Female
5,12,ad1,115,1,2006-09-21 10:19:01,59,"First, prepare a port, loose tea, and cup.\nSe...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, first, RB), (,, ,, ,), (prepare, prep...",Korean,w,4,2006_fall,Female


### Add placement test scores

There is a unique placement test score in the `test_scores.csv` file for each student (anon_id) and semester combination (when a placement test was taken).

In [13]:
pelic_df.head(1)

Unnamed: 0_level_0,question_id,anon_id,course_id,version,created_date,text_len,text,tokens,tok_lem_POS,L1,class_id,level_id,semester,gender
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,5,eq0,149,1,2006-09-20 16:11:08,177,I met my friend Nife while I was studying in a...,"[I, met, my, friend, Nife, while, I, was, stud...","[(I, I, PRP), (met, meet, VBD), (my, my, PRP$)...",Arabic,g,4,2006_fall,Male


In [14]:
# Create dictionary with placement test score for each (anon_id, semester) tuple in tests_df

tests_df['temp'] = list(zip(tests_df.anon_id, tests_df.semester)) # create temporary column
placement_score_dict = pd.Series(tests_df.MTELP_Conv_Score.values,tests_df.temp).to_dict()
del tests_df['temp'] # delete temporary column

In [15]:
# Map dictionary to create new column in pelic_df

pelic_df['temp'] = list(zip(pelic_df.anon_id, pelic_df.semester)) # create temporary column
pelic_df['placement_test'] = pelic_df.temp.map(placement_score_dict) # map dict to temporary colum
del pelic_df['temp'] # delete temporary column

## Writing out `PELIC_compiled`
Saved as a csv file with optional code for creating a pickle file.

In [16]:
# Reorder columns to show learner info, course info, then text info

pelic_df = pelic_df[['anon_id','L1','gender','semester','placement_test','course_id','level_id','class_id',
                     'question_id','version','text_len','text','tokens','tok_lem_POS']]
pelic_df.head()

Unnamed: 0_level_0,anon_id,L1,gender,semester,placement_test,course_id,level_id,class_id,question_id,version,text_len,text,tokens,tok_lem_POS
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,eq0,Arabic,Male,2006_fall,,149,4,g,5,1,177,I met my friend Nife while I was studying in a...,"[I, met, my, friend, Nife, while, I, was, stud...","[(I, I, PRP), (met, meet, VBD), (my, my, PRP$)..."
2,am8,Thai,Female,2006_fall,,149,4,g,5,1,137,"Ten years ago, I met a women on the train betw...","[Ten, years, ago, ,, I, met, a, women, on, the...","[(Ten, ten, CD), (years, year, NNS), (ago, ago..."
3,dk5,Turkish,Female,2006_fall,,115,4,w,12,1,63,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","[(In, in, IN), (my, my, PRP$), (country, count..."
4,dk5,Turkish,Female,2006_fall,,115,4,w,13,1,6,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","[(I, I, PRP), (organized, organize, VBD), (the..."
5,ad1,Korean,Female,2006_fall,,115,4,w,12,1,59,"First, prepare a port, loose tea, and cup.\nSe...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","[(First, first, RB), (,, ,, ,), (prepare, prep..."


In [17]:
# Write out PELIC_compiled as a csv file

pelic_df.to_csv('../PELIC_compiled.csv',index=True, header=True)

In [18]:
# Option to write out as pickle file

pelic_df.to_pickle('../pelic_compiled.pkl')

## `PELIC_compiled` mini demonstration
The following short example shows how `PELIC_compiled` can be used to apply filters to find a subset of texts, in this case texts by speakers with the following characteristics:
- Korean L1
- Female
- Level 5  

In [19]:
# Using the .loc function to create a subset of PELIC

subset = pelic_df.loc[(pelic_df.L1 == 'Korean') & (pelic_df.gender == 'Female') & (pelic_df.level_id == 5)]
subset.head()

Unnamed: 0_level_0,anon_id,L1,gender,semester,placement_test,course_id,level_id,class_id,question_id,version,text_len,text,tokens,tok_lem_POS
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
132,at8,Korean,Female,2006_fall,,118,5,w,17,1,93,my friend is ANON_NAME_0. she is a my ELI frie...,"[my, friend, is, ANON_NAME_0, ., she, is, a, m...","[(my, my, PRP$), (friend, friend, NN), (is, be..."
134,at8,Korean,Female,2006_fall,,118,5,w,17,2,104,my friend is ANON_NAME_0. she is a my ELI frie...,"[my, friend, is, ANON_NAME_0, ., she, is, a, m...","[(my, my, PRP$), (friend, friend, NN), (is, be..."
135,at8,Korean,Female,2006_fall,,118,5,w,17,3,104,my friend is ANON_NAME_0. she is a my ELI frie...,"[my, friend, is, ANON_NAME_0, ., she, is, a, m...","[(my, my, PRP$), (friend, friend, NN), (is, be..."
145,fn2,Korean,Female,2006_fall,81.0,117,5,w,4,1,271,"1. It has been said, ""Not all learning takes p...","[1, ., It, has, been, said, ,, ``, Not, all, l...","[(1, 1, CD), (., ., .), (It, It, PRP), (has, h..."
152,dj0,Korean,Female,2006_fall,,117,5,w,4,1,299,There are many qualities of a good neighbor in...,"[There, are, many, qualities, of, a, good, nei...","[(There, there, EX), (are, be, VBP), (many, ma..."


Here we see that there are 2092 texts matching these criteria.  
<br>
We may also want to see how many students created these texts and how many texts they wrote on average.

In [20]:
print('There are',len(set(subset.anon_id)), 'students in this subset of PELIC.')
print('On average, these students produced',round(len(subset)/len(set(subset.anon_id)),1),'texts each.')

There are 67 students in this subset of PELIC.
On average, these students produced 31.2 texts each.


We can also check the average length of these texts.

In [21]:
print('The mean text length of this subset is', round(subset.text_len.mean(),1),'words.')

The mean text length of this subset is 94.8 words.


And we can see in which types of classes they wrote them.

In [22]:
print(subset.class_id.value_counts())

r    721
g    679
w    564
l    115
s     13
Name: class_id, dtype: int64


For more detailed tutorials, please see the [tutorials folder](https://github.com/ELI-Data-Mining-Group/PELIC-dataset/tree/master/tutorials) and description in the [`README.md`](https://github.com/ELI-Data-Mining-Group/PELIC-dataset/blob/master/README.md). For, example, you may wish to create concordances of particular linguistic items in this subset, a function which is described in the [`PELIC_concordancing_tutorial`](https://github.com/ELI-Data-Mining-Group/PELIC-dataset/blob/master/tutorials/PELIC_concordancing_tutorial.ipynb). There is also a [`exploratory_data_analysis`](https://github.com/ELI-Data-Mining-Group/PELIC-dataset/blob/master/tutorials/exploratory_data_analysis.ipynb) tutorial which shows how to probe `PELIC_compiled.csv` for statistics relating to PELIC's composition.

[Back to top](#Building-PELIC_compiled.csv)

In [23]:
pelic_df.loc[~pelic_df.placement_test.isnull()].head(10)

Unnamed: 0_level_0,anon_id,L1,gender,semester,placement_test,course_id,level_id,class_id,question_id,version,text_len,text,tokens,tok_lem_POS
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
11,fv6,French,Male,2006_fall,51.0,115,4,w,12,1,88,"To make tea, nothing is easier, even if someti...","[To, make, tea, ,, nothing, is, easier, ,, eve...","[(To, to, TO), (make, make, VB), (tea, tea, NN..."
12,fv6,French,Male,2006_fall,51.0,115,4,w,13,1,8,i organize the instructions by time and import...,"[i, organize, the, instructions, by, time, and...","[(i, i, NN), (organize, organize, VBP), (the, ..."
13,ei2,Chinese,Female,2006_fall,57.0,115,4,w,12,1,60,"First, you should take some hot water, you can...","[First, ,, you, should, take, some, hot, water...","[(First, first, RB), (,, ,, ,), (you, you, PRP..."
14,ei2,Chinese,Female,2006_fall,57.0,115,4,w,13,1,12,I organize the instruction by time. For exampl...,"[I, organize, the, instruction, by, time, ., F...","[(I, I, PRP), (organize, organize, VBP), (the,..."
15,hb4,Korean,Female,2006_fall,48.0,115,4,w,12,1,48,"In my country, make a tea is very easy because...","[In, my, country, ,, make, a, tea, is, very, e...","[(In, in, IN), (my, my, PRP$), (country, count..."
16,hb4,Korean,Female,2006_fall,48.0,115,4,w,13,1,28,Every paragragh's instructions depend on a mai...,"[Every, paragragh, 's, instructions, depend, o...","[(Every, every, DT), (paragragh, paragragh, NN..."
18,ah1,Thai,Male,2006_fall,53.0,115,4,w,12,1,49,"When I want to drink my tea, I have to make it...","[When, I, want, to, drink, my, tea, ,, I, have...","[(When, when, WRB), (I, I, PRP), (want, want, ..."
19,ah1,Thai,Male,2006_fall,53.0,115,4,w,13,1,30,The organization by time is about method to do...,"[The, organization, by, time, is, about, metho...","[(The, the, DT), (organization, organization, ..."
20,bo8,Chinese,Female,2006_fall,59.0,114,4,w,10,1,79,"Failing a test is not easy successful ,but if ...","[Failing, a, test, is, not, easy, successful, ...","[(Failing, fail, VBG), (a, a, DT), (test, test..."
23,di7,Chinese,Male,2006_fall,74.0,115,4,w,12,1,118,To boil water to 95 cent-degree. If you have w...,"[To, boil, water, to, 95, cent, -, degree, ., ...","[(To, to, TO), (boil, boil, VB), (water, water..."
