# Final Data Preparation

After exploring the TAASSC output for the first 100 rows of the PELIC dataset (and seeing how painfully slow the process was), I'm now ready to create the final dataset for the project, generate its syntactical measures with TAASSC, and perform some actual statistical analysis.

## Create Final Dataset

Since TAASSC would take ages to process the whole dataset, I'm going to only work with a much smaller (though still much bigger than the first 100 rows) subset of the data.
I also remembered that some L1s and proficiency levels had very few students to begin with, which I worried could skew the results for those L1-proficiency combinations, so after some deliberation I decided that I'll only work with the most common L1s and proficiency levels.
This should also help limit the size of the final dataset (and thereby also reduce the TAASSC processing time).

In [60]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
sns.set(rc={'figure.figsize': (20, 10)})

In [2]:
pelic = pd.read_csv("data/PELIC_compiled.csv")
pelic.head()

Unnamed: 0,answer_id,anon_id,L1,gender,semester,placement_test,course_id,level_id,class_id,question_id,version,text_len,text,tokens,tok_lem_POS
0,1,eq0,Arabic,Male,2006_fall,,149,4,g,5,1,177,I met my friend Nife while I was studying in a...,"['I', 'met', 'my', 'friend', 'Nife', 'while', ...","[('I', 'I', 'PRP'), ('met', 'meet', 'VBD'), ('..."
1,2,am8,Thai,Female,2006_fall,,149,4,g,5,1,137,"Ten years ago, I met a women on the train betw...","['Ten', 'years', 'ago', ',', 'I', 'met', 'a', ...","[('Ten', 'ten', 'CD'), ('years', 'year', 'NNS'..."
2,3,dk5,Turkish,Female,2006_fall,,115,4,w,12,1,63,In my country we usually don't use tea bags. F...,"['In', 'my', 'country', 'we', 'usually', 'do',...","[('In', 'in', 'IN'), ('my', 'my', 'PRP$'), ('c..."
3,4,dk5,Turkish,Female,2006_fall,,115,4,w,13,1,6,I organized the instructions by time.,"['I', 'organized', 'the', 'instructions', 'by'...","[('I', 'I', 'PRP'), ('organized', 'organize', ..."
4,5,ad1,Korean,Female,2006_fall,,115,4,w,12,1,59,"First, prepare a port, loose tea, and cup.\nSe...","['First', ',', 'prepare', 'a', 'port', ',', 'l...","[('First', 'first', 'RB'), (',', ',', ','), ('..."


First take another look at the L1 speaker counts:

In [3]:
L1_counts = pelic.drop_duplicates("anon_id").L1.value_counts()
L1_counts = L1_counts.reset_index()
L1_counts.rename(columns={"L1": "Count", "index": "L1"}, inplace=True)
L1_counts

Unnamed: 0,L1,Count
0,Arabic,439
1,Chinese,220
2,Korean,214
3,Japanese,67
4,Spanish,57
5,Turkish,40
6,Thai,31
7,Portuguese,17
8,Other,14
9,Italian,12


In [22]:
biggest_L1s = L1_counts[L1_counts["Count"] > 30]["L1"]

print("L1 with more than 30 speakers:")
for L1 in biggest_L1s:
    print(L1)

L1 with more than 30 speakers:
Arabic
Chinese
Korean
Japanese
Spanish
Turkish
Thai


In [38]:
L1_level_counts = pelic.groupby(["L1", "level_id"])["anon_id"].nunique()
L1_level_counts = L1_level_counts.reset_index()
L1_level_counts.rename(columns={"level_id": "Level", "anon_id": "Count"}, inplace=True)

print("Speaker counts by level for L1s with more than 30 speakers:")
biggest_L1_counts = L1_level_counts[
    L1_level_counts["L1"].isin(list(biggest_L1s)) &
    L1_level_counts["Level"].isin([3, 4, 5])
]
biggest_L1_counts

Speaker counts by level for L1s with more than 30 speakers:


Unnamed: 0,L1,Level,Count
1,Arabic,3,244
2,Arabic,4,342
3,Arabic,5,212
7,Chinese,3,86
8,Chinese,4,154
9,Chinese,5,96
33,Japanese,3,21
34,Japanese,4,53
35,Japanese,5,35
37,Korean,3,88


Let's see how many essays there are for these L1s:

In [37]:
len(pelic[pelic["L1"].isin(list(biggest_L1s))])

42146

Okay... so that's still *way* too many to process in a reasonable amount of time.
Instead, I think I'll select a sample of *students* for each combination of L1 and proficiency level and work specifically with the essays from those students.
This should also ensure that the syntactic measures later on are more evenly distributed so that no L1 or proficiency level dominate the dataset.

Note: there may be cases of a student being selected for more than one proficiency level, but I don't think that should be a significant issue.

The list of students for a specific L1 and proficiency level could be created as follows:

In [56]:
pelic[(pelic["L1"] == "Thai") & (pelic["level_id"] == 3)]["anon_id"].unique()

array(['eu0', 'fw3', 'af0', 'ci6', 'cs1', 'ei3', 'be8', 'gy1', 'cf3',
       'gt4'], dtype=object)

In [16]:
len(pelic)

46204

In [71]:
random.seed(69)

essay_samples = []
for L1 in biggest_L1s:
    for level in [3, 4, 5]:
        students = list(pelic[(pelic["L1"] == L1) & (pelic["level_id"] == level)]["anon_id"].unique())
        student_sample = random.sample(students, k=10)
        essay_samples.append(
            pelic[
                (pelic["L1"] == L1) &
                (pelic["level_id"] == level) &
                (pelic["anon_id"].isin(student_sample))
            ]
        )

essay_samples = pd.concat(essay_samples)
essay_samples.head()

Unnamed: 0,answer_id,anon_id,L1,gender,semester,placement_test,course_id,level_id,class_id,question_id,version,text_len,text,tokens,tok_lem_POS
439,478,bd3,Arabic,Male,2006_fall,,98,3,r,29,1,177,(Mismanagement)\nSentence: … their would be no...,"['(', 'Mismanagement', ')', 'Sentence', ':', '...","[('(', '(', '('), ('Mismanagement', 'Mismanage..."
440,479,bd3,Arabic,Male,2006_fall,,110,3,w,50,1,256,Hospitality in Saudi Arabia\n\n The hospitalit...,"['Hospitality', 'in', 'Saudi', 'Arabia', 'The'...","[('Hospitality', 'hospitality', 'NN'), ('in', ..."
555,595,bd3,Arabic,Male,2006_fall,,98,3,r,29,2,234,Vocabulary:\n\n(Mismanagement)\nSentence: … th...,"['Vocabulary', ':', '(', 'Mismanagement', ')',...","[('Vocabulary', 'Vocabulary', 'JJ'), (':', ':'..."
1056,1101,bd3,Arabic,Male,2006_fall,,110,3,w,99,1,130,"October 17, 2006\n\nDear basma,\n\n I thought ...","['October', '17', ',', '2006', 'Dear', 'basma'...","[('October', 'October', 'NNP'), ('17', '17', '..."
1305,1364,bd3,Arabic,Male,2006_fall,,110,3,w,120,1,143,"October 17, 2006\n\nDear basma,\n\nI thought a...","['October', '17', ',', '2006', 'Dear', 'basma'...","[('October', 'October', 'NNP'), ('17', '17', '..."


Thankfully the size of the dataset has been *significantly* reduced:

In [65]:
len(essay_samples)

4341

In [70]:
# for i in range(len(essay_samples)):
#     with open(f"data_samples/text/{essay_samples.iloc[i]['answer_id']}.txt", "w") as f:
#         f.write(essay_samples.iloc[i]["text"])

## TAASSC Processing 2: Electric Boogaloo

I'll finish this once TAASSC is done processing the data.