# Arabic Learner Corpus Considerations: Classifier Training and Data Analysis
Anthony Verardi | a.verardi@pitt.edu | 3/17/2020 | University of Pittsburgh

In this Notebook (which is a continuation of [ALC Data Organization](https://github.com/Data-Science-for-Linguists-2020/Arabic-Learner-Corpus-Considerations/blob/master/Notebooks/ALC_Data_Organization.ipynb)), I'll begin the process of analyzing the data obtained from the [Arabic Learner Corpus](https://www.arabiclearnercorpus.com/).

Corpus credit to: Alfaifi, A., Atwell, E. and Hedaya, I. (2014). Arabic Learner Corpus (ALC) v2: A New Written and Spoken Corpus of Arabic Learners. In the proceedings of the Learner Corpus Studies in Asia and the World (LCSAW) 2014, 31 May - 01 Jun 2014. Kobe, Japan. http://www.arabiclearnercorpus.com.

In [1]:
# Importing necessary packages to begin reading in our data. The files come in XML format,
# so we'll need to import a library, BeautifulSoup, that can read them in and get the data
# ready for input into a DataFrame. Glob is for easily working with batches of files at once.

import nltk, glob, pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline

# Allowing for multiple lines of output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
df = pd.read_pickle("ALC_df.pkl")

In [4]:
df.head()

Unnamed: 0_level_0,L1,NumLangs,Nationality,Age,Gender,YearsStudy,GenLvl,LvlStdy,Title,Text,Genre,Mode,TextToks,TitleToks,TextLen,TitleLen,TTR
DocID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
S001_T1_M_Pre_NNAS_W_C,Moore,4,Burkina Faso,20,Male,14,Pre-university,Diploma course,الرحلة إلى القرية لزيارة ذوي القربى,اعتدت الذهاب إلى قريتي في الإجازات الصيفيّة ال...,Narrative,Written,"[اعتدت, الذهاب, إلى, قريتي, في, الإجازات, الصي...","[الرحلة, إلى, القرية, لزيارة, ذوي, القربى]",169,6,0.798817
S001_T2_M_Pre_NNAS_W_C,Moore,4,Burkina Faso,20,Male,14,Pre-university,Diploma course,الجمع بين العلم الشرعي والعلوم الدنيوية لحمزة ...,أحبّ أن ألتحق بكلِّية الشريعة بعد الانتها من ا...,Discussion,Written,"[أحبّ, أن, ألتحق, بكلِّية, الشريعة, بعد, الانت...","[الجمع, بين, العلم, الشرعي, والعلوم, الدنيوية,...",161,9,0.84472
S002_T1_M_Pre_NNAS_W_C,Russian,5,Russian,25,Male,5,Pre-university,Diploma course,رحلة الحج المباركة,كتب الله لي أن أحج إلى بيته الحرام السنة الماض...,Narrative,Written,"[كتب, الله, لي, أن, أحج, إلى, بيته, الحرام, ال...","[رحلة, الحج, المباركة]",317,3,0.637224
S002_T2_M_Pre_NNAS_W_C,Russian,5,Russian,25,Male,5,Pre-university,Diploma course,أكثر من التخصص,الحمد لله الذي وفقني لدراسة شرعية في جامعة الإ...,Discussion,Written,"[الحمد, لله, الذي, وفقني, لدراسة, شرعية, في, ج...","[أكثر, من, التخصص]",173,3,0.757225
S003_T1_M_Pre_NNAS_W_C,Tatar,4,Russian,24,Male,6,Pre-university,Diploma course,رحلتي إلى الجبال,في أحد الأيام الصيف أخبرنا أبي بسفرٍ إلى الغاب...,Narrative,Written,"[في, أحد, الأيام, الصيف, أخبرنا, أبي, بسفرٍ, إ...","[رحلتي, إلى, الجبال]",133,3,0.766917


This is as far as I think I'm going to get with everything going on right now. The plan is as follows:

* Add a column for language family (I might go back and do this in the Organization notebook instead)
* Run some tests to see how the data are distributed within learner groups (a Shapiro test might not be needed/useful since there are over 30 observations, but it might still be good to go through)
* Partition data into training, testing, and development sets
* Train a classifier and see if it can reliably tell apart the differences in writing between L1-Arabic learners of Modern Standard Arabic (MSA) and non-L1-Arabic learners of MSA
* Try to eek out what the differences ARE between L1-Arabic learners and non-L1-Arabic learners, and what features are useful indicators of these differences
* I'd like to try another tokenizer that handles Arabic morphosyntax more elegantly (the NLTK version doesn't split words into morphemes, so for+her is rendered as one token instead of "for" and "her" for example), but I could use a hand finding/implementing one

In [9]:
df.info

<bound method DataFrame.info of                              L1  NumLangs   Nationality Age  Gender  \
DocID                                                                 
S001_T1_M_Pre_NNAS_W_C    Moore         4  Burkina Faso  20    Male   
S001_T2_M_Pre_NNAS_W_C    Moore         4  Burkina Faso  20    Male   
S002_T1_M_Pre_NNAS_W_C  Russian         5       Russian  25    Male   
S002_T2_M_Pre_NNAS_W_C  Russian         5       Russian  25    Male   
S003_T1_M_Pre_NNAS_W_C    Tatar         4       Russian  24    Male   
...                         ...       ...           ...  ..     ...   
S939_T1_F_Uni_NNAS_S_C  Swahili         3      Comorian  23  Female   
S939_T2_F_Uni_NNAS_S_C  Swahili         3      Comorian  23  Female   
S940_T1_M_Pre_NNAS_S_C   Yoruba         3      Nigerian  26    Male   
S941_T1_M_Pre_NNAS_S_C     Urdu         3      Nepalese  25    Male   
S942_T1_M_Uni_NAS_S_C    Arabic         1         Saudi  23    Male   

                        YearsStudy          

In [7]:
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline

In [5]:
mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, random_state=1)
tfIdf = TfidfVectorizer(min_df=2, max_df=.5)

pipe = Pipeline(steps=[('tfIdf', tfIdf),('mlp',mlp)])

# Trying two different solvers, one of which (lbfgs) is supposed to be quicker and better for smaller
# data sets
clf = GridSearchCV(pipe, param_grid = {"tfIdf__max_features":[2000, 5000],
                                      "mlp__solver":('adam','lbfgs')}, cv=5, return_train_score=True)

clf.fit(df.Text, df.L1)



GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('tfIdf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=0.5,
                                                        max_features=None,
                                                        min_df=2,
                                                        ngram_range=(1, 1),
                                         

In [8]:
SVC_model = make_pipeline(TfidfVectorizer(max_features = 3000), SVC(kernel='rbf', C=1E5))
SVC_model.fit(df.Text, df.L1)



Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=3000,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('svc',
                 SVC(C=100000.0, cache_size=200, class_weight=None, coef0=0.0,
                     decision_function_shape='ovr', degree

In [10]:
# Labels
labels = SVC_model.predict(df.Text)

# Matrix
matrix = confusion_matrix(df.L1, labels)

# Accuracy assessment
accuracy = accuracy_score(df.L1, Prompt_SVC_labels)
accuracy

1.0

In [16]:
svc = sklearn.svm.SVC(kernel='rbf', C=1E5)
tfIdf = TfidfVectorizer(min_df=2, max_df=.5)

pipe = Pipeline(steps=[('tfIdf', tfIdf),('svc',svc)])

clf = GridSearchCV(pipe, param_grid = {"tfIdf__max_features":[2000, 5000]}, cv=3, return_train_score=True)

clf.fit(df.Text, df.L1)



GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('tfIdf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=0.5,
                                                        max_features=None,
                                                        min_df=2,
                                                        ngram_range=(1, 1),
                                         

In [17]:
clf.best_params_

{'tfIdf__max_features': 5000}

In [18]:
df2 = pd.DataFrame.from_dict(clf.cv_results_)
df2

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_tfIdf__max_features,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,3.987985,0.313749,0.63283,0.056506,2000,{'tfIdf__max_features': 2000},0.528777,0.56926,0.579681,0.55836,0.022147,2,1.0,1.0,1.0,1.0,0.0
1,6.046199,0.709656,0.850659,0.062294,5000,{'tfIdf__max_features': 5000},0.534173,0.578748,0.585657,0.5653,0.023049,1,1.0,1.0,1.0,1.0,0.0
