# Arabic Learner Corpus Considerations: Data Organization and Cleaning
Anthony Verardi | a.verardi@pitt.edu | 2/24/2020 | University of Pittsburgh

The following notebook reads in textual data and metadata from the [Arabic Learner Corpus](https://www.arabiclearnercorpus.com/) for further exploration.

In [1]:
# Importing necessary packages to begin reading in our data. The files come in XML format,
# so we'll need to import a library that can read them in and get the data ready for input into a DataFrame

import nltk, glob, re
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

# Allowing for multiple lines of output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# For toggling pretty printing off/on
%pprint

Pretty printing has been turned OFF


In [3]:
# Setting up our corpus directory. For now, I'll be using a smaller test data set while I figure
# out how to get things ready to go, but let's set a path to the full dataset while we're at it
corDirFull = '../../Arabic-Learner-Corpus-Considerations/private/ALC Texts/'
corDirTest = '../../Arabic-Learner-Corpus-Considerations/test_data/'

#C:\Users\Anthony\Documents\dataSci2020\Arabic-Learner-Corpus-Considerations\private\ALC Texts

In [4]:
infile = open(corDirTest+"TEST.xml","r")
contents = infile.read()
infile.close()
soup = BeautifulSoup(contents,'xml')
print(soup)

<?xml version="1.0" encoding="utf-8"?>



Right now, I can only get `BeautifulSoup` to read in the very first line of the test XML file. I followed a brief tutorial that used a toy XML file and it parsed that just fine, so I'm not sure what the issue is here. I created a `TEST.xml` file based on the first item in the actual corpus to play with, but no deletion fixed the parsing issue. I need to sort this out in order to move forward, but so far, nothing's come up helpful.

The goal right now is to get the XML data loaded into a `DataFrame` object in `pandas` so that I can do some statistical analyses on the learner samples. At the very least, let's try to get an idea of how many files we're working with here. We'll switch over to using the full dataset for this, the one we assigned `corDirFull` to earlier.

In [5]:
essay_fnames = glob.glob(corDirFull+'*.xml')
essay_fnames[0]

'../../Arabic-Learner-Corpus-Considerations/private/ALC Texts\\S001_T1_M_Pre_NNAS_W_C.xml'

Aside from giving two forward slashes (\\) instead of one backslash (/) in the filepath, it looks like it's working just fine to give the filenames. It'd be better to trim them down to eliminate the path, though. Let's do just that, using the fact that each one starts with an S to our advantage. We may as well also try to read in the XML content, although it won't be as useful to us this way.

In short, this regular expression is going to flag anything before the filename itself (all of which follow the same naming convention) and trim the excess parts.

In [6]:
# We need this to hold the filenames and their associated XML texts
essay_dict = {}

for fname in essay_fnames:
    fname_short = re.sub(r'\S+\s+\S+(S\d{3}_T[12]_[MF]_\S{3}_NN?AS_[WS]_[CH]\.xml)$', r'\1', fname)  #\1 matches () content
    # print(fname_short)      # print short name for checking
    txt = open(fname).read()
    essay_dict[fname_short] = txt

Success! Now let's see what we've grabbed...

In [7]:
essay_dict['S001_T1_M_Pre_NNAS_W_C.xml']

'ÿþ<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00=\x00"\x001\x00.\x000\x00"\x00?\x00>\x00\n\x00\n\x00<\x00!\x00-\x00-\x00A\x00r\x00a\x00b\x00i\x00c\x00 \x00L\x00e\x00a\x00r\x00n\x00e\x00r\x00 \x00C\x00o\x00r\x00p\x00u\x00s\x00_\x00v\x002\x00_\x002\x000\x001\x004\x00-\x00-\x00>\x00\n\x00\n\x00<\x00!\x00D\x00O\x00C\x00T\x00Y\x00P\x00E\x00 \x00d\x00o\x00c\x00 \x00[\x00\n\x00\n\x00<\x00!\x00E\x00L\x00E\x00M\x00E\x00N\x00T\x00 \x00d\x00o\x00c\x00 \x00(\x00h\x00e\x00a\x00d\x00e\x00r\x00,\x00t\x00e\x00x\x00t\x00)\x00>\x00\n\x00\n\x00<\x00!\x00A\x00T\x00T\x00L\x00I\x00S\x00T\x00 \x00d\x00o\x00c\x00\n\x00\n\x00I\x00D\x00 \x00I\x00D\x00 \x00#\x00R\x00E\x00Q\x00U\x00I\x00R\x00E\x00D\x00\n\x00\n\x00>\x00\n\x00\n\x00<\x00!\x00E\x00L\x00E\x00M\x00E\x00N\x00T\x00 \x00h\x00e\x00a\x00d\x00e\x00r\x00 \x00(\x00l\x00e\x00a\x00r\x00n\x00e\x00r\x00_\x00p\x00r\x00o\x00f\x00i\x00l\x00e\x00,\x00t\x00e\x00x\x00t\x00_\x00p\x00r\x00o\x00f\x00i\x00l\x00e\x00)\x00>\x00\n\x00\n\x00<\x00!\x00E\x00L

Yikes! Not exactly human (or really even machine) readable. Well, let's at least see how many files there are as a test of my regex abilities.

In [8]:
len(essay_dict.keys())

1585

At least something in this endeavor has gone to plan! So right now know, at the very least, that we're working with 1,585 files in this corpus, and we have a way of getting their file IDs. All that's left is to crack the mysteries of `BeautifulSoup` and get the relevant data into `pandas`.