# Preprocessing

Preprocessing is about transforming the data of the corpus into a usable format. Currently, the korpus data is stored in multiple text file in different folders (Academic, Culture, ...). These text files will be joined and transformed into a .csv file. This is because .csv files can easily be loaded into a Pandas DataFrame which is optimised for data analytics.

The original korpus is over 6GB in size. Most of this data comes from the Parliament and Law Subjects. In order to reduce complexity, a subset of the data in these subjects will be considered. Some of the data left out will be used as a test corpus. 

The test corpus only features 5 sentences taken from the Parliament section. This is because higher sentences produce

In [2]:
import os
import re
import pandas as pd
import seaborn as sns

In [3]:
folder = os.path.join(os.getcwd(),'..','data','korpus')
subjects = ['Academic','Culture','European','Law','News',
            'Opinion','Parliament','Religion','Sport','Test']

## Removing XML tags from <i>korpus</i>

The text files contain different XML tags that indicate the start of sentences, paragraphs etc. These XML tags create complexity within the dataset. We only really need the start and end of sentences, since they will be further utilised in Text Generation.

The following regex is applied to every text file, which removes all XML tags except <s>

All tags except <b>\<s\></b> will be removed from the korpus.

In [3]:
regex = re.compile(r'<(?![/]?s).*?>')
def clean(text):
    return regex.sub('', text)

In [4]:
for sbj in subjects:
    path = os.path.join(folder,sbj)
    
    for filename in os.listdir(path):
        filepath = os.path.join(path,filename)

        with open(filepath, 'r', encoding='utf-8') as f:
            print(filename)
            text = f.read()
        
        with open(filepath, 'w', encoding='utf-8') as f:
            f.write(clean(text))
    
    

Academic.csv
Academic1.txt
Academic2.txt
Culture.csv
Culture1.txt
Culture2.txt
European.csv
European.txt
Law.csv
Law1.txt
News.csv
News1.txt
News2.txt
News3.txt
News4.txt
News5.txt
News6.txt
News8.txt
Opinion.csv
Opinion1.txt
Opinion2.txt
Parliament.csv
Parliament2.txt
Religion.csv
Religion1.txt
Religion2.txt
Sport.csv
Sport1.txt
Sport2.txt
Test.csv
Test.txt


## Turning text file into dataset

Next we will transform every text file pertaining to a subject into a csv file

In [5]:
for sbj in subjects:
    data = []
    path = os.path.join(folder,sbj)
    
    for filename in os.listdir(path):
        filepath = os.path.join(path,filename)
        
        with open(filepath,'r', encoding='utf-8') as f:
            print(filename)
            lines = f.readlines()
            
            for l in lines:
                d = l.split('\t')
                
                #Start and End Token
                if len(d) == 1:
                    if re.search(r'<s id="[0-9]*">', d[0]): data.append(['<s>','START',None,None])
                    elif re.search(r'</s>', d[0]):          data.append(['</s>','END',None,None])
                
                elif len(d) > 1:
                    d[-1] = d[-1][:-1]
                    data.append(d)
                    
    df = pd.DataFrame(data, columns=['Word','POS','Lemma','Root'])
    df.to_csv(os.path.join(path,sbj+'.csv'), index=False)

Academic.csv
Academic1.txt
Academic2.txt
Culture.csv
Culture1.txt
Culture2.txt
European.csv
European.txt


KeyboardInterrupt: 

## Joining individual csv files

Now, the individual csv files will be concatanated together. The result is a massive spreadsheet which encodes all the corpus's information. This is done so we can easily load the entire korpus into a Pandas DataFrame

We will create 2 versions of korpus.csv. One will simply be all the csv files together while the other will read an equal number of bytes from each file. This is so the frequency counts won't be biased based on the subject of the texts.

Note that we do not include the text file in the training korpus

In [None]:
subjects = ['Academic','Culture','European','Law','News',
            'Opinion','Parliament','Religion','Sport',]

with open(os.path.join(folder,'korpus.csv'), 'w', encoding='utf-8') as f1:
    for sbj in subjects:
        with open(os.path.join(folder,sbj,sbj+'.csv'), 'r', encoding='utf-8') as f2:
            f1.write('\n')
            f1.write(f2.read())
            print(f'{sbj} Finished')

print('All Finished\n')


min_size = min([os.path.getsize(os.path.join(folder,sbj,sbj+'.csv')) for sbj in subjects])

#Only read min_length lines from each csv file. 

with open(os.path.join(folder,'norm_korpus.csv'), 'w', encoding='utf-8') as f1:
    for sbj in subjects:
        with open(os.path.join(folder,sbj,sbj+'.csv'), 'r', encoding='utf-8') as f2:    
            f1.write('\n')
            f1.write(f2.read(min_size))
            print(f'{sbj} Finished')

print('All Finished')


We can confirm that the csv files have been joined by summing the size of the individual files and comparing it to the aggregate one.

## Exploring Dataset

Now that the corpus is in an acccessible format, we can load it into a Pandas DataFrame. 
We will not use all the information in the Corpus. We only really need access to the Word column, so POS Lemma and Root can be omitted.

In [4]:
%%time

df_all = pd.read_csv(os.path.join(folder,'korpus.csv'),
                 usecols=["Word","POS"],
                 dtype={"Word": "U","POS": "S"},
                 nrows=50_000_000)

df_norm = pd.read_csv(os.path.join(folder,'norm_korpus.csv'),
                      usecols=["Word","POS"],
                      dtype={"Word": "U","POS": "S"})

df_test = pd.read_csv(os.path.join(folder,'Test','Test.csv'),
                      usecols=["Word","POS"],
                      dtype={"Word": "U","POS": "S"})


df = [df_all,df_norm,df_test]



df_all.head(10)

Wall time: 22.4 s


Unnamed: 0,Word,POS
0,<s>,START
1,L-,DEF
2,għan,NOUN
3,prinċipali,ADJ
4,ta',GEN
5,Conectando,NOUN-PROP
6,Mundos,NOUN-PROP
7,(,X-PUN
8,Malta,NOUN-PROP
9,),X-PUN


### Cleaning

The corpus is not an accurate representation of the Maltese Language. The POS tags indicate the context of words within the sentence. We will clean the corpus by omitting certain types of words.

The Maltese Tagset can be found here: https://mlrs.research.um.edu.mt/resources/malti03/tagset30.html

Giberrish words need to be removed as they do not make part of coherent speech.
Interjection words break the flow of sentences and hence does not encode.
Punctuation is not necessary to encode valid sentences. Note: The end of a sentence is represented by </s/>

English words were deliberately left in since they can be incorporated in a maltese sentence and removing them would break the sentence's flow.

In [5]:
df_all['POS'].unique()

array(['START', 'DEF', 'NOUN', 'ADJ', 'GEN', 'NOUN-PROP', 'X-PUN',
       'X-ABV', 'PREP', 'CONJ-CORD', 'PART-PASS', 'PREP-DEF', 'PRON-PERS',
       'COMP', 'VERB', 'END', 'LIL-DEF', 'CONJ-SUB', 'KIEN', 'GEN-PRON',
       'ADV', 'VERB-PSEU', 'NEG', 'GEN-DEF', 'QUAN', 'PRON-DEM', 'X-DIG',
       'PRON-INT', 'FOC', 'PREP-PRON', 'NUM-WHD', 'LIL', 'NUM-CRD',
       'X-ENG', 'X-FOR', 'PROG', 'INT', 'X-BOR', 'PRON-PERS-NEG',
       'LIL-PRON', 'PRON-INDEF', 'PRON-DEM-DEF', 'NUM-ORD', 'HEMM',
       'PRON-REF', 'PART-ACT', 'FUT', 'NUM-FRC', 'PRON-REC', 'POS'],
      dtype=object)

In [6]:
for i in range(len(df)):
    df[i] = df[i].dropna()

In [7]:
%%time
#Maltese Tagset: https://mlrs.research.um.edu.mt/resources/malti03/tagset30.html

for i in range(len(df)):
    df[i] = df[i].drop(df[i][df[i]['POS']=='X-PUN'].index) #Punctuation
    df[i] = df[i].drop(df[i][df[i]['POS']=='X-BOR'].index) #Gibberish
    df[i] = df[i].drop(df[i][df[i]['POS']=='INT'].index)   #Interjections


Wall time: 33.3 s


### Removing semi-colons

Semi colons will be used as delimters in ngrams later on. Having semicolons at the beginning or end of a word disrupts the delimiting process. Although individual semicolons should have been removed in the previous stage, not all semicolons were tagged properly.

In [8]:
df[0]['Word']

0                  <s>
1                   L-
2                 għan
3           prinċipali
4                  ta'
               ...    
49999994         Bajda
49999996          </s>
49999997           <s>
49999998            Dr
49999999        Zammit
Name: Word, Length: 44232071, dtype: object

In [9]:
for i in range(len(df)):
    df[i]['Word'] = df[i]['Word'].apply(lambda s: ''.join(str(s).split(';')), 1)


Finally we will remove all NA values. Other special cases should also be considered.

In [10]:
for i in range(len(df)):
    df[i] = df[i].dropna(subset=['Word'])
    df[i] = df[i].drop(df[i][df[i]["Word"]=='"'].index)
    df[i] = df[i].drop(df[i][df[i]["Word"]=='&lt'].index)
    df[i] = df[i].drop(df[i][df[i]["Word"]=='&gt'].index)
    df[i] = df[i].drop(df[i][df[i]["Word"]=='&amp'].index)


We save the dataframes into a csv file so that they can be loaded back into memory.

In [11]:
df_all  = df[0]
df_norm = df[1]
df_test = df[2]

df_all.to_csv(os.path.join(folder,'korpus_clean.csv'), index=False)
df_norm.to_csv(os.path.join(folder,'norm_korpus_clean.csv'), index=False)
df_test.to_csv(os.path.join(folder,'Test','Test.csv'), index=False)

## Frequency Counts

We will also create frequency counts for all unique words in both versions of the korpus. This will be used later when generating the ngram models.

In [12]:
df = pd.read_csv(os.path.join(folder, "korpus_clean.csv"),
                 usecols=["Word","POS"],
                 dtype={"Word": "U","POS": "S"})

df_norm = pd.read_csv(os.path.join(folder, "norm_korpus_clean.csv"),
                 usecols=["Word","POS"],
                 dtype={"Word": "U","POS": "S"})


In [13]:
#Get all unique words
df_frequency = df.value_counts().to_frame()[0].reset_index()
df_frequency.columns = ['Word','POS','Frequency']


df_normal_frequency = df_norm.value_counts().to_frame()[0].reset_index()
df_normal_frequency.columns = ['Word','POS','Frequency']

df_frequency.to_csv(os.path.join(folder,'korpus_frequency.csv'), index=False)
df_normal_frequency.to_csv(os.path.join(folder,'norm_korpus_frequency.csv'), index=False)

df_frequency

Unnamed: 0,Word,POS,Frequency
0,<s>,START,1791836
1,</s>,END,1791835
2,l-,DEF,1584004
3,li,COMP,1391328
4,ta',GEN,1379784
...,...,...,...
491718,banama,NOUN,1
491719,banalment,ADV,1
491720,banalizzat,PART-PASS,1
491721,banakrella,NOUN,1


We can validate that this is working as the top frequent words are words which maltese native speakers expect to be at the top, words such as <i>ta'</i> and <i>l-</i>

#### Question 1. How large is the corpus?
The normalised corpus only contains about 125k occurences of words. For the sake of brevity, only 30 million occurences of words were considered from the non-normalised korpus when cleaning and generating the frequency counts. This was because loading the whole data and performing operations on it would take a long time and be computationally expensive.