### The goal of this notebook is to :

##### 1. Avoid loading large files into memory.
##### 2. Build a train, validation set that ensures balanced classes.
#####  3. The size of the validation set  is equal to the size of the test set. 
#####  4. Avoid loading samples that exist in the test set. 
####  5. Preprocess the test set. 


In [1]:
from glob import glob
import pandas as pd
import re

### Load the test set

In [5]:
def clean(text):
    text =  re.sub(r'<.*?>', '', text) # remove anything inside <> html tags
    text = re.sub(r'[^\w\s]','',text)  # remove all punctuation 
    text = re.sub(r'[0-9]+', '', text).lower().strip() # remove numbers, strip and lowercase
    return text

In [6]:
test_lines = []
test_labels = []

with open("europarl.test") as f:
    for line in f:
        test_labels.append(line[:2])
        test_lines.append(clean(line[2:]))


In [7]:
test = pd.DataFrame({"text":test_lines, "lang":test_labels})
test.columns = ["lang", "text"]
labels = test["lang"].unique()

In [8]:
test.head()

Unnamed: 0,lang,text
0,bg,европа не трябва да стартира нов конкурентен ...
1,bg,cs найголямата несправедливост на сегашната об...
2,bg,de гжо председател гн член на комисията по при...
3,bg,de гн председател бих искал да започна с комен...
4,bg,de гн председател въпросът за правата на човек...


### We notice that the test files contain sentences which have (de), (ce) ... which do not indicate the same language. such as:  (DE) Señor Presidente, considero que esta Directiva marco para la protección del suelo es un grave error que pone en peligro la competitividad de la agricultura europea y el suministro de alimentos en Europa.

### Hence we replace them. 

In [17]:
lang_symbols = test["lang"].unique() # create the list

def replace_lang_symbol(sentence, symbols):
    """
    function to replace "(de)" or other language symobols by an empty space
    
    """
    for symbol in symbols:
        if symbol in sentence: 
            sentence = sentence.replace(symbol, "")
    return sentence

In [18]:
test["text"] = test["text"].apply(replace_lang_symbol, symbols = lang_symbols)

In [19]:
test.head()

Unnamed: 0,lang,text
0,bg,европа не трябва да стартира нов конкурентен ...
1,bg,найголямата несправедливост на сегашната обща...
2,bg,гжо председател гн член на комисията по принц...
3,bg,гн председател бих искал да започна с комента...
4,bg,гн председател въпросът за правата на човека ...


In [20]:
test.to_csv("test_lang.csv", index = False)

### A bit of  Analytics:  Let's figure out the length of the shortest examples. This will be useful for picking observations for the training and validation set. In fact, detecting the language of a short sentence especially in close languages can be hard. 

The top 5 shortest sentences in our test set

In [21]:
sorted(list(test["text"]), key = len)[:5]

['møt åbn kl',
 'r gør emridt',
 'herr präsint',
 'aäh härra hae',
 'vaakem  ümber']

In [22]:
len(sorted(list(test["text"]), key = len)[0].split())

3

The shortest sentence in the test set contains three words. Hence we will make sure that the observations in our training and validation set contain at least three words.

## create a validation set that matches the size of the test set. 


In [23]:
test["lang"].value_counts()

sl    1000
pl    1000
es    1000
it    1000
da    1000
lt    1000
el    1000
lv    1000
et    1000
nl    1000
bg    1000
ro    1000
fr    1000
hu    1000
sk    1000
en    1000
sv    1000
pt    1000
cs    1000
fi    1000
de    1000
Name: lang, dtype: int64

Hence we will create a validation set that has 1000 text per language and a training set that has 4000 text per language.

In [24]:
test_size = 1000
valid_size = test_size 
train_size = test_size *4
data_size = valid_size  + train_size 

### Build the training and validation set

The following conditions are implemented in order to pick texts samples. They are heuristics that are based on a manual check in order to stay safe and do not reflect a thorough exploration. 


If a sentence passes the following tests we will 
    include in the training and validation dataset
    
    Condition 1: Sentences which language doesn't match the file name are included and start with a parenthesis.
    
    Condition 2: We want the shortest sentences to have the same length as our test set. In our case 3 words.
    
    COndition 3,4 ,5: Sentences that end with a point or start with - or the word.
                      Report might contain languages that do not match the target language.
                      
    Condition 6: No overlap betweem test and validation or training set. 

In [27]:
lang_files = glob("raw_data/europarl-v7*") # Load the language files 
train_texts = []
train_lang = []
valid_texts = []
valid_lang = []


for file in lang_files:
    counter = 0
    texts = []
    languages = []
    lang = file[-2:]
    with open(file) as f:
        if lang not in languages: 
            for line in f: 
                    if line[0] != "(" and counter < data_size and len(line.split())>2 and line[0] != "-" :  # we notice that the files contain non target languages examples put in parenthesis
                        
                        
                        if line[-1]!="." and "report" not in line:  
                            line = replace_lang_symbol(line, lang_symbols)
                            line = clean(line)
                            
                            if line not in test["text"].values: #last check after applying similar preprocessing
                            
                                texts.append(clean(line))
                                languages.append(lang)
                                counter += 1
                                
            train_texts.extend(texts[:train_size])
            train_lang.extend(languages[:train_size])
            valid_texts.extend(texts[train_size:])
            valid_lang.extend(languages[train_size:])

In [28]:
train = pd.DataFrame({"text": train_texts, "lang":train_lang})
valid = pd.DataFrame({"text": valid_texts, "lang":valid_lang})

In [29]:
train.head()

Unnamed: 0,lang,text
0,bg,състав на парламента вж протоколи
1,bg,одобряване на протокола от предишното заседани...
2,bg,състав на парламента вж протоколи
3,bg,проверка на пълномощията вж протоколи
4,bg,петиции вж протоколи


In [30]:
train.drop(train[train["lang"]=="en"].index[train_size:], inplace=True) # makes sure we do not oversample from english text
valid.drop(valid[valid["lang"]=="en"].index[valid_size:], inplace=True)

In [31]:
valid["lang"].value_counts()

sl    1000
pl    1000
es    1000
it    1000
da    1000
lt    1000
el    1000
lv    1000
et    1000
nl    1000
bg    1000
ro    1000
fr    1000
hu    1000
sk    1000
en    1000
sv    1000
pt    1000
cs    1000
fi    1000
de    1000
Name: lang, dtype: int64

In [32]:
train["lang"].value_counts()

en    4000
it    4000
el    4000
bg    4000
pt    4000
lv    4000
pl    4000
sk    4000
cs    4000
es    4000
lt    4000
fr    4000
nl    4000
hu    4000
sv    4000
fi    4000
ro    4000
sl    4000
da    4000
et    4000
de    4000
Name: lang, dtype: int64

## Double check if  the training and validation file  contain observations in the test file

In [33]:
count = 0
for t in test["text"]:
        if t in train["text"]:
            count +=1 
print(count)

0


In [34]:
count = 0
for t in test["text"]:
        if t in valid["text"]:
            count +=1 
print(count)

0


We can safely generate the training and validation set. 

In [36]:
train.to_csv("train_lang.csv", index = False)
valid.to_csv("valid_lang.csv", index = False)

