### The goal of this notebook is to :

##### 1. Avoid loading large files into memory. 
##### 2. Build a train, validation set that ensures balanced classes.
#####  3. The size of the validation set  is equal to the size of the test set. 
#####  4. Avoid loading samples that exist in the test set. 
####  5. Preprocess the test set. 


In [1]:
from glob import glob
import pandas as pd
import re

### Load the test set

In [2]:
def clean(text):
    text =  re.sub(r'<.*?>', '', text) # remove anything inside <> html tags
    text = re.sub(r'[^\w\s]','',text)  # remove all punctuation 
    text = re.sub(r'[0-9]+', '', text).lower().strip() # remove numbers, strip and lowercase
    return text

In [3]:
test_lines = []
test_labels = []

with open("europarl.test") as f:
    for line in f:
        test_labels.append(line[:2])
        test_lines.append(clean(line[2:]))


In [4]:
test = pd.DataFrame({"text":test_lines, "lang":test_labels})
test.columns = ["lang", "text"]
labels = test["lang"].unique()

In [5]:
test.head()

Unnamed: 0,lang,text
0,bg,европа не трябва да стартира нов конкурентен ...
1,bg,cs найголямата несправедливост на сегашната об...
2,bg,de гжо председател гн член на комисията по при...
3,bg,de гн председател бих искал да започна с комен...
4,bg,de гн председател въпросът за правата на човек...


### We notice that the test files contain sentences which have (de), (ce) ... which do not indicate the same language. 

For example:  

(DE) Señor Presidente, considero que esta Directiva marco para la protección del suelo es un grave error que pone en peligro la competitividad de la agricultura europea y el suministro de alimentos en Europa.

### Hence we replace them. 

In [6]:
lang_symbols = test["lang"].unique() # create the list

def replace_lang_symbol(sentence, symbols):
    """
    function to replace "(de)" or other language symobols by an empty space
    
    """
    for symbol in symbols:
        if symbol in sentence: 
            sentence = sentence.replace(symbol, "")
    return sentence

In [7]:
test["text"] = test["text"].apply(replace_lang_symbol, symbols = lang_symbols)

In [8]:
len(test)

21000

In [9]:
test.drop_duplicates(inplace=True)

In [10]:
len(test)

20989

In [11]:
test.to_csv("test_lang.csv", index = False)

### A bit of  Analytics:  Let's figure out the length of the shortest examples. This will be useful for picking observations for the training and validation set. In fact, detecting the language of a short sentence especially in close languages can be hard. 

The top 5 shortest sentences in our test set

In [12]:
sorted(list(test["text"]), key = len)[:5]

['møt åbn kl',
 'r gør emridt',
 'herr präsint',
 'aäh härra hae',
 'vaakem  ümber']

In [13]:
len(sorted(list(test["text"]), key = len)[0].split())

3

The shortest sentence in the test set contains three words. Hence we will make sure that the observations in our training and validation set contain at least three words.

## create a validation set that matches the size of the test set. 


In [14]:
test["lang"].value_counts()

lt    1000
en    1000
nl    1000
lv    1000
sk    1000
bg    1000
hu    1000
el    1000
sl    1000
pt    1000
pl    1000
cs    1000
ro     999
sv     999
fr     999
da     999
es     999
de     999
et     999
it     998
fi     998
Name: lang, dtype: int64

Hence we will create a validation set that has 1000 text per language and a training set that has 4000 text per language.

In [15]:
test_size = 1000
valid_size = test_size 
train_size = test_size *4
data_size = valid_size  + train_size 

### Build the training and validation set

The following conditions are implemented in order to pick texts samples. They are heuristics that are based on a manual check in order to stay safe and do not reflect a thorough exploration. Since we have more than 100,000 texts per languages and will be picking 5000.


If a sentence passes the following tests we will 
    include in the training and validation dataset
    
#### Condition 1: Some sentences which language don't match the file name are included in "(" and start with a parenthesis.
    
#### Condition 2: We want the shortest sentences to have the same length as our test set. In our case at least 3 words.
    
#### Conditions 3,4 ,5: Some sentences that end with a point or start with "-" or the word "report" contain  languages that do not match the target language. 
                      
#### Conditions 6: No overlap betweem test and validation or training set. 

In [16]:
lang_files = glob("raw_data/europarl-v7*") # Load the language files 
train_texts = []
train_lang = []
valid_texts = []
valid_lang = []


for file in lang_files:
    counter = 0
    texts = []
    languages = []
    lang = file[-2:]
    with open(file) as f:
        if lang not in languages: 
            for line in f: 
                    if line[0] != "(" and counter < data_size and len(line.split())>2 and line[0] != "-" :  
                        
                        
                        if line[-1]!="." and "report" not in line:  
                            line = replace_lang_symbol(line, lang_symbols)
                            line = clean(line)
                            
                            if line not in test["text"].values and line not in texts : #last checks after applying similar preprocessing
                            
                                texts.append(clean(line))
                                languages.append(lang)
                                counter += 1
                                
            train_texts.extend(texts[:train_size])
            train_lang.extend(languages[:train_size])
            
            valid_texts.extend(texts[train_size:])
            valid_lang.extend(languages[train_size:])

In [17]:
train = pd.DataFrame({"text": train_texts, "lang":train_lang})
valid = pd.DataFrame({"text": valid_texts, "lang":valid_lang})

In [18]:
train.head()

Unnamed: 0,lang,text
0,bg,състав на парламента вж протоколи
1,bg,одобряване на протокола от предишното заседани...
2,bg,проверка на пълномощията вж протоколи
3,bg,петиции вж протоколи
4,bg,предаване на текстове на споразумения от съвет...


In [19]:
train.drop(train[train["lang"]=="en"].index[train_size:], inplace=True) # makes sure we do not oversample from english text
valid.drop(valid[valid["lang"]=="en"].index[valid_size:], inplace=True)

In [20]:
len(train)

84000

In [21]:
train.drop_duplicates(inplace=True)

In [22]:
len(train)

84000

In [23]:
len(valid)

21000

In [24]:
valid.drop_duplicates(inplace=True)

In [25]:
len(valid)

21000

In [26]:
valid["lang"].value_counts()

lt    1000
it    1000
nl    1000
lv    1000
sk    1000
bg    1000
fi    1000
de    1000
es    1000
da    1000
el    1000
fr    1000
hu    1000
en    1000
ro    1000
sv    1000
sl    1000
pt    1000
et    1000
pl    1000
cs    1000
Name: lang, dtype: int64

In [27]:
train["lang"].value_counts()

lt    4000
pt    4000
es    4000
it    4000
en    4000
sl    4000
lv    4000
sk    4000
bg    4000
sv    4000
et    4000
hu    4000
pl    4000
fr    4000
nl    4000
de    4000
da    4000
el    4000
ro    4000
fi    4000
cs    4000
Name: lang, dtype: int64

## Double check if  the training and validation file  contain observations in the test file:
Step 1: Concatenate all files.

Step 2: Check length

Step 3: Drop duplicates

Step 4: If the same length as in step 2 appears, then we don't have common duplicates.


In [28]:
d = pd.concat([train,valid,test],axis=0)

In [29]:
len(d)

125989

In [30]:
d.drop_duplicates(inplace=True)

In [31]:
len(d) #we have the same length, henve no duplicates.

125989

We can safely generate the training and validation set after shuffling the data. 

In [32]:
train = train.sample(frac = 1.0)
valid = valid.sample(frac = 1.0)

In [34]:
train.to_csv("train_lang.csv", index = False)
valid.to_csv("valid_lang.csv", index = False)

