# English - French Translation


## Data Extracation
In this notebook I will focus on the the extraction of of the data. This data is text data comprised of multiple documents files in english as well the same documents in french. These documents are each in a compressed .gz format. We will be unwrapping the data and combining all our data.

In [1]:
import os
import numpy as np
import pandas as pd

In [2]:
pd.set_option('display.max_rows', 500)

In [4]:
def extract_data(dir_path, english_file_name='english', other_lang_file_name='french'):
    en_dict, fr_dict = {}, {}
    en_file_no, fr_file_no = [],[]
    # Get list of files in directory
    files = [path+f for f in os.listdir(path)] 
    
    # Separate file names by language in two different dictionaries
    for file in files:
        splits = file.split('.')
        if splits[-2] == 'e':
            en_dict[splits[-3]] = file
        elif splits[-2] == 'f':
            fr_dict[splits[-3]] = file
    
    en_sorted = sorted(en_dict.items(), key=lambda x: x[0])
    fr_sorted = sorted(fr_dict.items(), key=lambda x: x[0])
    
    # Extract english text
    with open(f'./Data/{english_file_name}.gz', 'ab') as english:  # append in binary mode
        for en_file in en_sorted:
            en_file_no.append(en_file[0]) 
            with open(en_file[1], 'rb') as en_temp:        # open in binary mode also
                english.write(en_temp.read())
    
    # Extract other language text
    with open(f'./Data/{other_lang_file_name}.gz', 'ab') as other:  # append in binary mode
        for fr_file in fr_sorted:
            fr_file_no.append(fr_file[0])
            with open(fr_file[1], 'rb') as other_temp:        # open in binary mode also
                other.write(other_temp.read())
    
    # Check if the same languages have same files
    assert (np.array(en_file_no) == np.array(fr_file_no)).all(), 'The two languages do not have the same number of files. Check file names.'
    

In [3]:
train_path = '../../../Downloads/hansard.36 3/Release-2001.1a/sentence-pairs/senate/debates/development/training/'

# Extract training files
extract_data(train_path, english_file_name='train_english', other_lang_file_name='train_french')

In [37]:
# Read english and french file
train_english = pd.read_csv('./Data/train_english.txt', sep=r'\s{2,}', engine='python', header=None)
train_french = pd.read_csv('./Data/train_french.txt', sep=r'\s{2,}', engine='python', header=None)

In [38]:
train_english.columns = ['english_text']
train_french.columns = ['french_text']
train_data = english.merge(french, left_index=True, right_index=True)

In [22]:
test_path1 = '../../../Downloads/hansard.36/Release-2001.1a/sentence-pairs/senate/debates/development/testing/1/'
test_path2 = '../../../Downloads/hansard.36/Release-2001.1a/sentence-pairs/senate/debates/development/testing/2/'

# Extract test files
extract_data(test_path1, english_file_name='test_english1', other_lang_file_name='test_french1')
extract_data(test_path2, english_file_name='test_english2', other_lang_file_name='test_french2')

In [23]:
# Read english and french test file
test_english1 = pd.read_csv('./Data/test_english1.txt', sep=r'\s{2,}', engine='python', header=None)
test_french1 = pd.read_csv('./Data/test_french1.txt', sep=r'\s{2,}', engine='python', header=None)

test_english2 = pd.read_csv('./Data/test_english2.txt', sep=r'\s{2,}', engine='python', header=None)
test_french2 = pd.read_csv('./Data/test_french2.txt', sep=r'\s{2,}', engine='python', header=None)

In [28]:
test_english1.columns = ['english_text']
test_french1.columns = ['french_text']

test_english2.columns = ['english_text']
test_french2.columns = ['french_text']

test1 = test_english1.merge(test_french1, left_index=True, right_index=True)
test2 = test_english2.merge(test_french2, left_index=True, right_index=True)

test_data = pd.concat([test1, test2], axis=0).reset_index(drop=True)

In [39]:
train_data.tail()

Unnamed: 0,english_text,french_text
182130,"That at 3:30 p.m. tomorrow, if the business of...","Que, ‡ 15 h 30 demain, si le SÈnat n'a pas ter..."
182131,That should a division be deferred until 5:30 ...,"Que, si un vote est diffÈrÈ ‡ 17 h 30 demain, ..."
182132,That all matters on the Orders of the Day and ...,Que tous les points figurant ‡ l'Ordre du jour...
182133,Motion agreed to.,(La motion est adoptÈe.)
182134,"The Senate adjourned until Wednesday, April 5,...","(Le SÈnat s'ajourne au mercredi 5 avril 2000, ..."


In [30]:
test_data.tail()

Unnamed: 0,english_text,french_text
364265,"That at 3:30 p.m. tomorrow, if the business of...","Que, ‡ 15 h 30 demain, si le SÈnat n'a pas ter..."
364266,That should a division be deferred until 5:30 ...,"Que, si un vote est diffÈrÈ ‡ 17 h 30 demain, ..."
364267,That all matters on the Orders of the Day and ...,Que tous les points figurant ‡ l'Ordre du jour...
364268,Motion agreed to.,(La motion est adoptÈe.)
364269,"The Senate adjourned until Wednesday, April 5,...","(Le SÈnat s'ajourne au mercredi 5 avril 2000, ..."


## Data Cleaning
The French translation has some text formating problems that prevents certain character to be improperly convert from binary during extraction and reading of the data. Characters such as ‡, È, represent preposition (à) and è. » and … representing upper case è and é. These chcaracter when represented incorrectely can completly change the meaning of a phrase or sentence. I will replcace them with the right characters. There maybe some other special characters that that I might be missing but these changes should help as a starting point.

In [40]:
def replace_char(sent):
    # Lower case the string and replace char.
    sent = sent.lower()
    sent = sent.replace('»', 'è')
    sent = sent.replace('…', 'é')   
    sent = sent.replace('‡', 'à')  
    sent = sent.replace('¿', '')
    return sent.strip()

In [42]:
# Apply replace functions to french text
train_data.french_text = train_data.french_text.apply(replace_char)
test_data.french_text = test_data.french_text.apply(replace_char)

In [45]:
train_data.to_csv('./Data/prepped/train_data.csv', index=False)
test_data.to_csv('./Data/prepped/test_data.csv', index=False)