## Brief description of the data set and a summary of its attributes
My data set is made up of two columns: English words/sentences and French words/sentences.

It contains 175622 rows, of which there isn’t any null values.

I pulled the data set from Kaggle as I needed data containing proper and correct translation of both languages. I am also fluent only in one of the languages, English thus it provided me with a ready made translation.

## Importing the necessary libraries
- pandas
- sklearn

In [1]:
import pandas as pd
import string
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import spacy
from sklearn.model_selection import train_test_split


## Using pandas library to retrieve the csv file

In [1]:
import chardet

# Detect the encoding of the file
with open(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\eng_-french.csv", "rb") as f:
    result = chardet.detect(f.read())
    encoding = result['encoding']
print(encoding)


ISO-8859-1


In [52]:
file = r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\eng_-french.csv"
df = pd.read_csv(file)
#df = df.replace('�','',regex = True) -- Very WRONG, DO NOT do this!!! What I mistakenly did was replace all the special characters in French e.g. é, è, ê, à, etc
# The csv files will reaad them as "?" e.g sant?! which is actually santé

In [53]:
df.index = df.index + 1
df

Unnamed: 0,English words/sentences,French words/sentences
1,Hi.,Salut!
2,Run!,Cours !
3,Run!,Courez !
4,Who?,Qui ?
5,Wow!,Ça alors !
...,...,...
175617,"Top-down economics never works, said Obama. ""T...","« L'économie en partant du haut vers le bas, ç..."
175618,A carbon footprint is the amount of carbon dio...,Une empreinte carbone est la somme de pollutio...
175619,Death is something that we're often discourage...,La mort est une chose qu'on nous décourage sou...
175620,Since there are usually multiple websites on a...,Puisqu'il y a de multiples sites web sur chaqu...


## Printing out the first 5 rows of the data frame

In [55]:
df.head(5)

Unnamed: 0,English words/sentences,French words/sentences
1,Hi.,Salut!
2,Run!,Cours !
3,Run!,Courez !
4,Who?,Qui ?
5,Wow!,Ça alors !


## Checking for any null values
Checking for missing values: 
- **df.isnull()** or **df.isna()** - will return true if null
- **df.notnull()** - will return true false if null

Handling missing values:
1)   Removing rows or columns with missing values: **df.dropna()**
2)   Interpolating missing values: **df.interpolate()**
3)   Imputing missing values: You can use **df.fillna(value)** to fill missing values with a specific value, or use more advanced techniques like mean, median, or machine learning algorithms for imputation.

In [56]:
df.isna().sum()

English words/sentences    0
French words/sentences     0
dtype: int64

## Checking for unique values

In [57]:
df.nunique().sum()

289075

## Checking the number of rows
Shape function will return a tuple consisting of 2 indices, 1st (rows,columns)

In [58]:
df.shape[0]

175621

## Checking for number of records
We also could use this to see the number of records in every column.

In [59]:
df.count()

English words/sentences    175621
French words/sentences     175621
dtype: int64

## Checking for the data types of values within the dataframe
We could use **astype(dtype)** to change the data type of records e.g. df.astype(float)


In [60]:
df.dtypes

English words/sentences    object
French words/sentences     object
dtype: object

## Checking for number of duplicates
- Detecting duplicates: **df.duplicated()** to check for duplicate rows.
- Removing duplicates: **df.drop_duplicates()** to remove duplicate rows.

In [61]:
df.duplicated().sum()

0

## Printing Duplicates

In [83]:
duplicates = df[df.duplicated()]
duplicates

Unnamed: 0,English words/sentences,French words/sentences


In [84]:
df.isnull()

Unnamed: 0,English words/sentences,French words/sentences
1,False,False
2,False,False
3,False,False
4,False,False
5,False,False
...,...,...
175617,False,False
175618,False,False
175619,False,False
175620,False,False


In [68]:
Eng, Fre = df["English words/sentences"], df["French words/sentences"]

In [69]:
lower_Eng = []
for words in Eng:
    lower_Eng.append(words.lower())

In [70]:
lower_Fre = []
for words in Fre:
    lower_Fre.append(words.lower())

In [71]:
Lower_Eng_df = pd.DataFrame(lower_Eng, columns = ['English words/sentences'])

In [72]:
Lower_Eng_df.index = Lower_Eng_df.index + 1

In [163]:
Lower_Fre_df = pd.DataFrame(lower_Fre,index=range(1, len(lower_Fre)+1), columns = ["French words/sentences"])

In [164]:
Lower_Fre_df

Unnamed: 0,French words/sentences
1,salut!
2,cours !
3,courez !
4,qui ?
5,ça alors !
...,...
175617,"« l'économie en partant du haut vers le bas, ç..."
175618,une empreinte carbone est la somme de pollutio...
175619,la mort est une chose qu'on nous décourage sou...
175620,puisqu'il y a de multiples sites web sur chaqu...


In [128]:
Lemmatized_Eng.to_csv(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\Data_Preprocess\lemmatized_English\Lemmatized_English_SpaCy.csv", index = False)

In [63]:
nlp = spacy.load('en_core_web_md')


In [101]:

'''# %%timeit   48.3 ms ± 8.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
def lemm(lis):
    list2 = []
    for words in lis:
        list3 = []
        doc = nlp(words)
        for word in doc:
            list3.append(word.lemma_)
        list2.append(' '.join(list3))
    return list2

lemm(list1)
#%%timeit       110 ns ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
lemm = lambda lis : [' '.join(word.lemma_ for word in nlp(words)) for words in lis]'''

"# %%timeit   48.3 ms ± 8.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\ndef lemm(lis):\n    list2 = []\n    for words in lis:\n        list3 = []\n        doc = nlp(words)\n        for word in doc:\n            list3.append(word.lemma_)\n        list2.append(' '.join(list3))\n    return list2\n\nlemm(list1)\n#%%timeit       110 ns ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)\nlemm = lambda lis : [' '.join(word.lemma_ for word in nlp(words)) for words in lis]"

In [79]:
# Load the spaCy language model
nlp = spacy.load("fr_core_news_md")

# Define the lemmatization function
Lemmatize_text = lambda text : ' '.join([token.lemma_ for token in nlp(text)])
# Apply lemmatization to the 'French Words' column
Lemmatized_Fre = Lower_Fre_df.applymap(Lemmatize_text)


In [80]:
Lemmatized_Fre.to_csv(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\Data_Preprocess\lemmatized_French\Lemmatized_French_spaCy.csv", index = False)

In [95]:
tokenized_Eng = [nltk.sent_tokenize(word) for word in lower_Eng]
len(tokenized_Eng)

175621

In [96]:
tokenized_Fre = [nltk.sent_tokenize(word) for word in lower_Fre]
len(tokenized_Fre)

175621

In [106]:
tokenized_Fre[175616]

['?',
 "l'?conomie en partant du haut vers le bas, ?a ne marche jamais, ?",
 'a dit obama. ?',
 "le pays ne r?ussit pas lorsque seulement ceux qui sont au sommet s'en sortent bien.",
 "nous r?ussissons lorsque la classe moyenne s'?largit, lorsqu'elle se sent davantage en s?curit?.",
 '?']

In [326]:
Eng_Lemm = pd.read_csv(r'C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\Data_Preprocess\lemmatized_English\Lemmatized_English_SpaCy.csv')

In [142]:
Fre_Lemm = pd.read_csv(r'C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\Data_Preprocess\lemmatized_French\Lemmatized_French_spaCy.csv')

## Removing the Punctuation Marks

In [102]:
#Printing out a collection of punctuation marks, ASCII characters
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Initially I did this but then realized that I wasn't really using the fully capabilities of the <span style = "color:red">if statement</span>. You notice that I am instead using the else statement to append the letters to my **col** list.

    def remove_punc(column):
        new_column = []
        for word in column:
            col = [] 
            for letter in word:
                if letter in string.punctuation:
                    letter = letter.replace(letter,'')
                else:
                    col.append(letter) #list for individual letters now without punctuation mark
                new_word = "".join(col)
            new_column.append(new_word)    
        return new_column

Instead I used <span style = "color:blue">not in</span> which was more effective and cleaner.

In [131]:
def remove_punc(column):
    new_column = []
    for word in column['English words/sentences']:
        col = []
        for letter in word:
            if letter not in string.punctuation:
                col.append(letter) #list for individual letters now without punctuation mark
            new_word = "".join(col) #It will first run through the first item, in this case word then proceed to the inner for loop then exit and start again
        new_column.append(new_word)    
    return new_column

In [133]:
No_Punc_Eng = remove_punc(Eng_Lemm)

In [171]:
No_Punc_Eng_df = pd.DataFrame(No_Punc_Eng, index = range(1, len(No_Punc_Eng)+1), columns = ["English words/sentences"])

In [270]:
No_Punc_Eng_df

English words/sentences    I win 
Name: 23, dtype: object

In [None]:

# Apply lemmatization to the 'French Words' column
Lemmatized_Fre = Lower_Fre_df.applymap(Lemmatize_text)

In [144]:
def remove_punc(column):
    new_column = []
    for word in column['French words/sentences']:
        col = []
        for letter in word:
            if letter not in string.punctuation:
                col.append(letter) #list for individual letters now without punctuation mark
            new_word = "".join(col) #It will first run through the first item, in this case word then proceed to the inner for loop then exit and start again
        new_column.append(new_word)    
    return new_column

In [151]:
No_Punc_Fre = remove_punc(Fre_Lemm)


In [155]:
new_No_Punc_Fre = [item.replace('\u202f', '') for item in No_Punc_Fre]

In [169]:
No_Punc_Fre_df = pd.DataFrame(new_No_Punc_Fre, index = range(1, len(new_No_Punc_Fre)+1), columns = ["French words/sentences"])

In [269]:
No_Punc_Fre_df

English words/sentences    I win 
Name: 23, dtype: object

In [210]:
nlp = spacy.load('en_core_web_md')

In [245]:
def Stopword_Removal(lis):
    list2 = []
    for sentence in lis['English words/sentences']:
        list3 = []
        doc = nlp(sentence)
        for token in doc:
            if not token.is_stop:
                list3.append(token.text)  
        list2.append(' '.join(list3))
    return list2


In [271]:
Lower_No_Punc_Eng = []

for word in No_Punc_Eng:
    Lower_No_Punc_Eng.append(word.lower())

In [311]:
nltk.download('stopwords')

#Verifying if that we have English and French Stopwords

from nltk.corpus import stopwords

#stopwords.fileids()  -- will print out list of language which it contains their stopwords
stop_Eng = stopwords.words('english') #179 of them



[nltk_data] Downloading package stopwords to C:\Users\Bildad
[nltk_data]     Otieno/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [334]:
list2 = []
for sentence in Lower_No_Punc_Eng:
    stop = []
    for word in sentence.split():  # Split the sentence into words
        if word not in stop_Eng or word in {'go', 'on', 'now'}:
            stop.append(word)
    list2.append(' '.join(stop))

In [336]:
#No_Stop_Eng_df = pd.DataFrame(list2, index = range(1, len(list2)+1), columns = ["English words/sentences"])

In [338]:
#No_Stop_Eng_df.to_csv(r'C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\No_Stopword_Eng\No_Stop_Eng.csv', index = False)

In [339]:
list2 = []
for sentence in No_Punc_Fre:
    stop = []
    for word in sentence.split():  # Split the sentence into words
        if word not in stop_Fre:
            stop.append(word)
    list2.append(' '.join(stop))


KeyboardInterrupt: 

In [None]:
No_Stop_Eng_df = pd.DataFrame(list2, index = range(1, len(list2)+1), columns = ['English words/sentences'])

In [323]:
Eng_df = pd.read_csv(r'No_Stopword_Eng\No_Stop_Eng.csv')

In [342]:
# Concatenate horizontally (along columns)
#combined_df = pd.concat([No_Stop_Eng_df, No_Stop_Fre_df], axis=1)

In [345]:
#combined_df.to_csv(r'C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\Combined_No_Stop.csv', index = False)

In [394]:
'''import re

def preprocess_text(text):
    # Remove French quotation marks (guillemets)
    text = re.sub(r'[«»]', '', text)
    return text

# Assuming you have a DataFrame named combined_df_filtered with a column 'French words/sentences'
combined_df_filtered['French words/sentences'] = combined_df_filtered['French words/sentences'].apply(preprocess_text)'''


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_df_filtered['French words/sentences'] = combined_df_filtered['French words/sentences'].apply(preprocess_text)


In [370]:
'''# Use str.match() to find rows with only numbers in 'English words/sentences' column
rows_with_numbers = combined_df[combined_df['English words/sentences'].str.match(r'^\d+$')]

# Use ~ to negate the condition and drop rows with only numbers
combined_df_filtered = combined_df[~combined_df['English words/sentences'].str.match(r'^\d+$')]'''

In [389]:
'''import pandas as pd
import numpy as np

# Replace empty strings with NaN
combined_df_filtered.replace('', np.nan, inplace=True)

# Drop rows with missing values (NaN) from the entire DataFrame
combined_df_filtered.dropna(inplace=True)
'''

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_df_filtered.replace('', np.nan, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_df_filtered.dropna(inplace=True)


In [395]:
#combined_df_filtered.to_csv(r'C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\Combined_no_null.csv',index = False)

In [303]:
#No_Stop_Fre_df = pd.DataFrame(list2, index = range(1, len(list2)+1), columns = ['French words/sentences'])

In [306]:
#No_Stop_Fre_df.to_csv(r'C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\No_Stopword_Fre\No_Stop_Fre.csv', index = False)

In [None]:
list2 = []
for sentence in Lower_No_Punc_Eng:
    stop = []
    for word in sentence.split():  # Split the sentence into words
        if word not in stop_Eng or word in {'go', 'on'}:
            stop.append(word)
    list2.append(' '.join(stop))

In [190]:
def Stopword_Removal(No_Punc_df):
    Stop_list = []
    for word in No_Punc_df['French words/sentences']:
        doc = nlp(word)
        for token in doc:
            if not token.is_stop:
                Stop_list.append(token)  
    return Stop_list

In [192]:
#No_Stop_Fre = Stopword_Removal(No_Punc_Fre_df)

NameError: name 'Stopword_removal' is not defined

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Bildad
[nltk_data]     Otieno/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
'''FrenchPar = FrenchPar.repartition(npartitions=20)
FrenchPar'''

Unnamed: 0_level_0,French Words
npartitions=20,Unnamed: 1_level_1
,object
,...
...,...
,...
,...


In [None]:
#French2 = pd.read_csv("French.csv", usecols=lambda col: col != 'Unnamed: 0')

In [None]:
'''import dask.dataframe as dd
FrenchPar = dd.read_parquet("French.parquet")
len(FrenchPar)'''

175621

In [None]:
#FrenchPar.to_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French")

df3 = pd.DataFrame(lower_Eng,columns= ["English Words"])
df3.to_parquet("C:\\Users\\Bildad Otieno\\Documents\\Billy_Repo\\Translation_Mod\\English.parquet")

EnglishPar = pd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\English.parquet")
EnglishPar

EnglishPar.npartitions

EnglishPar = EnglishPar.repartition(20)

I will opt for lemmatization and not stemming as I did before:


    ps = PorterStemmer()
    print(" {0:25}  {1:25} ".format("--Word(s)--","--Stem--"))
    for word in lower_Eng:
        print("   {0:25}  {1:25} ".format(word,ps.stem(word)))


In [None]:
# nltk.download('all') - Every Package is Up-to-date for my Ellie

In [None]:
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to C:\Users\Bildad
[nltk_data]     Otieno/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
wnl = WordNetLemmatizer()

In [None]:
df_lemm_Eng = pd.DataFrame(lemm_Eng, columns = ["English Words"])

!python -m spacy download fr_core_news_md

In [None]:
'''
import spacy
nlp = spacy.load('fr_core_news_md')
Part0 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.0.parquet")
Part1 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.1.parquet")
Part2 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.2.parquet")
Part3 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.3.parquet")
Part4 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.4.parquet")
Part5 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.5.parquet")
Part6 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.6.parquet")
Part7 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.7.parquet")
Part8 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.8.parquet")
Part9 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.9.parquet")
Part10 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.10.parquet")
Part11 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.11.parquet")
Part12 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.12.parquet")
Part13 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.13.parquet")
Part14 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.14.parquet")
Part15 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.15.parquet")
Part16 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.16.parquet")
Part17 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.17.parquet")
Part18 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.18.parquet")
Part19 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.19.parquet")
'''

In [None]:
'''pandas_df = Part0.compute()
for index, row in pandas_df.iterrows():
    print(row['French Words'])'''


"pandas_df = Part0.compute()\nfor index, row in pandas_df.iterrows():\n    print(row['French Words'])"

In [None]:
eng_sub = newdf["English Words"]

In [None]:
lemmatize_text("i m")

'I m'

In [None]:
'''pandas_df = lemmatized_df
for index, row in pandas_df.iterrows():
    print(row['French Words'])'''

"pandas_df = lemmatized_df\nfor index, row in pandas_df.iterrows():\n    print(row['French Words'])"

In [None]:
#lemmatized_df.to_csv(r'C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\lemmatized_output19.csv', index=False)

In [None]:
'''lemmatized_words = [word.lemma_ for doc in Part0.to_delayed() for word in nlp(str(doc.compute()))]
print(lemmatized_words)
'''

['                 ', 'French', 'Words', '\n', '0', '                      ', 'salut', '\n', '1', '                      ', 'cours', '\n', '2', '                     ', 'courir', '\n', '3', '                        ', 'qui', '\n', '4', '                    ', 'avoir', 'alors', '\n', '...', '                      ', '...', '\n', '8776', ' ', 'puisje', 'me', 'joindre', ' ', 'vous', '\n', '8777', ' ', 'puisje', 'vous', 'accompagner', '\n', '8778', '   ', 'puisje', 'vous', 'embrasser', '\n', '8779', '     ', 'puisje', 'masseoir', 'ici', '\n', '8780', '    ', 'puisje', 'utiliser', 'ceci', '\n\n', '[', '8781', 'row', 'x', '1', 'column', ']']


In [328]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
print(df.to_string)

"pd.set_option('display.max_rows', None)\npd.set_option('display.max_columns', None)\nprint(df.to_string)"

In [None]:
'''new_col = []
for doc in Part0.to_delayed():
    col = []
    doc = nlp(str(doc.compute()))
    for word in doc:
        col.append(word.lemma_)
    new_word = " ".join(col)
    
    new_col.append(new_word)
    
new_col 
print(new_col[0])'''

                  French Words 
 0                        salut 
 1                        cours 
 2                       courir 
 3                          qui 
 4                      avoir alors 
 ...                        ... 
 8776   puisje me joindre   vous 
 8777   puisje vous accompagner 
 8778     puisje vous embrasser 
 8779       puisje masseoir ici 
 8780      puisje utiliser ceci 

 [ 8781 row x 1 column ]


# Padding

In [11]:
import pandas as pd
Combined_no_null = pd.read_csv(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\Combined_no_null.csv", encoding = 'latin1')
Combined_no_null.index = Combined_no_null.index +1

In [13]:
len(Combined_no_null)

174211

In [14]:
Combined_no_null

Unnamed: 0,English words/sentences,French words/sentences
1,hi,salut
2,run,cours
3,run,courir
4,wow,cela alors
5,fire,feu
...,...,...
174207,top economic never work say obama country succ...,économie partir haut vers bas cela marche jama...
174208,carbon footprint amount carbon dioxide polluti...,empreinte carbone être somme pollution dioxyde...
174209,death something often discourage talk even thi...,mort être chose décourager souvent discuter pe...
174210,since usually multiple website on give topic u...,puisque avoir multiple site web chaque sujet c...


In [17]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer


tokenizer = Tokenizer()
tokenizer.fit_on_texts(Combined_no_null['English words/sentences'])


In [18]:
tokenizer2 = Tokenizer()
tokenizer2.fit_on_texts(Combined_no_null['French words/sentences'])

In [19]:
print(tokenizer.word_index)
print(tokenizer2.word_index)

{'être': 1, 'avoir': 2, 'tom': 3, 'faire': 4, 'pouvoir': 5, 'vouloir': 6, 'tout': 7, 'aller': 8, 'dire': 9, 'devoir': 10, 'savoir': 11, 'cela': 12, 'plus': 13, 'penser': 14, 'aimer': 15, 'si': 16, 'voir': 17, 'quelque': 18, 'chose': 19, 'parler': 20, 'très': 21, 'prendre': 22, 'celer': 23, 'temps': 24, 'pourquoi': 25, 'venir': 26, 'luire': 27, 'ici': 28, 'bon': 29, 'jamais': 30, 'quel': 31, 'où': 32, 'arriver': 33, 'comment': 34, 'vraiment': 35, 'bien': 36, 'personne': 37, 'vou': 38, 'besoin': 39, 'beaucoup': 40, 'trouver': 41, 'passer': 42, 'là': 43, 'heure': 44, 'autre': 45, 'manger': 46, 'rien': 47, 'seul': 48, 'maison': 49, 'peu': 50, 'mary': 51, 'train': 52, 'falloir': 53, 'toujours': 54, 'trop': 55, 'comme': 56, 'aider': 57, 'croire': 58, 'deux': 59, 'argent': 60, 'quand': 61, 'quoi': 62, 'encore': 63, 'nouveau': 64, 'monde': 65, 'attendre': 66, 'essayer': 67, 'rendre': 68, 'jour': 69, 'livre': 70, 'voiture': 71, 'entendre': 72, 'partir': 73, 'demander': 74, 'laisser': 75, 'mettr

In [21]:
Combined_no_null["Eng_sequences"] = tokenizer.texts_to_sequences(Combined_no_null["English words/sentences"])
Combined_no_null["Fre_sequences"] = tokenizer2.texts_to_sequences(Combined_no_null["French words/sentences"])


In [24]:
max_sequence_length = 100  # Set the maximum sequence length

# Padding English sequences
padded_eng_sequences = pad_sequences(Combined_no_null["Eng_sequences"], maxlen=max_sequence_length, padding='post', truncating='post')

# Padding French sequences
padded_fre_sequences = pad_sequences(Combined_no_null["Fre_sequences"], maxlen=max_sequence_length, padding='post', truncating='post')


In [26]:
from sklearn.model_selection import train_test_split

In [28]:
# Assuming you have separate arrays for English and French sequences
X_train_eng, X_test_eng, y_train_french, y_test_french = train_test_split(padded_eng_sequences, padded_fre_sequences, test_size=0.15, random_state=42)


In [29]:
X_train_eng

array([[ 124, 4761, 4863, ...,    0,    0,    0],
       [ 744,  103,  298, ...,    0,    0,    0],
       [  63, 1880,    0, ...,    0,    0,    0],
       ...,
       [   2,  466,  108, ...,    0,    0,    0],
       [ 614,   93, 1011, ...,    0,    0,    0],
       [  43,  334,   57, ...,    0,    0,    0]])