## Brief description of the data set and a summary of its attributes
My data set is made up of two columns: English words/sentences and French words/sentences.

It contains 175622 rows, of which there wasn’t any null values.

I pulled the data set from Kaggle as I needed data containing proper and correct translation of both languages. I am also fluent only in one of the languages, English thus it provided me with a ready made translation.

## Importing the necessary libraries
- pandas
- sklearn

In [1]:
import pandas as pd
import string
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import spacy
from sklearn.model_selection import train_test_split


## Using pandas library to retrieve the csv file

In [2]:
file = r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\Eng-Fre.csv"
df = pd.read_csv(file, encoding= 'utf-8')
df = df.replace('�','',regex = True)
#df.to_csv("C:\\Users\\Bildad Otieno\\Documents\\Billy_Repo\\Translation_Mod\\Eng-Fre2.csv", index = False)

## Printing out the first 5 rows of the data frame

In [3]:
df.head(5)

Unnamed: 0,English words/sentences,French words/sentences
0,Hi.,Salut!
1,Run!,Cours?!
2,Run!,Courez?!
3,Who?,Qui ?
4,Wow!,?a alors?!


## Checking for any null values
Checking for missing values: 
- **df.isnull()** or **df.isna()** - will return true if null
- **df.notnull()** - will return true false if null

Handling missing values:
1)   Removing rows or columns with missing values: **df.dropna()**
2)   Interpolating missing values: **df.interpolate()**
3)   Imputing missing values: You can use **df.fillna(value)** to fill missing values with a specific value, or use more advanced techniques like mean, median, or machine learning algorithms for imputation.

In [4]:
df.isna().sum()

English words/sentences    0
French words/sentences     0
dtype: int64

## Checking for unique values

In [5]:
df.nunique().sum()

289032

## Checking the number of rows
Shape function will return a tuple consisting of 2 indices, 1st (rows,columns)

In [6]:
df.shape[0]

175621

## Checking for number of records
We also could use this to see the number of records in every column.

In [7]:
df.count()

English words/sentences    175621
French words/sentences     175621
dtype: int64

## Checking for the data types of values within the dataframe
We could use **astype(dtype)** to change the data type of records e.g. df.astype(float)


In [8]:
df.dtypes

English words/sentences    object
French words/sentences     object
dtype: object

## Checking for number of duplicates
- Detecting duplicates: **df.duplicated()** to check for duplicate rows.
- Removing duplicates: **df.drop_duplicates()** to remove duplicate rows.

In [9]:
df.duplicated().sum()

5

## Printing Duplicates

In [10]:
duplicates = df[df.duplicated()]
duplicates

Unnamed: 0,English words/sentences,French words/sentences
1621,Stand back!,Reculez !
8626,It's not funny.,Ce n'est pas dr?le !
65263,What time will you leave?,? quelle heure pars-tu??
100309,What's the weather like today?,Quel temps fait-il aujourd'hui??
147869,Medical marijuana is legal in this state.,La marijuana th?rapeutique est l?gale dans cet...


In [11]:
df.isnull()

Unnamed: 0,English words/sentences,French words/sentences
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
...,...,...
175616,False,False
175617,False,False
175618,False,False
175619,False,False


In [12]:
Eng, Fre = df["English words/sentences"], df["French words/sentences"]

In [13]:
#Printing out a collection of punctuation marks, ASCII characters
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


## Removing the Punctuation Marks

Initially I did this but then realized that I wasn't really using the fully capabilities of the <span style = "color:red">if statement</span>. You notice that I am instead using the else statement to append the letters to my **col** list.

    def remove_punc(column):
        new_column = []
        for word in column:
            col = [] 
            for letter in word:
                if letter in string.punctuation:
                    letter = letter.replace(letter,'')
                else:
                    col.append(letter) #list for individual letters now without punctuation mark
                new_word = "".join(col)
            new_column.append(new_word)    
        return new_column

Instead I used <span style = "color:blue">not in</span> which was more effective and cleaner.

In [14]:
def remove_punc(column):
    new_column = []
    for word in column:
        col = []
        for letter in word:
            if letter not in string.punctuation:
                col.append(letter) #list for individual letters now without punctuation mark
            new_word = "".join(col)
        new_column.append(new_word)    
    return new_column

In [15]:
No_Punc_Eng = remove_punc(Eng)

In [16]:
No_Punc_Fre = remove_punc(Fre)

In [17]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Bildad
[nltk_data]     Otieno/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [18]:
tokenized_Eng = [nltk.word_tokenize(word) for word in No_Punc_Eng]
len(tokenized_Eng)

175621

In [19]:
tokenized_Fre = [nltk.sent_tokenize(word) for word in No_Punc_Fre]
len(tokenized_Fre)

175621

In [20]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Bildad
[nltk_data]     Otieno/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [21]:
#Verifying if that we have English and French Stopwords
from nltk.corpus import stopwords
#stopwords.fileids()

In [22]:
stop_Eng = stopwords.words('english') #179 of them

In [23]:
stop_Fre = stopwords.words('french') #157 of them

In [24]:
No_Stop_Eng = []
for word in tokenized_Eng:
    if word not in stop_Eng:
        No_Stop_Eng.append(word)

In [25]:
len(No_Stop_Eng)

175621

In [26]:
No_Stop_Fre = []
for word in tokenized_Fre:
    if word not in stop_Fre:
        No_Stop_Fre.append(word)

In [27]:
len(No_Stop_Fre)

175621

In [28]:
lower_Fre = []
for words in No_Stop_Fre:
    for word in words:
        lower_Fre.append(word.lower())

In [29]:
lower_Eng = []
for words in No_Stop_Eng:
    for word in words:
        lower_Eng.append(word.lower())

In [30]:
df = pd.DataFrame(lower_Fre,columns= ["French Words"])
df2 = pd.DataFrame(lower_Fre,columns= ["French Words"])
df.to_parquet("C:\\Users\\Bildad Otieno\\Documents\\Billy_Repo\\Translation_Mod\\French.parquet")
df2.to_csv("C:\\Users\\Bildad Otieno\\Documents\\Billy_Repo\\Translation_Mod\\French.csv")

In [31]:
French = pd.read_parquet("French.parquet")
len(French)

175621

In [32]:
#French2 = pd.read_csv("French.csv", usecols=lambda col: col != 'Unnamed: 0')

In [33]:
import dask.dataframe as dd
FrenchPar = dd.read_parquet("French.parquet")
len(FrenchPar)

175621

In [34]:
FrenchPar = FrenchPar.repartition(npartitions=20)
FrenchPar

Unnamed: 0_level_0,French Words
npartitions=20,Unnamed: 1_level_1
,object
,...
...,...
,...
,...


In [35]:
FrenchPar.to_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French")

df3 = pd.DataFrame(lower_Eng,columns= ["English Words"])
df3.to_parquet("C:\\Users\\Bildad Otieno\\Documents\\Billy_Repo\\Translation_Mod\\English.parquet")

EnglishPar = pd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\English.parquet")
EnglishPar

EnglishPar.npartitions

EnglishPar = EnglishPar.repartition(20)

I will opt for lemmatization and not stemming as I did before:


    ps = PorterStemmer()
    print(" {0:25}  {1:25} ".format("--Word(s)--","--Stem--"))
    for word in lower_Eng:
        print("   {0:25}  {1:25} ".format(word,ps.stem(word)))


In [36]:
# nltk.download('all') - Every Package is Up-to-date for my Ellie

In [37]:
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to C:\Users\Bildad
[nltk_data]     Otieno/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [38]:
wnl = WordNetLemmatizer()

In [39]:
nltk.download('wordnet')
'''print(" {0:25}  {1:25} ".format("--Word(s)--","--Lemma--"))
for word in lower_Fre:
    print("   {0:25}  {1:25} ".format(word, wnl.lemmatize(word, pos='v')))'''
    
lemm_Eng = [wnl.lemmatize(word, pos='v') for word in lower_Eng]

[nltk_data] Downloading package wordnet to C:\Users\Bildad
[nltk_data]     Otieno/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


!python -m spacy download fr_core_news_md

In [125]:
import spacy
nlp = spacy.load('fr_core_news_md')
Part0 = dd.read_parquet(r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.0.parquet")

In [155]:
'''pandas_df = Part0.compute()
for index, row in pandas_df.iterrows():
    print(row['French Words'])'''


"pandas_df = Part0.compute()\nfor index, row in pandas_df.iterrows():\n    print(row['French Words'])"

In [154]:
import spacy
import pandas as pd
import dask.dataframe as dd

# Load the spaCy language model
nlp = spacy.load("fr_core_news_md")

# Define the lemmatization function
def lemmatize_partition(partition):
    partition['French Words'] = partition['French Words'].apply(lambda text: ' '.join([token.lemma_ for token in nlp(text)]))
    return partition

# Apply lemmatization to each partition of the DataFrame
lemmatized_partition = Part0.map_partitions(lemmatize_partition)

# Convert the Dask DataFrame back to a Pandas DataFrame
lemmatized_df = lemmatized_partition.compute()

# Print the lemmatized DataFrame
print(lemmatized_df)


                  French Words
0                        salut
1                        cours
2                       courir
3                          qui
4                  avoir alors
...                        ...
8776  puisje me joindre   vous
8777   puisje vous accompagner
8778     puisje vous embrasser
8779       puisje masseoir ici
8780      puisje utiliser ceci

[8781 rows x 1 columns]


In [158]:
pandas_df = lemmatized_df
for index, row in pandas_df.iterrows():
    print(row['French Words'])

salut
cours
courir
qui
avoir alors
au feu
  laid
saute
avoir suffire
stop
arrtetoi
attendre
attendre
poursuivre
continuer
poursuivre
bonjour
salut
je comprendre
jessaye
jai gagn
je ler emport
jai gagn
oh non
attaque
attaquez
sant
  votre sant
merci
tchintchin
lvetoi
aller maintenant
allezy maintenant
vasy maintenant
jai pig
compris
pig
compris
tas capt
monte
monter
serremoi dans ton bras
serrezmoi dans votre bras
je être tombe
je être tomb
je savoir
je être partir
je être partir
jai mentir
jai perdre
jai pay
jai 19 an
je aller bien
avoir aller
coutez
cest pas possible
impossible
en aucun cas
sans faon
cest hors de question
il nen être pas question
cest exclure
en aucun manire
hors de question
vraiment
vrai
ah bon
merci
on essayer
nous avoir gagn
nous gagnme
nous laver emport
nous lemportme
demande   tom
fantastique
être calme
être calme
être calme
être dtendu
être juste
être juste
être juste
être quitabl
être quitabl
être quitable
être gentil
être gentil
être gentil
être gentil
être ge

In [162]:
lemmatized_df.to_csv(r'C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\lemmatized_output1.csv', index=False)

In [146]:
lemmatized_words = [word.lemma_ for doc in Part0.to_delayed() for word in nlp(str(doc.compute()))]
print(lemmatized_words)


['                 ', 'French', 'Words', '\n', '0', '                      ', 'salut', '\n', '1', '                      ', 'cours', '\n', '2', '                     ', 'courir', '\n', '3', '                        ', 'qui', '\n', '4', '                    ', 'avoir', 'alors', '\n', '...', '                      ', '...', '\n', '8776', ' ', 'puisje', 'me', 'joindre', ' ', 'vous', '\n', '8777', ' ', 'puisje', 'vous', 'accompagner', '\n', '8778', '   ', 'puisje', 'vous', 'embrasser', '\n', '8779', '     ', 'puisje', 'masseoir', 'ici', '\n', '8780', '    ', 'puisje', 'utiliser', 'ceci', '\n\n', '[', '8781', 'row', 'x', '1', 'column', ']']


In [135]:
new_col = []
for doc in Part0.to_delayed():
    col = []
    doc = nlp(str(doc.compute()))
    for word in doc:
        col.append(word.lemma_)
    new_word = " ".join(col)
    
    new_col.append(new_word)
    
new_col 
print(new_col[0])

                  French Words 
 0                        salut 
 1                        cours 
 2                       courir 
 3                          qui 
 4                      avoir alors 
 ...                        ... 
 8776   puisje me joindre   vous 
 8777   puisje vous accompagner 
 8778     puisje vous embrasser 
 8779       puisje masseoir ici 
 8780      puisje utiliser ceci 

 [ 8781 row x 1 column ]


In [133]:
import re
import pandas as pd


# Extract the string element
string_data = new_col[0]

# Extract the string element
string_data = data[0]

# Split the string into lines
lines = string_data.split('\n')

# Remove leading and trailing whitespaces from each line
cleaned_lines = [line.strip() for line in lines]

# Filter out non-empty lines and the first line with column names
filtered_lines = [line for line in cleaned_lines if line]

# Extract the French words from the filtered lines
french_words = []
for line in filtered_lines[1:]:
    if 'row x' in line:
        break
    words = line.split()
    french_words.extend(words)

# Create a DataFrame from the extracted French words
df = pd.DataFrame(french_words, columns=['French Words'])

print(df)


   French Words
0             0
1         salut
2             1
3         cours
4             2
5        courir
6             3
7           qui
8             4
9         avoir
10        alors
11          ...
12          ...
13         8776
14       puisje
15           me
16      joindre
17         vous
18         8777
19       puisje
20         vous
21  accompagner
22         8778
23       puisje
24         vous
25    embrasser
26         8779
27       puisje
28     masseoir
29          ici
30         8780
31       puisje
32     utiliser
33         ceci


In [93]:
import pandas as pd

# Assuming the list of strings is in a variable called 'new_col'
df = pd.DataFrame(new_col)

df.to_string

<bound method DataFrame.to_string of                                                    0
0                    French Words \n 0           ...>

In [51]:
new_col = []
for text in Part0:
    doc = nlp(str(text))
    col = []
    for word in doc:
        col.append(word.lemma_)
    new_word = " ".join(col)
    new_col.append(new_word)
    print(new_col)

['French', 'French Words']


In [44]:


for word in FrenchPar.to_delayed():
    FreWord.append(word.compute())


NameError: name 'FreWord' is not defined

In [None]:
import pandas as pd
cleaned_col = []
for text in new_col:
    rows = text.strip().split("\n")
    cleaned_rows = [row.strip() for row in rows[1:] if row.strip()]
    cleaned_col.extend(cleaned_rows)

# Clean up the strings in new_col


# Create a list of dictionaries from the cleaned_col strings
data = [{"French Words": text} for text in cleaned_col]

# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(data)

# Print the resulting DataFrame
print(df)



NameError: name 'new_col' is not defined

In [None]:
df_merged = pd.DataFrame(cleaned_col, columns=["French Words"])

In [None]:
df_merged

Unnamed: 0,French Words
0,0 salut
1,1 cours
2,2 courir
3,3 qui
4,4 avoir alors
5,... ...
6,8776 puisje me joindre vous
7,8777 puisje vous accompagner
8,8778 puisje vous embrasser
9,8779 puisje masseoir ici


In [None]:
# Iterate over the twenty partitions
for i in range(20):
    # Create a new Dask dataframe with the updated data for the current partition
    new_partition = dd.from_pandas(new_col[i], npartitions=1)
    
    # Repartition the new dataframe if necessary
    new_partition = new_partition.repartition(npartitions=1)
    
    # Save the new dataframe as a parquet file, replacing the existing partition file
    new_partition.to_parquet(f"path_to_partition_{i}.parquet", overwrite=True)


TypeError: Input must be a pandas DataFrame or Series.

In [None]:
file_pattern = r"C:\Users\Bildad Otieno\Documents\Billy_Repo\Translation_Mod\French\part.*.parquet"

df4 = dd.read_parquet(file_pattern)

merged_df4 = dd.concat([df4])

merged_df4 = merged_df4.compute()
merged_df4

Unnamed: 0,French Words
0,salut
1,cours
2,courez
3,qui
4,a alors
...,...
175616,lconomie en partant du haut vers le bas a ne ...
175617,une empreinte carbone est la somme de pollutio...
175618,la mort est une chose quon nous dcourage souve...
175619,puisquil y a de multiples sites web sur chaque...


## Splitting Dataset into 70:30 Ratio

In [None]:
Eng_train, Eng_test, Fre_train, Fre_test = train_test_split(Eng, Fre, test_size= .33, random_state=42)