I´m trying to figure out how I can use my coding/ML skills to help me learn new languages. I know that it craves a lot of hard work to truly become proficient in a foreign language but I am interested in seeing if I can construct some shortcuts for myself. I know there are already apps out there like DuoLingo where people get paid to do these things (at least) 8 hours a day, but at the same time, they also get paid to keep users on their platform, thus not prioritising learning a language more quickly/efficiently (I think).

In this notebook I will try to find similar words between Swedish and French so that I can focus more on words that are completely different in my French-learning journey. The plan is also to extend this to include Spanish (another Romance language) which I am more proficient in. 

After having searched for a Swedish-French corpus without luck - I´ve settled on an English-French one for now. 

In [24]:
!kaggle datasets download -d devicharith/language-translation-englishfrench

Dataset URL: https://www.kaggle.com/datasets/devicharith/language-translation-englishfrench
License(s): CC0-1.0
language-translation-englishfrench.zip: Skipping, found more recently modified local copy (use --force to force download)


In [25]:
import zipfile
import os

# Define the path to the zip file and the extraction directory
zip_file_path = "language-translation-englishfrench.zip" 
extraction_dir = "language-translation-dataset"

# Create the extraction directory if it doesn't exist
os.makedirs(extraction_dir, exist_ok=True)

# Extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extraction_dir)

In [26]:
import pandas as pd

# Load the dataset 
df = pd.read_csv(os.path.join(extraction_dir, 'eng_-french.csv')) 

df

Unnamed: 0,English words/sentences,French words/sentences
0,Hi.,Salut!
1,Run!,Cours !
2,Run!,Courez !
3,Who?,Qui ?
4,Wow!,Ça alors !
...,...,...
175616,"Top-down economics never works, said Obama. ""T...","« L'économie en partant du haut vers le bas, ç..."
175617,A carbon footprint is the amount of carbon dio...,Une empreinte carbone est la somme de pollutio...
175618,Death is something that we're often discourage...,La mort est une chose qu'on nous décourage sou...
175619,Since there are usually multiple websites on a...,Puisqu'il y a de multiples sites web sur chaqu...


In [27]:
df['English'] = df['English words/sentences']
df['French'] = df['French words/sentences']

Alright so it´s a sentence translation dataset. It´s alright - let´s work with this. 

## Let´s explore some of the similarities/differences between the "shape" of the two languages

In [28]:
# Calculate average sentence lengths
average_english_length = df['English'].str.split().str.len().mean()
average_french_length = df['French'].str.split().str.len().mean()

print(f"Average English sentence length: {average_english_length}")
print(f"Average French sentence length: {average_french_length}")

Average English sentence length: 6.161552433934438
Average French sentence length: 6.706669475746067


## Metaphone

Metaphone is a phonetic algorithm for indexing words by their English pronounciation - it builds upon and improves on the Soundex algorithm. I´m looking at this quickly because I feel like my Swedish will help me more with improving my French pronounciation rather than English. 

In [29]:
!pip install metaphone


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [30]:
from metaphone import doublemetaphone

# Phonetic matching example
def phonetic_similarity(word):
    return doublemetaphone(word)[0]

# Applying phonetic similarity to the first few English words
df['English_Phonetic'] = df['English'].apply(phonetic_similarity)
df['French_Phonetic'] = df['French'].apply(phonetic_similarity)

# Comparing phonetic encodings
print(df[['English', 'French', 'English_Phonetic', 'French_Phonetic']].head())

  English      French English_Phonetic French_Phonetic
0     Hi.      Salut!               HH            SLTT
1    Run!     Cours !              RNN           KRSSS
2    Run!    Courez !              RNN           KRSSS
3    Who?       Qui ?                A               K
4    Wow!  Ça alors !                A          SLRSSS


I´ve read that the Jaccard similarity index can be used to measure the similarity e.g. between two text documents. I wonder if this is also the case for documents in two different languages. There might be some limitations to this but still interesting to look at.  

In [31]:
def jaccard_similarity(str1, str2):
    a = set(str1.split())
    b = set(str2.split())
    return len(a.intersection(b)) / len(a.union(b))

# Calculate Jaccard similarity between English and French sentences
df['Jaccard_Similarity'] = df.apply(lambda x: jaccard_similarity(x['English'], x['French']), axis=1)

# Display results
df[['English', 'French', 'Jaccard_Similarity']].sort_values(by=['Jaccard_Similarity'], ascending=False)

Unnamed: 0,English,French,Jaccard_Similarity
1460,Ignore Tom.,Ignore Tom.,1.000000
71906,Tom has a terrible secret.,Tom a un terrible secret.,0.666667
21224,Tom has Windows 7.,Tom a Windows 7.,0.600000
12592,Tom has a ranch.,Tom a un ranch.,0.600000
9223,Tom has a Ford.,Tom a une Ford.,0.600000
...,...,...,...
60748,I brought reinforcements.,J'ai apporté des renforts.,0.000000
60749,I brought you some lunch.,Je t'ai apporté un déjeuner.,0.000000
60750,I brought you some lunch.,Je vous ai apporté un déjeuner.,0.000000
60751,I brought you some water.,Je vous ai apporté de l'eau.,0.000000


Seems like it´s noticing similarities based on things that should be similar or the same. Such as names (Tom, Windows 7, Ford). Maybe we need to remove these. I don´t know how easy it is to kind of "reverse engineer" this and go from this to an english-french dictionary?

## Richness of vocabulary

Let´s look at the number of unique words in both columns

In [33]:
import pandas as pd

# Function to extract unique words from a column
def unique_words(column):
    # Split each sentence into words, convert to lowercase, and create a set to find unique words
    words = column.str.cat(sep=' ').lower().split()
    return set(words)

# Extract unique words from both columns
unique_english = unique_words(df['English'])
unique_french = unique_words(df['French'])

# Count unique words
count_unique_english = len(unique_english)
count_unique_french = len(unique_french)

# Print results
print(f"Unique words in English: {count_unique_english}")
print(f"Unique words in French: {count_unique_french}")


Unique words in English: 25622
Unique words in French: 42627
