## Dataset from EU Parliament proceedings

#### I found a dataset containing phrases in english, swedish, french and spanish! :D 

It would be cool to add a bengali column but to be honest I don´t really know how bengali can help me with my spanish/french learning journey. Maybe there are some phonemes in bengali that can help but we can focus on that later.

This dataset contains proceedings from the European Parliament - to me that sounds like a really high-quality dataset. Translated by experts and in multiple european languages. Let´s see what we can do with it. 

Link to Kaggle dataset: https://www.kaggle.com/datasets/djonafegnem/europarl-parallel-corpus-19962011/data

In [8]:
!kaggle datasets download -d djonafegnem/europarl-parallel-corpus-19962011 --file english_french.csv
!kaggle datasets download -d djonafegnem/europarl-parallel-corpus-19962011 --file english_swedish.csv
!kaggle datasets download -d djonafegnem/europarl-parallel-corpus-19962011 --file english_spanish.csv

Dataset URL: https://www.kaggle.com/datasets/djonafegnem/europarl-parallel-corpus-19962011
License(s): CC0-1.0
Downloading english_french.csv.zip to /Users/sabrina/LinguaLoki
 99%|███████████████████████████████████████▋| 213M/215M [00:12<00:00, 21.2MB/s]
100%|████████████████████████████████████████| 215M/215M [00:12<00:00, 18.3MB/s]
Dataset URL: https://www.kaggle.com/datasets/djonafegnem/europarl-parallel-corpus-19962011
License(s): CC0-1.0
english_swedish.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
Dataset URL: https://www.kaggle.com/datasets/djonafegnem/europarl-parallel-corpus-19962011
License(s): CC0-1.0
english_spanish.csv.zip: Skipping, found more recently modified local copy (use --force to force download)


In [2]:
import zipfile

langs = ["swedish", "spanish", "french"]
folder_name = "europarl_multilingual_dataset"

for lang in langs:
    with zipfile.ZipFile(f"english_{lang}.csv.zip", 'r') as zip_ref:
        zip_ref.extractall(folder_name)

In [2]:
import pandas as pd

folder_name = "europarl_multilingual_dataset"

df_french = pd.read_csv(f"{folder_name}/english_french.csv")
df_swedish = pd.read_csv(f"{folder_name}/english_swedish.csv")
df_spanish = pd.read_csv(f"{folder_name}/english_spanish.csv")

In [3]:
print(len(df_french))
print(len(df_swedish))
print(len(df_spanish))
df_french.head()

2007724
1862235
1965735


Unnamed: 0,English,French
0,Resumption of the session,Reprise de la session
1,I declare resumed the session of the European ...,Je déclare reprise la session du Parlement eur...
2,"Although, as you will have seen, the dreaded '...","Comme vous avez pu le constater, le grand ""bog..."
3,You have requested a debate on this subject in...,Vous avez souhaité un débat à ce sujet dans le...
4,"In the meantime, I should like to observe a mi...","En attendant, je souhaiterais, comme un certai..."


In [4]:
df_swedish.head()

Unnamed: 0,English,Swedish
0,Resumption of the session,Återupptagande av sessionen
1,I declare resumed the session of the European ...,Jag förklarar Europaparlamentets session återu...
2,"Although, as you will have seen, the dreaded '...","Som ni kunnat konstatera ägde ""den stora år 20..."
3,You have requested a debate on this subject in...,Ni har begärt en debatt i ämnet under sammantr...
4,"In the meantime, I should like to observe a mi...","Till dess vill jag att vi, som ett antal kolle..."


In [5]:
df_spanish.head()

Unnamed: 0,English,Spanish
0,Resumption of the session,Reanudación del período de sesiones
1,I declare resumed the session of the European ...,Declaro reanudado el período de sesiones del P...
2,"Although, as you will have seen, the dreaded '...","Como todos han podido comprobar, el gran ""efec..."
3,You have requested a debate on this subject in...,Sus Señorías han solicitado un debate sobre el...
4,"In the meantime, I should like to observe a mi...","A la espera de que se produzca, de acuerdo con..."


I´ve been having problems with merging so let´s investigate possible duplicates in the english columns and drop them

In [6]:
duplicates = df_french[df_french.duplicated(subset='English')]
print(len(duplicates))

duplicates = df_swedish[df_swedish.duplicated(subset='English')]
print(len(duplicates))

duplicates = df_spanish[df_spanish.duplicated(subset='English')]
print(len(duplicates))

54243
66205
57156


Let´s remove these and see if we can arrive at bilingual datasets of equal length

In [7]:
df_french = df_french.drop_duplicates(subset='English')
df_swedish = df_swedish.drop_duplicates(subset='English')
df_spanish = df_spanish.drop_duplicates(subset='English')

In [8]:
print(len(df_french))
print(len(df_swedish))
print(len(df_spanish))

1953481
1796030
1908579


In [11]:
df_merged = df_french.join(df_swedish.set_index('English'), on='English')
len(df_merged)
df_merged = df_merged.join(df_spanish.set_index('English'), on='English')
len(df_merged)

1953481

In [12]:
df_merged.head()

Unnamed: 0,English,French,Swedish,Spanish
0,Resumption of the session,Reprise de la session,Återupptagande av sessionen,Reanudación del período de sesiones
1,I declare resumed the session of the European ...,Je déclare reprise la session du Parlement eur...,Jag förklarar Europaparlamentets session återu...,Declaro reanudado el período de sesiones del P...
2,"Although, as you will have seen, the dreaded '...","Comme vous avez pu le constater, le grand ""bog...","Som ni kunnat konstatera ägde ""den stora år 20...","Como todos han podido comprobar, el gran ""efec..."
3,You have requested a debate on this subject in...,Vous avez souhaité un débat à ce sujet dans le...,Ni har begärt en debatt i ämnet under sammantr...,Sus Señorías han solicitado un debate sobre el...
4,"In the meantime, I should like to observe a mi...","En attendant, je souhaiterais, comme un certai...","Till dess vill jag att vi, som ett antal kolle...","A la espera de que se produzca, de acuerdo con..."


Great - we now have a dataset of 4 languages that I can more or less speak and understand. Almost 2 million rows - nice!

## French vs Swedish

My immediate thought right now is that I want to investigate the similarities between Swedish and French. I´ve been hanging out with some French people lately - they´re also trying to improve their Spanish so we speak with each other in Spanish. Sometimes they forget a word in Spanish and try to say it in French with a Spanish accent. Sometimes it works and sometimes it´s completely off. Sometimes we realise that it´s the exact (or almost) same word in Swedish - probably taken from French I would guess?. Anywho - so I´m really interested in seeing how much of an influence modern Swedish still has from French. You can see it in words such as fåtölj and trottoar, but maybe there are e.g. ways of expressing oneself in Swedish that are actually influenced by French. 

## Anglicisms

The current lingua franca (if I´m using the correct term) in large parts of the world is currently English, and I´ve heard Swedish people around me and also myself speak Swedish in kind of an "English" way even though one might not realise it - due to the high exposure to english content and media (I also did my whole master in English although it was at a Swedish university). E.g. as far as I know, there is no direct translation to "it makes sense" so I´ve heard many swedish people say "det makear sense" which is a clear anglicism. Anglicisms are quite obvious influences of english on the swedish language. Something less obvious is "calquing" or "structural borrowing" which I´ve heard more from younger people maybe more exposed to english on different social media platforms or just the internet in general where the english language gets more reach and views. Examples (taken from this reddit thread from some annoyed swedes at the high amount of anglicisms used by swedish people): https://www.reddit.com/r/sweden/comments/12ytyvb/anglicismer_i_svenskan/
- komma iväg med något (instead of "komma undan med något"): a direct translation of get away with something
- röda flaggor (instead of "varningsklockor"): red flags

In the same thread we find a person who also mentions the german and french influences that are so old and tangled into the swedish languages that it´s just not noticeable anymore. I guess this might happen with the english influences in the future. 

## Structural borrowing

I remember when we were taught how to communicate the time in french class - il est huit heures. A direct translation to english would "he is eight hours" or "it is eight hours" and to swedish "han är åtta timmar". Saying e.g. "hon är kvart i åtta" (she is a quarter to eight) is not uncommon to hear when talking about the time so I assume it might be something that was borrowed from French? Or maybe not - I don´t know.

## Gendered nouns

Another interesting thing are the genders used in e.g. Spanish and French, which don´t exist in Swedish. For me it´s not straightforward that a lamp is feminine such as it would be for spanish or french speakers. The spanish speakers in my french class still sometimes guess the gender wrong for some french words so there´s some consolation in that. 

The swedish gender does not exist in terms of feminine or masculine as it does in french or spanish, but I remember listening to the swedish podcast Språket (the language) - where they spoke about gender in the swedish language and it seems like we kind of have a form of gender in terms of en vs ett. 

a lamp = en lampa
a car = en bil

a house = ett hus
an animal = ett djur

In the podcast they spoke about old swedish having genders such as in other germanic languages like German - but over time they disappeared. 