<a href="https://colab.research.google.com/github/Huertas97/Get_Multilingual_Data/blob/main/notebooks/Multilingual_Data_fit_PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introducction

In this notebooks we show how to use the scripts available in https://github.com/Huertas97/Get_Multilingual_Data.git to extract Multilingual sentences. 

The final result from the notebook is a data frame with 1000 sentences per language from three different resources: TED2020, WikiMatrix and OPUS-NewsCommentary. 

A visualization of the multilingual data extracted is shown at the end of the notebook. 

# Loading Trial Data


In [1]:
!pip install -U -q sentence-transformers
import pandas as pd

# Languages for the PCA

In [2]:
languages = "ar, cs, de, en, es, fr, hi, it, ja, nl, pl, pt, ru, tr, zh".split(", ")
print(",".join(languages))
print(len(languages))

ar,cs,de,en,es,fr,hi,it,ja,nl,pl,pt,ru,tr,zh
15


# Clone the github repository

In [3]:
!git clone https://github.com/Huertas97/Get_Multilingual_Data.git

fatal: destination path 'Get_Multilingual_Data' already exists and is not an empty directory.


In [4]:
%cd Get_Multilingual_Data/

/content/Get_Multilingual_Data


# Sentences from TED 2020

In [5]:
!python ./scripts/get_TED2020_sentences.py --help

2020-12-16 12:19:21.447146: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1

This script downloads the TED2020 corpus and create parallel sentences tsv files

The TED2020 corpus is a crawl of transcripts from TED and TEDx talks, which 
are translated to 100+ languages. With this script the user can select the 
amount of sentences and the languages desired. 

The TED2020 corpus is downloaded automatically only for the languages selected.
          
Usage:

    python get_TED2020_sentences [options] 

Options:
    -n, --n_sentences            Number of sentences to collect
    -l, --languages              Languages ​​from which we extract sentences
    -h, --help                   Help documentation


Example. Extract TED 2020  arabic and italian sentences:
    python get_TED2020_sentences.py --n_sentences 500 --languages ar,it


In [6]:
!python ./scripts/get_TED2020_sentences.py --n_sentences 500 --languages pl,tr,hi

2020-12-16 12:19:25.474236: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Recovering TED2020 talks for languages: pl tr hi
ted2020.tsv.gz does not exists. Try to download from server
100% 581M/581M [00:29<00:00, 19.7MB/s]
Parallel sentences files en-hi, en-pl, en-tr do not exist. Create these files now
Creating data frame for TED2020 languages: pl-tr-hi
TED2020-en-tr-train_pca.tsv.gz
TED2020-en-pl-train_pca.tsv.gz
TED2020-en-hi-train_pca.tsv.gz
---Saving results in parallel-sentences/TED2020/df_TED_pl-tr-hi.pkl ---
--- Removing downloaded files ---
--- Finish ---


In [7]:
df_TED_pl_tr_hi = pd.read_pickle("./parallel-sentences/TED2020/df_TED_pl-tr-hi.pkl")
df_TED_pl_tr_hi.groupby("lang").describe()

Unnamed: 0_level_0,from,from,from,from,sentences,sentences,sentences,sentences
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
lang,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
hi,500,1,TED2020,500,500,500,"यह एक कीमत नहीं, एक लाभ है.",1
pl,500,1,TED2020,500,500,500,Chciałem dziś zrobić coś specjalnego.,1
tr,500,1,TED2020,500,500,500,Son beş veya altı aydır üzerinde çalıştığım bi...,1


In [8]:
!python ./scripts/get_TED2020_sentences.py --n_sentences 250 --languages ar,cs,de,en,es,fr,it,ja,nl,pt,ru,zh

2020-12-16 12:20:00.720766: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Recovering TED2020 talks for languages: ar cs de en es fr it ja nl pt ru zh
Parallel sentences files en-it, en-ja, en-fr, en-zh, en-en, en-es, en-ar, en-nl, en-pt, en-ru, en-cs, en-de do not exist. Create these files now
Creating data frame for TED2020 languages: ar-cs-de-en-es-fr-it-ja-nl-pt-ru-zh
TED2020-en-fr-train_pca.tsv.gz
TED2020-en-cs-train_pca.tsv.gz
TED2020-en-ru-train_pca.tsv.gz
TED2020-en-ja-train_pca.tsv.gz
TED2020-en-es-train_pca.tsv.gz
TED2020-en-pt-train_pca.tsv.gz
TED2020-en-nl-train_pca.tsv.gz
TED2020-en-zh-train_pca.tsv.gz
TED2020-en-en-train_pca.tsv.gz
TED2020-en-ar-train_pca.tsv.gz
TED2020-en-de-train_pca.tsv.gz
TED2020-en-it-train_pca.tsv.gz
---Saving results in parallel-sentences/TED2020/df_TED_ar-cs-de-en-es-fr-it-ja-nl-pt-ru-zh.pkl ---
--- Removing downloaded files ---
--- Finish ---


In [9]:
df_TED_langs = pd.read_pickle("./parallel-sentences/TED2020/df_TED_ar-cs-de-en-es-fr-it-ja-nl-pt-ru-zh.pkl")
df_TED_langs.groupby("lang").describe()

Unnamed: 0_level_0,from,from,from,from,sentences,sentences,sentences,sentences
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
lang,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ar,250,1,TED2020,250,250,250,ليس هناك أدوات تكنولوجية هنا ؛ مجرد علوم الأحي...,1
cs,250,1,TED2020,250,250,250,"Za prvé, toto je očekávaný podíl USA na globál...",1
de,250,1,TED2020,250,250,250,"Zu Beginn der Rede erzählte ich, was mir am Vo...",1
en,250,1,TED2020,250,250,250,Help with the mass persuasion campaign that wi...,1
es,250,1,TED2020,250,250,250,Si podíamos usar nuestro control remoto óptico...,1
fr,250,1,TED2020,250,250,250,nous devons réarranger le tracé.,1
it,250,1,TED2020,250,250,250,"Prima di tutto, questo é dove si prevede che a...",1
ja,250,1,TED2020,250,250,250,例を２つ紹介します ２系統のハエを比べています いずれも 光で制御できる細胞が およそ１００...,1
nl,250,1,TED2020,250,250,250,Als de Criticus bij de optisch-geactiveerde ce...,1
pt,250,1,TED2020,250,250,250,E coloco novas imagens porque assim aprendo ma...,1


## Check TED2020 is parallel data. 

The same sentence in different languages should theorically have the same vectorization. However, there might be some variability among languages. Including this variability in the data to fit the PCA is highly recommended. Parallel data from TED2020 is used for this purpose. 

In [10]:
df_TED_langs[df_TED_langs["lang"] == "pt"].head()

Unnamed: 0,from,lang,sentences
1250,TED2020,pt,"Muito obrigado, Chris."
1251,TED2020,pt,É realmente uma grande honra ter a oportunidad...
1252,TED2020,pt,Fiquei muito impressionado com esta conferênci...
1253,TED2020,pt,"Digo-o sinceramente, em parte, porque... preci..."
1254,TED2020,pt,(Risos) Coloquem-se no meu lugar!


In [11]:
df_TED_langs[df_TED_langs["lang"] == "es"].head()

Unnamed: 0,from,lang,sentences
1000,TED2020,es,Muchas gracias Chris.
1001,TED2020,es,Y es en verdad un gran honor tener la oportuni...
1002,TED2020,es,"He quedado conmovido por esta conferencia, y d..."
1003,TED2020,es,"Y digo eso sinceramente, en parte porque -- (S..."
1004,TED2020,es,(Risas) ¡Pónganse en mi posición!


# Sentences from Wiki Matrix


In [12]:
!python ./scripts/get_wikimatrix_sentences.py --help

2020-12-16 12:20:06.275426: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1

This script automatically downloads WikiMatrix corpus for the languages selected.   
The WikiMatrix corpus is a crawl of mined sentences from Wikipedia in 
different languages. With this script the user can select the amount of 
sentences and the languages desired. We only used pairs with scores
above 1.075, as pairs below this threshold were often of bad quality.
       
          
Usage:

    python get_wikimatrix_sentences.py [options] 

Options:
    -n, --n_sentences            Number of sentences to collect
    -l, --languages              Languages ​​from which we extract sentences
    -h, --help                   Help documentation


Example. Extract Wikimatrix arabic and italian sentences:
    python get_wikimatrix_sentences.py --n_sentences 500 --languages ar,it


In [13]:
!python ./scripts/get_wikimatrix_sentences.py --n_sentences 500 --languages pl,tr,hi

2020-12-16 12:20:10.377882: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Recovering WikiMatrix data for for languages: pl tr hi
Download https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-tr.tsv.gz
100% 156M/156M [00:06<00:00, 22.4MB/s]
Write 500 PCA train sentences parallel-sentences/Wikimatrix/WikiMatrix-en-tr-train_pca.tsv.gz
Download https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-pl.tsv.gz
100% 312M/312M [00:14<00:00, 22.0MB/s]
Write 500 PCA train sentences parallel-sentences/Wikimatrix/WikiMatrix-en-pl-train_pca.tsv.gz
Download https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-hi.tsv.gz
100% 93.5M/93.5M [00:04<00:00, 20.8MB/s]
Write 500 PCA train sentences parallel-sentences/Wikimatrix/WikiMatrix-en-hi-train_pca.tsv.gz
Creating data frame for Wikimatrix languages: pl-tr-hi
WikiMatrix-en-tr-train_pca.tsv.gz
WikiMatrix-en-pl-train_pca.tsv.gz
WikiMatrix-en-hi-train_

In [14]:
df_Wiki_pl_tr_hi = pd.read_pickle("./parallel-sentences/Wikimatrix/df_wikimatrix_pl-tr-hi.pkl")
df_Wiki_pl_tr_hi.groupby("lang").describe()

Unnamed: 0_level_0,from,from,from,from,sentences,sentences,sentences,sentences
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
lang,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
hi,500,1,WikiMatrix,500,500,500,यह PAL-M (ब्राजील में उपयोग होने वाला) के बहुत...,1
pl,500,1,WikiMatrix,500,500,500,Mówimy im dzisiaj: chcemy demokracji od zaraz!,1
tr,500,1,WikiMatrix,500,500,500,"""İngiltere Savour Finest Tour"".",1


In [15]:
!python ./scripts/get_wikimatrix_sentences.py --n_sentences 250 --languages ar,cs,de,en,es,fr,it,ja,nl,pt,ru,zh

2020-12-16 12:20:42.041737: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Recovering WikiMatrix data for for languages: ar cs de en es fr it ja nl pt ru zh
Download https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.ar-en.tsv.gz
100% 273M/273M [00:11<00:00, 23.1MB/s]
Write 250 PCA train sentences parallel-sentences/Wikimatrix/WikiMatrix-en-ar-train_pca.tsv.gz
Download https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-nl.tsv.gz
100% 327M/327M [00:23<00:00, 13.9MB/s]
Write 250 PCA train sentences parallel-sentences/Wikimatrix/WikiMatrix-en-nl-train_pca.tsv.gz
Download https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-ru.tsv.gz
100% 658M/658M [00:27<00:00, 23.6MB/s]
Write 250 PCA train sentences parallel-sentences/Wikimatrix/WikiMatrix-en-ru-train_pca.tsv.gz
Download https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-fr.tsv.gz
100% 755M/755M [00:30<00:00, 24.6MB/s

In [16]:
df_Wiki_langs = pd.read_pickle("./parallel-sentences/Wikimatrix/df_wikimatrix_ar-cs-de-en-es-fr-it-ja-nl-pt-ru-zh.pkl")
df_Wiki_langs.groupby("lang").describe()

Unnamed: 0_level_0,from,from,from,from,sentences,sentences,sentences,sentences
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
lang,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ar,250,1,WikiMatrix,250,250,250,كما أنه لم يستخدم في نهاية الرقم.,1
cs,250,1,WikiMatrix,250,250,250,"V roce 2010 zavtipkovala: ""Teď pracuju s ženam...",1
de,250,1,WikiMatrix,250,250,250,Da ist dein Onkel Gilbert.,1
en,250,1,WikiMatrix,250,250,250,"Tripoli, I prefer you to myself.",1
es,250,1,WikiMatrix,250,250,250,Era el momento de abrir el séptimo libro negro.,1
fr,250,1,WikiMatrix,250,250,250,La volonté de Dieu me demandait de pratiquer l...,1
it,250,1,WikiMatrix,250,250,250,"In un primo tempo, tutto va come Oliver aveva ...",1
ja,250,1,WikiMatrix,250,250,250,主の「לאהבי」（愛人）であることに対する見返りが、主の「חסד」（慈悲）である。,1
nl,250,1,WikiMatrix,250,250,250,"Geef terug wat je hebt gestolen"".",1
pt,250,1,WikiMatrix,250,250,250,"Então, ele pode ter tantos anfitriões ao seu r...",1


## Check WikiMatrix data is not parallel data

As mentioned above, parallel data is extremely useful for including the language embedding representation variability in the PCA. However, introducing a specific set of sentences for each language is also required. This strategy ensures that PCA includes the specific representation for each language.  

In [17]:
df_Wiki_langs[df_Wiki_langs["lang"] == "pt"].head()

Unnamed: 0,from,lang,sentences
2500,WikiMatrix,pt,"O rabino Johanan disse: ""Há montanhas, planíci..."
2501,WikiMatrix,pt,O intercâmbio final (saída 27) fornece acesso ...
2502,WikiMatrix,pt,"Ele é o Clemente, o Misericordioso!"
2503,WikiMatrix,pt,Mas como você luta contra uma sombra do inferno?
2504,WikiMatrix,pt,Ele é um dos The Evil Dead.”


In [18]:
df_Wiki_langs[df_Wiki_langs["lang"] == "es"].head()

Unnamed: 0,from,lang,sentences
2750,WikiMatrix,es,Nunca combate al lado o en contra de Gordon Fr...
2751,WikiMatrix,es,Vivimos entre ellos y notamos raramente esa gr...
2752,WikiMatrix,es,Dijo: «Serán mis últimos cuatro años».
2753,WikiMatrix,es,"Él es el Poderoso, el Misericordioso."
2754,WikiMatrix,es,"Dijo: ""Me gusta ese contenido oscuro."


# OPUS - News Commentary

In [19]:
!pip install opustools

Collecting opustools
[?25l  Downloading https://files.pythonhosted.org/packages/9d/5e/f2f0fdbb17a0d348ac1974965168c12157e9090df2a5348a4b35ae5d9b71/opustools-1.2.1-py3-none-any.whl (108kB)
[K     |███                             | 10kB 15.9MB/s eta 0:00:01[K     |██████                          | 20kB 21.0MB/s eta 0:00:01[K     |█████████                       | 30kB 10.8MB/s eta 0:00:01[K     |████████████                    | 40kB 8.3MB/s eta 0:00:01[K     |███████████████▏                | 51kB 4.5MB/s eta 0:00:01[K     |██████████████████▏             | 61kB 5.0MB/s eta 0:00:01[K     |█████████████████████▏          | 71kB 5.0MB/s eta 0:00:01[K     |████████████████████████▏       | 81kB 5.6MB/s eta 0:00:01[K     |███████████████████████████▏    | 92kB 5.4MB/s eta 0:00:01[K     |██████████████████████████████▎ | 102kB 4.3MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 4.3MB/s 
[?25hInstalling collected packages: opustools
Successfully installe

In [20]:
!python ./scripts/get_news_opus.py --help

       
OPUS-NewsCommentary is one of the different dataset available in OPUS. It consists of
a parallel corpus of News Commentaries provided by Workshop on Statistical Machine 
Translation (WMT). This type of data is the most related to fact-check news we 
hope to face up. 


Requirements: 
    This scripts requiers opustools. You can install it with the following command:
        $ pip install opustools
        
Usage:

    python get_news_opus.py [options] 

Options:
    -n, --n_sentences            Number of sentences to collect
    -l, --languages              Languages ​​from which we extract sentences
    -h, --help                   Help documentation

Example. Extract OPUS-NewsCommentary arabic and italian sentences:
    !python get_news_opus.py --n_sentences 500 --languages ar,it


In [21]:
!python ./scripts/get_news_opus.py --n_sentences 500 --languages ar,cs,de,en,es,fr,it,ja,nl,pt,ru,zh

Recovering OPUS-NewsCommentary sentences for languages: ar cs de en es fr it ja nl pt ru zh
Create: parallel-sentences/News/News-Commentary-en-ar.tsv.gz
No alignment file "/projappl/nlpl/data/OPUS/News-Commentary/latest/xml/ar-en.xml.gz" or "./opus/News-Commentary_latest_xml_ar-en.xml.gz" found
The following files are available for downloading:

  37 MB https://object.pouta.csc.fi/OPUS-News-Commentary/v14/raw/ar.zip
  40 MB https://object.pouta.csc.fi/OPUS-News-Commentary/v14/raw/en.zip
 416 KB https://object.pouta.csc.fi/OPUS-News-Commentary/v14/xml/ar-en.xml.gz

  78 MB Total size
./opus/News-Commentary_latest_raw_ar.zip ... 100% of 37 MB
./opus/News-Commentary_latest_raw_en.zip ... 100% of 40 MB
./opus/News-Commentary_latest_xml_ar-en.xml.gz ... 100% of 416 KB
Create: parallel-sentences/News/News-Commentary-en-cs.tsv.gz
No alignment file "/projappl/nlpl/data/OPUS/News-Commentary/latest/xml/cs-en.xml.gz" or "./opus/News-Commentary_latest_xml_cs-en.xml.gz" found
The following files ar

In [22]:
df_news = pd.read_pickle("./parallel-sentences/News/df_News_ar-cs-de-en-es-fr-it-ja-nl-pt-ru-zh.pkl")
df_news.groupby("lang").describe()

Unnamed: 0_level_0,from,from,from,from,sentences,sentences,sentences,sentences
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
lang,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ar,500,1,OPUS_News_Commentary,500,500,500,ونظراً لميل الأجهزة الإشرافية الوطنية المتوقع ...,1
cs,500,1,OPUS_News_Commentary,500,500,500,Tvořil součást konsensu roku 1945.,1
de,500,1,OPUS_News_Commentary,500,500,500,NEW YORK: Mehr als die Hälfte der asiatischen ...,1
en,500,1,OPUS_News_Commentary,500,500,500,Elections can lead to illiberal democracies an...,1
es,500,1,OPUS_News_Commentary,500,500,500,"De ser así, el Congreso se limitaría a aprobar...",1
fr,500,1,OPUS_News_Commentary,500,500,500,Se sentir trop confiant serait une grave erreu...,1
it,500,1,OPUS_News_Commentary,500,500,500,Accontentarsi della ricchezza,1
ja,500,1,OPUS_News_Commentary,500,500,500,大半のドイツ国民が望���でいるのは後者であろうが、最も実現しそうなのは前者の方である｡しか...,1
nl,500,1,OPUS_News_Commentary,500,500,500,Het ESI identificeert vier gebieden van wederz...,1
pt,500,1,OPUS_News_Commentary,500,500,500,"Se as barreiras de segurança forem fortes, ou ...",1


## Check OPUS-NewsCommentary is parallel data

In [23]:
df_news[df_news["lang"] == "fr"].head()

Unnamed: 0,from,lang,sentences
2500,OPUS_News_Commentary,fr,L’or à 10.000 dollars l’once ?
2501,OPUS_News_Commentary,fr,SAN FRANCISCO – Il n’a jamais été facile d’avo...
2502,OPUS_News_Commentary,fr,"Et aujourd’hui, alors que le cours de l’or a a..."
2503,OPUS_News_Commentary,fr,"En décembre dernier, mes collègues économistes..."
2504,OPUS_News_Commentary,fr,Mais devinez ce qui s’est passé ?


In [24]:
df_news[df_news["lang"] == "es"].head()

Unnamed: 0,from,lang,sentences
4000,OPUS_News_Commentary,es,¿El oro a 10.000 dólares?
4001,OPUS_News_Commentary,es,SAN FRANCISCO – Nunca ha resultado fácil soste...
4002,OPUS_News_Commentary,es,"Últimamente, con los precios del oro más de un..."
4003,OPUS_News_Commentary,es,"Apenas en el pasado mes de diciembre, mis cole..."
4004,OPUS_News_Commentary,es,¿Y saben qué?


# ALL DATA

In [25]:
df_multi_PCA_train = pd.concat([df_TED_pl_tr_hi, df_TED_langs, 
                                df_Wiki_pl_tr_hi, df_Wiki_langs, 
                                df_news])
df_multi_PCA_train.to_pickle("df_multi_PCA_train_1000.pkl")
df_multi_PCA_train.groupby("lang").describe()

Unnamed: 0_level_0,from,from,from,from,sentences,sentences,sentences,sentences
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
lang,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ar,1000,3,OPUS_News_Commentary,500,1000,1000,ليس هناك أدوات تكنولوجية هنا ؛ مجرد علوم الأحي...,1
cs,1000,3,OPUS_News_Commentary,500,1000,1000,"V roce 2010 zavtipkovala: ""Teď pracuju s ženam...",1
de,1000,3,OPUS_News_Commentary,500,1000,1000,NEW YORK: Mehr als die Hälfte der asiatischen ...,1
en,1000,3,OPUS_News_Commentary,500,1000,1000,The name Community and Development disappeared.,1
es,1000,3,OPUS_News_Commentary,500,1000,1000,Y lo que ocurra en 2009 pondrá en riesgo algun...,1
fr,1000,3,OPUS_News_Commentary,500,1000,1000,Et chaque fois qu'elles ont fait un des deux c...,1
hi,1000,2,TED2020,500,1000,1000,वहाँ एक परिकलित्र (कैलकुलेटर) है.,1
it,1000,3,OPUS_News_Commentary,500,1000,1000,"Ad esempio, nel 2010 solo l’8% dei 409 miliard...",1
ja,1000,3,OPUS_News_Commentary,500,1000,1000,これを実際に示したのがアメリカと世界のテロリズムに対する反応であった。アル・カイーダに出入り...,1
nl,1000,3,OPUS_News_Commentary,500,1000,1000,Het ESI identificeert vier gebieden van wederz...,1


# Visualization

In [26]:
!pip install -U -q plotly
import plotly


import matplotlib.pyplot as plt
# %matplotlib inline 
import plotly.express as px
import plotly.graph_objects as go

plotly.__version__

[K     |████████████████████████████████| 13.2MB 269kB/s 
[?25h

'4.14.1'

In [27]:
df_multi_PCA_train_1000 = pd.read_pickle("./df_multi_PCA_train_1000.pkl")
sentences_train = df_multi_PCA_train_1000.sentences.to_list()

In [28]:
import matplotlib.pyplot as plt
# %matplotlib inline 
import plotly.express as px
import plotly.graph_objects as go
px.colors.n_colors

fig_1 = go.Figure([go.Bar(
                        x = df_multi_PCA_train_1000.groupby("lang").describe().index, 
                        y= df_multi_PCA_train_1000.groupby("lang").describe()["from"]["count"],
                        hovertemplate= "Language: %{x} <br>Nº sentences: %{y}",
                        marker_color ="rgb(253,180,98)",
                        name = ""
                        ),
                 ],
                )
fig_1.update_layout(
      hoverlabel=dict(
          font_size=14,
          font_family="Arial",
          bgcolor = "white"),

      height=400, 
      width = 900, 
      title="Training PCA Multilingual Data - Number of sentences per language",
      title_font = {"size": 20},)

fig_1.update_xaxes(
          tickangle = 0,
          tickfont = {"size": 16},
          title_text = "Languagues",
          title_font = {"size": 18},
          title_standoff = 10)

fig_1.update_yaxes(
          tickfont = {"size": 16},
          title_font = {"size": 18},
          title_text = "Number of sentences",
          title_standoff = 10)


fig_1.show()

In [31]:
fig_2 = go.Figure()
for source, group in df_multi_PCA_train_1000.groupby(["from"]):
  data = group.groupby(["lang"], as_index = False).agg({'sentences':'count'})
  source_data = [source]
  trace = go.Bar(
                  x = data["lang"], 
                  y= data["sentences"],
                  customdata = [source_data] * len(data["sentences"].to_list()),
                  hovertemplate= "Data source: %{customdata} <br> Language: %{x} <br>Nº sentences: %{y}",
                  # marker_color ="rgb(253,180,98)",
                  name = source
                )
  fig_2.add_trace(trace)

fig_2.update_layout(
      hoverlabel=dict(
          font_size=14,
          font_family="Arial",
          bgcolor = "white"),

      height = 400, 
      width = 1000, 
      title = "Training PCA Multilingual Data - Number of sentences per data source",
      title_font={"size": 20},
      legend_font = {"size": 16})

fig_2.update_xaxes(
          tickangle = 0,
          tickfont = {"size": 16},
          title_text = "Languagues",
          title_font = {"size": 18},
          title_standoff = 10)

fig_2.update_yaxes(
          tickfont = {"size": 16},
          title_font = {"size": 18},
          title_text = "Number of sentences",
          title_standoff = 10)

fig_2.show()

The following code allows us to create a HTML file (not as heavy as save the figure directly to HTML) with both figures to interact with them. 

In [30]:
with open('multi_pca_barplot_sentences_per_language.html', 'w') as f:
  
  f.write("""
<body>
    <div style="width:800px; margin:0 auto;">
    <br>
    <br>
    """)
  f.write(fig_1.to_html(full_html=False, include_plotlyjs='cdn'))
  f.write(fig_2.to_html(full_html=False, include_plotlyjs='cdn'))