<a href="https://colab.research.google.com/github/Huertas97/Get_Multilingual_Data/blob/main/notebooks/Multilingual_Data_fit_PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introducction

In this notebooks we show how to use the scripts available in https://github.com/Huertas97/Get_Multilingual_Data.git to extract Multilingual sentences. 

The final result from the notebook is a data frame with 1000 sentences per language from three different resouces: TED2020, WikiMatrix and OPUS-NewsCommentary. 

A visualization of the multilingual data extracted is shown at the end of the notebook. 

# Loading Trial Data


In [6]:
!pip install -U -q sentence-transformers
import pandas as pd

[K     |████████████████████████████████| 71kB 3.9MB/s 
[K     |████████████████████████████████| 1.3MB 7.9MB/s 
[K     |████████████████████████████████| 1.1MB 42.0MB/s 
[K     |████████████████████████████████| 890kB 39.9MB/s 
[K     |████████████████████████████████| 2.9MB 52.9MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [1]:
!pip install -U -q sentence-transformers
!pip install -U -q tqdm

from sklearn.decomposition import PCA
import numpy as np
from sentence_transformers.readers import STSBenchmarkDataReader
import os
import re
import gzip
import csv
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.reset_option("^display")
import logging
from tqdm.notebook import tqdm
import logging
from sentence_transformers import LoggingHandler
from itertools import combinations
from tqdm.notebook import tqdm
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
from scipy.stats import pearsonr, spearmanr

[K     |████████████████████████████████| 71kB 3.8MB/s 
[K     |████████████████████████████████| 1.3MB 10.5MB/s 
[K     |████████████████████████████████| 1.1MB 24.8MB/s 
[K     |████████████████████████████████| 2.9MB 37.2MB/s 
[K     |████████████████████████████████| 890kB 50.1MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 71kB 4.0MB/s 
[?25h

# Languages for the PCA

In [1]:
languages = "ar, cs, de, en, es, fr, hi, it, ja, nl, pl, pt, ru, tr, zh".split(", ")
print(",".join(languages))
print(len(languages))

ar,cs,de,en,es,fr,hi,it,ja,nl,pl,pt,ru,tr,zh
15


# Sentences from TED 2020

In [18]:
!python get_TED2020_sentences.py --help

2020-12-16 10:03:54.700559: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1

This script downloads the TED2020 corpus and create parallel sentences tsv files

The TED2020 corpus is a crawl of transcripts from TED and TEDx talks, which 
are translated to 100+ languages. With this script the user can select the 
amount of sentences and the languages desired. 

The TED2020 corpus is downloaded automatically only for the languages selected.
          
Usage:

    python get_TED2020_sentences [options] 

Options:
    -n, --n_sentences            Number of sentences to collect
    -l, --languages              Languages ​​from which we extract sentences


Example. Extract TED 2020  arabic and italian sentences:
    python get_TED2020_sentences.py --n_sentences 500 --languages ar,it


In [19]:
!python get_TED2020_sentences.py --n_sentences 500 --languages pl,tr,hi

2020-12-16 10:04:14.509631: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Recovering TED2020 talks for languages: pl tr hi
Parallel sentences files en-pl, en-tr, en-hi do not exist. Create these files now
Creating data frame for TED2020 languages: pl-tr-hi
TED2020-en-tr-train_pca.tsv.gz
TED2020-en-pl-train_pca.tsv.gz
TED2020-en-hi-train_pca.tsv.gz
---Saving results in parallel-sentences/TED2020/df_TED_pl-tr-hi.pkl ---
--- Removing downloaded files ---
--- Finish ---


In [12]:
df_TED_pl_tr_hi = pd.read_pickle("/content/parallel-sentences/TED2020/df_TED_pl-tr-hi.pkl")
df_TED_pl_tr_hi.groupby("lang").describe()

Unnamed: 0_level_0,from,from,from,from,sentences,sentences,sentences,sentences
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
lang,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
hi,500,1,TED2020,500,500,500,हर बार इसे दिखाने से पहले मैं इसे सुधारता हूँ.,1
pl,500,1,TED2020,500,500,500,Znamy symptomy.,1
tr,500,1,TED2020,500,500,500,Larry Lessig de bu sürecin içine dahil olacak ...,1


In [16]:
!python get_TED2020_sentences.py --n_sentences 250 --languages ar,cs,de,en,es,fr,it,ja,nl,pt,ru,zh

2020-12-16 07:57:31.242155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Recovering TED2020 talks for languages: ar cs de en es fr it ja nl pt ru zh
Parallel sentences files en-es, en-it, en-pt, en-ar, en-ru, en-zh, en-en, en-nl, en-fr, en-de, en-ja, en-cs do not exist. Create these files now
Creating data frame for Wikimatrix languages: ar-cs-de-en-es-fr-it-ja-nl-pt-ru-zh
TED2020-en-fr-train_pca.tsv.gz
TED2020-en-cs-train_pca.tsv.gz
TED2020-en-ru-train_pca.tsv.gz
TED2020-en-ja-train_pca.tsv.gz
TED2020-en-es-train_pca.tsv.gz
TED2020-en-pt-train_pca.tsv.gz
TED2020-en-nl-train_pca.tsv.gz
TED2020-en-zh-train_pca.tsv.gz
TED2020-en-en-train_pca.tsv.gz
TED2020-en-ar-train_pca.tsv.gz
TED2020-en-de-train_pca.tsv.gz
TED2020-en-it-train_pca.tsv.gz
---Saving results in parallel-sentences/TED2020/df_TED_ar-cs-de-en-es-fr-it-ja-nl-pt-ru-zh.pkl ---
--- Removing downloaded files ---
--- Finish ---


In [17]:
df_TED_langs = pd.read_pickle("/content/parallel-sentences/TED2020/df_TED_ar-cs-de-en-es-fr-it-ja-nl-pt-ru-zh.pkl")
df_TED_langs.groupby("lang").describe()

Unnamed: 0_level_0,from,from,from,from,sentences,sentences,sentences,sentences
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
lang,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ar,250,1,TED2020,250,250,250,هناك الكثير ليقال حول ذلك.,1
cs,250,1,TED2020,250,250,250,"Pomozte s masovou kampaní, která začne letos n...",1
de,250,1,TED2020,250,250,250,"Und wir haben einen langen Weg hinter uns, sei...",1
en,250,1,TED2020,250,250,250,(Applause),1
es,250,1,TED2020,250,250,250,"Estos no son resultados electorales, son pers...",1
fr,250,1,TED2020,250,250,250,Ainsi au lieu d'insérer un fil dans un seul en...,1
it,250,1,TED2020,250,250,250,"uscimmo, iniziammo a cercare e trovammo un ris...",1
ja,250,1,TED2020,250,250,250,よく見てみると 彼の頭がい骨は アクリルガラスのドームに 付け替えられています こうすること...,1
nl,250,1,TED2020,250,250,250,"Men doet waarvoor je ze betaalt, en als ze ins...",1
pt,250,1,TED2020,250,250,250,Cortou a cabeça às moscas.,1


## Check TED2020 is parallel data. 

The same sentence in different languages should theorically have the same vectorization. However, there might be some variability among languages. Including this variability in the data to fit the PCA is highly recommended. Parallel data from TED2020 is used for this purpose. 

In [40]:
df_TED_langs[df_TED_langs["lang"] == "pt"].head()

Unnamed: 0,from,lang,sentences
1250,TED2020,pt,"Muito obrigado, Chris."
1251,TED2020,pt,É realmente uma grande honra ter a oportunidad...
1252,TED2020,pt,Fiquei muito impressionado com esta conferênci...
1253,TED2020,pt,"Digo-o sinceramente, em parte, porque... preci..."
1254,TED2020,pt,(Risos) Coloquem-se no meu lugar!


In [43]:
df_TED_langs[df_TED_langs["lang"] == "es"].head()

Unnamed: 0,from,lang,sentences
1000,TED2020,es,Muchas gracias Chris.
1001,TED2020,es,Y es en verdad un gran honor tener la oportuni...
1002,TED2020,es,"He quedado conmovido por esta conferencia, y d..."
1003,TED2020,es,"Y digo eso sinceramente, en parte porque -- (S..."
1004,TED2020,es,(Risas) ¡Pónganse en mi posición!


# Sentences from Wiki Matrix


In [20]:
!python get_wikimatrix_sentences.py --help

2020-12-16 10:23:55.003485: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1

This script automatically downloads WikiMatrix corpus for the languages selected.   
The WikiMatrix corpus is a crawl of mined parallel sentences from Wikipedia in 
different languages. With this script the user can select the amount of 
sentences and the languages desired. We only used pairs with scores
above 1.075, as pairs below this threshold were often of bad quality.
       
          
Usage:

    python get_wikimatrix_sentences.py [options] 

Options:
    -n, --n_sentences            Number of sentences to collect
    -l, --languages              Languages ​​from which we extract sentences
    -h, --help                   Help documentation


Example. Extract Wikimatrix arabic and italian sentences:
    python get_wikimatrix_sentences.py --n_sentences 500 --languages ar,it


In [29]:
!python get_wikimatrix_sentences.py --n_sentences 500 --languages pl,tr,hi

2020-12-16 10:32:11.443633: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Recovering WikiMatrix data for for languages: pl tr hi
Write 500 PCA train sentences parallel-sentences/Wikimatrix/WikiMatrix-en-tr-train_pca.tsv.gz
Write 500 PCA train sentences parallel-sentences/Wikimatrix/WikiMatrix-en-pl-train_pca.tsv.gz
Write 500 PCA train sentences parallel-sentences/Wikimatrix/WikiMatrix-en-hi-train_pca.tsv.gz
Creating data frame for Wikimatrix languages: pl-tr-hi
WikiMatrix-en-nl-train_pca.tsv.gz
WikiMatrix-en-tr-train_pca.tsv.gz
WikiMatrix-en-pl-train_pca.tsv.gz
WikiMatrix-en-hi-train_pca.tsv.gz
---Saving results in parallel-sentences/Wikimatrix/df_wikimatrix_pl-tr-hi.pkl ---
--- Removing downloaded files ---
--- Finish ---


In [26]:
df_Wiki_pl_tr_hi = pd.read_pickle("/content/parallel-sentences/Wikimatrix/df_wikimatrix_pl-tr-hi.pkl")
df_Wiki_pl_tr_hi.groupby("lang").describe()

Unnamed: 0_level_0,from,from,from,from,sentences,sentences,sentences,sentences
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
lang,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
hi,500,1,WikiMatrix,500,500,500,शहर में कम से कम तीन बाग़ थे।,1
pl,500,1,WikiMatrix,500,500,500,W 2009 roku powiedział: „Pamiętacie Lawrence’a...,1
tr,500,1,WikiMatrix,500,500,500,Irak'ta kalırlarsa ne yapacaklar?,1


In [30]:
!python get_wikimatrix_sentences.py --n_sentences 250 --languages ar,cs,de,en,es,fr,it,ja,nl,pt,ru,zh

2020-12-16 10:32:15.383313: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Recovering WikiMatrix data for for languages: ar cs de en es fr it ja nl pt ru zh
Download https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-fr.tsv.gz
100% 755M/755M [00:31<00:00, 24.0MB/s]
Write 250 PCA train sentences parallel-sentences/Wikimatrix/WikiMatrix-en-fr-train_pca.tsv.gz
Download https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.cs-en.tsv.gz
100% 220M/220M [00:09<00:00, 23.2MB/s]
Write 250 PCA train sentences parallel-sentences/Wikimatrix/WikiMatrix-en-cs-train_pca.tsv.gz
Download https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-it.tsv.gz
100% 566M/566M [00:23<00:00, 23.9MB/s]
Write 250 PCA train sentences parallel-sentences/Wikimatrix/WikiMatrix-en-it-train_pca.tsv.gz
Write 250 PCA train sentences parallel-sentences/Wikimatrix/WikiMatrix-en-nl-train_pca.tsv.gz
Download https://dl.fbaipub

In [31]:
df_Wiki_langs = pd.read_pickle("/content/parallel-sentences/Wikimatrix/df_wikimatrix_ar-cs-de-en-es-fr-it-ja-nl-pt-ru-zh.pkl")
df_Wiki_langs.groupby("lang").describe()

Unnamed: 0_level_0,from,from,from,from,sentences,sentences,sentences,sentences
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
lang,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ar,250,1,WikiMatrix,250,250,250,وأنا مثل أي شخص، أود أن أعيش حياة طويلة.,1
cs,250,1,WikiMatrix,250,250,250,Blažej Baláž - Moja cesta.,1
de,250,1,WikiMatrix,250,250,250,Ich habe einen Test gemacht: Ist der Mann immu...,1
en,250,1,WikiMatrix,250,250,250,"The Prophet said, ""O the son of Al-Khattab!",1
es,250,1,WikiMatrix,250,250,250,«¿Es imposible traducir dos veces (exactamente...,1
fr,250,1,WikiMatrix,250,250,250,These may be roughly equivalent to HEPA or ULP...,1
it,250,1,WikiMatrix,250,250,250,Gu finalmente trova la pace.,1
ja,250,1,WikiMatrix,250,250,250,また曹操は一旦許に帰還した。,1
nl,250,1,WikiMatrix,250,250,250,Ames illustreerde daarnaast de werken Really a...,1
pt,250,1,WikiMatrix,250,250,250,"E o Ale pode abrir essa nova porta.""",1


## Check WikiMatrix data is not parallel data

As mentioned above, parallel data is extremely useful for including the language embedding representation variability in the PCA. However, introducing a specific set of sentences for each language is also required. This strategy ensures that PCA includes the specific representation for each language.  

In [45]:
df_Wiki_langs[df_Wiki_langs["lang"] == "pt"].head()

Unnamed: 0,from,lang,sentences
2500,WikiMatrix,pt,"O rabino Johanan disse: ""Há montanhas, planíci..."
2501,WikiMatrix,pt,O intercâmbio final (saída 27) fornece acesso ...
2502,WikiMatrix,pt,"Ele é o Clemente, o Misericordioso!"
2503,WikiMatrix,pt,Mas como você luta contra uma sombra do inferno?
2504,WikiMatrix,pt,Ele é um dos The Evil Dead.”


In [44]:
df_Wiki_langs[df_Wiki_langs["lang"] == "es"].head()

Unnamed: 0,from,lang,sentences
2750,WikiMatrix,es,Nunca combate al lado o en contra de Gordon Fr...
2751,WikiMatrix,es,Vivimos entre ellos y notamos raramente esa gr...
2752,WikiMatrix,es,Dijo: «Serán mis últimos cuatro años».
2753,WikiMatrix,es,"Él es el Poderoso, el Misericordioso."
2754,WikiMatrix,es,"Dijo: ""Me gusta ese contenido oscuro."


# OPUS - News Commentary

In [46]:
!pip install opustools

Collecting opustools
[?25l  Downloading https://files.pythonhosted.org/packages/9d/5e/f2f0fdbb17a0d348ac1974965168c12157e9090df2a5348a4b35ae5d9b71/opustools-1.2.1-py3-none-any.whl (108kB)
[K     |███                             | 10kB 15.9MB/s eta 0:00:01[K     |██████                          | 20kB 19.2MB/s eta 0:00:01[K     |█████████                       | 30kB 10.4MB/s eta 0:00:01[K     |████████████                    | 40kB 8.6MB/s eta 0:00:01[K     |███████████████▏                | 51kB 4.5MB/s eta 0:00:01[K     |██████████████████▏             | 61kB 5.1MB/s eta 0:00:01[K     |█████████████████████▏          | 71kB 5.2MB/s eta 0:00:01[K     |████████████████████████▏       | 81kB 5.6MB/s eta 0:00:01[K     |███████████████████████████▏    | 92kB 5.9MB/s eta 0:00:01[K     |██████████████████████████████▎ | 102kB 6.1MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 6.1MB/s 
[?25hInstalling collected packages: opustools
Successfully installe

In [53]:
!python get_news_opus.py --help

       
OPUS-NewsCommentary is one of the different dataset available in OPUS. It consists of
a parallel corpus of News Commentaries provided by Workshop on Statistical Machine 
Translation (WMT). This type of data is the most related to fact-check news we 
hope to face up. 


Requirements: 
    This scripts requiers opustools. You can install it with the following command:
        $ pip install opustools
        
Usage:

    python get_news_opus.py [options] 

Options:
    -n, --n_sentences            Number of sentences to collect
    -l, --languages              Languages ​​from which we extract sentences


Example. Extract OPUS-NewsCommentary arabic and italian sentences:
    !python get_news_opus.py --n_sentences 500 --languages ar,it


In [54]:
!python get_news_opus.py --n_sentences 500 --languages ar,cs,de,en,es,fr,it,ja,nl,pt,ru,zh

Recovering OPUS-NewsCommentary sentences for languages: ar cs de en es fr it ja nl pt ru zh
Create: parallel-sentences/News/News-Commentary-en-ar.tsv.gz
No alignment file "/projappl/nlpl/data/OPUS/News-Commentary/latest/xml/ar-en.xml.gz" or "./opus/News-Commentary_latest_xml_ar-en.xml.gz" found
The following files are available for downloading:

        ./opus/News-Commentary_latest_raw_en.zip already exists
  37 MB https://object.pouta.csc.fi/OPUS-News-Commentary/v14/raw/ar.zip
 416 KB https://object.pouta.csc.fi/OPUS-News-Commentary/v14/xml/ar-en.xml.gz

  38 MB Total size
./opus/News-Commentary_latest_raw_ar.zip ... 100% of 37 MB
./opus/News-Commentary_latest_xml_ar-en.xml.gz ... 100% of 416 KB
Create: parallel-sentences/News/News-Commentary-en-cs.tsv.gz
No alignment file "/projappl/nlpl/data/OPUS/News-Commentary/latest/xml/cs-en.xml.gz" or "./opus/News-Commentary_latest_xml_cs-en.xml.gz" found
The following files are available for downloading:

        ./opus/News-Commentary_latest

In [55]:
df_news = pd.read_pickle("/content/parallel-sentences/News/df_News_ar-cs-de-en-es-fr-it-ja-nl-pt-ru-zh.pkl")
df_news.groupby("lang").describe()

Unnamed: 0_level_0,from,from,from,from,sentences,sentences,sentences,sentences
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
lang,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ar,500,1,OPUS_News_Commentary,500,500,500,وهناك أيضاً الخطر المتزايد من الإرهابيين ممن ن...,1
cs,500,1,OPUS_News_Commentary,500,500,500,Při každém železničním neštěstí či havárii let...,1
de,500,1,OPUS_News_Commentary,500,500,500,Sofern die Mitgliedsregierungen der ADB deren ...,1
en,500,1,OPUS_News_Commentary,500,500,500,"In the Arab world, in particular, Islam is dom...",1
es,500,1,OPUS_News_Commentary,500,500,500,Muchos de estos cambios se plasmaron en la lla...,1
fr,500,1,OPUS_News_Commentary,500,500,500,L'objectif du PTCI est d'exploiter la puissanc...,1
it,500,1,OPUS_News_Commentary,500,500,500,"Eppure, se la mobilità all’interno della zona ...",1
ja,500,1,OPUS_News_Commentary,500,500,500,ファタハに対する圧倒的な政治的拒否から利を得たハマスが、新たな権力をどう使っていくのかに審査...,1
nl,500,1,OPUS_News_Commentary,500,500,500,Verder naar het westen toe nemen de economisch...,1
pt,500,1,OPUS_News_Commentary,500,500,500,"Para começar, os governos devem desenhar polít...",1


## Check OPUS-NewsCommentary is parallel data

In [56]:
df_news[df_news["lang"] == "fr"].head()

Unnamed: 0,from,lang,sentences
2500,OPUS_News_Commentary,fr,L’or à 10.000 dollars l’once ?
2501,OPUS_News_Commentary,fr,SAN FRANCISCO – Il n’a jamais été facile d’avo...
2502,OPUS_News_Commentary,fr,"Et aujourd’hui, alors que le cours de l’or a a..."
2503,OPUS_News_Commentary,fr,"En décembre dernier, mes collègues économistes..."
2504,OPUS_News_Commentary,fr,Mais devinez ce qui s’est passé ?


In [57]:
df_news[df_news["lang"] == "es"].head()

Unnamed: 0,from,lang,sentences
4000,OPUS_News_Commentary,es,¿El oro a 10.000 dólares?
4001,OPUS_News_Commentary,es,SAN FRANCISCO – Nunca ha resultado fácil soste...
4002,OPUS_News_Commentary,es,"Últimamente, con los precios del oro más de un..."
4003,OPUS_News_Commentary,es,"Apenas en el pasado mes de diciembre, mis cole..."
4004,OPUS_News_Commentary,es,¿Y saben qué?


# ALL DATA

In [58]:
df_multi_PCA_train = pd.concat([df_TED_pl_tr_hi, df_TED_langs, 
                                df_Wiki_pl_tr_hi, df_Wiki_langs, 
                                df_news])
df_multi_PCA_train.to_pickle("df_multi_PCA_train_1000.pkl")
df_multi_PCA_train.groupby("lang").describe()

Unnamed: 0_level_0,from,from,from,from,sentences,sentences,sentences,sentences
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
lang,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ar,1000,3,OPUS_News_Commentary,500,1000,1000,وهناك أيضاً الخطر المتزايد من الإرهابيين ممن ن...,1
cs,1000,3,OPUS_News_Commentary,500,1000,1000,Srdeční a cévní nemoci zabíjejí stále více lid...,1
de,1000,3,OPUS_News_Commentary,500,1000,1000,Sein Oratorium Vor langer Zeit.,1
en,1000,3,OPUS_News_Commentary,500,1000,1000,"In the Arab world, in particular, Islam is dom...",1
es,1000,3,OPUS_News_Commentary,500,1000,1000,«¿Es imposible traducir dos veces (exactamente...,1
fr,1000,3,OPUS_News_Commentary,500,1000,1000,“Was the Great War a Watershed ?,1
hi,1000,2,TED2020,500,1000,1000,हर बार इसे दिखाने से पहले मैं इसे सुधारता हूँ.,1
it,1000,3,OPUS_News_Commentary,500,1000,1000,La seconda è che i problemi insiti in una sing...,1
ja,1000,3,OPUS_News_Commentary,500,1000,1000,また曹操は一旦許に帰還した。,1
nl,1000,3,OPUS_News_Commentary,500,1000,1000,Hulp aan regeringen die toestaan dat specifiek...,1


# Visuazation

In [59]:
!pip install -U -q plotly
import plotly


import matplotlib.pyplot as plt
# %matplotlib inline 
import plotly.express as px
import plotly.graph_objects as go

plotly.__version__

[K     |████████████████████████████████| 13.2MB 276kB/s 
[?25h

'4.14.1'

In [62]:
df_multi_PCA_train_1000 = pd.read_pickle("/content/df_multi_PCA_train_1000.pkl")
sentences_train = df_multi_PCA_train_1000.sentences.to_list()

In [63]:
import matplotlib.pyplot as plt
# %matplotlib inline 
import plotly.express as px
import plotly.graph_objects as go
px.colors.n_colors

fig_1 = go.Figure([go.Bar(
                        x = df_multi_PCA_train_1000.groupby("lang").describe().index, 
                        y= df_multi_PCA_train_1000.groupby("lang").describe()["from"]["count"],
                        hovertemplate= "Language: %{x} <br>Nº sentences: %{y}",
                        marker_color ="rgb(253,180,98)",
                        name = ""
                        ),
                 ],
                )
fig_1.update_layout(
      hoverlabel=dict(
          font_size=14,
          font_family="Arial",
          bgcolor = "white"),

      height=400, 
      width = 900, 
      title="Training PCA Multilingual Data - Number of sentences per language",
      title_font = {"size": 20},)

fig_1.update_xaxes(
          tickangle = 0,
          tickfont = {"size": 16},
          title_text = "Languagues",
          title_font = {"size": 18},
          title_standoff = 10)

fig_1.update_yaxes(
          tickfont = {"size": 16},
          title_font = {"size": 18},
          title_text = "Number of sentences",
          title_standoff = 10)


fig_1.show()

In [64]:
fig_2 = go.Figure()
for source, group in df_multi_PCA_train_1000.groupby(["from"]):
  data = group.groupby(["lang"], as_index = False).agg({'sentences':'count'})
  source_data = [source]
  trace = go.Bar(
                  x = data["lang"], 
                  y= data["sentences"],
                  customdata = [source_data] * len(data["sentences"].to_list()),
                  hovertemplate= "Data source: %{customdata} <br> Language: %{x} <br>Nº sentences: %{y}",
                  # marker_color ="rgb(253,180,98)",
                  name = source
                )
  fig_2.add_trace(trace)

fig_2.update_layout(
      hoverlabel=dict(
          font_size=14,
          font_family="Arial",
          bgcolor = "white"),

      height = 400, 
      width = 1000, 
      title = "Training PCA Multilingual Data - Number of sentences per data source",
      title_font={"size": 20},
      legend_font = {"size": 16})

fig_2.update_xaxes(
          tickangle = 0,
          tickfont = {"size": 16},
          title_text = "Languagues",
          title_font = {"size": 18},
          title_standoff = 10)

fig_2.update_yaxes(
          tickfont = {"size": 16},
          title_font = {"size": 18},
          title_text = "Number of sentences",
          title_standoff = 10)

fig_2.show()

The following code allows us to create a HTML file (not as heavy as save the figure directly to HTML) with both figures to interact with them. 

In [None]:
with open('multi_pca_barplot_sentences_per_language.html', 'w') as f:
  
  f.write("""
<body>
    <div style="width:800px; margin:0 auto;">
    <br>
    <br>
    """)
  f.write(fig_1.to_html(full_html=False, include_plotlyjs='cdn'))
  f.write(fig_2.to_html(full_html=False, include_plotlyjs='cdn'))