# word2manylanguages

This experiment will test the assumption that disparate languages should be modeled using the same parameters, as in (van Paridon & Thompson, 2021). We will build 60 words-by-dimensions models for each language by systematically varying each parameter as described in the technical implementation below, and then test and rank the resulting models by their r-squared score. The best model for each language will be chosen, with best defined as the highest $R^2$ with the lowest window size and least dimensions.

![word2manylanguages Process Flow](word2manylanguages_process.png)

The fastText model (Bojanowski et al., 2016) from the gensim version 3.8.3 Python package (Řehůřek & Sojka, 2010) will be used to generate the embeddings from the concatenated corpus files.

## Required Packages

- bz2
- html
- numpy as np
- os
- pandas as pd
- re
- requests
- simhash
- sklearn.linear_model
- sklearn.model_selection
- sklearn.preprocessing
- sklearn.utils
- zipfile
- lxml import etree
- gensim.models import FastText
- glob
- unicodedata
- unidecode import unidecode


# Setup




In [1]:
import word2manylanguages as w
import os

# Set the base directory for all subsequent operations. This should already exist.
w.basedir = '../../data/processing_example'

# Create the necessary output subdirectories if they don't already exist.
for path in [w.datadir, w.processdir, w.corpusdir, w.modeldir, w.evaldir]:
    output = os.path.join(w.basedir, path)
    if not os.path.exists(output):
        os.makedirs(output)

# Set the list of languages to work on
# Full list:
# langs = ['af','ar','bg','bn','br','bs','ca','cs','da','de','el','en','eo','es','et',
#         'eu','fa','fi','fr','gl','he','hi','hr','hu','hy','id','is','it','ja','ka',
#         'kk','ko','lt','lv','mk','ml','ms','nl','no','pl','pt','ro','ru','si','sk',
#         'sl','sq','sr','sv','ta','te','th','tl','tr','tw','uk','ur','vi','zh']

langs = ['af']

# Download Raw Data

This download process is the original process at the time of the experiment.  While the Wikipedia download process still works as shown, the OpenSubtitles project has drastically changed their download procedures, making the data harder to obtain.  We will provide an alternative download that gives the same data the original process obtained.  

Downloaded data will be stored in a subfolder under base directory called **data**.  

In [2]:
# For each language, download the Wikipedia and OpenSubtitles data.
for language in langs:
    w.download('wikipedia', language)
    w.download('subtitles', language)

Remote file http://dumps.wikimedia.your.org/afwiki/latest/afwiki-latest-pages-meta-current.xml.bz2, Local file wikipedia-af.bz2
File wikipedia-af.bz2 exists, and overwrite not specified. Skipping.
Remote file https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/raw/af.zip, Local file subtitles-af.zip
File subtitles-af.zip exists, and overwrite not specified. Skipping.


# Clean the Raw Data

Perform data cleanup on each of the raw data files.

Cleaned data will be stored in a subfolder under base directory called **preprocessed**.  

In [3]:
# For each language, perform data cleanup
for language in langs:
    w.clean('wikipedia', language)
    w.clean('subtitles', language)

File wikipedia-af-pre.zip exists, and overwrite not specified. Skipping.
File subtitles-af-pre.zip exists, and overwrite not specified. Skipping.


# Deduplicate the Cleaned Data

This is an optional step that was ultimately skipped in the experiment because the data sources used here are well-curated and the likelihood of document duplication is low, and we feel that it serves our purposes better to include phrase duplication in order to accurately represent how common phrases such as "Thank you" are in spoken language.

In [4]:
# Perform sentence-level deduplication if desired
#for language in langs:
#    w.prune('wikipedia', language)
#    w.prune('subtitles', language)

# Create the Corpus

Concatenate the OpenSubtitles and Wikipedia data into a single corpus file, with one sentence per line.

Concatenated data will be stored in a subfolder under base directory called **corpora**.

In [5]:
# For each language, create the concatenated corpus
for language in langs:
    w.concatenate_corpus(language)

File corpus-af.txt exists, and overwrite not specified. Skipping.


# Build the Words by Dimensions Models

Using gensim, we create models with window sizes of 1, 2, 3, 4, 5, and 6 and dimensions of 50, 100, 200, 300, and 500. Each model is trained using the skip-gram or continuous bag of words algorithm. 


In [6]:
# For each language, build models
for language in langs:
    w.build_models(language)

File af_50_1_cbow_wxd.csv exists, and overwrite not specified. Skipping.
File af_50_1_sg_wxd.csv exists, and overwrite not specified. Skipping.
File af_50_2_cbow_wxd.csv exists, and overwrite not specified. Skipping.
File af_50_2_sg_wxd.csv exists, and overwrite not specified. Skipping.
File af_50_3_cbow_wxd.csv exists, and overwrite not specified. Skipping.
File af_50_3_sg_wxd.csv exists, and overwrite not specified. Skipping.
File af_50_4_cbow_wxd.csv exists, and overwrite not specified. Skipping.
File af_50_4_sg_wxd.csv exists, and overwrite not specified. Skipping.
File af_50_5_cbow_wxd.csv exists, and overwrite not specified. Skipping.
File af_50_5_sg_wxd.csv exists, and overwrite not specified. Skipping.
File af_50_6_cbow_wxd.csv exists, and overwrite not specified. Skipping.
File af_50_6_sg_wxd.csv exists, and overwrite not specified. Skipping.
File af_100_1_cbow_wxd.csv exists, and overwrite not specified. Skipping.
File af_100_1_sg_wxd.csv exists, and overwrite not specified. 

# Evaluate Norms with Replication Data Sets

The replication datasets can be downloaded here: https://github.com/jvparidon/subs2vec/tree/master/subs2vec/datasets/norms 

For this example of processing, we created an `af-fake-2025.tsv` to show how the prediction code runs. We used the same code from the previous manuscript to calculate our prediction effect sizes. 

```    
model = sklearn.linear_model.Ridge(alpha=alpha)  # use ridge regression models
cv = sklearn.model_selection.RepeatedKFold(n_splits=5, n_repeats=10)
```


In [7]:
for language in langs:
    w.loop_norms_vp(language)

Evaluating model af_500_1_cbow
predicting norms from af-fake-2025.tsv
missing vectors for 1189 out of 152736 words
Evaluating model af_500_1_sg
predicting norms from af-fake-2025.tsv
missing vectors for 1189 out of 152736 words
Evaluating model af_500_2_cbow
predicting norms from af-fake-2025.tsv
missing vectors for 1189 out of 152736 words
Evaluating model af_500_2_sg
predicting norms from af-fake-2025.tsv
missing vectors for 1189 out of 152736 words
Evaluating model af_500_3_cbow
predicting norms from af-fake-2025.tsv
missing vectors for 1189 out of 152736 words
Evaluating model af_500_3_sg
predicting norms from af-fake-2025.tsv
missing vectors for 1189 out of 152736 words
Evaluating model af_500_4_cbow
predicting norms from af-fake-2025.tsv
missing vectors for 1189 out of 152736 words
Evaluating model af_500_4_sg
predicting norms from af-fake-2025.tsv
missing vectors for 1189 out of 152736 words
Evaluating model af_500_5_cbow
predicting norms from af-fake-2025.tsv
missing vectors fo

In [8]:
# put in which folder you want to combine datasets for 
for language in langs:
    w.score_vp(language, "replication")

Loading eval ../../data/processing_example/evals/replication/af_100_3_cbow_eval.csv
Loading eval ../../data/processing_example/evals/replication/af_200_1_sg_eval.csv
Loading eval ../../data/processing_example/evals/replication/af_50_5_sg_eval.csv
Loading eval ../../data/processing_example/evals/replication/af_500_4_sg_eval.csv
Loading eval ../../data/processing_example/evals/replication/af_500_4_cbow_eval.csv
Loading eval ../../data/processing_example/evals/replication/af_50_6_cbow_eval.csv
Loading eval ../../data/processing_example/evals/replication/af_300_2_sg_eval.csv
Loading eval ../../data/processing_example/evals/replication/af_50_2_cbow_eval.csv
Loading eval ../../data/processing_example/evals/replication/af_100_5_sg_eval.csv
Loading eval ../../data/processing_example/evals/replication/af_50_5_cbow_eval.csv
Loading eval ../../data/processing_example/evals/replication/af_50_1_cbow_eval.csv
Loading eval ../../data/processing_example/evals/replication/af_200_4_sg_eval.csv
Loading e

# Evaluate Norms with Extended Data Sets

The extended datasets can be downloaded from https://github.com/SemanticPriming/semanticprimeR/releases/tag/v0.0.1 - this release is periodically updated, but with datasets that could be added to the same prediction as below. 

We provide a synthetic data example in this folder `af-fake-2025.csv` to show reproducibility in the code. We created `datasets.csv` in the datasets folder to show the example, but you can see our original datasets file in `datasets_original.csv`. 

In [9]:
for language in langs:
    w.evaluate_norms(language)

Loading model af_50_1_cbow
Evaluating af-fake-2025.csv
Considering columns ['word', 'valence_M', 'arousal_M', 'familiar_M']
missing vectors for 3 out of 114552 words
Loading model af_50_1_sg
Evaluating af-fake-2025.csv
Considering columns ['word', 'valence_M', 'arousal_M', 'familiar_M']
missing vectors for 3 out of 114552 words
Loading model af_50_2_cbow
Evaluating af-fake-2025.csv
Considering columns ['word', 'valence_M', 'arousal_M', 'familiar_M']
missing vectors for 3 out of 114552 words
Loading model af_50_2_sg
Evaluating af-fake-2025.csv
Considering columns ['word', 'valence_M', 'arousal_M', 'familiar_M']
missing vectors for 3 out of 114552 words
Loading model af_50_3_cbow
Evaluating af-fake-2025.csv
Considering columns ['word', 'valence_M', 'arousal_M', 'familiar_M']
missing vectors for 3 out of 114552 words
Loading model af_50_3_sg
Evaluating af-fake-2025.csv
Considering columns ['word', 'valence_M', 'arousal_M', 'familiar_M']
missing vectors for 3 out of 114552 words
Loading mo

In [10]:
# put in which folder you want to combine datasets for 
for language in langs:
    w.score_norms(language, "norms")

Loading eval af_eval.csv
    Language Dimensions Window Algorithm        Norm     Score
199       af         50      4      cbow   arousal_M -0.000259
19        af         50      4      cbow   arousal_M -0.000287
181       af         50      1      cbow   arousal_M -0.000306
31        af         50      6      cbow   arousal_M -0.000328
22        af         50      4        sg   arousal_M -0.000339
..       ...        ...    ...       ...         ...       ...
173       af        500      5        sg  familiar_M -0.005536
353       af        500      5        sg  familiar_M -0.005553
359       af        500      6        sg  familiar_M -0.005583
357       af        500      6        sg   valence_M -0.005818
177       af        500      6        sg   valence_M -0.005955

[360 rows x 6 columns]


# Evaluate Counts

In [11]:
for language in langs: 
    w.evaluate_counts(language)

Loading model af_50_1_cbow
Loading dedup.af.words.unigrams.tsv
Cleaning dedup.af.words.unigrams.tsv
Evaluating dedup.af.words.unigrams.tsv
vectors: 152735  freqs: 18348  matches: 20701
missing vectors for -2353 out of 18348 words
Loading dedup.afwiki-meta.words.unigrams.tsv
Cleaning dedup.afwiki-meta.words.unigrams.tsv
Evaluating dedup.afwiki-meta.words.unigrams.tsv
vectors: 152735  freqs: 472483  matches: 139317
missing vectors for 333166 out of 472483 words
Loading model af_50_1_sg
Loading dedup.af.words.unigrams.tsv
Cleaning dedup.af.words.unigrams.tsv
Evaluating dedup.af.words.unigrams.tsv
vectors: 152735  freqs: 18348  matches: 20701
missing vectors for -2353 out of 18348 words
Loading dedup.afwiki-meta.words.unigrams.tsv
Cleaning dedup.afwiki-meta.words.unigrams.tsv
Evaluating dedup.afwiki-meta.words.unigrams.tsv
vectors: 152735  freqs: 472483  matches: 139317
missing vectors for 333166 out of 472483 words
Loading model af_50_2_cbow
Loading dedup.af.words.unigrams.tsv
Cleaning de

In [12]:
# put in which folder you want to combine datasets for 
for language in langs:
    w.score_counts(language, "counts")

Loading eval ../../data/processing_example/evals/counts/af_eval.csv
✅ Saved sorted results to ../../data/processing_example/scores/counts/af_scores.csv
