Funnelling is a new ensemble method for heterogeneous transfer learning. The present Python implementation concerns the application of Funnelling to Polylingual Text Classification (PLC).
The two variants of Funnelling, Fun(KFCV) and Fun(TAT), are implemented by the FunnellingPolylingualClassifier class, and instantiated by setting folded_projections=k for Fun(KFCV) (with k>1 the number of folds) or folded_projections=1 for Fun(TAT).
This code has been used to produce all experimental results reported in the article "Esuli, A., Moreo, A., & Sebastiani, F. (2019). Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and Its Application to Cross-Lingual Text Classification. ACM Transactions on Information Systems (TOIS), 37(3), 37.".
This package also contains the code implementing all baselines involved in the experimental evaluation. Some of these baselines may require external resources. All baselines are implemented in the learners.py script. The list of baselines include: Naive, Lightweight Random Indexing (LRI), Cross-Lingual Explicit Semantic Analysis (CLESA), Kernel Canonical Correlation Analysis (KCCA), Distributional Correspondence Indexing (DCI), Poly-lingual Embeddings (MLE and MLE-LSTM), and UpperBound. Among those, CLESA, KCCA, MLE, and MLE-LSTM require the following additional resources:
- CLESA: the class CLESAPolylingualClassifier requires a processed version of a Wikipedia dump; see section Datasets for more information.
- KCCA: the class KCCAPolylingualClassifier also requires a processed version of Wikipedia. KCCA is built on top of a wrapper of pyrcca from the article Regularized kernel canonical correlation analysis in Python. If you intend to run KCCA you might first fork the aforementioned project and make it accessible at the root of this project.
- MLE: the class PolylingualEmbeddingsClassifier uses the multilingual embeddings from the article Word Translation without Parallel Data which can be downloaded from the MUSE repo.
- MLE-LSTM: is implemented in LSTMclassifierKeras.py and requires:
- The availability of the polylingual embeddings (as in MLE).
- A Keras installation.
The datasets we used to run our experiments include:
- RCV1/RCV2: a comparable corpus of Reuters newstories
- JRC-Acquis: a parallel corpus of legislative texts of the European Union
The datasets need to be built before running any experiment. This process requires downloading, parsing, preprocessing, splitting, and vectorizing. The datasets we generated and used in our experiments can be directly downloaded (in vector form) from here. Note that some methods (e.g., the PLE and PLE-LSTM methods) might require the original documents in raw form, which we are not allowed to distribute. The tools we used in order to build the datasets are also available in this repo, and are explained below (feel free to skip reading if you are ok with the pre-built version).
The dataset generation relies on NLTK for text preprocessing. Make sure you have NLTK installed and you have downloaded the packages needed for enabling stopword removal and stemming (via SnowballStemmer) before building the datasets.
A multilingual dump of the Wikipedia is required during the generation of the datasets for the CLESA and KCCA baselines (see section Baselines). If you are not interested in running CLESA or KCCA, you can simply omit this requirement by setting max_wiki=0 before running the script. If otherwise, you would have to go through the documentation which contains some tools and explanations on how to prepare the Wikipedia dump (you might require external tools).
We adapted the Wikipedia_Extractor to extract a comparable set of documents for all of the 11 languages involved in our experiments. Technical details and ad-hoc tools might be found in wikipedia_tools.py (in this repo). The toolkit allows:
- Simpliying the (huge) json dump file
- Processing the json file as a stream and filter out documents not satisfying certain conditions (e.g., do not have a view for all of the specified languages).
- Extract clean versions of documents (see the Wikipedia_Extractor for more information)
- Create multilingual maps of comparable documents, and pickle them for faster usage.
The dataset splits are built once for all using the dataset_builder.py script and then pickled for fast subsequent runs. JRC-Acquis is automatically donwloaded the first time. RCV1/RCV2, despite being public, cannot be downloaded without a formal permission. Please, refer to RCV1's site and RCV2's site before proceeding.
Once locally available, this script preprocesses the documents, and vectorizes them. 10 random splits are generated for experimental purposes. The list of ids we ended up using are accessible (in pickle format) here.
Reproducing the Experiments
Most of the experiments were run using the script polylingual_classification.py. This script can be run with different command line arguments to reproduce all multilabel experiments (with the exception of PLE-LSTM, see below).
Run it with -h or --help to show this help.
Usage: polylingual_classification.py [options] Options: -h, --help show this help message and exit -d DATASET, --dataset=DATASET Path to the multilingual dataset processed and stored in .pickle format -m MODE, --mode=MODE Model code of the polylingual classifier, valid ones include ['fun-kfcv', 'fun-tat', 'naive', 'lri', 'clesa', 'kcca', 'dci', 'ple', 'upper', 'fun-mono'] -o OUTPUT, --output=OUTPUT Result file -n NOTE, --note=NOTE A description note to be added to the result file -c, --optimc Optimice hyperparameters -b BINARY, --binary=BINARY Run experiments on a single category specified with this parameter -L LANG_ABLATION, --lang_ablation=LANG_ABLATION Removes the language from the training -f, --force Run even if the result was already computed -j N_JOBS, --n_jobs=N_JOBS Number of parallel jobs (default is -1, all) -s SET_C, --set_c=SET_C Set the C parameter -r KCCAREG, --kccareg=KCCAREG Set the regularization parameter for KCCA -w WE_PATH, --we-path=WE_PATH Path to the polylingual word embeddings (required only if --mode polyembeddings) -W WIKI, --wiki=WIKI Path to Wikipedia raw documents --calmode=CALMODE Calibration mode for the base classifiers (only for class-based models). Valid ones are'cal' (default, calibrates the base classifiers and use predict_proba to project), 'nocal' (does not calibrate, use the decision_function to project)'sigmoid' (does not calibrate, use the sigmoid of the decision function to project)
For example, the following command will produce the results for Fun(TAT) on the first random split of the RCV1/RCV2 dataset optimizing the C parameter of the first-tier SVM classifiers.
$> python polylingual_classification.py -d "../Datasets/RCV2/rcv1-2_nltk_trByLang1000_teByLang1000_processed_run0.pickle" -o ./results.csv --mode fun-tat --optimc
Once the experiment is over, some results will be displayed in the standard output:
evaluation (n_jobs=-1) Lang nl: macro-F1=0.540 micro-F1=0.829 Lang es: macro-F1=0.582 micro-F1=0.843 Lang fr: macro-F1=0.499 micro-F1=0.765 Lang en: macro-F1=0.528 micro-F1=0.764 Lang sv: macro-F1=0.540 micro-F1=0.775 Lang it: macro-F1=0.511 micro-F1=0.789 Lang da: macro-F1=0.490 micro-F1=0.797 Lang pt: macro-F1=0.706 micro-F1=0.879 Lang de: macro-F1=0.416 micro-F1=0.741 Averages: MF1, mF1, MK, mK [0.53464632 0.79803785 0.5088316 0.75633335]
The complete record of the experiment is saved in the result file, which can be consulted with Pandas. For example, the following snippet will display the results for all languages:
import pandas as pd results = pd.read_csv('results.csv', sep='\t') pd.pivot_table(results, index = ['method', 'lang'], values=['microf1','macrof1','microk','macrok']) Out: macrof1 macrok microf1 microk method lang fun-tat da 0.490002 0.455626 0.796524 0.742877 de 0.415858 0.394820 0.741391 0.698547 en 0.528280 0.488883 0.764349 0.716628 es 0.581849 0.577447 0.842697 0.823296 fr 0.499307 0.477912 0.764876 0.704686 it 0.510546 0.471447 0.788944 0.751368 nl 0.540213 0.510137 0.828782 0.789040 pt 0.705507 0.698201 0.879412 0.850810 sv 0.540255 0.505011 0.775367 0.729749
The code to run PLE-LSTM is implemented in LSTMclassifierKeras.py. Note that you need the raw version of the documents to run it (see the Datasets section).
Other scripts used include: