Skip to content

JuliaLiu1997/CISC867

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Original Paper: Word class flexibility: A deep contextualized approach

This repository is the official implementation of "Word class flexibility: A deep contextualized approach" (https://arxiv.org/abs/2009.09241).

Requirements

To install requirements:

pip install -r requirements.txt
git clone https://github.com/attardi/wikiextractor
pip install wikiextractor

Probing test of contextualized model (Optional)

cd reproduction
python run_mturk_correlations.py

Datasets

  1. Download ud treebanks:
wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3105/ud-treebanks-v2.5.tgz
mkdir data/ud_all
tar xf ud-treebanks-v2.5.tgz --directory data/ud_all
  1. Download bnc baby:
mkdir data/bnc
cd data/bnc
wget https://ota.bodleian.ox.ac.uk/repository/xmlui/bitstream/handle/20.500.12024/2553/2553.zip
!unzip 2553.zip

Download and extract English (and other languages) Wikipedia

To Download and extract English Wikipedia, run:

mkdir ../wiki
cd ./wiki
wget https://dumps.wikimedia.org/enwiki/20210220/enwiki-20210220-pages-articles-multistream1.xml-p1p41242.bz2
bzip2 -d enwiki*
python -m wikiextractor.WikiExtractor enwiki* -o en
cd ../..

Wikipedia for other language can also be found in https://dumps.wikimedia.org/

Preprocess Wikipedia and BNC baby

To process English Wikipedia, run:

python reproduction/process_wikipedia.py \
    --wiki_dir=data/wiki/ \
    --ud_dir=data/ud_all/ud-treebanks-v2.5/ \
    --dest_dir=data/wiki/\
    --lang=en \
    --model=stanza \
    --tokens 10000000

To process BNC baby, run:

python reproduction/process_bnc.py \
    --bnc_dir=data/bnc/download/Texts \
    --to=data/bnc/bnc.pkl

Run semantic metrics

To run semantic metrics:

mkdir results
python reproduction/model_contextual.py \
      --pkl_dir data/wiki/processed_udpipe \
      --pkl_file en.pkl \
      --results_dir results/en/ \
      --model bert-base-multilingual-cased

📋 model name options: "bert-base-multilingual-cased", "bert-base-uncased", "xlm-roberta-base", "elmo"

Results

2 main results are : Semantic Metrics for 18 Languages:

Semantic Metrics for 4 Models:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages