Skip to content

X2Static Training

Cahid Arda edited this page Mar 14, 2024 · 2 revisions

Last time we used the HPC available to our university, we had to run our scripts inside docker images.

This meant that our script had to work end-to-end. It would load the data, clone the repositories, train the model and run the evaluation scripts from start to finish.

Following is the script we used in the end:

#!/bin/bash #SBATCH --container-image ghcr.io#bouncmpe/cuda-python3 #SBATCH --gpus=2 #SBATCH --cpus-per-gpu=16 #SBATCH --mem=100G #SBATCH --time=90:00:00

source /opt/python3/venv/base/bin/activate

echo "bounweb corpus" pip install wget python3 -c "import wget; wget.download('https://tulap.cmpe.boun.edu.tr/server/api/core/bitstreams/150f2e37-1dd3-4229-a37e-111f8a365edf/content')" python3 -c "import zipfile; zipfile.ZipFile('bounwebcorpus.txt.zip', 'r').extractall()" rm bounwebcorpus.txt.zip

echo "huwaei" pip install gdown python3 -m gdown 1PTytZ7yGIl9QvxRxCsfWLlHycU4z1_Vp

echo "vocab" python -m gdown 1Sjfh9c7gMa6lvsjprcnbtVyKYEJ8r2fH

echo "comb" python3 -c "open('combined.txt', 'w', encoding='utf-8').writelines([line for file in ['bounwebcorpus.txt', 'turkish-texts-tokenized.txt'] for line in open(file, 'r', encoding='utf-8')])"

mv bounwebcorpus.txt combined.txt

export REPO=Word-Embeddings-Repository-for-Turkish

git clone https://github.com/epfml/X2Static.git

pip install tqdm torch transformers nltk gensim "tensorflow-gpu==2.8.0" scikit-learn torchmetrics matplotlib

python X2Static/src/make_vocab_dataset.py --dataset_location combined.txt --location_save_vocab_dataset processed_data/

python X2Static/src/make_vocab_dataset.py --dataset_location combined.txt --min_count 10 --max_vocab_size 750000 --location_save_vocab_dataset processed_data/

ls -l processed_data

python X2Static/src/learn_from_bert_ver2_paragraph.py --pretrained_bert_model "dbmdz/bert-base-turkish-128k-cased" --location_dataset processed_data/ --model_folder x2static_model --gpu_id 0 --num_epochs 1 --lr 0.001 --algo SparseAdam --t 5e-6 --word_emb_size 768 --num_negatives 10

python X2Static/src/learn_from_bert_ver2_paragraph.py --gpu_id 0 --num_epochs 1 --lr 0.001 --algo SparseAdam --t 5e-6 --word_emb_size 768 --location_dataset processed_data/ --model_folder model/ --num_negatives 10 --pretrained_bert_model dbmdz/bert-base-turkish-128k-cased

ls -l

echo "model" ls -l /model mv /model/vectors_final.txt /bert-decontextualized-static.wv

pip install protobuf==3.20.*

Loop through each script in the array

git clone https://github.com/Turkish-Word-Embeddings/Word-Embeddings-Repository-for-Turkish.git

echo "NER" cd /$REPO/NLP/NER sed -i "s@FOLDER = "C:/Users/karab/Desktop/Models"@FOLDER = '/'@g" ner.py sed -i 's/"no_header": False/"no_header": True/g' ner.py python3 ner.py -w dc_bert

echo "PoS" cd /$REPO/NLP/PoS sed -i "s@FOLDER = "C:/Users/karab/Desktop/Models"@FOLDER = '/'@g" pos.py sed -i 's/"no_header": False/"no_header": True/g' pos.py python3 pos.py -w dc_bert

echo "SENTIMENT" cd "/$REPO/NLP/Sentiment Analysis/" sed -i 's@"C:/Users/karab/Desktop/Models"@"/"@g' sentiment.py sed -i 's/"no_header": False/"no_header": True/g' sentiment.py python3 sentiment.py -w dc_bert -d 1 -s 7 python3 sentiment.py -w dc_bert -d 1 -s 24 python3 sentiment.py -w dc_bert -d 1 -s 30

python3 sentiment.py -w dc_bert -d 2 -s 7 python3 sentiment.py -w dc_bert -d 2 -s 24 python3 sentiment.py -w dc_bert -d 2 -s 30

python3 sentiment.py -w dc_bert -d 3 -hs 196 -s 7 python3 sentiment.py -w dc_bert -d 3 -hs 196 -s 24 python3 sentiment.py -w dc_bert -d 3 -hs 196 -s 30

Clone this wiki locally