Please check dev branch for the latest version.
- (Ignore this step) Check the cuda version on GCP as
sudo nvidia-smi
- No need to configure virtual environment for this machine. Just use the base environment and install the dependencies as
pip install -r requirements.txt
- Download model/base_model.pt from Google Drive into your
model/
directory. - Download data/gweb_sancl/ from Google Drive into your
data/
directory.
Make sure you have the following directories in the root before running the script
.
├── Analysis_int_res.ipynb
├── Analysis_output_Online_fixed_self_learning.ipynb
├── Analysis_output_Online_nonfixed_self_learning.ipynb
├── LICENSE
├── Online_fixed_self_learning_v5.ipynb
├── Online_nonfixed_self_learning_v5.ipynb
├── Online_token_self_learning_v5.ipynb
├── README.md
├── Scratch_fixed_self_learning_v5.ipynb
├── Scratch_nonfixed_self_learning_v5.ipynb
├── Scratch_token_self_learning_v5.ipynb
├── analysis.py
├── build_model.py
├── create_pseudo_data.py
├── create_pseudo_data_by_tokens.py
├── data
│ └── gweb_sancl
│ ├── pos_fine
│ │ ├── answers
│ │ ├── emails
│ │ ├── newsgroups
│ │ ├── reviews
│ │ ├── weblogs
│ │ └── wsj
│ └── unlabeled
│ └── gweb-answers.unlabeled.txt
├── docs
├── intermediate_result
├── metrics
├── model
├── plots_tags
├── requirements.txt
├── result
├── scripts
├── setup.sh
└── utils.py
Here follows the brief introduction about the specific details for each directory:
- metrics: store the metrics at each loop after self training including precision, f1 and recall
- plots: store the plots for metrics at different parameter settings
- model: store the model settings to save time
- data: store the data we are gonna use
- docs: store the meeting records for the project
- pickles: store the serialized python object after self-training for future usages.