GitHub - AugusteLef/CIL-Project-Text-Classification: The use of microblogging and text messaging as a media of communication has greatly increased over the past 10 years. Such large volumes of data amplifies the need for automatic methods to understand the opinion conveyed in a text.

File Structure

Some scripts may assume the following file-structure (you might have to create missing directories):

Data : Directory containing all training-, test- and preprocessed data
Predictions : Directory containing all predictions for test-set.
Preprocessing_Data : Directory containing wordlists and other data used for preprocessing.
LectureBaselines : Directory containing implementations of the baselines from the exercises. A separate ReadMe can be found here with instructions on how to reproduce the baselines.
Output_pp : Directory for all output (logging) files related to preprocessing
Output_predicting : Directory for all output (logging) files related to inference on the test-set.
Output_huggingface : Directory for all output (logging) files related to training the huggingface models.
Output_ensembles : Directory for all output (logging) files related to training the ensembles.

The following python scripts should be contained in the main project folder:

preprocessing.py : Used for preprocessing data-sets with different preprocessing methods.
training_*.py : Training scripts for huggingface models and our ensemble models.
inference_*.py : Scripts used to create predictions for the test set.
utils_*.py : Contains some useful code-snippets used in the above scripts.

All our experiments can be reproduced by running the following files with 'source' on the Leonhard cluster (more details under 'Workflow' below):

preprocess_all : Schedules all preprocessing jobs needed in this project. These jobs roughly take up to 1h to finish.
train_huggingface : Schedules all jobs for training huggingface models. These jobs roughly take 24h to finish.
train_ensembles : Schedules all jobs for training ensembles. These jobs roughly take 24h to finish.
predict_all : Schedules all jobs for making predictions on the test-set. These jobs don't take more than a few minutes to finish.

Dataset

[All the preprocessed datasets used for our experiments can be directly downloaded on this polbyox, password: data]

Download the tweet dataset:

wget http://www.da.inf.ethz.ch/teaching/2018/CIL/material/exercise/twitter-datasets.zip

Move it to the Data directory:

unzip twitter-datasets.zip

mkdir Data

mv twitter-datasets/* Data

The dataset should contain the following files:

sample_submission.csv
train_neg.txt : a subset of negative training samples
train_pos.txt: a subset of positive training samples
test_data.txt:
train_neg_full.txt: the full negative training samples
train_pos_full.txt: the full positive training samples

Additional Datasets

Download the datasets directly from kaggle competition: Sentiment140 and Tweet Sentiment Extraction or directly via the polybox link using the following password: AddDatasets

Match the format of the original dataset and create a positive and negative datasets (save them in Data/):

For the dataset from Sentiment140

python3 additional_dataset_1.py dataset1.csv

For the dataset from Tweet Sentiment Extraction

python3 additional_dataset_2.py dataset2.csv

Combine all datasets and output the resulting datasets into the 'Data' folder [WARNING: make sure you created the 'Data' folder as mentioned in section Dataset]:

for negative:

python3 combine_datasets.py Data/train_neg_full.txt Data/train_neg_add1.txt Data/train_neg_add2.txt train_neg_all_full.txt

for positive:

python3 combine_datasets.py Data/train_pos_full.txt Data/train_pos_add1.txt Data/train_pos_add2.txt train_pos_all_full.txt

Running Our Experiments

Guidelines for running our experiments are presented here. We assume that the git-directory has been cloned on the Leonhard cluster, that the correct file structure has been set up (i.e. adding missing directories according to description above) and that the datasets have been downloaded and put in the Data directory. Our experiments will store the models in your personal scratch space ($SCRATCH).

Further Preparations

Load necessary modules:

module load gcc/6.3.0 python_gpu/3.8.5 eth_proxy

Create and start virtual environment:

python3 -m venv venv
source venv/bin/activate

Install dependencies (make sure to be in venv):

pip install -r requirements.txt

Preprocessing

You can schedule all 10 preprocessing jobs by running:

source preprocess_all

Training

First, the huggingface models have to be trained:

source train_huggingface

Once they are done, we can train the ensembles:

source train_ensembles

Predictions

As a last step, we create the predictions using the trained models:

source predict_all

Miscellaneous

I never manage to remember all of it, hence I put here a list of handy commands.

Virtual Environment & Dependencies

Start virtual environment:

source venv/bin/activate

Exit virtual environment:

deactivate

Update requirements.txt:

pip list --format=freeze > requirements.txt

Install dependencies (make sure to be in venv):

pip install -r requirements.txt

Leonhard Cluster

Load modules:

module load gcc/6.3.0 python_gpu/3.8.5 eth_proxy

Reset modules:

module purge
module load StdEnv

Submitting job:

bsub -R "rusage[mem=8192]" -R "rusage[ngpus_excl_p=1]" -oo output python3 main.py [args]

Submitting as interactive job for testing (output to terminal):

bsub -I -R "rusage[mem=8192]" -R "rusage[ngpus_excl_p=1]" python3 main.py [args]

Monitoring job:

bjobs
bbjobs

Name		Name	Last commit message	Last commit date
Latest commit History 221 Commits
LectureBaselines		LectureBaselines
Predictions		Predictions
Preprocessing_Data		Preprocessing_Data
.gitignore		.gitignore
README.md		README.md
additional_dataset_1.py		additional_dataset_1.py
additional_dataset_2.py		additional_dataset_2.py
combine_datasets.py		combine_datasets.py
inference_ensemble1.py		inference_ensemble1.py
inference_ensemble2.py		inference_ensemble2.py
inference_huggingface.py		inference_huggingface.py
models.py		models.py
predict_all		predict_all
preprocess_all		preprocess_all
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
train_ensembles		train_ensembles
train_huggingface		train_huggingface
training_ensemble1.py		training_ensemble1.py
training_ensemble2.py		training_ensemble2.py
training_huggingface.py		training_huggingface.py
utils_preprocessing.py		utils_preprocessing.py
utils_training_inference.py		utils_training_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

File Structure

Dataset

Additional Datasets

Running Our Experiments

Further Preparations

Preprocessing

Training

Predictions

Miscellaneous

Virtual Environment & Dependencies

Leonhard Cluster

About

Releases

Packages

Contributors 4

Languages

AugusteLef/CIL-Project-Text-Classification

Folders and files

Latest commit

History

Repository files navigation

File Structure

Dataset

Additional Datasets

Running Our Experiments

Further Preparations

Preprocessing

Training

Predictions

Miscellaneous

Virtual Environment & Dependencies

Leonhard Cluster

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages