INF368-Final-Project

This was made for a final project for INF368 at the University of Bergen. We replicate two models, GanBert and LAMBADA, based on the papers below. These are generative models that help in low-data situations. GanBert uses unlabeled data while LAMBADA synthesizes its own data based on GPT2. These are then used to try to improve classification performance on the IMDB dataset and the Medical Text dataset. We then compare the results to a baseline using a BERT classifier.

Papers

Data

IMDB data set contains 50,000 documents with 2 categories. IMDB data set contains 14438 documents with 5 categories.

Setup

variable	value
# per label	5, 10, 25, 50
batch size	5
learning rate	5e-5
seed	0

How to run the analysis

Get the data. IMBD and Medical Text are already supplied, but if you wish to use other datasets, you need to generate the required files yourself. The required files are six different files that should all be in data/YourDatasetName/. These are on the form of:
- These four files each one column called text with the text, and a column with labels called label. The number at the end indicates the number of datapoints per label:
```
- train_labeled_5.csv
- train_labeled_10.csv
- train_labeled_25.csv
- train_labeled_50.csv
```
- The unlabeled data must also have two columns, text and label. The label should be "blank" or another another word indicating it is unlabeled, although it doesn't really matter as long as it exists:
```
- train_unlabeled.csv
```
- The test data should also be on the form of two columns, text and label.
```
- test.csv
```
It is very important that the names are exactly as written, on the specified form.
Generate results for the different datasets by:
- Running Bert.ipynb for the baseline results.
- Running GanBert.ipynb for the Gan-Bert results.
- Running Lambada.ipynb for the Lambada results.
- Note that each notebook will have a variable you need to change in order to point it to the correct dataset. This will have to be changed manually.
Run Results.ipynb to concatenate the results and save them to a single .csv file along with a plot. They can now be compared.

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Notebooks		Notebooks
__pycache__		__pycache__
data		data
images		images
results		results
.gitignore		.gitignore
Bert.py		Bert.py
Datahandler.py		Datahandler.py
Final_Project_INF368.pdf		Final_Project_INF368.pdf
GPT2Tuner.py		GPT2Tuner.py
GanBert.py		GanBert.py
README.md		README.md

ALjone/INF368-Final-Project

Folders and files

Latest commit

History

Repository files navigation

INF368-Final-Project

Papers

Data

Setup

How to run the analysis

Sources

About

Resources

Stars

Watchers

Forks

Languages