Skip to content

In this report, we focus on the Kaggle competition Natural Language Processing with Disaster Tweets, whose objective is to distinguish the real disaster tweets from others. An attempt is made to acquire the word embedding vectors by applying word2vec structure, the vectors are then classified into two labels using Bi-LSTM, TextCNN and fine-tuned…

Notifications You must be signed in to change notification settings

Lavender517/AMLS_II_assignment21_22-kaggle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AMLS_II_assignment21_22-kaggle

The NLP-project implementation of course ELEC0134: Applied Machine Learning Systems (21/22) in UCL MSc IMLS.

SN: 21056542

The ubiquity of smartphones allows people to announce any emergencies they identify in real-time. A large percentage of them choose to broadcast by sending tweets, so classifying suspicious tweets for disaster content is a topic that has been gaining steam in recent years. In this report, we focus on the Kaggle competition Natural Language Processing with Disaster Tweets, whose objective is to distinguish the real disaster tweets from others. An attempt is made to acquire the word embedding vectors by applying word2vec structure, the vectors are then classified into two labels using Bi-LSTM, TextCNN and fine-tuned BERT models, as well as displaying their performance. The prediction results are evaluated and compared to provide references on the suitability of classification with the selected superior approach.

Environments

You can build an environment and then install the modules from following command:

pip install -r requirements.txt 

File Directory


AMLS_II_assignment21_22-kaggle
├── /code/
│  ├── data_load.py
│  ├── data_preprocess.py
│  ├── models.py
│  ├── train_test.py
│  ├── word2vec.py
├── /Datasets/
│  ├── train.csv
│  ├── test.csv
│  ├── prediction.csv
├── /models/
│  ├── /BERT/
│  ├── /CNN/
│  ├── /LSTM/
│  ├── w2v_all_200.model
│  ├── w2v_all_300.model
├── /output/
│  ├── xxxx-xx-xx_xx:xx:xx.log
├── /runs/
├── main.py
├── README.md
├── requirements.txt
├── run.sh

File Description

main.py

This file implements the whole process of this report, including word embedding process, contructing BertTokenizer and fine-tuned BERT model, training commands to choose specific model from deep learning approaches and produces the test results on the dataset. You can choose to run the training process or prediction process by commenting one of the other in main function.

data_load.py

This script contains the filtering pre-processing operation and data_load function as well as TwitterDataset Class Definition.

data_preprocess.py

This script defines the Pre-process Class to implement word embedding operation, and also contains BertTokenizer function.

models.py

This script constructs the specific architecures of Bi-LSTM and TextCNN by using torch.nn package.

train_test.py

This script defines the traning and testing function used to train the modeal and make predictions. The evaluation funcation also be included.

word2vec.py

This script includes the codes of constructing word2vec model, and save it in /models/ folder that can be called anytime.

Dataset

The dataset is provided by Kaggle[1], which includes a training set and a test set, both of which contains more than 7000 samples. The main part in the dataset are the text and target columns. The prediction.csv is the test result generated by our model.

models

This folder saves the trained model with better performance such as CNN Acc more than 79 and Bi-LSTM Acc more than 78. But for BERT folder, as its file size is too big, it fails to be uploaded on Github. Besides, the word2vec obtained by word2vec.py file is also saved at here.

output

This folder records the individual output of each training process, including train_acc, valid_acc, train_loss, and valid_loss that produced by each epoch. They are named by the date and time of the beginning of the training.

runs

This folder saves the learing curves of each neural network training process, they can be viewed by TensorBoard and the results are corresponding to the output folder.

requirements.txt

Provide specific list of external libraries and packages, the virtual environment builds on Anaconda.

run.sh

This shell script enables the train.py file to be proceeded without hanging up on the backstage of the system.

Usage example

When starting to use, select which task to solve and choose a specifci model to implement it.

For example, if I want to test the BERT model , just use the following command:

python main.py --model BERT

Then it will output like this:

Loading training data ...
Using BERT Model

Which means you have begun to train successfully. For deep learning models ['LSTM', 'CNN', 'BERT'], the correspondance for each networks are:

  • LSTM: Bi-LSTM
  • CNN: TextCNN
  • BERT: BERT

References

[1] https://www.kaggle.com/competitions/nlp-getting-started/overview

About

In this report, we focus on the Kaggle competition Natural Language Processing with Disaster Tweets, whose objective is to distinguish the real disaster tweets from others. An attempt is made to acquire the word embedding vectors by applying word2vec structure, the vectors are then classified into two labels using Bi-LSTM, TextCNN and fine-tuned…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published