Skip to content

Tejas-Nanaware/Native-Language-Identification

Repository files navigation

Introduction

Create a text classification model that can predict the native language of an author who writes in English. With using the Google's BERT, the vector representations of the authors can be obtained which is fed to Neural Networks and other Prediction Models.

This task was created for the CS585 - Natural Language Processing Fall 2019 Course at Illinois Institute of Technology, Chicago.

Workflow

  1. Clone the BERT repository and add it as a git submodule referred as BERT_BASE_DIR.
  2. Use the BERT-Base Uncased model that is used as the data files. This is referred as BERT_DATA_DIR. Repo Link.
  3. Download the dataset from the University Repo that is used for training, validation and testing. This is the data directory.
  4. Run the Format Data For Input.sh that programmatically reformats the data files into the bert_input_data and then run the run_bert_fv.sh that obtains the feature vector representation for each data into the bert_output_data directory.
  5. Apply Prediction models for the prediction Prediction Models.ipynb file.

Directory Structure

BERT_BASE_DIR (The files from the Google's BERT Submodule)
BERT_DATA_DIR (The files from the BERT-Base Uncased Model)
data (The dataset from the University Repository)
|--lang_id_train.csv
|--lang_id_eval.csv
|--lang_id_test.csv
bert_input_data (Formatted files for vector representation)
|--train.txt
|--eval.txt
|--test.txt
bert_output_data (Obtained feature vector representation)
|--train.jsonlines
|--eval.jsonlines
|--test.jsonlines
Format Data For Input.sh
run_bert_fv.sh
Prediction Models.ipynb

About

Identify the Native Language of an Author using Neural Networks and BERT for vector representation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published