NNLangID 🌎💬

Project for neural network-based tweet language identification. See project report for more details.

Terminal Screenshot

Architecture Overview

Getting Started

Fetch tweet data from Twitter via the TweetRetriever.py and place it into data/input_data/original (already done for the Twitter blog post data this project is based on).
Run Main.py for the main procedure. Depending on use_cluster_params it reads in one of the two YAML settings files, which contain all user parameters.
- Set create_splitted_data_files = True to split an original file from specified file path input_tr_va_te_data_rel_path into separate training, validation and test set files. The data is then fetched from those files, preprocessed and transformed to be readily used by the subsequent embedding and RNN.
- Set train_embed = True to train the embedding and get the embedding weights. The embedding is implemented as a Skip-Gram model with Negative Sampling. While training, the loss-based best embedding model checkpoint and extracted embedding weights are automatically saved to specified file paths in embed_model_checkpoint_rel_path and embed_weights_rel_path.
- Set train_rnn = True to use the embedding weights to embed the characters of a tweet and feed them into the RNN, which is implemented as a (uni- or bidirectional) GRU model. While training, the loss-based best RNN model checkpoint is automatically saved to the specified file path in rnn_model_checkpoint_rel_path.
- Set eval_test_set = True to evaluate a trained RNN model checkpoint on the test set, to get further metrics on the performance, which are then stored back to the checkpoint file. (File paths specified in rnn_model_checkpoint_rel_path and embed_weights_rel_path are used).
- Set run_terminal = True to run the terminal for interactive evaluation of a trained RNN model checkpoint with arbitrary input text or live tweets fetched directly from Twitter. Some trained model checkpoints and weight files may be found in data/save/trained. (File paths specified in trained_model_checkpoint_rel_path and trained_embed_weights_rel_path are used.)
- Set print_embed_testing = True to print the embedding test after the embedding calculation to the console.
- Set print_model_checkpoint_embed_weights and print_rnn_model_checkpoint or print_embed_model_checkpoint to the respective file paths to print stored model checkpoint data to the console. (Note: Some parameters in the YAML settings file, e.g. input_tr_va_te_data_rel_path and hidden_size_rnn, have to be the same as in the model checkpoint file!)

Prerequisites

Python v2.7
PyTorch v0.2.0_4
CUDA is used if available.

Authors

Project developed by Alexander Heilig, Dominik Sauter and Tabea Kiupel in the context of the Neural Networks practical course at the Karlsruhe Institute of Technology (KIT), Germany.

License

Licensed unter the MIT license (see LICENSE file for more details).

Name		Name	Last commit message	Last commit date
Latest commit History 249 Commits
LanguageIdentification		LanguageIdentification
.gitignore		.gitignore
ArchitectureOverview.png		ArchitectureOverview.png
LICENSE		LICENSE
README.md		README.md
TerminalManual.PNG		TerminalManual.PNG
TweetLanguageIdentification_D.Sauter-A.Heilig-T.Kiupel_rev1.pdf		TweetLanguageIdentification_D.Sauter-A.Heilig-T.Kiupel_rev1.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NNLangID 🌎💬

Terminal Screenshot

Architecture Overview

Getting Started

Prerequisites

Authors

License

About

Releases

Packages

Contributors 3

Languages

License

DominikSauter/NNLangID

Folders and files

Latest commit

History

Repository files navigation

NNLangID 🌎💬

Terminal Screenshot

Architecture Overview

Getting Started

Prerequisites

Authors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages