Machine Learning: Model

We have used TensorFlow Keras to construct the machine learning model for emotion analysis. It is a bi-directional LSTM model trained on 40,000 tweets with crowdsourced mood data. As expected, a lot of data cleaning is required before the data is usable for training and prediction, in order:

Correct spellings (using GNU Aspell dataset)
Expand contractions (using custom regex and Kaggle dataset)
Remove mentions and URLs (using the Python tweet-preprocessor package)
Parse emojis into text (using the Python emoji package)
Remove punctuation (from string.punctuation)
Remove stop words (with spaCy-defined stopwords)

We then use an embedding matrix using the GloVE dataset vectors with 6 billion unique vectors of 300 dimensions and a corpus vocabulary of 400K words as the input layer, feed it to our bi-directional LSTM, compress the features and finally get a probabilistic outcome from the softmax layer.

The entirety of the training data comes to ~1 GiB and we get a model of ~86 MB.

Copyright (C) 2020 Ajitesh Panda, Ankit Maity Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine Learning: Model

Overview

For developers and contributors

Clone this wiki locally