-
Notifications
You must be signed in to change notification settings - Fork 2
Machine Learning: Model
We have used TensorFlow Keras to construct the machine learning model for emotion analysis. It is a bi-directional LSTM model trained on 40,000 tweets with crowdsourced mood data. As expected, a lot of data cleaning is required before the data is usable for training and prediction, in order:
- Correct spellings (using GNU Aspell dataset)
- Expand contractions (using custom regex and Kaggle dataset)
- Remove mentions and URLs (using the Python
tweet-preprocessor
package) - Parse emojis into text (using the Python
emoji
package) - Remove punctuation (from
string.punctuation
) - Remove stop words (with spaCy-defined stopwords)
We then use an embedding matrix using the GloVE dataset vectors with 6 billion unique vectors of 300 dimensions
and a corpus vocabulary of 400K words as the input layer, feed it to our bi-directional LSTM, compress the features
and finally get a probabilistic outcome from the softmax
layer.
The entirety of the training data comes to ~1 GiB and we get a model of ~86 MB.
Copyright (C) 2020 Ajitesh Panda, Ankit Maity Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".