ml
directory is intended to be imported as an entire module (hence the __init__.py
). To get started with making predictions, it's simple! 🌟
$ git clone git@github.com:QEDK/clarity.git
$ cd clarity/ml
$ pip3 install -r requirements.txt
Now, in your code, you can import the entire module using:
from ml.processtext import ProcessText
nlp = ProcessText() # this will load the Keras model and other functionality
# And then to perform analysis:
result = await nlp.process("Some text you need to analyze!")
# This module uses async so you can enable concurrency in your real-time applications ;)
The module gives you a prediction in JSON format:
{
"ents": "<escaped_formatted_html>",
"tf_idf": "<tf_idf>",
"word_associations": "<word_associations>",
"sentiment": {
"mood": {
"empty": 0.0,
"sadness": 0.0,
"enthusiasm": 0.0,
"neutral": 0.0,
"worry": 0.0,
"surprise": 0.0,
"love": 0.0,
"fun": 0.0,
"hate": 0.0,
"happiness": 0.0,
"boredom": 0.0,
"relief": 0.0,
"anger": 0.0
},
"polarity": 0.0,
"subjectivity": 0.0
}
}
Each mood has a probability score from 0-100 but the model is highly unlikely to make predictions only towards one particular mood as normal text encompasses many different emotions. These scores are generated by a machine learning model trained on a lot of data (more details below).
Polarity is assigned on a scale of [-1, 1]
(most negative to most positive). Subjectivity is assigned on a scale
of [0, 1]
(objective to subjective). These are rule-based scores depending on presence of certain polarizing,
sentimental words.
We have used Keras to construct the machine learning model for emotion analysis. It is a bi-directional LSTM model trained on 40,000 tweets with crowdsourced mood data. As expected, a lot of data cleaning is required before the data is usable for training and prediction, in order:
- Correct spellings
- Expand contractions
- Remove mentions and URLs
- Parse emojis into text
- Remove punctuation
- Remove stop words
We then use an embedding matrix using the GloVE dataset vectors with 6 billion unique vectors of 300 dimensions
and a corpus vocabulary of 400K words as the input layer, feed it to our bi-directional LSTM, compress the features
and finally get a probablistic outcome from the softmax
layer.
The entirety of the training data comes to ~1 GiB and we get a model of ~86 MB.
The model has over 13,738,513 parameters (4,689,013 trainable + 9,049,500 untrainable) and will take a significant time to train without a TPU/GPU. The model tries to strike a balance between efficiency and performance, and while training is slow, mood analysis of an average paragraph takes <1 second on a 2017 Core i5 processor.
You can easily retrain the model using the model.h5
file available in the module.
import keras
model = keras.load_model("model.h5")
# Do retraining as required
If that's not your style, we also provide a makemodel.py
file to generate the model how you want to train it
or if you just want to tinker around. Comments have been provided to make things easier to understand.
Please note that the datasets are imported from Kaggle and need to be set up either in a kernel or locally.
There are also additional dependencies on tweet-preprocessor
, pandas
and emoji
.
If you've read everything above and are ready to tinker, we have additional feature so you can get the maximum
performance. The default nlp.process()
makes asyncio
run the analysis functions concurrently, however
this is bottlenecked by the Python GIL, you can easily get better performance using multiprocessing
and spinning
off each async
function into a different thread. The API allows for this like:
import spacy
from ml.processtext import ProcessText
from textblob import TextBlob
sp = spacy.load("en_core_web_md")
nlp = ProcessText()
doc = sp("Some text to analyze")
blob = TextBlob(doc.text)
result1 = await nlp.get_formatted_entities(doc)
# You can then fork this using mp.fork()
# result2 = await nlp.get_sentiment(doc, blob)
# And so on....
Similar to how makemodel.py
can be run as a script, you can also use processtext.py
as a script like:
$ python3 processtext.py "<Write the text you want to analyze>"
You'll get the a JSON output similar to what's shown above and the duration of time taken for the output for utility purposes.
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.
- GNU Aspell