Skip to content

Latest commit

 

History

History
executable file
·
125 lines (106 loc) · 4.82 KB

README.md

File metadata and controls

executable file
·
125 lines (106 loc) · 4.82 KB

🚀 Getting Started

ml directory is intended to be imported as an entire module (hence the __init__.py). To get started with making predictions, it's simple! 🌟

$ git clone git@github.com:QEDK/clarity.git
$ cd clarity/ml
$ pip3 install -r requirements.txt

Now, in your code, you can import the entire module using:

from ml.processtext import ProcessText

nlp = ProcessText()  # this will load the Keras model and other functionality
# And then to perform analysis:
result = await nlp.process("Some text you need to analyze!")
# This module uses async so you can enable concurrency in your real-time applications ;)

🔍 Predictions

The module gives you a prediction in JSON format:

{
    "ents": "<escaped_formatted_html>",
    "tf_idf": "<tf_idf>",
    "word_associations": "<word_associations>",
    "sentiment": {
    	"mood": {
		    "empty": 0.0,
		    "sadness": 0.0,
		    "enthusiasm": 0.0,
		    "neutral": 0.0,
		    "worry": 0.0,
		    "surprise": 0.0,
		    "love": 0.0,
		    "fun": 0.0,
		    "hate": 0.0,
		    "happiness": 0.0,
		    "boredom": 0.0,
		    "relief": 0.0,
		    "anger": 0.0
		},
		"polarity": 0.0,
		"subjectivity": 0.0
	}
}

Each mood has a probability score from 0-100 but the model is highly unlikely to make predictions only towards one particular mood as normal text encompasses many different emotions. These scores are generated by a machine learning model trained on a lot of data (more details below).

Polarity is assigned on a scale of [-1, 1] (most negative to most positive). Subjectivity is assigned on a scale of [0, 1] (objective to subjective). These are rule-based scores depending on presence of certain polarizing, sentimental words.

🧬 Model

We have used Keras to construct the machine learning model for emotion analysis. It is a bi-directional LSTM model trained on 40,000 tweets with crowdsourced mood data. As expected, a lot of data cleaning is required before the data is usable for training and prediction, in order:

  1. Correct spellings
  2. Expand contractions
  3. Remove mentions and URLs
  4. Parse emojis into text
  5. Remove punctuation
  6. Remove stop words

We then use an embedding matrix using the GloVE dataset vectors with 6 billion unique vectors of 300 dimensions and a corpus vocabulary of 400K words as the input layer, feed it to our bi-directional LSTM, compress the features and finally get a probablistic outcome from the softmax layer.

The entirety of the training data comes to ~1 GiB and we get a model of ~86 MB.

🏢 Architecture

Architecture diagram

The model has over 13,738,513 parameters (4,689,013 trainable + 9,049,500 untrainable) and will take a significant time to train without a TPU/GPU. The model tries to strike a balance between efficiency and performance, and while training is slow, mood analysis of an average paragraph takes <1 second on a 2017 Core i5 processor.

⏮️ Retraining

You can easily retrain the model using the model.h5 file available in the module.

import keras

model = keras.load_model("model.h5")
# Do retraining as required

If that's not your style, we also provide a makemodel.py file to generate the model how you want to train it or if you just want to tinker around. Comments have been provided to make things easier to understand. Please note that the datasets are imported from Kaggle and need to be set up either in a kernel or locally. There are also additional dependencies on tweet-preprocessor, pandas and emoji.

🥽 Getting Started: Advanced Guide

If you've read everything above and are ready to tinker, we have additional feature so you can get the maximum performance. The default nlp.process() makes asyncio run the analysis functions concurrently, however this is bottlenecked by the Python GIL, you can easily get better performance using multiprocessing and spinning off each async function into a different thread. The API allows for this like:

import spacy
from ml.processtext import ProcessText
from textblob import TextBlob

sp = spacy.load("en_core_web_md")
nlp = ProcessText()
doc = sp("Some text to analyze")
blob = TextBlob(doc.text)
result1 = await nlp.get_formatted_entities(doc)
# You can then fork this using mp.fork()
# result2 = await nlp.get_sentiment(doc, blob)
# And so on....

Similar to how makemodel.py can be run as a script, you can also use processtext.py as a script like:

$ python3 processtext.py "<Write the text you want to analyze>"

You'll get the a JSON output similar to what's shown above and the duration of time taken for the output for utility purposes.

Credits