This project is based on the code used in my Bachelor's Thesis research. Its main goal is to provide a simple way of training ML models to perform classification over tweets (using Twitter APIs). The models can be combined using a JSON configuration file, in order to build a hierarchical classifier in the shape of a tree, where nodes are trained models and leaves are the final categories.
Althought the models could be trained using datasets of very different nature, one of the most straightforward applications is to build a sentiment analysis hierarchical classifier. This particular classification example is hosted online using packages such as Flask and Gunicorn, to show the project potential capabilities.
First of all, the datasets are processed and the ML models trained:
1. Datasets: they are text files containing one sentence per row.
2. Sentences cleaning: tokenize, remove stopwords and extract the lemma of each word.
3. Features vectors: build feature vectors using unigrams (words) and bigrams (pairs of words).
4. Features selection: get the most informative features (input percentage) using chi-square test.
5. Filter features: filter the current features leaving only the most informative ones.
6. Train classifier: use Scikit-learn algorithms:
- Multinomial Naïve Bayes.
- Logistic Regression.
- Linear Support Vector Machine.
- Random Forest (100 trees).
Secondly, the tweets are extracted and processed:
7. Tweets extraction: obtain tweets from Twitter APIs using Tweepy.
8. Tweets preprocessing: tokenize, remove stopwords and extract the lemma of each word.
9. Features vectors: build feature vectors using unigrams (words) and bigrams (pairs of words).
10. Filter features: filter the current features leaving only the most informative ones.
Finally, the classification is performed:
11. Classification: the trained models classify the vector of features into one of the final categories.
The classification is performed in a hierarchical way. This means that the trained models are placed in the nodes of a tree, and depending on how the previous models classify a given piece of information, it will follow one branch or another.
The advantages of this approach over a classic multi-label classification are:
- There could be different model algorithms as nodes, depending on which one perform best.
- The set of most informative features is specific for each label-to-label differentiation.
Following with the sentiment analysis case, there are 3 possible categories: neutral, positive and negative. They are represented as leaves in the classification tree, so once the assigned category is one of those, the process is over. The classification tree would have this shape:
In order to build a this custom classification tree, a JSON file with the following structure is required:
{
"tree": {
"clf_file": "subjectivity.pickle",
"clf_object": null,
"clf_children": {
"polarized": {
"clf_file": "sentiment.pickle",
"clf_object": null,
"clf_children": {}
}
}
},
"colors": {
"neutral": [0.6, 0.6, 0.6],
"negative": [0.8, 0.0, 0.0],
"positive": [0.0, 0.8, 0.0]
}
}
The evaluation of the different models (defined by algorithm and percentage of informative features) is done using 10 Folds Cross Validation. This method divides the datasets in 10 folds, performing 10 iterations where 9 are used for training and 1 for testing. Finally, the mean of the results is calculated.
However, the evaluation procedure is not the only relevant factor to decide, choosing a good fitness metric is crucial to perform a good comparison. In this project, the evaluation metric is the F-score, which is better than common accuracy because it considers unbalance classification among categories (Explanation here).
The repository contains:
-
Evaluation folder: contains a shell script to automatically evaluate algorithms.
-
Models folder: contains the trained models.
-
Profiles folder: contrains configuration files:
- Predicting folder: contains files for building a hierarchical classifier from individual models.
- Training folder: contains files for training a model from specific datasets.
-
Resources folder:
- Datasets folder: contains datasets to train models.
- Images folder: contains the images for this README.
- Stopwords folder: contains lists of language specific non-relevant words to filter.
-
Source folder: contains the code. The files could be grouped depending on their responsability:
DISCLAIMER: Before using some of the following functionalities, you need to provide Twitter application and user keys in the "twitter_keys.py" file. They can be obtained by creating a Twitter Application.
The main file from which all functionalities are called is "main.py". The execution syntax is as follows:
$ python3 main.py <mode> <args>
Depending on the chosen mode (train_model, search_data, predict_user, predict_stream), the following arguments are different. The required arguments depending on the selected mode are specified in the next sections:
Trains a models and saves it inside the "models" folder. The expected arguments are:
- -a algorithm: {naive-bayes, logistic-regression, linear-svm, random-forest}.
- -f features percentage: percentage of most informative features to keep.
- -l language: language of the datasets sentences.
- -o output: name of the output model.
- -p training profile: JSON file specifying the datasets name and associated label. The datasets must be placed inside the "profiles/training" folder. Example:
[
{
"dataset_name": "neutral.txt",
"dataset_label": "neutral"
},
{
"dataset_name": "polarized.txt",
"dataset_label": "polarized"
}
]
Command line example:
$ ... train_model -a Logistic-Regression -f 2 -l english -o polarity.pickle -p polarity.json
Retrieves tweets using Twitter Search API and saves them inside resources/datasets. The expected arguments are:
- -q query: words or hashtags that the tweets must contain.
- -l language: language of the retrieved tweets.
- -d search_depth: number of tweets to retrieve.
- -o output: name of the output file containing all the tweets.
Command line example:
$ ... search_data -q "#excited OR #happy -filter:retweets" -l en -d 1000 -o pos_search.txt
Predicts the category of historic user tweets filtered by word using the Twitter REST API. The prediction is performed using a hierarchical classifier defined by a profile file inside profile/predicting. The expected arguments are:
- -u user: user account name (without the '@').
- -w filter word: word that has to be present in the retrieved tweets.
- -p profile: JSON specifying the hierarchical classification tree (inside profile/predicting).
Command line example:
$ ... predict_user -u david_cameron -w brexit -p sentiment.json
Predicts the category of real time tweets filtered by word and location using the Twitter Streaming API. The prediction is performed using a hierarchical classifier tree. The expected arguments are:
- -s buffer size: number of tweets to represent in a live graph.
- -t filtered word: word that has to be present in the retrieved tweets.
- -l language: language of the retrieved tweets.
- -c coord_1 coord_2 coord_3 coord_4: coordinates of the desired location.
- -p profile: JSON specifying the hierarchical classification tree (inside profile/predicting).
Command line example:
$ ... predict_stream -s 500 -t Trump -l en -c -122.75 36.8 -121.75 37.8 -p sentiment.json
This project requires Python >= 3.4 🐍 , as long as some additional packages such as: