Sentiment analysis of Tweets as “positive”, “negative”, or “neutral”
To train and test classifier for each model, run hw1.py with the following command:
$ hw1.py --train <path_to_resorces>\data\train.csv
--test <path_to_resorces>\data\dev.csv
--model "Model Name"
--lexicon_path <path_to_resorces>\lexica\
For the purpose of training, I have used the Hashtag-Lexicon file (for unigrams as well as bigrams). Each of the models have a runtime of:
- Ngram : 3.84 seconds
- Ngram+Lex : 6.93 seconds
- Ngram+Lex+Enc : 7.16 seconds
- Custom Model : 7.76 seconds
The best performing model is the custom model with an F1 score (macro-averaged) of 0.5077. Detailed features of each model are given below.
In this model, vanilla ngram model was deployed, using word ngrams from (1,4).The sequence of steps carried out are:
- The labels with value 'objective' are replaced with 'neutral'.
- The tweets are lemmatized
- Training data is fitted on the countvectorizer with ngram range as (1,4)
- Test data is tranformed on this countvectorizer and all 4 are returned (labels as numpy arrays)
- Upon training and predicting using SVM (C=10) and Naive Bayes, SVM yielded a result with a higher F1 score. Thus, SVM is chosen.
Here, the features used were:
a. tweet_token : The tweets who's sentiment is to be analyzed, in the form of Ngram vectors from countvectorizer
In this model, ngram model was deployed, using word ngrams from (1,4). In addition, features are extracted from the hashtag-lexicon file for both unigrams as well as bigrams and are used as additional features to feed into the model. The sequence of steps carried out are:
- The labels with value 'objective' are replaced with 'neutral'.
- The tweets are lemmatized
- All bigrams are extracted from each tweet in order to extract the HS-bigrams in each tweet.
- The features_unigrams and features_bigrams functions extract features from HS-unigrams and HS-bigrams (features are listed below)
- Tweets data is fitted on the countvectorizer with ngram range as (1,4)
- The tweets are then stacked together with the features obtained from step (4).
- After transforming the test tweets, the same process (6) is repeated for test data and all 4 are returned (labels as numpy arrays)
Here, the features used were:
- tweet_token : The tweets who's sentiment is to be analyzed, in the form of Ngram vectors from countvectorizer
- total count of unigram lexicons in the tweet with score(w, p) > 0 where s -> lexicon score
- total score of all positive unigram lexicons
- maximal score amongst all positive unigram lexicons
- score of the last unigram token in the tweet with score(w, p) > 0
- total count of unigram tokens in the tweet with score(w, p) < 0 where s -> lexicon score
- total score of all negative unigram lexicons
- minimum score amongst all negative unigram lexicons
- score of the last unigram token in the tweet with score(w, p) < 0
- total count of bigram tokens in the tweet with score(w, p) > 0 where s -> lexicon score
- total score of all positive bigram lexicons
- maximal score amongst all positive bigram lexicons
- score of the last bigram token in the tweet with score(w, p) > 0
- total count of bigram tokens in the tweet with score(w, p) < 0 where s -> lexicon score
- total score of all negative bigram lexicons
- minimum score amongst all negative bigram lexicons
- score of the last bigram token in the tweet with score(w, p) < 0
In this model, ngram model was deployed, using word ngrams from (1,4). In addition, features are extracted from the hashtag-lexicon file for both unigrams as well as bigrams and extracted Encoding features are used as additional features to feed into the model. The sequence of steps carried out are:
- The labels with value 'objective' are replaced with 'neutral'.
- The tweets are lemmatized
- All bigrams are extracted from each tweet in order to extract the HS-bigrams in each tweet.
- The features_unigrams and features_bigrams functions extract features from HS-unigrams and HS-bigrams (features are listed below)
- In addition, two features: count of all capitalized words in a tweet, number of hashtags in a tweet are taken
- Tweets data is fitted on the countvectorizer with ngram range as (1,4)
- The tweets are then stacked together with the features obtained from steps (4, 5)
- After transforming the test tweets, the same process (g) is repeated for test data and all 4 are returned (labels as numpy arrays)
Here, the features used were:
- tweet_token : The tweets who's sentiment is to be analyzed, in the form of Ngram vectors from countvectorizer
- total count of unigram lexicons in the tweet with score(w, p) > 0 where s -> lexicon score
- total score of all positive unigram lexicons
- maximal score amongst all positive unigram lexicons
- score of the last unigram token in the tweet with score(w, p) > 0
- total count of unigram tokens in the tweet with score(w, p) < 0 where s -> lexicon score
- total score of all negative unigram lexicons
- minimum score amongst all negative unigram lexicons
- score of the last unigram token in the tweet with score(w, p) < 0
- total count of bigram tokens in the tweet with score(w, p) > 0 where s -> lexicon score
- total score of all positive bigram lexicons l2. maximal score amongst all positive bigram lexicons
- score of the last bigram token in the tweet with score(w, p) > 0
- total count of bigram tokens in the tweet with score(w, p) < 0 where s -> lexicon score
- total score of all negative bigram lexicons
- minimum score amongst all negative bigram lexicons
- score of the last bigram token in the tweet with score(w, p) < 0
- number of words in a tweet with all letters capitalized
- count of hashtags in a tweet
In this model, ngram model was deployed, using word ngrams from (1,4). In addition, features are extracted from the hashtag-lexicon file for both unigrams as well as bigrams and extracted Encoding features are used as additional features to feed into the model. The sequence of steps carried out are:
- The labels with value 'objective' are replaced with 'neutral'.
- The tweets are lemmatized
- All bigrams are extracted from each tweet in order to extract the HS-bigrams in each tweet.
- The features_unigrams and features_bigrams functions extract features from HS-unigrams and HS-bigrams (features are listed below)
- In addition, two features: count of all capitalized words in a tweet, number of hashtags in a tweet are taken
- Tweets data is fitted on the TF-IDF with ngram range as (1,4)
- The tweets are then stacked together with the features obtained from steps (d,e) and TOP 500 features are selected.
- After transforming the test tweets, the same process (g) is repeated for test data and all 4 are returned (labels as numpy arrays)
Here, the features used were:
- tweet_token : The tweets who's sentiment is to be analyzed, in the form of Ngram vectors from TF-IDF
- total count of unigram lexicons in the tweet with score(w, p) > 0 where s -> lexicon score
- total score of all positive unigram lexicons
- maximal score amongst all positive unigram lexicons
- score of the last unigram token in the tweet with score(w, p) > 0
- total count of unigram tokens in the tweet with score(w, p) < 0 where s -> lexicon score
- total score of all negative unigram lexicons
- minimum score amongst all negative unigram lexicons
- score of the last unigram token in the tweet with score(w, p) < 0
- total count of bigram tokens in the tweet with score(w, p) > 0 where s -> lexicon score
- total score of all positive bigram lexicons l2. maximal score amongst all positive bigram lexicons
- score of the last bigram token in the tweet with score(w, p) > 0
- total count of bigram tokens in the tweet with score(w, p) < 0 where s -> lexicon score
- total score of all negative bigram lexicons
- minimum score amongst all negative bigram lexicons
- score of the last bigram token in the tweet with score(w, p) < 0
- number of words in a tweet with all letters capitalized
- count of hashtags in a tweet
Special features of my classifier: I am using a LinearSVM classifier with a C value of 10 (for all 4).
While the results are stable, the F1 score was not very high for one class (because of skewed data). Thus, this is a limitation of my classifier.