# Three-Way Sentiment Analysis for Twitter Tweets

### Overview

This NLP project implements a three-way polarity (positive, negative, neutral) classification system for tweets, *without* using NLTK's in-built sentiment analysis engine. 

Based on Twitter tweets, the project will first run (1) preprocessing, then  compares the accuracy of (2) Dummy, Decision Tree, and Logistic Regression Classifier, (3) Polarity Lexicons (both built-in and manual), and (4) a combination of Logistic Regression with Polarity Lexicon in predictiing the sentiment polarity.

### Data Description

- training.json: This file contains ~15k raw tweets, along with their polarity labels (1 = positive, 0 = neutral, -1 = negative), which is used for training.
- develop.json: In the same format as training.json. File contains a smaller set of tweets used to test the predictions.

## Preprocessing Steps

First, extract the tweets from the json file, read each line and store the tweets, labels in separate lists.

Then for the preprocessing the steps are:

- segment tweets into sentences using an NTLK segmenter
- tokenize the sentences using an NLTK tokenizer
- lowercase all the words
- remove twitter usernames beginning with @ using regex
- remove URLs starting with http using regex
- process hashtags: tokenize hashtags, break down multi-word hashtags using a MaxMatch algorithm and NLTK word dictionary

In [1]:
import json
import re
import nltk
import os

# Download NLTK built-in vocabulary
nltk.download('words')
nltk.download('stopwords')
nltk.download('sentiwordnet')
nltk.download('word2vec_sample')
nltk.download('opinion_lexicon')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\ChristianV700\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ChristianV700\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\ChristianV700\AppData\Roaming\nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package word2vec_sample to
[nltk_data]     C:\Users\ChristianV700\AppData\Roaming\nltk_data...
[nltk_data]   Package word2vec_sample is already up-to-date!
[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     C:\Users\ChristianV700\AppData\Roaming\nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!


True

## 1 Preprocessing

### Some Helper Functions

In [2]:
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
dictionary = set(nltk.corpus.words.words()) #Save words vocabulary in variable. To be used for MaxMatch

#Function to lemmatize word | Used during maxmatch
def lemmatize(word):
    lemma = lemmatizer.lemmatize(word,'v')
    if lemma == word:
        lemma = lemmatizer.lemmatize(word,'n')
    return lemma

#Function to implement the maxmatch algorithm for multi-word hashtags
def maxmatch(word,dictionary):
    if not word:
        return []
    for i in range(len(word),1,-1):
        first = word[0:i]
        rem = word[i:]
        if lemmatize(first).lower() in dictionary: #Important to lowercase lemmatized words before comparing in dictionary. 
            return [first] + maxmatch(rem,dictionary)
    first = word[0:1]
    rem = word[1:]
    return [first] + maxmatch(rem,dictionary)

#Function to preprocess a single tweet
def preprocess(tweet):
    
    tweet = re.sub("@\w+","",tweet).strip()
    tweet = re.sub("http\S+","",tweet).strip()
    hashtags = re.findall("#\w+",tweet)
    
    tweet = tweet.lower()
    tweet = re.sub("#\w+","",tweet).strip() 
    
    hashtag_tokens = [] #Separate list for hashtags
    
    for hashtag in hashtags:
        hashtag_tokens.append(maxmatch(hashtag[1:],dictionary))        
    
    segmenter = nltk.data.load('tokenizers/punkt/english.pickle')
    segmented_sentences = segmenter.tokenize(tweet)
    
    #General tokenization
    processed_tweet = []
    
    word_tokenizer = nltk.tokenize.regexp.WordPunctTokenizer()
    for sentence in segmented_sentences:
        tokenized_sentence = word_tokenizer.tokenize(sentence.strip())
        processed_tweet.append(tokenized_sentence)
    
    #Processing the hashtags only when they exist in a tweet
    if hashtag_tokens:
        for tag_token in hashtag_tokens:
            processed_tweet.append(tag_token)
    
    return processed_tweet
    
#Function that takes in a file, and passes each tweet to the preprocessor
def preprocess_file(filename):
    tweets = []
    labels = []
    f = open(filename)
    for line in f:
        tweet_dict = json.loads(line)
        tweets.append(preprocess(tweet_dict["text"]))
        labels.append(int(tweet_dict["label"]))
    return tweets,labels

### How does MaxMatch algorithm disentangle a tweet's hashtags?

In [3]:
# example 1
maxmatch('wecan',dictionary)

['we', 'can']

In [4]:
# example 2
maxmatch('casestudy',dictionary)

['cases', 'tu', 'd', 'y']

The second example shows how maxmatch incorrectly breks down the word 'casestudy', by returning 'cases', instead of 'case' for the first iteration. This is because it _greedily_ extract 'cases' first.

For improvement, count the number of successful matches in the result of the maxmatch process, and return the one with the highest successful match count.

### Start Preprocessing

In [5]:
# Running the basic preprocessing module and capturing the data
train_data = preprocess_file(f'{os.getcwd()}\\data\\sentiment\\training.json')
train_tweets = train_data[0]
train_labels = train_data[1]

Print out the first pre-processed tweets of the training set:

In [6]:
print(train_tweets[:2])

[[['dear', 'the', 'newooffice', 'for', 'mac', 'is', 'great', 'and', 'all', ',', 'but', 'no', 'lync', 'update', '?'], ['c', "'", 'mon', '.']], [['how', 'about', 'you', 'make', 'a', 'system', 'that', 'doesn', "'", 't', 'eat', 'my', 'friggin', 'discs', '.'], ['this', 'is', 'the', '2nd', 'time', 'this', 'has', 'happened', 'and', 'i', 'am', 'so', 'sick', 'of', 'it', '!']]]


Next Helper Function:

A simple script to that runs the preprocessing module on a few tweets and prints the original and processed results side by side if it detects a multi-word hashtag.

In [7]:
# Printing examples of multi-word hashtags (Doesn't work for multi sentence tweets)
f = open(f'{os.getcwd()}\\data\\sentiment\\training.json')

count = 1
for index,line in enumerate(f):
    if count >5:
        break
    original_tweet = json.loads(line)["text"]
    hashtags = re.findall("#\w+",original_tweet)
    if hashtags:
        for hashtag in hashtags:
            if len(maxmatch(hashtag[1:],dictionary)) > 1:
                # If the length of the array returned by the maxmatch function is greater than 1,
                # it means that the algorithm has detected a hashtag with more than 1 word inside. 
                print(str(count) + ". Original Tweet: " + original_tweet + "\nProcessed tweet: " + str(train_tweets[index]) + "\n")
                count += 1
                break

1. Original Tweet: If I make a game as a #windows10 Universal App. Will #xboxone owners be able to download and play it in November? @majornelson @Microsoft
Processed tweet: [['if', 'i', 'make', 'a', 'game', 'as', 'a', 'universal', 'app', '.'], ['will', 'owners', 'be', 'able', 'to', 'download', 'and', 'play', 'it', 'in', 'november', '?'], ['windows', '1', '0'], ['x', 'box', 'one']]

2. Original Tweet: Microsoft, I may not prefer your gaming branch of business. But, you do make a damn fine operating system. #Windows10 @Microsoft
Processed tweet: [['microsoft', ',', 'i', 'may', 'not', 'prefer', 'your', 'gaming', 'branch', 'of', 'business', '.'], ['but', ',', 'you', 'do', 'make', 'a', 'damn', 'fine', 'operating', 'system', '.'], ['Window', 's', '1', '0']]

3. Original Tweet: @MikeWolf1980 @Microsoft I will be downgrading and let #Windows10 be out for almost the 1st yr b4 trying it again. #Windows10fail
Processed tweet: [['i', 'will', 'be', 'downgrading', 'and', 'let', 'be', 'out', 'for', 

That's better! Our pre-processing module is working as intended.

The next step is to convert each processed tweet into a bag-of-words feature dictionary. We'll allow for options to remove stopwords during the process, and also to remove _rare_ words, i.e. words occuring less than n times across the whole training set. 

In [8]:
from nltk.corpus import stopwords

stopwords = set(stopwords.words('english'))

# Identify words appearing less than n times create a dictionary for the whole training set

total_train_bow = {}

for tweet in train_tweets:
    for segment in tweet:
        for token in segment:
            total_train_bow[token] = total_train_bow.get(token,0) + 1

# Convert pre_processed tweets to bag of words feature dictionaries
# Allows for options to remove stopwords, and also to remove words occuring less than n times in the whole training set.            
def convert_to_feature_dicts(tweets,remove_stop_words,n): 
    feature_dicts = []
    for tweet in tweets:
        # build feature dictionary for tweet
        feature_dict = {}
        if remove_stop_words:
            for segment in tweet:
                for token in segment:
                    if token not in stopwords and (n<=0 or total_train_bow[token]>=n):
                        feature_dict[token] = feature_dict.get(token,0) + 1
        else:
            for segment in tweet:
                for token in segment:
                    if n<=0 or total_train_bow[token]>=n:
                        feature_dict[token] = feature_dict.get(token,0) + 1
        feature_dicts.append(feature_dict)
    return feature_dicts

Now that there is a function to convert raw tweets to feature dictionaries, run it on training and development data.

In [9]:
from sklearn.feature_extraction import DictVectorizer
vectorizer = DictVectorizer()

# Conversion to feature dictionaries
train_set = convert_to_feature_dicts(train_tweets,True,2)

dev_data = preprocess_file(f'{os.getcwd()}\\data\\sentiment\\develop.json')

dev_set = convert_to_feature_dicts(dev_data[0],False,0)

# Conversion to sparse representations
training_data = vectorizer.fit_transform(train_set)

development_data = vectorizer.transform(dev_set)

## 2 Machine Learning Classification

### Dummy Classifier
-> picks most frequently occuring class as the output as baseline benchmark.

In [10]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

# The Dummy Classifier always predicts the most frequent class, as specified in the strategy. 
dummy_clf = DummyClassifier(strategy='most_frequent')
dummy_clf.fit(development_data,dev_data[1])
dummy_predictions = dummy_clf.predict(development_data)

print("\nMost common class baseline accuracy: " + str(accuracy_score(dev_data[1],dummy_predictions)))


Most common class baseline accuracy: 0.42044833242208857


Since this is a three-way classification, the baseline must be below 0.5.

### Decision Tree Classifier
-> tuned with Grid Search over parameter combinations.

In [11]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

# Grid used to test the combinations of parameters
tree_param_grid = [
    {'criterion':['gini','entropy'], 'min_samples_leaf': [75,100,125,150,175], 'max_features':['sqrt','log2',None],
    }
]

tree_clf = GridSearchCV(DecisionTreeClassifier(),tree_param_grid,cv=10,scoring='accuracy')

tree_clf.fit(training_data,train_data[1])

print("Optimal parameters for Decision Tree: " + str(tree_clf.best_params_)) # Print best discovered combination of the parameters

tree_predictions = tree_clf.predict(development_data)

print("\nDecision Tree Accuracy: " + str(accuracy_score(dev_data[1],tree_predictions)))

Optimal parameters for Decision Tree: {'criterion': 'entropy', 'max_features': None, 'min_samples_leaf': 75}

Decision Tree Accuracy: 0.48715144887916895


The Decision Tree classifier scores relatively low, but beetter than the dummy.

### Logisitc Regression Classifier 

In [12]:
from sklearn.linear_model import LogisticRegression

log_param_grid = [
    {'C':[0.012,0.0125,0.130,0.135,0.14],
     'solver':['lbfgs'],'multi_class':['multinomial']
    }
]

log_clf = GridSearchCV(LogisticRegression(max_iter=200),log_param_grid,cv=10,scoring='accuracy')

log_clf.fit(training_data,train_data[1])

log_predictions = log_clf.predict(development_data)

print("Optimal parameters for LR: " + str(log_clf.best_params_))

print("Logistic Regression Accuracy: " + str(accuracy_score(dev_data[1],log_predictions)))

Optimal parameters for LR: {'C': 0.0125, 'multi_class': 'multinomial', 'solver': 'lbfgs'}
Logistic Regression Accuracy: 0.49371241115363584


**Summary:**

<table>
<tr>
<th>Classifier</th>
<th>Approx. Accuracy score (in %)</th>
</tr>
<tr>
<td>Dummy classifier (most common class)</td>
<td>42</td>
</tr>
<tr>
<td>Decision Tree classifier</td>
<td>48.7</td>
</tr>
<tr>
<td>Logistic Regression classifier</td>
<td>49.3</td>
</tr>
</table>

## 3 Polarity Lexicons

Integrate external information into the training set in the form of polarity scores for the tweets. 

Build two automatic lexicons, compare it with NLTK's manually annotated set, and then add that information to our training data.

**Lecixon 1** will be built through SentiWordNet. This lexicon contains pre-calculated scores positive, negative and neutral sentiments for some words in WordNet. As this information is arranged in the form of synsets, take the most common polarity across its senses (and take neutral in case of a tie).

In [13]:
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import wordnet as wn
import random

In [14]:
swn_positive = []
swn_negative = []

def get_polarity_type(synset_name):
    swn_synset =  swn.senti_synset(synset_name)
    if not swn_synset:
        return None
    elif swn_synset.pos_score() > swn_synset.neg_score() and swn_synset.pos_score() > swn_synset.obj_score():
        return 1
    elif swn_synset.neg_score() > swn_synset.pos_score() and swn_synset.neg_score() > swn_synset.obj_score():
        return -1
    else:
        return 0

for synset in wn.all_synsets():      
    # count synset polarity for each lemma
    pos_count = 0
    neg_count = 0
    neutral_count = 0
    
    for lemma in synset.lemma_names():
        for syns in wn.synsets(lemma):
            if get_polarity_type(syns.name())==1:
                pos_count+=1
            elif get_polarity_type(syns.name())==-1:
                neg_count+=1
            else:
                neutral_count+=1
    
    if pos_count > neg_count and pos_count >= neutral_count:    # >=neutral as words that are more positive than negative, 
                                                                # despite being equally neutral might belong to positive list (explain)
        swn_positive.append(synset.lemma_names()[0])
    elif neg_count > pos_count and neg_count >= neutral_count:
        swn_negative.append(synset.lemma_names()[0])       

swn_positive = list(set(swn_positive))
swn_negative = list(set(swn_negative))

print('Positive words: ' + str(random.sample(swn_positive,5)))
print('Negative Words: ' + str(random.sample(swn_negative,5)))

Positive words: ['associable', 'thoroughly', 'palmatifid', 'probative', 'arable']
Negative Words: ['associative', 'hairlessness', 'spastic_bladder', 'unimpeded', 'unbefitting']


The polarity of a synset is defined by the lemma names that were extracted from the synset to get its 'senses'. Next, each of those lemma names were converted to a synset object, which were then passed to the pre-supplied 'get_polarity_type' function. Based on the score returned, the head lemma of the synset object was appended to the positive, negative, or neutral lists. The head lemma was chosen from the lemma_names, as it best represents the synset object.

The printed words above are a random sample of positive and negative.

**Lexicon 2** will use the word2vec (CBOW) vectors included in NLTK. 

Using a small set of positive and negative seed terms, the next cells calculate the cosine similarity between vectors of seed terms and another word.
Then, use Gensim to iterate over words in model.vocab for comparison over seed terms. 

After calculating the cosine similarity of a word with both the positive and negative terms, calculate their average, after flipping the sign for negative seeds.

A threshold of ±0.03 will be used to determine if words are positive or negative. 

In [15]:
import gensim
from nltk.data import find
import random

In [16]:
positive_seeds = ["good","nice","excellent","positive","fortunate","correct","superior","great"]
negative_seeds = ["bad","nasty","poor","negative","unfortunate","wrong","inferior","awful"]

word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample,binary=False)

wv_positive = []
wv_negative = []

for word in model.vocab:
    try:
        word=word.lower()
    
        pos_score = 0.0
        neg_score = 0.0
    
        for seed in positive_seeds:
            pos_score = pos_score + model.similarity(word,seed)
    
        for seed in negative_seeds:
            neg_score = neg_score + model.similarity(word,seed)
        
        avg = (pos_score - neg_score)/16 # Total number of seeds is 16
    
        if avg>0.03:
            wv_positive.append(word)
        elif avg<-0.03:
            wv_negative.append(word)
    except:
        pass

print('Positive words: ' + str(random.sample(wv_positive,5)))
print('Negative Words: ' + str(random.sample(wv_negative,5)))

Positive words: ['proudly', 'professional', 'handsome', 'locate', 'distinguished']
Negative Words: ['millenarianism', 'attacked', 'unsympathetic', 'rots', 'cranky']


Again, the printed samples are randomised.

Considering the sample, this looks like a good set of both positive negative words, looking at the samples.

See how it compares with NLTK's manually annotated set.

### NLTK Lexicon

The lexicon included with NLTK contains a list of positive and negative words.

First, investigate what percentage of the words in the manual lexicon are in each of the automatic lexicon.
Second, only for those words which overlap and are not in the seed set, evaluate the accuracy of with each of the automatic lexicons.

In [17]:
from nltk.corpus import opinion_lexicon
import math

In [18]:
positive_words = opinion_lexicon.positive()
negative_words = opinion_lexicon.negative()

# Calculate the percentage of words in the manually annotated lexicon set, that also appear in an automatic lexicon.
def get_perc_manual(manual_pos,manual_neg,auto_pos,auto_neg):
    return len(set(manual_pos+manual_neg).intersection(set(auto_pos+auto_neg)))/len(manual_pos+manual_neg)*100

print("Percentage of words in manual lexicons also present in the automatic lexicon")
print("First automatic lexicon: "+ str(get_perc_manual(positive_words,negative_words,swn_positive,swn_negative)))
print("Second automatic lexicon: "+ str(get_perc_manual(positive_words,negative_words,wv_positive,wv_negative)))

# Calculate the accuracy of words in the automatic lexicon. Assuming that the manual lexicons are accurate, it calculates the percentage of words that occur in both positive and negative (respectively) lists of automatic and manual lexicons.
def get_lexicon_accuracy(manual_pos,manual_neg,auto_pos,auto_neg):
    common_words = set(manual_pos+manual_neg).intersection(set(auto_pos+auto_neg))-set(negative_seeds)-set(positive_seeds)
    return (len(set(manual_pos) & set(auto_pos) & common_words)+len(set(manual_neg) & set(auto_neg) & common_words))/len(common_words)*100

print("\nAccuracy of lexicons: ")
print("First automatic lexicon: "+ str(get_lexicon_accuracy(positive_words,negative_words,swn_positive,swn_negative)))
print("Second automatic lexicon: "+ str(get_lexicon_accuracy(positive_words,negative_words,wv_positive,wv_negative)))

Percentage of words in manual lexicons also present in the automatic lexicon
First automatic lexicon: 13.610251878038001
Second automatic lexicon: 37.796435410222415

Accuracy of lexicons: 
First automatic lexicon: 84.46389496717724
Second automatic lexicon: 98.94159153273226


The second lexicon shares the most common words with the manual lexicon, and has the most accurately classified words, as it uses the most intutive way of creative positive/negative lexicons i.e. by identifying the most similar words.

### Lexicons for Classification

What if we used the lexicons for the main classification problem? 

Let's create a function that calculates a polarity score for a sentence based on a given lexicon. We'll count the positive and negative words that appear in the tweet, and then return a +1 if there are more posiitve words, a -1 if there are more negative words, and a 0 otherwise.

We'll then compare the results of the three lexicons on the development set. 

In [19]:
# All lexicons are converted to sets for faster preprocessing
manual_pos_set = set(positive_words)
manual_neg_set = set(negative_words)

syn_pos_set = set(swn_positive)
syn_neg_set = set(swn_negative)

wordvec_pos_set = set(wv_positive)
wordvec_neg_set = set(wv_negative)

# Function for polarity score of a sentence based on the frequency of positive or negative words
def get_polarity_score(sentence,pos_lexicon,neg_lexicon):
    pos_count = 0
    neg_count = 0
    
    for word in sentence:
        if word in pos_lexicon:
            pos_count+=1
        if word in neg_lexicon:
            neg_count+=1
    if pos_count>neg_count:
        return 1
    elif neg_count>pos_count:
        return -1
    else:
        return 0
    
# Function for score of each tweet and compare against labels of the dataset and calculate/count  accuracy score
def data_polarity_accuracy(dataset,datalabels,pos_lexicon,neg_lexicon):
    accuracy_count = 0
    
    for index,tweet in enumerate(dataset):
        if datalabels[index]==get_polarity_score([word for sentence in tweet for word in sentence],pos_lexicon,neg_lexicon):
            accuracy_count+=1
    return (accuracy_count/len(dataset))*100


print("Manual lexicon accuracy: " + str(data_polarity_accuracy(dev_data[0],dev_data[1],manual_pos_set,manual_neg_set)))
print("First auto lexicon accuracy: " + str(data_polarity_accuracy(dev_data[0],dev_data[1],syn_pos_set,syn_neg_set)))
print("Second auto lexicon accuracy: " + str(data_polarity_accuracy(dev_data[0],dev_data[1],wordvec_pos_set,wordvec_neg_set)))

Manual lexicon accuracy: 45.2159650082012
First auto lexicon accuracy: 42.3728813559322
Second auto lexicon accuracy: 45.16129032258064


The results reflect the quality metric obtained from the previous section, with the manual and second lexicon (word vector) slightly ahead. However, still not as good as a Machine Learning algorithm without the polarity information.

## 4 Combo: Polarity Lexicon with Logistic Regression

Lastly, add the lexicon polarity score as a feature for a classifier. 

For this, create a modified version of the feature extraction function to integrate the extra *feature* and retrain the logisitc regression classifier.

In [20]:
def convert_to_feature_dicts_v2(tweets,manual,first,second,remove_stop_words,n): 
    feature_dicts = []
    
    for tweet in tweets:
        # Build feature dictionary for tweet
        feature_dict = {}
        
        if remove_stop_words:
            for segment in tweet:
                for token in segment:
                    if token not in stopwords and (n<=0 or total_train_bow[token]>=n):
                        feature_dict[token] = feature_dict.get(token,0) + 1
        else:
            for segment in tweet:
                for token in segment:
                    if n<=0 or total_train_bow[token]>=n:
                        feature_dict[token] = feature_dict.get(token,0) + 1
        if manual == True:
            feature_dict['manual_polarity'] = get_polarity_score([word for sentence in tweet for word in sentence],manual_pos_set,manual_neg_set)
        if first == True:
            feature_dict['synset_polarity'] = get_polarity_score([word for sentence in tweet for word in sentence],syn_pos_set,syn_neg_set)
        if second == True:
            feature_dict['wordvec_polarity'] = get_polarity_score([word for sentence in tweet for word in sentence],wordvec_pos_set,wordvec_neg_set)

        feature_dicts.append(feature_dict)      
    return feature_dicts

In [21]:
# convert training set
training_set_v2 = convert_to_feature_dicts_v2(train_tweets,True,False,True,True,2)

training_data_v2 = vectorizer.fit_transform(training_set_v2)

In [22]:
# convert dev set
dev_set_v2 = convert_to_feature_dicts_v2(dev_data[0],True,False,True,False,0)

development_data_v2 = vectorizer.transform(dev_set_v2)

In [23]:
# train logistic regression
log_clf_v2 = LogisticRegression(C=0.012,solver='lbfgs',multi_class='multinomial')

log_clf_v2.fit(training_data_v2,train_data[1])

LogisticRegression(C=0.012, multi_class='multinomial')

In [24]:
# prediction
log_predictions_v2 = log_clf_v2.predict(development_data_v2)

print("Logistic Regression V2 (with polarity scores) Accuracy: " + str(accuracy_score(dev_data[1],log_predictions_v2)))

Logistic Regression V2 (with polarity scores) Accuracy: 0.5079278294149808


Though minimal (1.7%), there was improvement in the Logistic Regression Classifier by integrating the polarity data. 

This concludes the project of building a three-way polarity classifier for tweets.