# Sentiment Analysis (supervised, feature-based)

This tutorial shows sentiment analysis using feature extraction and machine learning.

## Import all important packages

In [15]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# The next imports are only needed for the preprocessing stemming function from nltk
from nltk.tokenize import TweetTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
#from utils.nlputils import preprocess_text
from text_preprocessing import preprocess_text

In [16]:
tweet_tokenizer = TweetTokenizer()
porter_stemmer = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

## Prepare training data

### Load traing data

Use `pandas` to read files. Here i reference a file that conains 703 tweets together with a polarity.

In [17]:
df_tweets_train = pd.read_csv('data/twitter-sentiment/twitter-sentiment-bowden-training.csv')

# Print the first 5 lines
df_tweets_train.head()

Unnamed: 0,tweet,senti
0,@united UA5396 can wait for me. I'm on the gro...,0
1,I hate Time Warner! Soooo wish I had Vios. Can...,0
2,Tom Shanahan's latest column on SDSU and its N...,2
3,Found the self driving car!! /IWo3QSvdu2,2
4,@united arrived in YYZ to take our flight to T...,0


The values for "senti", i.e., the labels are:
- 0 means negative
- 2 means neutral
- 4 means positive

From the `pandas` data frame,  generate two list, one containing the tweets, the other containg the polarities.

In [18]:
polarities_train = df_tweets_train['senti'].tolist() 
tweets_train = df_tweets_train['tweet'].tolist() 

### Preprocess tweets

Use `preprocess_text()` method to preprocess all tweets.

In [19]:
processed_tweets_train = [''] * len(tweets_train)

for idx, doc in enumerate(tweets_train):
    processed_tweets_train[idx] = preprocess_text(doc, tokenizer=tweet_tokenizer, lemmatizer=wordnet_lemmatizer)
    

TypeError: preprocess_text() got an unexpected keyword argument 'tokenizer'

## Calculate feature set

 Use `TfidfVectorizer` to calculate the document word matrix which will be the feature set.

In [20]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1))

X_train_tfidf = tfidf_vectorizer.fit_transform(processed_tweets_train)

ValueError: empty vocabulary; perhaps the documents only contain stop words

## Training of classifier

Packages like `scikit-learn` make it extremely easy to train machine learning classifiers. `scikit-learn` offers a variety of classifier algorithms. Here, i use `MultinomialNB`, a Multinomial Naive Bayes classifier which is often a good choice in case of text documents. 

But feel free to try other classifiers! 

Very conveniently, all `scikit-learn` classifiers use the same methods which makes replacinf classifiers very quick and easy. For the full list of supported classifiers, see the `sklearn-learn` website.

In [21]:
clf = MultinomialNB().fit(X_train_tfidf, polarities_train)
#clf = DecisionTreeClassifier().fit(X_train_tfidf, polarities_train)
#clf = LinearSVC().fit(X_train_tfidf, polarities_train)
#clf = KNeighborsClassifier().fit(X_train_tfidf, polarities_train)

NameError: name 'X_train_tfidf' is not defined

## Testing the classifier

Check the classifier with 2 sample documents to see the 2 basic required steps:

- convert documents into the document word matrix; note that we have to make sure that the matrix (with respect to the vocabulary) with the one generated from the training data

- run classifier over the document word matrix (i.e., the feature set)

In [None]:
docs_new = ['soccer is so much fun', 'being hungry is shit']

# Use the fitted vectorizer to transform the documents: transform() not fit_transform()!
X_new_tfidf = tfidf_vectorizer.transform(docs_new)

# Use the trained classifier to predict the polarities
predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('{} => {}'.format(doc, category))

## Load test data

Now perform the exact same steps as i did for the training data:

- load the file with the test data

- convert `pandas` data frame into two lists (tweets + polarities)

- preprocess tweets (the same way we did the tweets of the training data)

In [None]:
df_tweets_test = pd.read_csv('data/twitter-sentiment/twitter-sentiment-bowden-test.csv')

polarities_test = df_tweets_test['senti'].tolist() 
tweets_test = df_tweets_test['tweet'].tolist() 

processed_tweets_test = [''] * len(tweets_test)

for idx, doc in enumerate(tweets_test):
    processed_tweets_test[idx] = preprocess_text(doc, tokenizer=tweet_tokenizer, lemmatizer=wordnet_lemmatizer)
        

## Perform test

The last two steps are as outline above:

- convert documents into the document word matrix; note that we have to make sure that the matrix (with respect to the vocabulary) with the one generated from the training data

- run classifier over the document word matrix (i.e., the feature set)


In [None]:
# Use the fitted vectorizer to transform the documents: transform() not fit_transform()!
X_test_tfidf = tfidf_vectorizer.transform(tweets_test)

predicted = clf.predict(X_test_tfidf)

## Evaluate results

`scikit-learn` provides a series of methods calculate a variety of metrics, particularly the precision, recall, and f1-score.


In [None]:
print(confusion_matrix(polarities_test, predicted))
print()
print(classification_report(polarities_test, predicted))

## Use additional features (optional)

This example shall illustrate that a feature vector can contain all kinds of dimensions/features. So far we only used the feature vectorss derived from the words. In the following, add to this feature vector two more values which we get from from the unsupervised sentiment analysis to boost the classifier.

Remember, for the unsupervised sentiment analysis, we assinged it word with a positive or negative numerical value, and summed up all values within document to get the final sentiment values. We can use this information to add to the feature vector. Since the values must be positive, we add 2 dimensions to the vector: one indicating that the sentiment score was positive and one indicating that score was negative


### Unsupervised sentiment analysis step

We copy the folling part from the tutorial for the unsupervised sentiment analysis:

- load sentiment lexicon

- define method `calc_polarity` that calculates the polarity of a document depending on the sentiment score

In [None]:
df_sentiment = pd.read_csv('data/sentiment-lexicon/sentilex-vader.txt', sep='\t', encoding = "ISO-8859-1", header=None)

sentiment_dict = {}

for index, row in df_sentiment.iterrows():
    token, score = row[0], row[1]
    sentiment_dict[token] = score / 4.0 # normalize score from [-4,...,4] to [-1,...,1]

def calc_polarity(doc, num_polarities=3):
    doc_score = 0.0
    for token in doc.split(): # Here split() is sufficient
        if token in sentiment_dict:
            doc_score += sentiment_dict[token]
    if doc_score > 0:
        return 1.0
    elif doc_score < 0:
        return -1.0
    else:
        if num_polarities == 3:
            return 0.0
        else:
            return 1.0
    return 0.0 # Just to be sure, should never be reached

### Feature extraction

The next lines cover these 3 main steps:

- Calculate the basic feature set using the `TfidfVectorizer`, same as above

- Calculate the extended feature set with 2 dimension for each feature vector

- Merge both feature sets into one


In [None]:
from scipy.sparse import csr_matrix, hstack

X_train_tfidf = tfidf_vectorizer.fit_transform(processed_tweets_train)

print (type(X_train_tfidf))
print (X_train_tfidf.shape)

new_feature_col_train = csr_matrix((X_train_tfidf.shape[0], 2), dtype=float)

print (new_feature_col_train.shape)

for idx, doc in enumerate(tweets_train):
    polarity = calc_polarity(doc)
    if polarity > 0:
        new_feature_col_train[idx,0] = 1
    elif polarity < 0:
        new_feature_col_train[idx,1] = 1  
    #new_feature_col_train[idx] = ((calc_polarity(doc) + 1.0) / 2.0)

# Merge the 699x2677 matrix with the 699x2 matrix
X_train_tfidf = hstack((X_train_tfidf, new_feature_col_train))

print (X_train_tfidf.shape)
print (len(polarities_train))

### Training and evaluating classifier

From here on, everything is the same as above, only that the features set of the test data needs also be extended, of course.

In [None]:
clf = MultinomialNB().fit(X_train_tfidf, polarities_train)
#clf = DecisionTreeClassifier().fit(X_train_tfidf, polarities_train)
#clf = LinearSVC().fit(X_train_tfidf, polarities_train)
#clf = KNeighborsClassifier().fit(X_train_tfidf, polarities_train)

In [None]:

X_test_tfidf = tfidf_vectorizer.transform(tweets_test)

new_feature_col_test = csr_matrix((X_test_tfidf.shape[0], 2), dtype=float)

for idx, doc in enumerate(tweets_test):
    polarity = calc_polarity(doc)
    if polarity > 0:
        new_feature_col_test[idx,0] = 1
    elif polarity < 0:
        new_feature_col_test[idx,1] = 1  
    #new_feature_col_train[idx] = ((calc_polarity(doc) + 1.0) / 2.0)

print (new_feature_col_test.shape)

X_test_tfidf = hstack((X_test_tfidf, new_feature_col_test))

predicted = clf.predict(X_test_tfidf)


In [None]:
print(confusion_matrix(polarities_test, predicted))
print()
print(classification_report(polarities_test, predicted))