# Text classification and sentiment analysis: Twitter

<a href="https://colab.research.google.com/github/chu-ise/413A-2022/blob/main/notebooks/07/07-2_sentiment_analysis_twitter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Once text data has been converted into numerical features, text classification works just like any other classification task.

In this notebook, we will apply these preprocessing technique to news articles, product reviews, and Twitter data and teach various classifiers to predict discrete news categories, review scores, and sentiment polarity.

## Imports

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
%matplotlib inline

from pathlib import Path
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# spacy, textblob and nltk for language processing
from textblob import TextBlob

# sklearn for feature extraction & modeling
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score


In [None]:
sns.set_style('white')


## Twitter Sentiment

### Download the data

We use a dataset that contains 1.6 million training and 350 test tweets from 2009 with algorithmically assigned binary positive and negative sentiment scores that are fairly evenly split.

In [None]:
import gdown
id = "15kGH8PG8VwLJH0mTPz5ntpeRikPIM5i-"

data_file = "twitter_sentiment.zip"
gdown.cached_download(id=id, path=data_file, postprocess=gdown.extractall)

- 0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive); training data has no neutral tweets
- 1 - the id of the tweet (2087)
- 2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- 3 - the query (lyx). If there is no query, then this value is NO_QUERY. (only test data uses query)
- 4 - the user that tweeted (robotickilldozr)
- 5 - the text of the tweet (Lyx is cool)

### Read and preprocess train/test data

In [None]:
data_path = Path('twitter_sentiment')

In [None]:
names = ['polarity', 'id', 'date', 'query', 'user', 'text']

Take a few preprocessing steps:
- remove tweets above the legal (at the time) length of 140 characters,
- binarize polarity, and 
- move the data to the faster parquet format.

In [None]:
parquet_file = data_path / 'train.parquet'
train = pd.read_parquet(parquet_file)
train.info(null_counts=True)

In [None]:
parquet_file = data_path / 'test.parquet'
test = pd.read_parquet(parquet_file)
test.info(null_counts=True)

### Explore data

In [None]:
train.head()

In [None]:
train.polarity = (train.polarity>0).astype(int)
train.polarity.value_counts()

In [None]:
test.polarity = (test.polarity>0).astype(int)
test.polarity.value_counts()

In [None]:
sns.distplot(train.text.str.len(), kde=False)
sns.despine();

In [None]:
train.date.describe()

In [None]:
train.user.nunique()

In [None]:
train.user.value_counts().head(10)

### Create text vectorizer

We create a document-term matrix with 934 tokens as follows:

In [None]:
vectorizer = CountVectorizer(min_df=.001, max_df=.8, stop_words='english')
train_dtm = vectorizer.fit_transform(train.text)

In [None]:
train_dtm

In [None]:
test_dtm = vectorizer.transform(test.text)

### Train Naive Bayes Classifier

In [None]:
nb = MultinomialNB()
nb.fit(train_dtm, train.polarity)

### Predict Test Polarity

In [None]:
predicted_polarity = nb.predict(test_dtm)

### Evaluate Results

In [None]:
accuracy_score(test.polarity, predicted_polarity)

### TextBlob for Sentiment Analysis

In [None]:
sample_positive = train.text.loc[256332]
print(sample_positive)
parsed_positive = TextBlob(sample_positive)
parsed_positive.polarity

In [None]:
sample_negative = train.text.loc[636079]
print(sample_negative)
parsed_negative = TextBlob(sample_negative)
parsed_negative.polarity

In [None]:
def estimate_polarity(text):
    return TextBlob(text).sentiment.polarity

In [None]:
train[['text']].sample(10).assign(sentiment=lambda x: x.text.apply(estimate_polarity)).sort_values('sentiment')

### Compare with TextBlob Polarity Score

We also obtain TextBlob sentiment scores for the tweets and note (see left panel in below figure) that positive test tweets receive a significantly higher sentiment estimate. We then use the MultinomialNB ‘s model .predict_proba() method to compute predicted probabilities and compare both models using the respective Area Under the Curve (see right panel below).

In [None]:
test['sentiment'] = test.text.apply(estimate_polarity)

In [None]:
accuracy_score(test.polarity, (test.sentiment>0).astype(int))

#### ROC AUC Scores

In [None]:
roc_auc_score(y_true=test.polarity, y_score=test.sentiment)

In [None]:
roc_auc_score(y_true=test.polarity, y_score=nb.predict_proba(test_dtm)[:, 1])

In [None]:
fpr_tb, tpr_tb, _ = roc_curve(y_true=test.polarity, y_score=test.sentiment)
roc_tb = pd.Series(tpr_tb, index=fpr_tb)
fpr_nb, tpr_nb, _ = roc_curve(y_true=test.polarity, y_score=nb.predict_proba(test_dtm)[:, 1])
roc_nb = pd.Series(tpr_nb, index=fpr_nb)

The Naive Bayes model outperforms TextBlob in this case.

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(14, 6))
sns.boxplot(x='polarity', y='sentiment', data=test, ax=axes[0])
axes[0].set_title('TextBlob Sentiment Scores')
roc_nb.plot(ax=axes[1], label='Naive Bayes', legend=True, lw=1, title='ROC Curves')
roc_tb.plot(ax=axes[1], label='TextBlob', legend=True, lw=1)
sns.despine()
fig.tight_layout();