# Computational Lingustics - Sentiment Analysis

In this lab, we are going to explore some sentiment analysis.Sentiment analysis in text is typically achieved by learning to solve a classification problem. The classes are typically,

 * Positive
 * Negative
 * Neutral

Let's take a look at some tweets from Donald Trump over the past year. Now we are going to use the [VADER sentiment analyzer](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf).

Read more about **VADER** (Valence Aware Dictionary and sEntiment Reasoner) [here](https://github.com/cjhutto/vaderSentiment/blob/master/README.rst).


In [None]:
#All the packages we are using in this project
import nltk, re, pprint

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import word_tokenize
from nltk import FreqDist

## Lets import some libraries form mathplotlib ... it's helpful for plotting. 
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

All tweets from Donald Trump are stored in the file 'realDonaldTrump_tweets.txt' listed below.

## Text Analysis
For this text analysis we will look at the following:
 - Text [collocations](https://en.wikipedia.org/wiki/Collocation) to find common words that go together
 - Regular expressions to parse out hashtags and high frequency user accounts

In [None]:
file_path = '/dsa/data/all_datasets/linguistic/realDonaldTrump_tweets.txt'


Each line in this file is a tweet. So we will read the file into a set of lines, then we can use regular expressions to strip away symbols and web links in the tweets.

In [None]:
with open(file_path, 'r') as f:
    tweets = f.read().splitlines()
    tweets = [re.sub(r'[^\w]|https.*\b', ' ', t) for t in tweets]

print(tweets[0:10])  



## Using the ` SentimentIntensityAnalyzer` to analyze tweets

In [None]:
analyzer = SentimentIntensityAnalyzer()
vs = analyzer.polarity_scores(tweets[0])
print(type(vs))
print("{:-<65} {}".format(tweets[0], str(vs)))

**Read about scoring here:**https://github.com/cjhutto/vaderSentiment/blob/master/README.rst#about-the-scoring

If we process the entire tweet data set, maybe we can understand the trends!

In [None]:
analyzer = SentimentIntensityAnalyzer()
tweets_sentiment = [analyzer.polarity_scores(t) for t in tweets]

df = pd.DataFrame(tweets_sentiment)
df['tweet'] = tweets

df = df[['tweet', 'neg', 'neu', 'pos', 'compound']]

df.head()

Let's look at the statistics of the measurements.


In [None]:
df.describe()

We see that the average compound score is 0.126, so slightly positive speech.

Note from the documentation, the standard classification: 

1. positive sentiment: compound score >= 0.05
1. neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
1. negative sentiment: compound score <= -0.05



In [None]:
df['sentiment'] = 'NEU'
df.loc[df['compound'] > 0.05, 'sentiment'] = 'POS'
df.loc[df['compound'] < -0.05, 'sentiment'] = 'NEG'

df.head()

---

### Let's visualize!


In [None]:
import seaborn as sns
sns.set()
sns.boxplot(x="sentiment", y="compound", data=df);

---
# Save notebook, then `File > Close and Halt`