# *Game of Thrones* Sentiment Analysis

The goal of this project is to determine the sentiment towards the final season of HBO's hit series *Game of Thrones*. To do this, we will use tweets during the weeks that the final season aired. Later, we will investigate how sentiment changed episode to episode.

In this notebook we will be performing the sentiment analysis.

In [None]:
# Import all the necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import nltk
nltk.download('twitter_samples') # dataset used to train the model
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')
nltk.download('punkt')
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords, twitter_samples
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize, TweetTokenizer
from nltk import classify, NaiveBayesClassifier
import re, string
from wordcloud import WordCloud,STOPWORDS

In [None]:
# Import the file with tweets
got = pd.read_csv('/content/drive/MyDrive/gotTwitter.csv')

After importing the dataset, we will take a look at the columns that are of interest to us, the 'created_at' and 'text' columns. The 'created_at' column will help us see how sentiment changed over time and for each episode. The 'text' column contains the actual tweet and will be used to determine the sentiment. 

In [None]:
gotsa = got[['created_at', 'text']]
gotsa.head()

Unnamed: 0,created_at,text
0,2019-04-17 07:34:18,👍 on @YouTube: GAME OF THRONES 8x01 Breakdown!...
1,2019-04-16 03:34:16,👍 on @YouTube: Ups and Downs From Game Of Thro...
2,2019-04-16 03:06:08,Liked on YouTube: Ups and Downs From Game Of T...
3,2019-04-17 07:07:38,Liked on YouTube: GAME OF THRONES 8x01 Breakdo...
4,2019-04-17 07:34:09,@MrLegenDarius unpopular opinion: game of thro...


## Setting up the Sentiment Classifier

We will use twitter samples from the NLTK library to set up our sentiment analysis classifier. First we will set up our stop words to get rid of, and then we will tokenize the positive and negative tweets from the twitter samples, as we aren't interested in neutral sentiment.

In [None]:
stop_words = stopwords.words('english')

positive_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tokens = twitter_samples.tokenized('negative_tweets.json')

Next, we will clean the tweet tokens by removing hyperlinks, twitter handles, and punctuation. We will also normalize the tokens by using a lemmatizer. We do this for both the positive and negative tweets.

In [None]:
positive_cleaned_tokens = []
for i in range(len(positive_tokens)):
  row_token = []
  for token, tag in pos_tag(positive_tokens[i]):
    token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                    '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token) # remove any hyperlinks
    token = re.sub('(@[A-Za-z0-9_]+)','', token) # remove any twitter handles
    if tag.startswith('NN'): # assigning nouns
      pos = 'n'
    elif tag.startswith('VB'): # assigning verbs
      pos = 'v'
    else: # assigning adjectives
      pos = 'a'
    lemmatizer = WordNetLemmatizer()
    token = lemmatizer.lemmatize(token, pos) # lemmatize the token
    if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
      row_token.append(token.lower()) # save the token to the row (tweet)
  positive_cleaned_tokens.append(row_token) # save the row (tweet) to the list of cleaned tweets

In [None]:
negative_cleaned_tokens = []
for i in range(len(negative_tokens)):
  row_token = []
  for token, tag in pos_tag(negative_tokens[i]):
    token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                    '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token) # remove any hyperlinks
    token = re.sub('(@[A-Za-z0-9_]+)','', token) # remove any twitter handles
    if tag.startswith('NN'): # assigning nouns
      pos = 'n'
    elif tag.startswith('VB'): # assigning verbs
      pos = 'v'
    else: # assigning adjectives
      pos = 'a'
    lemmatizer = WordNetLemmatizer()
    token = lemmatizer.lemmatize(token, pos) # lemmatize the token
    if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
      row_token.append(token.lower()) # save the token to the row (tweet)
  negative_cleaned_tokens.append(row_token) # save the row (tweet) to the list of cleaned tweets

After cleaning these tweets, we will create a function to put the words of the tweets into a dictionary so that it can be properly input into the Naives Bayes classifier.

In [None]:
# Define the function to put the tweet tokens into a dictionary
def get_tweets_for_model(cleaned_tokens_list):
  for tweet_tokens in cleaned_tokens_list:
    yield dict([token, True] for token in tweet_tokens)

In [None]:
# Run the function on both datasets
positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens)

Now that the datasets are in dictionaries, we will add a classifier to each tweet to denote if it has a positive or negative sentiment. Then we will combine the datasets into one and then create random training and testing datasets for the model.

In [None]:
positive_dataset = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

train_data, test_data = train_test_split(dataset, test_size=0.3, random_state=10)

Finally, we train the classifier and then test it.

In [None]:
classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Accuracy is: 0.9966666666666667
Most Informative Features
                      :( = True           Negati : Positi =   2032.2 : 1.0
                      :) = True           Positi : Negati =   1007.5 : 1.0
                follower = True           Positi : Negati =     39.9 : 1.0
                     sad = True           Negati : Positi =     33.8 : 1.0
                     x15 = True           Negati : Positi =     17.3 : 1.0
               community = True           Positi : Negati =     16.0 : 1.0
                      aw = True           Negati : Positi =     12.7 : 1.0
                   didnt = True           Negati : Positi =     11.4 : 1.0
                    glad = True           Positi : Negati =     11.3 : 1.0
               goodnight = True           Positi : Negati =     10.6 : 1.0
None


We see that the model is 99.6% accurate which is great! We also see which words are most associated with positive or negative sentiment. A smiley face most often means a tweet is positive, whereas if the tweet contains a sad face there's a good chance the tweet is negative.

## Running for GOT data

After setting up the classifier model, we will clean our *Game of Thrones* tweets and then run the classifier on them to determine the sentiment.

In [None]:
# Convert the dataframe to a list
got_text = gotsa['text'].to_numpy()

In [None]:
# Set up the tweet tokenizer
tweet_tokenizer = TweetTokenizer()

got_tokens = []

# Tokenize the tweets
for sent in got_text:
  got_tokens.append(tweet_tokenizer.tokenize(sent))

Now that the tweets are tokenized, we will clean them up the same way we did with the twitter samples.

In [None]:
cleaned_tokens = []
for i in range(len(got_tokens)):
  row_token = []
  for token, tag in pos_tag(got_tokens[i]):
    token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                    '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token) # remove any hyperlinks
    token = re.sub('(@[A-Za-z0-9_]+)','', token) # remove any twitter handles
    if tag.startswith('NN'): # assigning nouns
      pos = 'n'
    elif tag.startswith('VB'): # assigning verbs
      pos = 'v'
    else: # assigning adjectives
      pos = 'a'
    lemmatizer = WordNetLemmatizer()
    token = lemmatizer.lemmatize(token, pos) # lemmatize the token
    if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
      row_token.append(token.lower()) # save the token to the row (tweet)
  cleaned_tokens.append(row_token) # save the row (tweet) to the list of cleaned tweets

Finally, we run the *Game of Thrones* dataset through the classifier to get the sentiment for each tweet.

In [None]:
got_sent = []
for tokens in cleaned_tokens:
  got_dict = dict([token, True] for token in tokens)
  got_sent.append(classifier.classify(got_dict))

Now let's add the results to our dataframe and we can see each tweet and it's sentiment.

In [None]:
gotsa['sentiment'] = got_sent
gotsa.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,created_at,text,sentiment
0,2019-04-17 07:34:18,👍 on @YouTube: GAME OF THRONES 8x01 Breakdown!...,Positive
1,2019-04-16 03:34:16,👍 on @YouTube: Ups and Downs From Game Of Thro...,Negative
2,2019-04-16 03:06:08,Liked on YouTube: Ups and Downs From Game Of T...,Negative
3,2019-04-17 07:07:38,Liked on YouTube: GAME OF THRONES 8x01 Breakdo...,Positive
4,2019-04-17 07:34:09,@MrLegenDarius unpopular opinion: game of thro...,Positive


Before saving the dataset, let's get an idea of the percentages of positive and negative tweets there were.

In [None]:
gotsa['sentiment'].value_counts(normalize=True)

Negative    0.590307
Positive    0.409693
Name: sentiment, dtype: float64

Almost 60% of the tweets were negative! However, for a better representation of sentiment in the future, we should create our own rules on what is defined as positive and negative. But in this notebook, we used a sample dataset that contains sentiment already to train the model. In the following notebook, we will see how the sentiment changed over time. 

In [None]:
# Save the dataset
gotsa.to_csv('/content/drive/MyDrive/got_sa.csv', index=False)