# Naive Bayes sentiment model and use it to analyze Twitter data


create a Naive Bayes sentiment model and use it to analyze Twitter data.

# Large Movie Review Dataset

The Large Movie Review Dataset is a corpus of 50,000 movie reviews from IMDB that have been classified as either positive or negative. More information about the dataset can be found at https://ai.stanford.edu/~amaas/data/sentiment/.

The following code downloads a copy of the Large Movie Review Dataset and saves it in a variable `imdb_corpus`.

In [None]:
import urllib.request, json
imdb_corpus = []
with urllib.request.urlopen("https://storage.googleapis.com/wd13/IMDBReviewSent.txt") as url:
  for line in url.readlines():
    imdb_corpus.append(line.decode().split('\t'))

`imdb_corpus` is a list. Each element of the list is another list which stores a document and its label.

In [None]:
# print the text and label of document 16
docid = 16
print(imdb_corpus[docid])

['positive', "Some films just simply should not be remade. This is one of them. In and of itself it is not a bad film. But it fails to capture the flavor and the terror of the 1963 film of the same title. Liam Neeson was excellent as he always is, and most of the cast holds up, with the exception of Owen Wilson, who just did not bring the right feel to the character of Luke. But the major fault with this version is that it strayed too far from the Shirley Jackson story in it's attempts to be grandiose and lost some of the thrill of the earlier film in a trade off for snazzier special effects. Again I will say that in and of itself it is not a bad film. But you will enjoy the friction of terror in the older version much more.\n"]


In [None]:
# print the label of document 16
docid = 16
print(imdb_corpus[docid][0])

positive


In [None]:
# print the text of document 16
docid = 16
print(imdb_corpus[docid][1])

Some films just simply should not be remade. This is one of them. In and of itself it is not a bad film. But it fails to capture the flavor and the terror of the 1963 film of the same title. Liam Neeson was excellent as he always is, and most of the cast holds up, with the exception of Owen Wilson, who just did not bring the right feel to the character of Luke. But the major fault with this version is that it strayed too far from the Shirley Jackson story in it's attempts to be grandiose and lost some of the thrill of the earlier film in a trade off for snazzier special effects. Again I will say that in and of itself it is not a bad film. But you will enjoy the friction of terror in the older version much more.



# Create a tokenizer

Write a function `tokenize` that takes a string and returns a list of tokens.

In [None]:
import re

def tokenize(doc):
  doc= doc.lower()
  tokens=re.findall(r'\w+',doc)

  return tokens


# Import log function

In [None]:
from math import log
log(1)

import math

def safe_log(x,y):
    if x <= 0:
        return 0
    return math.log(x/y)

# Create a lexicon

Calculate sentiment scores for every token in the corpus, using the method discussed in class. Store these scores in a dictionary called `lexicon`.

In [None]:

lexicon = {}
total_pos_doc=0
total_neg_doc=0
positiveToken_count={}
negativeToken_count={}
corpusTokenlist=set()

for i in range(len(imdb_corpus)):
  token_list=tokenize(imdb_corpus[i][1])
  for j in token_list:
      corpusTokenlist.add(j)

  for token_list_word in token_list:
    if imdb_corpus[i][0]=='positive':
      if token_list_word not in positiveToken_count:
        positiveToken_count[token_list_word] = 0
      positiveToken_count[token_list_word] += 1
    else:
      if token_list_word not in negativeToken_count:
        negativeToken_count[token_list_word] = 0
      negativeToken_count[token_list_word] += 1

  if imdb_corpus[i][0] == "positive":
    total_pos_doc+=1
  else:
    total_neg_doc+=1

for token in corpusTokenlist:
  if token in positiveToken_count and token in negativeToken_count:
    lexicon[token] = safe_log(positiveToken_count[token]/total_pos_doc,negativeToken_count[token]/total_neg_doc)


# Create a score message function

Write a funciton `score_message` that takes a message `doc` and returns a sentiment score, using the method discussed in class.

In [None]:

def score_message(doc):
  score = 0
  for token in set(tokenize(doc)):
    if token in lexicon:
      score += lexicon[token]
  return(score)

score_message('bad idea')


-1.957736324680629

# Live Twitter Data

**YOU MUST DOWNLOAD LIVE TWITTER DATA PRIOR TO FEB 9**

Twitter announced (on Feb 2) that they would shutdown their free API on Feb 9. Instructions for downloading tweets to a json file are provided here: https://colab.research.google.com/drive/1oYzY93ZdVtLxuz237kZGBk95o4UAHWRO?usp=sharing.

Once you've downloaded your live Twitter data to a json file, you can upload it here.

In [None]:
import json
from google.colab import files
uploaded_files = files.upload()
tweets = json.loads(uploaded_files['tweets.json'])

Saving tweets.json to tweets.json


In [None]:
tweets

In [None]:
tweet_text_corpus=[tweet['text'] for tweet in tweets]

In [None]:
tweet_text_corpus

['RT @MatHayes5: "Conservative Leader Pierre Poilievre offered tepid support for the proposal, even promising he’d honour it if he becomes pr…',
 'RT @SpencerFernando: WATCH: Trudeau Continues To Lie About Carbon Tax Rebates. https://t.co/P7BVU6nvK3',
 "RT @RebelNewsOnline: Trudeau's criminally convicted Environment Minister is called out for his ever-increasing carbon tax raising the cost…",
 "RT @CoryBMorgan: Holy crap. \n\nYou have to squeeze a person's hand pretty hard to leave an impression like that. \n\nTrudeau's loathing of str…",
 'RT @TrueNorthCentre: Jagmeet Singh accuses Prime Minister Justin Trudeau of a "major flip-flop" on healthcare. https://t.co/0HXc8sZJci',
 "RT @truckdriverpleb: Danielle Smith's hand after Justin Trudeau FORCED her to shake his https://t.co/Etd55iegsO",
 'RT @dubsndoo: Today in Parliament Trudeau said that the farmers he’s talked to are concerned about *climate change* and support paying skyr…',
 'RT @HughMcCoy10: NYC Mayor is practising state sponso

In [None]:
conservative_total_count = 0
conservative_positive_count = 0
conservative_negative_count = 0
liberal_total_count = 0
liberal_positive_count = 0
liberal_negative_count = 0
for tweet_text in tweet_text_corpus:
  tweet_sentiment_score = score_message(tweet_text)
  tweet_tokens = set(tokenize(tweet_text))
  if {'immigration','conservative'}.intersection(tweet_tokens):
    conservative_total_count += 1
    if tweet_sentiment_score>=0:
      conservative_positive_count += 1
    else:
      conservative_negative_count += 1
  if {'immigration','libreal'}.intersection(tweet_tokens):
    liberal_total_count += 1
    if tweet_sentiment_score>=0:
      liberal_positive_count += 1
    else:
      liberal_negative_count += 1

# Analyze Twitter Data

Sentiment analysis of tweets on immigration by two parties Conservative and Liberals. Following are tha statistical parameters.
* Number of Tweets
* Share of Voice
* Positive Percent
* Negative Percent
* Net Positive Percent

In [None]:
con_corpus_summary = {
    'NTweets' : conservative_total_count,
    'ShareofVoice' : 100*conservative_total_count/(conservative_total_count+liberal_total_count),
    'PositivePct' : 100*conservative_positive_count/conservative_total_count,
    'NegativePct' : 100*conservative_negative_count/conservative_total_count,
    'NetPositivePct' : 100*(conservative_positive_count-conservative_negative_count)/conservative_total_count
}
lib_corpus_summary = {
    'NTweets' : liberal_total_count,
    'ShareofVoice' : 100*liberal_total_count/(conservative_total_count+liberal_total_count),
    'PositivePct' : 100*liberal_positive_count/liberal_total_count,
    'NegativePct' : 100*liberal_negative_count/liberal_total_count,
    'NetPositivePct' : 100*(liberal_positive_count-liberal_negative_count)/liberal_total_count
}

In [None]:
from prettytable import PrettyTable

x = PrettyTable()
x.field_names = ["Party", "# Tweets", "Share of Voice","Positive %","Negative %","Net Positive %"]

x.add_row(["Liberal",lib_corpus_summary["NTweets"],lib_corpus_summary["ShareofVoice"],lib_corpus_summary["PositivePct"],lib_corpus_summary["NegativePct"],lib_corpus_summary["NetPositivePct"] ])
x.add_row(["Conservative",con_corpus_summary["NTweets"],con_corpus_summary["ShareofVoice"],con_corpus_summary["PositivePct"],con_corpus_summary["NegativePct"],con_corpus_summary["NetPositivePct"] ])


print(x)

+--------------+----------+-------------------+-------------------+--------------------+-------------------+
|    Party     | # Tweets |   Share of Voice  |     Positive %    |     Negative %     |   Net Positive %  |
+--------------+----------+-------------------+-------------------+--------------------+-------------------+
|   Liberal    |    19    | 9.268292682926829 | 89.47368421052632 | 10.526315789473685 | 78.94736842105263 |
| Conservative |   186    | 90.73170731707317 | 81.72043010752688 | 18.27956989247312  | 63.44086021505376 |
+--------------+----------+-------------------+-------------------+--------------------+-------------------+
