# <span style="color:#820747">Twitter Sentiment Analysis with Movie Training Data Set. NLP.

<img src="img/head.jpg">

<span style="color:#610023"> In this Project, I will use "movie reviews" data set contains short movie reviews to train my models and then I will create a module which we can use to analyse twitter posts in real time via twitter API. I will build a real time graph, for those positive and negative sentiment. 

<img src="img/lin.jpg">

In [84]:
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode #This gonna choose who get most votes
import pickle

# <span style="color:#ffad01">1. Load: <span style="color:#004577"> Read data.

<span style="color:#be0119"><b>
--- comment ---</b>

<span style="color:#1e488f"> First I read my 2 txt files with positive and negative reviews. Data set is separated by new line. 

In [96]:
raw_pos = open('data/raw_positive.txt', 'r').read()
raw_neg = open('data/raw_negative.txt', 'r').read()

In [97]:
# Check my raw data, first 500 char.
#_________________________________
print(raw_pos[0:400])
print('---------------------------------------------------------------------------------------------------------')
print(raw_neg[0:400])

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . 
the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-ea
---------------------------------------------------------------------------------------------------------
simplistic , silly and tedious . 
it's so laddish and juvenile , only teenage boys could possibly find it funny . 
exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . 
[garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation . 
a visually fla


<img src="img/lin.jpg">

# <span style="color:#ffad01">2. Words Tokenize: <span style="color:#004577"> Create documents and take words from raw data, and put in into new variable bag of words.

<span style="color:#be0119"><b>
--- comment ---</b>

<span style="color:#1e488f">I want make lines as a documents, each line should be a tuple ( review, pos/neg ) and take all words from raw data and collect it in one place "bag of words". Also I use stop word technique to get rid of unneeded words and choose between parts of speach: ('j' is adject, 'r' is adverb, 'v' is verb).

In [98]:
# Put few more Stopwords
#_________________________________
stopwords = nltk.corpus.stopwords.words('english')
newStopWords = ['the', 'is', '21st',  "'s", 'he', 'j', 'r', '--', ';', '(', ')', ':', "'", '&',
                'us', '15-year-old', 'ms', 'er', '!', '85', '3-d', '``', ',', '.',]
stopwords.extend(newStopWords)
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [99]:
%%time
# Create "documents"
# Fill bag_of_words with words which not in stopwords and put it in lower case.
# I will use only 3 types of words in my models. To get them I can use pos_tag method from nltk liberary. 
# 'j' is adject, 'r' is adverb, 'v' is verb
# allowed_word_types = ['J','R','V']
#_________________________________

documents = []
bag_of_words = []
allowed_word_types = ['J','R','V']

for p in raw_pos.split('\n'):
    documents.append( (p, "pos") )
    words = word_tokenize(p)
    pos = nltk.pos_tag(words)
    for w in pos:
        if w[1][0] in allowed_word_types:
            bag_of_words.append(w[0].lower()) # I take only word with index [0] as index [1] related to name of part of speach
            
for p in raw_neg.split('\n'):
    documents.append( (p, 'neg') )
    words = word_tokenize(p)
    pos = nltk.pos_tag(words)
    for w in pos:
        if w[1][0] in allowed_word_types:
            bag_of_words.append(w[0].lower()) # I take only word with index [0] as index [1] related to name of part of speach
            
print('length of bag_of_words:',len(bag_of_words))
print('----------------------')
print('length of document:',len(documents))

length of bag_of_words: 71204
----------------------
length of document: 10664
Wall time: 51.8 s


# <span style="color:green"> Pickle documents

In [100]:
save_documents = open('_pickled_algos/documents.pickle', 'wb')
pickle.dump(documents, save_documents)
save_documents.close()

# <img src="img/lin.jpg">

# <span style="color:#ffad01">3. Features: <span style="color:#004577"> Creating Features Set and Train / Test Split.

<span style="color:#be0119"><b>
--- comment ---</b>

<span style="color:#1e488f"> Creating word_features - 10000 words as features. And then create feature sets to go through each document and to check if word in word features, that how we can see which word to which sentiment (positive or negative) related. And finaly Split data into Train and Test.

# <span style="color:#004577">Create bag of words and pick 10000 word features

In [101]:
# I will use 10000 word features only
# _________________________________
bag_of_words = nltk.FreqDist(bag_of_words)
word_features = list(bag_of_words.keys())[:10000]
len(word_features)

10000

# <span style="color:green"> Pickle word_features

In [102]:
save_word_features = open('_pickled_algos/word_features.pickle', 'wb')
pickle.dump(word_features, save_word_features)
save_word_features.close()

# <span style="color:#004577">Create features Set

In [103]:
# Define function to find features which exist in documents and word_features
#_________________________________
def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)
    return features

In [104]:
%%time
# Create featureset, its gonna be a list of words with FALSE or TRUE if words in document also in word_features
# Use random shuffle as our data was splited even
# _________________________________
featuresets = [(find_features(rev), category) for (rev, category) in documents]
random.shuffle(featuresets)
print('Number of all posts:',len(featuresets))
print('--------------')
print(featuresets[1])

Number of all posts: 10664
--------------
Wall time: 31.3 s


# <span style="color:green"> Pickle featuresets.

In [105]:
save_featureset = open('_pickled_algos/featuresets.pickle', 'wb')
pickle.dump(featuresets, save_featureset)
save_featureset.close()

# <span style="color:#004577">Train/Test split

In [106]:
# Split data
#_________________________________
X_Train = featuresets[:10000]
y_test = featuresets[10000:]

print('Train set:',len(X_Train))
print('-----------------')
print('Test set:',len(y_test))

Train set: 10000
-----------------
Test set: 664


<img src="img/lin.jpg">

# <span style="color:#ffad01">4. Modeling: <span style="color:#004577"> Train and Test models 

<span style="color:#be0119"><b>
--- comment ---</b>

<span style="color:#1e488f"> I will train bunch of models, and then will build "voted classifier" which will make final decision, with confidence. Also I will pickle my pretrained models to use it later in my new module. 

# <span style="color:green"> Naive Bayes Classifier
# <span style="color:green"> Multinomial Naive Bayes Classifier
# <span style="color:green"> Bernoulli Naive Bayes Classifier  
# <span style="color:green"> Logistic Regression Classifier
# <span style="color:green"> Stochastic Gradient Descent Classifier
# <span style="color:green"> Support Vector Machine Classifier
# <span style="color:green"> Linear Support Vector Machine Classifier
# <span style="color:green"> Nu Support Vector Machine Classifier

In [109]:
%%time
NB_classifier = nltk.NaiveBayesClassifier.train(X_Train)
print('Naive Bayes accuracy percent "%":', (nltk.classify.accuracy(NB_classifier, y_test))*100)
NB_classifier.show_most_informative_features(15)
print('--------------------------------------------------------------------')

MNB_classifier = nltk.SklearnClassifier(MultinomialNB())
MNB_classifier.train(X_Train)
print('MNB_classifier accuracy percent:', (nltk.classify.accuracy(MNB_classifier, y_test))*100)
print('--------------------------------------------------------------------')

BernoulliNB_classifier = nltk.SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(X_Train)
print('BernoulliNB_classifier accuracy percent:', (nltk.classify.accuracy(BernoulliNB_classifier, y_test))*100)
print('--------------------------------------------------------------------')

LogisticRegression_classifier = nltk.SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(X_Train)
print('LogisticRegression_classifier accuracy percent:', (nltk.classify.accuracy(LogisticRegression_classifier, y_test))*100)
print('--------------------------------------------------------------------')

SGDClassifier_classifier = nltk.SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(X_Train)
print('SGDClassifier_classifier accuracy percent:', (nltk.classify.accuracy(SGDClassifier_classifier, y_test))*100)
print('--------------------------------------------------------------------')

SVC_classifier = nltk.SklearnClassifier(SVC())
SVC_classifier.train(X_Train)
print('SVC_classifier accuracy percent:', (nltk.classify.accuracy(SVC_classifier, y_test))*100)
print('--------------------------------------------------------------------')

LinearSVC_classifier = nltk.SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(X_Train)
print('LinearSVC_classifier accuracy percent:', (nltk.classify.accuracy(LinearSVC_classifier, y_test))*100)
print('--------------------------------------------------------------------')

NuSVC_classifier = nltk.SklearnClassifier(NuSVC())
NuSVC_classifier.train(X_Train)
print('NuSVC_classifier accuracy percent:', (nltk.classify.accuracy(NuSVC_classifier, y_test))*100)
print('--------------------------------------------------------------------')

Naive Bayes accuracy percent "%": 75.90361445783132
Most Informative Features
                provides = True              pos : neg    =     18.5 : 1.0
                 generic = True              neg : pos    =     16.2 : 1.0
                 routine = True              neg : pos    =     16.2 : 1.0
                mediocre = True              neg : pos    =     16.2 : 1.0
                  boring = True              neg : pos    =     14.3 : 1.0
               inventive = True              pos : neg    =     13.8 : 1.0
                    flat = True              neg : pos    =     13.7 : 1.0
               wonderful = True              pos : neg    =     12.7 : 1.0
              refreshing = True              pos : neg    =     12.5 : 1.0
                    dull = True              neg : pos    =     12.3 : 1.0
                    warm = True              pos : neg    =     12.3 : 1.0
            refreshingly = True              pos : neg    =     11.8 : 1.0
               realist



SGDClassifier_classifier accuracy percent: 71.6867469879518
--------------------------------------------------------------------
SVC_classifier accuracy percent: 46.3855421686747
--------------------------------------------------------------------
LinearSVC_classifier accuracy percent: 72.89156626506023
--------------------------------------------------------------------
NuSVC_classifier accuracy percent: 74.09638554216868
--------------------------------------------------------------------
Wall time: 11min 22s


# <span style="color:green"> Pickle all my pretrained models.

In [110]:
# Naive Bayes (Pickle)
save_classifier_NB = open('_pickled_algos/Naive_Bayes.pickle', 'wb')
pickle.dump(NB_classifier, save_classifier_NB)
save_classifier_NB.close()

# Multinomial Naive Bayes (Pickle)
save_classifier_MNB = open('_pickled_algos/Multinomial_Naive_Bayes.pickle', 'wb')
pickle.dump(MNB_classifier, save_classifier_MNB)
save_classifier_MNB.close()

# Bernoulli Naive Bayes (Pickle)
save_classifier_BNB = open('_pickled_algos/Bernoulli_Naive_Bayes.pickle', 'wb')
pickle.dump(BernoulliNB_classifier, save_classifier_BNB)
save_classifier_BNB.close()

# Logistic_Regression (Pickle)
save_classifier_LR = open('_pickled_algos/Logistic_Regression.pickle', 'wb')
pickle.dump(LogisticRegression_classifier, save_classifier_LR)
save_classifier_LR.close()

# Stochastic Gradient Descent (Pickle)
save_classifier_SGD = open('_pickled_algos/Stochastic_Gradient_Descent.pickle', 'wb')
pickle.dump(SGDClassifier_classifier, save_classifier_SGD)
save_classifier_SGD.close()

# SVC (Pickle)
save_classifier_SVC = open('_pickled_algos/Support_Vector_Classifire.pickle', 'wb')
pickle.dump(SVC_classifier, save_classifier_SVC)
save_classifier_SVC.close()

# Linear SVC (Pickle)
save_classifier_LinearSVC = open('_pickled_algos/Linear_Support_Vector_Classifire.pickle', 'wb')
pickle.dump(LinearSVC_classifier, save_classifier_LinearSVC)
save_classifier_LinearSVC.close()

# NuSVC (Pickle)
save_classifier_NuSVC = open('_pickled_algos/Nu_Support_Vector_Classifire.pickle', 'wb')
pickle.dump(NuSVC_classifier, save_classifier_NuSVC)
save_classifier_NuSVC.close()

<img src="img/lin.jpg">

# <span style="color:#ffad01">5. Module: <span style="color:#004577"> Create new Module to find out sentiments of the tweets. And save it as "my_sent_mod.py"

<span style="color:#be0119"><b>
--- comment ---</b>

<span style="color:#1e488f"> In this part I will create new module, which I will use later live to analyse tweets. I created class which will return me mode of all my models and confidence. I will use my pickled models to load into my Module. 

In [112]:
%%time
import nltk
import random
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode
from nltk.tokenize import word_tokenize

class VoteClassifire(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers
    
    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)
    
    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        
        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

    
document_f = open('_pickled_algos/documents.pickle', 'rb')
documents = pickle.load(document_f)
document_f.close()

word_features_f = open('_pickled_algos/word_features.pickle', 'rb')
word_features = pickle.load(word_features_f)
word_features_f.close()



def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)
    return features


featuresets_f = open('_pickled_algos/featuresets.pickle', 'rb')
featuresets = pickle.load(featuresets_f)
featuresets_f.close()

random.shuffle(featuresets)

y_test = featuresets[10000:]
X_Train = featuresets[:10000]


# load all my models which I will use. Dont use SVC as bad score.

open_file = open('_pickled_algos/Naive_Bayes.pickle', 'rb')
NB_classifier = pickle.load(open_file)
open_file.close()

open_file = open('_pickled_algos/Multinomial_Naive_Bayes.pickle', 'rb')
MNB_classifier = pickle.load(open_file)
open_file.close()

open_file = open('_pickled_algos/Bernoulli_Naive_Bayes.pickle', 'rb')
BernoulliNB_classifier = pickle.load(open_file)
open_file.close()

open_file = open('_pickled_algos/Logistic_Regression.pickle', 'rb')
LogisticRegression_classifier = pickle.load(open_file)
open_file.close()

open_file = open('_pickled_algos/Linear_Support_Vector_Classifire.pickle', 'rb')
LinearSVC_classifier = pickle.load(open_file)
open_file.close()

open_file = open('_pickled_algos/Stochastic_Gradient_Descent.pickle', 'rb')
SGDC_classifier = pickle.load(open_file)
open_file.close()

open_file = open('_pickled_algos/Nu_Support_Vector_Classifire.pickle', 'rb')
NuSVC_classifier = pickle.load(open_file)
open_file.close()


# create voted classifire
voted_classifier = VoteClassifire(NB_classifier, 
                                  LinearSVC_classifier, 
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier,
                                  SGDC_classifier,
                                  NuSVC_classifier)


def sentiment(text):
    feats = find_features(text)
    
    return voted_classifier.classify(feats), voted_classifier.confidence(feats) # confidence lvl based on votings

Wall time: 10.7 s


# <span style="color:green"> Test module on few sentences.

<span style="color:#be0119"><b>
--- comment ---</b>

<span style="color:#1e488f"> As we can see there is sometimes mistakes, and we can see how confident we are this answer. 

In [118]:
print('Should be positive')
print('----------------------------------')
print(sentiment('And there the grass grows soft and white, And there the sun burns crimson bright'))
print(sentiment('Friendship is the rainbow between two hearts sharing seven colors: feelings, love, sadness, happiness, truth, faith, secret & respect'))
print(sentiment("Your favorite part of the week... here's a *NEW* sneak peek of our third jersey this year!"))
print(sentiment('Happy Birthday to my amazing wife'))

Should be positive
----------------------------------
('neg', 0.5714285714285714)
('pos', 1.0)
('pos', 0.5714285714285714)
('pos', 1.0)


In [117]:
print('Should be negative')
print('----------------------------------')
print(sentiment('It stops with you! Scroll away from #cyberbullying. Don’t “like,” share, or comment on #negative information that has been posted about someone else.'))
print(sentiment("broken. from the bottom of my heart, i am so so sorry. i don't have words."))
print(sentiment('The reason so many think Trump is a white supremacist is because of many things that have been reported by the media. Let the facts speak'))
print(sentiment('are u a moron?? Brock Lesnar was drug tested like crazy by USADA. He was clean.'))

Should be negative
----------------------------------
('neg', 1.0)
('neg', 1.0)
('pos', 0.5714285714285714)
('neg', 0.8571428571428571)


<img src="img/lin.jpg">

# <span style="color:#ffad01">6. Connect to Tweeter: <span style="color:#004577">Use API to connect to tweeter.

<span style="color:#be0119"><b>
--- comment ---</b>

<span style="color:#1e488f"> I will use my keys to connect to tweeter, but as I don't want to share my keys I will hide them. Then I use class listener to read tweets and apply my module on it to find out sentiments. Also I will take in account only sentiments with confidence more than 85%. I will save all sentiments in to "twitter-records.txt" file which I will use later to create live chart.

In [124]:
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import json

# Import my module which returns sentiment pos or neg and confidence
import my_sent_mod as s

# defing Tweeter keys to this variables:

# ckey = 'your ckey'
# csecret = 'your csecret'
# atoken = 'your atoken'
# asecret = 'your asecret'

# Load my API keys
from tweetAPIs import *

class listener(StreamListener):
    
    def on_data(self, data):
        all_data = json.loads(data)
        
        tweet = all_data["text"]
        # Turning on my module
        sentiment_value, confidence = s.sentiment(tweet)
        print(tweet, sentiment_value, confidence)
        
        #if confidense more than 85 than write to the file with extention "a"-append
        if confidence*100 >= 85:
            output = open("twitter-records.txt", "a")
            output.write(sentiment_value)
            output.write('\n')
            output.close()
                
        return True
   
    def on_error(self, status):
        print(status)

# Autorising my self
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())

# what do I want filter 'anything'
twitterStream.filter(track=['weekends'])

RT @StratfordCric: 🗞 | Herald Report 

‘Old Hill made absolutely no effort to try to get the game complete, given the situation of the matc… neg 1.0
RT @itsjulesboi: Once you get a boyfriend: https://t.co/gQ4cBnG7Gd neg 0.8571428571428571
Pampalipas weekends haha https://t.co/aYPKCc4RNN neg 0.8571428571428571
This Friday, head over to The Terminus with that friend who always says “I’ll have just ONE drink” and ends up plas… https://t.co/MwpTMXuuHd neg 1.0
DJ Hannah will be at the B've tonight! Make plans NOW! 
#weekends #comeplay https://t.co/a1k6gzCQkh neg 0.8571428571428571
I dont work weekends but I kinda wanna find a part time weekend job..... neg 1.0
@MattGrandis One problem is that Vulkan has been built for the needs of big game engines. A handful game engines is… https://t.co/Te1b3j4Cxx neg 1.0
RT @Thevesh: A short thread: The insane glorification of overtime work. 

1) It's such a Malaysian/Asian mentality to overwork staff. They'… pos 1.0
RT @Thevesh: A short thread: The insan

KeyboardInterrupt: 

<img src="img/lin.jpg">

# <span style="color:#ffad01">7. Live Graph: <span style="color:#004577">Creating real time chart.

<span style="color:#be0119"><b>
--- comment ---</b>

<span style="color:#1e488f"> To see live chart we have to use for example sublime text editor, where we can paste this piece of code, and run it at the same time as Tweeter reader.

In [126]:
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib import style
import time

style.use('ggplot')

fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)


def animate(i):
    pullData = open('twitter-records.txt','r').read()
    lines = pullData.split('\n')
    
    xar = []
    yar = []
    
    x = 0
    y = 0
    
    # As there is some bayes to negative sentiments I set pos = 1 and neg = 0.5
    for l in lines:
        x += 1
        if 'pos' in l:
            y += 1
        elif 'neg' in l:
            y-= 0.5
        
        xar.append(x)
        yar.append(y)

    ax1.clear()
    ax1.plot(xar, yar)

ani = animation.FuncAnimation(fig, animate, interval=1000)
plt.show()

<img src="img/final.jpg">

<span style="color:#be0119"><b>
--- comment ---</b>

<span style="color:#1e488f"> Here we you can see real time reading tweets with topic "weekends", as I run this code on Friday, we got more positive sentiments. Also sentiments with more than 85% confidence go to text file, and using matplot animation I load that data from text file to draw the graph. It is reading in real time. 

<img src="img/lin.jpg">