# Consent between tweets and IMDB rating on TV shows


### Objective:
Objective of our project is to analyze the relationships between the tweets that users post on Twitter and the real TV ratings provided by the people on Internet Movie Database (IMDB). The results can be used as a key to business strategies for recommending a TV show based upon tweets along with the IMDB ratings.

The Hypothesis of the Project is that, if we fetch tweets related to a particular show and perform the sentimental analysis on the data, then we will be able to compare the analysis results with the overall rating available for that show on IMDB. From the results we try to check whether there is any consent between the two.

### Data Description:  
The details of the data are as follows:

Twitter Data:-
1. The tweets with show name’s hashtag (For e.g., #DesignatedSurvivor) were collected once per week, over a period of two weeks using Twitter REST API.
2. The fields which were utilized from the tweets are User Screen name, the actual text and the tweet ID (To avoid redundancy).

IMDB Data:-
1. The latest ratings for Designated Survivor and Lethal Weapon shows were extracted from IMDB website by writing a Web Scraper using BeautifulSoup parser.
2. The fields which were utilized from IMDB show page are rating and number of users voted for that show.
3. The Ratings for shows obtained at 18:00 pm CST on 11/17/2016 are as follows:

    a. Designated Survivor- 8.1 rating from 10,308 users

    b. Lethal Weapon- 7.8 rating from 7,871 users

<h2 align="center"> Loading the data </h2>

In [5]:
"""Loading the Data"""

import hashlib
import matplotlib.pyplot as plt
import numpy as np
import os
import re
from sklearn.cross_validation import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from bs4 import BeautifulSoup
from urllib.request import urlopen
%matplotlib inline


""" Root folder name, placed parallel to project. Here all training and testing 
    tweet files are kept. """

path = 'Project_Data'

""" fetch_tweet_files function returns a list of tweet file names in the folder 
    directory that end in .txt"""

def fetch_tweet_files(path):
    
    files_list = []
    
    for file in (os.listdir(path)):
        if file.endswith(".txt"):
            files_list.append((os.path.join(path) + os.sep + file).replace("\\","/"))
    
    return files_list

""" fetching the positive, negative and neutral training tweet files from directory path"""

pos_train_files = fetch_tweet_files(path + os.sep + 'train' + os.sep + 'all_shows' + os.sep + 'pos')
neg_train_files = fetch_tweet_files(path + os.sep + 'train' + os.sep + 'all_shows' + os.sep + 'neg')
neutral_train_files = fetch_tweet_files(path + os.sep + 'train' + os.sep + 'all_shows' + os.sep + 'neu')

""" combining all positive, negative and neutral lists into a single list"""
combined_train_files = pos_train_files + neg_train_files + neutral_train_files

""" printing the Number of positive, negative and neutral tweet files"""

print('We have %d positive, %d negative and %d neutral training tweet files for both shows' %
      (len(pos_train_files), len(neg_train_files), len(neutral_train_files) ))


""" training_true_labels function is assigning the labels to the tweet files by passing 
    all training files names in the parameter . Here label '1' is assigned to positive 
    tweets, label '-1' is assigned to negative tweets and label '0' is assigned to neutral tweets.
    This function will return a numpy array of labels assigned to the training tweets."""

def training_true_labels(train_files):
    
    positive = 'pos'
    negative = 'neg'
    neutral = 'neu'
    array_list = []
    
    for files in train_files:
        
        if(positive in files):
            array_list.append(1)
        elif(negative in files):
            array_list.append(-1)
        elif(neutral in files):
            array_list.append(0)
    
        
    return np.array(array_list)
    

labels = training_true_labels(combined_train_files)

""" printing the training labels"""

print ("\nThe assigned labels are: %s" %labels)





We have 1159 positive, 153 negative and 672 neutral training tweet files for both shows

The assigned labels are: [1 1 1 ..., 0 0 0]


<h2 align="center"> Data Preprocessing and Printing the data shape </h2>

In [6]:
""" preprocessing_Tweet function takes a tweet text as a parameter and doing preprocessing
    on tweet text and it returns the tweet after preprocessing"""

def preprocessing_Tweet(tweet):
    
    #Collapse URLs starting with www.* or https?://* to THIS_IS_A_URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','THIS_IS_A_URL',tweet)
    
    #Collapse Mentions like @username to THIS_IS_A_MENTION
    tweet = re.sub('@[^\s]+','THIS_IS_A_MENTION',tweet)
    
    #Replace Hastags words to words without hashtags like #Quantico with Quantico
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    
    return tweet

""" tokenize function takes a tweet text as a parameter and tokenizing the tweet contents
    into tokens list. This function is considering punctuations. Also to handle the negations
    in the tweets, whenever the term 'not' appears in the tweet, the tokenizer will change 
    the two subsequent tokens to have the prefix 'not_' prior to the token.
    This function returns a tokens list."""

def tokenize(text):
    
    preprocessed_tweet = preprocessing_Tweet(text)
    split_text = preprocessed_tweet.lower().split()
    final_text =' '.join(split_text)
    tokens_list = re.findall(r"[\w]+|[^\w\s]", final_text)
    
    tokens_length = len(tokens_list)
    
    for i in range(tokens_length):
        if tokens_list[i] == 'not':
            if (i+1 < tokens_length):
                tokens_list[i+1] = 'not_%s' %tokens_list[i+1]
            if (i+2 < tokens_length):   
                tokens_list[i+2] = 'not_%s' %tokens_list[i+2]
                
    return tokens_list

""" perform_vectorize function takes all of the combined tweet files names, tokenizer
    function, min_df value, max_df value and other parameters. It is vectorizing the
    tweets using CountVectorizer method of sklearn package. The input value is taken as
    'filename' so that filename mentioned with the path can be extracted and read for
    fetching the text. After vectorizing, the vectorize object is used for getting
    features matrix. This function returns feature matrix and vectorizer object."""    

def perform_vectorize(tweets, tokenizer_fn=tokenize, min_df=2,
                 max_df=.7, binary=True, ngram_range=(1,1)):
    
    
    vectorizer = CountVectorizer(input='filename', tokenizer=tokenizer_fn, ngram_range=ngram_range, max_df=max_df, 
                                 min_df=min_df, stop_words="english",  binary=binary, dtype=int)
    
    X = vectorizer.fit_transform(tweets)
    
    return X, vectorizer

""" Calling the perform_vectorize method for getting a feature matrix and vectorizer obejct.
    Also printing the number of tweets and features in the feature matrix. """ 
    
matrix, vec = perform_vectorize(combined_train_files)
print ('\nThe feature matrix contains %d tweets instances with %d features\n' % (matrix.shape[0], matrix.shape[1]))

def repeatable_random(seed):
    hash = str(seed)
    hash = hash.encode('utf-8')
    while True:
        hash = hashlib.md5(hash).digest()
        for c in hash:
            yield (c)

def repeatable_shuffle(X, y, combined_train_files):
    r = repeatable_random(42)
    indices = sorted(range(X.shape[0]), key=lambda x: next(r))
    return X[indices], y[indices], np.array(combined_train_files)[indices]

X, y, filenames = repeatable_shuffle(matrix, labels, combined_train_files )


#top_n = 100
#top_features = [features[i] for i in indices[:top_n]]
#print top_features
#print(vec.get_feature_names()[:100])

#indices = np.argsort(vec.idf_)[::-1]
features = vec.get_feature_names()
#top_n = 100
#top_features = [features[i] for i in indices[:top_n]]



The feature matrix contains 1984 tweets instances with 1609 features



## The Performance Measure used:

<h3> accuracy_score </h3>

We are calculating the average cross-validation accuracy on training as well as test data.

Our data has train-test split, as we have manually labelled the tweets collected from Twitter

<h2 align="center"> Model Selection on train-split </h2>

### Firstly, we have selected Logistic Regression model with penalty='l2', C=1.0 and random_state=0

In [7]:
""" get_clf function returns the Logistic Regression classifier"""

def get_clf():
    
    return LogisticRegression(penalty='l2', C=1.0, random_state=0)

""" perform_cross_validation function is calculating the accuracy of the data using N-fold
    cross validation function. We have used 10 folds.
    This function returns the average accuracy for the dataset. """

def perform_cross_validation(X, y, n_folds, verbose=False):

    crossValidation =KFold(len(y), n_folds=n_folds)
   
    accuracies = []
    
    for train_idx, test_idx in crossValidation:
        
        clf = get_clf()
        clf.fit(X[train_idx], y[train_idx])
        predicted = clf.predict(X[test_idx])
        acc = accuracy_score(y[test_idx], predicted)
        accuracies.append(acc)
        
    avg = np.mean(accuracies)
    
    return avg

""" printing the Average cross validation accuracy on training data. """
 
print('\nThe Average cross validation accuracy on training data= %.4f' %perform_cross_validation(X, y,n_folds=10, verbose=False))

print('\nThe Average cross validation accuracy on training data= %.4f' %perform_cross_validation(X, y,n_folds=15, verbose=False))



The Average cross validation accuracy on training data= 0.7389

The Average cross validation accuracy on training data= 0.7394


#### Logistic Regression model with penalty='l1', C=1.0 and random_state=0

In [8]:
""" get_clf function returns the Logistic Regression classifier"""

def get_clf():
    
    return LogisticRegression(penalty='l1', C=1.0, random_state=0)

""" perform_cross_validation function is calculating the accuracy of the data using N-fold
    cross validation function. We have used 10 folds.
    This function returns the average accuracy for the dataset. """

def perform_cross_validation(X, y, n_folds, verbose=False):

    crossValidation =KFold(len(y), n_folds=n_folds)
   
    accuracies = []
    
    for train_idx, test_idx in crossValidation:
        
        clf = get_clf()
        clf.fit(X[train_idx], y[train_idx])
        predicted = clf.predict(X[test_idx])
        acc = accuracy_score(y[test_idx], predicted)
        accuracies.append(acc)
        
    avg = np.mean(accuracies)
    
    return avg

""" printing the Average cross validation accuracy on training data. """
 
print('\nThe Average cross validation accuracy on training data= %.4f' %perform_cross_validation(X, y,n_folds=10, verbose=False))

print('\nThe Average cross validation accuracy on training data= %.4f' %perform_cross_validation(X, y,n_folds=15, verbose=False))



The Average cross validation accuracy on training data= 0.7298

The Average cross validation accuracy on training data= 0.7258


#### Logistic Regression model with penalty='l2', C=1.0 and random_state=9

In [9]:
""" get_clf function returns the Logistic Regression classifier"""

def get_clf():
    
    return LogisticRegression(penalty='l2', C=1.0, random_state=9)

""" perform_cross_validation function is calculating the accuracy of the data using N-fold
    cross validation function. We have used 10 folds.
    This function returns the average accuracy for the dataset. """

def perform_cross_validation(X, y, n_folds, verbose=False):

    crossValidation =KFold(len(y), n_folds=n_folds)
   
    accuracies = []
    
    for train_idx, test_idx in crossValidation:
        
        clf = get_clf()
        clf.fit(X[train_idx], y[train_idx])
        predicted = clf.predict(X[test_idx])
        acc = accuracy_score(y[test_idx], predicted)
        accuracies.append(acc)
        
    avg = np.mean(accuracies)
    
    return avg

""" printing the Average cross validation accuracy on training data. """
 
print('\nThe Average cross validation accuracy on training data= %.4f' %perform_cross_validation(X, y,n_folds=10, verbose=False))

print('\nThe Average cross validation accuracy on training data= %.4f' %perform_cross_validation(X, y,n_folds=15, verbose=False))



The Average cross validation accuracy on training data= 0.7389

The Average cross validation accuracy on training data= 0.7394


### Second, we have selected Support Vector Machine model. Here all default parameters have been used.

In [10]:
""" get_clf function returns the Support Vector Machine classifier"""
from sklearn import svm
def get_clf():
    
    return svm.SVC()

""" perform_cross_validation function is calculating the accuracy of the data using N-fold
    cross validation function. We have used 10 folds.
    This function returns the average accuracy for the dataset. """

def perform_cross_validation(X, y, n_folds, verbose=False):

    crossValidation =KFold(len(y), n_folds=n_folds)
   
    accuracies = []
    
    for train_idx, test_idx in crossValidation:
        
        clf = get_clf()
        clf.fit(X[train_idx], y[train_idx])
        predicted = clf.predict(X[test_idx])
        acc = accuracy_score(y[test_idx], predicted)
        accuracies.append(acc)
        
    avg = np.mean(accuracies)
    
    return avg

""" printing the Average cross validation accuracy on training data. """

print('\nThe Average cross validation accuracy on training data for 10 folds: %.4f' %perform_cross_validation(X, y, n_folds=10, verbose=False))

print('\nThe Average cross validation accuracy on training data for 15 folds: %.4f' %perform_cross_validation(X, y, n_folds=15, verbose=False))




The Average cross validation accuracy on training data for 10 folds: 0.5842

The Average cross validation accuracy on training data for 15 folds: 0.5841


#### SVM model with C=1.0, kernel='linear'

In [11]:
""" get_clf function returns the Support Vector Machine classifier"""
from sklearn import svm
def get_clf():
    
    return svm.SVC(C=1.0, kernel='linear')

""" perform_cross_validation function is calculating the accuracy of the data using N-fold
    cross validation function. We have used 10 folds.
    This function returns the average accuracy for the dataset. """

def perform_cross_validation(X, y, n_folds, verbose=False):

    crossValidation =KFold(len(y), n_folds=n_folds)
   
    accuracies = []
    
    for train_idx, test_idx in crossValidation:
        
        clf = get_clf()
        clf.fit(X[train_idx], y[train_idx])
        predicted = clf.predict(X[test_idx])
        acc = accuracy_score(y[test_idx], predicted)
        accuracies.append(acc)
        
    avg = np.mean(accuracies)
    
    return avg

""" printing the Average cross validation accuracy on training data. """

print('\nThe Average cross validation accuracy on training data for 10 folds: %.4f' %perform_cross_validation(X, y, n_folds=10, verbose=False))

print('\nThe Average cross validation accuracy on training data for 15 folds: %.4f' %perform_cross_validation(X, y, n_folds=15, verbose=False))




The Average cross validation accuracy on training data for 10 folds: 0.7182

The Average cross validation accuracy on training data for 15 folds: 0.7152


#### SVM model with C=1.0, kernel='poly', random_state=8

In [12]:
""" get_clf function returns the Support Vector Machine classifier"""
from sklearn import svm
def get_clf():
    
    return svm.SVC(C=1.0, kernel='poly', random_state=8)

""" perform_cross_validation function is calculating the accuracy of the data using N-fold
    cross validation function. We have used 10 folds.
    This function returns the average accuracy for the dataset. """

def perform_cross_validation(X, y, n_folds, verbose=False):

    crossValidation =KFold(len(y), n_folds=n_folds)
   
    accuracies = []
    
    for train_idx, test_idx in crossValidation:
        
        clf = get_clf()
        clf.fit(X[train_idx], y[train_idx])
        predicted = clf.predict(X[test_idx])
        acc = accuracy_score(y[test_idx], predicted)
        accuracies.append(acc)
        
    avg = np.mean(accuracies)
    
    return avg

""" printing the Average cross validation accuracy on training data. """

print('\nThe Average cross validation accuracy on training data for 10 folds: %.4f' %perform_cross_validation(X, y, n_folds=10, verbose=False))

print('\nThe Average cross validation accuracy on training data for 15 folds: %.4f' %perform_cross_validation(X, y, n_folds=15, verbose=False))



The Average cross validation accuracy on training data for 10 folds: 0.5842

The Average cross validation accuracy on training data for 15 folds: 0.5841


#### SVM model with C=1.0, kernel='poly', degree=2

In [13]:
""" get_clf function returns the Support Vector Machine classifier"""
from sklearn import svm
def get_clf():
    
    return svm.SVC(C=1.0, kernel='poly', degree=2)

""" perform_cross_validation function is calculating the accuracy of the data using N-fold
    cross validation function. We have used 10 folds.
    This function returns the average accuracy for the dataset. """

def perform_cross_validation(X, y, n_folds, verbose=False):

    crossValidation =KFold(len(y), n_folds=n_folds)
   
    accuracies = []
    
    for train_idx, test_idx in crossValidation:
        
        clf = get_clf()
        clf.fit(X[train_idx], y[train_idx])
        predicted = clf.predict(X[test_idx])
        acc = accuracy_score(y[test_idx], predicted)
        accuracies.append(acc)
        
    avg = np.mean(accuracies)
    
    return avg

""" printing the Average cross validation accuracy on training data. """

print('\nThe Average cross validation accuracy on training data for 10 folds: %.4f' %perform_cross_validation(X, y, n_folds=10, verbose=False))

print('\nThe Average cross validation accuracy on training data for 15 folds: %.4f' %perform_cross_validation(X, y, n_folds=15, verbose=False))



The Average cross validation accuracy on training data for 10 folds: 0.5842

The Average cross validation accuracy on training data for 15 folds: 0.5841


### Third, we have selected Multinomial Naive Bayes model. Here all default parameters have been used.

In [14]:
"""Multinomial Naive Bayel model prediction"""
import numpy as np
from sklearn.naive_bayes import MultinomialNB
def get_clf():
    
    return MultinomialNB()


""" perform_cross_validation function is calculating the accuracy of the data using N-fold
    cross validation function. We have used 10 folds.
    This function returns the average accuracy for the dataset. """

def perform_cross_validation(X, y, n_folds, verbose=False):

    crossValidation =KFold(len(y), n_folds=n_folds)
   
    accuracies = []
    
    for train_idx, test_idx in crossValidation:
        
        clf = get_clf()
        
        predicted = clf.fit(X[train_idx], y[train_idx]).predict(X[test_idx])
        #print(predicted)               

        acc = accuracy_score(y[test_idx], predicted)
        accuracies.append(acc)
        
    avg = np.mean(accuracies)
    
    return avg

""" printing the Average cross validation accuracy on training data. """

print('\nThe Average cross validation accuracy on training data for 10 folds: %.4f' %perform_cross_validation(X, y, n_folds=10, verbose=False))

print('\nThe Average cross validation accuracy on training data for 15 folds: %.4f' %perform_cross_validation(X, y, n_folds=15, verbose=False))


The Average cross validation accuracy on training data for 10 folds: 0.7197

The Average cross validation accuracy on training data for 15 folds: 0.7192


#### Multinomial Naive Bayes model with fit_prior=False

In [15]:
"""Multinomial Naive Bayes model prediction"""
import numpy as np
from sklearn.naive_bayes import MultinomialNB
def get_clf():
    
    return MultinomialNB(fit_prior=False)


""" perform_cross_validation function is calculating the accuracy of the data using N-fold
    cross validation function. We have used 10 folds.
    This function returns the average accuracy for the dataset. """

def perform_cross_validation(X, y, n_folds, verbose=False):

    crossValidation =KFold(len(y), n_folds=n_folds)
   
    accuracies = []
    
    for train_idx, test_idx in crossValidation:
        
        clf = get_clf()
        
        predicted = clf.fit(X[train_idx], y[train_idx]).predict(X[test_idx])
        #print(predicted)               

        acc = accuracy_score(y[test_idx], predicted)
        accuracies.append(acc)
        
    avg = np.mean(accuracies)
    
    return avg

""" printing the Average cross validation accuracy on training data. """

print('\nThe Average cross validation accuracy on training data for 10 folds: %.4f' %perform_cross_validation(X, y, n_folds=10, verbose=False))

print('\nThe Average cross validation accuracy on training data for 15 folds: %.4f' %perform_cross_validation(X, y, n_folds=15, verbose=False))


The Average cross validation accuracy on training data for 10 folds: 0.6799

The Average cross validation accuracy on training data for 15 folds: 0.6764


In [27]:
"""Gaussians Naive Bayes model prediction"""

from sklearn.naive_bayes import GaussianNB
import numpy as np

clf = GaussianNB()

def perform_cross_validation(X, y, k=10):

    crossValidation =KFold(len(y))
   
    accuracies = []
    
    for train_idx, test_idx in crossValidation:
        
       
        clf.fit(X[train_idx], y[train_idx])
        
        
        predicted = clf.predict(X[test_idx])
        acc = accuracy_score(y[test_idx], predicted)
        accuracies.append(acc)
        
    avg = np.mean(accuracies)
    
    return avg

perform_cross_validation(X.toarray(), y)

""" printing the Average cross validation accuracy on training data. """

print('\nThe Average cross validation accuracy on training data for 10 folds: %.4f' %perform_cross_validation(X.toarray(), y, k=50))

print('\nThe Average cross validation accuracy on training data for 15 folds: %.4f' %perform_cross_validation(X.toarray(), y, k=10))


The Average cross validation accuracy on training data for 10 folds: 0.5030

The Average cross validation accuracy on training data for 15 folds: 0.5030


<h3> Based upon the different model selections using various parameters. We have received best results from Logistic Regression using penalty='l2', C=1.0 and random_state=0. Hence, we have chosen Logistic Regression Model for repoting the performance on testing data. </h3>

<h2 align="center"> Logistic Regression model </h2>

<h3>Performance on Full training data and labelled testing data:</h3>

In [None]:
""" get_clf function returns the Logistic Regression classifier"""

def get_clf():
    
    return LogisticRegression(C=1.0, random_state=0)

""" perform_cross_validation function is calculating the accuracy of the data using N-fold
    cross validation function. We have used 10 folds.
    This function returns the average accuracy for the dataset. """

def perform_cross_validation(X, y, n_folds=10, verbose=False):

    crossValidation =KFold(len(y), n_folds=n_folds)
   
    accuracies = []
    
    for train_idx, test_idx in crossValidation:
        
        clf = get_clf()
        clf.fit(X[train_idx], y[train_idx])
        predicted = clf.predict(X[test_idx])
        acc = accuracy_score(y[test_idx], predicted)
        accuracies.append(acc)
        
    avg = np.mean(accuracies)
     clf.predict(X[test_idx])
    return avg

""" printing the Average cross validation accuracy on training data. """
 
print('\nThe Average cross validation accuracy on training data= %.4f' %perform_cross_validation(X, y, n_folds=10, verbose=False))

clf = get_clf()
clf.fit(X, y)


pos_test_files = fetch_tweet_files(path + os.sep + 'test_labelled' + os.sep + 'pos')
neg_test_files = fetch_tweet_files(path + os.sep + 'test_labelled' + os.sep + 'neg')
neu_test_files = fetch_tweet_files(path + os.sep + 'test_labelled' + os.sep + 'neu')
shows_test_files = pos_test_files + neg_test_files + neu_test_files

X_labelled_test = vec.transform(shows_test_files)
y_labelled_test = np.array([1] * len(pos_test_files) + [-1] * len(neg_test_files) + [0] * len(neu_test_files))
print('\nX_labelled_test represents %d tweets with %d features' % (X_labelled_test.shape[0], X_labelled_test.shape[1]))
print('\ny_labelled_test has %d positive, %d negative and %d neutral labels' % (len(np.where(y_labelled_test==1)[0]),
            len(np.where(y_labelled_test==-1)[0]), len(np.where(y_labelled_test==0)[0])))

testingAccuracy=accuracy_score(y_labelled_test, clf.predict(X_labelled_test))

print('\nThe Testing accuracy for the labelled tweets= %.4g' %testingAccuracy)



<h2>Predicting the unseen tweets for both shows(Designated Survivor and Lethal Weapon)</h2>

In [None]:
""" Fetching the Designated Survivor test files(i.e. unseen tweet files) from the directory"""

designatedSurvivor_test_files = fetch_tweet_files(path + os.sep + 'test_designatedSurvivor')


""" Now, as we don't want to learn a new vocabulary. We are calling .transform
    using vectorizer object instead of .fit_transform, which was used earlier
    for training. """

designatedSurvivor_X_test = vec.transform(designatedSurvivor_test_files)

""" printing the number of testing tweets and features in the feature matrix. """
print('\ndesignatedSurvivor_X_test represents %d tweets with %d features' % (designatedSurvivor_X_test.shape[0], designatedSurvivor_X_test.shape[1]))

designatedSurvivor_pos_count= 0
designatedSurvivor_neg_count = 0
designatedSurvivor_neutral_count = 0

for i in clf.predict(designatedSurvivor_X_test):

    if i == 1:
        designatedSurvivor_pos_count+= 1
    elif i == -1:
        designatedSurvivor_neg_count+= 1
    elif i == 0:
        designatedSurvivor_neutral_count+= 1

print ("No of Positive Designated Survivor tweets predicted: %d" %designatedSurvivor_pos_count)
print ("No of Negative Designated Survivor tweets predicted: %d" %designatedSurvivor_neg_count)
print ("No of Neutral Designated Survivor tweets predicted: %d" %designatedSurvivor_neutral_count)

""" The above steps are now repeated for Lethal Weapon show. """

""" Fetching the Lethal Weapon test files(i.e. unseen tweet files) from the directory"""

lethalWeapon_test_files = fetch_tweet_files(path + os.sep + 'test_lethalWeapon')


lethalWeapon_X_test = vec.transform(lethalWeapon_test_files)

print('\nlethalWeapon_X_test represents %d tweets with %d features' % (lethalWeapon_X_test.shape[0], lethalWeapon_X_test.shape[1]))

lethalWeapon_pos_count= 0
lethalWeapon_neg_count = 0
lethalWeapon_neutral_count = 0

for i in clf.predict(lethalWeapon_X_test):

    if i == 1:
        lethalWeapon_pos_count+= 1
    elif i == -1:
        lethalWeapon_neg_count+= 1
    elif i == 0:
        lethalWeapon_neutral_count+= 1

print ("No of Positive Lethal Weapon tweets predicted: %d" %lethalWeapon_pos_count)
print ("No of Negative Lethal Weapon tweets predicted: %d" %lethalWeapon_neg_count)
print ("No of Neutral Lethal Weapon tweets predicted: %d" %lethalWeapon_neutral_count)



<h2 align="center"> Performance of predicting the majority class all the time </h2>

In our project, majority class is Positive class, whose label is 1.

In [None]:
y_maj = np.array([1] * len(designatedSurvivor_test_files))
testingAccuracy_maj=accuracy_score(y_maj, clf.predict(designatedSurvivor_X_test))
print ('\nThe Testing accuracy for predicting the majority class all the time (Designated Survivor)= %.4g' %testingAccuracy_maj)

y_maj1 = np.array([1] * len(lethalWeapon_test_files))
testingAccuracy_maj1=accuracy_score(y_maj1, clf.predict(lethalWeapon_X_test))
print ('\nThe Testing accuracy for predicting the majority class all the time (Lethal Weapon)= %.4g' %testingAccuracy_maj1)

<h2 align="center"> Performance of random prediction </h2>

In [None]:
def repeatable_random(seed):
    hash = str(seed)
    hash = hash.encode('utf-8')
    while True:
        hash = hashlib.md5(hash).digest()
        for c in hash:
            yield (c)

def repeatable_shuffle(X1, y1, designatedSurvivor_test_files):
    r = repeatable_random(42)
    indices = sorted(range(X1.shape[0]), key=lambda x: next(r))
    return X1[indices], y1[indices], np.array(designatedSurvivor_test_files)[indices]

##X1, y1, filenames = repeatable_shuffle(matrix, labels, designatedSurvivor_test_files )

testingAccuracy=accuracy_score(y_labelled_test, clf.predict(X_labelled_test))
print ('\nThe Testing accuracy for random prediction (Designated Survivor)= %.4g' %testingAccuracy)

y_rand = np.array([1] * len(designatedSurvivor_test_files))
testingAccuracy_rand=accuracy_score(y_ran, clf.predict(designatedSurvivor_X_test))
print ('\nThe Testing accuracy for random prediction (Lethal Weapon)= %.4g' %testingAccuracy_rand)


<h2 align="center"> Final conculsion after comparing the results with IMDB</h2>

In [None]:
""" The below code is a Web Scraper written for getting the Designated Survivor and Lethal Weapon ratings
    and number of users from IMDB """
rating=[]
urlList=["http://www.imdb.com/title/tt5296406/?ref_=nv_sr_1", "http://www.imdb.com/title/tt5164196/?ref_=nv_sr_1"]
outf = open('imdbDataSet.txt', 'wt')
for url in urlList:
    url_got = urlopen(url)
    soup = BeautifulSoup(url_got.read(), 'html.parser')
    for foo in soup.find_all('div', attrs={'class': 'ratingValue'}):
        
        bar = foo.find('span', attrs={'itemprop': 'ratingValue'})
        
        bar1 = foo.find('span', attrs={'itemprop': 'bestRating'})
       
        rating.append(bar.text)
        
    for name in soup.find_all('div', attrs={'class': 'title_wrapper'}):
        b=name.find('h1', attrs={'itemprop': 'name'})
    
    
    print ("%s rating from IMDB= %s" %(b.text.rstrip(),bar.text))
    obj='%s %s %s  \n'%(b.text,bar.text,bar1.text)
    outf.write(obj)
outf.close()

'''calculating score '''

score=[]
score.append((testingAccuracy*designatedSurvivor_pos_count)/(designatedSurvivor_pos_count+designatedSurvivor_neg_count))
score.append((testingAccuracy*lethalWeapon_pos_count)/(lethalWeapon_pos_count+lethalWeapon_neg_count))
print ("Positivity from experiment for Designated Survivor= %s" %(score[0]))
print ("Positivity from experiment for Lethal Weapon= %s" %(score[1]))

'''Conclusion from experiment'''

print ("\nIt can be noted that the positivity for Designated Survivor(%s) with IMDB rating %s is more than the positivity for Lethal Weapon(%s)with IMDB rating %s "%(score[0],rating[0],score[1],rating[1]))



### Conclusion: 
We have analyzed the two TV shows tweets using Sentiment Analysis, and finds the positive percentage sentiment for a TV show and checking the consent of the same with the ratings available for the show on IMDB. On a higher level, it has been concluded that the positivity percentage in the tweets is more for a show which has got more rating on IMDB and vice-versa.

The following are the top 20 positive and negative model features along with their weights/scores 

<h2 align="center"> Top 20 Positive features along with their weights </h2>

In [None]:
positive = clf.coef_[2]
featr = vec.get_feature_names()
index_sorted =np.argsort(positive)[::-1].tolist()
featres_positive = [(featr[i],positive[i]) for i in index_sorted[:20]]
print (featres_positive)


<h2 align="center"> Top 20 Negative features along with their weights </h2>

In [None]:
negative = clf.coef_[0]
index_sorted1 =np.argsort(negative)[::-1].tolist()
featres_negative = [(featr[i],negative[i]) for i in index_sorted1[:20]]
print (featres_negative)

In [None]:
positive = clf.coef_[2]
featr = vec.get_feature_names()
index_sorted =np.argsort(positive)[::-1].tolist()
featres_positive = [(featr[i],positive[i]) for i in index_sorted]
featres_positive

In [None]:
negative = clf.coef_[0]
index_sorted1 =np.argsort(negative)[::-1].tolist()
featres_negative = [(featr[i],negative[i]) for i in index_sorted1]
featres_negative

In [None]:
neutral = clf.coef_[1]
index_sorted2 =np.argsort(neutral)[::-1].tolist()
featres_neutral = [(featr[i],neutral[i]) for i in index_sorted2]
featres_neutral