# Bitcoin Tweet Sentiment Predictor

Lately, cryptocurrencies have become very popular on the internet. People increasingly find clever ways to move with the market and make quick money by trading one cryptocurrency with the other. However, Bitcoin, the original cryptocurrency, still stands on top and boasts a strong position.

However, Bitcoin followers aren't usually seen in that light and are blamed for being toxic and voicing negative opinions on the internet. This project will build a model that will predict the sentiment of a Bitcoin tweet.

First, we will import all the necessary libraries that'll be used in the context of this project to create the model

In [27]:
import numpy as np 
import pandas as pd 
import re
import nltk
import pickle
import joblib

import utils

# Natural Language Processing imports
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer

# Scikit learn imports
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

# Models
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, ComplementNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier

## Load Dataset Into Memory

We will first load the dataset int memory and showcase the results. As you can see from the example below, there are quite a number of columns and unnecessary features that we need to remove. This will become a hinderance to the model and will bring unexpected results so we need to clean the data.

In [2]:
fullDataset = pd.read_csv("BTC_tweets_daily_example.csv", low_memory=False)
fullDataset

Unnamed: 0.1,Unnamed: 0,Date,Tweet,Screen_name,Source,Link,Sentiment,sent_score,New_Sentiment_Score,New_Sentiment_State,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15
0,0,Fri Mar 23 00:40:32 +0000 2018,"RT @ALXTOKEN: Paul Krugman, Nobel Luddite. I h...",myresumerocket,[],"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",['neutral'],0,0,0,,,,,,
1,1,Fri Mar 23 00:40:34 +0000 2018,@lopp @_Kevin_Pham @psycho_sage @naval But @Pr...,BitMocro,[u'Bitcoin'],"<a href=""http://twitter.com/download/android"" ...",['neutral'],0,0,0,,,,,,
2,2,Fri Mar 23 00:40:35 +0000 2018,RT @tippereconomy: Another use case for #block...,hojachotopur,"[u'blockchain', u'Tipper', u'TipperEconomy']","<a href=""http://twitter.com"" rel=""nofollow"">Tw...",['positive'],1,0.136363636363636,1,,,,,,
3,3,Fri Mar 23 00:40:36 +0000 2018,free coins https://t.co/DiuoePJdap,denies_distro,[],"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",['positive'],1,0.4,1,,,,,,
4,4,Fri Mar 23 00:40:36 +0000 2018,RT @payvxofficial: WE are happy to announce th...,aditzgraha,[],"<a href=""http://twitter.com/download/android"" ...",['positive'],1,0.468181818181818,1,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50882,50854,Fri Mar 23 08:55:16 +0000 2018,RT @fixy_app: Fixy Network brings popular cryp...,quoting_lives,[],"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",['positive'],1,0.6,1,,,,,,
50883,50855,Fri Mar 23 08:55:17 +0000 2018,RT @bethereumteam: After a successful launch o...,VariPewitt,[],"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",['positive'],1,0.375,1,,,,,,
50884,50856,Fri Mar 23 08:55:18 +0000 2018,"RT @GymRewards: Buy #GYMRewards Tokens, Bonus ...",urbancoinerz,"[u'GYMRewards', u'ICO', u'cryptocurrency', u'm...","<a href=""http://twitter.com"" rel=""nofollow"">Tw...",['neutral'],0,0,0,,,,,,
50885,50857,Fri Mar 23 08:55:19 +0000 2018,I added a video to a @YouTube playlist https:/...,MRDanishShahab,[],"<a href=""http://www.google.com/"" rel=""nofollow...",['positive'],1,0.4,1,,,,,,


## Cleaning The Dataset

We now have to clean the dataset to proceed with the project. Following things are at fault with the original dataset:

 - There are a unnecessary amount of features
 - There are some null values far into the dataset
 - There are invalid values in the "Sentiment" column, which should only accept "positive", "neutral" and "negative"
 - The sentiment column stores sentiment strings as stringified lists. We need to change that

<br>

The following changes will be made:

 - All columns except "Tweet" and "Sentiment" will be dropped.
 - Null values will be dropped
 - Sentiment values will be converted from stringified lists to simple strings e.g. ['positive'] => positive
 - Unknown values from "Sentiment" column will be deleted
 - Cleaned dataset will be saved as cleaned_btc_tweets.csv file

In [3]:
# Dropping all unneeded columns except tweet and sentiment colums
btc_tweets = fullDataset.drop(fullDataset.columns[[0, 1, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15]], axis=1)

btc_tweets = btc_tweets.dropna()
# btc_tweets = btc_tweets.head(10000)

indices = []

# This loop clears all sentiment values that aren't known
for index, row in btc_tweets.iterrows():
    
    sentiment = btc_tweets['Sentiment'][index].strip('][\'')
    
    if sentiment == "positive":
        btc_tweets.loc[index, 'Sentiment'] = sentiment
        continue
    if sentiment == "neutral":
        btc_tweets.loc[index, 'Sentiment'] = sentiment
        continue
    if sentiment == "negative":
        btc_tweets.loc[index, 'Sentiment'] = sentiment
        continue
        
    indices.append(index)
    
btc_tweets = btc_tweets.drop(labels=indices, axis=0)

btc_tweets.to_csv(r'cleaned_btc_tweets.csv', index = False)

print("After cleaning the sentiment column:")
pd.options.display.max_colwidth = 200
btc_tweets

After cleaning the sentiment column:


Unnamed: 0,Tweet,Sentiment
0,"RT @ALXTOKEN: Paul Krugman, Nobel Luddite. I had to tweak the nose of this Bitcoin enemy. He says such foolish things. Here's the link: htt…",neutral
1,@lopp @_Kevin_Pham @psycho_sage @naval But @ProfFaustus (dum b a ss) said you know nothing about #Bitcoin ... 😂😂😂 https://t.co/SBAMFQ2Yiy,neutral
2,RT @tippereconomy: Another use case for #blockchain and #Tipper. The #TipperEconomy can unseat Facebook and change everything! ICO Live No…,positive
3,free coins https://t.co/DiuoePJdap,positive
4,RT @payvxofficial: WE are happy to announce that PayVX Presale Phase 1 is now LIVE!\n\nSign up --&gt;&gt; https://t.co/dhprzsSxek\nCurrencies accept…,positive
...,...,...
50882,RT @fixy_app: Fixy Network brings popular cryptocurrencies and retailers as partners with benefits from blockchain. Partner Stores will acc…,positive
50883,"RT @bethereumteam: After a successful launch of our Bounty campaign, we've managed to filter out the Bounty related questions to: https://t…",positive
50884,"RT @GymRewards: Buy #GYMRewards Tokens, Bonus Time is ending! https://t.co/HDvhoZrz2J, #ICO #cryptocurrency #mobile #app #mining #exercisin…",neutral
50885,I added a video to a @YouTube playlist https://t.co/ntFJrNvSvZ How To Bitcoin Cloud Mining Free For Lifetime Urdu / Hindi,positive


## Checking Balance of Sentiments in Dataset

As we can see above, we have 50,886 total instances in the dataset even after cleaning. Now we check the balance of ratios between the instances with respect to sentiment so we have around equal of each sentiment. This step is crucial to designing a good data model. We have to check the ratio at which our tweets are divided by sentiment into the dataset.

This is important because if one of the sentiments is less in number with respect to the others, the model may get trained on inaccurate data and, hence, provide inaccurate results

The following loop will count all sentiment instances in the cleaned dataset:

In [4]:
positiveCount = 0
neutralCount = 0
negativeCount = 0

for index, row in btc_tweets.iterrows():
    sentiment = btc_tweets['Sentiment'][index].strip('][\'')
    
    if sentiment == "positive":
        positiveCount += 1
        
    if sentiment == "neutral":
        neutralCount += 1
        
    if sentiment == "negative":
        negativeCount += 1
        
print("Positive Tweets:", positiveCount)
print("Neutral Tweets:", neutralCount)
print("Negative Tweets:", negativeCount)

Positive Tweets: 22656
Neutral Tweets: 21150
Negative Tweets: 5945


## Data is Unbalanced

As we can see, negative sentiment tweets are almost 4 times lower than positive and neutral sentiments. This will affect the prediction results of the model if we train it on unbalanced data.

For this reason, we will cut down positive and neutral tweets until they match exactly the negative tweets' number and then proceed to the next step. The following loop does exactly that:

In [5]:
newDF = pd.DataFrame(columns=["Tweet", "Sentiment"])

positiveCount = 0
neutralCount = 0

for index, row in btc_tweets.iterrows():
    sentiment = btc_tweets['Sentiment'][index]
    
    if sentiment == "positive":
        if positiveCount == negativeCount:
            continue
        positiveCount += 1
        newDF.loc[len(newDF)]=[row["Tweet"], row["Sentiment"]] 
        
    if sentiment == "neutral":
        if neutralCount == negativeCount:
            continue
        neutralCount += 1
        newDF.loc[len(newDF)]=[row["Tweet"], row["Sentiment"]] 
        
    if sentiment == "negative":
        newDF.loc[len(newDF)]=[row["Tweet"], row["Sentiment"]] 
        
newDF

Unnamed: 0,Tweet,Sentiment
0,"RT @ALXTOKEN: Paul Krugman, Nobel Luddite. I had to tweak the nose of this Bitcoin enemy. He says such foolish things. Here's the link: htt…",neutral
1,@lopp @_Kevin_Pham @psycho_sage @naval But @ProfFaustus (dum b a ss) said you know nothing about #Bitcoin ... 😂😂😂 https://t.co/SBAMFQ2Yiy,neutral
2,RT @tippereconomy: Another use case for #blockchain and #Tipper. The #TipperEconomy can unseat Facebook and change everything! ICO Live No…,positive
3,free coins https://t.co/DiuoePJdap,positive
4,RT @payvxofficial: WE are happy to announce that PayVX Presale Phase 1 is now LIVE!\n\nSign up --&gt;&gt; https://t.co/dhprzsSxek\nCurrencies accept…,positive
...,...,...
17830,RT @PumaPay: Why Did Credit Cards Fail to Adopt to the Modern Needs? https://t.co/u1qB3gxA3T #pumapay #creditcards #banking #finance #block…,negative
17831,Bitcoin Will Be World's 'Single Currency' Says Twitter CEO https://t.co/f4hsEbLgkk https://t.co/P3fuHSLwkX,negative
17832,RT @CloudMiningX: Use the code: HF18BDAY30 at purchase to get a 30% discount for all contracts. The offer is limited. \n\n10 Ghs = 0.84$\n1000…,negative
17833,Twitter CEO Says Bitcoin Will Be World’s ‘Single Currency’ Within A Decade https://t.co/2obg7hKwm5,negative


## Check Balance Again

Balancing the tweets has reduced our dataset to 17,834 instances. We will check balance again now after balancing the sentiments with each other:

In [6]:
btc_tweets = newDF

positiveCount = 0
neutralCount = 0
negativeCount = 0

for index, row in btc_tweets.iterrows():
    sentiment = btc_tweets['Sentiment'][index].strip('][\'')
    
    if sentiment == "positive":
        positiveCount += 1
        
    if sentiment == "neutral":
        neutralCount += 1
        
    if sentiment == "negative":
        negativeCount += 1
        
print("Positive Tweets:", positiveCount)
print("Neutral Tweets:", neutralCount)
print("Negative Tweets:", negativeCount)

Positive Tweets: 5945
Neutral Tweets: 5945
Negative Tweets: 5945


## Data is Balanced

The data has successfully been balanced and now we have 5,945 instances of each sentiment in the dataset. Now we can proceed to split between training and testing data.

The following code block will do this:

 - First, separate dataframes with both all columns will be made with 80% training data and 20% testing data. This data will be saved as .csv files.
 - Then, the original data will be split again. Now we need to separate with 80 - 20% difference as well separate the columns. This is crucial in the code to follow after this step

In [12]:
# Creating csv's for presentation purpose
train_tosave, test_tosave = train_test_split(btc_tweets, test_size=0.2)

train_tosave.to_csv(r'train_set.csv', index = False)
test_tosave.to_csv(r'test_set.csv', index = False)

# 80% training set, 20% testing set
tweets = btc_tweets["Tweet"]
sentiments = btc_tweets["Sentiment"]

train_data, test_data, train_sentiment, test_sentiment = train_test_split(tweets, sentiments, test_size=0.2)

rand_indexs = np.random.randint(1,len(train_data),50).tolist()

print("Number of training instances:", len(train_data.index))
print("Number of testing instances:", len(test_data.index))

Number of training instances: 14268
Number of testing instances: 3567


## Training Data

In [13]:
print("Training data:")
train_data.head(60)

Training data:


4538      Name: Raiden Network Token\nSymbol: RDN\n24 hour change: -4.69%\nPrice: 1.58653\nRank: 129\nTotal Supply: 100000000.0\nVo… https://t.co/KHrEl2usc2
3436                   RT @cryptomsn: Home of Bitcoin Crypto Currency - https://t.co/O0gPKD0Ie8 \n#BTC #CryptoCurrencyNews #Eth #Ltc https://t.co/76hvwd6DlH
8986         RT @OnWindowly: Lightning Network Problems — wow!\n#BitcoinCash is #Bitcoin\n\n@el33th4xor  @ Satoshi Vision Conference in Tokyo, Japan. https…
5119             RT @DrDenaGrayson: @ericgeller Agree w/@ericgeller👉🏼likely signals #indictments of state-backed hackers. I believe that these hackers will…
17443        RT @CloudMiningX: Use the code: HF18BDAY30 at purchase to get a 30% discount for all contracts. The offer is limited. \n\n10 Ghs = 0.84$\n1000…
12025     Name: CRYPTO20\nSymbol: C20\n24 hour change: -7.7%\nPrice: 1.29942\nRank: 168\nTotal Supply: 40656082.0\nVolume: 2603890.… https://t.co/h3G19w0ywl
14952                                      Current Bitcoin

## Testing Data

In [14]:
print("Testing data:")
test_data.head(60)

Testing data:


12501             #Blockchain simplified: @CBinsights / #crypto #fintech #bitcoin, #ICO https://t.co/WuEnTKIq6r / @BourseetTrading… https://t.co/tuU6Yjy2Vh
607      Name: Tidex Token\nSymbol: TDX\n24 hour change: -38.18%\nPrice: 0.312157\nRank: 647\nTotal Supply: 10000000.0\nVolume: 29… https://t.co/UvqZKkplhU
13116                 [USD] 23/03/2018 03:00:01 Bitcoin: $8451.49 Ethereum: $516.85  #bitcoin #ethereum #altcoin #coin #blockchain… https://t.co/dZ4Wv0WcS5
10555      RT @CherylPreheim: City of Atlanta’s computers being held hostage by hacker demanding $51,000 ransom in bitcoin. FBI &amp; Homeland Security in…
15317                 RT @ErikVoorhees: CNBC: Jack Dorsey expects bitcoin to become the world's 'single currency' in about 10 years https://t.co/ERONOX5cH1
13995    Name: AdEx\nSymbol: ADX\n24 hour change: -8.72%\nPrice: 0.754066\nRank: 155\nTotal Supply: 100000000.0\nVolume: 6612940.0… https://t.co/ghfLGDKUgr
2243                                    Why Blockchain Will Surv

## Value of Emoticons as Sentiment

Emoticons are extremely important for sentiment analysis as they are clear exressors of emotions. The following code matches all emojis in our training dataset that match with a specific regular expression that is designed to generate all emojis.

This step is only for presentation purposes. These values will not actually be used later on.

In [16]:
# Checking which emoticons are used in data set
tweets_text = train_data.str.cat()

emos = set(re.findall(r" ([xX:;][-']?.) ", tweets_text))
emos_count = []
for emo in emos:
    emos_count.append((tweets_text.count(emo), emo))
print("Emoticons used in dataset:")
sorted(emos_count,reverse=True)

Emoticons used in dataset:


[(14363, ': '),
 (115, ':…'),
 (39, 'XM'),
 (36, ':)'),
 (7, ':('),
 (2, ';)'),
 (2, ':D')]

## Emoticons in Our Dataset by Sentiment

In [17]:
# Checking frequency of happy and sad emoji encounters

HAPPY_EMO = r" ([xX;:]-?[dD)]|:-?[\)]|[;:][pP]) "
SAD_EMO = r" (:'?[/|\(]) "

print("Emoticons specifying happy and sad expressions:\n")

print("Happy emoticons used:", set(re.findall(HAPPY_EMO, tweets_text)))
print("Sad emoticons used:", set(re.findall(SAD_EMO, tweets_text)))


Emoticons specifying happy and sad expressions:

Happy emoticons used: {';)', ':D', ':)'}
Sad emoticons used: {':('}


## Most Used Words

The following function will check for most used words in our dataset so we have a clear visual of what we're working with. The nltk library will first download a set of words and then check the dataset for most used words before printing them for presentation and clearing purposes.

This step is only for presentation purposes. These values will not actually be used later on.

In [19]:
nltk.download('punkt')
def most_used_words(text):
    tokens = word_tokenize(text)
    frequency_dist = nltk.FreqDist(tokens)
    print("There is %d different words in training dataset" % len(set(tokens)))
    return sorted(frequency_dist,key=frequency_dist.__getitem__, reverse=True)

most_used_words(train_data.str.cat())[:100]

[nltk_data] Downloading package punkt to /home/zozu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


There is 30513 different words in training dataset


[':',
 '#',
 'https',
 '@',
 'Bitcoin',
 '.',
 'the',
 'to',
 ',',
 '!',
 '$',
 'a',
 'is',
 'bitcoin',
 'and',
 'of',
 'in',
 'for',
 'you',
 '?',
 'on',
 'Airdrop',
 '(',
 ')',
 '’',
 '%',
 'with',
 'that',
 'I',
 '-',
 'cryptocurrency',
 'bethereumteam',
 'our',
 'blockchain',
 'crypto',
 'we',
 "'s",
 'Price',
 'The',
 'your',
 'will',
 'Supply',
 'Total',
 'be',
 '1',
 'it',
 '24',
 'are',
 'change',
 'hour',
 's',
 ';',
 'out',
 'BTC',
 'Symbol',
 'at',
 'Rank',
 "'re",
 'We',
 '...',
 'have',
 'by',
 '&',
 'Volume',
 'this',
 'Blockchain',
 'what',
 'about',
 "''",
 '--',
 '*',
 'Ethereum',
 'Will',
 'can',
 '``',
 'New',
 'Twitter',
 'has',
 'from',
 '📢',
 'ICO',
 'Satoshi',
 'amp',
 'make',
 '“',
 'all',
 '”',
 'article',
 'Crypto',
 'how',
 'or',
 'as',
 '…',
 'ethereum',
 'not',
 'now',
 'A',
 "'",
 'ETH',
 'money']

## Feature Extraction, defining Vectorizer and Pipeline

This is the most important step. For natural language processing, we cannot feed raw text data to models. We have to convert them into a machine understandable format. Here is where our vectorizer will come in.

For the purpose of this project, we are using Bag of Words and TF-IDF feature extraction method. The way it works is the it creates a table with each tweet in our dataset as a row, and each column being each word encountered in the dataset atleast once. The tweets per row will then have numerical values with respect to columns to demonstrate the number of said words encountered in the tweet.

This simple diagram eases the concept:

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/BoWBag-of-Words-model-2.png">

<br>

We will use the TfidfVectorizer() from scikit-learn library and make a vectorizer of our own. In the same directory as this .ipynb file, there should be a utils.py file alongside. In said file, a lemmetizer function and an text preprocessing class have been defined which we are using in the code block below.

The reason for putting them separately was so that the pickled pipeline works with them later when they're deployed to a Heroku server.

In [20]:
vectorizer = TfidfVectorizer(tokenizer=utils.lemmatize_tokenize, ngram_range=(1,2))

pipeline = Pipeline([
    ('text_pre_processing', utils.TextPreProc(use_mention=True)),
    ('vectorizer', vectorizer),
])

training_data = pipeline.fit_transform(train_data)

joblib.dump(pipeline, 'pipeline.pkl')

print("Processed data ready to be passed to the model:")
print(training_data)

Processed data ready to be passed to the model:
  (0, 11436)	0.20075486694643815
  (0, 55414)	0.10031328191932584
  (0, 6107)	0.19068189100977978
  (0, 11857)	0.19068189100977978
  (0, 6553)	0.2452675750855544
  (0, 12050)	0.2452675750855544
  (0, 49429)	0.09740334370414654
  (0, 1171)	0.10024074751618044
  (0, 8589)	0.2452675750855544
  (0, 12675)	0.2452675750855544
  (0, 25698)	0.10013234145865559
  (0, 37269)	0.10024074751618044
  (0, 50499)	0.1815464986234954
  (0, 13866)	0.1815464986234954
  (0, 59557)	0.1542536565856081
  (0, 45080)	0.16530099330427037
  (0, 50342)	0.1815464986234954
  (0, 13861)	0.1815464986234954
  (0, 44817)	0.10020455957761988
  (0, 5257)	0.20075486694643815
  (0, 55408)	0.09408198338361679
  (0, 6106)	0.19068189100977978
  (0, 6552)	0.2452675750855544
  (0, 49422)	0.08723363948657727
  (0, 1110)	0.08896602587879256
  :	:
  (14267, 27298)	0.18985952615516027
  (14267, 65890)	0.18985952615516027
  (14267, 342)	0.18985952615516027
  (14267, 21835)	0.17868324112

## Model Training

The above data is the data that was transformed by passing through our defined pipeline according to our lemmetizer and text preprocessing rules class. This will work fine when passed into a model.

We will now retain the performance and accuracy of 9 machine learning algorithms on our training data.

In [28]:
perceptron = Perceptron()
bnb = BernoulliNB()
mnb = MultinomialNB()
cnb = ComplementNB()
tree = DecisionTreeClassifier()
lsvc = LinearSVC()
sgdc = SGDClassifier()
randFor = RandomForestClassifier()
lr = LogisticRegression(max_iter=1000)

models = {
    "Random Forest Classifier": randFor,
    "Perceptron": perceptron,
    "Bernoulli Naive Bayes": bnb,
    "Multinomial Naive Bayes": mnb,
    "Complement Naive Bayes": cnb,
    "Decision Tree Classifier": tree,
    "Linear Support Vector Classification": lsvc,
    "Stochastic Gradient Descent": sgdc,
    "Logistic Regression": lr,
}


for model in models.keys():
    scores = cross_val_score(models[model], training_data, train_sentiment)
    print("\n===", model, "===")
    print("scores = ", scores)
    print("mean = ", scores.mean())
    print("variance = ", scores.var())
    models[model].fit(training_data, train_sentiment)
    acc_score = accuracy_score(models[model].predict(training_data), train_sentiment)
    print("score on the learning data (accuracy) = ", acc_score)
    print("")



=== Random Forest Classifier ===
scores =  [0.90749825 0.9159075  0.91065172 0.91342447 0.90606379]
mean =  0.9107091442367186
variance =  1.3257659739066788e-05
score on the learning data (accuracy) =  1.0


=== Perceptron ===
scores =  [0.92676945 0.93622985 0.93587947 0.93901157 0.93340343]
mean =  0.9342587536791698
variance =  1.7184494186189898e-05
score on the learning data (accuracy) =  1.0


=== Bernoulli Naive Bayes ===
scores =  [0.88367204 0.90049054 0.89313245 0.90536278 0.88818787]
mean =  0.8941691345934437
variance =  6.245940032019872e-05
score on the learning data (accuracy) =  0.9718951499859826


=== Multinomial Naive Bayes ===
scores =  [0.88156973 0.88822705 0.88367204 0.89730109 0.88958991]
mean =  0.8880719615271155
variance =  2.9828665773171636e-05
score on the learning data (accuracy) =  0.9670591533501542


=== Complement Naive Bayes ===
scores =  [0.88717589 0.89698669 0.8917309  0.90851735 0.89309499]
mean =  0.8955011641442109
variance =  5.2188509942218

## Testing Accuracy on Test Data

We will now test the accuracy of each model on test data and check which one gives the highest result. As is visible from below, Linear Support Vector Classification gives the highest rating with respect to accuracy for testing data.

In [29]:
# We now test each model on trainset
for model in models.keys():
    test_model = models[model]
    test_model.fit(training_data, train_sentiment)

    testing_data = pipeline.transform(test_data)
    print("Accuracy on test data for " + model + ":", test_model.score(testing_data, test_sentiment))

Accuracy on test data for Random Forest Classifier: 0.913372582001682
Accuracy on test data for Perceptron: 0.9425287356321839
Accuracy on test data for Bernoulli Naive Bayes: 0.8982338099243061
Accuracy on test data for Multinomial Naive Bayes: 0.8901037286234932
Accuracy on test data for Complement Naive Bayes: 0.895991028875806
Accuracy on test data for Decision Tree Classifier: 0.8873002523128679
Accuracy on test data for Linear Support Vector Classification: 0.9428090832632464
Accuracy on test data for Stochastic Gradient Descent: 0.9408466498458088
Accuracy on test data for Logistic Regression: 0.927950658816933


## Comparing Test Sentiments to Predictions

The time has come to compare test dataset sentiments with our predictions from the model. In the display as follows, both the original sentiment and predicted sentiments are compared side by side. As you can see, the result is quite impressive.

In [30]:
# We choose Linear Support Vector Classification due to highest accuracy
test_model = lsvc
test_learning = pipeline.transform(test_data)

tempDF = pd.DataFrame(test_sentiment)
tempDF["Predicted Sentiment"] = test_model.predict(test_learning)

tempDF.head(60)


Unnamed: 0,Sentiment,Predicted Sentiment
12501,neutral,neutral
607,negative,negative
13116,neutral,neutral
10555,neutral,neutral
15317,negative,negative
13995,negative,negative
2243,neutral,neutral
13629,negative,negative
11201,neutral,neutral
16171,negative,negative


## Finalize the Model

In [31]:
model = test_model

## Test Custom Input

We will now test our own input to test the model. As you can see, both the predictions are on point.

In [32]:
tweet = pd.Series([input(),])
tweet = pipeline.transform(tweet)

sentiment_predicted = model.predict(tweet)[0]

print("This tweet is", sentiment_predicted)

I hate bitcoin
This tweet is negative


In [33]:
tweet = pd.Series([input(),])
tweet = pipeline.transform(tweet)

sentiment_predicted = model.predict(tweet)[0]

print("This tweet is", sentiment_predicted)

I love bitcoin
This tweet is positive


## Save Model

We now save the model as a pickle file so we can use it to deploy on a server.

In [34]:
pickle.dump(model, open("model.pkl", 'wb'))
print("Model Saved")

Model Saved
