# Analysing and Predicting Public Perception on Social Media

Understanding public brand perception can be a challenge.  With the rise of social media, good data on pulic opinion about specific topics and brands is widely available. Twitter is the perfect platform for this.  By scrapping twitter data we will try to implement sentiment analysis on particular brands and topics, we will then implement a Time Series model and train it on past sentiment trends to help it predict future sentiment trajectory.  (Nike vs Adidas - Twitter sentiment trend analysis + prediction)

> Importing our standard libraries, the autoreload module..

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
%load_ext autoreload
%autoreload 2

import sys, os
from os import path
sys.path.append("twint/")

# Scrapping Twitter with TWINT

We'll begin by scrapping Twitter using the TWINT module, since Twitter's standard search API is very limited. The TWINT modules allows us to search for target tweets by keyword, within a date range, and much more almost without limitations, the enitre Twittersphere is now available to us.  We can then perform sentiment analysis on specific tweets.
  
We've installed TWINT through the command line and appended it to our system path in the cell above.  Next, we will import the module and set up its configuration and start running queries.  
  

In [3]:
# load TWINT and set up its configuration
import twint
c = twint.Config()

In [4]:
# Solve compatibility issues with notebooks and RunTime errors.
import nest_asyncio
nest_asyncio.apply()

In [5]:
c.Search = "bitcoin"
c.Limit = 1 # results are returned in blocks of 20 tweets, 1 here means 20
c.Pandas = True
twint.run.Search(c)

1154070174494003202 2019-07-24 12:45:45 EDT <XNM13> The End of Bitcoin: You [can't] be rich.  pic.twitter.com/JiVb9nBXxy
1154070166239494144 2019-07-24 12:45:43 EDT <soyiso_mpunzi> If you have no clue what bitcoin is all about and how we make money through bitcoin, just ask. For now consultation is FREE. Next is our 90 day game plan, the millionaire maker strategy.  I am in the ZONE.  https://www.instagram.com/p/B0ToNkWHdyk/?igshid=1vpxbs4xqni1b¬†‚Ä¶
1154070161479163905 2019-07-24 12:45:42 EDT <CryptoCryptoNe3> Found a Bitcoin sticker outside of a restaurant in Montreal : Bitcoin CRYPTO CRYPTO NEWS -  https://cryptocryptonews.com/found-a-bitcoin-sticker-outside-of-a-restaurant-in-montreal-bitcoin/¬†‚Ä¶
1154070154457899008 2019-07-24 12:45:41 EDT <cryptofluxfr> Fausses pubs pour Libra : Arnaque impliquant la future cryptomonnaie Facebook                ‚û°Ô∏èPlus d'actus crypto sur  https://cryptoflux.fr¬†  https://www.thecointribune.com/actualites/fausses-pubs-pour-libra-arnaque-impliq

### Great!
> We have tweets being output as our result!  Now let's format this output into a dataframe we can work with

In [6]:
def available_columns():
    return twint.output.panda.Tweets_df.columns

def twint_to_pandas(columns):
    return twint.output.panda.Tweets_df[columns]

In [7]:
# see what columns are available
available_columns()

Index(['cashtags', 'conversation_id', 'created_at', 'date', 'day', 'geo',
       'hashtags', 'hour', 'id', 'link', 'name', 'near', 'nlikes', 'nreplies',
       'nretweets', 'place', 'quote_url', 'retweet', 'search', 'timezone',
       'tweet', 'user_id', 'user_id_str', 'username'],
      dtype='object')

In [8]:
# create Pandas dataframe with desired columns
df = twint_to_pandas(['conversation_id', 'created_at', 'id', 'user_id', 'username', 'tweet', 'hashtags', 'date', 'day', 'nlikes', 'nretweets'])
print(df.shape)
df.head()

(20, 11)


Unnamed: 0,conversation_id,created_at,id,user_id,username,tweet,hashtags,date,day,nlikes,nretweets
0,1154070174494003202,1563986745000,1154070174494003202,36558099,XNM13,The End of Bitcoin: You [can't] be rich. pic....,[],2019-07-24 12:45:45,3,0,0
1,1154070166239494144,1563986743000,1154070166239494144,917474671,soyiso_mpunzi,If you have no clue what bitcoin is all about ...,[],2019-07-24 12:45:43,3,0,0
2,1154070161479163905,1563986742000,1154070161479163905,1129381702407741441,CryptoCryptoNe3,Found a Bitcoin sticker outside of a restauran...,[],2019-07-24 12:45:42,3,0,0
3,1154070154457899008,1563986741000,1154070154457899008,1134043719668248580,cryptofluxfr,Fausses pubs pour Libra : Arnaque impliquant l...,"[#cryptomonnaies, #bitcoin, #ethereum]",2019-07-24 12:45:41,3,0,0
4,1154070142440972288,1563986738000,1154070142440972288,2709434059,tokentalkco,Bitcoin Not Moved for at Least Five Years is a...,[#bitcoin],2019-07-24 12:45:38,3,0,0


### Success!
> We now have a data frame with 20 tweets all containing the keyword "bitcoin", along with some additional information about the tweets

> Now let's make our code a bit more modular so that we can run constant queries

In [14]:
# # load system utilities 
# %load_ext autoreload
# %autoreload 2

# import sys, os
# from os import path
# sys.path.append("twint/")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [15]:
# # Solve compatibility issues with notebooks and RunTime errors.
# import nest_asyncio
# nest_asyncio.apply()

In [9]:
# Disable annoying printing
class HiddenPrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout

In [10]:
# function to easily get tweets
def get_tweets(search_term, limit=100):
    c = twint.Config()
    c.Search = search_term
    c.Limit = limit
    c.Pandas = True
    c.Pandas_clean = True
    
    result_columns = ['id', 'username', 'tweet', 'hashtags', 'date', 'day', 'nlikes', 'nretweets']
    with HiddenPrints():
        print(twint.run.Search(c))
    return twint.output.panda.Tweets_df[result_columns]

In [11]:
bitcoin_tweets = get_tweets("bitcoin", limit=10000)
print(bitcoin_tweets.shape)
bitcoin_tweets.head()

(10004, 8)


Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets
0,1154070862536028161,maksimmerili,Long/Short Bitcoin & altcoin volatility with u...,[],2019-07-24 12:48:29,5,0,0
1,1154070855380623360,RadsickTrrance,#Bitcoin Price Shuns Volatility as Analysts Wa...,"[#bitcoin, #crypto]",2019-07-24 12:48:28,5,0,0
2,1154070832827813890,TurgayMutlucan,"Bitcoin bu, her an her≈üey olabilir!",[],2019-07-24 12:48:22,5,0,0
3,1154070829812125707,WallyGideon,New video by Legit TV: Why Coincola is the bes...,[],2019-07-24 12:48:22,5,0,0
4,1154070822828617729,cryptosnarf,When real bitcoin back,[],2019-07-24 12:48:20,5,0,0


In [12]:
adidas_tweets = get_tweets("adidas", limit=10000)
print(adidas_tweets.shape)
adidas_tweets.head()

(10016, 8)


Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets
0,1154072907297054720,swerve1973,Classy combo! I had a Raleigh Burner and Adida...,[],2019-07-24 12:56:37,3,0,0
1,1154072900816662528,RoopGautam,"‡§ï‡§≠‡•Ä ‡§®‡§Ç‡§ó‡•á ‡§™‡•à‡§∞ ‡§¶‡•å‡§°‡§º‡§®‡§æ ‡§™‡§°‡§º‡§§‡§æ ‡§•‡§æ, ‡§ï‡•ç‡§Ø‡•ã‡§Ç‡§ï‡§ø ‡§ú‡•Ç‡§§‡•á ‡§ñ‡§∞‡•Ä...",[],2019-07-24 12:56:35,3,1,0
2,1154072900418359296,cinj00,Elite 8 game is a W!! We beat a tough CBC Elit...,[],2019-07-24 12:56:35,3,0,0
3,1154072890951819265,checkthekicks,http://rover.ebay.com/rover/1/711-53200-19255...,[#kidsshoes],2019-07-24 12:56:33,3,0,0
4,1154072886635880448,liberosans,i'm sick his entire wardrobe is different blac...,[],2019-07-24 12:56:32,3,0,0


In [13]:
adidas_tweets.tail()

Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets
10011,1153716514215698432,New_preloved,{NP} Ada yang mau thread tas tas branded dari ...,[],2019-07-23 13:20:26,7,6,0
10012,1153716480917344258,Felipekrf,"Meninaaaas, chegaram os vestidos de moletom da...",[],2019-07-23 13:20:18,7,0,1
10013,1153716470469267458,giselamlopez,Adidas wow #daretocreate New adidas ad Dare ...,[#daretocreate],2019-07-23 13:20:16,7,0,0
10014,1153716458704310273,masterthegamerp,I like ADIDAS or Vans maybe?,[],2019-07-23 13:20:13,7,2,0
10015,1153716452496674816,steezy_jay__,They actually sell think about it it‚Äôs only we...,[],2019-07-23 13:20:11,7,0,0


In [14]:
nike_tweets = get_tweets("nike", limit=10000)
print(nike_tweets.shape)
nike_tweets.head()

(10000, 8)


Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets
0,1154077508133933058,twanAthon,I‚Äôm glad he went to Jordan. That speaks volume...,[],2019-07-24 13:14:54,2,0,0
1,1154077437912866818,crackyspen,2009 Polo Jeans Perfect Circle/Godspeed The St...,[],2019-07-24 13:14:37,2,0,0
2,1154077431860551682,somto_jr,Nike shoes are better than adidas shoes,[],2019-07-24 13:14:36,2,0,0
3,1154077420665921536,alcarazpedro,la √∫ltima de nike?,[],2019-07-24 13:14:33,2,0,0
4,1154077405717417988,FernandzCande,me calzo las Nike y me voy a caminar lejos con...,[],2019-07-24 13:14:29,2,0,0


# Testing the Vader Module for Sentiment Analysis

> first let's test out Vader on a simple line of text and analyze the results

In [None]:
#!pip install vaderSentiment

In [15]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

In [20]:
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print('sentence: "{}"'.format(sentence))
    print('scores: {}'.format(str(score))) 
    return score['compound']

In [21]:
sentiment_analyzer_scores("Nike is the best.")

sentence: "Nike is the best."
scores: {'neg': 0.0, 'neu': 0.417, 'pos': 0.583, 'compound': 0.6369}


0.6369

> Pretty positive sentiment, how about...

In [23]:
sentiment_analyzer_scores("Nike is the BEST!")

sentence: "Nike is the BEST!"
scores: {'neg': 0.0, 'neu': 0.365, 'pos': 0.635, 'compound': 0.7371}


0.7371

> See the increase in positivity from .63 to .73

> Let's try another simple example

In [22]:
sentiment_analyzer_scores("Adidas sucks, but I like their sustainability initiative.")

sentence: "Adidas sucks, but I like their sustainability initiative."
scores: {'neg': 0.175, 'neu': 0.5, 'pos': 0.325, 'compound': 0.3612}


0.3612

> How about...

In [24]:
sentiment_analyzer_scores("Adidas is terrible!  I just hate all their designs, and that Kanye West line looked like clothing for homeless people.  I'll never buy Adidas, ever!")

sentence: "Adidas is terrible!  I just hate all their designs, and that Kanye West line looked like clothing for homeless people.  I'll never buy Adidas, ever!"
scores: {'neg': 0.239, 'neu': 0.68, 'pos': 0.081, 'compound': -0.7081}


-0.7081

> Now that we've tested Vader, let's apply it to our dataframes and see the overall sentiment in about 10,000 tweets on each topic/brand

In [50]:
import warnings
warnings.filterwarnings('ignore')

In [43]:
def compound_score(tweet):
    return analyser.polarity_scores(tweet)['compound']

def overall_sentiment(df):
    df['sentiment_score'] = df['tweet'].apply(compound_score)
    return round(df['sentiment_score'].sum() / len(df['sentiment_score']), 2)

In [38]:
compound_score('Adidas is one of my FAV sports brands!')

0.6155

In [44]:
bitcoin_total_score = overall_sentiment(bitcoin_tweets)
print(f'The overall sentiment score for the "Bitcoin" related set of tweets is: {bitcoin_total_score}')
bitcoin_tweets.head(2)

The overall sentiment score for the "Bitcoin" related set of tweets is: 0.1


Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets,sentiment_score
0,1154070862536028161,maksimmerili,Long/Short Bitcoin & altcoin volatility with u...,[],2019-07-24 12:48:29,5,0,0,0.4184
1,1154070855380623360,RadsickTrrance,#Bitcoin Price Shuns Volatility as Analysts Wa...,"[#bitcoin, #crypto]",2019-07-24 12:48:28,5,0,0,-0.3612


> People are currently pretty neutral on Bitcoin, not surprising.

In [48]:
adidas_total_score = overall_sentiment(adidas_tweets)
print(f'The overall sentiment score for the "Adidas" related set of tweets is: {adidas_total_score}')
adidas_tweets.head(2)

The overall sentiment score for the "Adidas" related set of tweets is: 0.12


Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets,sentiment_score
0,1154072907297054720,swerve1973,Classy combo! I had a Raleigh Burner and Adida...,[],2019-07-24 12:56:37,3,0,0,0.4926
1,1154072900816662528,RoopGautam,"‡§ï‡§≠‡•Ä ‡§®‡§Ç‡§ó‡•á ‡§™‡•à‡§∞ ‡§¶‡•å‡§°‡§º‡§®‡§æ ‡§™‡§°‡§º‡§§‡§æ ‡§•‡§æ, ‡§ï‡•ç‡§Ø‡•ã‡§Ç‡§ï‡§ø ‡§ú‡•Ç‡§§‡•á ‡§ñ‡§∞‡•Ä...",[],2019-07-24 12:56:35,3,1,0,0.0


> Notice how a foreign language gives us a sentiment score of 0 (neutral, but this ruins our averages!)  
> What if the foreign language contained characters from the English alphabet?  how would our scores be affected?

In [51]:
nike_total_score = overall_sentiment(nike_tweets)
print(f'The overall sentiment score for the "Nike" related set of tweets is: {nike_total_score}')
nike_tweets.head(2)

The overall sentiment score for the "Nike" related set of tweets is: 0.07


Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets,sentiment_score
0,1154077508133933058,twanAthon,I‚Äôm glad he went to Jordan. That speaks volume...,[],2019-07-24 13:14:54,2,0,0,0.7096
1,1154077437912866818,crackyspen,2009 Polo Jeans Perfect Circle/Godspeed The St...,[],2019-07-24 13:14:37,2,0,0,0.9396


> Adidas currently trumps Nike! Whaaaaaat?? look at Nike's first two sentiment scores

Though averaging the entire dataset may have played a true role in this, we're still pretty skeptical that Adidas had better sentiment ratings than Nike.  We can also see that all of our overall scores are pretty neutral... This is where Data Cleaning and preparing our tweets to be analyzed for sentiment will play a big role.  Additionally, the popularity of a tweet should be taken into account, tweets with a large number of 'likes' and many retweets should have a stronger effect on our overall score.  We will tackle this shortly.  

For now, let's understand that we will these sentiment scores to individually classify tweets into 5 classes:  
  
- 0 = negative	
- 1 = neutral_negative		
- 2 = neutral		
- 3 = neutral_positive	
- 4 = positive    

  
The scores will aslo be averaged per day, month, and year for trend visualizaiton purposes and for our Time Series analysis.  
We should aggregate and organize our tweet data accordingly

In [None]:
# how can we take into account an average score for the day as well as the whole month
# we will want to visualize sentiment trends day by day as well as month by month over the year+

### Real Data
Let's note that before we apply Vader to our real data, there are multiple things we have to consider.  
Firsly, a considerable amount of data cleaning needs to be done on the tweets for the sentiment scores to be accurate and valuable.  In our example above, some tweets remain in foreign languages, so those sentiment scores are actually null to us, yet they still affect the overall average of compound score.  Additionally, some special characters may be throwing off the NLP implementation within Vader, and should be removed.  

Moving forward, we will keep these considerations and many more in mind, and generate the most accurate possible sentiment scores we can.  We will begin by aggragating our real data, keeping in mind our date ranges so to represent real time-linear data for our Time Series analysis.  We will then continue to Data Cleaning, and make sure our tweets are in perfect form before moving on to analysing sentiment with Vader. 

# Data Aggregation

# Data Cleaning

# Implementing Vader

# Visualizing Sentiment Analyis Trends

# Introducing Time Series

# Visualizing sentiment prediction accuracy

# Trend predictions on future perception

# Conclusions and real world applications