# Analysing and Predicting Public Perception on Social Media

Understanding public brand perception can be a challenge.  With the rise of social media, good data on pulic opinion about specific topics and brands is widely available. Twitter is the perfect platform for this.  By scrapping twitter data we will try to implement sentiment analysis on particular brands and topics, we will then implement a Time Series model and train it on past sentiment trends to help it predict future sentiment trajectory.  (Bitcoin, Nike vs Adidas - Twitter sentiment trend analysis + prediction)

> Importing our standard libraries, the autoreload module..

In [19]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

In [20]:
%load_ext autoreload
%autoreload 2

import sys, os
from os import path
sys.path.append("twint/")

# 1 . Scrapping Twitter with TWINT

We'll begin by scrapping Twitter using the TWINT module, since Twitter's standard search API is very limited. The TWINT modules allows us to search for target tweets by keyword, within a date range, and much more almost without limitations, the enitre Twittersphere is now available to us.  We can then perform sentiment analysis on specific tweets.
  
We've installed TWINT through the command line and appended it to our system path in the cell above.  Next, we will import the module and set up its configuration and start running queries.  
  

In [21]:
# load TWINT and set up its configuration
import twint
c = twint.Config()

In [22]:
# Solve compatibility issues with notebooks and RunTime errors.
import nest_asyncio
nest_asyncio.apply()

In [23]:
c.Search = "bitcoin"
c.Limit = 1 # results are returned in blocks of 20 tweets, 1 here means 20
c.Pandas = True
twint.run.Search(c)

1154544367035568128 2019-07-25 20:10:02 EDT <CoinCapsAi> Top 5 #cryptocurrencies   Alert Time: 2019-07-26 03:10:01 #Bitcoin: $9,896.377 #Ethereum: $219.378 #XRP: $0.314 #Litecoin: $93.287 #BitcoinCash: $301.910 #ico #airdrop #altcoin #IoT #trading http://www.coincaps.ai 
1154544359276044288 2019-07-25 20:10:00 EDT <coin_smut> @adam3us before and after bitcoin #btc pic.twitter.com/BGQzfjJ19S
1154544359154626560 2019-07-25 20:10:00 EDT <BitcoinIndeciso> O valor do Bitcoin caiu :( - R$37825
1154544356172488704 2019-07-25 20:09:59 EDT <edwinaac79g> Tell the secretary to buy Bitcoin and hodl.
1154544350736445440 2019-07-25 20:09:58 EDT <byeseldiana14> And you are my friends too. All who cares. That's why I have this for you.  5,000 Bitcoin airdrop has just begun!!!!  Join me:  http://bit.ly/2Me0kLb 
1154544349390102528 2019-07-25 20:09:57 EDT <HassMcCook> The utilities seem to be convinced that #bitcoin will die, hence they worry about having to replace the high amount of demand at short no

### Great!
> We have tweets being output as our result!  Now let's format this output into a dataframe we can work with

In [24]:
def available_columns():
    return twint.output.panda.Tweets_df.columns

def twint_to_pandas(columns):
    return twint.output.panda.Tweets_df[columns]

In [25]:
# see what columns are available
available_columns()

Index(['cashtags', 'conversation_id', 'created_at', 'date', 'day', 'geo',
       'hashtags', 'hour', 'id', 'link', 'name', 'near', 'nlikes', 'nreplies',
       'nretweets', 'place', 'quote_url', 'retweet', 'search', 'timezone',
       'tweet', 'user_id', 'user_id_str', 'username'],
      dtype='object')

In [26]:
# create Pandas dataframe with desired columns
df = twint_to_pandas(['conversation_id', 'created_at', 'id', 'user_id', 'username', 'tweet', 'hashtags', 'date', 'day', 'nlikes', 'nretweets'])
print(df.shape)
df.head()

(20, 11)


Unnamed: 0,conversation_id,created_at,id,user_id,username,tweet,hashtags,date,day,nlikes,nretweets
0,1154544367035568128,1564099802000,1154544367035568128,2507323260,CoinCapsAi,Top 5 #cryptocurrencies Alert Time: 2019-07-...,"[#cryptocurrencies, #bitcoin, #ethereum, #xrp,...",2019-07-25 20:10:02,2,0,0
1,1154544359276044288,1564099800000,1154544359276044288,901894460350636036,coin_smut,@adam3us before and after bitcoin #btc pic.twi...,[#btc],2019-07-25 20:10:00,2,0,0
2,1154544359154626560,1564099800000,1154544359154626560,1128777604033597440,BitcoinIndeciso,O valor do Bitcoin caiu :( - R$37825,[],2019-07-25 20:10:00,2,0,0
3,1154140841696538632,1564099799000,1154544356172488704,1053015067107692545,edwinaac79g,Tell the secretary to buy Bitcoin and hodl.,[],2019-07-25 20:09:59,2,0,0
4,1154528032276848640,1564099798000,1154544350736445440,1149814468299436033,byeseldiana14,And you are my friends too. All who cares. Tha...,[],2019-07-25 20:09:58,2,0,0


### Success!
> We now have a data frame with 20 tweets all containing the keyword "bitcoin", along with some additional information about the tweets

> Now let's make our code a bit more modular so that we can run constant queries

In [27]:
# Disable annoying printing
class HiddenPrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout

In [28]:
# test Twitter scraping
def get_tweets(search_term, limit=100):
    c = twint.Config()
    c.Search = search_term
    c.Limit = limit
    c.Pandas = True
    c.Pandas_clean = True
    
    result_columns = ['id', 'username', 'tweet', 'hashtags', 'date', 'day', 'nlikes', 'nretweets']
    with HiddenPrints():
        print(twint.run.Search(c))
    return twint.output.panda.Tweets_df[result_columns]

In [11]:
bitcoin_tweets = get_tweets("bitcoin", limit=10000)
print(bitcoin_tweets.shape)
bitcoin_tweets.head()

(10004, 8)


Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets
0,1154070862536028161,maksimmerili,Long/Short Bitcoin & altcoin volatility with u...,[],2019-07-24 12:48:29,5,0,0
1,1154070855380623360,RadsickTrrance,#Bitcoin Price Shuns Volatility as Analysts Wa...,"[#bitcoin, #crypto]",2019-07-24 12:48:28,5,0,0
2,1154070832827813890,TurgayMutlucan,"Bitcoin bu, her an herşey olabilir!",[],2019-07-24 12:48:22,5,0,0
3,1154070829812125707,WallyGideon,New video by Legit TV: Why Coincola is the bes...,[],2019-07-24 12:48:22,5,0,0
4,1154070822828617729,cryptosnarf,When real bitcoin back,[],2019-07-24 12:48:20,5,0,0


In [12]:
adidas_tweets = get_tweets("adidas", limit=10000)
print(adidas_tweets.shape)
adidas_tweets.head()

(10016, 8)


Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets
0,1154072907297054720,swerve1973,Classy combo! I had a Raleigh Burner and Adida...,[],2019-07-24 12:56:37,3,0,0
1,1154072900816662528,RoopGautam,"कभी नंगे पैर दौड़ना पड़ता था, क्योंकि जूते खरी...",[],2019-07-24 12:56:35,3,1,0
2,1154072900418359296,cinj00,Elite 8 game is a W!! We beat a tough CBC Elit...,[],2019-07-24 12:56:35,3,0,0
3,1154072890951819265,checkthekicks,http://rover.ebay.com/rover/1/711-53200-19255...,[#kidsshoes],2019-07-24 12:56:33,3,0,0
4,1154072886635880448,liberosans,i'm sick his entire wardrobe is different blac...,[],2019-07-24 12:56:32,3,0,0


In [13]:
adidas_tweets.tail()

Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets
10011,1153716514215698432,New_preloved,{NP} Ada yang mau thread tas tas branded dari ...,[],2019-07-23 13:20:26,7,6,0
10012,1153716480917344258,Felipekrf,"Meninaaaas, chegaram os vestidos de moletom da...",[],2019-07-23 13:20:18,7,0,1
10013,1153716470469267458,giselamlopez,Adidas wow #daretocreate New adidas ad Dare ...,[#daretocreate],2019-07-23 13:20:16,7,0,0
10014,1153716458704310273,masterthegamerp,I like ADIDAS or Vans maybe?,[],2019-07-23 13:20:13,7,2,0
10015,1153716452496674816,steezy_jay__,They actually sell think about it it’s only we...,[],2019-07-23 13:20:11,7,0,0


In [14]:
nike_tweets = get_tweets("nike", limit=10000)
print(nike_tweets.shape)
nike_tweets.head()

(10000, 8)


Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets
0,1154077508133933058,twanAthon,I’m glad he went to Jordan. That speaks volume...,[],2019-07-24 13:14:54,2,0,0
1,1154077437912866818,crackyspen,2009 Polo Jeans Perfect Circle/Godspeed The St...,[],2019-07-24 13:14:37,2,0,0
2,1154077431860551682,somto_jr,Nike shoes are better than adidas shoes,[],2019-07-24 13:14:36,2,0,0
3,1154077420665921536,alcarazpedro,la última de nike?,[],2019-07-24 13:14:33,2,0,0
4,1154077405717417988,FernandzCande,me calzo las Nike y me voy a caminar lejos con...,[],2019-07-24 13:14:29,2,0,0


# 2 . Testing the Vader Module for Sentiment Analysis

> first let's test out Vader on a simple line of text and analyze the results

In [None]:
#!pip install vaderSentiment

In [15]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

In [20]:
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print('sentence: "{}"'.format(sentence))
    print('scores: {}'.format(str(score))) 
    return score['compound']

In [21]:
sentiment_analyzer_scores("Nike is the best.")

sentence: "Nike is the best."
scores: {'neg': 0.0, 'neu': 0.417, 'pos': 0.583, 'compound': 0.6369}


0.6369

> Pretty positive sentiment, how about...

In [23]:
sentiment_analyzer_scores("Nike is the BEST!")

sentence: "Nike is the BEST!"
scores: {'neg': 0.0, 'neu': 0.365, 'pos': 0.635, 'compound': 0.7371}


0.7371

> See the increase in positivity from .63 to .73

> Let's try another simple example

In [22]:
sentiment_analyzer_scores("Adidas sucks, but I like their sustainability initiative.")

sentence: "Adidas sucks, but I like their sustainability initiative."
scores: {'neg': 0.175, 'neu': 0.5, 'pos': 0.325, 'compound': 0.3612}


0.3612

> How about...

In [24]:
sentiment_analyzer_scores("Adidas is terrible!  I just hate all their designs, and that Kanye West line looked like clothing for homeless people.  I'll never buy Adidas, ever!")

sentence: "Adidas is terrible!  I just hate all their designs, and that Kanye West line looked like clothing for homeless people.  I'll never buy Adidas, ever!"
scores: {'neg': 0.239, 'neu': 0.68, 'pos': 0.081, 'compound': -0.7081}


-0.7081

> Now that we've tested Vader, let's apply it to our dataframes and see the overall sentiment in about 10,000 tweets on each topic/brand

In [29]:
import warnings
warnings.filterwarnings('ignore')

In [41]:
def compound_score(tweet):
    return analyser.polarity_scores(tweet)['compound']

def overall_sentiment(df):
    df['sentiment_score'] = df['tweet'].apply(compound_score)
    return round(df['sentiment_score'].sum() / len(df['sentiment_score']), 2)

In [38]:
compound_score('Adidas is one of my FAV sports brands!')

0.6155

In [44]:
bitcoin_total_score = overall_sentiment(bitcoin_tweets)
print(f'The overall sentiment score for the "Bitcoin" related set of tweets is: {bitcoin_total_score}')
bitcoin_tweets.head(2)

The overall sentiment score for the "Bitcoin" related set of tweets is: 0.1


Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets,sentiment_score
0,1154070862536028161,maksimmerili,Long/Short Bitcoin & altcoin volatility with u...,[],2019-07-24 12:48:29,5,0,0,0.4184
1,1154070855380623360,RadsickTrrance,#Bitcoin Price Shuns Volatility as Analysts Wa...,"[#bitcoin, #crypto]",2019-07-24 12:48:28,5,0,0,-0.3612


> People are currently pretty neutral on Bitcoin, not surprising.

In [48]:
adidas_total_score = overall_sentiment(adidas_tweets)
print(f'The overall sentiment score for the "Adidas" related set of tweets is: {adidas_total_score}')
adidas_tweets.head(2)

The overall sentiment score for the "Adidas" related set of tweets is: 0.12


Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets,sentiment_score
0,1154072907297054720,swerve1973,Classy combo! I had a Raleigh Burner and Adida...,[],2019-07-24 12:56:37,3,0,0,0.4926
1,1154072900816662528,RoopGautam,"कभी नंगे पैर दौड़ना पड़ता था, क्योंकि जूते खरी...",[],2019-07-24 12:56:35,3,1,0,0.0


> Notice how a foreign language gives us a sentiment score of 0 (neutral, but this ruins our averages!)  
> What if the foreign language contained characters from the English alphabet?  how would our scores be affected?

In [51]:
nike_total_score = overall_sentiment(nike_tweets)
print(f'The overall sentiment score for the "Nike" related set of tweets is: {nike_total_score}')
nike_tweets.head(2)

The overall sentiment score for the "Nike" related set of tweets is: 0.07


Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets,sentiment_score
0,1154077508133933058,twanAthon,I’m glad he went to Jordan. That speaks volume...,[],2019-07-24 13:14:54,2,0,0,0.7096
1,1154077437912866818,crackyspen,2009 Polo Jeans Perfect Circle/Godspeed The St...,[],2019-07-24 13:14:37,2,0,0,0.9396


> Adidas currently trumps Nike! Whaaaaaat?? look at Nike's first two sentiment scores

Though averaging the entire dataset may have played a true role in this, we're still pretty skeptical that Adidas had better sentiment ratings than Nike.  We can also see that all of our overall scores are pretty neutral... This is where Data Cleaning and preparing our tweets to be analyzed for sentiment will play a big role.  Additionally, the popularity of a tweet should be taken into account, tweets with a large number of 'likes' and many retweets should have a stronger effect on our overall score.  We will tackle this shortly.  

For now, let's understand that we will these sentiment scores to individually classify tweets into 5 classes:  
  
- 0 = negative	
- 1 = neutral_negative		
- 2 = neutral		
- 3 = neutral_positive	
- 4 = positive    

  
The scores will aslo be averaged per day, month, and year for trend visualizaiton purposes and for our Time Series analysis.  
We should aggregate and organize our tweet data accordingly

In [None]:
# how can we take into account an average score for the day as well as the whole month
# we will want to visualize sentiment trends day by day as well as month by month over the year+

### Real Data
Let's note that before we apply Vader to our real data, there are multiple things we have to consider.  
Firsly, a considerable amount of data cleaning needs to be done on the tweets for the sentiment scores to be accurate and valuable.  In our example above, some tweets remain in foreign languages, so those sentiment scores are actually null to us, yet they still affect the overall average of compound score.  Additionally, some special characters may be throwing off the NLP implementation within Vader, and should be removed.  

Moving forward, we will keep these considerations and many more in mind, and generate the most accurate possible sentiment scores we can.  We will begin by aggragating our real data, keeping in mind our date ranges so to represent real time-linear data for our Time Series analysis.  We will then continue to Data Cleaning, and make sure our tweets are in perfect form before moving on to analysing sentiment with Vader. 

# 2 . Real Data Aggregation

We will now scrape Twitter for targeted tweets dating back to 2017 (2 and 1/2 + years in total).  We want these tweets aggragated monthly, and we will save each month of tweets in its own .json file inside our twitter_dataset directory.  We will then concactenate this data into a single dataframe for each topic/brand, representing the topic/brands' entire "recent" public Twitter data.

> We want popular and significant tweets - let's keep that in mind when scraping as well as when evaluating sentiment

> So ...  
- We will set TWINT's 'popular-tweets' option to True in order to retrieve only popular tweets  
- We will also use TWINT's 'lang' option to make sure we only retrieve tweets that are in English  
-  Finally, we will set 'since' amd 'until' to the approriate dates to retrive our tweets in a monthly fashion  

In [30]:
# slice our ~ 3 year time range into tuples of montlhy time ranges
import datetime
begin = '2017-01-01'
end = '2019-7-24'
month_ranges = []

dt_start = datetime.datetime.strptime(begin, '%Y-%m-%d')
dt_end = datetime.datetime.strptime(end, '%Y-%m-%d')
one_day = datetime.timedelta(1)
start_dates = [dt_start]
end_dates = []
today = dt_start

while today <= dt_end:
    #print(today)
    tomorrow = today + one_day
    if tomorrow.month != today.month:
        start_dates.append(tomorrow)
        end_dates.append(today)
    today = tomorrow

end_dates.append(dt_end)

for start, end in zip(start_dates,end_dates):
    month_ranges.append((start, end))
    
    
month_ranges

[(datetime.datetime(2017, 1, 1, 0, 0), datetime.datetime(2017, 1, 31, 0, 0)),
 (datetime.datetime(2017, 2, 1, 0, 0), datetime.datetime(2017, 2, 28, 0, 0)),
 (datetime.datetime(2017, 3, 1, 0, 0), datetime.datetime(2017, 3, 31, 0, 0)),
 (datetime.datetime(2017, 4, 1, 0, 0), datetime.datetime(2017, 4, 30, 0, 0)),
 (datetime.datetime(2017, 5, 1, 0, 0), datetime.datetime(2017, 5, 31, 0, 0)),
 (datetime.datetime(2017, 6, 1, 0, 0), datetime.datetime(2017, 6, 30, 0, 0)),
 (datetime.datetime(2017, 7, 1, 0, 0), datetime.datetime(2017, 7, 31, 0, 0)),
 (datetime.datetime(2017, 8, 1, 0, 0), datetime.datetime(2017, 8, 31, 0, 0)),
 (datetime.datetime(2017, 9, 1, 0, 0), datetime.datetime(2017, 9, 30, 0, 0)),
 (datetime.datetime(2017, 10, 1, 0, 0), datetime.datetime(2017, 10, 31, 0, 0)),
 (datetime.datetime(2017, 11, 1, 0, 0), datetime.datetime(2017, 11, 30, 0, 0)),
 (datetime.datetime(2017, 12, 1, 0, 0), datetime.datetime(2017, 12, 31, 0, 0)),
 (datetime.datetime(2018, 1, 1, 0, 0), datetime.datetime(2

In [31]:
import json

In [32]:
# let's rewrite our get_tweets function, i will serve as the index of the month_range tuple
def get_real_data_tweets(search_term, i, limit=100):
    """
    scrappes Twitter for tweets within the specified date range
    makes sure month_ranges list is in memory and contains tuples of 2 dates ([0]=starting, [1]=ending)
    pass in a search term, returns Pandas dataframe
    """
    # real data mining
    c = twint.Config()
    c.Search = search_term
    c.Limit = limit
    c.Pandas = True
    c.Pandas_clean = True
    c.Lang = 'en'
    c.Since = str(month_ranges[i][0])[:10]
    c.Until = str(month_ranges[i][1])[:10]
    c.Popular_tweets = True
    c.Store_json = True

    result_columns = ['id', 'username', 'tweet', 'hashtags', 'date', 'day', 'nlikes', 'nretweets']
    with HiddenPrints():
        print(twint.run.Search(c))
    return twint.output.panda.Tweets_df[result_columns]

In [155]:
# let's test it for one month
january_2017_bitcoin_tweets = get_real_data_tweets('bitcoin', 0, limit=100000)
print(january_2017_bitcoin_tweets.shape)
january_2017_bitcoin_tweets.head()

CRITICAL:root:twint.output:checkData:copyrightedTweet


(4294, 8)


Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets
0,826218195908456449,coindesk,The latest Bitcoin Price Index is 920.09 USD ...,[],2017-01-30 18:59:02,5,19,20
1,826216967724007425,sharkybit,Blythe Masters said #Bitcoin is bad because yo...,[#bitcoin],2017-01-30 18:54:10,1,31,15
2,826214708260765696,blockchain,A warning & important message to users & the #...,[#bitcoin],2017-01-30 18:45:11,2,41,48
3,826214582045728768,Steven_McKie,Just had a really awesome interview with @bala...,[],2017-01-30 18:44:41,2,21,6
4,826207612253458434,Xentagz,Stable #bitcoin price pic.twitter.com/ZlSiY8jb5d,[#bitcoin],2017-01-30 18:16:59,3,16,2


### Ok let's scrape our full recent Twitter dataset (~ 3 years) for "bitcoin", "adidas", and "nike"

In [39]:
def create_dataset(search_term, dataframes={}):
    """
    gets monthly .json files and dataframes for tweets dating back to January 2017
    concats them into one large dataframe 
    please pass in a search term
    """    
    for i, month in enumerate(month_ranges):
        key = str(month[0])[:4] + '_' + str(month[0])[5:7] + '_' + search_term + '_tweets'
        dataframes[key] = get_real_data_tweets(search_term, i,  limit=100000)
        dataframes[key].sort_values(by=['date'], inplace=True, ascending=True)
        dataframes[key].to_json(str(month[0])[:4] + '_' + str(month[0])[5:7] + '_' + search_term + '_tweets.json')
        
    df = pd.concat([v for v in dataframes.values()])
    df.reset_index(inplace=True)
    df.drop(['index'], axis=1, inplace=True)

    print(df.shape)
    return df, dataframes

In [40]:
bitcoin_dataset, bitcoin_dataframes = create_dataset('bitcoin')

CRITICAL:root:twint.output:checkData:copyrightedTweet
CRITICAL:root:twint.output:checkData:copyrightedTweet
CRITICAL:root:twint.output:checkData:copyrightedTweet


(285471, 8)


In [41]:
bitcoin_dataset.head()

Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets
0,815422325629796353,calestous,#Bitcoin Ends 2016 as the Top Currency as It N...,[#bitcoin],2017-01-01 00:00:06,6,5,10
1,815423070919921664,Satoshi_N_,Happy New Year @Bitcoin.,[],2017-01-01 00:03:04,1,44,18
2,815423326332088320,RedditBTC,"""It's a HUGE deal. It's a HUGE, HUGE, HUGE dea...",[],2017-01-01 00:04:05,2,8,2
3,815425340420001792,TigoCTM,Crypto Currency #Bitcoin #Dash #Crypto http:...,"[#bitcoin, #dash, #crypto]",2017-01-01 00:12:05,7,17,21
4,815426865561210880,DollarVigilante,Trends Generated By Jubilee 2016 Will Continue...,"[#bitcoin, #investing]",2017-01-01 00:18:09,4,15,7


In [42]:
bitcoin_dataset.tail()

Unnamed: 0,id,username,tweet,hashtags,date,day,nlikes,nretweets
285466,1153777826191020034,Rogiervdbeek,Did you know that @NavCoin is a Proof of Stake...,"[#cryptocurrency, #bitcoin]",2019-07-23 17:24:04,1,93,14
285467,1153787688501125127,saifedean,My definition: A shitcoin is anything promoted...,[#yesallshitcoins],2019-07-23 18:03:15,7,480,85
285468,1153794367104012288,bobcatbsv,My bitcoin towel is getting a workout on holid...,"[#bsv, #bitcoin]",2019-07-23 18:29:48,5,48,2
285469,1153810079981813761,TraderEscobar,$BTC You will never see Bitcoin below 10k ag...,[],2019-07-23 19:32:14,6,54,4
285470,1153813918906585089,StrictlyBidnazz,Respect the Bitcoin Marketing Team #Bitcoin #B...,"[#bitcoin, #btc]",2019-07-23 19:47:29,3,108,21


In [43]:
bitcoin_dataframes.keys()

dict_keys(['2017_01_bitcoin_tweets', '2017_02_bitcoin_tweets', '2017_03_bitcoin_tweets', '2017_04_bitcoin_tweets', '2017_05_bitcoin_tweets', '2017_06_bitcoin_tweets', '2017_07_bitcoin_tweets', '2017_08_bitcoin_tweets', '2017_09_bitcoin_tweets', '2017_10_bitcoin_tweets', '2017_11_bitcoin_tweets', '2017_12_bitcoin_tweets', '2018_01_bitcoin_tweets', '2018_02_bitcoin_tweets', '2018_03_bitcoin_tweets', '2018_04_bitcoin_tweets', '2018_05_bitcoin_tweets', '2018_06_bitcoin_tweets', '2018_07_bitcoin_tweets', '2018_08_bitcoin_tweets', '2018_09_bitcoin_tweets', '2018_10_bitcoin_tweets', '2018_11_bitcoin_tweets', '2018_12_bitcoin_tweets', '2019_01_bitcoin_tweets', '2019_02_bitcoin_tweets', '2019_03_bitcoin_tweets', '2019_04_bitcoin_tweets', '2019_05_bitcoin_tweets', '2019_06_bitcoin_tweets', '2019_07_bitcoin_tweets'])

### Perfect!  
- We now have a dataset of strictly popular bitcoin tweets that is timed linearly dating back to January 1st 2017.  
- We've also saved each month of the tweets in its own .json file.    
- Be sure to manually move the jason files to their appropriate directory

### Let's now do the same for Adidas and Nike  

In [None]:
adidas_dataset, adidas_dataframes = create_dataset('adidas')

In [None]:
adidas_dataset.head()

In [None]:
adidas_dataset.tail()

In [None]:
adidas_dataframes.keys()

In [None]:
nike_dataset, nike_dataframes = create_dataset('nike')

In [None]:
nike_dataset.head()

In [None]:
nike_dataset.tail()

In [None]:
nike_dataframes.keys()

In [None]:
# get monthly .json files and dataframes for bitcoin tweets dating back to January 2017, and concat() them into one large dataframe 
# bitcoin_monthly_dataframes = {}

# for i, month in enumerate(month_ranges):
#     key = str(month[0])[:4] + '_' + str(month[0])[5:7] + '_bitcoin_tweets'
#     bitcoin_monthly_dataframes[key] = get_real_data_tweets('bitcoin', i, limit=100000)
#     bitcoin_monthly_dataframes[key].sort_values(by=['date'], inplace=True, ascending=True)
#     bitcoin_monthly_dataframes[key].to_json(str(month[0])[:4] + '_' + str(month[0])[5:7] + '_bitcoin_tweets.json')

# bitcoin_tweets_2017_to_now = pd.concat([v for v in bitcoin_monthly_dataframes.values()])
# bitcoin_tweets_2017_to_now.reset_index(inplace=True)
# bitcoin_tweets_2017_to_now.drop(['index'], axis=1, inplace=True)

# print(bitcoin_tweets_2017_to_now.shape)

In [None]:
# bitcoin_tweets_2017_to_now.head()

In [None]:
# bitcoin_tweets_2017_to_now.tail()

In [None]:
# # same for adidas
# adidas_monthly_dataframes = {}

# for i, month in enumerate(month_ranges):
#     key = str(month[0])[:4] + '_' + str(month[0])[5:7] + '_adidas_tweets'
#     adidas_monthly_dataframes[key] = get_real_data_tweets('adidas', i, limit=100000)
#     adidas_monthly_dataframes[key].sort_values(by=['date'], inplace=True, ascending=True)
#     adidas_monthly_dataframes[key].to_json(str(month[0])[:4] + '_' + str(month[0])[5:7] + '_adidas_tweets.json')

# adidas_tweets_2017_to_now = pd.concat([v for v in adidas_monthly_dataframes.values()])
# adidas_tweets_2017_to_now.reset_index(inplace=True)
# adidas_tweets_2017_to_now.drop(['index'], inplace=True)

# print(adidas_tweets_2017_to_now.shape)

In [None]:
# adidas_tweets_2017_to_now.head()

In [None]:
# adidas_tweets_2017_to_now.tail()

In [None]:
# same for nike
# nike_monthly_dataframes = {}

# for i, month in enumerate(month_ranges):
#     key = str(month[0])[:4] + '_' + str(month[0])[5:7] + '_nike_tweets'
#     nike_monthly_dataframes[key] = get_real_data_tweets('nike', i, limit=100000)
#     nike_monthly_dataframes[key].sort_values(by=['date'], inplace=True, ascending=True)
#     nike_monthly_dataframes[key].to_json(str(month[0])[:4] + '_' + str(month[0])[5:7] + '_nike_tweets.json')

# nike_tweets_2017_to_now = pd.concat([v for v in nike_monthly_dataframes.values()])
# nike_tweets_2017_to_now.reset_index(inplace=True)
# nike_tweets_2017_to_now.drop(['index'], inplace=True)

# print(nike_tweets_2017_to_now.shape)

In [None]:
# nike_tweets_2017_to_now.head()

In [None]:
# nike_tweets_2017_to_now.tail()

### The following function allows us to recreate our dataframes from the .json files in case of memory loss

In [15]:
def recreate_dataframe(search_term, dataframes={}):
    """
    recreate full ~3 year dataset (df) and monthly dataframes dict from json files
    make sure the search_term you pass in has a directory of the same name in the twitter_dataset folder
    make sure json files inside that directory are appropriately named
    """
    for i, month in enumerate(month_ranges):
        #print(f'creating df no. {i+1}')
        key = str(month[0])[:4] + '_' + str(month[0])[5:7] + '_' + search_term + '_tweets'
        dataframes[key] = pd.read_json('twitter_dataset/' + search_term + '/' + key + '.json')
        dataframes[key].sort_values(by=['date'], inplace=True, ascending=True)

    df = pd.concat([v for v in dataframes.values()])
    df.reset_index(inplace=True)
    df.drop(['index'], axis=1, inplace=True)

    print(df.shape)
    return df, dataframes

# 3 . Data Cleaning

# 4 . Implementing VaderSentiment

# 5 . Visualizing Sentiment Distribution & Trends

# 6 . Introducing Time Series

# 7 . Analysing sentiment prediction accuracy

# 8 . Trend predictions on future perception

# 9 . Testing for Trends

# 10 . Conclusions and real world applications