# Social Media and Data Science - Part 2

###  *Goal*: Use social media posts to explore the appplication of text and natural language processing to see what might be learned from online interactions.

Specifically, we will retrieve, annotate, process, and interpret Twitter data on health-related issues such as depression.

--- 
References:
* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)
* The [Tweepy Python API for Twitter](http://www.tweepy.org/)

Required Software
* [Python 3](https://www.python.org)
* [NumPy](http://www.numpy.org) - for preparing data for plotting
* [Matplotlib](https://matplotlib.org) - plots and garphs
* [jsonpickle](https://jsonpickle.github.io) for storing tweets. 
---

In [1]:
%matplotlib inline

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import jsonpickle
import json
import random
import tweepy
import time
import operator
from datetime import datetime

## 2.0.1 Setup

Before we dig in, we must grab a bit of code from [Part 1](SocialMedia - Part 1.ipynb):

1. The Tweets class used to store the tweets.
2. Our twitter API Keys - be sure to copy the keys that you generated when you completed [Part 1](SocialMedia - Part 1.ipynb).
3. Configuration of our Twitter connection

In [2]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",tweet_mode='extended',count=10)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            time.sleep(5)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']
    
    def getText(self,id):
        tweet = self.getTweet(id)
        text=tweet['full_text']
        if 'retweeted_status'in tweet:
            original = tweet['retweeted_status']
            text=original['full_text']
        return text

*REDACT FOLLOWING DETAILS*

In [3]:
consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET'

In [4]:
from tweepy import OAuthHandler

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

## 2.1 Annotating Tweets

Now that we have a corpus of tweets, what do we want to do with them? Turning a relatively vague notion into a well-defined research question is often a significant challenge, as examination of the data often reveals both shortcomings and unforeseen opportunities.

In our case, we are interested in looking at tweets about depression, but we're not quite sure exactly *what* we are looking for. We have a vague notion that we might learn something interesting, but understanding exactly what that is, and what sort of analyses we might need, will require a bit more work.

In situations such as this, we might look at some of the data to form some preliminary impressions of the content. Specifically, we can look at indidividual tweets, assigning them to one or more categories - known as *codes* - based on their content.  We can add categories as needed to capture important ideas that we might want to refer back to. This practice - known as *open coding* allows us to begin to make sense of unfamiliar data sets. 

This sounds much more complicated than it is. For now, let's read some tweets in from a file collected using the procedures discused in [Part 1](SocialMedia - Part 1.ipynb). We'll then use those tweets to get to work. The tweets are stored in a file called `tweets-smoking.json`.

In [5]:
tweets =Tweets()
tweets.readTweets("tweets-smoking.json")

We check the count, to verify the contents...

In [6]:
print(tweets.countTweets())

100


We will begin by taking a look at a subset of the first 20 tweets

To get this list, we'll sort the ids of the tweets and take the first 10 in the list, as ordered by ID

In [7]:
ids=list(tweets.getIds())
ids.sort()
working=[]
for i in range(20):
    id = ids[i]
    working.append(id)

*working* now has 20 tweets ids. Let's start with the first.

In [8]:
td = working[0]
print(tweets.getSearchTerm(id))
print(tweets.getSearchTime(id))
print(tweets.getText(td))

smoking
2018-03-15 12:09:37.117405
me: smoking weed hasn’t affected me at all

someone: count to 10

me: https://t.co/SUoGzARpom


This tweet has several interesting charcteristics.
1. it contains a link
2. It contains a mention of marijuana: 'weed'.

We can model all of these points through relevant annotation. Specifically, we will a new array of codes to each tweet object. This array will contain a list of categorical annotations associated with the tweet.  We add routines to add a single code to a tweet (by ID), to add multiple codes, and to retrieve the list of codes associated with a tweet.


See modifications to the  Tweets object in this new definition. 

In [9]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",count=10)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            time.sleep(5)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']
    
    def getText(self,id):
        tweet = self.getTweet(id)
        text=tweet['full_text']
        if 'retweeted_status'in tweet:
            original = tweet['retweeted_status']
            text=original['full_text']
        return text
                
    ### NEW ROUTINE - add a code to a tweet
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
    ### NEW ROUTINE  - add multiple  codes for a tweet
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
    ### NEW ROUTINE get codes for a tweet
    def getCodes(self,id):
        tweet=self.getTweet(id)
        return tweet['codes']

Now that we have this set up, we can reload the tweets from the file and reload the subset.

In [10]:
tweets =Tweets()
tweets.readTweets("tweets-smoking.json")
ids=list(tweets.getIds())
ids.sort()
working=[]

for i in range(20):
    id = ids[i]
    working.append(id)

td = working[0]
t = tweets.getTweet(td)
tweets.getText(td)

'me: smoking weed hasn’t affected me at all\n\nsomeone: count to 10\n\nme: https://t.co/SUoGzARpom'

Above we noted that this tweet was interesting becuase:
1. it contains a link
2. It contains a mention of marijuana: 'weed'.

So, we will add codes to the appropriate tweet as needed:

In [11]:
tweets.addCode(td,"LINK")
tweets.addCode(td,"MARIJUANA")

We can also confirm that this tweet is associated with the desired codes:

In [12]:
tweets.getCodes(td)

{'LINK', 'MARIJUANA'}

Good. Let's look at the next tweet. 

In [13]:
td = working[1]
tweets.getText(td)

'@SeanTighe123 Gomez what you been smoking mate 😂'

This tweet contains a user mention.

In [14]:
tweets.addCodes(td,['USERMENTION'])

ok.. moving on to the third tweet..

In [15]:
td = working[2]
tweets.getText(td)

'FDA begins anti-smoking push to cut nicotine in cigarettes KKTV https://t.co/fneRanLQv8'

In [16]:
tweets.addCodes(td,['LINK','ANTI-SMOKING'])

next...

In [17]:
td = working[3]
tweets.getText(td)

'Welcome to Eternity...Smoking or non smoking ? https://t.co/W9nAU7GEaQ'

This doesn't seem to be terribly relevant, so we add a tag to that effect:

In [18]:
tweets.addCodes(td,['LINK','IRRELEVANT'])

In [19]:
td = working[4]
tweets.getText(td)

'Made a sandwich 10 min ago and been looking for it ever since then\U0001f926🏾\u200d♂️ I gotta stop smoking😂 https://t.co/NCbNOyvZXe'

This retweet includes a user mention, multiple hashtags, and a mention of stoppping smoking.

In [20]:
tweets.addCodes(td,['STOPPING','LINK','EMOTICON'])

In [21]:
td = working[5]
tweets.getText(td)

'Made a sandwich 10 min ago and been looking for it ever since then\U0001f926🏾\u200d♂️ I gotta stop smoking😂 https://t.co/NCbNOyvZXe'

hmm. that looks like the previous tweet. Are the IDs the same?

In [22]:
print(working[4])
print(working[5])

974316601796452352
974316602266128384


Nope. In this case, they're probably tweeting the same tweet. 

In [23]:
t4=tweets.getTweet(working[4])
t5=tweets.getTweet(working[5])
t4orig = t4['retweeted_status']['id_str']
t5orig = t5['retweeted_status']['id_str']
t4orig==t5orig

True

Yes, they are retweets of the same original, so we'll reuse the same anotations

In [24]:
tweets.addCodes(td,['STOPPING','LINK','EMOTICON'])

In [25]:
td = working[6]
tweets.getText(td)

'Made a sandwich 10 min ago and been looking for it ever since then\U0001f926🏾\u200d♂️ I gotta stop smoking😂 https://t.co/NCbNOyvZXe'

once again..

In [26]:
tweets.addCodes(td,['STOPPING','LINK','EMOTICON'])

In [27]:
td = working[7]
tweets.getText(td)

'After 10 years of smoking cigarettes I can finally say I stopped for good.. had 1 stroke, nose bleeding, can’t sleep at nights, lost my senses and my mind is dead but no more bad habits Alhamdulillah and if anyone witnesses me smoking again feel free to slap me 😄'

A retweet with a comment about stopping smoking.

In [28]:
tweets.addCodes(td,['STOPPING'])

In [29]:
td = working[8]
tweets.getText(td)

'I love smoking weed in beautiful ass places, looking at beautiful ass things.'

In [30]:
tweets.addCodes(td,['MARIJUANA'])

In [31]:
td = working[9]
tweets.getText(td)

'Should smoking be banned in movies? Peterborough Public Health officials are in favour:\nhttps://t.co/2uEZPG3QF1 #Ptbo #Peterborough #smoking #smokinginmovies'

In [32]:
tweets.addCodes(td,['LINK,','HASHTAG','ANIT-SMOKING'])

In [33]:
td = working[10]
tweets.getText(td)

'Are e-cigarettes leading young people to take up smoking? A new study says yes https://t.co/3Hv17tnER5'

Notice that this tweeet is about vaping (e-cigarettes) even though it was found by the search term 'smoking'.

In [34]:
tweets.addCodes(td,['LINK','VAPING'])

Now that we've gone through several tweets, we can review the codes used.

In [35]:
for i in range(11):
    td=working[i]
    print(tweets.getCodes(td))

{'LINK', 'MARIJUANA'}
{'USERMENTION'}
{'LINK', 'ANTI-SMOKING'}
{'LINK', 'IRRELEVANT'}
{'LINK', 'EMOTICON', 'STOPPING'}
{'LINK', 'EMOTICON', 'STOPPING'}
{'LINK', 'EMOTICON', 'STOPPING'}
{'STOPPING'}
{'MARIJUANA'}
{'HASHTAG', 'ANIT-SMOKING', 'LINK,'}
{'LINK', 'VAPING'}


Having annotated several tweets, we might want to save the annotations in a file for future use. Fortnuately, the approach that we've used in our save and reload code is flexible enough to handle this without any further changes to the implementation. 

How does this work? The `Tweets` class stores all of the information abou the tweets in a simple dictionary. Tweet counts and codes are then stored inside the tweet object. When we go to save the set of Tweets, we simply turn this dictionary into JSON and then write it to a file. To read things in, we just read the JSON from the file and convert the result back into a dictionary. Thus, anything that we add to the dictionary will automatically be writen out and read back in.  We still need additional routines to access this data (like `addCode`, `addCodes`, and `getCodes`), but we  don't need to change the save/load routines.  Let's try it out.


In [36]:
tweets.saveTweets("tweets-smoking-annotated.json")

In [37]:
tweets2=Tweets()
tweets2.readTweets("tweets-smoking-annotated.json")

In [38]:
print(tweets.getText(td))
print(tweets.getCodes(td))
print(tweets2.getText(td))
print(tweets2.getCodes(td))

Are e-cigarettes leading young people to take up smoking? A new study says yes https://t.co/3Hv17tnER5
{'LINK', 'VAPING'}
Are e-cigarettes leading young people to take up smoking? A new study says yes https://t.co/3Hv17tnER5
{'LINK', 'VAPING'}


****
# Exercise 2.1: Examining the distribution of codes across a corpus

Having annotated a number of tweets, you might want to get an idea of how many tweets are being used and how often. Write a new method inside the `Tweets` class tp will provide a list of codes and counts, sorted by frequency.  Please be sure to reload the tweets after you redefine the class.

*ANSWER FOLLOWS - cut below here*
Following lines to be deleted when provided for student use

In [39]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",count=10)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            time.sleep(5)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']
    
    def getText(self,id):
        tweet = self.getTweet(id)
        text=tweet['full_text']
        if 'retweeted_status'in tweet:
            original = tweet['retweeted_status']
            text=original['full_text']
        return text
                
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
   
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
 
    def getCodes(self,id):
        tweet=self.getTweet(id)
        return tweet['codes']
    
    # NEW -ROUTINE TO GET PROFILE
    def getCodeProfile(self):
        summary={}
        for id in self.tweets.keys():
            tweet=self.getTweet(id)
            if 'codes' in tweet:
                for code in tweet['codes']:
                    if code not in summary:
                            summary[code] =0
                    summary[code]=summary[code]+1
        sortedsummary = sorted(summary.items(),key=operator.itemgetter(0),reverse=True)
        return sortedsummary

In [40]:
tweets=Tweets()
tweets.readTweets("tweets-smoking-annotated.json")

In [41]:
tweets.getCodeProfile()

[('VAPING', 1),
 ('USERMENTION', 1),
 ('STOPPING', 4),
 ('MARIJUANA', 2),
 ('LINK,', 1),
 ('LINK', 7),
 ('IRRELEVANT', 1),
 ('HASHTAG', 1),
 ('EMOTICON', 3),
 ('ANTI-SMOKING', 1),
 ('ANIT-SMOKING', 1)]

*END CUT HERE*
****

# Exercise 2.2: Code the Next 10 tweets in the set. 
Start with the tags used above, adding your own as needed.  Code up to and including the tweet  with index 20 in the `working` array. Examine the code profile and save your tweets  to a new file when you are done. 

*ANSWER FOLLOWS - cut below here*
Following lines to be deleted when provided for student use

In [50]:
ids=list(tweets.getIds())
ids.sort()
working=[]

for i in range(11,21):
    id = ids[i]
    working.append(id)

td = working[0]
tweets.getText(id)

'DYK... Research shows that a lack of #SocialConnections has the same negative health impacts as smoking 15 cigarettes per day!  #haveTHATtalk #MentalHealthMatters \n https://t.co/Jw1wUVg7ST https://t.co/w7xpfCUOpH'

In [51]:
tweets.addCodes(td,['LINK','USERMENTION','ANTI-SMOKING'])

In [52]:
td = working[1]
tweets.getText(id)

'DYK... Research shows that a lack of #SocialConnections has the same negative health impacts as smoking 15 cigarettes per day!  #haveTHATtalk #MentalHealthMatters \n https://t.co/Jw1wUVg7ST https://t.co/w7xpfCUOpH'

In [53]:
tweets.addCodes(td,['LINK','USERMENTION','ANTI-SMOKING'])

In [54]:
td = working[2]
tweets.getText(td)

'#Smoking affects multiple parts of our body. Know more: https://t.co/hwTeRdC9Hf \n#SwasthaBharat #NHPIndia #mCessation #QuitSmoking https://t.co/x7xHO9G2Cr'

In [55]:
tweets.addCodes(td,['LINK','HASHTAG','ANTI-SMOKING'])

In [56]:
td = working[3]
tweets.getText(td)

'really tryna stop smoking \U0001f926🏻\u200d♀️'

In [57]:
tweets.addCodes(td,['STOPPING','EMOTICON'])

In [58]:
td = working[4]
tweets.getText(td)

'I’m crying what kind stuff y’all smoking 😂😭 https://t.co/hkRB0z9i6E'

In [59]:
tweets.addCodes(td,['EMOTICON','LINK'])

In [60]:
td = working[5]
tweets.getText(td)

'me: smoking weed hasn’t affected me at all\n\nsomeone: count to 10\n\nme: https://t.co/SUoGzARpom'

In [61]:
tweets.addCodes(td,['LINK','MARIJUANA'])

In [62]:
td = working[6]
tweets.getText(td)

'Made a sandwich 10 min ago and been looking for it ever since then\U0001f926🏾\u200d♂️ I gotta stop smoking😂 https://t.co/NCbNOyvZXe'

In [63]:
tweets.addCodes(td,['LINK','STOPPING','EMOTICON'])

In [64]:
td = working[7]
tweets.getText(td)

'Made a sandwich 10 min ago and been looking for it ever since then\U0001f926🏾\u200d♂️ I gotta stop smoking😂 https://t.co/NCbNOyvZXe'

In [65]:
tweets.addCodes(td,['LINK','STOPPING','EMOTICON'])

In [66]:
td = working[8]
tweets.getText(td)

'#UNFAO is scaling up efforts on reducing the amount of #wood used as #fuel for #fish smoking in the #Gambia. With the new #UNFAO Thiaroye Technology stove, #women in the #fishsmoking &amp; drying #industry have improved access to #technology &amp; #livelihood. #ZeroHunger \n@FAOWestAfrica https://t.co/ifT8KSRo3O'

In [67]:
tweets.addCodes(td,['LINK','IRRE'])

In [69]:
td = working[9]
t=tweets.getTweet(td)
tweets.getText(td)

'DYK... Research shows that a lack of #SocialConnections has the same negative health impacts as smoking 15 cigarettes per day!  #haveTHATtalk #MentalHealthMatters \n https://t.co/Jw1wUVg7ST https://t.co/w7xpfCUOpH'

In [70]:
tweets.addCodes(td,['RETWEET','USERMENTION','IRRELEVANT'])

In [71]:
tweets.getCodeProfile()

[('VAPING', 1),
 ('USERMENTION', 4),
 ('STOPPING', 7),
 ('RETWEET', 1),
 ('MARIJUANA', 3),
 ('LINK,', 1),
 ('LINK', 15),
 ('IRRELEVANT', 2),
 ('IRRE', 1),
 ('HASHTAG', 2),
 ('EMOTICON', 7),
 ('ANTI-SMOKING', 4),
 ('ANIT-SMOKING', 1)]

In [72]:
tweets.saveTweets("tweets-smoking-annotated.json")

*END CUT*

---

# EXERCISE 2.3: Reflection on coding

Open coding can often be an iterative process. When we first start out, we don't really know what we're looking for. As a result, the first few items annotated might only get a few codes, and we might miss ideas that we don't initially think are important. As we see more and more items, our ideas of what needs to be annotated will change, and we'll start adding in codes that might also apply to earlier messages. Thus, we often need to review and re-annotate earlier tweets to account for changes in our interpreations.

Review your annotations the tweets that you reviewed. Revise the codes associated with these tweets, adding items from the overall list of codes as appropriate. Describe the change that you have made.


---
*ANSWER FOLLOWS - cut below here*
Following lines to be deleted when provided for student use


*END CUT*

---

# EXERCISE 2.4: Reflection on storage/serialization

In working with this small set of 100 tweets, we are taking a very simple approach to storage and management of the tweets and annotations. Storing everything in a nested Python dictionary and then dumping it to disk as JSON text can be very appealing. What are the strengths and weaknesses of this approach, and how might these strengths and weaknesses differ with larger datasets containing 100,000 or 100 million datasets? What alternative  strategies might you use for larger datasets?

---
*ANSWER FOLLOWS - cut below here*
Following lines to be deleted when provided for student use

*Advantages*: JSON is easy to read, as programmers can open the text file and read contents directly. JSON is also easy to work with from multiple programming languages and platforms, allowing a collection of tweets written in Python to be read by code written in other languages.  Finally, the JSON structure is easily adaptable, allowing fields to be easily added or removed as needed.

*Disadvantages*: As a text-based format, JSON data is pretty-much "all or nothing". Although it might be possible to read in part of the JSON structure, this will complicate code significantly.  Thus, most programs would read the entire JSON structure into data all at once. This is fine for small datasets, but might get slow and bulky, requiring lots of RAM, for larger datasets. 

Very large sets of tweets might be stored in a database. A relational database might be created to store tweets in one or more tables, providing the power of the structured query language (SQL) to retrieve tweets matching only specified criteria.  SQL could also be used to quickly and easily calculate aggregate statistics, without loading all of the tweets into RAM. A downside of this approach is the need to manage a distinct software component (the database server), the difficulty in changing the contents of the database, the complexity of SQL queries, and the relatively complex and often language-specific tools needed to manage the connections to the database. 

Alternatively, NoSQL databases that work very well with JSON might be considered. These databases may share some of the challenges associated with relational databases, but they are often also more flexible.


*END CUT*

---

# 2.2 Final Notes

[Part 3](SocialMedia - Part 3.ipynb) will explore the application of Natural Language Processing  - NLP - techniques to Tweet data. As part of this exploration, we will create and save a set of tweets based on a diffferent search term - "Vaping". Eventually, we'll apply machine learning to see if we can classify tweets as being associated with the "smoking" or "vaping" searches.

In [5]:
vapeTweets = Tweets()
vapeTweets.searchTwitter("vaping",100)
vapeTweets.saveTweets("tweets-vaping.json")