###  *Goal*: Use social media posts to explore the appplication of text and natural language processing to see what might be learned from online interactions.

Specifically, we will retrieve, annotate, process, and interpret Twitter data on health-related issues such as depression.

--- 
References:
* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)
* The [Tweepy Python API for Twitter](http://www.tweepy.org/)

Required Software
* [Python 3](https://www.python.org)
* [NumPy](http://www.numpy.org) - for preparing data for plotting
* [Matplotlib](https://matplotlib.org) - plots and garphs
* [jsonpickle](https://jsonpickle.github.io) for storing tweets. 
---

In [43]:
%matplotlib inline

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import jsonpickle
import json
import random
import tweepy
import time
import operator

# Introduction

This module continues the Social Media Data Science module started in [Part 1](SocialMedia - Part 1.ipynb) covering the annotation of tweets. These lessons will continue in [Part 3](SocialMedia - Part 1.ipynb) as we move on to the use of Natural Language processing to analyze the tweets. 
  
Our case study will apply these topics to Twitter discussions of smoking and vaping. Although details of the tools used to access data and the format and content of the data may differ for various services, the strategies and procedures used to analyze the data will generalize to other tools.

## 0. Setup

Before we dig in, we must grab a bit of code from [Part 1](SocialMedia - Part 1.ipynb):

1. The Tweets class used to store the tweets.
2. The searchTweet routine for grabbing tweets
3. Our twitter API Keys - be sure to copy the keys that you generated when you completed [Part 1](SocialMedia - Part 1.ipynb).
4. Configuration of our Twitter connection

In [193]:
class Tweets:
    
    
    def __init__(self,ts=None):
        self.tweets={}
        if ts is not None:
            for tweet in ts:
                self.addTweet(tweet)
    
    def addTweet(self,tweet,count = 0):
        id = tweet['id']

        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
        # if a count is not provided in the call, increment the count 
            if count == 0:
                self.tweets[id]['count'] = self.tweets[id]['count'] +1
            else:
                self.tweets[id]['count'] = count
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns a list of ids
    def getIds(self):
        return self.tweets.keys()
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        self.tweets={}
        with open(filename,'r') as f:
            json_data = json.load(f)
            intweets = jsonpickle.decode(json_data)   
            
            # we now have a dict of the tweets by id
            # add theme in
            for id in intweets.keys():
                count = intweets[id]['count']
                tweet = intweets[id]['tweet']
                self.addTweet(tweet,count)
                

In [44]:
def searchTwitter(term,corpus_size):
    tweets={}
    while (len(tweets) < corpus_size):
        new_tweets = api.search(term,lang="en",count=10)
        for nt_json in new_tweets:
            nt = nt_json._json
            if nt['id_str'] not in tweets:
                new_entry={}
                new_entry['count']=0
                new_entry['tweet']=nt
                tweets[nt['id_str']]=new_entry
            tweets[nt['id_str']]['count'] = tweets[nt['id_str']]['count']+1
        # wait to give our twitter account a break..
        time.sleep(10)
    return tweets

*REDACT FOLLOWING DETAILS*

In [45]:
consumer_key='D2L4YZ2YrO1PMix7uKUK63b8H'
consumer_secret='losRw9T8zb6VT3TEJ9JHmmhAmn1GXKVj30dkiMv9vjhXuiWek9'
access_token='15283934-iggs1hiZAPI2o5sfHWMfjumTF7SvytHPjpPRGf3I6'
access_secret='bOvqssxS97PGPwXHQZxk83KtAcDyLhRLgdQaokCdVvwFi'

In [46]:
from tweepy import OAuthHandler

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

## 1. Annotating Tweets

Now that we have a corpus of tweets, what do we want to do with them? Turning a relatively vague notion into a well-defined research question is often a significant challenge, as examination of the data often reveals both shortcomings and unforeseen opportunities.

In our case, we are interested in looking at tweets about depression, but we're not quite sure exactly *what* we are looking for. We have a vague notion that we might learn something interesting, but understanding exactly what that is, and what sort of analyses we might need, will require a bit more work.

In situations such as this, we might look at some of the data to form some preliminary impressions of the content. Specifically, we can look at indidividual tweets, assigning them to one or more categories - known as *codes* - based on their content.  We can add categories as needed to capture important ideas that we might want to refer back to. This practice - known as *open coding* allows us to begin to make sense of unfamiliar data sets. 

This sounds much more complicated than it is. For now, let's read some tweets in from a file collected in October 2017 using the procedures discused in *part 1* (using the procedure defined above). We'll then use those tweets to get to work.

In [197]:
tweets =Tweets()
tweets.readTweets("tweet-corpus.json")

We will begin by taking a look at a subset of 100 tweets. 

To get this list, we'll sort the ids of the tweets and take the first 10 in the list. 

In [198]:
tweets.countTweets()

1001

In [199]:
ids=list(tweets.getIds())
ids.sort()
working=[]
for i in range(100):
    id = ids[i]
    working.append(id)

*working* now has 100 tweets ids. Let's start with the first.

In [200]:
td = working[0]
t = tweets.getTweet(td)
t['text']

'FlTNESS: RT DrugedPosts: "Wyd after smoking this?" https://t.co/OnLywTyJ0X'

This tweet has several interesting charcteristics.
1. it is a retweet
2. It contains a link. 

We can model all of these points through relevant annotation. Specifically, we will a new array of codes to each tweet object. This array will contain a list of categorical annotations associated with the tweet. See modifications to the  Tweets object in this new definition. 

In [201]:
class Tweets:
    
    
    def __init__(self,ts=None):
        self.tweets={}
        if ts is not None:
            for tweet in ts:
                self.addTweet(tweet)
    
    def addTweet(self,tweet,count = 0,codes = None):
        id = tweet['id']

        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
        # if a count is not provided in the call, increment the count 
            if count == 0:
                self.tweets[id]['count'] = self.tweets[id]['count'] +1
            else:
                self.tweets[id]['count'] = count
            if codes is not None:
                self.tweets[id]['codes']=codes
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns a list of ids
    def getIds(self):
        return self.tweets.keys()
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        self.tweets={}
        with open(filename,'r') as f:
            json_data = json.load(f)
            intweets = jsonpickle.decode(json_data)
            self.tweets = intweets
                
    ### NEW ROUTINE - add a code to a tweet
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
        
    ### NEW ROUTINE  - add multiple  codes for a tweet
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
    ### NEW ROUTINE get codes for a tweet
    def getCodes(self,id):
        tweet=self.getTweet(id)
        return tweet['codes']

Now that we have this set up, we can reload the tweets from the file and reload the subset.

In [8]:
tweets =Tweets()
tweets.readTweets("tweet-corpus.json")
ids=list(tweets.getIds())
ids.sort()
working=[]

for i in range(100):
    id = ids[i]
    working.append(id)

td = working[0]
t = tweets.getTweet(td)
t['text']

'FlTNESS: RT DrugedPosts: "Wyd after smoking this?" https://t.co/OnLywTyJ0X'

Above we noted that this tweet was interesting becuase:
1. it is a retweet
2. It contains a link.

So, we will add codes to the appropriate tweet as needed:

In [203]:
tweets.addCode(td,"RETWEET")
tweets.addCode(td,"LINK")

We can confirm that this is a rewtweet by checking for the `retweeted_status` attribute

In [204]:
'retweeted_status' in t

False

Hmm. the attribute is not present. Perhaps the user copied the text and added 'RT' without actually retweeting? Something to keep our eyes on for other tweets.

We can also confirm that this tweet is associated with the desired codes:

In [205]:
tweets.getCodes(td)

{'LINK', 'RETWEET'}

Good. Let's look at the next tweet. 

In [206]:
td = working[1]
t=tweets.getTweet(td)
t['text']

'RT DrugedPosts: "Wyd after smoking this?" https://t.co/PZ3YyYh8WB'

Notice this is similar, but not identical, to the previous tweet. This time, we wil, for simplicity, use the `addCodes` routine.

In [207]:
tweets.addCodes(td,["RETWEET","LINK"])

In [208]:
'retweeted_status' in t

False

ok.. moving on to the third tweet..

In [209]:
td = working[2]
t=tweets.getTweet(td)
t['text']

'RT @Anzers: #TheBetrayalPapers Video: Part II – In Plain Sight – A National Security Smoking Gun\nhttps://t.co/rpObdW9GcG'

This retweet includes a link, a hashtag reference, and a reference to a `Smoking gun`, suggesting that this is not really a tweet about tobacco, marijuana, or other smoking products. We'll label it `irrelevant`.

Note that this is a good example of a case where a single word - in this case `Smoking` - is not nearly as informative as the sequence `Smoking gun`. 

In [210]:
tweets.addCodes(td,['RETWEET','LINK','USERMENTION','HASHTAG','IRRELEVANT'])

In [211]:
'retweeted_status' in t

True

next...

In [212]:
td = working[3]
t=tweets.getTweet(td)
t['text']

'RT @FootyMemes: This new anti-smoking ad is really powerful... https://t.co/pWHZDLIb7O'

Here, we have have a retweet, a link, and something about anti-smoking

In [213]:
tweets.addCodes(td,['RETWEET','LINK','ANTI-SMOKING'])

In [214]:
'retweeted_status' in t

True

In [215]:
td = working[4]
t=tweets.getTweet(td)
t['text']

'@ericschmidt @jwnichls Stop. You need to stop torturing me. No buddy nobody cares about "smoking." Stop.'

This tweet includes user mentions. It might or might not be relevant. 

In [216]:
tweets.addCodes(td,['USERMENTION','POSSIBLYRELEVANT'])

In [217]:
td = working[5]
t=tweets.getTweet(td)
t['text']

'RT @FurnyFootball: Stop smoking 😂 https://t.co/bY1ZvJy63Z'

A retweet with a user mention, and anti-smoking message, and a link

In [218]:
tweets.addCodes(td,['RETWEET','USERMENTION','ANTI-SMOKING','LINK'])

In [219]:
td = working[6]
t=tweets.getTweet(td)
t['text']

'You ever wake up and wish you was still sleep? ... that’s me rn.'

This tweet doesn't seem to be about smoking.

In [220]:
tweets.addCode(td,'IRRELEVANT')

In [221]:
td = working[7]
t=tweets.getTweet(td)
t['text']

'"Resorted to...". Hahahaha..!  Way to be strong and brave unaided...!  Hahahaha...!  https://t.co/Lnd9N3zBCY via @YahooNews'

This tweet includes a user mention, and a link, but doesn't seem to be relevant to smoking

In [222]:
tweets.addCodes(td,['USERMENTION','LINK','IRRELEVANT'])

In [223]:
td = working[8]
t=tweets.getTweet(td)
t['text']

'RT @chocoo_loco: I just want my friends to stop smoking weed😂 https://t.co/LWI2HVofAf'

This is a retweet with a link, a user mention, and an expression of a desire that the user's friends top smoking marijuana.

In [224]:
tweets.addCodes(td,['RETWEET','USERMENTION','LINK','ANTI-SMOKING','MARIJUANA','FRIENDS','SENTIMENT'])

In [225]:
for i in range(9):
    td=working[i]
    print(tweets.getCodes(td))

{'RETWEET', 'LINK'}
{'RETWEET', 'LINK'}
{'HASHTAG', 'IRRELEVANT', 'RETWEET', 'LINK', 'USERMENTION'}
{'ANTI-SMOKING', 'RETWEET', 'LINK'}
{'POSSIBLYRELEVANT', 'USERMENTION'}
{'ANTI-SMOKING', 'RETWEET', 'LINK', 'USERMENTION'}
{'IRRELEVANT'}
{'IRRELEVANT', 'LINK', 'USERMENTION'}
{'RETWEET', 'FRIENDS', 'SENTIMENT', 'USERMENTION', 'LINK', 'ANTI-SMOKING', 'MARIJUANA'}


Having annotated several tweets, we might want to save the annotations in a file for future use. Fortnuately, the approach that we've used in our save and reload code is flexible enough to handle this without any further changes to the implementation. 

How does this work? The `Tweets` class stores all of the information abou the tweets in a simple dictionary. Tweet counts and codes are then stored inside the tweet object. When we go to save the set of Tweets, we simply turn this dictionary into JSON and then write it to a file. To read things in, we just read the JSON from the file and convert the result back into a dictionary. Thus, anything that we add to the dictionary will automatically be writen out and read back in.  We still need additional routines to access this data (like `addCode`, `addCodes`, and `getCodes`), but we  don't need to change the save/load routines.  Let's try it out.


In [226]:
tweets.saveTweets("tweet-corpus-annotated.json")

In [227]:
tweets2=Tweets()
tweets2.readTweets("tweet-corpus-annotated.json")

In [228]:
print(tweets2.getTweet(td)['text'])
tweets2.getCodes(td)

RT @chocoo_loco: I just want my friends to stop smoking weed😂 https://t.co/LWI2HVofAf


{'ANTI-SMOKING',
 'FRIENDS',
 'LINK',
 'MARIJUANA',
 'RETWEET',
 'SENTIMENT',
 'USERMENTION'}

****
## Exercise 2.1: Examining the distribution of codes across a corpus

Having annotated a number of tweets, you might want to get an idea of how many tweets are being used and how often. Write a new method inside the `Tweets` class tp will provide a list of codes and counts, sorted by frequency.  Please be sure to reload the tweets after you redefine the class.

*ANSWER FOLLOWS - cut below here*
Following lines to be deleted when provided for student use

In [4]:
class Tweets:
    
    
    def __init__(self,ts=None):
        self.tweets={}
        if ts is not None:
            for tweet in ts:
                self.addTweet(tweet)
    
    def addTweet(self,tweet,count = 0,codes = None):
        id = tweet['id']

        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
        # if a count is not provided in the call, increment the count 
            if count == 0:
                self.tweets[id]['count'] = self.tweets[id]['count'] +1
            else:
                self.tweets[id]['count'] = count
            if codes is not None:
                self.tweets[id]['codes']=codes
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns a list of ids
    def getIds(self):
        return self.tweets.keys()
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        self.tweets={}
        with open(filename,'r') as f:
            json_data = json.load(f)
            intweets = jsonpickle.decode(json_data)
            self.tweets = intweets

    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
            
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
    def getCodes(self,id):
        tweet=self.getTweet(id)
        return tweet['codes']
    
    # NEW -ROUTINE TO GET PROFILE
    def getCodeProfile(self):
        summary={}
        for id in self.tweets.keys():
            tweet=self.getTweet(id)
            if 'codes' in tweet:
                for code in tweet['codes']:
                    if code not in summary:
                            summary[code] =0
                    summary[code]=summary[code]+1
        sortedsummary = sorted(summary.items(),key=operator.itemgetter(0),reverse=True)
        return sortedsummary

In [5]:
tweets=Tweets()
tweets.readTweets("tweet-corpus-annotated.json")

In [6]:
tweets.getCodeProfile()

[('USERMENTION', 5),
 ('SENTIMENT', 1),
 ('RETWEET', 6),
 ('POSSIBLYRELEVANT', 1),
 ('MARIJUANA', 1),
 ('LINK', 7),
 ('IRRELEVANT', 3),
 ('HASHTAG', 1),
 ('FRIENDS', 1),
 ('ANTI-SMOKING', 3)]

*END CUT HERE*
****

****

## Exercise 2.2: Code the Next 10 tweets in the set. 
Start with the tags used above, adding your own as needed.  Code up to and including the tweet  with index 20 in the `working` array. Examine the code profile and save your tweets  to a new file when you are done. 

*ANSWER FOLLOWS - cut below here*
Following lines to be deleted when provided for student use

In [9]:
td = working[9]
t=tweets.getTweet(td)
t['text']

'No kidding... https://t.co/3Kg2HkfRsc'

In [10]:
tweets.addCodes(td,['LINK','IRRELEVANT'])

In [12]:
td = working[10]
t=tweets.getTweet(td)
t['text']

'RT @xancaps: smoking by myself now\n\ni don’t need nobody else around'

In [13]:
tweets.addCodes(td,['RETWEET','USERMENTION','HABITS'])

In [14]:
td = working[11]
t=tweets.getTweet(td)
t['text']

'RT @GiveMeInternet: Anti smoking ads should show the benefits of quitting instead of the harms of smoking.'

In [16]:
tweets.addCodes(td,['RETWEET','USERMENTION','ANTI-SMOKING','QUITTING'])

In [17]:
td = working[12]
t=tweets.getTweet(td)
t['text']

'@Austin_Sosbee See I have recently started to dream again, Why again? Cause smoking alot of weed stops dreaming, I miss not dreaming lol'

In [20]:
tweets.addCodes(td,['RETWEET','USERMENTION','MARIJUANA','BENEFITS'])

In [21]:
td = working[13]
t=tweets.getTweet(td)
t['text']

'RT @FootyMemes: This new anti-smoking ad is really powerful... https://t.co/pWHZDLIb7O'

In [22]:
tweets.addCodes(td,['RETWEET','USERMENTION','LINK','ANTI-SMOKING'])

In [23]:
td = working[14]
t=tweets.getTweet(td)
t['text']

'RT @GiveMeInternet: Anti smoking ads should show the benefits of quitting instead of the harms of smoking.'

In [24]:
tweets.addCodes(td,['RETWEET','USERMENTION','ANTI-SMOKING','QUITTING'])

In [25]:
td = working[15]
t=tweets.getTweet(td)
t['text']

"Mngxitama and his nyaope smoking buddies were a no show today because the Guptas aren't targeted hahahaha"

In [27]:
tweets.addCode(td,"FRIENDS")

In [28]:
td = working[16]
t=tweets.getTweet(td)
t['text']

'RT @_youngkingdave: Smoking #doinks with @WakaFlocka\n#doinksquad https://t.co/zex6zRw4Xx'

In [29]:
tweets.addCodes(td,['RETWEET','USERMENTION','LINK','MARIJUANA'])

In [30]:
td = working[17]
t=tweets.getTweet(td)
t['text']

'RT @OnlyWayIsShawtz: My boy stopped smoking weed the day he spent 30 minutes looking for his phone under the bed.. While using his phone fl…'

In [33]:
tweets.addCodes(td,['RETWEET','USERMENTION','MARIJUANA','IMPACT'])

In [34]:
td = working[18]
t=tweets.getTweet(td)
t['text']

'RT @GiveMeInternet: Anti smoking ads should show the benefits of quitting instead of the harms of smoking.'

In [35]:
tweets.addCodes(td,['RETWEET','USERMENTION','ANTI-SMOKING','QUITTING'])

In [36]:
td = working[19]
t=tweets.getTweet(td)
t['text']

"i @KattyKayBBC 'Cannabis is a gateway 2 taking Heroin' were did u hear that pish? Joint smoking 1 day next injecting in2 souls of feet no-no"

In [37]:
tweets.addCodes(td,['USERMENTION','MARIJUANA','OPIATES'])

In [38]:
td = working[20]
t=tweets.getTweet(td)
t['text']

"https://t.co/qchtcveqkA is the world's 1st Smoking Model directory. Search for your favorite… https://t.co/21OlmjWv7Z"

In [39]:
tweets.addCodes(td,['LINK'])

In [41]:
tweets.getCodeProfile()

[('USERMENTION', 9),
 ('RETWEET', 8),
 ('QUITTING', 3),
 ('OPIATES', 1),
 ('MARIJUANA', 4),
 ('LINK', 4),
 ('IRRELEVANT', 1),
 ('IMPACT', 1),
 ('HABITS', 1),
 ('FRIENDS', 1),
 ('BENEFITS', 1),
 ('ANTI-SMOKING', 4)]

In [42]:
tweets.saveTweets("smoking-tweets-annotated.json")

*END CUT*

---

## Exercise 2.3: Additional queries

The tweets annotated above are all based on searches for 'smoking'. What if you were to try other terms, such as 'vaping'? 


1. Using the `searchTwitter` procedure defined above, run a search for a set of tweets with 'vaping' as the search term. You can limit this search to 100 tweets.

2. Annotate the first twenty tweets in this set.

3. Compare the distribution of annotated terms across the two sets. What do you see that is similar? Different?

### 2.3.1 Search for a new set of tweets.

In [47]:
vapeTweets=searchTwitter('vaping',100)

In [48]:
vapeTweets.countTweets()

AttributeError: 'dict' object has no attribute 'countTweets'

In [None]:
vapeTweets.saveTweets('vaping-tweets.json')

In [366]:
len(vapeTweets)

103

In [367]:
vids = list(vapeTweets.keys())

In [368]:
vid = vids[0]
vid in vapeTweets

True

In [370]:
vid in tweets

False

#### 2.2.3. add search term to these new tweets

In [371]:
for id,entry in vapeTweets.items():
    vapeTweets[id]['search_term']='vape'

#### 2.2.4 add to the `tweets` object

In [374]:
len(vapeTweets)

103

In [375]:
vid2=vids[1]

In [376]:
vid2

'950803896200507395'

In [377]:
vid2 in vapeTweets

True

In [378]:
vid2 in tweets

False

In [379]:
len(tweets)

1001

In [380]:
for id,entry in vapeTweets.items():
    tweets[id]=entry

In [None]:
vapeIds=list(vapeTweets.keys())

In [None]:
t1 =vapeIds[0]

In [None]:
vapeTweets[t1]['search_term']


In [None]:
tweets[t1]['search_term']

`saveTweets` and `readTweets` simplify save and reload JSON structures, without any concern as to their contents. Thus, these routines can be used as is, without any modifications. 

In [None]:
saveTweets(tweets,"tweets-vape-and-smoking.json")

*END CUT*

----

---
## EXERCISE 2.3: Reflection on coding

Open coding can often be an iterative process. When we first start out, we don't really know what we're looking for. As a result, the first few items annotated might only get a few codes, and we might miss ideas that we don't initially think are important. As we see more and more items, our ideas of what needs to be annotated will change, and we'll start adding in codes that might also apply to earlier messages. Thus, we often need to review and re-annotate earlier tweets to account for changes in our interpreations.

Review the annotations that you have made, by doing the following:

1. write a routine to extract a list of all codes used to describe all tweets in the corpus, and the number of times each code is used. 

2. use this list of codes to review your annotations the tweets that you reviewed. Revise the codes associated with these tweets, adding items from the overall list of codes as appropriate. Describe the change that you have made.

3. Look at the distribution of tweets mentioning tobacco - do you have a good mix of both? You don't have to reach 50-50, but 80-20 is probably not a good idea. If you have a mix that has two few of one set - is too skewed - run another search, add the resulting tweets to your collection, and try again. Be sure to save your tweets - preferably to a new file. 


---
*ANSWER FOLLOWS - cut below here*

### 2.3.1 Write a routine to extract a list of all codes used to describe all tweets in the corpus


In [None]:
def getCodes(tweets):
    codes={}
    for id,entry in tweets.items():
        # if this tweet has any code
        if 'code' in entry:
            for code in entry['code']:
                # find the list of tweets with this code. if it doesn't exist, create it.
                if code not in codes:
                    codes[code]=[]
                codes[code].append(id)
    return codes

In [None]:
codes = getCodes(tweets)

Now, let's review those codes to see what they look - how frequently were they used. Remember, each entry in the `codes` dictionary is a pair  consisting of a code and a list of tweet ids containing that code. To see a sortedlist of  frequencies, we will iterate over entries, create pairs consisting of the code and the length of the list, and then spit that list out in descending order.  

In [None]:
cs = []
for t,entry in codes.items():
    count = len(entry)
    cs.append((t,count))  
cs.sort(key=lambda x: x[1],reverse=True)
cs

so, as you can see, we have lots of retweets.


*END CUT*

---

### 4.2 Final Notes

Now that we have completed the initial annotation, you can move on to [Part 3](SocialMedia - Part 3.ipynb)