###  *Goal*: Use social media posts to explore the appplication of text and natural language processing to see what might be learned from online interactions.

Specifically, we will retrieve, annotate, process, and interpret Twitter data on health-related issues such as depression.

--- 
References:
* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)
* The [Tweepy Python API for Twitter](http://www.tweepy.org/)

Required Software
* [Python 3](https://www.python.org)
* [NumPy](http://www.numpy.org) - for preparing data for plotting
* [Matplotlib](https://matplotlib.org) - plots and garphs
* [jsonpickle](https://jsonpickle.github.io) for storing tweets. 
---

In [1]:
%matplotlib inline

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import jsonpickle
import json
import random
import tweepy
import time
import operator
from datetime import datetime

# Introduction

This module continues the Social Media Data Science module started in [Part 1](SocialMedia - Part 1.ipynb) covering the annotation of tweets. These lessons will continue in [Part 3](SocialMedia - Part 1.ipynb) as we move on to the use of Natural Language processing to analyze the tweets. 
  
Our case study will apply these topics to Twitter discussions of smoking and vaping. Although details of the tools used to access data and the format and content of the data may differ for various services, the strategies and procedures used to analyze the data will generalize to other tools.

## 0. Setup

Before we dig in, we must grab a bit of code from [Part 1](SocialMedia - Part 1.ipynb):

1. The Tweets class used to store the tweets.
2. Our twitter API Keys - be sure to copy the keys that you generated when you completed [Part 1](SocialMedia - Part 1.ipynb).
3. Configuration of our Twitter connection

In [10]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",count=10)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            time.sleep(5)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']

*REDACT FOLLOWING DETAILS*

In [11]:
consumer_key='D2L4YZ2YrO1PMix7uKUK63b8H'
consumer_secret='losRw9T8zb6VT3TEJ9JHmmhAmn1GXKVj30dkiMv9vjhXuiWek9'
access_token='15283934-iggs1hiZAPI2o5sfHWMfjumTF7SvytHPjpPRGf3I6'
access_secret='bOvqssxS97PGPwXHQZxk83KtAcDyLhRLgdQaokCdVvwFi'

In [12]:
from tweepy import OAuthHandler

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

## 1. Annotating Tweets

Now that we have a corpus of tweets, what do we want to do with them? Turning a relatively vague notion into a well-defined research question is often a significant challenge, as examination of the data often reveals both shortcomings and unforeseen opportunities.

In our case, we are interested in looking at tweets about depression, but we're not quite sure exactly *what* we are looking for. We have a vague notion that we might learn something interesting, but understanding exactly what that is, and what sort of analyses we might need, will require a bit more work.

In situations such as this, we might look at some of the data to form some preliminary impressions of the content. Specifically, we can look at indidividual tweets, assigning them to one or more categories - known as *codes* - based on their content.  We can add categories as needed to capture important ideas that we might want to refer back to. This practice - known as *open coding* allows us to begin to make sense of unfamiliar data sets. 

This sounds much more complicated than it is. For now, let's read some tweets in from a file collected using the procedures discused in [Part 1](SocialMedia - Part 1.ipynb). We'll then use those tweets to get to work. The tweets are stored in a file called `tweets-smoking.json`.

In [13]:
tweets =Tweets()
tweets.readTweets("tweets.json")

We check the count, to verify the contents...

In [14]:
print(tweets.countTweets())

100


We will begin by taking a look at a subset of the first 20 tweets

To get this list, we'll sort the ids of the tweets and take the first 10 in the list, as ordered by ID

In [15]:
ids=list(tweets.getIds())
ids.sort()
working=[]
for i in range(20):
    id = ids[i]
    working.append(id)

*working* now has 20 tweets ids. Let's start with the first.

In [16]:
td = working[0]
t = tweets.getTweet(td)
print(tweets.getSearchTerm(id))
print(tweets.getSearchTime(id))
t['text']

smoking
2018-03-14 09:29:06.674341


'RT @scotgovhealth: Pamela tells how she went from smoking 20 a day to being smoke-free for five months now. Join the conversation and #tell…'

This tweet has several interesting charcteristics.
1. it appears to be a retweet.
2. It contains a user mention 
3. It's got a hash tag. 

We can model all of these points through relevant annotation. Specifically, we will a new array of codes to each tweet object. This array will contain a list of categorical annotations associated with the tweet.  We add routines to add a single code to a tweet (by ID), to add multiple codes, and to retrieve the list of codes associated with a tweet.


See modifications to the  Tweets object in this new definition. 

In [18]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.tweets={}
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        searchTime=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",count=10)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt,searchTime,term)
            time.sleep(5)
                
    def addTweet(self,tweet,searchTime,term="",count=0):
        id = tweet['id_str']
        if id not in self.tweets.keys():
            self.tweets[id]={}
            self.tweets[id]['tweet']=tweet
            self.tweets[id]['count']=0
            self.tweets[id]['searchTime']=searchTime
            self.tweets[id]['searchTerm']=term
        self.tweets[id]['count'] = self.tweets[id]['count'] +1
        
    def getTweet(self,id):
        if id in self.tweets:
            return self.tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.tweets[id]['count']
    
    def countTweets(self):
        return len(self.tweets)
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        for t,entry in self.tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.tweets.keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.tweets)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.tweets=incontents
        
    def getSearchTerm(self,id):
        return self.tweets[id]['searchTerm']
    
    def getSearchTime(self,id):
        return self.tweets[id]['searchTime']
                
    ### NEW ROUTINE - add a code to a tweet
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
    ### NEW ROUTINE  - add multiple  codes for a tweet
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
    ### NEW ROUTINE get codes for a tweet
    def getCodes(self,id):
        tweet=self.getTweet(id)
        return tweet['codes']

Now that we have this set up, we can reload the tweets from the file and reload the subset.

In [22]:
tweets =Tweets()
tweets.readTweets("tweets-smoking.json")
ids=list(tweets.getIds())
ids.sort()
working=[]

for i in range(20):
    id = ids[i]
    working.append(id)

td = working[0]
t = tweets.getTweet(td)
t['text']

'RT @scotgovhealth: Pamela tells how she went from smoking 20 a day to being smoke-free for five months now. Join the conversation and #tell…'

Above we noted that this tweet was interesting becuase:
1. it is a retweet
2. It contains a user mention
3. It contains a hash tag.

So, we will add codes to the appropriate tweet as needed:

In [26]:
tweets.addCode(td,"RETWEET")
tweets.addCode(td,"USERMENTION")
tweets.addCode(td,"HASHTAG")

We can also confirm that this tweet is associated with the desired codes:

In [27]:
tweets.getCodes(td)

{'HASHTAG', 'RETWEET', 'USERMENTION'}

We can see if this is a retweet by checking for the `retweeted_status` attribute

In [29]:
'retweeted_status' in t

True

Good. Let's look at the next tweet. 

In [30]:
td = working[1]
t=tweets.getTweet(td)
t['text']

'RT @brokeangeI: me: smoking weed hasn’t affected me at all\n\nsomeone: count to 10\n\nme: https://t.co/SUoGzARpom'

This tweet is a retweet with a user mention, a mention of marijunana, and a URL. 
This time, we wil, for simplicity, use the `addCodes` routine to add the codes.

In [34]:
tweets.addCodes(td,["RETWEET", 'USERMENTION','LINK','MARIJUANA'])

In [35]:
'retweeted_status' in t

True

ok.. moving on to the third tweet..

In [33]:
td = working[2]
t=tweets.getTweet(td)
t['text']

'RT @brokeangeI: me: smoking weed hasn’t affected me at all\n\nsomeone: count to 10\n\nme: https://t.co/SUoGzARpom'

Hmm. this looks just like the previous tweet. let's check IDs:

In [36]:
print(working[1])
print(working[2])

973913827396874241
973913828328181760


Ok. so they are two different retweets of the same original. Use the same codes as above. 

In [37]:
tweets.addCodes(td,["RETWEET", 'USERMENTION','LINK','MARIJUANA'])

next...

In [38]:
td = working[3]
t=tweets.getTweet(td)
t['text']

"RT @krupnan: เจ๋ง!👍\n\nCool!\nSweet! (slang)\nGreat!\nBeautiful!\nAwesome!\nExcellent!\nThat's smoking!\nThat's fab!\n\n#บุพเพสันนิวาส https://t.co/hV…"

Note that this tweet talks about a `smoking` in a colloquial sense, not discussing the  act of smoking tobacco or a drug. Thus, we mark it as `IRRELEVANT`.

In [39]:
tweets.addCodes(td,['RETWEET','USERMENTION','IRRELEVANT'])

In [40]:
td = working[4]
t=tweets.getTweet(td)
t['text']

'RT @TH_QuitRight: No Smoking Day launch at @elondonmosque 11am - 5pm find out about ways to quit. @TowerHamletsNow @eehnN2 @StopSmokingLon…'

This retweet includes a user mention, multiple hashtags, and a mention of stoppping smoking.

In [41]:
tweets.addCodes(td,['RETWEET','USERMENTION','HASHTAG','STOPPING'])

In [42]:
td = working[5]
t=tweets.getTweet(td)
t['text']

'RT @scotgovhealth: Pamela tells how she went from smoking 20 a day to being smoke-free for five months now. Join the conversation and #tell…'

A retweet with a user mention, a hashtag, and a comment about stopping smoking

In [44]:
len(t['text'])

140

In [43]:
tweets.addCodes(td,['RETWEET','USERMENTION','HASHTAG','STOPPING'])

In [None]:
td = working[6]
t=tweets.getTweet(td)
t['text']

In [None]:
tweets.addCodes(td,['RETWEET','LINK','USERMENTION'])

In [None]:
td = working[7]
t=tweets.getTweet(td)
t['text']

This tweet is similar to the above tweet  about the backpack.

In [None]:
tweets.addCodes(td,['RETWEET','USERMENTION','IRRELEVANT'])

In [None]:
td = working[8]
t=tweets.getTweet(td)
t['text']

This looks irrelevant

In [None]:
tweets.addCodes(td,['RETWEET','USERMENTION','IRRELEVANT'])

Now that we've gone through several tweets, we can review the codes used.

In [None]:
for i in range(9):
    td=working[i]
    print(tweets.getCodes(td))

Having annotated several tweets, we might want to save the annotations in a file for future use. Fortnuately, the approach that we've used in our save and reload code is flexible enough to handle this without any further changes to the implementation. 

How does this work? The `Tweets` class stores all of the information abou the tweets in a simple dictionary. Tweet counts and codes are then stored inside the tweet object. When we go to save the set of Tweets, we simply turn this dictionary into JSON and then write it to a file. To read things in, we just read the JSON from the file and convert the result back into a dictionary. Thus, anything that we add to the dictionary will automatically be writen out and read back in.  We still need additional routines to access this data (like `addCode`, `addCodes`, and `getCodes`), but we  don't need to change the save/load routines.  Let's try it out.


In [None]:
tweets.saveTweets("tweets-smoking-annotated.json")

In [None]:
tweets2=Tweets()
tweets2.readTweets("tweets-smoking-annotated.json")

In [None]:
print(tweets.getTweet(td)['text'])
print(tweets.getCodes(td))
print(tweets2.getTweet(td)['text'])
print(tweets2.getCodes(td))

****
## Exercise 2.1: Examining the distribution of codes across a corpus

Having annotated a number of tweets, you might want to get an idea of how many tweets are being used and how often. Write a new method inside the `Tweets` class tp will provide a list of codes and counts, sorted by frequency.  Please be sure to reload the tweets after you redefine the class.

*ANSWER FOLLOWS - cut below here*
Following lines to be deleted when provided for student use

In [None]:
class Tweets:
    
    
    def __init__(self,term="",corpus_size=100):
        self.contents={}
        self.contents['time']=datetime.now()
        self.contents['tweets']={}
        if term !="":
            self.searchTwitter(term,corpus_size)
                
    def searchTwitter(self,term,corpus_size):
        self.contents['searchTerm']=term
        self.contents['time']=datetime.now()
        while (self.countTweets() < corpus_size):
            new_tweets = api.search(term,lang="en",count=10)
            for nt_json in new_tweets:
                nt = nt_json._json
                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:
                    self.addTweet(nt)
            time.sleep(5)
                
    def addTweet(self,tweet,count =0):
        id = tweet['id_str']
        tweets=self.contents['tweets']
        if id not in tweets.keys():
            tweets[id]={}
            tweets[id]['tweet']=tweet
            tweets[id]['count']=0
        # if a count is not provided in the call, increment the count 
        if count == 0:
            tweets[id]['count'] = tweets[id]['count'] +1
        else:
            tweets[id]['count'] = count
        
    def getTweet(self,id):
        id = str(id)
        tweets=self.contents['tweets']
        if id in tweets:
            return tweets[id]['tweet']
        else:
            return None
    
    def getTweetCount(self,id):
        return self.contents['tweets'][id]['count']
    
    def countTweets(self):
        return len(self.contents['tweets'])
    
    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order
    def mostFrequent(self):
        ps = []
        tweets=self.contents['tweets']
        for t,entry in tweets.items():
            count = entry['count']
            ps.append((t,count))  
        ps.sort(key=lambda x: x[1],reverse=True)
        return ps
    
    # reeturns tweet IDs as a set
    def getIds(self):
        return set(self.contents['tweets'].keys())
    
    # save the tweets to a file
    def saveTweets(self,filename):
        json_data =jsonpickle.encode(self.contents)
        with open(filename,'w') as f:
            json.dump(json_data,f)
    
    # read the tweets from a file 
    def readTweets(self,filename):
        self.contents={}
        with open(filename,'r') as f:
            json_data = json.load(f)
            incontents = jsonpickle.decode(json_data)   
            self.contents=incontents
        
    def getSearchTerm(self):
        return self.contents['searchTerm']
     
                
    ### NEW ROUTINE - add a code to a tweet
    def addCode(self,id,code):
        tweet=self.getTweet(id)
        if 'codes' not in tweet:
            tweet['codes']=set()
        tweet['codes'].add(code)
        
        
    def addCodes(self,id,codes):
        for code in codes:
            self.addCode(id,code)
        
    def getCodes(self,id):
        tweet=self.getTweet(id)
        return tweet['codes']
    
    # NEW -ROUTINE TO GET PROFILE
    def getCodeProfile(self):
        summary={}
        tweets=self.contents['tweets']
        for id in tweets.keys():
            tweet=self.getTweet(id)
            if 'codes' in tweet:
                for code in tweet['codes']:
                    if code not in summary:
                            summary[code] =0
                    summary[code]=summary[code]+1
        sortedsummary = sorted(summary.items(),key=operator.itemgetter(0),reverse=True)
        return sortedsummary

In [None]:
tweets=Tweets()
tweets.readTweets("tweets-smoking-annotated.json")

In [None]:
tweets.getCodeProfile()

*END CUT HERE*
****

****

## Exercise 2.2: Code the Next 10 tweets in the set. 
Start with the tags used above, adding your own as needed.  Code up to and including the tweet  with index 20 in the `working` array. Examine the code profile and save your tweets  to a new file when you are done. 

*ANSWER FOLLOWS - cut below here*
Following lines to be deleted when provided for student use

In [None]:
ids=list(tweets.getIds())
ids.sort()
working=[]

for i in range(9,19):
    id = ids[i]
    working.append(id)

td = working[0]
t=tweets.getTweet(td)
t['text']

In [None]:
tweets.addCodes(td,['RETWEET','USERMENTION','LINK'])

In [None]:
td = working[1]
t=tweets.getTweet(td)
t['text']

In [None]:
tweets.addCodes(td,['RETWEET','USERMENTION','IRRELEVANT'])

In [None]:
td = working[2]
t=tweets.getTweet(td)
t['text']

In [None]:
tweets.addCodes(td,['RETWEET','USERMENTION'])

In [None]:
td = working[3]
t=tweets.getTweet(td)
t['text']

In [None]:
tweets.addCodes(td,['RETWEET','USERMENTION','MARIJUANA','POSAFFECT'])

In [None]:
td = working[4]
t=tweets.getTweet(td)
t['text']

In [None]:
tweets.addCodes(td,['USERMENTION'])

In [None]:
td = working[5]
t=tweets.getTweet(td)
t['text']

In [None]:
tweets.addCodes(td,['RETWEET','USERMENTION','MARIJUANA'])

In [None]:
td = working[6]
t=tweets.getTweet(td)
t['text']

In [None]:
tweets.addCodes(td,['HASHTAG','LINK','NEGAFFECT'])

In [None]:
td = working[7]
t=tweets.getTweet(td)
t['text']

In [None]:
tweets.addCodes(td,['LINK'])

In [None]:
td = working[8]
t=tweets.getTweet(td)
t['text']

In [None]:
tweets.addCodes(td,['RETWEET','USERMENTION','VAPING'])

In [None]:
td = working[9]
t=tweets.getTweet(td)
t['text']

In [60]:
tweets.addCodes(td,['RETWEET','USERMENTION','IRRELEVANT'])

In [61]:
tweets.getCodeProfile()

[('VAPING', 1),
 ('USERMENTION', 17),
 ('RETWEET', 16),
 ('POSAFFECT', 2),
 ('NEGAFFECT', 2),
 ('MARIJUANA', 5),
 ('LINK', 4),
 ('IRRELEVANT', 5),
 ('HASHTAG', 3),
 ('CRACK', 1)]

In [62]:
tweets.saveTweets("tweets-smoking-annotated.json")

*END CUT*

---

---
## EXERCISE 2.3: Reflection on coding

Open coding can often be an iterative process. When we first start out, we don't really know what we're looking for. As a result, the first few items annotated might only get a few codes, and we might miss ideas that we don't initially think are important. As we see more and more items, our ideas of what needs to be annotated will change, and we'll start adding in codes that might also apply to earlier messages. Thus, we often need to review and re-annotate earlier tweets to account for changes in our interpreations.

Review your annotations the tweets that you reviewed. Revise the codes associated with these tweets, adding items from the overall list of codes as appropriate. Describe the change that you have made.


---
*ANSWER FOLLOWS - cut below here*
Following lines to be deleted when provided for student use


*END CUT*

---

## EXERCISE 2.4: Reflection on storage/serialization

In working with this small set of 100 tweets, we are taking a very simple approach to storage and management of the tweets and annotations. Storing everything in a nested Python dictionary and then dumping it to disk as JSON text can be very appealing. What are the strengths and weaknesses of this approach, and how might these strengths and weaknesses differ with larger datasets containing 100,000 or 100 million datasets? What alternative  strategies might you use for larger datasets?

---
*ANSWER FOLLOWS - cut below here*
Following lines to be deleted when provided for student use

*Advantages*: JSON is easy to read, as programmers can open the text file and read contents directly. JSON is also easy to work with from multiple programming languages and platforms, allowing a collection of tweets written in Python to be read by code written in other languages.  Finally, the JSON structure is easily adaptable, allowing fields to be easily added or removed as needed.

*Disadvantages*: As a text-based format, JSON data is pretty-much "all or nothing". Although it might be possible to read in part of the JSON structure, this will complicate code significantly.  Thus, most programs would read the entire JSON structure into data all at once. This is fine for small datasets, but might get slow and bulky, requiring lots of RAM, for larger datasets. 

Very large sets of tweets might be stored in a database. A relational database might be created to store tweets in one or more tables, providing the power of the structured query language (SQL) to retrieve tweets matching only specified criteria.  SQL could also be used to quickly and easily calculate aggregate statistics, without loading all of the tweets into RAM. A downside of this approach is the need to manage a distinct software component (the database server), the difficulty in changing the contents of the database, the complexity of SQL queries, and the relatively complex and often language-specific tools needed to manage the connections to the database. 

Alternatively, NoSQL databases that work very well with JSON might be considered. These databases may share some of the challenges associated with relational databases, but they are often also more flexible.


*END CUT*

---

### 1.2 Final Notes

[Part 3](SocialMedia - Part 3.ipynb) will explore the application of Natural Language Processing  - NLP - techniques to Tweet data. As part of this exploration, we will create and save a set of tweets based on a diffferent search term - "Vaping". Eventually, we'll apply macine learning ot see if we can classify tweets as being associated with the "smoking" or "vaping" searches.

In [63]:
vapeTweets = Tweets()
vapeTweets.searchTwitter("vaping",100)
vapeTweets.saveTweets("tweets-vaping.json")