# Social Media and Human-Computer Interaction

###  *Goal*: Use social media posts to explore the appplication of text and natural language processing to see what might be learned from online interactions.

Specifically, we will retrieve, annotate, process, and interpret Twitter data on health-related issues such as depression.

--- 
References:
* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)
* The [Tweepy Python API for Twitter](http://www.tweepy.org/)
---

In [2]:
%matplotlib inline

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import jsonpickle
import json

# Introduction

Analysis of social-media discussions has grown to be an important tool for biomedical informatics researchers, particularly for addressing questions relevant to public perceptions of health and related matters. Studies have examination of a range of topics at the intersection of health and social media, including studies of how [Facebook might be used to commuication health information](http://www.jmir.org/2016/8/e218/) how Tweets might be used to understand how smokers perceive [e-cigarettes, hookahs and other emerging smokeing products](https://www.jmir.org/2013/8/e174/), and many others.

Although each investigation has unique aspects, studies of social media generally share several common tasks. Data acquisition is often the first challenge: although some data may be freely available, there are often [limits](https://dev.twitter.com/rest/public/rate-limits) as to how much data can be queried easily. Researchers might look out for [opportunities for accessing larger amounts of data](https://www.wired.com/2014/02/twitter-promises-share-secrets-academia/). Some studies contract with [commercial services providing fee-based access](https://gnip.com). 

Once a data set is hand, the next step is often to identify key terms and phrases relating to the research question. Messages might be annotated to indicate specific categorizations of interest - indicating, for example, if a message referred to a certain aspect of a disease or symptom. Similarly, key words and phrases regularly occurring in the content might also be identified. Natural language and text processing techniques might be used to extract key words, phrases, and relationships, and machine learning tools might be used to build classifiers capable of distinguishing between types of tweets of interest. 

This module presents a preliminary overview of these techniques, using Python 3 and several auxiliary libraries to explore the application of these techniques to Twitter data. 
  
  1. Configuration of tools to access Twitter data
  2. Twitter data retrieval
  3. Searching for tweets
  4. Annotation of tweets
  5. Natural Language Processing
  6. Examination of text patterns
  7. Construction of classifiers
  8. Exercises and next steps
  

Our case study will apply these topics to Twitter discussions of smoking and tobacco. Although details of the tools used to access data and the format and content of the data may differ for various services, the strategies and procedures used to analyze the data will generalize to other tools.

## 1. Configuration of tools to access Twitter data

[Twitter](www.twitter.com) provides limited capabilities for searching tweets through an Application Programming Interface (API) based on Representational State Transfer (REST).  [REST](https://doi.org/10.1145/337180.337228) is an approach to using web-based Hypertext-Transfer Protocol (HTTP) requests as APIs. 

Essentially, a REST API specifies conventions for HTTP requests that might be used to retrieve specific data items from a remote server. Unlike traditional HTTP requests, which return HTML markup to be rendered in web browsers, REST APIs return data formatted in XML or JSON, suitable for interpretation by computer programs. REST APIs from familiar websites underlie frequently-seen functionality such as embedded twitter widgets and "like/share" links, among others.

Commercial REST applications often use "API-Keys" - unique identifiers used to associate requests with registered accounts. Here, we will walk through the process of registering for Twitter API keys and using a Python library to manage the details of making a Twitter API request and receiving a response.

1.1 Registering for a Twitter API key

1.1.1 *Signup for Twitter* The first step in registering for a Twitter API key is to [signup](https://twitter.com/signup) for an account. If you dont' want to post anything or to use the account in any way that might be linked to your regular email adddress, you might want to create a special-purpose account using a service such as gmail, and use this new email address for the twitter account.

1.1.2. *Create a Twitter application*: Go to  [Twitter's developer site](https://dev.twitter.com) and click on "My Apps". Click on "Create New App" in the upper right and then fill out the form. The main thing that you need to focus on here is the application name, description, and website. The rest can be ignored.

Creating the application will lead to the display of some information with some URLs and a few tabs. Look under "Keys and Access Tokens" to see the Consumer API key and API Secret - these will come in handy later.

There will also be a button that says "Create my access token". Press this button and make a note of the Access Token and Access Token Secret values that are displayed. 

Although hese tokens are always available on the application page, for the purpose of this exercise, it's best to store them in Python variables directly in this Jupyter notebook. Execute the following insstructions, substituting the keys for your application for the phrases "YOUR-CONSUMER-KEY", etc. 

In [3]:
consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET'

### *Note that the following should be redacted*

In [4]:
consumer_key='D2L4YZ2YrO1PMix7uKUK63b8H'
consumer_secret='losRw9T8zb6VT3TEJ9JHmmhAmn1GXKVj30dkiMv9vjhXuiWek9'
access_token='15283934-iggs1hiZAPI2o5sfHWMfjumTF7SvytHPjpPRGf3I6'
access_secret='bOvqssxS97PGPwXHQZxk83KtAcDyLhRLgdQaokCdVvwFi'

In theory, you know have all that you need to start accessing Twitter. Using these keys and the information in the [Twitter Developer Documentation](https://dev.twitter.com/docs), you might conceivably create web requests to search for tweets, post, and read your timeline. In practice, it's a bit more complicated, so most folks use third-party tools that take care of the hard work. 

1.1.3 *Try the Tweepy library*: [Tweepy](http://www.tweepy.org) is a Python 3 library for using the Twitter API. Like other similar libraries - there are many for Python and other languages - Tweepy takes care of the details of authorization and provides a few simple function calls for accessing the API.  

The first step in using Tweepy is *authorization* - establishing your credentials for using the Twitter API. Tweepy uses the [OAuth](http://www.oauth.net) authorization framework, which is widely used for both API and user access to services provided over HTTP. Fortunately Tweepy hides the oauth details. All you need to do is to make a few calls to the Tweepy library and you're all set to go. Run the following code, making sure that the four variables are set to the values you were given when you registered your Twitter application:

In [5]:
import tweepy
from tweepy import OAuthHandler

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)


If this worked correctly, you should see something like this 
```
<tweepy.api.API at 0x109da36d8>
``` 

If you get an error message, please check your keys and tokens to ensure that they are correct.

## 2. Twitter data retrieval

Now that you have successfully accessed the Twitter API, it's time to access the data. The simplest thing to do is to grab some Tweets off of your timeline. Try the following code:

In [6]:
top_ten = []
i =0
for tweet in tweepy.Cursor(api.home_timeline).items(10):
    top_ten.append(tweet._json)
    

There are several key componnents to this block of code:
* ```api.home_timeline``` is a component of the API object, referring to the user timeline - the tweets shown on your home page.
* ```tweepy.Cursor``` is a construct in the Tweepy API that supports navigation through a large set of results.
* ```tweepy.Cursor(api.home_timeline).items(10)``` essentially asks Tweepy to set up a cursor for the home timeline and then to get the first 10 items in that set. The result is a Python Iterator, which can be used to examine the items in the set in turn.
* We will grab the JSON representation of each tweet (stored as "tweet.\_json") for maximum flexibility.
* The loop takes each of those objects an adds them into a Python array.

Now, each of the items in ```top_ten``` is a Tweet object. Let's take a look inside. We'll start by grabbing the first text:

In [16]:
tweet1=top_ten[0]

and looking at its text:

In [17]:
tweet1['text']

'RT @LIGO: Next Monday Oct 16 at 10:00 EDT join @LIGO @ego_virgo scientists for #GravitationalWaves update. More news soon at https://t.co/R…'

.. noting that the text is roughly 140 characters long...

In [18]:
len(tweet1['text'])

140

We can also examine when the tweet was created...

In [19]:
tweet1['created_at']

'Thu Oct 12 12:25:38 +0000 2017'

.. whether it has been favorited...

In [20]:
tweet1['favorited']

False

.. The unique ID String of the Tweet...

In [21]:
tweet1['id_str']

'918452599136948224'

.. and the name of the Twitter user responsible for the post. 

In [22]:
tweet1['user']['name']

'National Science Fdn'

We can check to see if a tweet is a retweet by seeing if it has the 'retweeted_status' attribute.

In [23]:
'retweeted_status' in tweet1

True

You can also see if your tweet was a retweet. If it was, the <em>retweeted_status</em> field will hold information about the original tweet

In [24]:

if 'retweeted_status' in tweet1:
    original = tweet1['retweeted_status']
    print(original['user']['name'])
else:
    print("not a rewteet")

LIGO


The twitter API supports many other details for users, tweets, and other entities. See [The Twitter API Overview](https://dev.twitter.com/overview/api) for general details and subpages about [Tweets](https://dev.twitter.com/overview/api/tweets), [Users](https://dev.twitter.com/overview/api/users) and related pages for specific details of other data types.

## 3. Searching for tweets

Our next major goal will be to search for Tweets. Effective searching requires both construction of useful queries (the hard part) and use of the Tweepy search API (the easy part).

### 3.1 Formulating a query

Formulating an effective search query is often a challenging, iterative process. Trying some searches in the Twitter web page is a good way to see both how a query might be formulated and which queries might be most useful.

If you look carefully at the URL bar in your browser after running a search, you might notice that the search term is embedded in the URL. Thus, if you search for "depression", you might see a URL that looks like https://twitter.com/search?q=depression. You might also see "&src=typed" at the end of the URL, indicating that the search was typed by hand.

You can also use Tweepy to conduct a search, as follows:

In [25]:
tlist = api.search("smoking",lang="en",count=10)
tweets = [t._json for t in tlist]

This search will find the first 10 English tweets matching the term "depression".

In [26]:
tweets[0]['text']

'girl tf you smoking https://t.co/6YZTsxEZns'

We can then look at the text for these tweets. This is a good way to check to ensure that we're getting what we think we should be getting.

In [27]:
texts = [c['text'] for c in tweets]

In [28]:
texts

['girl tf you smoking https://t.co/6YZTsxEZns',
 "RT @samuelinfirmier: Knock Knock ... I Don't Feel Myself So Marketing Orientated , While Smoking #Weed Enjoying #Beer or #Champagne\n\n@GiGiH…",
 'Ayos lang sayo nag vvape? — Im against smoking and i hate the smell of vape and cigarette so no https://t.co/4Yy5UVD1dX',
 'RT @Iamjustcoke: I was smoking a wood while getting head and da bitch called me disrespectful',
 "RT @samuelinfirmier: Knock Knock ... I Don't Feel Myself So Marketing Orientated , While Smoking #Weed Enjoying #Beer or #Champagne\n\n@GiGiH…",
 'RT @Kasper23: Smoking on this OG Diesel Kush listening to my boy @CoreyFinesse \n#BussinChecks 💰',
 "RT @flyhigh2NE1: Jiyong out and about in Paris?? I don't like you smoking but I guess that's your way of coping up with stress. Stay health…",
 "RT @samuelinfirmier: Knock Knock ... I Don't Feel Myself So Marketing Orientated , While Smoking #Weed Enjoying #Beer or #Champagne\n\n@GiGiH…",
 'RT @Vichekesho_254: Continue smoking weed

You may see some tweets that don't match exactly - perhaps using 'depressed' instead of 'depression'. This suggests that Twitter uses <em>stemming</em> - removing suffixes and variations to get to the core of the word - to increase search accuracy.

At this point, we should be able to evaluate the results to see if we are on the right track. If we aren't, we'd want to try some different queries. For now, it looks good, so let's move on.

### 3.2 Collecting and characterizing a larger corpus

Our original query only retrieved 10 tweets. This is a good start, but probably not enough for anything serious. We can loop through several times to create a longer list, with a delay between searches to avoid overstaying our welcome with Twitter:

In [29]:
import time
for i in range(10):
    new_tweets = api.search("smoking",lang="en",count=100)
    nt = [t._json for t in new_tweets]
    tweets= tweets+nt
    time.sleep(5)
    

In [None]:
len(tweets)

At this point, we might want to know something about the tweets that we have retrieved. As our goal is to shoot for linguistic diversity, we want to make sure that we don't have too many retweets, and that we have a wide range of authors. Let's run through the tweets and count the number of authors and retweets. We can count authors in a dictionary and retweets in a simple variable.

In [None]:
authors={}
retweets=0
for t in tweets:
    # is it a retweet? If so, increment
    if 'retweeted_status' in t:
        retweets = retweets+1
    # get tweet author name
    uname = t['user']['name']
    # if not in authors, put it in with zero articles
    if uname not in authors:
        authors[uname]=0
    authors[uname]=authors[uname]+1

In [None]:
retweets

In [None]:
len(tweets)

In [None]:
len(authors.keys())

We might see a lot of retweets here - I saw at least 80% in one instance, with about 193 authors. This suggests that this corpus has a good many authors with multiple tweets. 

To explore this, let's look at the histogram of the number of tweets/author.

To examine the distribution of authors, we can use the [NumPy](http://www.numpy.org) and [Matplotlib](http://matplotlib.org) libraries to extract the number of tweets from each user (given by authors.values()) and to plot a histogram...

In [None]:
vals = np.array(list(authors.values()))
plt.xticks(range(min(vals),max(vals)+1))
plt.hist(vals,np.arange(min(vals)-0.5,max(vals)+1.5));

It looks like a broad range of the number of tweets/user, up until 10 tweets, with many users having 10 tweets. This is an intersting pattern, with no immediately obvious interpretation.

Given the number of retweets and the frequency of posting by some authors, we might be concnered that we are seeing repeated tweets.  To check this, we will review the  tweet IDs in a manner similar to that  which we used for the authors, to see how many of the tweets are unique. 

As we do this, we'll create a dictionary that will allow us to retrieve tweets by IDs. Each element of the dictionary will itself be a dictionary, containing the full tweet and the number of times it occurs in the dataset. Later, we'll add to this structure

In [None]:
utweets={}
for t in tweets:
    if t['id_str'] not in utweets:
        new_entry={}
        new_entry['count']=0
        new_entry['tweet']=t
        utweets[t['id_str']]=new_entry
        
    utweets[t['id_str']]['count']=utweets[t['id_str']]['count']+1
len(utweets)

Now, we can turn this dictionary into a list of id, count pairs, sort by count, and see which ones were repeated most often.

In [None]:
ps = []
for t,entry in utweets.items():
    count = entry['count']
    ps.append((t,count))  
ps.sort(key=lambda x: x[1],reverse=True)

Hmm.. only a small portion of our tweets are unique

In [None]:
float(len(utweets))/float(len(tweets))

This leads to a question - how can we generate a large set of unique tweets, so as to ensure diversity of results? Our techniques for checking uniquness provide an answer. We can retrieve tweets, checking as we go to see if we've seen them before, and discaring tweets that are repeats. This will continue until we have a large enough set.

In [None]:
tweets={}
corpus_size=1000
while (len(tweets) < corpus_size):
    new_tweets = api.search("smoking",lang="en",count=100)
    for nt_json in new_tweets:
        nt = nt_json._json
        if nt['id_str'] not in tweets:
            new_entry={}
            new_entry['count']=0
            new_entry['tweet']=nt
            tweets[nt['id_str']]=new_entry
        tweets[nt['id_str']]['count'] = tweets[nt['id_str']]['count']+1
    # wait to give our twitter account a break..
    time.sleep(10)

In [None]:
len(tweets)

### 3.3 Saving tweets

Now, we've got a good solid set of tweets to work with. Let's save these tweets to a file, using the [jsonpickle](https://jsonpickle.github.io/) library to convert the strucure into a json file, which we will then write to disk. We'll define a function to do this, as we might want to repeat this later

In [None]:
def saveTweets(tweets,filename):
    json_data =jsonpickle.encode(tweets)
    with open(filename,'w') as f:
        json.dump(json_data,f)

In [None]:
saveTweets(tweets,'tweet.json')

Now that that's done, we can read it in again. Once again, we'll write a function.

In [None]:
def readTweets(filename):
    with open(filename,'r') as f:
        json_data = json.load(f)
    tweets = jsonpickle.decode(json_data)
    return tweets

In [None]:
tweets2 = readTweets('tweet.json')

In [None]:
len(tweets2)

In [None]:
tweets == tweets2

Note that we might find that we will want to add additional fields to this file. We can always rewreite the file as needed. Saving the file as is gives us a good record that we can work from, without having to recreate the dataset. For subsequent exercises, you can start from this line, without running any of the prior code..

## 4. Annotating Tweets

### 4.1 Open Coding

Now that we have a corpus of tweets, what do we want to do with them? Turning a relatively vague notion into a well-defined research question is often a significant challenge, as examination of the data often reveals both shortcomings and unforeseen opportunities.

In our case, we are interested in looking at tweets about depression, but we're not quite sure exactly *what* we are looking for. We have a vague notion that we might learn something interesting, but understanding exactly what that is, and what sort of analyses we might need, will require a bit more work.

In situations such as this, we might look at some of the data to form some preliminary impressions of the content. Specifically, we can look at indidividual tweets, assigning them to one or more categories - known as *codes* - based on their content.  We can add categories as needed to capture important ideas that we might want to refer back to. This practice - known as *open coding* allows us to begin to make sense of unfamiliar data sets. 

This sounds much more complicated than it is. For now, let's begin by taking a look at a subset of 10 tweets.  Keep in mind that *tweets* is a dictionary mapping id strings to infomration about tweets. Each entry in *tweets* is itself a dictionary, with 'count' corresponding to the number of times the tweet was sound, and 'tweet' corresponding to the tweet itself.  We're going to add some categories to that dictionary, but we need to start by getting a smaller set of tweets.

In [None]:
i=0
working=[]
for id,entry in tweets.items():
    working.append(entry)
    i = i +1
    if i > 99:
        break

*working* now has 100 tweets. Let's start with the first.

In [None]:
td = working[0]

In [None]:
td['tweet'].text

This tweet has several interesting charcteristic:1
1. it is a retweet
2. it refers to another twitter user
3. It mentions Marijuana ('weed') in particular
4. It suggests an intent. 

We can model all of these points through relevant annotation:

In [None]:
td['code']=[]
td['code'].append('USERMENTION')
td['code'].append('MARIJUANA')
td['code'].append("INTENT")
td['code'].append("RETWEET")

let's look at the next tweet. 

In [None]:
td = working[1]
td['tweet'].text

This is a retweet, mentioning a user. It also mention another drug ('Crack')

In [None]:
td['code']=[]
td['code'].append('USERMENTION')
td['code'].append('CRACK')
td['code'].append("RETWEET")

We can also get this by checking for the retweeted_status attribute

In [None]:
hasattr(td['tweet'],'retweeted_status')

ok.. moving on to the third tweet..

In [None]:
td = working[2]
td['tweet'].text

This is also a retweet with a user mention, but the notion of the 'car smoking' suggests that this tweet is not directly related to smoking of tobacco or other rugs, so we will call it irrelevant.

In [None]:
td['code']=[]
td['code'].append('USERMENTION')

td['code'].append("RETWEET")
td['code'].append('IRRELEVANT')

next...

In [None]:
td = working[3]
td['tweet'].text

In [None]:
td['tweet'].user

Here, we have have an example of a positive statement - an affirmation of success.

In [None]:
td['retweet']=True
td['code']=[]
td['code'].append("AFFIRMATION")

In [None]:
td = working[4]
td['tweet'].text

This retweet uses strong language to express frustration 

In [None]:
td['retweet']=True
td['code']=[]
td['code'].append("LANGUAGE")
td['code'].append("FRUSTRATION")

In [None]:
td = working[5]
td['tweet'].text

This is a retweet sharing advice

In [None]:
td['retweet']=True
td['code']=[]
td['code'].append("SHARE")
td['code'].append("ADVICE")

In [None]:
td = working[6]
td['tweet'].text

a retweet containing a report on a family member's reaction

In [None]:
td['retweet']=True
td['code']=[]
td['code'].append("FAMILY")

In [None]:
td = working[7]
td['tweet'].text

This tweet mentions external perceptions of depression

In [None]:
td['code']=[]
td['code'].append("EXTERNAL PERCEPTIONS")

In [None]:
td = working[8]
td['tweet'].text

This is a retweet with a link. The rest of the text is a bit unclear...

In [None]:
td['retweet']=True
td['code']=[]
td['code'].append("LINK")

and so it goes. You might have to code 100 or more tweets to get a good distribution.

*EXERCISE*: Code the first 50 tweets in the set. You might have to re-run the loop given above to make a list of tweets you can work with.

--- 
Following lines to be deleted when provided for student use

In [None]:
td = working[9]
td['tweet'].text

In [None]:
td['code']=[]
td['code'].append("LINK")
td['code'].append("QUESTION")

In [None]:
td = working[10]
td['tweet'].text

In [None]:
td['retweet']=True
td['code']=[]
td['code'].append("USERMENTION")
td['code'].append("CONCERN")

In [None]:
td = working[11]
td['tweet'].text

In [None]:
td['retweet']=True
td['code']=[]
td['code'].append("USERMENTION")
td['code'].append("MISUNDERSTANDINGS")

In [None]:
td = working[12]
td['tweet'].text

In [None]:
td['code']=[]
td['code'].append("USERMENTION")

In [None]:
td = working[13]
td['tweet'].text

In [None]:
td['rewteet']=True
td['code']=[]
td['code'].append("USERMENTION")
td['code'].append("CULTURE")

In [None]:
td = working[14]
td['tweet'].text

In [None]:
td['code']=[]
td['code'].append("USERMENTION")
td['code'].append("IRRELEVANT")

In [None]:
td = working[16]
td['tweet'].text

In [None]:
td['code']=[]
td['retweet']=True
td['code'].append("USERMENTION")
td['code'].append("INFORMATIONREQUEST")

In [None]:
td = working[17]
td['tweet'].text

In [None]:
td['code']=[]
td['code'].append("LINK")
td['code'].append("SHARE")
td['code'].append("ADVICE")

In [None]:
td = working[18]
td['tweet'].text

In [None]:
td['code']=[]
td['code'].append("LANGUAGE")
td['code'].append("FRUSTRATION")

In [None]:
td = working[19]
td['tweet'].text

In [None]:
td['retweet']=True
td['code']=[]
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("ARTS")

In [None]:
td = working[20]
td['tweet'].text

In [None]:
td['retweet']=True
td['code']=[]
td['code'].append("IRRELEVANT")

**WHY AM I LOSING USER INFO**

In [None]:
tweetJson =jsonpickle.encode(tweets)
with open('tweet.json','w') as f:
          json.dump(tweetJson,f)

In [None]:
with open('tweet.json','r') as f:
    tweetJson=json.load(f)
tweets=jsonpickle.decode(tweetJson)

### 4.2 Summarizing and exploring coding categories

Once we have a good set of categories, we can iterate over the tweets and create a dictionary mapping codes to relevant tweets.

-----

In [None]:
codeDict={}
for id,entry in tweets.items():
    # for every tweet, look to see if we have any codes
    if 'code' in entry:
        # for each code
        for code in entry['code']:
            # look for it in the codeDictionary, creating a new list of codes if needed
            if code not in codeDict:
                codeDict[code]=[]
            # add the id to the dictionary 
            codeDict[code].append(id)

In [None]:
codeDict