# Social Media and Human-Computer Interaction - Part 2


###  *Goal*: Use social media posts to explore the appplication of text and natural language processing to see what might be learned from online interactions.

Specifically, we will retrieve, annotate, process, and interpret Twitter data on health-related issues such as depression.

--- 
References:
* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)
* The [Tweepy Python API for Twitter](http://www.tweepy.org/)

Required Software
* [Python 3](https://www.python.org)
* [NumPy](http://www.numpy.org) - for preparing data for plotting
* [Matplotlib](https://matplotlib.org) - plots and garphs
* [jsonpickle](https://jsonpickle.github.io) for storing tweets. 
---

In [193]:
%matplotlib inline

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import jsonpickle
import json
import random
import tweepy
import time

# Introduction

This module continues the Social Media Data Science module started in [Part 1](SocialMedia - Part 1.ipynb), covering the annotation of tweets and subequent textual and natural language processing analysis:
  1. Annotation of tweets
  2. Natural Language Processing
  3. Examination of text patterns
  4. Construction of classifiers
  5. Exercises and next steps
  
Our case study will apply these topics to Twitter discussions of smoking and tobacco. Although details of the tools used to access data and the format and content of the data may differ for various services, the strategies and procedures used to analyze the data will generalize to other tools.

## 0. Setup

Before we dig in, we must grab a bit of code from [Part 1](SocialMedia - Part 1.ipynb):

1. The searchTweet routine for grabbing tweets
2. The routine that we wrote to read tweets in from a file.
3. Our twitter API Keys - be sure to copy the keys that you generated when you completed *part 1*.
4. Configuration of our Twitter connection

In [185]:
def searchTwitter(term,corpus_size):
    tweets={}
    while (len(tweets) < corpus_size):
        new_tweets = api.search(term,lang="en",count=10)
        for nt_json in new_tweets:
            nt = nt_json._json
            if nt['id_str'] not in tweets:
                new_entry={}
                new_entry['count']=0
                new_entry['tweet']=nt
                tweets[nt['id_str']]=new_entry
            tweets[nt['id_str']]['count'] = tweets[nt['id_str']]['count']+1
        # wait to give our twitter account a break..
        time.sleep(10)
    return tweets

In [6]:
def readTweets(filename):
    with open(filename,'r') as f:
        json_data = json.load(f)
    tweets = jsonpickle.decode(json_data)
    return tweets

*REDACT FOLLOWING DETAILS*

In [187]:
consumer_key='D2L4YZ2YrO1PMix7uKUK63b8H'
consumer_secret='losRw9T8zb6VT3TEJ9JHmmhAmn1GXKVj30dkiMv9vjhXuiWek9'
access_token='15283934-iggs1hiZAPI2o5sfHWMfjumTF7SvytHPjpPRGf3I6'
access_secret='bOvqssxS97PGPwXHQZxk83KtAcDyLhRLgdQaokCdVvwFi'

In [188]:
from tweepy import OAuthHandler

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

## 1. Annotating Tweets

### 1.1 Open Coding

Now that we have a corpus of tweets, what do we want to do with them? Turning a relatively vague notion into a well-defined research question is often a significant challenge, as examination of the data often reveals both shortcomings and unforeseen opportunities.

In our case, we are interested in looking at tweets about depression, but we're not quite sure exactly *what* we are looking for. We have a vague notion that we might learn something interesting, but understanding exactly what that is, and what sort of analyses we might need, will require a bit more work.

In situations such as this, we might look at some of the data to form some preliminary impressions of the content. Specifically, we can look at indidividual tweets, assigning them to one or more categories - known as *codes* - based on their content.  We can add categories as needed to capture important ideas that we might want to refer back to. This practice - known as *open coding* allows us to begin to make sense of unfamiliar data sets. 

This sounds much more complicated than it is. For now, let's read some tweets in from a file collected in October 2017 using the procedures discused in *part 1* (using the procedure defined above). We'll then use those tweets to get to work.

In [7]:
tweets =readTweets("tweet-corpus.json")

We will begin by taking a look at a subset of 100 tweets.  Keep in mind that *tweets* is a dictionary mapping id strings to information about tweets. Each entry in *tweets* is itself a dictionary, with 'count' corresponding to the number of times the tweet was sound, and 'tweet' corresponding to the tweet itself.  We're going to add some categories to that dictionary, but we need to start by getting a smaller set of tweets.

To get this list, we'll sort the ids of the tweets and take the first 10 in the list. 

In [8]:
ids=list(tweets.keys())
ids.sort()
working=[]
for i in range(100):
    id = ids[i]
    entry = tweets[id]
    working.append(entry)

*working* now has 100 tweets. Let's start with the first.

In [9]:
td = working[0]

In [10]:
td['tweet']['text']

'FlTNESS: RT DrugedPosts: "Wyd after smoking this?" https://t.co/OnLywTyJ0X'

This tweet has several interesting charcteristics.
1. it is a retweet
2. It contains a link. 

We can model all of these points through relevant annotation. Specifically, we will add two new arrays to each tweet object. 'code' will contain a list of categorical annotations associated with the tweet.

In [11]:
td['code']=[]
td['code'].append('LINK')
td['code'].append("RETWEET")

We can confirm that this is a rewtweet by checking for the `retweeted_status` attribute

In [12]:
'retweeted_status' in td['tweet']

False

Hmm. the attribute is not present. Perhaps the user copied the text and added 'RT' without actually retweeting? Something to keep our eyes on for other tweets.

let's look at the next tweet. 

In [13]:
td = working[1]
td['tweet']['text']

'RT DrugedPosts: "Wyd after smoking this?" https://t.co/PZ3YyYh8WB'

Notice this is similar, but not identical, to the previous tweet. 

In [14]:
td['code']=[]
td['code'].append('LINK')
td['code'].append('RETWEET')

In [15]:
'retweeted_status' in td['tweet']

False

ok.. moving on to the third tweet..

In [16]:
td = working[2]
td['tweet']['text']

'RT @Anzers: #TheBetrayalPapers Video: Part II – In Plain Sight – A National Security Smoking Gun\nhttps://t.co/rpObdW9GcG'

This retweet includes a link, a hashtag reference, and a reference to a `Smoking gun`, suggesting that this is not really a tweet about tobacco, marijuana, or other smoking products. We'll label it `irrelevant`

In [17]:
td['code']=[]
td['code'].append('RETWEET')
td['code'].append('LINK')
td['code'].append('USERMENTION')
td['code'].append('HASHTAG')
td['code'].append('IRRELEVANT')

In [18]:
'retweeted_status' in td['tweet']

True

next...

In [19]:
td = working[3]
td['tweet']['text']

'RT @FootyMemes: This new anti-smoking ad is really powerful... https://t.co/pWHZDLIb7O'

Here, we have have a retweet, a link, and something about anti-smoking

In [20]:
td['code']=[]
td['code'].append('RETWEET')
td['code'].append('LINK')
td['code'].append('ANTI-SMOKING')

In [21]:
'retweeted_status' in td['tweet']

True

In [22]:
td = working[4]
td['tweet']['text']

'@ericschmidt @jwnichls Stop. You need to stop torturing me. No buddy nobody cares about "smoking." Stop.'

This retweet includes user mentions. It might or might not be relevant. 

In [23]:
td['code']=[]
td['code'].append("USERMENTION")
td['code'].append("FRUSTRATION")
td['code'].append("POSSIBLYRELEVANT")

In [24]:
td = working[5]
td['tweet']['text']

'RT @FurnyFootball: Stop smoking 😂 https://t.co/bY1ZvJy63Z'

A retweet with a user mention, and anti-smoking message, and a link

In [25]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("ANTI-SMOKING")
td['code'].append("LINK")

In [26]:
td = working[6]
td['tweet']['text']

'You ever wake up and wish you was still sleep? ... that’s me rn.'

This tweet doesn't seem to be about smoking.

In [27]:
td['code']=[]
td['code'].append("IRRELEVANT")

In [28]:
td = working[7]
td['tweet']['text']

'"Resorted to...". Hahahaha..!  Way to be strong and brave unaided...!  Hahahaha...!  https://t.co/Lnd9N3zBCY via @YahooNews'

This tweet includes a user mention, and a link, but doesn't seem to be relevant to smoking

In [29]:
td['code']=[]
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("IRRELEVANT")


In [30]:
td = working[8]
td['tweet']['text']

'RT @chocoo_loco: I just want my friends to stop smoking weed😂 https://t.co/LWI2HVofAf'

This is a retweet with a link, a user mention, and an expression of a desire that the user's friends top smoking marijuana.

In [31]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("ANTI-SMOKING")
td['code'].append("MARIJUANA")
td['code'].append("FRIENDS")
td['code'].append("SENTIMENT")

and so it goes. You might have to code 100 or more tweets to get a good distribution. You'll work on this in a minute, but first, a suggestion. As you code, it might be hard to track which of the codes you've used. Let's write a routine to collect those codes. Printing this list on occassion will help you remeber what you've used and ensure that you don't miss opportunities to reuse code.

In [32]:
def getCodes(tweets):
    codes =[] 
    for id,entry in tweets.items():
        # for every tweet, look to see if we have any codes
        if 'code' in entry:
            # for each code
            for code in entry['code']:
            # look for it in the codeDictionary, creating a new list of codes if needed
                if code not in codes:
                    codes.append(code)
    return codes

In [33]:
getCodes(tweets)

['RETWEET',
 'LINK',
 'ANTI-SMOKING',
 'USERMENTION',
 'MARIJUANA',
 'FRIENDS',
 'SENTIMENT',
 'IRRELEVANT',
 'FRUSTRATION',
 'POSSIBLYRELEVANT',
 'HASHTAG']



## EXERCISE 1: Code the Next 50 tweets in the set. 
Start with the tags used above, adding your own as needed.  

--- 
### answer - cut below this line
Following lines to be deleted when provided for student use

In [34]:
td = working[9]
td['tweet']['text']

'No kidding... https://t.co/3Kg2HkfRsc'

In [35]:
td['code']=[]
td['code'].append("LINK")
td['code'].append("IRRELEVANT")

In [36]:
td = working[10]
td['tweet']['text']

'RT @xancaps: smoking by myself now\n\ni don’t need nobody else around'

In [37]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("HABITS")

In [38]:
td = working[11]
td['tweet']['text']

'RT @GiveMeInternet: Anti smoking ads should show the benefits of quitting instead of the harms of smoking.'

In [39]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("ANTI-SMOKING")
td['code'].append("QUITTING")

In [40]:
td = working[12]
td['tweet']['text']

'@Austin_Sosbee See I have recently started to dream again, Why again? Cause smoking alot of weed stops dreaming, I miss not dreaming lol'

In [41]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("MARIJUANA")
td['code'].append("BENEFITS")

In [42]:
td = working[13]
td['tweet']['text']

'RT @FootyMemes: This new anti-smoking ad is really powerful... https://t.co/pWHZDLIb7O'

In [43]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("ANTI-SMOKING")

In [44]:
td = working[14]
td['tweet']['text']

'RT @GiveMeInternet: Anti smoking ads should show the benefits of quitting instead of the harms of smoking.'

In [45]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("ANTI-SMOKING")
td['code'].append("QUITTING")

In [46]:
td = working[15]
td['tweet']['text']

"Mngxitama and his nyaope smoking buddies were a no show today because the Guptas aren't targeted hahahaha"

In [47]:
td['code']=[]
td['code'].append("FRIENDS")

In [48]:
td = working[16]
td['tweet']['text']

'RT @_youngkingdave: Smoking #doinks with @WakaFlocka\n#doinksquad https://t.co/zex6zRw4Xx'

In [49]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("MARIJUANA")

In [50]:
td = working[17]
td['tweet']['text']

'RT @OnlyWayIsShawtz: My boy stopped smoking weed the day he spent 30 minutes looking for his phone under the bed.. While using his phone fl…'

In [51]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("MARIJUANA")
td['code'].append("IMPACT")

In [52]:
td = working[18]
td['tweet']['text']

'RT @GiveMeInternet: Anti smoking ads should show the benefits of quitting instead of the harms of smoking.'

In [53]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("ANTI-SMOKING")
td['code'].append("QUITTING")

In [54]:
td = working[19]
td['tweet']['text']

"i @KattyKayBBC 'Cannabis is a gateway 2 taking Heroin' were did u hear that pish? Joint smoking 1 day next injecting in2 souls of feet no-no"

In [55]:
td['code']=[]
td['code'].append("USERMENTION")
td['code'].append("MARIJUANA")
td['code'].append("OPIATES")

In [56]:
td = working[20]
td['tweet']['text']

"https://t.co/qchtcveqkA is the world's 1st Smoking Model directory. Search for your favorite… https://t.co/21OlmjWv7Z"

In [57]:
td['code']=[]
td['code'].append("LINK")
td['code'].append("IRRELEVANT")

In [58]:
td = working[21]
td['tweet']['text']

'RT @_youngkingdave: Smoking #doinks with @WakaFlocka\n#doinksquad https://t.co/zex6zRw4Xx'

In [59]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("MARIJUANA")

In [60]:
working[21]['tweet']['id_str']==working[16]['tweet']['id_str']

False

good. no repeat

In [61]:
td = working[22]
td['tweet']['text']

'@LaurenSocha Make sure people wash who come in to close contact with her, no smoking near her or change smokey clot… https://t.co/iveJEh9sfT'

In [62]:
td['code']=[]
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("ADVICE")

In [63]:
td = working[23]
td['tweet']['text']

'RT @wifisfuneral: I love you always brother &amp; I’m proud how far you’ve gone I remember riding in your whips smoking ports plotting on this…'

In [64]:
td['code']=[]
td['code'].append('RETWEET')
td['code'].append("USERMENTION")
td['code'].append("POSITIVEAFFECT")

In [65]:
td = working[24]
td['tweet']['text']

"RT @The_AOP: Well done to everyone who's got this far! Find out how smoking can impact your eye health in our blog: https://t.co/wCxqBvJxdX…"

In [66]:
td['code']=[]
td['code'].append('RETWEET')
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("INFORMATION")
td['code'].append("ANTI-SMOKING")

In [67]:
td = working[25]
td['tweet']['text']

'RT @DrugedPosts: "Wyd after smoking this?" https://t.co/n3eGNF4ywY'

In [68]:
td['code']=[]
td['code'].append('LINK')
td['code'].append("RETWEET")

In [69]:
td = working[26]
td['tweet']['text']

'RT @_youngkingdave: Smoking #doinks with @WakaFlocka\n#doinksquad https://t.co/zex6zRw4Xx'

In [70]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("MARIJUANA")

In [71]:
td = working[27]
td['tweet']['text']

'@FoodTrapper stop smoking cigarettes.'

In [72]:
td['code']=[]
td['code'].append("USERMENTION")
td['code'].append("ANTI-SMOKING")

In [73]:
td = working[28]
td['tweet']['text']

'Smoker? Learn how quitting cuts your #risk of #heartdisease in half: https://t.co/vRLhuXIF1w https://t.co/CbifCvSthc #livewell #2health'

In [74]:
td['code']=[]
td['code'].append('ANTI-SMOKING')
td['code'].append('LINK')
td['code'].append('ADVICE')

In [75]:
td = working[29]
td['tweet']['text']

'RT @onmyworst: Not to sound like tana mongeau but I love smoking weed'

In [76]:
td['code']=[]
td['code'].append('RETWEET')
td['code'].append('USERMENTION')
td['code'].append('MARIJUANA')
td['code'].append("POSITIVEAFFECT")

In [77]:
td = working[30]
td['tweet']['text']

'Max has been growing some odd vegetables and smoking them recently. I must investigate further.'

In [78]:
td['code']=[]
td['code'].append('VEGETABLES')

In [79]:
td = working[31]
td['tweet']['text']

'@KBonimtetezi @WilliamsRuto We need to get urine sample coz what you are smoking??'

In [80]:
td['code']=[]
td['code'].append('USERMENTION')
td['code'].append("DRUGTESTING")

In [81]:
td = working[32]
td['tweet']['text']

'Drinking a Petite Sour Raspberry by @CrookedStave @ Meat Smoking House 2 — https://t.co/jp3qqNwAgc #photo'

In [82]:
td['code']=[]
td['code'].append("LINK")
td['code'].append("IRRELEVANT")

In [83]:
td = working[33]
td['tweet']['text']

"So high that I'm fading away ..."

In [84]:
td['code']=[]
td['code'].append("MARIJUANA")

In [85]:
td = working[34]
td['tweet']['text']

'RT @Bhuvan_Bam: @CarryMinati *Stresses out*\n*Starts smoking* 😭😭😂'

In [86]:
td['code']=[]
td['code'].append("USERMENTION")
td['code'].append("POSITIVEAFFECT")

In [87]:
td = working[35]
td['tweet']['text']

"RT @foxnewspolitics: 'Smoking gun'email shows Obama DOJ blocked conservative groups from settlement funds,GOP lawmaker says- @AlexPappas\nht…"

In [88]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("IRRELEVANT")

In [89]:
td = working[36]
td['tweet']['text']

'RT @DrugedPosts: "Wyd after smoking this?" https://t.co/n9g9TkPqDM'

In [90]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")

In [91]:
td = working[37]
td['tweet']['text']

'RT @skwawkbox: This is huge if people understand its significance...\nhttps://t.co/IcEi2Gh1fh'

In [92]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("IRRELEVANT")

In [93]:
td = working[38]
td['tweet']['text']

"RT @Sesamee_giraffe: Sana's introduction is a perfect example of smoking through your presentation 😂"

In [94]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("IRRELEVANT")

In [95]:
td = working[39]
td['tweet']['text']

'RT @Bhuvan_Bam: @CarryMinati *Stresses out*\n*Starts smoking* 😭😭😂'

In [96]:
td['code']=[]
td['code'].append("USERMENTION")
td['code'].append("POSITIVEAFFECT")

In [97]:
td = working[40]
td['tweet']['text']

'RT @phil30mccrackin: Smoking hot Funtime babes @ashlyandersonxx &amp; @missjojokiss https://t.co/CExJIai8ZW'

In [98]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("IRRELEVANT")

In [99]:
td = working[41]
td['tweet']['text']

'RT @WeedFeed: Differences Between Eating And Smoking Weed https://t.co/8FK10ZbWqx https://t.co/l6212vXpA7'

In [100]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append('MARIJUANA')

In [101]:
td = working[42]
td['tweet']['text']

'@danielmarven He must stop smoking 😂😂😂 https://t.co/8tzUrn9Jem'

In [102]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("LINK")
td['code'].append("ANTI-SMOKING")

In [103]:
td = working[43]
td['tweet']['text']

'RT @GiveMeInternet: Anti smoking ads should show the benefits of quitting instead of the harms of smoking.'

In [104]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("ANTI-SMOKING")
td['code'].append("QUITTING")

In [105]:
td = working[44]
td['tweet']['text']

'RT @DrugedPosts: "Wyd after smoking this?" https://t.co/n9g9TkPqDM'

In [106]:
td['code']=[]
td['code'].append('LINK')
td['code'].append("RETWEET")

In [107]:
td = working[45]
td['tweet']['text']

'“What Were They Smoking?”: On Liturgical Art from the 1970s https://t.co/dOgDAUDoVr https://t.co/CcQEkOrIW1'

In [108]:
td['code']=[]
td['code'].append('LINK')

In [109]:
td = working[46]
td['tweet']['text']

"RT @Prime_Politics: Fmr. House Speaker Boehner Describes How Obama Struggled With Smoking and Was 'Scared to Death' of Michelle\n\n#P2 https:…"

In [110]:
td['code']=[]
td['code'].append('RETWEET')
td['code'].append('USERMENTION')
td['code'].append('LINK')

In [111]:
td = working[47]
td['tweet']['text']

'RT @GiveMeInternet: Anti smoking ads should show the benefits of quitting instead of the harms of smoking.'

In [112]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("ANTI-SMOKING")
td['code'].append("QUITTING")

In [113]:
td = working[48]
td['tweet']['text']

'RT DrugedPosts: "Wyd after smoking this?" https://t.co/85w6elISAo'

In [114]:
td['code']=[]
td['code'].append('LINK')
td['code'].append("RETWEET")

In [115]:
td = working[49]
td['tweet']['text']

'RT @pyrocajun: One of my friends in Al Udeid took this photo while out smoking a cigarette, and it is goddamn amazing. https://t.co/aGNl7aT…'

In [116]:
td['code']=[]
td['code'].append('LINK')
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("TOBACCO")

In [117]:
td = working[50]
td['tweet']['text']

'Sitting is the smoking of our generation https://t.co/OEOTt0jkSK'

In [118]:
td['code']=[]
td['code'].append('LINK')
td['code'].append('ADVICE')

In [119]:
td = working[51]
td['tweet']['text']

'RT @NLMblog: “What Were They Smoking?”: On Liturgical Art from the 1970s https://t.co/dOgDAUDoVr https://t.co/CcQEkOrIW1'

In [120]:
td['code']=[]
td['code'].append('LINK')

In [121]:
td = working[52]
td['tweet']['text']

"Who else can share the video of their life's 1st puff of a cigarette and their parents are still proud of it. 😎… https://t.co/plfYK4AnHt"

In [122]:
td['code']=[]
td['code'].append('LINK')
td['code'].append('TOBACCO')

In [123]:
td = working[53]
td['tweet']['text']

'hot fuck jerkoff instruction hentai videos smoking porn'

In [124]:
td['code']=[]
td['code'].append('IRRELEVANT')

In [125]:
td = working[54]
td['tweet']['text']

'RT @cjsnowdon: Postcard from a country that is losing its mind. https://t.co/bbbwcW25DU'

In [126]:
td['code']=[]
td['code'].append('LINK')
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append('IRRELEVANT')

In [127]:
td = working[55]
td['tweet']['text']

'RT @MajorPoonia: RG is the next PM of “INDEPENDENT” India🤔?\nr we Gulam🤔?\nSalman Bhai,1 thing is for sure- whatever stuff U r smoking is of…'

In [128]:
td['code']=[]
td['code'].append('LINK')
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append('IRRELEVANT')

In [129]:
td = working[56]
td['tweet']['text']

'@BabeHeavenTV @preeti_young @BabestationTV @UKBabeChannels @RampantTV @tvbabesahoy @the_real_winsaw @murphdogg11… https://t.co/4ZtftYdDqM'

In [130]:
td['code']=[]
td['code'].append('LINK')
td['code'].append("USERMENTION")
td['code'].append('IRRELEVANT')

In [131]:
td = working[57]
td['tweet']['text']

'RT @onmyworst: Not to sound like tana mongeau but I love smoking weed'

In [132]:
td['code']=[]
td['code'].append('RETWEET')
td['code'].append("USERMENTION")
td['code'].append('IRRELEVANT')
td['code'].append('MARIJUANA')
td['code'].append('POSITIVEAFFECT')

In [133]:
td = working[58]
td['tweet']['text']

'RT @_youngkingdave: Smoking #doinks with @WakaFlocka\n#doinksquad https://t.co/zex6zRw4Xx'

In [134]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("MARIJUANA")

In [135]:
td = working[59]
td['tweet']['text']

'RT @naijagym: Stop smoking at home. It makes ur children more prone to ear infections, pneumonia, bronchitis, &amp; coughs. https://t.co/CGkMD9…'

In [136]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("ADVICE")
td['code'].append("ANTI-SMOKING")

In [137]:
td = working[60]
td['tweet']['text']

'RT @wydafters: "Wyd after smoking this?" https://t.co/jjE1aF4NhB'

In [138]:
td['code']=[]
td['code'].append('LINK')
td['code'].append('RETWEET')

In [139]:
td = working[61]
td['tweet']['text']

'FlTNESS: RT DrugedPosts: "Wyd after smoking this?" https://t.co/hFQcFqtaHr'

In [140]:
td['code']=[]
td['code'].append('LINK')
td['code'].append('RETWEET')

In [141]:
getCodes(tweets)

['LINK',
 'RETWEET',
 'USERMENTION',
 'ANTI-SMOKING',
 'QUITTING',
 'IRRELEVANT',
 'ADVICE',
 'MARIJUANA',
 'FRIENDS',
 'SENTIMENT',
 'DRUGTESTING',
 'IMPACT',
 'POSITIVEAFFECT',
 'INFORMATION',
 'VEGETABLES',
 'TOBACCO',
 'FRUSTRATION',
 'POSSIBLYRELEVANT',
 'OPIATES',
 'BENEFITS',
 'HABITS',
 'HASHTAG']

In [142]:
td = working[62]
td['tweet']['text']

'RT @Joshuel1209: So ive decided to try to quit smoking.. \nLol \nLets see how long it takes😂'

In [143]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("TOBACCO")
td['code'].append("QUITTING")

In [144]:
td = working[63]
td['tweet']['text']

'RT @eBookExtremist: Conservative mindset: Censor the word "ass" on the radio in a song about smoking weed, cooking and selling crack, prost…'

In [145]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("MARIJUANA")

In [146]:
td = working[64]
td['tweet']['text']

'RT @vpybur: I hate smoking w people who are so paranoid about getting caught'

In [147]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("MARIJUANA")

In [148]:
td = working[65]
td['tweet']['text']

'RT @bkeane3030: smoking big doinks out in amish  https://t.co/En14YMDMI6'

In [149]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("MARIJUANA")

In [150]:
td = working[66]
td['tweet']['text']

'RT @Scaler17: The latest Acamprosate Made Me Quit Smoking Daily! https://t.co/Segx1Ry9uJ Thanks to @DoctorNazarian @BrownlowPrinces @EcigCl…'

In [151]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("QUITTING")
td['code'].append("ADVICE")
td['code'].append("TOBACCO")

In [152]:
td = working[67]
td['tweet']['text']

'RT @kitz007: #Irony It happens only in #India 😂😂 #Smoking ... #NoSmoking \n\nCaptions pls :D https://t.co/O39sdwVcRM'

In [153]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("ANTI-SMOKING")

In [154]:
td = working[68]
td['tweet']['text']

'RT @Scaler17: https://t.co/K8TDlJXdSD … …   Link to my ad on how I quit smoking with Acamprosate second article down  Oct. 26 / 27 /28 /29,…'

In [155]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("QUITTING")
td['code'].append("ADVICE")
td['code'].append("TOBACCO")

In [156]:
td = working[69]
td['tweet']['text']

'RT @FootyMemes: This new anti-smoking ad is really powerful... https://t.co/pWHZDLIb7O'

In [157]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("TOBACCO")
td['code'].append("ANTI-SMOKING")

In [158]:
td = working[70]
td['tweet']['text']

'RT @Scaler17: #Acamprosate enabled me to quit smoking - without trying to quit smoking.  Please view my website https://t.co/wZlB8UmVz8  No…'

In [159]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("TOBACCO")
td['code'].append("ANTI-SMOKING")

In [160]:
td = working[71]
td['tweet']['text']

'RT @2HighBros: "Wyd after smoking this?" https://t.co/W2IsDjukye'

In [161]:
td['code']=[]
td['code'].append('LINK')
td['code'].append("RETWEET")

In [162]:
td = working[72]
td['tweet']['text']

"RT @joelavinash_: Guys , I'm trying to get my friend to stop smoking please please please get this to 2.5K rts so that he'll stop smoking 🚭…"

In [163]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append('USERMENTION')
td['code'].append("RETWEET")
td['code'].append('QUITTING')
td['code'].append('FRIENDS')

In [164]:
td = working[73]
td['tweet']['text']

'RT @2HighBros: "Wyd after smoking this?" https://t.co/W2IsDjukye'

In [165]:
td['code']=[]
td['code'].append('LINK')
td['code'].append("RETWEET")

In [166]:
td = working[74]
td['tweet']['text']

'Former House Speaker John Boehner describes how Obama struggled with smoking and was… https://t.co/jGoZtwpyfB'

In [167]:
td['code']=[]
td['code'].append('RETWEET')
td['code'].append('LINK')

In [168]:
td = working[75]
td['tweet']['text']

'@AMERICAFUCKU @DBloom451 @NFL Yes!! They are ABSOLUTELY clueless &amp; @realDonaldTrump got them smoking on that Trump… https://t.co/Q0c9TAed1T'

In [169]:
td['code']=[]
td['code'].append('RETWEET')
td['code'].append('LINK')
td['code'].append('USERMENTION')
td['code'].append('IRRELEVANT')

In [170]:
td = working[76]
td['tweet']['text']

'RT @Scaler17: The latest Acamprosate Made Me Quit Smoking Daily! https://t.co/Segx1Rgy69 Thanks to @OakCreekDental @medsinpregnancy @TheGal…'

In [171]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")
td['code'].append("QUITTING")
td['code'].append("ADVICE")
td['code'].append("TOBACCO")

In [172]:
td = working[77]
td['tweet']['text']

'I need a smoking buddy asap.'

In [173]:
td['code']=[]
td['code'].append("FRIENDS")

In [174]:
td = working[78]
td['tweet']['text']

"seems like winter is here in bangalore. smoking a cigarette in the office terrace and can't stop shivering..."

In [175]:
td['code']=[]
td['code'].append("TOBACCO")

In [176]:
td = working[79]
td['tweet']['text']

'RT @ChickenColeman: Bitch he smoking WHAT???? RT @n1irving: nigga smoking incest 😂 RT @DrakeGoat: High AF #420 http://t.co/NRhhzU2jN2'

In [177]:
td['code']=[]
td['code'].append("RETWEET")
td['code'].append("USERMENTION")
td['code'].append("LINK")

In [178]:
td = working[80]
td['tweet']['text']

'RT @GEslave: follow @Goddess_Erotika , the ultimate GODDESS of smoking &amp; high platform heels #smoking #highheels #fetish #longnails #mistre…'

In [179]:
td['code']=[]


In [180]:
td['code'].append("RETWEET")
td['code'].append("USERMENTION")

### end cut here
---

## Exercise 2: Additional queries

The tweets annotated above are all based on searches for 'smoking'. What if you were to try other terms, such as 'tobacco' or 'vaping'? 

1. Using the `searchTwitter` procedure defined above, run a search for a set of tweets with one of the these alternative terms.

2. Revise `saveTweets` and `readTweets` to store the new set of tweets in a json file, along with the original tweets. How might you distinguish between the two sets? 

---
### cut below this line


There are several ways this might be done - the following approach is reasonably straightforward.
1.  iterate over current `tweets` dictionary, modifying each entry with a new attribute ['search_term'] set to 'smoking'
2. grab a new set of tweets with the alternative term, 
3. Annotate these new tweets with the next search term
4. add them to the `tweets` object,including the new `search_term`


Let's go through these in turn
#### 2.1. iterate over current `tweets` dictionary, modifying each entry with a new attribute ['search_term'] set to 'smoking'

In [189]:
for id,entry in tweets.items():
    tweets[id]['search_term']='smoking'


### 2.2  grab a new set of tweets with the alternative term

In [None]:
vapeTweets = searchTwitter("vape",100)

In [191]:
#### 3. add search term to these new tweets

In [192]:
for id,entry in vapeTweets.items():
    vapeTweets[id]['search_term']='vape'

NameError: name 'vapeTweets' is not defined

### end cut here
----

## Exercise 3: Reflection on coding

Open coding can often be an iterative process. When we first start out, we don't really know what we're looking for. As a result, the first few items annotated might only get a few codes, and we might miss ideas that we don't initially think are important. As we see more and more items, our ideas of what needs to be annotated will change, and we'll start adding in codes that might also apply to earlier messages. Thus, we often need to review and re-annotate earlier tweets to account for changes in our interpreations.

Review the annotations that you have made, by doing the following:
1. write a routine to extract a list of all codes used to describe all tweets in the corpus. Hint: The python set () construction can help ensure that you do not repeat codes
2. use this list of codes to review your annotations of the first 10 tweets tha you reviewed. Revise the codes associated with these tweets, adding items from the overall list of codes as appropriate.

---
begin cut here



### 3.1 Write a routine to extract a list of all codes used to describe all tweets in the corpus


In [240]:
def get_codes(tweets):
    codes=set()
    for id,entry in tweets.items():
        if 'code' in entry:
            for code in entry['code']:
                codes.add(code)
    return codes

In [241]:
codes = get_codes(tweets)
codes

{'ADVICE',
 'ANTI-SMOKING',
 'BENEFITS',
 'DRUGTESTING',
 'FRIENDS',
 'FRUSTRATION',
 'HABITS',
 'HASHTAG',
 'IMPACT',
 'INFORMATION',
 'IRRELEVANT',
 'LINK',
 'MARIJUANA',
 'OPIATES',
 'POSITIVEAFFECT',
 'POSSIBLYRELEVANT',
 'QUITTING',
 'RETWEET',
 'SENTIMENT',
 'TOBACCO',
 'USERMENTION',
 'VEGETABLES'}


end cut here 

---

### 4.2 Summarizing and exploring coding categories

Once we have a good set of categories, we can iterate over the tweets and create a dictionary mapping codes to relevant tweets.

-----

In [None]:
codeDict={}
for id,entry in tweets.items():
    # for every tweet, look to see if we have any codes
    if 'code' in entry:
        # for each code
        for code in entry['code']:
            # look for it in the codeDictionary, creating a new list of codes if needed
            if code not in codeDict:
                codeDict[code]=[]
            # add the id to the dictionary 
            codeDict[code].append(id)