## Search overview

We'll end up looking at two main kinds of searches:
* *GET search/tweets*, which returns tweets matching a search term
* *GET statuses/user_timeline*, which returns all of a given user's tweets

Note on timeline iteration: https://dev.twitter.com/rest/public/timelines

## GET search/tweets

First things first; we'll set up a search object like last time.

In [None]:
import sys, os, re
from pprint import pprint                           #Important for reading through JSONs
from time import localtime,strftime,sleep,time      #Important for dealing with Twitter rate limits
import datetime                       #Important for processing Twitter timestamps
import twitter

In [None]:
cons_oauth_file = 'c.xxx'
if os.path.exists(cons_oauth_file):
    constoken, conssecret = twitter.read_token_file(cons_oauth_file)
else:
    constoken = raw_input("What is your app's 'Consumer Key'?").strip()
    conssecret = raw_input("What is your app's 'Consumer Secret'?").strip()
    wf = open(cons_oauth_file,'w'); wf.write(constoken+'\n'+conssecret); wf.close()

In [None]:
app_oauth_file = 'a.xxx'
if not os.path.exists(app_oauth_file):									#if user not authorized already
	twitter.oauth_dance("your app",constoken,conssecret,app_oauth_file)		#perform OAuth Dance
apptoken, appsecret = twitter.read_token_file(app_oauth_file)					#import user credentials

In [None]:
tsearch = twitter.Twitter(auth=twitter.OAuth(apptoken,appsecret,constoken,conssecret))	#create search command

## GET search/tweets

Now that _tsearch_ is initialized, let's get to searching! [Here](https://dev.twitter.com/rest/reference/get/search/tweets) is Twitter's documentation on the GET search/tweets call, which is pretty good.

GET search/tweets takes a range of arguments, but I find these the most important:
* *q* : the search term, which must be UTF-8 & URL-encoded
* *count* : how many tweets per search? (100 max)
* *result_type* : do you want all recent tweets, or those that Twitter thinks are most interesting? (Hint: the former, you definitely want the former.)
* *max_id* : limits results to tweets before specified tweet ID

Try out a simple search below.

In [None]:
term="placeholder+text"  #Note: use + instead of spaces

res = tsearch.search.tweets(q=term)
#                            count=10,              #just want 10 hits back
#                            result_type="recent")  #include all recent tweets, not only popular ones

The result of a search is again a nested dictionary. (The Twitter API returns data in either JSON or XML format, which the *twitter* library auto-encodes as nested dictionaries.)  The returned tweets are in the 'statuses' dictionary. (Internally, the API refers to tweets as statuses, which is weird sometimes.)

Let's look at the first hit:

In [None]:
pprint(res['statuses'][0])

Boy, that's a lot of information. Twitter calls this a fully-hydrated result, and includes information about the tweet, the user, and the social engagement of the tweet.  Here's Twitter's [overview of the information](https://dev.twitter.com/overview/api/tweets) in a tweet.

I like to compare this against what Twitter shows us on its website, where the visualization is easier.  Let's make a real quick function to extract the URL from this information so we can visualize the tweets as we talk about them:

In [None]:
def extracttweetURL(j):
	return 'http://twitter.com/'+j['user']['screen_name']+'/status/'+str(j['id'])

t = res['statuses'][0]
print extracttweetURL(t)

The key pieces of information depend on your goals, but in most cases these will be important:
* *created_at*: tweet's time, in UTC.
* *favorite_count*, *retweet_count*: number of favs & RTs, respectively, the tweet has amassed
* *id*: tweet's unique numerical ID
* *in_reply_to_status_id*: ID of the tweet this one's replying to (if any)
* *text*: the text of the tweet
* *user*: all the info about the tweeter

And within the user dictionary, here're some important fields:
* *id*: tweeter's numerical ID (constant throughout account's lifespan)
* *location*: self-reported location of tweeter
* *friends_count*, *follower_count*: number of people the tweeter follows and is followed by (respectively)
* *name*: tweeter's display name (can change)
* *screen_name*: tweeter's Twitter handle (i.e., @whatever; also can change)

Just to make the tweets a little more readable, I'm going to create a pruning function down to just these features.

In [None]:
def prunetweet(t):
    d = {k: t[k] for k in ['created_at','favorite_count','retweet_count','id','in_reply_to_status_id','text']}    #keeping only relevant top-level features (user features handled below)
    d['user'] = {k: t['user'][k] for k in ['id','location','friends_count','followers_count','name','screen_name']} #keeping only relevant features
    return d

pprint(prunetweet(t))

### Playing around with tweets

Let's play around with these a bit; try some different searches and look at the results you get. Are there any really surprising results?

In [None]:
term="%22good+morning%22"  #Note: use + instead of spaces, and %22 instead of quotes
count=25                   #Don't bother with too many hits yet

res = tsearch.search.tweets(q=term,count=count,result_type='recent')

In [None]:
for i in range(0,len(res['statuses'])):
    print '\n',i, extracttweetURL(res['statuses'][i])
    pprint(prunetweet(res['statuses'][i]))

A few things I've found strange/interesting/annoying:
* a query can be matched by a username in addition to the text itself.
* Jupyter isn't displaying Unicode well (so no emoji, :( )
* favorite_count in tweet, favo**u**rites_count in user
* manual RTs?

### Iterating back in time

Twitter limits the number of tweets from any single *GET search/tweets* call to 100. But you're allowed to go back up to one week, or 3000 tweets, whichever you run afoul of first.  How do you do that?  The result of each API call has a *search_metadata* feature, which both gives info about the completeed search and where to go from here:

In [None]:
res['search_metadata']

The search API lets you specify a maximum tweet ID in each search (*max_id*), and by iteratively moving that maximum back to the minimum ID of the preceding search, you keep return results further back in time, until Twitter stops you.

Unfortunately, that minimum ID is not supplied directly here; you have to extract it from the *next_results* string, or extract it manually as the minimum ID in your results.  Also, note that the *max_id* search value is **inclusive**, so you should subtract one from it before your search or you'll get that tweet over again.

In [None]:
minid = 9999999999999999999999
for i in range(0,len(res['statuses'])):
    #print res['statuses'][i]['id']
    if res['statuses'][i]['id'] < minid:
        minid = res['statuses'][i]['id']-1
print minid

#code to regexp to the max_id value and extract it as element 1 of the match object
#see https://docs.python.org/2.7/library/re.html
#However, Twitter is dumb if you're doing a more complex search and can't handle this, so you have to manually obtain the max_id.
minid = re.search('max_id=([^&]+)&',res['search_metadata']['next_results']).group(1)
print minid

In [None]:
res = tsearch.search.tweets(q=term,count=count,result_type='recent',max_id=minid)
res['search_metadata']

Ta da!

## Good morning!

Hey, here's a stupid test case to make sure that we're getting reasonable results. When do people say good morning?  And does it depend on where they are?

Let's compare the distribution of "good morning" tweets in Berkeley and Pittsburgh, 3 hours apart by time zone.  

In [None]:
#To go back 3000 tweets, you need 30 searches at the default count of 100 tweets per search. 
#Find out if you have enough searches left

In [None]:
r = tsearch.application.rate_limit_status()
remaining = r['resources']['search']['/search/tweets']['remaining']
print remaining

In [None]:
#Let's find the 3000 most recent "good morning" tweets
#Build a search for "good morning" (remember to convert quotes to the URL-encoded %22) & check it works as expected

In [None]:
term = '%22good+morning%22'
res = tsearch.search.tweets(q=term,count=100,result_type='recent')

In [None]:
res['search_metadata']

In [None]:
#Extract the minimum tweet ID, and get the 100 tweets before it
#Loop 30 times (or until res['search_metadata']['count']==0)
#Be sure to save all the tweets
#Note: searching with max_id=0 is the same as having no max.

In [None]:
allres = []
minid = 9999999999999999999999
for i in range(0,30):
    print 'Up to tweet', minid, 'iteration', i
    res = tsearch.search.tweets(q=term,count=100,result_type='recent',max_id=minid,geocode='37.87,-122.27,50km')
    print 'Hits:', res['search_metadata']['count']
    if res['search_metadata']['count']==0:
        break
    #minid = re.search('max_id=([^&]+)&',res['search_metadata']['next_results']).group(1)
    for j in range(0,len(res['statuses'])):
        #print res['statuses'][j]['id']
        if res['statuses'][j]['id'] < minid:
            minid = res['statuses'][j]['id']-1
    print 'New minimum ID:', minid
    allres.extend(res['statuses'])


In [None]:
res['statuses']

In [None]:
#Check how many tweets you ended up with
#Now let's extract their times

def extracttime(t):
    UTCoffset = datetime.timedelta(hours=7)   #I'm assuming we're on Pacific Daylight Time (UTC-7)
    return datetime.datetime.strptime(t['created_at'],'%a %b %d %H:%M:%S +0000 %Y')-UTCoffset


In [None]:
#Check that the time is getting correctly calculated by comparing against web client's time.
t=allres[0]
print datetime.datetime.strptime(t['created_at'],'%a %b %d %H:%M:%S +0000 %Y')
print extracttime(t)
print extracttweetURL(t)

In [None]:
#Extract all times
alltimes = []
for t in allres:
    ti = extracttime(t)
    alltimes.append([ti.day,ti.hour,ti.minute])

In [None]:
#Put the times into an array so we can easily plot them.
import numpy as np
timearr = np.array(alltimes)
print timearr[:,0]

In [None]:
import matplotlib.pyplot as plt

plt.title("Berkeley good mornings")
plt.hist(timearr[:,1],range(0,25))
plt.xlabel("Hour, Pacific Time")
plt.ylabel("Num. of Tweets")
plt.show()

People near Berkeley start saying "good morning" in earnest around 7am, although there's a non-negligible baseline rate at all times.

In [None]:
#Now repeating for people in Pittsburgh
allres_b = allres
allres = []
minid = 9999999999999999999999
for i in range(0,30):
    print 'Up to tweet', minid, 'iteration', i
    res = tsearch.search.tweets(q=term,count=100,result_type='recent',max_id=minid,geocode='40.44,-80.00,50km')
    print 'Hits:', res['search_metadata']['count']
    if res['search_metadata']['count']==0:
        break
    #minid = re.search('max_id=([^&]+)&',res['search_metadata']['next_results']).group(1)
    for j in range(0,len(res['statuses'])):
        #print res['statuses'][j]['id']
        if res['statuses'][j]['id'] < minid:
            minid = res['statuses'][j]['id']-1
    print 'New minimum ID:', minid
    allres.extend(res['statuses'])

In [None]:
alltimes = []
for t in allres:
    ti = extracttime(t)
    alltimes.append([ti.day,ti.hour,ti.minute])
    
timearr = np.array(alltimes)
print timearr

In [None]:
# the histogram of the data
np.histogram(timearr[:,1], 23)

In [None]:
plt.title("Pittsburgh good mornings")
plt.hist(timearr[:,1],range(0,25))
plt.xlabel("Hour, Pacific Time")
plt.ylabel("Num. of Tweets")
plt.show()

### Conclusions?  And where from here?

Obviously, this is a pretty toy-ish example, but maybe it raises some interesting ideas about what sources of noise there are. Why are there "good morning" tweets at the wrong times?  Is Twitter handling locations or times incorrectly, are people lying about their self-reported locations, or what?

### A second example

How do we speak about gender?

Searching for *women can be* and *men can be*, and looking at the concordances.

In [None]:
#Women searches

allwomen = []
term = '%22women+can+be%22'
minid = 9999999999999999999999
for i in range(0,30):
    print 'Up to tweet', minid, 'iteration', i
    res = tsearch.search.tweets(q=term+'-rt',count=100,result_type='recent',max_id=minid)
    print 'Hits:', res['search_metadata']['count']
    if res['search_metadata']['count']==0:
        break
    #minid = re.search('max_id=([^&]+)&',res['search_metadata']['next_results']).group(1)
    for j in range(0,len(res['statuses'])):
        #print res['statuses'][j]['id']
        if res['statuses'][j]['id'] < minid:
            minid = res['statuses'][j]['id']-1
    print 'New minimum ID:', minid
    allwomen.extend(res['statuses'])

In [None]:
allmen = []
term = '%22men+can+be%22'
minid = 9999999999999999999999
for i in range(0,30):
    print 'Up to tweet', minid, 'iteration', i
    res = tsearch.search.tweets(q=term,count=100,result_type='recent',max_id=minid)
    print 'Hits:', res['search_metadata']['count']
    if res['search_metadata']['count']==0:
        break
    #minid = re.search('max_id=([^&]+)&',res['search_metadata']['next_results']).group(1)
    for j in range(0,len(res['statuses'])):
        #print res['statuses'][j]['id']
        if res['statuses'][j]['id'] < minid:
            minid = res['statuses'][j]['id']-1
    print 'New minimum ID:', minid
    allmen.extend(res['statuses'])


### What can we look for here?

We could do more sophisticated analysis, and would probably want to if we were collecting this data for an actual project.  But let's just look at a really simple question: what adjectives are ascribed to men and women?

In [None]:
def extractconcordance(t,r):
    tweettext = t['text']
    nextwordhit = re.search(r,tweettext)
    if nextwordhit is None:
        return 'No match'
    else:
        return nextwordhit.group(1)

womenre = re.compile('women can be ([^ ]+)',re.I)

#for t in allwomen[0:15]:
#    print t['text']
#    print extractconcordance(t,womenre)

womendict = {}
for t in allwomen:
    word = extractconcordance(t,womenre).lower()
    womendict[word] = womendict.get(word,0)+1

pprint(womendict)


In [None]:
#pprint(np.sort(np.array([(k,v) for (k,v) in womendict.iteritems()],dtype=[('word', '<U43'), ('count', 'i8')]),axis=0,order=['count'])[-50:])
pprint({k: v for [k,v] in womendict.iteritems() if v>4})


In [None]:
menre = re.compile('men can be ([^ ]+)',re.I)

#for t in allwomen[0:15]:
#    print t['text']
#    print extractconcordance(t,womenre)

mendict = {}
for t in allmen:
    word = extractconcordance(t,menre).lower()
    mendict[word] = mendict.get(word,0)+1

pprint(mendict)


In [None]:
pprint({k: v for [k,v] in mendict.iteritems() if v>4})

### Conclusions, etc.

So sure enough, we see stereotypical language being used, even in this really simple analysis.  How could we expand the analysis?

### A third example, if you're interested

Here's a problem I've been working on lately: the relationship between actual and prescribed gendered and gender-neutral language use.

Many style guides still claim that *everyone put on their coats* is bad, and *everyone put on **his** coat* is the only acceptable form. Twitter might not have high editorial standards, but it's an interesting case for looking at how people use their language, and it might help us understand if *they* sounds natural.  So let's compile some data!

Specifically, let's start out by comparing two searches:
* *everybody their*
* *everybody his*

What I'd like to do is get at two important ratios:
* what are the relative frequencies of these options? (tweets/day)
* what proportion of each of these is a case where *their/his* is referring to *everybody*?

And to follow it up, I'd like to compare two more alternatives:
* *everybody "his or her"*
* *everybody her*

*Everybody had his or her coat on*, or things in that vein, are often offered as compromises, and many of us take that compromise in our writings. But, as your Twitter data (probably) shows, this is a formal circumvention, and almost everyone naturally uses *their* in this situation.  As it turns out, we've been using gender-neutral *they* in these contexts for centuries, in fact! (Bodine 1975)

Anyway, this is one way of seeing the formal/informal distinction in language use between an edited corpus (like Google Books) and a more conversational corpus (like Twitter). Twitter gets us into new territories, and we'll see that even more in Part 3.

In [None]:
#To go back 3000 tweets, you need 30 searches at the default count of 100 tweets per search. 
#Find out if you have enough searches left

In [None]:
#r = tsearch.application.rate_limit_status()
#remaining = r['resources']['search']['/search/tweets']['remaining']
#print remaining

In [None]:
#Construct an *everybody their* search, and iterate through it 30 times