## Search overview

We'll end up looking at two main kinds of searches:
* *GET search/tweets*, which returns tweets matching a search term
* *GET statuses/user_timeline*, which returns all of a given user's tweets

Note on timeline iteration: https://dev.twitter.com/rest/public/timelines

## GET search/tweets

First things first; we'll set up a search object like last time.

In [60]:
import sys, os, re
from pprint import pprint                           #Important for reading through JSONs
from time import localtime,strftime,sleep,time      #Important for dealing with Twitter rate limits
import datetime                       #Important for processing Twitter timestamps
import twitter

In [2]:
cons_oauth_file = 'c.xxx'
if os.path.exists(cons_oauth_file):
    constoken, conssecret = twitter.read_token_file(cons_oauth_file)
else:
    constoken = raw_input("What is your app's 'Consumer Key'?").strip()
    conssecret = raw_input("What is your app's 'Consumer Secret'?").strip()
    wf = open(cons_oauth_file,'w'); wf.write(constoken+'\n'+conssecret); wf.close()

In [3]:
app_oauth_file = 'a.xxx'
if not os.path.exists(app_oauth_file):									#if user not authorized already
	twitter.oauth_dance("your app",constoken,conssecret,app_oauth_file)		#perform OAuth Dance
apptoken, appsecret = twitter.read_token_file(app_oauth_file)					#import user credentials

In [4]:
tsearch = twitter.Twitter(auth=twitter.OAuth(apptoken,appsecret,constoken,conssecret))	#create search command

## GET search/tweets

Now that _tsearch_ is initialized, let's get to searching! [Here](https://dev.twitter.com/rest/reference/get/search/tweets) is Twitter's documentation on the GET search/tweets call, which is pretty good.

GET search/tweets takes a range of arguments, but I find these the most important:
* *q* : the search term, which must be UTF-8 & URL-encoded
* *count* : how many tweets per search? (100 max)
* *result_type* : do you want all recent tweets, or those that Twitter thinks are most interesting? (Hint: the former, you definitely want the former.)
* *max_id* : limits results to tweets before specified tweet ID

Try out a simple search below.

In [5]:
term="placeholder+text"  #Note: use + instead of spaces
#count=25              #Don't bother with too many hits yet

res = tsearch.search.tweets(q=term)
#,              #A test query
#                            count=10,              #just want 10 hits back
#                            result_type="recent")  #include all recent tweets, not only popular ones

KeyError: 0

The result of a search is again a nested dictionary. (The Twitter API returns data in either JSON or XML format, which the *twitter* library auto-encodes as nested dictionaries.)  The returned tweets are in the 'statuses' dictionary. (Internally, the API refers to tweets as statuses, which is weird sometimes.)

Let's look at the first hit:

In [9]:
pprint(res['statuses'][0])

{u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Thu Jun 22 21:38:32 +0000 2017',
 u'entities': {u'hashtags': [],
               u'symbols': [],
               u'urls': [],
               u'user_mentions': [{u'id': 302254432,
                                   u'id_str': u'302254432',
                                   u'indices': [0, 11],
                                   u'name': u'\U0001f43a\U0001f985\U0001f409\U0001f432',
                                   u'screen_name': u'pjparker16'}]},
 u'favorite_count': 0,
 u'favorited': False,
 u'geo': None,
 u'id': 878004300059987968,
 u'id_str': u'878004300059987968',
 u'in_reply_to_screen_name': u'pjparker16',
 u'in_reply_to_status_id': 878003383977639937,
 u'in_reply_to_status_id_str': u'878003383977639937',
 u'in_reply_to_user_id': 302254432,
 u'in_reply_to_user_id_str': u'302254432',
 u'is_quote_status': False,
 u'lang': u'en',
 u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'},
 u'place': None,
 u'

Boy, that's a lot of information. Twitter calls this a fully-hydrated result, and includes information about the tweet, the user, and the social engagement of the tweet.  Here's Twitter's [overview of the information](https://dev.twitter.com/overview/api/tweets) in a tweet.

I like to compare this against what Twitter shows us on its website, where the visualization is easier.  Let's make a real quick function to extract the URL from this information so we can visualize the tweets as we talk about them:

In [10]:
def extracttweetURL(j):
	return 'http://twitter.com/'+j['user']['screen_name']+'/status/'+str(j['id'])

t = res['statuses'][0]
print extracttweetURL(t)

http://twitter.com/Twintendo_/status/878004300059987968


The key pieces of information depend on your goals, but in most cases these will be important:
* *created_at*: tweet's time, in UTC.
* *favorite_count*, *retweet_count*: number of favs & RTs, respectively, the tweet has amassed
* *id*: tweet's unique numerical ID
* *in_reply_to_status_id*: ID of the tweet this one's replying to (if any)
* *text*: the text of the tweet
* *user*: all the info about the tweeter

And within the user dictionary, here're some important fields:
* *id*: tweeter's numerical ID (constant throughout account's lifespan)
* *location*: self-reported location of tweeter
* *friends_count*, *follower_count*: number of people the tweeter follows and is followed by (respectively)
* *name*: tweeter's display name (can change)
* *screen_name*: tweeter's Twitter handle (i.e., @whatever; also can change)

Just to make the tweets a little more readable, I'm going to create a pruning function down to just these features.

In [15]:
def prunetweet(t):
    d = {k: t[k] for k in ['created_at','favorite_count','retweet_count','id','in_reply_to_status_id','text']}    #keeping only relevant top-level features (user features handled below)
    d['user'] = {k: t['user'][k] for k in ['id','location','friends_count','followers_count','name','screen_name']} #keeping only relevant features
    return d

pprint(prunetweet(t))

{'created_at': u'Thu Jun 22 21:38:32 +0000 2017',
 'favorite_count': 0,
 'id': 878004300059987968,
 'in_reply_to_status_id': 878003383977639937,
 'retweet_count': 0,
 'text': u"@pjparker16 Probably placeholder text for those who are a very low level who haven't even picked a team yet",
 'user': {'followers_count': 8477,
          'friends_count': 240,
          'id': 2349001082,
          'location': u'Sheffield, England',
          'name': u'Twintendo',
          'screen_name': u'Twintendo_'}}


### Playing around with tweets

Let's play around with these a bit; try some different searches and look at the results you get. Are there any really surprising results?

In [16]:
term="%22good+morning%22"  #Note: use + instead of spaces, and %22 instead of quotes
count=25                   #Don't bother with too many hits yet

res = tsearch.search.tweets(q=term,count=count,result_type='recent')

In [19]:
for i in range(0,len(res['statuses'])):
    print '\n',i, extracttweetURL(res['statuses'][i])
    pprint(prunetweet(res['statuses'][i]))


0 http://twitter.com/NobitaSuperx/status/878031420542693376
{'created_at': u'Thu Jun 22 23:26:18 +0000 2017',
 'favorite_count': 0,
 'id': 878031420542693376,
 'in_reply_to_status_id': None,
 'retweet_count': 0,
 'text': u'Good morning Friday :) #nobitasuperx #\u0e15\u0e37\u0e48\u0e19\u0e02\u0e36\u0e49\u0e19\u0e21\u0e32\u0e01\u0e47\u0e15\u0e32\u0e15\u0e35\u0e48 https://t.co/ocMuhpNHGx',
 'user': {'followers_count': 5753,
          'friends_count': 591,
          'id': 3019268520,
          'location': u'Tokyo Japan',
          'name': u'NobitaSuperx',
          'screen_name': u'NobitaSuperx'}}

1 http://twitter.com/inang_salud/status/878031416893861889
{'created_at': u'Thu Jun 22 23:26:17 +0000 2017',
 'favorite_count': 0,
 'id': 878031416893861889,
 'in_reply_to_status_id': None,
 'retweet_count': 4,
 'text': u'RT @DocMarvoree: @justapple24 Good morning ate apple :) \n\nBestTHURgether MARVOREE',
 'user': {'followers_count': 311,
          'friends_count': 657,
          'id': 8213189

A few things I've found strange/interesting/annoying:
* a query can be matched by a username in addition to the text itself.
* Jupyter isn't displaying Unicode well (so no emoji, :( )
* favorite_count in tweet, favo**u**rites_count in user
* manual RTs?

### Iterating back in time

Twitter limits the number of tweets from any single *GET search/tweets* call to 100. But you're allowed to go back up to one week, or 3000 tweets, whichever you run afoul of first.  How do you do that?  The result of each API call has a *search_metadata* feature, which both gives info about the completeed search and where to go from here:

In [33]:
res['search_metadata']

{u'completed_in': 0.073,
 u'count': 25,
 u'max_id': 878031347343740927,
 u'max_id_str': u'878031347343740927',
 u'next_results': u'?max_id=878031289978281984&q=%2522good%2Bmorning%2522&count=25&include_entities=1&result_type=recent',
 u'query': u'%2522good%2Bmorning%2522',
 u'refresh_url': u'?since_id=878031347343740927&q=%2522good%2Bmorning%2522&result_type=recent&include_entities=1',
 u'since_id': 0,
 u'since_id_str': u'0'}

The search API lets you specify a maximum tweet ID in each search (*max_id*), and by iteratively moving that maximum back to the minimum ID of the preceding search, you keep return results further back in time, until Twitter stops you.

Unfortunately, that minimum ID is not supplied directly here; you have to extract it from the *next_results* string, or extract it manually as the minimum ID in your results.  Also, note that the *max_id* search value is **inclusive**, so you should subtract one from it before your search or you'll get that tweet over again.

In [31]:
minid = 9999999999999999999999
for i in range(0,len(res['statuses'])):
    #print res['statuses'][i]['id']
    if res['statuses'][i]['id'] < minid:
        minid = res['statuses'][i]['id']-1
print minid

#code to regexp to the max_id value and extract it as element 1 of the match object
#see https://docs.python.org/2.7/library/re.html
#However, Twitter is dumb if you're doing a more complex search and can't handle this.
minid = re.search('max_id=([^&]+)&',res['search_metadata']['next_results']).group(1)
print minid

878031347343740927
878031347343740927


In [32]:
res = tsearch.search.tweets(q=term,count=count,result_type='recent',max_id=minid)
res['search_metadata']

Ta da!

## Good morning!

Hey, here's a stupid test case. When do people say good morning?

In [37]:
#To go back 3000 tweets, you need 30 searches at the default count of 100 tweets per search. 
#Find out if you have enough searches left

In [38]:
r = tsearch.application.rate_limit_status()
remaining = r['resources']['search']['/search/tweets']['remaining']
print remaining

180


In [39]:
#Let's find the 3000 most recent "good morning" tweets
#Build a search for "good morning" (remember to convert quotes to the URL-encoded %22) & check it works as expected

In [42]:
term = '%22good+morning%22'
res = tsearch.search.tweets(q=term,count=100,result_type='recent')

In [43]:
res['search_metadata']

{u'completed_in': 0.061,
 u'count': 100,
 u'max_id': 878078641049001984,
 u'max_id_str': u'878078641049001984',
 u'next_results': u'?max_id=878078498950205439&q=%2522good%2Bmorning%2522&count=100&include_entities=1&result_type=recent',
 u'query': u'%2522good%2Bmorning%2522',
 u'refresh_url': u'?since_id=878078641049001984&q=%2522good%2Bmorning%2522&result_type=recent&include_entities=1',
 u'since_id': 0,
 u'since_id_str': u'0'}

In [None]:
#Extract the minimum tweet ID, and get the 100 tweets before it
#Loop 30 times (or until res['search_metadata']['count']==0)
#Be sure to save all the tweets
#Note: searching with max_id=0 is the same as having no max.

In [93]:
allres = []
minid = 0
for i in range(0,30):
    print 'Up to tweet', minid, 'iteration', i
    res = tsearch.search.tweets(q=term,count=100,result_type='recent',max_id=minid,geocode='37.87,-122.27,50km')
    print 'Hits:', res['search_metadata']['count']
    if res['search_metadata']['count']==0:
        break
    #minid = re.search('max_id=([^&]+)&',res['search_metadata']['next_results']).group(1)
    minid = 9999999999999999999999
    for j in range(0,len(res['statuses'])):
        #print res['statuses'][j]['id']
        if res['statuses'][j]['id'] < minid:
            minid = res['statuses'][j]['id']-1
    print 'New minimum ID:', minid
    allres.extend(res['statuses'])


Up to tweet 0 iteration 0
Hits: 100
New minimum ID: 877949514019221504
Up to tweet 877949514019221504 iteration 1
Hits: 100
New minimum ID: 877921090235842559
Up to tweet 877921090235842559 iteration 2
Hits: 100
New minimum ID: 877902935270739967
Up to tweet 877902935270739967 iteration 3
Hits: 100
New minimum ID: 877894820471095295
Up to tweet 877894820471095295 iteration 4
Hits: 100
New minimum ID: 877849025365463039
Up to tweet 877849025365463039 iteration 5
Hits: 100
New minimum ID: 877709830504366079
Up to tweet 877709830504366079 iteration 6
Hits: 100
New minimum ID: 877631636304666624
Up to tweet 877631636304666624 iteration 7
Hits: 100
New minimum ID: 877591059630481407
Up to tweet 877591059630481407 iteration 8
Hits: 100
New minimum ID: 877570126010130431
Up to tweet 877570126010130431 iteration 9
Hits: 100
New minimum ID: 877558912152555520
Up to tweet 877558912152555520 iteration 10
Hits: 100
New minimum ID: 877550867490250751
Up to tweet 877550867490250751 iteration 11
Hits

In [92]:
res['statuses']

[{u'contributors': None,
  u'coordinates': None,
  u'created_at': u'Fri Jun 23 02:44:47 +0000 2017',
  u'entities': {u'hashtags': [],
   u'symbols': [],
   u'urls': [],
   u'user_mentions': []},
  u'favorite_count': 0,
  u'favorited': False,
  u'geo': None,
  u'id': 878081371016527876,
  u'id_str': u'878081371016527876',
  u'in_reply_to_screen_name': None,
  u'in_reply_to_status_id': None,
  u'in_reply_to_status_id_str': None,
  u'in_reply_to_user_id': None,
  u'in_reply_to_user_id_str': None,
  u'is_quote_status': False,
  u'lang': u'en',
  u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'},
  u'place': None,
  u'retweet_count': 0,
  u'retweeted': False,
  u'source': u'<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
  u'text': u'Good morning',
  u'truncated': False,
  u'user': {u'contributors_enabled': False,
   u'created_at': u'Sat Apr 02 06:35:00 +0000 2016',
   u'default_profile': True,
   u'default_profile_image': False,
  

In [63]:
#Check how many tweets you ended up with
#Now let's extract their times

def extracttime(t):
    UTCoffset = datetime.timedelta(hours=7)   #I'm assuming we're on Pacific Daylight Time (UTC-7)
    return datetime.datetime.strptime(t['created_at'],'%a %b %d %H:%M:%S +0000 %Y')-UTCoffset


In [64]:
t=allres[0]
print datetime.datetime.strptime(t['created_at'],'%a %b %d %H:%M:%S +0000 %Y')
print extracttime(t)
print extracttweetURL(t)

2017-06-23 02:45:59
2017-06-22 19:45:59
http://twitter.com/GLOBALERPE/status/878081672268267521


In [94]:
alltimes = []
for t in allres:
    ti = extracttime(t)
    alltimes.append([ti.day,ti.hour,ti.minute])

In [100]:
import numpy as np
timearr = np.array(alltimes)
print timearr[:,0]

[22 22 22 ..., 19 19 19]


In [99]:
import numpy as np
#import matplotlib.pyplot as plt

# the histogram of the data
np.histogram(timearr[:,1], 23)


(array([ 37,  35,  48,  41,  53,  91, 122, 498, 407, 248, 151, 113,  73,
        157, 192, 125, 110,  96,  89,  83,  61,  47,  88]),
 array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
         11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,
         22.,  23.]))

People near Berkeley start saying "good morning" in earnest around 7am, although there's a non-negligible baseline rate at all times.

### A second example

Here's a problem I've been working on lately: the relationship between actual and prescribed gendered and gender-neutral language use.

Many style guides still claim that *everyone put on their coats* is bad, and *everyone put on **his** coat* is the only acceptable form. Twitter might not have high editorial standards, but it's an interesting case for looking at how people use their language, and it might help us understand if *they* sounds natural.  So let's compile some data!

Specifically, let's start out by comparing two searches:
* *everybody their*
* *everybody his*

What I'd like to do is get at two important ratios:
* what are the relative frequencies of these options? (tweets/day)
* what proportion of each of these is a case where *their/his* is referring to *everybody*?

In [34]:
#To go back 3000 tweets, you need 30 searches at the default count of 100 tweets per search. 
#Find out if you have enough searches left

In [36]:
r = tsearch.application.rate_limit_status()
remaining = r['resources']['search']['/search/tweets']['remaining']
print remaining

180


In [None]:
#Construct an *everybody their* search, and iterate through it 30 times
res = {}
tres = 