## Search overview

We'll end up looking at two main kinds of searches:
* *GET search/tweets*, which returns tweets matching a search term
* *GET statuses/user_timeline*, which returns all of a given user's tweets

Note on timeline iteration: https://dev.twitter.com/rest/public/timelines

## Conversations

One cool thing about Twitter as a data source is that it has a lot of interactions. You might be thinking: well, sure, but those interactions are mostly just alt-right trolls tossing Pepes at people who immediately block them, or fans shouting at their idols in hopes of attracting a response from [Paramore](https://twitter.com/after1aughter/status/877661990541336576) or [Five Seconds of Summer](https://twitter.com/Michael5SOS/status/878164242284765184).

But lurking beneath those well-worn tropes are actual conversations, and let's try to find a few of them.

In [None]:
#The usual initialization stuff

import sys, os, re
from pprint import pprint                           #Important for reading through JSONs
from time import localtime,strftime,sleep,time      #Important for dealing with Twitter rate limits
import datetime                       #Important for processing Twitter timestamps
import twitter

cons_oauth_file = 'c.xxx'
if os.path.exists(cons_oauth_file):
    constoken, conssecret = twitter.read_token_file(cons_oauth_file)
else:
    constoken = raw_input("What is your app's 'Consumer Key'?").strip()
    conssecret = raw_input("What is your app's 'Consumer Secret'?").strip()
    wf = open(cons_oauth_file,'w'); wf.write(constoken+'\n'+conssecret); wf.close()
    
app_oauth_file = 'a.xxx'
if not os.path.exists(app_oauth_file):									#if user not authorized already
	twitter.oauth_dance("your app",constoken,conssecret,app_oauth_file)		#perform OAuth Dance
apptoken, appsecret = twitter.read_token_file(app_oauth_file)					#import user credentials

tsearch = twitter.Twitter(auth=twitter.OAuth(apptoken,appsecret,constoken,conssecret))	#create search command\

def extracttweetURL(j):
	return 'http://twitter.com/'+j['user']['screen_name']+'/status/'+str(j['id'])


### First, how do we reconstruct a conversation?

Lots of tweets are in reply to someone else. (A lot of the "women/men can be" tweets from the previous part were, for example.) Twitter uses a linked-list style in its conversation structure, which allows us to crawl back up a conversation.  Let's see how this works in practice.

I'm going to cheat and use the first hit from the first time I ran the "women can be" search. Here's the tweet, from the wonderfully-named *@bitchardnixon*, on Twitter: http://twitter.com/bitchardnixon/status/878355576719372288

It's a really simple back-and-forth conversation between two people who clearly have a history of interactions, exactly the sort of conversation that's been underrepresented in corpus and lab studies.  This is such a stupid, simple interaction, or par with all the other stupid, simple interactions we have every day, but boy, I've never been so excited to see this sort of thing!

So how do we collect this data from the API? Well, it's got a [*GET statuses/lookup*](https://dev.twitter.com/rest/reference/get/statuses/lookup) call that returns data on up to 100 tweets at a time.  So let's find out about tweet 878355576719372288!

In [None]:
tweets = tsearch.statuses.lookup(_id=878355576719372288)
pprint(tweets)

There's our data, and there's a feature *in_reply_to_status_id*, which indicates what tweet this one is replying to. There's also *in_reply_to_screen_name* and *in_reply_to_user_id*, which will be useful later on for improving our conversational coverage, but for now let's focus on the specific preceding tweet.  Since we have its ID, we can look it up in the same way.

(Two notes: one, since *GET statuses/lookup* can take up to 100 IDs, it returns a list of JSONs (just like *GET search/tweets*), even if only one hit comes back; two, for all the Python *twitter* search functions, to specify and ID as an argument, you have to specify it as *_id=*, with the initial underscore.)

In [None]:
tweets.extend(tsearch.statuses.lookup(_id=tweets[0]['in_reply_to_status_id']))
pprint(tweets[1])

We'll repeat with the preceding message:

In [None]:
tweets.extend(tsearch.statuses.lookup(_id=tweets[1]['in_reply_to_status_id']))
pprint(tweets[2])

And now, the *in_reply_to_status_id* field has the value **None**, so we've reached the top of the conversation. What a long, strange trip it's been.

## How do we get down?

It's easy to climb up a conversation, in general.  There is the potential for missing branches in the conversational tree; some people regularly delete old tweets, others delete tweets that have gained undue attention, and osme have protected accounts so that randos like you or me can't read them.  If a tweet is deleted or inaccesible, the tweet JSON will still list it as a reply, but you're not going to be able to reconstruct the conversation back to its root. 

There's also the potential for people to reply out of order; say you've got a 150-character thought to express, so you split it over tweet A and tweet B, and make tweet B a reply to tweet A (yes, you can reply to your own tweets). Now I want to respond, and if I reply to tweet B, you have a nice chain: A-B-me. But if I reply to tweet A -- and people often do respond to the first tweet in a thread -- the context is kind of screwed up.  I don't have any particular advice on this; just something to be aware of.

What's really hard is trying to go down a conversation.  Unfortunately, there is no way of seeing the list of tweets that reply to a tweet.  To be honest, I don't even know how exactly Twitter does it when they show all the responses in the web/app views.

But here are some approaches that can help with this.

### Grab a user's timeline

This is the *GET statuses/user_timeline* call I mentioned way at the beginning.  If two people are having a long conversation, and we grab all of one person's tweets, we can be reasonably sure of getting almost all of the conversation.  (The only thing we'll miss is if the other person has the last word.)  If we grab the timelines of everyone in the conversation, we ought to get it all.

Unlike *GET search/tweets*, which only goes back one week, *GET statuses/user_timeline* will go back up to 3200 tweets, with no time limit.  For non-prolific tweeters, this means you can end up capturing tweets from back in the 2000s! Like, for instance, this guy:

In [None]:
#Note this is a little different from the GET search/tweets call; max_id doesn't work well with a really large minimum.

usertweets = []
username = 'mgrammar'
res = tsearch.statuses.user_timeline(screen_name=username,count=200)
minid = 999999999999999999999
for i in range(0,16):
    print 'Up to tweet', minid, 'iteration', i
    if len(res)==0:
        break
    usertweets.extend(res)
    for j in range(0,len(res)):
        #print res['statuses'][j]['id']
        if res[j]['id'] < minid:
            minid = res[j]['id']-1
    print 'Hits:', len(res)
    #minid = re.search('max_id=([^&]+)&',res['search_metadata']['next_results']).group(1)
    print 'New minimum ID:', minid
    res = tsearch.statuses.user_timeline(screen_name=username,count=200,max_id=minid)
usertweets.extend(res)    


#### GET statuses/user_timeline arguments

There are a few important arguments you can send with a call to *GET statuses/user_timeline*:
* *user_id/screen_name*: take your pick, either include someone's twitter ID or their screenname for the search. Remember user ID is more stable; you can change your screenname, but not your ID.
* *count*: up to a maximum of 200 tweets per search (and up to 3200 total)
* *max_id*: the usual; returns the N most recent tweets before *max_id*

Two arguments merit a little more discussion:

*trim_user*: you know all the stuff in the 'user' dictionary, attached to each tweet?  That's redundant; it's the same for each tweet. If trim_user=True, only the user's ID is returned, and you can look up their information using *GET users/show*. In general, it's good to omit this and just call for user info once. For ease of coding right now, I'll accept redundancy.

*include_rts*: Do you want to include tweets the user has retweeted? I usually don't. The default is to include them, and searching with this set to *False* excludes them. I do this for a couple of reasons.

1) Retweets aren't that user's words, and people often retweet out of spite, irony, or to point out something they disagree with.  It's very difficult to understand the relationship between a user's feeling and their RT's feelings. 

2) Including RTs can lead to the same tweet appearing over and over again, especially if you're grabbing multiple people's timelines. 

3) Twitter doesn't have a standardized flag for RTs. There are signals that something is a retweet, such as the presence of the *'quoted_status'* feature in the tweet information.  You can also search for 'RT @' in the tweet text.  unfortunately, these signals change regularly when Twitter changes its retweet behavior, and different forms of retweets (quote tweeting, manual RTs, and standard RTs) have different signals.  This can really get frustrating.

Unfortunately, even setting include_rts to *False* doesn't block all RTs, in my experience.  Be aware of this if your data depends on non-RTs, and I'm happy to give some advice that's worked for me in the past about recognizing and excluding RTs.

### The "seed and snowball" approach

A lot of Twitter analyses use a "user snowball" approach, which is pretty straightforward. You start out a seed user, or a set of seed users.  You can choose your seed users in a range of ways. One common way is to just select the most recent tweeters, if you're trying to get a broad sample.  You can also pick specific users; I had a project where we we were interested in the interactions between famous ("verified") and normal Twitter users, so we chose a few politicians, singers, and scientists who tweeted regularly and actually replied to (at least some of) their fans.  And the basic idea here is that you grab all of the seed user's tweets that you can find, and look at who they mention in their tweets.

The snowball part comes from taking all the users the seed user mentions, and treating them as seeds for a second generation. You collect all the mentioned-users' tweets, list everyone they mention, and collect all their tweets.  Though it depends on what your goal is, how many distinct seed users you have, and how much they engage with others, you probably only need one to three generations to build your corpus.


Here's a simple test case, and maybe one that shows the strengths and shortcomings of this approach. For our seed user, let's use [Michael Clifford](http://twitter.com/micahel5sos) of the band Five Seconds of Summer, a boy band with a pretty solid Twitter following (though I don't believe they're at the level of One Direction or Justin Bieber had).

Fandoms are nice for investigating conversations because they're usually friendly, upbeat, and talkative. Definitely much more pleasant than the conversations about politics I was originally using as an example here.

In [None]:
#First, get our seeduser's timeline.
username = 'michael5sos'
res = tsearch.statuses.user_timeline(screen_name=username,count=200,include_rts=False)

In [None]:
#Now, extract all mentions
mentionsdict = {}
for t in res:
    if len(t['entities']['user_mentions'])>0:
        for mention in t['entities']['user_mentions']:
            mentionsdict[mention['screen_name']] = mentionsdict.get(mention['screen_name'],0)+1
        print extracttweetURL(t)
        print t['text']
    

In [None]:
mentionsdict

In [None]:
#Now grab *their* timelines

In [None]:
#Last thing, build an efficient searcher to reconstruct conversations.
#(This is definitely outside the scope of today's discussion.)

But does this cover everything for reconstructing Twitter conversations?

No! Well, maybe, depending on what you're trying to do. But there are a couple small things left to deal with.

**Recognizing gaps.** You're gonna have gaps, conversations that can't be traced back to their first tweets.  Depending on your application, this may not matter or may be critical.

**Request limit effects.** Because different people tweet different amounts, you might be able to get all the tweets from one side of a conversation, but not the other.  This is especially true whne you're looking back a year or more. You may have to throw out some old conversations because they are just too difficult to fill in.

**You can never be sure a conversation ended.** It might turn out that there was a response right after you searched.  Or some random person might have responded, whose timeline you didn't capture. In general, this is not super common and not worrisome, but be very careful using this data to test a hypothesis that rests on identifying the last message in a conversation with high accuracy.

#### Two more ways of filling in gaps

**Capturing mentions.** You can search for a Tweeter's screen name just like any other search term using *GET search/tweets*, and get the last 3000 mentions of them. The snowball approach only captures people who the seed users mentioned, and with popular accounts, they're only going to mention a small number of those who mention them.

**Manual tweet retrieval.** You can also try using *GET statuses/lookup* to grab tweets whose IDs appear in a conversation thread but that haven't been found through timelines/mentions. *GET statuses/lookup* has no time or tweet-number limits, so you can go all the way back to the very earliest tweets in this way.