## One last thing: Twitter geography

I don't know how much of this we're going to be able to cover, so it's going to be a little sparse.  One of my major interests in Twitter corpora is the geographic aspect of tweets. Tweets *can* contain geographic information, though many don't.  This information can come from multiple sources: the tweet itself can be **geotagged**, with a latitude-longitude pair or city name, or the user can have a self-specified **location** (which is much noisier and may be a stright-up lie).  Around 3% of tweets have geotags, and in my experience, somewhere around 30-60% have usable geographic information in the user's location (this depends on what granularity of geographic information you're willing to accept).

In [None]:
#The usual initialization stuff

import sys, os, re
from pprint import pprint                           #Important for reading through JSONs
from time import localtime,strftime,sleep,time      #Important for dealing with Twitter rate limits
import datetime                       #Important for processing Twitter timestamps
import twitter

cons_oauth_file = 'c.xxx'
if os.path.exists(cons_oauth_file):
    constoken, conssecret = twitter.read_token_file(cons_oauth_file)
else:
    constoken = raw_input("What is your app's 'Consumer Key'?").strip()
    conssecret = raw_input("What is your app's 'Consumer Secret'?").strip()
    wf = open(cons_oauth_file,'w'); wf.write(constoken+'\n'+conssecret); wf.close()
    
app_oauth_file = 'a.xxx'
if not os.path.exists(app_oauth_file):									#if user not authorized already
	twitter.oauth_dance("your app",constoken,conssecret,app_oauth_file)		#perform OAuth Dance
apptoken, appsecret = twitter.read_token_file(app_oauth_file)					#import user credentials

tsearch = twitter.Twitter(auth=twitter.OAuth(apptoken,appsecret,constoken,conssecret))	#create search command\

def extracttweetURL(j):
	return 'http://twitter.com/'+j['user']['screen_name']+'/status/'+str(j['id'])


To start, let's do a really simple search, and one that has a well-known geographic distribution.  Specifically, let's look for tweets mentioning interstates.

I-85 is an interstate in the Southeastern U.S., running from Alabama to Virginia.  [Wikipedia](https://en.wikipedia.org/wiki/Interstate_85) has a nice little map of the route.  Let's see if we can't re-create that map.

In [None]:
term1 = '"I-85"'
term1 = re.sub(' ','+',term1)
term = re.sub('\"','%22',term1)

print "Searching for", term

In [None]:
import sys, os
sys.path.append(os.path.abspath("lib"))

from seetweetlib225 import *
from pprint import pprint
from time import sleep

In [None]:
#Global variables
loclist = ["30.8,-98.6","39.8,-95.6","32.8,-117.6","37.8,-122.6"]
radius = "2500km"
multiloc = True
tweetspersearch = 75
maxpages = 20
maxid = float("+inf")
importcsv = ''
overwrite = 'w'
outfile = ''
keeptweets = True
startat = 0
header = True
throttle = True
checklimits = 3 		#To avoid overquerying the rate_limit function, only check every N iterations
importmultiloc = False
newmultiloc = False
onlyincltweets = True
wff = False
trackfails = False
MAXERRORS = 3
scheduled = False
raw = True
tweetcount = 0

### Setting up your account to use SeeTweet

First things first, you need a Twitter account through which to access the REST API. Just in case you run afoul of any Twitter regulations---and it's not inconceivable as you start out---I recommend using an account you could survive being Twitter jailed.

In [None]:
tsearch = authorize()

### Choosing a search term

Okay, we're going to look for some term here. You can look for individual words or a phrase.

In [None]:
term1 = '"tractor trailer"'
term1 = re.sub(' ','+',term1)
term = re.sub('\"','%22',term1)

print "Searching for", term

### Main code

This will be revised to simply & improve transparency. Ideally will be reduced to simple API calls with reduced opaque processing.  Some additional processing I'm currently doing will also be removed.

Good feature to add here is URL for each discovered tweet in case you want to look at what's going on as it comes out.

In [None]:
#Basic startup processing
if (importcsv and multiloc):
	importmultiloc = True
elif multiloc:
	newmultiloc = True


if importcsv:
	rf = open(importcsv,'r')
	firstline=True
	tidnum = 0
	tids = []
	centers = []
	incls = []
	for line in rf:
		if line[0] == '#':
			continue
		splitline = line.strip().split(',')
		if firstline:
			tidnum = splitline.index('tid')
			if importmultiloc:
				centernum = splitline.index('center')
				inclnum = splitline.index('incl')
			firstline=False
		else:
			if onlyincltweets:							#If we're excluding excluded tweets from baseline calc, skip to next line if incl=0
				if int(splitline[inclnum])==0:
					continue
			tids.append(long(splitline[tidnum])-1)
			if importmultiloc:
				centers.append(int(splitline[centernum]))
				incls.append(int(splitline[inclnum]))
	tids = tids[startat:]
	if importmultiloc:
		centers = centers[startat:]
		incls = incls[startat:]
	maxpages = len(tids)
	tids.append(0)
	firsthit = tids[0]
	rf.close()
else:
	firsthit = float("+inf")

outcomes = {}
locnum = -1
if newmultiloc:
	tweetlist = [0]*len(loclist)
	searchesleft = [0]*len(loclist)
	mintids = [0]*len(loclist)
	maxtids = [0]*len(loclist)
elif importmultiloc:
	tweetlist = [0]*maxpages		#tweetlist[tweetnum][locnum][resnum] = [outline,loc,tid] for the resnum-th baseline tweet in loc locnum on testtweet tweetnum
	mintids = [0]*maxpages
	maxtids = [0]*maxpages
	

In [None]:
#Actual search-performing code
outfile = ''

if importcsv:
	if not outfile:
		outfile = 'base.'+term1.strip('\"')+'.'+importcsv
else:
	#if scheduled:
	#	scheddir = 'scheduled'
	#	if not os.path.exists(scheddir):
	#		os.mkdir(scheddir)
	#	outfile = scheddir+'/'+term1.strip('\"')+'.'+strftime('%Y%m%d.%H%M')+'.csv'
	if not outfile:
		outfile = term1.strip('\"')+'.csv'
wf = open(outfile,overwrite)
latlong = loclist[0]
#TODO: block out the header stuff
if header:
	wf.write('#Compiled by SeeTweet '+versionnum+'.\n')
	wf.write('#Search performed at '+strftime('%Y-%m-%d %H:%M')+'\n')
	if not multiloc:
		wf.write('#Search location: '+latlong+','+radius+'\n')
	else:
		wf.write('#Search location: U.S. 4-location coverage points (Texas, KC, SD, SF)\n')
	wf.write('#Search term: '+term1+'\n')
if (overwrite=='w' and not multiloc):
	wf.write('day,year,month,date,hour,minute,second,source,city,state,lat,long,uid,tid\n')
elif (overwrite=='w' and newmultiloc):
	wf.write('day,year,month,date,hour,minute,second,source,city,state,lat,long,uid,tid,center,incl\n')
elif (overwrite=='w' and importmultiloc):
	wf.write('day,year,month,date,hour,minute,second,source,city,state,lat,long,uid,tid,origtid,origincl,center,incl\n')
if keeptweets:
	tweetdir = 'tweetarchive'
	if not os.path.exists(tweetdir):
		os.mkdir(tweetdir)
	#if not os.path.exists(tweetdir+'/'+scheddir):
	#	os.mkdir(tweetdir+'/'+scheddir)
	outtweetfile = tweetdir+'/'+os.path.splitext(outfile)[0]+'.tweets'
	wft = open(outtweetfile,overwrite)
	if overwrite=='w':
		if raw:
			wft.write('day\tyear\tmonth\tdate\thour\tminute\tsecond\tloc\tuid\ttid\ttweet\ttlength\trt\n')
		else:
			wft.write('loc,uid,tid,tweet\n')
if trackfails:
	faildir = 'failures'
	if not os.path.exists(faildir):
		os.mkdir(faildir)
	outfailfile = faildir+'/'+os.path.splitext(outfile)[0]+'.fails'
	wff = open(outfailfile,'w')
	
	
	#Note: block out the rate check
for latlong in loclist:
	locnum = locnum + 1
	geocodestr = latlong+","+radius
	print "\nSearch centered at:", latlong, "(locnum "+str(locnum)+")"
	if newmultiloc:
		currloctweets = []
		tidbycurrloc = []
	
	for pagenum in range(0,maxpages):
		print ''
		if importmultiloc:
			if (locnum==0):
				tweetlist[pagenum] = []
				maxtids[pagenum] = []
				mintids[pagenum] = []
			#print maxtids
			currloctweets = []
			tidbycurrloc = []
		#Examining the rate limit
		if (pagenum % checklimits == 0):
			r = getlimits(tsearch)
			if r['remaining'] <= checklimits:
				print "\n\n**Paused because of rate limit.**"
				print "Current time:",strftime('%I:%M:%S')
				print "Reset time:  ",strftime('%I:%M:%S',localtime(r['reset']))
				if not multiloc:
					print "Resume with flag -s="+str(startat+pagenum)
				else:
					print "Stopped on location "+str(locnum)+", tweet "+str(startat+pagenum)+"/"+str(maxpages)
				print "--"
				print tweetcount, 'tweets found. Centered at', geocodestr
				print outcomes
				if throttle:
					waittime = r['reset']-time()+30
					print "Waiting", round(waittime), "seconds before resuming."
					sleep(waittime)
				else:
					sys.exit()
			elif r['remaining'] < 11:
				print "\n\n**WARNING:", r['remaining'], "queries remaining.**"
				print "Current time:",strftime('%I:%M:%S')
				print "Reset time:  ",strftime('%I:%M:%S',localtime(r['reset']))
				print ""
				print "\n\n"
				sleep(10)
		
		#Adding a catch for various Twitter errors
		errors = 0
		while (errors < MAXERRORS):
			try:
				if (firsthit == float("+inf")):
					res = tsearch.search.tweets(q=term+'+-rt',geocode=geocodestr,count=str(tweetspersearch),result_type="recent")
					#pprint(res['statuses'][99])
				else:
					#print 'Test:', term+'+-rt',geocodestr,str(tweetspersearch),str(firsthit)
					res = tsearch.search.tweets(q=term+'+-rt',geocode=geocodestr,count=str(tweetspersearch),result_type="recent",max_id=str(firsthit))
					#pprint(res['statuses'][99])
				break
			except TwitterHTTPError as e:
				errors = errors + 1
				print "Twitter Error encountered. Retrying",MAXERRORS-errors,"more times."                                                       
				print "\n"+e.response_data
				sleep(5)
		if (errors==MAXERRORS):
			print "Repeated errors encountered, possibly due to rate limit."
			print "Will wait 15 minutes and try once more before quitting."
			sleep(900)
			try:
				if (firsthit == float("+inf")):
					res = tsearch.search.tweets(q=term+'+-rt',geocode=geocodestr,count=str(tweetspersearch),result_type="recent")
				else:
					res = tsearch.search.tweets(q=term+'+-rt',geocode=geocodestr,count=str(tweetspersearch),result_type="recent",max_id=str(firsthit))
			except:			
				raise Exception("Gave up because of repeated errors, sorry.")
		res = res['statuses']
			
		print len(res), 'hits on page', startat+pagenum+1, '(max_id='+str(firsthit)+')'
		print r['remaining']-1, 'queries remaining.'
		if (len(res)==0):
			print 'Out of tweets at this location.'
			#print 'Test:', term+'+-rt',geocodestr,str(tweetspersearch),str(firsthit)
			#sleep(1)
			#res = tsearch.search.tweets(q=term+'+-rt',geocode=geocodestr,count=str(tweetspersearch),result_type="recent",max_id=str(firsthit))
			#pprint(res['statuses'][99])
			break
		for i in range(0,len(res)):
			tweetcount = tweetcount + 1
			[outline,outcome,tid,tline] = extractinfo(res[i],wff,raw)
			if outline:
				if not multiloc:
					wf.write(outline)
				elif newmultiloc:
					currloctweets.append([outline[:-1]+','+str(locnum)+'\n',locnum,long(tid)])
					tidbycurrloc.append(float(tid))
					#wf.write(outline[:-1]+','+str(locnum)+'\n')
				elif importmultiloc:
					currloctweets.append([outline[:-1]+','+str(tids[pagenum]+1)+','+str(incls[pagenum])+','+str(locnum)+'\n',locnum,long(tid)])
					tidbycurrloc.append(float(tid))
			if keeptweets:
				#wft.write(tline.encode('ascii','ignore'))
				if not raw:
					wft.write(tline.encode('ascii','ignore'))
				if raw:
					wft.write(tline.encode('ascii','replace'))		#maybe create a separate .utweets file with the Unicode versions of tweets
			outcomes[outcome] = outcomes.get(outcome,0)+1
			if importcsv:
				firsthit = tids[pagenum+1]
			else:
				if firsthit > long(tid):			#if current tweet came before previous oldest, update oldest
					firsthit = long(tid)-1
		if importmultiloc:
			if len(tidbycurrloc) > 0:
				maxtids[pagenum].append(max(tidbycurrloc))
				mintids[pagenum].append(min(tidbycurrloc))
			else:
				maxtids[pagenum].append(0)
				mintids[pagenum].append(0)				
			tweetlist[pagenum].append(currloctweets)
	#endfor searches within a location
	if newmultiloc:
		if len(tidbycurrloc) > 0:
			maxtids[locnum] = max(tidbycurrloc)
			mintids[locnum] = min(tidbycurrloc)
		else:
			maxtids[locnum] = 0
			mintids[locnum] = 0
		searchesleft[locnum] = maxpages-pagenum-1			#calculating how many pages were left in the most maxed-out search
		tweetlist[locnum] = currloctweets
	overwrite = 'a+'
	header = False
	if importcsv:
		firsthit = tids[0]
	else:
		firsthit = float("+inf")

#Endfor multiple locations
if newmultiloc:
	balanceandprint(tweetlist,mintids,maxtids,searchesleft,wf)
elif importmultiloc:
	for pagenum in range(0,maxpages):
		balanceandprint(tweetlist[pagenum],mintids[pagenum],maxtids[pagenum],[0]*len(loclist),wf)
wf.close()
if keeptweets:
	wft.close()
if trackfails:
	wff.close()
	

In [None]:
import numpy as np
ewf = open('I-85.csv','r')
ewlist = []
header = 0
for line in ewf:
    if header<5:
        header += 1
        continue
    splitline = line.strip().split(',')
    if splitline[10]!='NA':
        ewlist.append([splitline[10],splitline[11]])
ewf.close()
iarr = np.array(ewlist)

In [None]:
import numpy as np
ewf = open('eighteen-wheeler.csv','r')
ewlist = []
header = 0
for line in ewf:
    if header<5:
        header += 1
        continue
    splitline = line.strip().split(',')
    if splitline[10]!='NA':
        ewlist.append([splitline[10],splitline[11]])
ewf.close()
ewarr = np.array(ewlist)

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.scatter(ttarr[:,1], ttarr[:,0], color='b', s=2, alpha=.4)
ax.scatter(iarr[:,1], iarr[:,0], color='r', s=2, alpha=.4)

plt.show()

In [None]:
import numpy as np
ewf = open('eighteen-wheeler.csv','r')
ewlist = []
header = 0
for line in ewf:
    if header<5:
        header += 1
        continue
    splitline = line.strip().split(',')
    if splitline[10]!='NA':
        ewlist.append([splitline[10],splitline[11]])
ewf.close()
ewarr = np.array(ewlist)

In [None]:
ewarr[0:4]

In [None]:
import numpy as np
ewf = open('tractor+trailer.csv','r')
ewlist = []
header = 0
for line in ewf:
    if header<5:
        header += 1
        continue
    splitline = line.strip().split(',')
    if splitline[10]!='NA':
        ewlist.append([splitline[10],splitline[11]])
ewf.close()
ttarr = np.array(ewlist)


In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.scatter(ewarr[:,1], ewarr[:,0], color='b', s=2, alpha=.4)
ax.scatter(ttarr[:,1], ttarr[:,0], color='b', s=2, alpha=.4)

plt.show()