# Working with the Twitter Search API

The goal of this notebook is to provide a strong demo of using the new python twitter search API wrapper to Do Data Science™.


Twitter data has a massive potential across many domains. What if you were curious about music patterns across the world? 



Let's get our notebook started with some imports and basic setup.

In [None]:
import os

import pandas as pd
import seaborn as sns

from tweet_parser.tweet import Tweet
from tweet_parser.getter_methods.tweet_geo import get_profile_location

from twittersearch.result_stream import ResultStream
from twittersearch.utils import *

# the following makes working in a notebook a bit easier, as you don't have to have new cells for all output
from IPython.display import HTML
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%load_ext autoreload
%autoreload 2
%matplotlib inline



## Using the API interactively

We'll define some high-level information here. For convenience, I've stored my password in an environemnt variable and we'll be using both the counts and search endpoints. I will generate the relevant arguments that we'll use for the rest of our session.

In [None]:
os.environ["TWITTER_SEARCH_ACCOUNT_NAME"] = ""
os.environ["TWITTER_SEARCH_PW"] = ""

In [None]:
username = "agonzales@twitter.com"
search_api = "fullarchive"
endpoint_label = "ogformat.json"

search_endpoint = gen_endpoint(search_api,
                               os.environ["TWITTER_SEARCH_ACCOUNT_NAME"],
                               endpoint_label,
                               count_endpoint=False)

count_endpoint = gen_endpoint(search_api,
                              os.environ["TWITTER_SEARCH_ACCOUNT_NAME"],
                              endpoint_label,
                              count_endpoint=True)

search_args = {"username": username,
               "password": os.environ["TWITTER_SEARCH_PW"],
               "url": search_endpoint }
count_args = {"username": username,
               "password": os.environ["TWITTER_SEARCH_PW"],
              "url": count_endpoint }


The power of the search api comes from the rich query language that it exposes to devleopers. Instead of manually filtering tweets from the free 1% streaming api, we can rapidly explore *all* tweets, all the way back to 

In [None]:
HTML('<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">just setting up my twttr</p>&mdash; jack (@jack) <a href="https://twitter.com/jack/status/20">March 21, 2006</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>')

The search API also has a `counts` endpoint, which instead of returning tweet data, returns the number of tweets that match your search critera that can be divied up into convenient buckets. quick explorations with the `count` endpoint can save you time and API calls.


Let's explore some tweets that mention musicians and how to make powertrack rules.

When using the helper method `gen_rule_payload`, you must define your powertrack rule. We'll start by matching tweets that mention "taylor swift" and use the counts api to refine our searches from there.

As a side note, in a notebook environemnt, i will break the DRY principle a bit.

In [None]:
_rule = """
"taylor swift"
"""


count_rule = gen_rule_payload(_rule,
                        from_date="2016-09-01",
                        to_date="2017-09-01",
                        max_results=500, 
                        count_bucket="day")

count_rule


Our main point of entry to the API is the `ResultStream` object and it's primary function, `.stream()`. We'll pass it our connection and authentication arguments, the jsonified rule payload, and either the `max_tweets` argument or the `max_pages` argument. By default, the `ResultStream` will paginate results from the API for you, but this parameter can effectively limit the number of calls used.

When using the Counts API, a single call returns a json array of count records, one per count bucket. For our year-long range, a single call of 'day' buckets will get all the data we want.

In [None]:
count_stream = ResultStream(**count_args,
                           rule_payload=count_rule,
                           max_tweets=500)

The `ResultStream` object can be inspected with your various args. 

In [None]:
# print(count_stream)

The object has one main entry point, which returns a generator of results from the API. In many cases, you'll want to fetch all the results from this stream of data, which can be done as such:

In [None]:
count_results_gen = count_stream.stream()

count_results_gen

In [None]:
counts = list(count_results_gen)

In [None]:
counts[0]

As described earlier, we get back a quick count of tweets matching our rule in each bucket. We can quickly visualize this with Pandas:

In [None]:
(pd.DataFrame(counts)
 .assign(timePeriod=lambda df: pd.to_datetime(df["timePeriod"]))
 .set_index("timePeriod")
 .sort_index()
 .sum()
 
)
(pd.DataFrame(counts)
 .assign(timePeriod=lambda df: pd.to_datetime(df["timePeriod"]))
 .set_index("timePeriod")
 .sort_index()
 .plot(title=_rule)
 
)

Alright, so there are 12 million matching tweets in our yearly history, with big spikes around the time Taylor does something newsworthy. We probably missed something here, which is that the exact match of "taylor swift" will miss all the mentions with her in them. Let's redo our search with a new rule.

In [None]:
_rule = """
"taylor swift" OR (has:mentions @taylorswift13)
"""


count_rule = gen_rule_payload(_rule,
                        from_date="2016-09-01",
                        to_date="2017-09-01",
                        max_results=500, 
                        count_bucket="day")

count_rule


We can shorten our previous interaction with the `ResultStream` object to something like this:

In [None]:
counts = list(ResultStream(**count_args,
                           rule_payload=count_rule,
                           max_tweets=500)
              .stream())

In [None]:
(pd.DataFrame(counts)
 .assign(timePeriod=lambda df: pd.to_datetime(df["timePeriod"]))
 .set_index("timePeriod")
 .sort_index()
 .sum()
 
)
(pd.DataFrame(counts)
 .assign(timePeriod=lambda df: pd.to_datetime(df["timePeriod"]))
 .set_index("timePeriod")
 .sort_index()
 .plot(title=_rule)
 
)

Great, so we get similar patterns of tweets, but have an additional 5 million with which we can work, and we haven't began including other things people might mention about her, such as her album or song names, or phrases like "I'm listening to tswift's new album". Let's add in her common nickname to the counts on more time:

In [None]:
_rule = """
"taylor swift"
OR (has:mentions @taylorswift13)
OR "tswift"
"""


count_rule = gen_rule_payload(_rule,
                        from_date="2016-09-01",
                        to_date="2017-09-01",
                        max_results=500, 
                        count_bucket="day")

count_rule


In [None]:
counts = list(ResultStream(**count_args,
                           rule_payload=count_rule,
                           max_tweets=500)
              .stream())

In [None]:
(pd.DataFrame(counts)
 .assign(timePeriod=lambda df: pd.to_datetime(df["timePeriod"]))
 .set_index("timePeriod")
 .sort_index()
 .sum()
 
)
(pd.DataFrame(counts)
 .assign(timePeriod=lambda df: pd.to_datetime(df["timePeriod"]))
 .set_index("timePeriod")
 .sort_index()
 .plot(title=_rule)
 
)

Hardly any difference, so let's move on. 

Powertrack rules also allow you to find tweets that have explicit information, such as geographical location data. We'll want to examine regional differences in music patterns, so let's see how many of our tweets have that data using the `has:geo` operator. 

Note that geo tags are turned OFF by default, so there are usually FAR fewer tweets that have geo information than those without. We offer a `profile_geo` operator that will expose the home geo of the user, which can be a very reasonable proxy for tweet geo in many cases.

In [None]:
_rule = """
("taylor swift" OR (has:mentions @taylorswift13) OR "tswift")
has:geo
"""


count_rule = gen_rule_payload(_rule,
                        from_date="2016-09-01",
                        to_date="2017-09-01",
                        max_results=500, 
                        count_bucket="day")

count_rule

In [None]:
counts = list(ResultStream(**count_args,
                           rule_payload=count_rule,
                           max_tweets=500)
              .stream())

In [None]:
(pd.DataFrame(counts)
 .assign(timePeriod=lambda df: pd.to_datetime(df["timePeriod"]))
 .set_index("timePeriod")
 .sort_index()
 .sum()
 
)
(pd.DataFrame(counts)
 .assign(timePeriod=lambda df: pd.to_datetime(df["timePeriod"]))
 .set_index("timePeriod")
 .sort_index()
 .plot(title=_rule)
 
)

As mentioned, we have wildly fewer tweets that match our criteria, but let's see what we can do with the `profile_geo` enrichment.

In [None]:
_rule = """
("taylor swift" OR (has:mentions @taylorswift13) OR "tswift")
(has:geo
OR
has:profile_geo)
"""


count_rule = gen_rule_payload(_rule,
                        from_date="2016-09-01",
                        to_date="2017-09-01",
                        max_results=500, 
                        count_bucket="day")

count_rule

In [None]:
counts = list(ResultStream(**count_args,
                           rule_payload=count_rule,
                           max_tweets=500)
              .stream())

In [None]:
(pd.DataFrame(counts)
 .assign(timePeriod=lambda df: pd.to_datetime(df["timePeriod"]))
 .set_index("timePeriod")
 .sort_index()
 .sum()
 
)
(pd.DataFrame(counts)
 .assign(timePeriod=lambda df: pd.to_datetime(df["timePeriod"]))
 .set_index("timePeriod")
 .sort_index()
 .plot(title=_rule)
 
)

Great, so we have a reasonable idea of what we're working with here and we haven't pulled a single tweet yet. 

Let's take a look at some tweet data with our finer-grained geo search.

In [None]:
_rule = """
("taylor swift" OR (has:mentions @taylorswift13) OR "tswift")
has:geo
"""


search_rule = gen_rule_payload(_rule,
                               from_date="2016-09-01",
                               to_date="2017-09-01",
                               max_results=500
                              )

search_rule

In [None]:
tweets = list(ResultStream(**search_args, rule_payload=search_rule, max_tweets=500)
              .stream())

With the python interface to the search API, tweets are automatically parsed via the [Tweet_parser library](https://github.com/tw-ddis/tweet_parser), which makes working with them fairly straightforward. We've abstracted out many of the annoying details of working with raw tweet data and provided a straightforward API with access to lots of tweet data. Let's take a look:

In [None]:
t = tweets[200]
t.created_at_datetime
t.screen_name
t.bio
t.text


t["place"]


this person doesn't seem to care much for Taylor's new music. (Shrug).


Before we move on, we'll pause to define a few functions that will help with grabbing the geojson data from a tweet. Recall that the `has:geo` operator will return tweets that are tagged at a Twitter-defined `Place` (in this case, the city of Coralville, IA). Places are given a bounding box, which may not be useful for your application (i have a map in mind...).

We cannot cover all use cases for the tweet parser, but it's easily extentible, or you can use methods that work on dicts because each tweet is really a subclass of `dict`. 

To demonstrate the need for this here, let's grab tweets that have *exact* coordinates:

In [None]:
[t.geo_coordinates for t in tweets if t.geo_coordinates] 

if we are going to generate a map or work with single coordinates, we'll have to turn boudning boxes into sigular points and have a single function that extracts the geo data from each tweet.


Note - the coordinates are returned as [LONG, LAT], and in the parsing functions, I will flip those to the required downstream format [LAT, LONG].

In [None]:
from functools import reduce

from tweet_parser.tweet_checking import is_original_format

try:
    import numpy as np
    mean_bbox = lambda x: list(np.array(x).mean(axis=0))
except ImportError:
    mean_bbox = lambda x: (reduce(lambda y, z: y + z, x) / len(x))

def get_profile_geo_coords(tweet):
    geo = tweet.profile_location.get("geo")
    coords = geo.get("coordinates") # in [LONG, LAT]
    if coords:
        long, lat = coords
    return lat, long


def get_place_coords(tweet, est_center=False):
    """
    Places are formal spots that define a bounding box around a place.
    Each coordinate pair in the bounding box is a set of [[lat, long], [lat, long]]
    pairs.
    
    """
    
    def get_bbox_ogformat():
        _place = tweet.get("place")
        if _place is None:
            return None
    
        return (_place
                .get("bounding_box")
                .get("coordinates")[0])

    def get_bbox_asformat():
        _place = tweet.get("location")
        if _place is None:
            return None
        return (_place
                .get("geo")
                .get("coordinates")[0])
        
    bbox = get_bbox_ogformat() if is_original_format(tweet) else get_bbox_asformat()

    return mean_bbox(bbox) if est_center else bbox


def get_exact_geo_coords(tweet):
    geo = tweet.get("geo")
    if geo is None:
        return None
    
    # coordinates.coordinates is [LONG, LAT]
    # geo.coordinates is [LAT, LONG]
    field = "geo" if is_original_format(tweet) else "geo"
    coords = tweet.get("geo").get("coordinates")
    return coords


def get_a_geo_coordinate(tweet):
    geo = get_exact_geo_coords(tweet)
    lat, long = geo if geo else (None, None)
    if lat:
        return lat, long
    long, lat = get_place_coords(tweet, est_center=True)
    return lat, long

In [None]:
get_a_geo_coordinate(t)
t["place"]

Our highest function will grab the center of a bounding box. This works really well when we have a city-or-smaller place and is not as useful for a profile geo of "Texas", but so it goes.


## Working with the tweet stream / Let's build a map


Given the streaming format of tweets that is returned by our api, we can pull large amounts of data interactively and build functions that help organize it to our needs directly. 

As this example is going to eventually go beyond Taylor Swift, we'll define a wrapper function that takes an input ResultStream object and returns a pandas dataframe with our geo coordinate information and other basic information from tweets. it will correctly filter and index the dataframe, giving us something back that we can rapidly use to Make Maps. We'll use another utility function that converts the [lat, long] pairs to web mercator format ([meters_x, meters_y]) format that is required by some plotting libraries. I lightly adapted this from an example in Datashader.

In [None]:
def latlng_to_meters(df, lat_name, lng_name):
    """
    Taken and modified from the datashader notebooks 
    """
    lat = df[lat_name]
    lng = df[lng_name]
    origin_shift = 2 * np.pi * 6378137 / 2.0
    mx = lng * origin_shift / 180.0
    my = np.log(np.tan((90 + lat) * np.pi / 360.0)) / (np.pi / 180.0)
    my = my * origin_shift / 180.0
    return df.assign(mx=mx).assign(my=my)

In [None]:
def tweet_geo_collector(result_stream, tag, fields=None):
    if fields is None:
        fields = ["id", "created_at_datetime", "text"]
    
    coords = []
    print("collecting tweets for {}".format(tag))
    for tweet in result_stream.stream():
        attrs = (tweet.__getattribute__(field)
                   for field in fields)
        try:
            _coords = get_a_geo_coordinate(tweet)
            coords.append(list(it.chain.from_iterable([attrs, _coords])))
        except AttributeError:
            print("error in geo")
            print(tweet.id, tweet.text)
            continue
        
        
    columns = list(it.chain.from_iterable([fields, ["lat", "long"]]))
    
    df = (pd.DataFrame(coords, columns=columns)
          .pipe(latlng_to_meters, "lat", "long")
          .drop(["lat", "long"], axis=1)
          .assign(tag=tag)
         )
    return df



In [None]:
rs = ResultStream(**search_args, rule_payload=search_rule, max_tweets=2000)
# rs.artist = "taylor_swift"
df = tweet_geo_collector(rs, tag="taylor_swift")

In [None]:
df.head()
df.info()

In [None]:
(df
 .set_index("created_at_datetime")
 .sort_index()
 .resample("D")
 .size()
)

We'll use Bokeh to do some plotting:

In [None]:
from functools import partial


from bokeh.models import WMTSTileSource
from bokeh.tile_providers import STAMEN_TONER

from bokeh.io import output_notebook, show
from bokeh.plotting import ColumnDataSource, figure
from bokeh.models import HoverTool, value

output_notebook()

tiles = {'OpenMap': WMTSTileSource(url='http://c.tile.openstreetmap.org/{Z}/{X}/{Y}.png'),
         'ESRI': WMTSTileSource(url='https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.jpg'),
         'Wikipedia': WMTSTileSource(url='https://maps.wikimedia.org/osm-intl/{Z}/{X}/{Y}@2x.png'),
         'Stamen': WMTSTileSource(url="http://tile.stamen.com/toner-background/{z}/{x}/{y}.png", )
         }


def plot_tweets(df, x_col="mx", y_col="my", tile="Stamen", title="title"):
    # add our DataFrame as a ColumnDataSource for Bokeh
    plot_data = ColumnDataSource(df)
    # create the plot and configure the
    # title, dimensions, and tools
    plot = figure(title=title,
                  plot_width=800,
                  plot_height=800,
                  tools= ('pan, wheel_zoom, box_zoom, reset'),
                  active_scroll='wheel_zoom')

    # add a hover tool to display words on roll-over
    plot.add_tools(HoverTool(tooltips = '@text'))

    # draw the words as circles on the plot
    plot.circle(x=x_col, y=y_col, source=plot_data,
                     color=u'blue', line_alpha=0.1, fill_alpha=0.1,
                     size=3, hover_line_color='black')

    # configure visual elements of the plot
    plot.title.text_font_size = value('12pt')
    plot.xaxis.visible = False
    plot.yaxis.visible = False
    plot.grid.grid_line_color = None
    plot.outline_line_color = None
    plot.add_tile(tiles[tile], alpha=0.25)
    return plot



In [None]:
show(plot_tweets(df, title=json.loads(search_rule)["query"]))

So, wrapping up the first section, we have an interactive map of where tweets that mention taylor-swift related terms in very little code. 


We'll scale up in the next section by adding both 
- more tweets
- more artists

## Dynamically generating rules

In [None]:
base_rule = """
(("{exact_name}") OR (has:mentions @{handle}))
has:geo
"""


from functools import partial


gen_search_payload_ = partial(gen_rule_payload,
                            max_results=500,
                            from_date="2016-09-01",
                            to_date="2017-09-01",
                            )

gen_count_payload_ = partial(gen_rule_payload,
                            max_results=500,
                            from_date="2016-09-01",
                            to_date="2017-09-01",
                            count_bucket='day'
                            )





artists_names = ["taylor swift",
                 "katy perry",
                 "beyonce",
                 "lady gaga",
                 "britney spears"
                ]

artists = {"taylor swift": "taylorswift13",
           "katy perry": "katyperry",
           "beyonce": "beyonce",
           "lady gaga": "ladygaga",
           "britney spears": "britneyspears"
           
          }


artist_count_rules = [gen_count_payload_(base_rule.format(exact_name=name, handle=handle))
                for name, handle in artists.items()]
artist_search_rules = [gen_search_payload_(base_rule.format(exact_name=name, handle=handle))
                for name, handle in artists.items()]

In [None]:
streams = [ResultStream(**count_args,
                        rule_payload=rule,
                        max_tweets=2000)
           for rule in artist_rules]



In [None]:
counts = (pd.concat([pd.DataFrame(list(rs.stream())).assign(artist=artist)
                    for rs, artist in zip(streams, artists_names)])
          .assign(timePeriod=lambda df: pd.to_datetime(df["timePeriod"]))
          .set_index(["timePeriod", "artist"])
          .sort_index()
         )

In [None]:
counts.unstack().sum()
counts.unstack()['count'].plot()
counts.unstack()['count'].rolling("14D").mean().plot()

In [None]:
streams = [ResultStream(**search_args,
                        rule_payload=rule,
                        max_tweets=2000)
           for rule in artist_search_rules]

In [None]:
results = [tweet_geo_collector(stream, tag)
           for stream, tag in zip(streams, artists.keys())]

In [None]:
df = pd.concat(results)

In [None]:
(df
 .set_index("created_at_datetime")
 .sort_index()
 .groupby([pd.TimeGrouper("D"), "tag"])
 .size()
 .to_frame("tweets")
 ["tweets"]
 .unstack()
 .fillna(0)
 .plot()
 
)

In [None]:
def even_sample(df, cat_col):
    cats = df[cat_col].unique()
    vc = df[cat_col].value_counts()
    min_count = vc.min()
    res = []
    for cat in cats:
        res.append(df[df[cat_col] == cat].sample(min_count))
    return pd.concat(res)

In [None]:
even_df = even_sample(df, "tag")

In [None]:
from bokeh.models.widgets import Panel, Tabs

In [None]:
show(plot_tweets(df.query("tag == 'lady gaga'")))

In [None]:
two_chainz = plot_tweets(df.query("tag == 'rhi'"))
ku = plot_tweets(df.query("tag == 'keith urban'"))
ts = plot_tweets(df.query("tag == 'taylor swift'"))
bey = plot_tweets(df.query("tag == 'beyonce'"))

tabs = Tabs(tabs=[Panel(child=two_chainz, title="2 chainz"),
                 # Panel(child=ku, title="keith urban"),
                  Panel(child=ts, title="taylor swift"),
                 # Panel(child=bey, title="beyonce")
                 ])

In [None]:
from bokeh.io import output_file, reset_output, save

In [None]:
reset_output()
# output_file("bokeh_tabs.html")

In [None]:
output_file("test_bokeh.html")

In [None]:
save(tabs, filename="test_bokeh.html")

## Datashader

In [None]:
from bokeh import palettes

import datashader as ds
import datashader.transfer_functions as tf

from datashader.bokeh_ext import InteractiveImage

from cartopy import crs


import geoviews as gv

import holoviews as hv

from holoviews.operation.datashader import aggregate, shade, datashade, dynspread


hv.notebook_extension('mpl', 'bokeh')


def gen_col_points(categories, colormap):
    inv_cats = {k: k for k in categories}
    color_points = hv.NdOverlay({inv_cats[k]: gv.Points([0,0],
                                                        crs=crs.PlateCarree(),
                                                        label=inv_cats[k])
                                 (style=dict(color=v))
                                 for k, v in colormap.items()})
    return color_points

In [None]:
plot_df = df.assign(tag=lambda df: df["tag"].astype("category"))

In [None]:
x_min, y_min, x_max, y_max = (plot_df.mx.values.min(),
                              plot_df.my.values.min(),
                              plot_df.mx.values.max(),
                              plot_df.my.values.max())
x_range=(x_min, x_max)
y_range=(y_min, y_max)
color_key = dict(zip(artists, palettes.Category10[len(artists)]))
shade_defaults = dict(x_range=x_range,
                      y_range=y_range,
                      width=1200,
                      height=660)

In [None]:
%%output filename="artist_datashaded_points"
%%opts Overlay [width=800 height=600 xaxis=None yaxis=None show_grid=False ] (background_alpha=0.1) 
%%opts Shape (fill_color=None line_width=1.5) [apply_ranges=False] 
%%opts Points [apply_ranges=False tools=[]]
%%opts WMTS (alpha=0.25)

# shade_defaults = dict(x_range=(x_max, x_min),
                      # y_range=(y_max, y_min),
                      # width=1200,
                      # height=660)

shaded_points = datashade(hv.Points(gv.Dataset(plot_df,
                                               kdims=["mx", "my"],
                                               vdims=["tag"])),
                          cmap=color_key,
                          element_type=gv.Image,
                          aggregator=ds.count_cat("tag"),
                          **shade_defaults, 
                         )

color_points = gen_col_points(color_key.keys(), color_key)

map_ = gv.WMTS(tiles["Stamen"]) * dynspread(shaded_points,
                                            max_px=1,
                                            threshold=0.5) * color_points
map_