# _Tutorial: Working with Streaming Data and the Twitter API in Python_

This notebook is based on the following Twitter API [tutorial](https://www.dataquest.io/blog/streaming-data-python/) on Dataquest. The idea for this project _was not my own_, however I thought it would be best to follow along with the lesson and gain skills related to streaming data. 

The tutorial walks us through how to "_create a tool that enables us to find out how people feel about Donald Trump and Hillary Clinton, both of whom are US Presidential candidates._" This is what we'll need to do in this tutorial:
- Stream tweets from the Twitter API
- Filter out the tweets that aren't relevant
- Process the tweets to figure out what emotions they express about each candidate
- store the tweets for additional analysis

## _Programming Paradigm: Event-driven programming_

This type of programming is based around a program executing actions based on external inputs (like our streaming tweets). Below is some psuedocode that will get us started.

In [None]:
def filter_tweet(tweet):
# Remove any tweets that don't match our criteria.
    if not tweet_matches_criteria(tweet):
        return
# Process the remaining tweets.
process_tweet(tweet)
def process_tweet(tweet):
# Annotate the tweet dictionary with any other information we need.
    tweet["sentiment"] = get_sentiment(tweet)
# Store the tweet.
store_tweet(tweet)
def store_tweet(tweet):
# Saves a tweet for later processing. ...

# _The Twitter Streaming API_

- Link: https://developer.twitter.com/en/docs
- Main rules when streaming tweets from Twitter:
    - Create persistent connection to the API, read each connection incrementally
    - process tweets quickly, and don't let your program get backed up
    - handle errors and other issues properly
- Link for Twitter Streaming API clients: https://developer.twitter.com/en/community
    - [Tweepy](https://github.com/tweepy/tweepy) is most popular

In [22]:
# import libraries that we'll need for later
from settings import private
import tweepy
import matplotlib.pyplot as plt
import scipy
import numpy as np
import pandas as pd
from textblob import TextBlob
from sqlalchemy.exc import ProgrammingError
import json

In [8]:
# setup tweepy to authenticate with Twitter with the following code
auth = tweepy.OAuthHandler(private.TWITTER_APP_KEY, private.TWITTER_APP_SECRET)
auth.set_access_token(private.TWITTER_KEY, private.TWITTER_SECRET)

In [9]:
# create an API object to pull data from Twitter, pass in the authentication from above
api = tweepy.API(auth)

# _Setting up a listener_

- opening Twitter stream using tweepy requires user-defined `listener` class
- `StreamListener` class has `on_data` method
    - automatically figures out what kind of data Twitter sent
    - calls appropriate method to deal with data type
- for now, we only care about when users post tweets
    - will override the `on_status` method

In [14]:
# create a listener that prints text of any tweet that comes from the Twitter API
class StreamListener(tweepy.StreamListener):
    def on_status(self, status):
        print(status.text)
        
    # override on_error method to handle errors properly, will send 420 status code if being rate limited
    # and will disconnect, if anything else will keep going
    def on_error(self, status_code):
        if status_code == 420:
            return False

# _Starting the listener_

- once we setup listener, we're ready to wire everything up and stream tweets
- below we'll:
    - create an instance of `StreamListener` class
    - create an instance of the tweepy `Stream` class
        - streams the tweets
        - pass authentication credentials `api.auth` so that Twitter allows us to connect
        - pass in `stream_listener` so that our callback functions are called
    - start streaming tweets by calling `filter` method
        - streams tweets from `filter.json` API endpoint, passing to listener callback
            - pass this in a list of terms to filter on, as API requires

In [None]:
# this will simply print out tweets mentioning the terms below, didn't run it because it made notebook
# extremely long
stream_listener = StreamListener()
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
stream.filter(track=["trump", "donald trump", "impeachment"])

# _Filtering events_

- what if we want to ignore the retweets?
    - reason: same text can show up hundreds or thousands of times, skewing results
        - one person's tweet will count thousands of times in our analysis
- tweet that is passed into `on_status` method is instance of `Status` class
    - has properties describing tweet, including if it was retweeted
- below we'll modify the `on_status` function to filter out retweets
    - if the `retweeted_status` property exists, then don't process the tweet
    - print all tweets that aren't retweets

In [18]:
# modification of on_status function that filters out retweets
class StreamListener(tweepy.StreamListener):
    def on_status(self, status):
        if status.retweeted:
            return
        print(status.text)
        
    # override on_error method to handle errors properly, will send 420 status code if being rate limited
    # and will disconnect, if anything else will keep going
    def on_error(self, status_code):
        if status_code == 420:
            return False

In [28]:
from datetime import datetime
# current date and time
now = datetime.now()
print("Last edit: ", now)
print('Resetart from Filtering events section in Dataquest blog post; see below for link.')

Last edit:  2019-09-30 18:18:19.135810
Resetart from Filtering events section in Dataquest blog post; see below for link.


# _Filtering events (cont.)_

Last edit was made approximately at the time listed above. 

Link to blog post: [Dataquest Streaming Tutorial](https://www.dataquest.io/blog/streaming-data-python/)