# _Tutorial: Working with Streaming Data and the Twitter API in Python_

This notebook is based on the following Twitter API [tutorial](https://www.dataquest.io/blog/streaming-data-python/) on Dataquest. The idea for this project _was not my own_, however I thought it would be best to follow along with the lesson and gain skills related to streaming data. 

The tutorial walks us through how to "_create a tool that enables us to find out how people feel about Donald Trump and Hillary Clinton, both of whom are US Presidential candidates._" This is what we'll need to do in this tutorial:
- Stream tweets from the Twitter API
- Filter out the tweets that aren't relevant
- Process the tweets to figure out what emotions they express about each candidate
- store the tweets for additional analysis

## _Programming Paradigm: Event-driven programming_

This type of programming is based around a program executing actions based on external inputs (like our streaming tweets). Below is some psuedocode that will get us started.

In [None]:
def filter_tweet(tweet):
# Remove any tweets that don't match our criteria.
    if not tweet_matches_criteria(tweet):
        return
# Process the remaining tweets.
process_tweet(tweet)
def process_tweet(tweet):
# Annotate the tweet dictionary with any other information we need.
    tweet["sentiment"] = get_sentiment(tweet)
# Store the tweet.
store_tweet(tweet)
def store_tweet(tweet):
# Saves a tweet for later processing. ...

# _The Twitter Streaming API_

- Link: https://developer.twitter.com/en/docs
- Main rules when streaming tweets from Twitter:
    - Create persistent connection to the API, read each connection incrementally
    - process tweets quickly, and don't let your program get backed up
    - handle errors and other issues properly
- Link for Twitter Streaming API clients: https://developer.twitter.com/en/community
    - [Tweepy](https://github.com/tweepy/tweepy) is most popular

In [1]:
# import libraries that we'll need for later
from settings import private
import tweepy
import matplotlib.pyplot as plt
import scipy
import numpy as np
import pandas as pd
from textblob import TextBlob
from sqlalchemy.exc import ProgrammingError
import json

In [2]:
# setup tweepy to authenticate with Twitter with the following code
auth = tweepy.OAuthHandler(private.TWITTER_APP_KEY, private.TWITTER_APP_SECRET)
auth.set_access_token(private.TWITTER_KEY, private.TWITTER_SECRET)

In [3]:
# create an API object to pull data from Twitter, pass in the authentication from above
api = tweepy.API(auth)

# _Setting up a listener_

- opening Twitter stream using tweepy requires user-defined `listener` class
- `StreamListener` class has `on_data` method
    - automatically figures out what kind of data Twitter sent
    - calls appropriate method to deal with data type
- for now, we only care about when users post tweets
    - will override the `on_status` method

In [4]:
# create a listener that prints text of any tweet that comes from the Twitter API
class StreamListener(tweepy.StreamListener):
    def on_status(self, status):
        print(status.text)
        
    # override on_error method to handle errors properly, will send 420 status code if being rate limited
    # and will disconnect, if anything else will keep going
    def on_error(self, status_code):
        if status_code == 420:
            return False

# _Starting the listener_

- once we setup listener, we're ready to wire everything up and stream tweets
- below we'll:
    - create an instance of `StreamListener` class
    - create an instance of the tweepy `Stream` class
        - streams the tweets
        - pass authentication credentials `api.auth` so that Twitter allows us to connect
        - pass in `stream_listener` so that our callback functions are called
    - start streaming tweets by calling `filter` method
        - streams tweets from `filter.json` API endpoint, passing to listener callback
            - pass this in a list of terms to filter on, as API requires

In [None]:
# this will simply print out tweets mentioning the terms below, 
# Run if interested in getting a stream of tweets but the resulting output is extremely long
stream_listener = StreamListener()
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
stream.filter(track=["trump", "donald trump", "impeachment"])

# _Filtering events_

- what if we want to ignore the retweets?
    - reason: same text can show up hundreds or thousands of times, skewing results
        - one person's tweet will count thousands of times in our analysis
- tweet that is passed into `on_status` method is instance of `Status` class
    - has properties describing tweet, including if it was retweeted
- below we'll modify the `on_status` function to filter out retweets
    - if the `retweeted_status` property exists, then don't process the tweet
    - print all tweets that aren't retweets

In [5]:
# modification of on_status function that filters out retweets
class StreamListener(tweepy.StreamListener):
    def on_status(self, status):
        if status.retweeted:
            return
        print(status.text)
        
    # override on_error method to handle errors properly, will send 420 status code if being rate limited
    # and will disconnect, if anything else will keep going
    def on_error(self, status_code):
        if status_code == 420:
            return False

In [7]:
from datetime import datetime
# current date and time
now = datetime.now()
print("Last edit: ", now)

Last edit:  2019-10-01 13:43:23.910273


# _Filtering events (cont.)_

- full [list](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object) of available properties of tweets
- could perform additional filtering by using fields such as:
    - `retweet_count` — the number of times a tweet has been retweeted.
    - `withheld_in_countries` — the tweet has been withheld in certain countries.
    - `favorite_count` — the number of times the tweet has been favorited by other users.

Last edit was made approximately at the time listed above. 

Link to blog post: [Dataquest Streaming Tutorial](https://www.dataquest.io/blog/streaming-data-python/)

# _Extracting information_

- we have to process tweets quickly --> i.e. don't do anything too intensive before saving them
- focus on extracting and storing properties we want
- there are a few fields that will be interesting to us:
    - `status.user.description`: the user's description (from their biography)
    - `status.user.location`: location the user who created the tweet wrote in their bio
    - `status.user.screen_name`: screen name of the user
    - `status.user.created_at`: when the user's account was created
    - `status.user.followers_count`: how many followers the user has
    - `status.user.profile_background_color`: background color the user has chosen for their profile
    - `status.text`: text of the tweet
    - `status.id_str`: unique ID Twitter assigned to the tweet
    - `status.created_at`: when tweet was sent
    - `status.retweet_count`: how many times the tweet has been retweeted
    - `status.coordinates`: geographic coordinates from where the tweet was sent
- Can extract the above information in the `on_status` method

In [6]:
# modification of on_status function that filters out retweets
class StreamListener(tweepy.StreamListener):
    def on_status(self, status):
        if status.retweeted:
            return

        description = status.user.description
        loc = status.user.location
        text = status.text
        coords = status.coordinates
        geo = status.geo
        name = status.user.screen_name
        user_created = status.user.created_at
        followers = status.user.followers_count
        id_str = status.id_str
        created = status.created_at
        retweets = status.retweet_count
        bg_color = status.user.profile_background_color
        
    # override on_error method to handle errors properly, will send 420 status code if being rate limited
    # and will disconnect, if anything else will keep going
    def on_error(self, status_code):
        if status_code == 420:
            return False

# _Processing the tweets_

- interested in emotion of people in regards to recent impeachment inquiry into President Donald Trump
    - want to analyze the text of each tweet to figure out sentiment it expresses
- can use sentiment analysis to tag each tweet with a sentiment score, from `-1` to `1`
    - `-1`: means the tweet is very negative
    - `0`: neutral
    - `1`: means the tweet is very positive
- sentiment analysis tools typically generate score based on works known to be positive/negative sentiment
    - ex. if `hate` occurs in a string, more likely to be negative
    - essentially string matching, extremely quick (i.e. a good thing for this case)
- can use `TextBlob` library to perform sentiment analysis

In [7]:
# modification of on_status function that filters out retweets
class StreamListener(tweepy.StreamListener):
    def on_status(self, status):
        if status.retweeted:
            return

        description = status.user.description
        loc = status.user.location
        text = status.text
        coords = status.coordinates
        geo = status.geo
        name = status.user.screen_name
        user_created = status.user.created_at
        followers = status.user.followers_count
        id_str = status.id_str
        created = status.created_at
        retweets = status.retweet_count
        bg_color = status.user.profile_background_color
        blob = TextBlob(text) # initialize TextBlob class on the text of the tweet
        sent = blob.sentiment # get sentiment score from the class
        
    # override on_error method to handle errors properly, will send 420 status code if being rate limited
    # and will disconnect, if anything else will keep going
    def on_error(self, status_code):
        if status_code == 420:
            return False

# _Processing the tweets (cont.)_

- once we have the `sent` object, we need to extract `polarity` and `subjectivity` from it
    - `polarity`: positivity/negativity of tweet on -1 to 1 scale
    - `subjectivity`: how objective/subjective the tweet is, 0 being very objective and 1 being very subjective

In [8]:
# modification of on_status function that filters out retweets
class StreamListener(tweepy.StreamListener):
    def on_status(self, status):
        if status.retweeted:
            return

        description = status.user.description
        loc = status.user.location
        text = status.text
        coords = status.coordinates
        geo = status.geo
        name = status.user.screen_name
        user_created = status.user.created_at
        followers = status.user.followers_count
        id_str = status.id_str
        created = status.created_at
        retweets = status.retweet_count
        bg_color = status.user.profile_background_color
        blob = TextBlob(text) # initialize TextBlob class on the text of the tweet
        sent = blob.sentiment # get sentiment score from the class
        polarity = sent.polarity # get polarity score from sentiment
        subjectivity = sent.subjectivity # get subjectivity score from sentiment
        
    # override on_error method to handle errors properly, will send 420 status code if being rate limited
    # and will disconnect, if anything else will keep going
    def on_error(self, status_code):
        if status_code == 420:
            return False

# _Storing the tweets_

- once we have data we want on each tweet, we're ready to store it for later processing
- storing in CSV makes it hard to query
    - if we want to read from CSV file, either have to load whole thing or go through process to load only the pieces we want
- database is a good palce to store our data, specifically a relational database called SQLite
    - simple, doesn't require any processes to be running
    - everything is stored in a single file
- need to use the [dataset](https://dataset.readthedocs.io/en/latest/) package
    - makes it simple to access a database and store data
    - we simply store data and the `dataset` package will automatically creat the database and tables we need

In [11]:
# first have to connect our database using a connection string
import dataset
db = dataset.connect("sqlite:///tweets.db")

When using _SQLite_, if the database file (i.e. `tweets.db`) doesn't exist, it will automatically be created in the current folder. 

Next, we have to dump our coordinates json dictionary to a string, so we can store it:

In [12]:
# modification of on_status function 
class StreamListener(tweepy.StreamListener):
    def on_status(self, status):
        if status.retweeted:
            return

        description = status.user.description
        loc = status.user.location
        text = status.text
        coords = status.coordinates
        geo = status.geo
        name = status.user.screen_name
        user_created = status.user.created_at
        followers = status.user.followers_count
        id_str = status.id_str
        created = status.created_at
        retweets = status.retweet_count
        bg_color = status.user.profile_background_color
        blob = TextBlob(text) # initialize TextBlob class on the text of the tweet
        sent = blob.sentiment # get sentiment score from the class
        polarity = sent.polarity # get polarity score from sentiment
        subjectivity = sent.subjectivity # get subjectivity score from sentiment
        
        # dump coordinates JSON dictionary to a string so we can store it
        if geo is not None:
            geo = json.dumps(geo)
        
    # override on_error method to handle errors properly, will send 420 status code if being rate limited
    # and will disconnect, if anything else will keep going
    def on_error(self, status_code):
        if status_code == 420:
            return False