# Twitter Crawler

The first thing you need to do is to create an application:

[Twitter Apps](https://apps.twitter.com/) Select the **Create New App** button and follow instructions to the end.

and obtain the following keys/tokens for authentication:

* consumer_key
* consumer_secret
* access_token
* access_token_secret

**Note** Generating Twitter API keys can take anywhere from minutes to weeks 

# **Tweepy**

> Tweepy is one of the best packages for working with twiter APIs [More](https://www.tweepy.org/)

In [None]:
## Import Required Modules

import os
import json
import tweepy
import requests

## Environment Setup and Authentication

> * Set your twitter consumer_key, consumer_secret, access_token, and access_token_secret as environment variables. 
> * For information on where to locate this information you can look at [TwitterEnvironment](https://developer.twitter.com/en/docs/apps/overview)
> * A secure way to use your credentials is by creating environment variables in your terminal. 
```console
export 'consumer_key'='xxxx' 
export 'consumer_secret'='xxxx' 
export 'access_token'='xxxx' 
export 'access_token_secret'='xxxx'
```
> * After authenticating the twitter credentials, you will be able to access the twitter api interface.

In [None]:
consumer_key = os.environ.get('consumer_key')
consumer_secret = os.environ.get('consumer_secret')
access_token = os.environ.get('access_token')
access_token_secret = os.environ.get('access_token_secret')

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

**Getting user’s Tweets**
>Main parameters:
> * id – Specifies the ID or screen name of the user.
> * count – Max amount of most recent tweets of user. <br>
> * [More Details](https://tweepy.readthedocs.io/en/latest/api.html#API.user_timeline)

In [None]:
!pip install columnar

In [None]:
from columnar import columnar

username = 'boredbengio'
count = 5

# Only iterate through the first n statuses
tweets = tweepy.Cursor(api.user_timeline,
                       screen_name=username).items(count)

# Pulling information from tweets iterable object
tweets_list = [[tweet.id, tweet.created_at, tweet.text] for tweet in tweets]

#print tweets
headers = ['id', 'created_at','text']
table = columnar(tweets_list, headers, no_borders=True)
print(table)

# what are the current attributes/tags in a tweet?
# https://jsoneditoronline.org/
tweet = api.get_status('1420646753863225349')
print(json.dumps(tweet._json))

**Pagination**
>Main parameters:
> * count – Max number of pages. <br>
> * [More Details](https://docs.tweepy.org/en/latest/v2_pagination.html?highlight=pagination)

In [None]:
# pagination.. iterate through pages
count = 1
for page in tweepy.Cursor(api.user_timeline,screen_name=username).pages(count):
    searched_tweets = [status for status in page]
    ids_texts = [(json_obj.id, json_obj.text) for json_obj in searched_tweets]
    for id, text in ids_texts:
        print(id, text[:30])
    # searched_tweets = [status._json for status in page]
    #json_strings = [json.dumps(json_obj) for json_obj in searched_tweets]  
    #print(json_strings[0])

In [None]:
user_id='14861663'
count = 5

followers = tweepy.Cursor(api.get_follower_ids,
                          user_id=user_id).items(count)

user_list = [[user] for user in followers]

headers = ['user_id']
table = columnar(user_list, headers, no_borders=True)
print(table) 

**Getting user's followees**
>Main parameters:
> * user_id – Specifies the ID of the user.
> * [More Details](https://docs.tweepy.org/en/latest/api.html?highlight=get_friends#tweepy.API.get_friends)

In [None]:
user_id='14861663'
count = 5
    
friends = tweepy.Cursor(api.get_friends,
                        user_id=user_id).items(count)
    
# Pulling information from tweets iterable object
user_list = [[user.id,  user.screen_name, user.created_at] for user in friends]

#print users
headers = ['user_id', ' screen_name','created_at']
table = columnar(user_list, headers, no_borders=True)
print(table) 


**Getting tweet with specific id**
> helpful when you only have tweet ids and would like to get the corresponding attributes such as text.

In [None]:
tweet_id='1255894886051713030'

tweet = api.get_status(tweet_id)

tweet_list = [tweet.text, tweet.favorite_count, tweet.retweet_count]
print(tweet_list)

json_tweet = json.dumps(tweet._json)

print( json_tweet)


**Twitter Search**
 > To search Twitter for recent tweets, we will define search terms and a start date of for search. [More Details](http://docs.tweepy.org/en/latest/api.html#API.search)<br>
 > - For creating complex queries please see [Building standard queries](https://developer.twitter.com/en/docs/twitter-api/v1/rules-and-filtering/overview/standard-operators)
 > - Twitter API only allows you to access the past few weeks of tweets, so you cannot dig into the history too far.

In [None]:
# Define the search term and the date_since date

search_words = "#disneyland -filter:retweets"

# Collect tweets
tweets = tweepy.Cursor(api.search_tweets,
                       q=search_words,
                       lang="en").items(5)

# Pulling information from tweets iterable object
tweets_list = [[tweet.id, tweet.created_at, tweet.text] for tweet in tweets]

#print tweets
headers = ['id', ' created_at','text']
table = columnar(tweets_list, headers, no_borders=True)
print(table)

#### Bearer Token

> * You would need to set up the bearer token, from your twitter App developer dashboard, for secure point of entry to use the twitter API.
> * The bearer token can be found on your twitter App developer dashboard under the "keys and tokens" page of the desired twitter app, for more details check out [BearerToken](https://developer.twitter.com/en/docs/authentication/oauth-2-0/bearer-tokens) <br>
> * A secure way to use your credentials is by creating environment variables in your terminal.
```console
export 'BEARER_TOKEN'='xxxx'
```
> * bearer_oauth is used for bearer_token authorization.

In [None]:
bearer_token = os.environ.get("BEARER_TOKEN")

def bearer_oauth(r):
    """
    Method required by bearer token authentication.
    """

    r.headers["Authorization"] = f"Bearer {bearer_token}"
    r.headers["User-Agent"] = "v2FilteredStreamPython"
    return r

#### Filtered Stream
We will now see how to get tweets based on certain rules using FilteredStream. Tweets are requested from the URL [SearchStreamURL](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/api-reference/get-tweets-search-stream)

> * You can adjust the rules by changing sample_rules under the set_rules function.
> > * Here the rules are getting tweets with text apple and covid19.
> > * You can add more rules by specifying additional strings with keyword value.
> > * Rules can also have operators such as has: and tag: The "has" operator will get tweets that are only associated with images, whereas the "tag" operator is just a string which can be used at a high level to recognize the rule.
> > * Check out [BuildRules](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/integrate/build-a-rule) for more details on building rules for the filtered stream endpoint.
> * get_stream prints out the tweets retrieved according to the rules from the filtered stream end point.
> * Once you connect to the FilteredStream endpoint you will keep getting tweets matching the rules through a continuous http streaming connection.

In [None]:
def get_rules():
    response = requests.get(
        "https://api.twitter.com/2/tweets/search/stream/rules", auth=bearer_oauth
    )
    if response.status_code != 200:
        raise Exception(
            "Cannot get rules (HTTP {}): {}".format(response.status_code, response.text)
        )
    print(json.dumps(response.json()))
    return response.json()

def delete_all_rules(rules):
    if rules is None or "data" not in rules:
        return None

    ids = list(map(lambda rule: rule["id"], rules["data"]))
    payload = {"delete": {"ids": ids}}
    response = requests.post(
        "https://api.twitter.com/2/tweets/search/stream/rules",
        auth=bearer_oauth,
        json=payload
    )
    if response.status_code != 200:
        raise Exception(
            "Cannot delete rules (HTTP {}): {}".format(
                response.status_code, response.text
            )
        )
    print(json.dumps(response.json()))

    
def set_rules(rules):
    # You can adjust the rules if needed
    sample_rules = [
        {"value": "apple"},
        {"value": "covid19"},
    ]
    payload = {"add": sample_rules}
    response = requests.post(
        "https://api.twitter.com/2/tweets/search/stream/rules",
        auth=bearer_oauth,
        json=payload,
    )
    if response.status_code != 201:
        raise Exception(
            "Cannot add rules (HTTP {}): {}".format(response.status_code, response.text)
        )
    print(json.dumps(response.json()))

def get_stream(set):
    response = requests.get(
        "https://api.twitter.com/2/tweets/search/stream", auth=bearer_oauth, stream=True,
    )
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(
            "Cannot get stream (HTTP {}): {}".format(
                response.status_code, response.text
            )
        )
    for response_line in response.iter_lines():
        if response_line:
            json_response = json.loads(response_line)
            print(json.dumps(json_response, indent=4, sort_keys=True))

#### Test Run

> * Get any prior rules using the function get_rules.
> * Delete these prior rules using delete_all_rules, this is so that you don't have any older rules which you may have used but are not specified in your current payload; for example if your initial rule was to find tweets with text "weather" and you later on want tweets with text "apple", "covid19" you will need to delete "weather" from your sample_rules before setting the new rules, else the payload will contain "weather", "apple", and "covid19".
> * The new rules specified by sample_rules are set using set_rules.
> * Tweets matching the rules are then streamed using get_stream.

In [None]:
rules = get_rules()
delete = delete_all_rules(rules)
set = set_rules(delete)
get_stream(set)