# Twitter Crawler

The first thing you need to do is to create an application:

[Twitter Apps](https://apps.twitter.com/) Select the **Create New App** button and follow instructions to the end.

and obtain the following keys/tokens for authentication:

* consumer_key
* consumer_secret
* access_token
* access_token_secret

**Note** Generating Twitter API keys can take anywhere from minutes to weeks 

# **Tweepy**

> Tweepy is one of the best packages for working with twiter APIs [More](https://www.tweepy.org/)

In [1]:
## Import Required Modules

import os
import json
import tweepy


## Environment Setup and Authentication

- Set your twitter consumer_key, consumer_secret, access_token, and access_token_secret as environment variables. 
- For information on where to locate this information you can look at [TwitterEnvironment](https://developer.twitter.com/en/docs/apps/overview)
- A secure way to use your credentials is by creating environment variables in your terminal. 
```console
export 'consumer_key'='xxxx' 
export 'consumer_secret'='xxxx' 
export 'access_token'='xxxx' 
export 'access_token_secret'='xxxx'
```
- After authenticating the twitter credentials, you will be able to access the twitter api interface.

In [2]:
consumer_key = os.environ.get('consumer_key')
consumer_secret = os.environ.get('consumer_secret')
access_token = os.environ.get('access_token')
access_token_secret = os.environ.get('access_token_secret')

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

**Getting user’s Tweets**
>Main parameters:
> * id – Specifies the ID or screen name of the user.
> * count – Max amount of most recent tweets of user. <br>
> * [More Details]("https://tweepy.readthedocs.io/en/latest/api.html#API.user_timeline/")

In [3]:
!pip install columnar

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [4]:
import json
import tweepy
from columnar import columnar

username = 'boredbengio'
count = 5

# Only iterate through the first n statuses
tweets = tweepy.Cursor(api.user_timeline,
                       screen_name=username).items(count)

# Pulling information from tweets iterable object
tweets_list = [[tweet.id, tweet.created_at, tweet.text] for tweet in tweets]

#print tweets
headers = ['id', 'created_at','text']
table = columnar(tweets_list, headers, no_borders=True)
print(table)

# what are the current attributes/tags in a tweet?
# https://jsoneditoronline.org/
tweet = api.get_status('1420646753863225349')
print(json.dumps(tweet._json))


        
  ID                   CREATED_AT                 TEXT                            
    
  1520417612043325448  2022-04-30 14:59:41+00:00  RT @boredyannlecun: If (i) all  
                                                   the world's a ConvNet; &amp;   
                                                  (ii) all worlds spawn a         
                                                  @ylecun, who invents ConvNets   
                                                  → (iii) the probability that…   
  1517141321927954432  2022-04-21 14:00:52+00:00  My research program so big, Go  
                                                  d had to invent Bengio paralle  
                                                  lism, forking Gogeta Bengio in  
                                                  to Yoshua &amp; Samy to accel…  
                                                   https://t.co/Tinjmxs1Kg        
  1516969990901092353  2022-04-21 02:40:04+00:00  My brain so big, I wrot

**Pagination**
>Main parameters:
> * count – Max number of pages. <br>
> * [More Details]("https://docs.tweepy.org/en/stable/pagination.html")

In [5]:
# pagination.. iterate through pages
count = 1
for page in tweepy.Cursor(api.user_timeline,screen_name=username).pages(count):
    searched_tweets = [status for status in page]
    ids_texts = [(json_obj.id, json_obj.text) for json_obj in searched_tweets]
    for id, text in ids_texts:
        print(id, text[:30])
    # searched_tweets = [status._json for status in page]
    #json_strings = [json.dumps(json_obj) for json_obj in searched_tweets]  
    #print(json_strings[0])
    


1520417612043325448 RT @boredyannlecun: If (i) all
1517141321927954432 My research program so big, Go
1516969990901092353 My brain so big, I wrote "The 
1516968578016329729 RT @boredyannlecun: My influen
1516967191601725440 My lab so big, you need Hadoop
1516965673196572672 RT @boredyannlecun: My deep ne
1425265524238389254 RT @boredyannlecun: WTF, I was
1420646753863225349 There is a lot of talk about n
1338760483097235456 RT @boredyannlecun: Damn, @pmd
1338267929587163139 RT @boredyannlecun: What did y
1338266157573419008 RT @boredyannlecun: In light o
1334782203285430273 Seems @GoogleAI had the vanish
1324772330913046530 RT @boredyannlecun: Trump has 
1324481304797290497 ICLoseR (pronounced "I see los
1322996993120231426 What if instead of minimizing 
1313922130422136832 RT @boredyannlecun: Old man Ge
1313280370649989121 RT @BasicScienceSav: Thank you
1312595632851496960 RT @Graham__Duncan: Best "epic
1312595482359857153 RT @Jarmosan: Mr. Bengio, you’
1312495310837481472 Look at thi

In [6]:
user_id='14861663'
count = 5

followers = tweepy.Cursor(api.get_follower_ids,
                          user_id=user_id).items(count)

user_list = [[user] for user in followers]

headers = ['user_id']
table = columnar(user_list, headers, no_borders=True)
print(table) 

    
  USER_ID              
    
  1480480689648816130  
  1477247508539600897  
  1270168772809306118  
  2815033546           
  1521526199142518787  



**Getting user's followees**
>Main parameters:
> * user_id – Specifies the ID of the user.
> * [More Details]("http://docs.tweepy.org/en/v3.5.0/api.html#API.friends_ids")

In [7]:
user_id='14861663'
count = 5
    
friends = tweepy.Cursor(api.get_friends,
                        user_id=user_id).items(count)
    
# Pulling information from tweets iterable object
user_list = [[user.id,  user.screen_name, user.created_at] for user in friends]

#print users
headers = ['user_id', ' screen_name','created_at']
table = columnar(user_list, headers, no_borders=True)
print(table)   

        
  USER_ID               SCREEN_NAME     CREATED_AT                 
    
  1054463094624321541  RiverHawkVideo   2018-10-22 20:02:47+00:00  
  527688814            cat_khalil       2012-03-17 17:57:43+00:00  
  768079734601310208   eforall_mvalley  2016-08-23 13:37:33+00:00  
  794429491            KSubbaswamy      2012-08-31 18:35:49+00:00  
  1063439284152213506  TogetherallNA    2018-11-16 14:30:57+00:00  




**Getting tweet with specific id**
> helpful when you only have tweet ids and would like to get the corresponding attributes such as text.


In [8]:
import json 

tweet_id='1255894886051713030'

tweet = api.get_status(tweet_id)

tweet_list = [tweet.text, tweet.favorite_count, tweet.retweet_count]
print(tweet_list)

json_tweet = json.dumps(tweet._json)

print( json_tweet)

['Al Pacino Fan Site: Al Pacino The Latest Huge Name For Tarantino’s ‘Once Upon A Time In Hollywood’ https://t.co/ldyHRX2kuH', 7, 1]
{"created_at": "Thu Apr 30 16:20:48 +0000 2020", "id": 1255894886051713030, "id_str": "1255894886051713030", "text": "Al Pacino Fan Site: Al Pacino The Latest Huge Name For Tarantino\u2019s \u2018Once Upon A Time In Hollywood\u2019 https://t.co/ldyHRX2kuH", "truncated": false, "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": [{"url": "https://t.co/ldyHRX2kuH", "expanded_url": "https://alpacino.life/al-pacino-the-latest-huge-name-for-tarantinos-once-upon-a-time-in-hollywood-2.html", "display_url": "alpacino.life/al-pacino-the-\u2026", "indices": [99, 122]}]}, "source": "<a href=\"http://alpacino.info\" rel=\"nofollow\">AlPacino.info</a>", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": {"id": 233717042, "id_str": 


**Twitter Search**
 > To search Twitter for recent tweets, we will define search terms and a start date of for search. [More Details](http://docs.tweepy.org/en/latest/api.html#API.search)<br>
 > - For creating complex queries please see [Building standard queries](https://developer.twitter.com/en/docs/twitter-api/v1/rules-and-filtering/overview/standard-operators)
 > - Twitter API only allows you to access the past few weeks of tweets, so you cannot dig into the history too far.
​

In [9]:
# Define the search term and the date_since date

search_words = "#disneyland -filter:retweets"

# Collect tweets
tweets = tweepy.Cursor(api.search_tweets,
                       q=search_words,
                       lang="en").items(5)

# Pulling information from tweets iterable object
tweets_list = [[tweet.id, tweet.created_at, tweet.text] for tweet in tweets]

#print tweets
headers = ['id', ' created_at','text']
table = columnar(tweets_list, headers, no_borders=True)
print(table)


        
  ID                    CREATED_AT                TEXT                            
    
  1521906959829590017  2022-05-04 17:37:49+00:00  It's (temporarily) possible to  
                                                   make a reservation again with  
                                                   a believe key for Disneyland   
                                                  Park on 06/04/2022.… https://t  
                                                  .co/0rNn3ZqCpy                  
  1521906957405552640  2022-05-04 17:37:48+00:00  It's no longer possible to mak  
                                                  e a reservation with a imagine  
                                                   key for Disney California Adv  
                                                  enture Park on 05/19/2022… htt  
                                                  ps://t.co/3MTX9REgoH            
  1521906955547197442  2022-05-04 17:37:48+00:00  It's (temporarily) poss

**Twitter Streaming API**
> The Twitter streaming API is used to download twitter messages in real time. In Tweepy, an instance of tweepy.Stream establishes a streaming session and routes messages to StreamListener instance. The on_data method of a stream listener receives all messages and calls functions according to the message type.<br>
> Using the streaming api has three steps: 
> - Create a class inheriting from StreamListener
> - Using that class create a Stream object
> - Connect to the Twitter API using the Stream.
[More Details](https://docs.tweepy.org/en/v3.5.0/streaming_how_to.html)

*What kinds of filters can be used?*: [see here](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filter-realtime/api-reference/post-statuses-filter)

*What are the error codes and how to handel them*: [see here](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filter-realtime/guides/streaming-message-types)

In [10]:
import requests
import os
import json

In [11]:
def create_url():
    return "https://api.twitter.com/2/tweets/sample/stream"

In [12]:
def connect_to_endpoint(url):
    response = requests.request("GET", url, auth=bearer_oauth, stream=True)
    print(response.status_code)
    for response_line in response.iter_lines():
        if response_line:
            json_response = json.loads(response_line)
            print(json.dumps(json_response, indent=4, sort_keys=True))
    if response.status_code != 200:
        raise Exception(
            "Request returned an error: {} {}".format(
                response.status_code, response.text
            )
        )

In [13]:
import requests
bearer_token = os.environ.get("BEARER_TOKEN")

def bearer_oauth(r):
    """
    Method required by bearer token authentication.
    """

    r.headers["Authorization"] = f"Bearer {bearer_token}"
    r.headers["User-Agent"] = "v2FilteredStreamPython"
    return r

In [23]:
def get_rules():
    response = requests.get(
        "https://api.twitter.com/2/tweets/search/stream/rules", auth=bearer_oauth
    )
    if response.status_code != 200:
        raise Exception(
            "Cannot get rules (HTTP {}): {}".format(response.status_code, response.text)
        )
    print(json.dumps(response.json()))
    return response.json()

def delete_all_rules(rules):
    if rules is None or "data" not in rules:
        return None

    ids = list(map(lambda rule: rule["id"], rules["data"]))
    payload = {"delete": {"ids": ids}}
    response = requests.post(
        "https://api.twitter.com/2/tweets/search/stream/rules",
        auth=bearer_oauth,
        json=payload
    )
    if response.status_code != 200:
        raise Exception(
            "Cannot delete rules (HTTP {}): {}".format(
                response.status_code, response.text
            )
        )
    print(json.dumps(response.json()))

    
def set_rules(delete):
    # You can adjust the rules if needed
    sample_rules = [
        {"value": "apple"},
        {"value": "covid19"},
    ]
    payload = {"add": sample_rules}
    response = requests.post(
        "https://api.twitter.com/2/tweets/search/stream/rules",
        auth=bearer_oauth,
        json=payload,
    )
    if response.status_code != 201:
        raise Exception(
            "Cannot add rules (HTTP {}): {}".format(response.status_code, response.text)
        )
    print(json.dumps(response.json()))

def get_stream(set):
    response = requests.get(
        "https://api.twitter.com/2/tweets/search/stream", auth=bearer_oauth, stream=True,
    )
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(
            "Cannot get stream (HTTP {}): {}".format(
                response.status_code, response.text
            )
        )
    for response_line in response.iter_lines():
        if response_line:
            json_response = json.loads(response_line)
            print(json.dumps(json_response, indent=4, sort_keys=True))

In [24]:
rules = get_rules()
delete = delete_all_rules(rules)
set = set_rules(delete)
get_stream(set)

{"data": [{"id": "1521911941609885697", "value": "keyword apple"}, {"id": "1521911941609885698", "value": "keyword covid19"}], "meta": {"sent": "2022-05-04T18:00:08.055Z", "result_count": 2}}
{"meta": {"sent": "2022-05-04T18:00:08.499Z", "summary": {"deleted": 2, "not_deleted": 0}}}
{"data": [{"value": "covid19", "id": "1521912577487294466"}, {"value": "apple", "id": "1521912577487294465"}], "meta": {"sent": "2022-05-04T18:00:08.947Z", "summary": {"created": 2, "not_created": 0, "valid": 2, "invalid": 0}}}
200
{
    "data": {
        "id": "1521912572106223616",
        "text": "13\" MacBook Pro Price Tracker, real-time prices #macbook #apple #macbookpro : https://t.co/HJbPYB6B16"
    },
    "matching_rules": [
        {
            "id": "1521912577487294465",
            "tag": ""
        }
    ]
}
{
    "data": {
        "id": "1521912571305025536",
        "text": "RT @peaceandteachin: Thank Susan Collins!  \nHow dare they say a women\u2019s health care should be decided by their r

{
    "data": {
        "id": "1521912583032393729",
        "text": "RT @jkjskjkskjkj: || 5500\ufdfc || \u0627\u064a\u0641\u0648\u0646 13 Apple iPhone\ud83d\udcf1 \u0644\u0645\u062a\u0627\u0628\u0639\u064a \u0641\u0642\u0637 !\n\n\u2022 \u0644\u0645\u062a\u0627\u0628\u0639\u064a \u0641\u0642\u0637 \u0631\u062a\u0648\u064a\u062a \u0648\u0627\u0643\u062a\u0628 \u062a\u0645 \ud83d\udc47 \n\n\u0627\u0644\u0633\u062d\u0628 \u0645\u0646 \u0627\u0644\u0631\u062a\u0648\u064a\u062a \u0648\u0627\u0644\u0645\u062a\u0627\u0628\u0639\u0647 \u0645\u0648\u062b\u0642 \n\u0628\u0627\u0644\u062a\u0648\u2026"
    },
    "matching_rules": [
        {
            "id": "1521912577487294465",
            "tag": ""
        }
    ]
}
{
    "data": {
        "id": "1521912582038298624",
        "text": "Top 100 - Peru CLEAN \ud83c\uddf5\ud83c\uddea:\n\n\ud83e\udd47 KAROL G (5 days)\n\ud83e\udd48 Harry Styles\n\ud83e\udd49 Becky G., KAROL G\n4\ufe0f\u20e3 Bizarrap, Tiago pzk\n5\ufe0f\u20e3 Anitta\n6\ufe0f\u20e

{
    "data": {
        "id": "1521912591613935618",
        "text": "What are your thoughts on Olaplex's #omnichannel strategy?\n\nIn a world with wavering consumer trust, CEO JuE Wong has built a trust-first, synergistic path forward.\n\nLearn more on the latest episode of the Conversations with CommerceNext #podcast: https://t.co/UqwIEChzSi https://t.co/bJfJFjOtgw"
    },
    "matching_rules": [
        {
            "id": "1521912577487294465",
            "tag": ""
        }
    ]
}
{
    "data": {
        "id": "1521912597297213440",
        "text": "@Arto1_ \u062a\u0645"
    },
    "matching_rules": [
        {
            "id": "1521912577487294465",
            "tag": ""
        }
    ]
}
{
    "data": {
        "id": "1521912594277310465",
        "text": "RT @Shazam: Tell a friend to Stream and Shazam @psy_oppa's new song #ThatThat, produced by #SUGA of @BTS_twt: https://t.co/HlviTGSZ8L \ud83d\udc99 htt\u2026"
    },
    "matching_rules": [
        {
            "id": "15219

{
    "data": {
        "id": "1521912612480507905",
        "text": "\ud83d\udcdb\ud83d\udcdb\ud83d\udcdb https://t.co/Uafe84cMjt"
    },
    "matching_rules": [
        {
            "id": "1521912577487294465",
            "tag": ""
        }
    ]
}
{
    "data": {
        "id": "1521912611180367872",
        "text": "@mmm772973786 @sunbeau \u0644\u0627 \u0627\u0644\u0647 \u0622\u0644\u0627 \u0627\u0644\u0644\u0647"
    },
    "matching_rules": [
        {
            "id": "1521912577487294465",
            "tag": ""
        }
    ]
}
{
    "data": {
        "id": "1521912617312337923",
        "text": "@dremyers_ Sameee look at my wrist and Tuesday still go"
    },
    "matching_rules": [
        {
            "id": "1521912577487294465",
            "tag": ""
        }
    ]
}
{
    "data": {
        "id": "1521912618084098050",
        "text": "The Big Five Tech Companies (Apple, Amazon, Alphabet, Microsoft and Meta) Earned Over $1.4 Trillion Last Year, Here\u2019s Where That M

KeyboardInterrupt: 