# Twitter Crawler

This Notebook asks retrieves tweets that contain the keywords specified by the user.

- User's credentials must be in a file named 'Credentials.py' placed in the same folder as this notebook (authentication key and secret are needed)
- Keywords must be written in a txt file separated by commas. Directory and name of such file can be specified in the parameter 'keywords_file' (default is 'keywords.txt)
- Max number of tweets retrieved is 1 million (can be changed by modyfying the parameter 'maxTwts'
- All tweets operations are handled by try catch construct to manage exceptions and errors without losing work
- Parameter 'onlyImg' sets whether the crawler should retrieve only tweets containing images or not
- Tweets are stored in a CSV file with a subset of the complete fields, whose path can be modified with parameter 'base_path'
- Already seen URLs are stored in the file 'uniques.txt'

## Libraries

In [1]:
import tweepy
import time
import datetime
import pandas as pd
from Code.Credentials import auth_key, auth_secret

## Parameters

In [7]:
maxTwts = 1000000              #Max tweets to be crawled
onlyImg = True                 #Only tweets containing images are retrieved
base_path = "Datasets/"        #Base path where tweets will be saved
keywords_file = 'keywords.txt' #File that contains the keywords (Note: use commas to separate the words)

#Retrieve keywords
with open(keywords_file, 'r') as f:
    keywords = f.read().split(',')
    
#Name of the file where tweets will be saved
fName = base_path + 'crawl_' + (datetime.datetime.now().strftime("%Y%m%d%H"))+'.csv'

print("Using keywords: {0}".format(keywords))
print("Tweets will be saved in {0}".format(fName))

Using keywords: ['coronavirus', 'corona', 'virus', 'covid', 'covid19', 'covid-19', 'flu', 'wuhan', 'Coronaviridae', 'N95']
Tweets will be saved in Datasets/crawl_2020121212.csv


## Authentication

Tries to authenticate by retrieving credentials from file: credentials.py

In [8]:
#Try authentication
auth = tweepy.AppAuthHandler(auth_key, auth_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

if(not api):
    print("Can't authenticate")
else:
    print("Authentication successful")

Authentication successful


## Query

In [9]:
#Build the query using the keywords, filtering retweets and tweets without images
separator = ' OR '
searchQuery = separator.join(keywords)
searchQuery += ' -filter:retweets'
if onlyImg:
    searchQuery += ' filter:images' 

## Preparation

Initialies the final dataset with the expected columns and retrieves the already seen URLs from previous crawls.

In [10]:
twts = {'id': [], 'id_str': [],'created_at':[], 'media_url':[],'url':[],'type':[], 'language_code':[],'bounding_box': [], 'country': [], 'country_code':[], 'user_loc': [],'full_text':[]}
 
#with open("Datasets/uniques.txt") as file:
    #urls = set(file.readlines())

## Crawling

Tries to retrieve the last tweets up to date with matching keywords

In [11]:
#Utilities
tweetsPerQry = 100 #Number of tweets retrieved each request (100 is maximum)
tweetCount = 0     #Number of tweets retrieved so far
max_id = -1        #Tweet id to start from (-1 retrieves the latest tweet)
savedTweets = 0    #Number of tweets that are actually saved (in case we're excluding tweets without images)
new_urls = set()

print("Downloading max {0} tweets".format(maxTwts))
with open(fName, 'a') as f:
    while tweetCount < maxTwts:
        try:
            #If max_id is not defined, we start from the most recent tweet
            #Otherwise, we start from the twitter with id = max_id - 1
            if (max_id <= 0):
                new_tweets = api.search(q=searchQuery, count=tweetsPerQry, tweet_mode='extended')
            else:
                new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id - 1), tweet_mode='extended')
                
            #Stop if no more tweets are found
            if not new_tweets:
                print("No more tweets found")
                break

            #Write tweets on file
            for tweet in new_tweets:
                if 'media' not in tweet.entities:
                    continue
                media = tweet.entities['media'][0]
                #if media['media_url'] in urls.union(new_urls):
                    #continue
                #else:
                    #new_urls.add(media['media_url'])
                twts['created_at'].append(tweet.created_at)
                twts['id'].append(tweet.id)
                twts['id_str'].append(tweet.id_str)
                text = tweet.full_text
                for char in [',',';','|', "'", '"','\n','\r']:
                    text = text.replace(char,'')
                text = text.encode('UTF-8')
                twts['full_text'].append(text)
                twts['media_url'].append(media['media_url'])
                twts['url'].append(media['url'])
                twts['type'].append(media['type'])
                if tweet.user.location != None:
                    twts['user_loc'].append(tweet.user.location)
                else:
                    twts['user_loc'].append('-')
                twts['language_code'].append(tweet.metadata['iso_language_code'])
                if tweet.place != None:
                    twts['country'].append(tweet.place.country)
                    twts['country_code'].append(tweet.place.country_code)
                    if tweet.place.bounding_box != None:
                        twts['bounding_box'].append(tweet.place.bounding_box.coordinates[0])
                    else:
                        twts['bounding_box'].append('-')
                else:
                    twts['country'].append('-')
                    twts['country_code'].append('-')
                    twts['bounding_box'].append([0])
            #Update length
            tweetCount += len(new_tweets)
            percCompletion = tweetCount * 100/maxTwts
            print("Downloaded {0} tweets (completed: {1:.2f}%)".format(tweetCount, percCompletion))

            #Update max_id with the last retrieved tweet
            max_id = new_tweets[-1].id

        except tweepy.TweepError as e:
            # Just exit if any error
            print("some error : " + str(e))
            break
        if len(twts['id']) % 1000 == 0:
            df = pd.DataFrame(twts)
            df.to_csv(fName, index=False)
            
print("Downloaded {0} tweets, saved them in {1}".format(tweetCount,fName))
df = pd.DataFrame(twts)
df.to_csv(fName, index=False)

Downloading max 1000000 tweets
Downloaded 100 tweets (completed: 0.01%)
Downloaded 200 tweets (completed: 0.02%)
Downloaded 299 tweets (completed: 0.03%)
Downloaded 399 tweets (completed: 0.04%)
Downloaded 499 tweets (completed: 0.05%)
Downloaded 599 tweets (completed: 0.06%)
Downloaded 699 tweets (completed: 0.07%)
Downloaded 799 tweets (completed: 0.08%)
Downloaded 895 tweets (completed: 0.09%)
Downloaded 995 tweets (completed: 0.10%)
Downloaded 1095 tweets (completed: 0.11%)
Downloaded 1192 tweets (completed: 0.12%)
Downloaded 1292 tweets (completed: 0.13%)
Downloaded 1392 tweets (completed: 0.14%)
Downloaded 1492 tweets (completed: 0.15%)
Downloaded 1592 tweets (completed: 0.16%)
Downloaded 1692 tweets (completed: 0.17%)
Downloaded 1792 tweets (completed: 0.18%)
Downloaded 1892 tweets (completed: 0.19%)
Downloaded 1992 tweets (completed: 0.20%)
Downloaded 2092 tweets (completed: 0.21%)
Downloaded 2181 tweets (completed: 0.22%)
Downloaded 2281 tweets (completed: 0.23%)
Downloaded 23

Rate limit reached. Sleeping for: 400
Rate limit reached. Sleeping for: 439
Rate limit reached. Sleeping for: 397
Rate limit reached. Sleeping for: 434


## Dataset Storin   g

Updates the uniques.txt file with the new URLs from the crawl. <br>
Saves the CSV file with the important fields from the crawled tweets.

In [8]:
keys = twts.keys()

for k in keys:
    print(len(twts[k]))

60421
60421
60421
60421
60421
60421
60421
60421
60421
60421
60421
60421


In [17]:
df = pd.DataFrame(twts)
df.to_csv(fName, index=False)
print("Saved")

Saved


In [14]:
with open("Datasets/uniques.txt", 'a+') as urlfile:
    for url in new_urls:
        urlfile.write(url+'\n')