## Scraping Tweets using snscrape

In this notebook we will be scraping Tweets from Twitter's API to extract all tweets mentioning '\\$dot' in the period between 18-08-2020 (23:00 UTC) to 18-02-2021 (23:59 UTC). While scraping Twitter for the word Polkadot would have returned too much noice, using '\\$dot' is a saver option. '\\$dot' can be seen as '#dot', while the '\\$' icon is a hashtag for financial markets. Twitter's API is not sensitive to lower or upper case variations of the word.

This notebook and the code below are based on the developers version of the snscrape package (https://github.com/JustAnotherArchivist/snscrape), check their github for more information about the developers version of snscrape.

The process of scraping Tweets will be done by iteratively requesting individual samples until a threshold date is reached. Each request returns new Tweets (including additional data) from a specified point in time. A new request will take the date from the most recent Tweet from the last request and request new tweets from that date on. This causes partial overlapping if the end tweet of the previous call is not the last one of that day, while time can not be specified within the request. To overcome this, Panda's drop_duplicates method is used. All requested are: stored as json file, opened and converted to df and saved to a csv file per all Tweets corresponding to a specific month.

In [None]:
# Run the pip install command below if you don't already have the library. You might need to restart your program for the package to work.
#!pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git

import os
import pandas as pd

In [2]:
# The following list contains the names of the csv files corresponding to the months of the search period.
# These will later be used to save the data corresponding to that month.
csv_names_list = ['DOT_august.csv', 'DOT_september.csv', 'DOT_october.csv', 'DOT_november.csv', 'DOT_december.csv', 'DOT_january.csv', 'DOT_february.csv']

# The following lists contain the start and end dates
start_dates = ["2020-08-18", "2020-09-01", "2020-10-01", "2020-11-01", "2020-12-01", "2021-01-01", "2021-02-01"]
end_dates = ["2020-09-01", "2020-10-01", "2020-11-01", "2020-12-01", "2021-01-01", "2021-02-01", "2021-02-19"]

# The one day Timedelta object will later be used to move the end date.
ONE_DAY = pd.Timedelta(1, unit = 'd')

In [3]:
# The for-loop goes iteratively through the start_dates and end_dates lists.
for x in range(len(start_dates)):
    since_date = start_dates[x]
    until_date = end_dates[x]
    end_date = str(pd.Timestamp.date(pd.Timestamp(since_date) + ONE_DAY))
    count = 0

    tweet_count = 15000
    text_query = "$dot"
# Using OS library to call CLI commands in Python
    os.system('snscrape --jsonl --max-results {} --since {} twitter-search "{} until:{}"> text-query-tweets.json'.format(tweet_count, since_date, text_query, until_date))

    tweets_dot = pd.read_json('text-query-tweets.json', lines=True)
    new_date = tweets_dot.iloc[-1,1]
    new_date = str(pd.Timestamp.date(new_date))
# While one call of 15.000 Tweets is most often not enough to cover all Tweets within a month, the following while-loop repeats the previous request.
# This is done until the end_date is reached.
    while count == 0:
        if end_date > new_date:
            count += 1
    
        until_date = new_date

        os.system('snscrape --jsonl --max-results {} --since {} twitter-search "{} until:{}"> text-query-tweets.json'.format(tweet_count, since_date, text_query, until_date))

        tweets_dot = tweets_dot.append(pd.read_json('text-query-tweets.json', lines=True))
    
        new_date = tweets_dot.iloc[-1,1]
        new_date = str(pd.Timestamp.date(new_date))
# When the While-loop is completed, all the Tweets from a specific month are cleaned from duplicates and saved to the corresponding csv file.
    tweets_dot = tweets_dot.drop_duplicates(subset = 'id')
    tweets_dot.to_csv(csv_names_list[x], sep=',', index=False)


2020-08-18
2020-08-18
2020-09-01
2020-09-01
2020-10-01
2020-11-01
2020-12-01
2020-12-01
2021-01-17
2021-01-14
2021-01-04
2021-01-01
2021-01-01
2021-02-05
2021-02-01
2021-02-01
