# Tweet downloader

This notebook demonstrates how to download tweets from the X API.

To download all the data used in our research, create an account at https://developer.x.com and purchase access to the X API. You will need an account with a pull limit of at least 50,000 tweets (4 Basic accounts for $200/month each or 1 Pro account for $5,000/month).  
Then, run all the code snippets below and follow the instructions.

**Note:** You must provide your own X API credentials and ensure your account has sufficient access (at least 50,000 tweets).  
- Use section **3.1** to download tweets for a single user.  
- Use section **3.2** to automatically download tweets for all users included in the study.

## 1. Libaries

In [84]:
import json
from datetime import datetime, timedelta
import os 
from time import sleep
import sys

sys.path.append("../")

from src.helpers_downloader import authenticate_app_only, fetch_all_tweet_fields, save_tweets_to_json, download_tweets

## 2. Bear token

In [93]:
# Provide your Bearer Token here for Twitter API access
BEARER_TOKEN = input("Please enter your Bearer Token: ") 

if not BEARER_TOKEN:
    print("Error: Bearer Token is required.")

client = authenticate_app_only(BEARER_TOKEN)

Error: Bearer Token is required.


## 3. Final downloading loop 

Below is the final implementation for the Tweepy paginator. It does not resolve all issues, which is why we had to implement a custom while loop. In this loop, we check the date of the last downloaded tweet and request tweets between the `START_TIME` and the date of the last tweet ( + 1 second) to ensure we have downloaded all tweets from the specified time range. When some users post and retweet a lot, the API does not work correctly.

### 3.1. Single person downloader

In [None]:
username = "mwojcik_" #To change: Twitter username
EXCLUDE_RETWEETS = True
EXCLUDE_REPLIES = False

START_TIME = "2023-10-16T00:00:00Z" #To change: Start time
END_TIME = "2024-10-15T23:59:59Z" #To change: End time

org_START_TIME = START_TIME
org_END_TIME = END_TIME

tweets = []

new_tweets = fetch_all_tweet_fields(
        client=client,
        username=username,
        start_time=START_TIME,
        end_time=END_TIME,
        exclude_retweets=EXCLUDE_RETWEETS,
        exclude_replies=EXCLUDE_REPLIES
    )
tweets.extend(new_tweets)
print(f'Fetched {len(new_tweets)} tweets')

while (len(new_tweets) > 1):
    last_tweet_time = datetime.fromisoformat(tweets[-1]["created_at"].replace("Z", "+00:00"))
    new_time = last_tweet_time + timedelta(seconds=1)
    new_time_iso = new_time.isoformat().replace("+00:00", "Z")
    END_TIME = new_time_iso
    print(f"Fetching tweets from {START_TIME} to {END_TIME}")
    new_tweets = fetch_all_tweet_fields(
        client=client,
        username=username,
        start_time=START_TIME,
        end_time=END_TIME,
        exclude_retweets=EXCLUDE_RETWEETS,
        exclude_replies=EXCLUDE_REPLIES
    )
    print(f'Fetched {len(new_tweets)} tweets')
    tweets.extend(new_tweets)    
    sleep(10)

save_tweets_to_json(tweets, f"{username}_{org_START_TIME[0:10]}_{org_END_TIME[0:10]}.json",folder="../data/01.raw/tweets_before_elections/PiS")  #To change: Folder !IMPORTANT

### 3.2. Research reproducibility: downloading all data used in the study

#### Downloading tweets for all users before the elections

In [102]:
# Read the people to download 
with open('../data/00.init/people_before_elections.json', 'r', encoding='utf-8') as f:
    people_before_elections = json.load(f)

# Check people which are already downloaded
tweets_before_elections_path = "../data/01.raw/tweets_before_elections"
all_files_before_elections = os.listdir(tweets_before_elections_path)
all_files_dict = {}
for party in all_files_before_elections:
    party_folder = os.path.join(tweets_before_elections_path, party)
    all_files_dict[party] = os.listdir(party_folder)
all_downloaded_usernames = {}
for party, files in all_files_dict.items():
    all_downloaded_usernames[party] = [f.split('_2022')[0] for f in files]

START_TIME = "2022-10-16T00:00:00Z"
END_TIME = "2023-10-15T23:59:59Z"
for party, people in people_before_elections.items():
    for person in people:
        if person in all_downloaded_usernames[party]:
            print(f"Skipping {person} from {party}, already downloaded.")
            continue
        print(f"Downloading tweets for {person} from {party} before elections.")
        download_tweets(person, party, True, START_TIME, END_TIME)

Skipping bartlomiejpejo from Konfederacja, already downloaded.
Skipping GrzegorzBraun_ from Konfederacja, already downloaded.
Skipping JkmMikke from Konfederacja, already downloaded.
Skipping KonradBerkowicz from Konfederacja, already downloaded.
Skipping krzysztofbosak from Konfederacja, already downloaded.
Skipping MarSypniewski from Konfederacja, already downloaded.
Skipping MichalWawer from Konfederacja, already downloaded.
Skipping placzekgrzegorz from Konfederacja, already downloaded.
Skipping RJ_Iwaszkiewicz from Konfederacja, already downloaded.
Skipping SlawomirMentzen from Konfederacja, already downloaded.
Skipping TudujKrzysztof from Konfederacja, already downloaded.
Skipping Wlodek_Skalik from Konfederacja, already downloaded.
Skipping WTumanowicz from Konfederacja, already downloaded.
Skipping AndrzejSzejna from NL, already downloaded.
Skipping AnitaKDZG from NL, already downloaded.
Skipping Arek_Iwaniak from NL, already downloaded.
Skipping B_Maciejewska from NL, already 

#### Downloading tweets for all users after the elections

In [101]:
# Read the people to download 
with open('../data/00.init/people_after_elections.json', 'r', encoding='utf-8') as f:
    people_after_elections = json.load(f)

# Check people which are already downloaded
tweets_after_elections_path = "../data/01.raw/tweets_after_elections"
all_files_after_elections = os.listdir(tweets_after_elections_path)
all_files_dict = {}
for party in all_files_after_elections:
    party_folder = os.path.join(tweets_after_elections_path, party)
    all_files_dict[party] = os.listdir(party_folder)
all_downloaded_usernames = {}
for party, files in all_files_dict.items():
    all_downloaded_usernames[party] = [f.split('_2023')[0] for f in files]

START_TIME = "2023-10-16T00:00:00Z"
END_TIME = "2024-10-15T23:59:59Z"

# Download tweets for all people after elections
for party, people in people_after_elections.items():
    for person in people:
        if person in all_downloaded_usernames[party]:
            print(f"Skipping {person} from {party}, already downloaded.")
            continue
        print(f"Downloading tweets for {person} from {party} after elections.")
        download_tweets(person, party, False, START_TIME, END_TIME, client)
        

Skipping bartlomiejpejo from Konfederacja, already downloaded.
Skipping GrzegorzBraun_ from Konfederacja, already downloaded.
Skipping Iwaszkiewicz_RJ from Konfederacja, already downloaded.
Skipping KonradBerkowicz from Konfederacja, already downloaded.
Skipping MarSypniewski from Konfederacja, already downloaded.
Skipping MichalWawer from Konfederacja, already downloaded.
Skipping placzekgrzegorz from Konfederacja, already downloaded.
Skipping SlawomirMentzen from Konfederacja, already downloaded.
Skipping TudujKrzysztof from Konfederacja, already downloaded.
Skipping Wlodek_Skalik from Konfederacja, already downloaded.
Skipping WTumanowicz from Konfederacja, already downloaded.
Skipping AndrzejSzejna from NL, already downloaded.
Skipping AnitaKDZG from NL, already downloaded.
Skipping DyduchMarek from NL, already downloaded.
Skipping JoankaSW from NL, already downloaded.
Skipping KGawkowski from NL, already downloaded.
Skipping K_Smiszek from NL, already downloaded.
Skipping MarcinKu