<a href="https://colab.research.google.com/github/Ningensei848/SATwi/blob/main/notebook/bulk_get_tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieve tweets/mentions/likes in bulk

## What is it ?

In [Ningensei848/SATwi](https://github.com/Ningensei848/SATwi), `dailyUpdate.py` acquired data little by little on a regular basis. 
On the other hand, there is a demand to retrieve a large amount of data in an instant.

In this Notebook, the `start_time` constraint imposed by `dailyUpdate.py` has been removed so that tweets can be retrieved up to the upper limit specified for each endpoint in [Twitter API v2](https://developer.twitter.com/en/docs/twitter-api).

- tweet ... 3200 requests
- mention ... 800 requests
- like ... 7500 requests

However, the `like` is not a `start_time` constraint, but is derived from [the rate limit](https://developer.twitter.com/en/docs/twitter-api/rate-limits) of 75 req / 15 min. 
If the limit is reached during execution, an error message is displayed and no more data can be obtained.

A solid approach would be to collect them one by one by specifying them as `UNIQUE_TARGET_ID`, rather than running multiple people based on `targetList.txt`.

However, there remains the problem of not being able to collect more than 7,500 cases of this even if they exist (to be addressed in the future).

## Usage

Simply enter the various variables required for execution in the "Define authentication information" code cell.

- `GITHUB_USERNAME`: your github username
- `GITHUB_EMAIL`: your github email
- `REPOSITORY_NAME`: the name of your own private repository created by importing [SATwi](https://github.com/Ningensei848/SATwi) 
- `GITHUB_TOKEN`: the Personal Access Token that grants at least `repo` privileges
- `BEARER_TOKEN`: the App Access Token in Twitter

For the type of data you wish to collect, you can specify that the data be retrieved by checking the checkbox
(if not checked, the data will not be retrieved).

- `ENABLE_TWEETS`: if checked, collect tweets
- `ENABLE_MENTION`: if checked, collect mentions
- `ENABLE_LIKED_TWEETS`: if checked, collect likes

## Tips

Basically, data is collected for the IDs written in `targetList.txt` in your own private repository.
However, by specifying a separate ID in `UNIQUE_TARGET_ID`, you can collect data only for that account.
This is good to use when you want to avoid [the restriction of getting "Likes"](https://developer.twitter.com/en/docs/twitter-api/tweets/likes/api-reference/get-users-id-liked_tweets).


## !! CAUTION !!

This notebook will necessarily contain authentication information.

After entering and executing the various variables, it is best not to share them with third parties. If you wish to redistribute it to others, **you must delete the authentication information before doing so**.

(The creator cannot be held responsible.)


In [None]:
# @title Define authentication information

# @markdown #### Enter your github information || cf. [Generate access token](https://github.com/settings/tokens)
GITHUB_USERNAME = "Ningensei848" # @param {"type": "string"}
GITHUB_EMAIL = "k.kubokawa@klis.tsukuba.ac.jp" # @param {"type": "string"}
REPOSITORY_NAME = "SATwi-imported-private" # @param {"type": "string"}
GITHUB_TOKEN = "ghp_poiuytrew0987654321lkjhgfdsamnbvcxz_this_is_dummy_token" # @param {"type": "string"}
OWNER_AND_REPO = f"{GITHUB_USERNAME}/{REPOSITORY_NAME}"

# @markdown #### Enter the `BEARER_TOKEN` on Twitter || cf. [Twitter Developer Portal](https://developer.twitter.com/en/portal/dashboard)
BEARER_TOKEN = "AAAAAAAAAAAAAAAAAAAAAAAqwertyuiop1234567890qwertyuiopasdfghjklzxcvbnm_this_is_dummy_token" # @param {"type": "string"}

# @markdown ### CAUTION: The above information is **confidential information**; please be very careful not to disclose it to others!

# @markdown ---

# @markdown #### Optional settings
# @markdown ###### If you want to collect `tweets` sent by the target, check below
ENABLE_TWEETS = False #@param {type:"boolean"}
# @markdown ###### Specify the size of tweets you want to collect (up to 3200)
MAX_RESULTS_TWEET = 100 # @param {type:"integer"}

# @markdown ###### If you want to collect `mentions` to the target, check below
ENABLE_MENTION = False #@param {type:"boolean"}
# @markdown ###### Specify the size of mentions you want to collect (up to 800)
MAX_RESULTS_MENTION = 100 # @param {type:"integer"}

# @markdown ###### If you want to collect `liked_tweets` by the target, check below
ENABLE_LIKED_TWEETS = False #@param {type:"boolean"}
# @markdown ###### Specify the size of liked tweets you want to collect (up to 7500)
MAX_RESULTS_LIKE = 100 # @param {type:"integer"}

# @markdown ---

# @markdown #### If you want to collect information about one specific account, enter the user ID below (the `targetList.txt` will be ignored)
UNIQUE_TARGET_ID = 0 #@param {type:"integer"}



In [None]:
# @title Install necessary external libraries with `pip`
# @markdown > WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead:

# @markdown You will be warned like above, but you should be able to reset the runtime environment by restarting the runtime, so ignore this.

%pip install --upgrade pip
%pip install --upgrade python-dotenv requests 
%pip install requests-oauthlib tqdm commentjson


In [None]:
# @title Clone individual `SATwi` repository from GitHub

import subprocess

%cd "/content"

proc = ["git", "clone", f"https://{GITHUB_TOKEN}@github.com/{GITHUB_USERNAME}/{REPOSITORY_NAME}.git"]
res = subprocess.run(proc, encoding="utf-8", capture_output=True, text=True)
print(res.stderr)

%cd "/content/$REPOSITORY_NAME"


In [None]:
# @title Import required libraries

import os
import re
import time
import json
import urllib.request
from datetime import datetime, timedelta, timezone
from pathlib import Path

import requests
import commentjson
from tqdm import tqdm
# @markdown > Import "script.lib" could not be resolved(reportMissingImports)

# @markdown You will be warned like above, but you should be able to install it without problems, so ignore it.
from script.lib import createTimelinesUrl, saveAsJSON


In [None]:
# @title Defining Functions: gitCommit()
# @markdown Define several git commands and group them together as a function.

def makeCommands():
    dt = datetime.now(timezone(timedelta(hours=9))).strftime("%Y-%m-%d %H:%M:%S")
    git_config_name = ["git", "config", "--local", "user.name", GITHUB_USERNAME]
    git_config_email = ["git", "config", "--local", "user.email", GITHUB_EMAIL]
    git_add = ["git", "add", "."]
    git_commit = ["git", "commit", "-m", f"[ipynb] Data updated || {dt}"]
    git_pull = ["git", "pull", "--rebase"]
    git_gc = ["git", "gc", "--prune=all"]
    git_push = ["git", "push"]

    return [git_config_name, git_config_email, git_add, git_commit,git_pull, git_gc, git_push]



def gitCommit():
    for proc in makeCommands():
        res = subprocess.run(proc, encoding="utf-8", capture_output=True, text=True)
        if len(res.stderr):
            print(res.stderr)


In [None]:
# @title Defining Functions: isPrivate()
# @markdown Check the visibility of repository

def isPrivate():

    url = f"https://api.github.com/repos/{OWNER_AND_REPO}"
    req = urllib.request.Request(url)
    req.headers = {"Accept": "application/vnd.github+json", "Authorization": f"token {GITHUB_TOKEN}"}

    res = urllib.request.urlopen(req)
    content = json.loads(res.read().decode("utf-8"))
    return content["private"]


In [None]:
# @title Defining Functions: connectEndpoint()
# @markdown Request the endpoint with BEARER_TOKEN and retrieve the data.

def bearerOAuth(r):
    """
    Method required by bearer token authentication.
    """

    r.headers["Authorization"] = f"Bearer {BEARER_TOKEN}"
    r.headers["User-Agent"] = "v2UserTweetsPython"
    return r


def connectEndpoint(url, params):
    response = requests.request("GET", url, auth=bearerOAuth, params=params)
    # print(response.status_code)
    if response.status_code != 200:
        raise Exception("Request returned an error: {} {}".format(response.status_code, response.text))
    return response.json()


In [None]:
# @title Defining Functions: getParams()
# @markdown Read necessary parameters from `queryParameters.json`.

def convertListToStr(list_):
    return ",".join(list_) if type(list_) is list else str(list_)


def getParams(pagination_token: str = None):
    filepath = cwd / "queryParameters.json"
    config = commentjson.loads(filepath.read_text())
    param_fields = [
        "expansions",
        "tweet.fields",
        "media.fields",
        "place.fields",
        "poll.fields",
    ]
    param_dict = {k: convertListToStr(v) for k, v in config.items() if k in param_fields}
    param_dict.update(
        {
            "max_results": 100,
            # "start_time": START_TIME.isoformat(timespec="seconds") + "Z",
            "pagination_token": pagination_token,
        }
    )

    return param_dict


In [None]:
# @title Defining Functions: procedure()
# @markdown Consolidate repetitive processes.

def procedure(user_id: int, result_count: int, endpoint="tweets", next_token=None):
    url = createTimelinesUrl(user_id, endpoint)
    params = getParams() if next_token is None else getParams(next_token)

    if result_count == 0:
        return
    elif result_count - 100 < 0:
        params["max_results"] = 100 - result_count
        result_count = 0
    else:
        result_count -= 100

    try:
        json_response = connectEndpoint(url, params)
    except Exception as e:
        print('-' * 80 + '\n\tERROR at connectEndpoint(url, params)\n' + '-' * 80)
        print(e)
        print('-' * 80 + '\n\n')
        return

    if "data" not in json_response:
        print(f"`data` not found. user_id is {user_id} and endpoint is {endpoint}")
        return
    else:
        saveAsJSON(user_id, endpoint, json_response)

    if "meta" in json_response and "next_token" in json_response["meta"]:
        pagination_token = json_response["meta"]["next_token"]
        time.sleep(3)  # wait 3 seconds
        procedure(user_id, result_count, endpoint, pagination_token)

    return


In [None]:
# @title main()
# @markdown Read the target ID list from `targetList.txt` and execute.

# @markdown However, if `UNIQUE_TARGET_ID` is specified, only that one person is taken.

# regexp
pattern_user_id = re.compile(r"\d+")

cwd = Path.cwd()
source = cwd / "targetList.txt"

for id in source.read_text().split("\n"):
    print(id)

target_id_list = [
    int(pattern_user_id.match(id)[0])
    for id in source.read_text().split("\n")
    if len(id) > 0 and pattern_user_id.match(id) is not None
]

# Ignore `targetList.txt` when collecting data on one specific person
if UNIQUE_TARGET_ID:
    print("But target(s) in targetList.txt above is ignored.")
    print(f"We collecting data about {UNIQUE_TARGET_ID}")
    target_id_list = [ UNIQUE_TARGET_ID ]

for user_id in tqdm(target_id_list):
    if ENABLE_TWEETS:
        print(f"\nNow we are currently collecting {user_id}'s Tweets ...\n")
        procedure(user_id, MAX_RESULTS_TWEET, endpoint="tweets")
    if ENABLE_MENTION:
        print(f"\nNow we are currently collecting {user_id}'s Mentions ...\n")
        procedure(user_id, MAX_RESULTS_MENTION, endpoint="mentions")
    if ENABLE_LIKED_TWEETS:
        print(f"\nNow we are currently collecting {user_id}'s Liked Tweets ...\n")
        procedure(user_id, MAX_RESULTS_LIKE, endpoint="liked_tweets")


In [None]:
# @title Finally ...
# @markdown Push to repository to complete 
# @markdown (but the commit will not be performed unless the code described below is **explicitly commented out**)

# if isPrivate():
#     gitCommit()
# else:
#     print("This repository is not Private!")
#     print("Pushing as is is a violation of Twitter's terms and conditions,")
#     print("and furthermore, a violation of **[copyright infringement]**!")
#     print("---------------------------------------------------------")
#     print("If you want to save your data,")
#     print("change visibility of your remote repository now!")
