# Using Tweetscrape

## Setup

In [None]:
pip install python-dotenv

In [37]:
from tweetscrape import *
from dotenv import load_dotenv
from os import getenv
import pandas as pd

In [6]:
load_dotenv() # Load the env variables stored in local .env file
twitter_oauth_ian  = getenv("TWITTER_OAUTH_IAN")

## ScraperPopular class

This class gets tweets from the endpoint, https://api.twitter.com/1.1/search/tweets.json, and requires elevated access.

**Unfortunately it is hard to get a sufficient amount of tweets for a topic with this. For "league of legends", I got a max of 35 and for "(elon musk OR chief twit), I got a max of like 16.**

### Initialization

In [10]:
scraper_popular = ScraperPopular(twitter_oauth_ian)

### Params

The class is initialized with several default params. You can update these params, i.e.,
`scraper_popular.params['lang']='jp'`. The params that most commonly vary between queries, however, can simply be passed in as arguments to the `scrape` functions (i.e., `query`, `max_results`, `lang`).

You'll notice that `ScraperPopular` uses a `q` param. The older endpoints use `q` instead of query and feature a separate `lang` parameter for the language. This differs from the v2 endpoints where the query param is `query` and the language is included within the query argument (e.g. `'query': '(elon musk OR chief twit) lang:en'`)

To delete a param, call the pop function. E.g.,
`scraper_popular.params.pop('user.fields')`

In [8]:
# These are the params set by default. You don't have to bother changing q, lang, or max_results. These can be specified each time you call the scrape function.
scraper_popular.params

{'q': '',
 'lang': '',
 'max_results': '100',
 'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source',
 'user.fields': 'id,name,username,created_at,description,public_metrics,verified',
 'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
 'result_type': 'popular'}

### Pagination token -- get new data between queries

The standard search API ScraperPopular uses returns 14 tweets per call. Therefore, to get the next page of tweets, a token has to be passed from the last result so the same tweets aren't fetched.

Technically with this older endpoint, this is actually stored in the metadata array as `max_id` and if you run a query and then check the class' params (e.g., `scraper_popular.params`), you will notice a `max_id` param.

**If you want to continue scraping tweets with the same query later**, get the last pagination token when your current query finishes using the getter, and save it somewhere:

`scraper_popular.get_pagination_token()`

When you want to continue scraping, then, if you initialize a new instance, simply set the pagination token using the setter or pass it in as an argument to the scrape function Either way, it will scrape from that point onward so you will be getting new data.

`scraper_popular.set_pagination_token("asdfasdfasdfasdfa")`

**If you keep the same instance alive and repeat the same query, the pagination token is updating automatically** so there's no need to keep setting it each time you run the scrape function.



### Scraping

<span style="color: crimson; font-weight: bolder;">def scrape(self, query, max_results=500, lang='en', pagination_token=''):</span>

Scrapes the endpoint https://api.twitter.com/1.1/search/tweets.json with the 'popular' `result_type` parameter. Loops until the number of tweets captured reaches max_results or until no remaining tweets are retrieved.

<span style="color: blue; font-weight: bolder;">Args:</span><br>
&emsp;query (str): Your search query. Can be boolean, e.g., "('elon musk OR chief twit')"
<br><br>
&emsp;max_results (int, optional): The max number of rows you want in the returned DataFrame. Defaults to 500.
<br><br>
&emsp;lang (str, optional): IISO2 language code for the tweets. None will remove the lang parameter. Defaults to None.
<br><br>
&emsp;pagination_token (str, optional): Really max_id for this endpoint. Used to set the 'max_id' param. Defaults to ''.
<br><br>
<span style="color: blue; font-weight: bolder;">Returns:</span><br>
&emsp;pandas.DataFrame: A DataFrame of the results.
<br>

`df1 = scraper_popular.scrape("league of legends", max_results=5_000)`

### Merging the dataframes from multiple scrapes:

Assuming the parameters for the fields to include have not changed between two scrapes, you can merge the dataframes from those scrapes as such:

```
df1 = scraper_popular.scrape("league of legends", max_results=5_000)
df2 = scraper_popular.scrape("league of legends", max_results=10_000)

df3 = pd.concat([df1, df2], ignore_index=True)
```

## ScraperArchive class

This class gets tweets from the endpoint, https://api.twitter.com/2/tweets/search/all, and requires academic research access.

It is used the same way as ScraperPopular.

### Scraping

<span style="color: crimson; font-weight: bolder;">def scrape(self, query, max_results=500, lang='en', pagination_token=''):</span>

Scrapes the endpoint https://api.twitter.com/1.1/search/tweets.json with the 'popular' `result_type` parameter. Loops until the number of tweets captured reaches max_results or until no remaining tweets are retrieved.

<span style="color: blue; font-weight: bolder;">Args:</span><br>
&emsp;query (str): Your search query. Can be boolean, e.g., "('elon musk OR cheif twit')"
<br><br>
&emsp;max_results (int, optional): The max number of rows you want in the returned DataFrame. Defaults to 500.
<br><br>
&emsp;lang (str, optional): ISO2 language code for the tweets. None will disclude the lang argument. Defaults to None.
<br><br>
&emsp;pagination_token (str, optional): Used to set the 'pagination_token' param. Defaults to ''.
<br><br>
<span style="color: blue; font-weight: bolder;">Returns:</span><br>
&emsp;pandas.DataFrame: A DataFrame of the results.
<br>

`df1 = scraper_archive.scrape("league of legends", max_results=5_000)`

### Merging the dataframes from multiple scrapes:

Assuming the parameters for the fields to include have not changed between two scrapes, you can merge the dataframes from those scrapes as such:

```
df1 = scraper_archive.scrape("league of legends", max_results=5_000)
df2 = scraper_archive.scrape("league of legends", max_results=10_000)

df3 = pd.concat([df1, df2], ignore_index=True)
```

## ScraperArchive Demo

### League of Legends Tweets

In [25]:
archive_scraper = ScraperArchive(twitter_oauth_ian)

In [27]:
query = 'league of legends'

# Let's gather 1000 tweets and write them to a CSV file:
league_tweets1 = archive_scraper.scrape(query=query, max_results=1000, lang='en')
league_tweets1.to_csv("Datasets/ScrapedTwitter/leauge_tweets_1k.csv") # It actually manage to gather 1099

# Now lets get the pagination token for when we want to get more Musk tweets:
pagination_token = archive_scraper.get_pagination_token()
print(f"Pagination token: {pagination_token}")

Pagination token: b26v89c19zqg8o3fpzen8dk1mpzoekbfuz08yrie3okfx


In [28]:
# Pretend the old instance is disposed so the pagination
# token has to be passed to the new instance:
archive_scraper = ScraperArchive(twitter_oauth_ian)
archive_scraper.set_pagination_token('b26v89c19zqg8o3fpzen8dk1mpzoekbfuz08yrie3okfx')

# Let's gather 14000 MORE tweets and write them to a CSV file:
league_tweets2 = archive_scraper.scrape(query, max_results=14_000, lang='en') # This will take HoT several minutes
leauge_tweets2.to_csv("Datasets/ScrapedTwitter/league_tweets_14k.csv") #WH000PS

NameError: name 'leauge_tweets2' is not defined

In [29]:
league_tweets2.to_csv("Datasets/ScrapedTwitter/league_tweets_14k.csv")

In [31]:
# Let's merge the bastards into one DataFrame and export the data:
league_tweets15k = pd.concat([league_tweets1, league_tweets2], ignore_index=True)
league_tweets15k.to_csv("Datasets/ScrapedTwitter/league_tweets_15k.csv")

<h3 style="color: red;">Don't forget to get the pagination token!</h3>
<p>
This will make life easier when you want to get more tweets using the same query.
</p>

In [33]:
print(f"Last pagination token for query '{query}':")
print(archive_scraper.get_pagination_token())

Last pagination token for query 'league of legends':
b26v89c19zqg8o3fpzen875uz627neg8vi5f64xyb0n7h


### Elon Musk Tweets

In [35]:
musk_archive_scraper = ScraperArchive(twitter_oauth_ian)

query = '(elon musk OR chief twit)'
musk_tweets15k = musk_archive_scraper.scrape(query, max_results=15_000, lang='en') # This will take HoT several minutes
musk_tweets15k.to_csv("Datasets/ScrapedTwitter/musk_tweets_15k.csv")

<h3 style="color: red;">Don't forget to get the pagination token!</h3>
<p>
This will make life easier when you want to get more tweets using the same query.
</p>

In [41]:
print(f"Last pagination token for query '{query}':")
print(musk_archive_scraper.get_pagination_token())

Last pagination token for query '(elon musk OR chief twit)':
b26v89c19zqg8o3fpzen8fn9sdtflk0xqqfyru2wcj1bx


In [50]:
### Dinosaur Tweets
query = 'dinosaur'

In [79]:
### Dinosaur Tweets
dino_tweets15k = archive_scraper.scrape(query, max_results=15_000, lang='en')

In [102]:
dino_tweets15k.iloc[1]

public_metrics            {'retweet_count': 0, 'reply_count': 0, 'like_c...
edit_history_tweet_ids                                [1588245412221046784]
conversation_id                                         1588245412221046784
created_at                                         2022-11-03T19:03:09.000Z
text                      Light Painting 101: Illuminating a terrifying ...
id                                                      1588245412221046784
referenced_tweets                                                       NaN
author_id                                                         434534859
lang                                                                     en
source                                                                IFTTT
reply_settings                                                     everyone
in_reply_to_user_id                                                     NaN
geo                                                                     NaN
withheld    

In [None]:
# Plug author ID into https://tweeterid.com/ --> User: @WZ65
# --> Scrolled down their feed and found the dino tweet posted on Nov. 11

In [95]:
dino_tweets15k.to_csv("Datasets/ScrapedTwitter/dino_tweets_15k.csv")

### Dinosaur -- last pagination token

In [100]:
print("Last pagination token for query 'dinosaur':")
print(archive_scraper.get_pagination_token())

Last pagination token for query 'dinosaur':
b26v89c19zqg8o3fpzemdy1syktbx8t777oii7etrz4ot
