<a href="https://colab.research.google.com/github/AmirMotefaker/Twitter-Watch/blob/main/Scrape_Twitter_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Why Scrape Twitter?

- [Twitter](https://twitter.com/Twitter) is a major announcement hub where people and companies publish their announcements. This is a great opportunity to use Twitter to follow industry trends. For example, stock market or crypto market targets could be scraped to predict the future price of a stock or crypto.

- Twitter is also a great source of data for sentiment analysis. You can use Twitter to find out what people think about a certain topic or brand. This is useful for market research, product development, and brand awareness.

- So, if we can scrape Twitter data with Python we can have access to this valuable public information for free!

# Setup Wizard

- We'll approach Twitter scraping in three ways:

  - We'll be using [browser automation toolkit Playwright](https://scrapfly.io/blog/web-scraping-with-playwright-and-python/)
    - This is the easiest way to scrape Twitter as we are using real web browser, so all we have to do is navigate to url, wait for page to load and get the results.

  - We'll also take a look at reverse engineering Twitter's hidden API.
This will be a bit harder but these type of scrapers will be much faster than the browser ones. For this we'll be using [httpx](https://pypi.org/project/httpx/).

  - For ScrapFly users we'll also take a look at ScrapFly SDK which makes the above methods even easier.


- We'll be working with both JSON and HTML response data. So, we'll be using [parsel](https://pypi.org/project/parsel/) to parse HTML and [jamespath for JSON](https://scrapfly.io/blog/parse-json-jmespath-python/).

#### All of these libraries are available for free and can be installed via the pip install terminal command:

   $ pip install httpx playwright parsel jmespath scrapfly-sdk



In [None]:
!pip install Scrapy
!pip install httpx playwright parsel jmespath asyncio gevent scrapfly-sdk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Scrapy
  Downloading Scrapy-2.8.0-py2.py3-none-any.whl (272 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m272.9/272.9 KB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting zope.interface>=5.1.0
  Downloading zope.interface-5.5.2-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (257 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m257.9/257.9 KB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting w3lib>=1.17.0
  Downloading w3lib-2.1.1-py3-none-any.whl (21 kB)
Collecting queuelib>=1.4.2
  Downloading queuelib-1.6.2-py2.py3-none-any.whl (13 kB)
Collecting cryptography>=3.4.6
  Downloading cryptography-39.0.2-cp36-abi3-manylinux_2_28_x86_64.whl (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollectin

# Scraping Tweets

- Twitter is a complicated javascript web application that requires javascript to work. So, for tweet scraping we'll be using [Playwright](https://playwright.dev/) browser automation library.

#### Playwright-based Twitter scraper in Python should look something like this:

  1- We'll start a headless Chrome browser

  2- Navigate it to Tweet page URL like https://twitter.com/Scrapfly_dev/status/1577664612908077062

  3- Wait for the page to load

  4- Retrieve page HTML source

  5- Load HTML to "parsel.Selector"

  6- Use CSS selectors and XPath to extract Tweet details and replies

# Python

In [None]:
from parsel import Selector
from playwright.sync_api import sync_playwright
from playwright.sync_api._generated import Page


def parse_tweets(selector: Selector):
    """
    parse tweets from pages containing tweets like:
    - tweet page
    - search page
    - reply page
    - homepage
    returns list of tweets on the page where 1st tweet is the 
    main tweet and the rest are replies
    """
    results = []
    # select all tweets on the page as individual boxes
    # each tweet is stored under <article data-testid="tweet"> box:
    tweets = selector.xpath("//article[@data-testid='tweet']")
    for i, tweet in enumerate(tweets):
        # using data-testid attribute we can get tweet details:
        found = {
            "text": "".join(tweet.xpath(".//*[@data-testid='tweetText']//text()").getall()),
            "username": tweet.xpath(".//*[@data-testid='User-Names']/div[1]//text()").get(),
            "handle": tweet.xpath(".//*[@data-testid='User-Names']/div[2]//text()").get(),
            "datetime": tweet.xpath(".//time/@datetime").get(),
            "verified": bool(tweet.xpath(".//svg[@data-testid='icon-verified']")),
            "url": tweet.xpath(".//time/../@href").get(),
            "image": tweet.xpath(".//*[@data-testid='tweetPhoto']/img/@src").get(),
            "video": tweet.xpath(".//video/@src").get(),
            "video_thumb": tweet.xpath(".//video/@poster").get(),
            "likes": tweet.xpath(".//*[@data-testid='like']//text()").get(),
            "retweets": tweet.xpath(".//*[@data-testid='retweet']//text()").get(),
            "replies": tweet.xpath(".//*[@data-testid='reply']//text()").get(),
            "views": (tweet.xpath(".//*[contains(@aria-label,'Views')]").re("(\d+) Views") or [None])[0],
        }
        # main tweet (not a reply):
        if i == 0:
            found["views"] = tweet.xpath('.//span[contains(text(),"Views")]/../preceding-sibling::div//text()').get()
            found["retweets"] = tweet.xpath('.//a[contains(@href,"retweets")]//text()').get()
            found["quote_tweets"] = tweet.xpath('.//a[contains(@href,"retweets/with_comments")]//text()').get()
            found["likes"] = tweet.xpath('.//a[contains(@href,"likes")]//text()').get()
        results.append({k: v for k, v in found.items() if v is not None})
    return results


def scrape_tweet(url: str, page: Page):
    """
    Scrape tweet and replies from tweet page like:
    https://twitter.com/Scrapfly_dev/status/1587431468141318146
    """
    # go to url
    page.goto(url)
    # wait for content to load
    page.wait_for_selector("//article[@data-testid='tweet']")  
    # retrieve final page HTML:
    html = page.content()
    # parse it for data:
    selector = Selector(html)
    tweets = parse_tweets(selector)
    return tweets


# # example run:
# with sync_playwright() as pw:
#     # start browser and open a new tab:
#     browser = pw.chromium.launch(headless=False)
#     page = browser.new_page(viewport={"width": 1920, "height": 1080})
#     # scrape tweet and replies:
#     tweet_and_replies = scrape_tweet("httpTrutwitter.com/Scrapfly_dev/status/1587431468141318146", page)
#     print(tweet_and_replies)


# example
from playwright.async_api import async_playwright # need to import this first
from gevent import monkey, spawn
import asyncio
import gevent

monkey.patch_all()
loop = asyncio.new_event_loop()


async def f():
    print("start")
    playwright = await async_playwright().start()
    browser = await playwright.chromium.launch(headless=True)
    context = await browser.new_context()
    page = await context.new_page()
    await page.goto(f"https://www.google.com")
    print("done")


def greeny():
    while True:  # and not other_exit_condition
        future = asyncio.run_coroutine_threadsafe(f(), loop)
        while not future.done():
            gevent.sleep(1)


greenlet1 = spawn(greeny)
greenlet2 = spawn(greeny)
#loop.run_forever()

# ScrapFly

In [None]:
from parsel import Selector
from scrapfly import ScrapflyClient, ScrapeConfig

#scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
scrapfly = ScrapflyClient(key="scp-live-f35fc29bbb4d476bb1084e6ae155260d")


def parse_tweets(selector: Selector):
    """
    parse tweets from pages containing tweets like:
    - tweet page
    - search page
    - reply page
    - homepage
    returns list of tweets on the page where 1st tweet is the
    main tweet and the rest are replies
    """
    results = []
    # select all tweets on the page as individual boxes
    # each tweet is stored under <article data-testid="tweet"> box:
    tweets = selector.xpath("//article[@data-testid='tweet']")
    for i, tweet in enumerate(tweets):
        # using data-testid attribute we can get tweet details:
        found = {
            "text": "".join(tweet.xpath(".//*[@data-testid='tweetText']//text()").getall()),
            "username": tweet.xpath(".//*[@data-testid='User-Names']/div[1]//text()").get(),
            "handle": tweet.xpath(".//*[@data-testid='User-Names']/div[2]//text()").get(),
            "datetime": tweet.xpath(".//time/@datetime").get(),
            "verified": bool(tweet.xpath(".//svg[@data-testid='icon-verified']")),
            "url": tweet.xpath(".//time/../@href").get(),
            "image": tweet.xpath(".//*[@data-testid='tweetPhoto']/img/@src").get(),
            "video": tweet.xpath(".//video/@src").get(),
            "video_thumb": tweet.xpath(".//video/@poster").get(),
            "likes": tweet.xpath(".//*[@data-testid='like']//text()").get(),
            "retweets": tweet.xpath(".//*[@data-testid='retweet']//text()").get(),
            "replies": tweet.xpath(".//*[@data-testid='reply']//text()").get(),
            "views": (tweet.xpath(".//*[contains(@aria-label,'Views')]").re("(\d+) Views") or [None])[0],
        }
        # main tweet (not a reply):
        if i == 0:
            found["views"] = tweet.xpath('.//span[contains(text(),"Views")]/../preceding-sibling::div//text()').get()
            found["retweets"] = tweet.xpath('.//a[contains(@href,"retweets")]//text()').get()
            found["quote_tweets"] = tweet.xpath('.//a[contains(@href,"retweets/with_comments")]//text()').get()
            found["likes"] = tweet.xpath('.//a[contains(@href,"likes")]//text()').get()
        results.append({k: v for k, v in found.items() if v is not None})
    return results


def scrape_tweet(url: str):
    """
    Scrape tweet and replies from tweet page like:
    https://twitter.com/Scrapfly_dev/status/1587431468141318146
    """
    result = scrapfly.scrape(ScrapeConfig(
        url=url,
        country="US",
        render_js=True,
    ))
    return parse_tweets(result.selector)


tweet_and_replies = scrape_tweet("https://twitter.com/Google/status/1622686179077357573")
print(tweet_and_replies)

It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is be

[{'text': 'AI can help people, businesses and communities unlock their potential. Here’s the latest on how we’re building on our advancements in large language models, including an experimental conversational AI service and new AI-powered features in Search ↓', 'username': 'Google', 'handle': '@Google', 'datetime': '2023-02-06T19:58:27.000Z', 'verified': True, 'url': '/Google/status/1622686179077357573', 'likes': '1,025', 'retweets': '270', 'views': '308.2K', 'quote_tweets': '68'}, {'text': 'Hopefully this gets applied to @GoogleHome since turning off a simple light gets her asking me if I want to set an alarm or the light is not setup when it is.', 'username': 'Andrew Toone', 'handle': '@ssj2toone', 'datetime': '2023-02-08T10:09:53.000Z', 'verified': False, 'url': '/ssj2toone/status/1623262835332579331', 'replies': '1', 'views': '240'}, {'text': "We hear you, Andrew. We're always looking for ways to improve and we'll take this as feedback. Keep following us for future updates!", 'user

# Scraping Search

- Twitter is known for its powerful search engine and it's a great place to find popular tweets and users.

- To scrape Twitter search we'll use Playwright as well. Our process looks very similar to our previous scraper:

  1- We'll start a headless Chrome browser

  2- Navigate it to Tweet search page url like https://twitter.com/Scrapfly_dev/status/1587431468141318146
  
  3- Wait for the page to load

  4- Retrieve page HTML source

  5- Load HTML to parsel.Selector

  6- Use CSS selectors and XPath to find Tweets or Twitter users


  - For this example, we'll cover two search endpoints:

    - People Search - scrape people profiles related to the search query.
    - Top Results - scrape recommended tweets and profiles related to the search query.

In [None]:
from parsel import Selector
from playwright.sync_api import sync_playwright
from playwright.sync_api._generated import Page

def parse_profiles(sel: Selector):
    """parse profile preview data from Twitter profile search"""
    profiles = []
    for profile in sel.xpath("//div[@data-testid='UserCell']"):
        profiles.append(
            {
                "name": profile.xpath(".//a[not(@tabindex=-1)]//text()").get().strip(),
                "handle": profile.xpath(".//a[@tabindex=-1]//text()").get().strip(),
                "bio": ''.join(profile.xpath("(.//div[@dir='auto'])[last()]//text()").getall()),
                "url": profile.xpath(".//a/@href").get(),
                "image": profile.xpath(".//img/@src").get(),
            }
        )
    return profiles


def scrape_top_search(query: str, page: Page):
    """scrape top Twitter page for featured tweets"""
    page.goto(f"https://twitter.com/search?q={query}&src=typed_query")
    page.wait_for_selector("//article[@data-testid='tweet']")  # wait for content to load
    tweets = parse_tweets(Selector(page.content()))
    return tweets


def scrape_people_search(query: str, page: Page):
    """scrape people search Twitter page for related users"""
    page.goto(f"https://twitter.com/search?q={query}&src=typed_query&f=user")
    page.wait_for_selector("//div[@data-testid='UserCell']")  # wait for content to load
    profiles = parse_profiles(Selector(page.content()))
    return profiles


# with sync_playwright() as pw:
#     browser = pw.chromium.launch(headless=False)
#     page = browser.new_page(viewport={"width": 1920, "height": 1080})
    
#     top_tweet_search = scrape_top_search("google", page)
#     people_tweet_search = scrape_people_search("google", page)


from playwright.async_api import async_playwright # need to import this first
from gevent import monkey, spawn
import asyncio
import gevent

monkey.patch_all()
loop = asyncio.new_event_loop()


async def f():
    print("start")
    playwright = await async_playwright().start()
    browser = await playwright.chromium.launch(headless=True)
    context = await browser.new_context()
    page = await context.new_page()
    await page.goto(f"https://www.google.com")
    print("done")


def greeny():
    while True:  # and not other_exit_condition
        future = asyncio.run_coroutine_threadsafe(f(), loop)
        while not future.done():
            gevent.sleep(1)


greenlet1 = spawn(greeny)
greenlet2 = spawn(greeny)
#loop.run_forever()

# ScrapFly

In [None]:
import json
from parsel import Selector
from playwright.sync_api import sync_playwright
from playwright.sync_api._generated import Page
#from snippet1 import parse_tweets  # we covered tweet parsing in previous code snippet!
from scrapfly import ScrapflyClient, ScrapeConfig

#scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
scrapfly = ScrapflyClient(key="scp-live-f35fc29bbb4d476bb1084e6ae155260d")



def parse_profiles(sel: Selector):
    """parse profile preview data from Twitter profile search"""
    profiles = []
    for profile in sel.xpath("//div[@data-testid='UserCell']"):
        profiles.append(
            {
                "name": profile.xpath(".//a[not(@tabindex=-1)]//text()").get().strip(),
                "handle": profile.xpath(".//a[@tabindex=-1]//text()").get().strip(),
                "bio": "".join(profile.xpath("(.//div[@dir='auto'])[last()]//text()").getall()),
                "url": profile.xpath(".//a/@href").get(),
                "image": profile.xpath(".//img/@src").get(),
            }
        )
    return profiles


def scrape_top_search(query: str):
    """scrape top Twitter page for featured tweets"""
    result = scrapfly.scrape(
        ScrapeConfig(
            url=f"https://twitter.com/search?q={query}&src=typed_query",
            country="US",
            render_js=True,
        )
    )
    return parse_tweets(result.selector)


def scrape_people_search(query: str):
    """scrape people search Twitter page for related users"""
    result = scrapfly.scrape(
        ScrapeConfig(
            url=f"https://twitter.com/search?q={query}&src=typed_query&f=user",
            country="US",
            render_js=True,
        )
    )
    return parse_profiles(result.selector)


if __name__ == "__main__":
    top_tweet_search = scrape_top_search("google")
    print(json.dumps(top_tweet_search, indent=2))
    people_tweet_search = scrape_people_search("google")
    print(json.dumps(people_tweet_search, indent=2))

It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is be

[
  {
    "text": "Finished kwites newest video talking about the subject.\nI did my research and it's safe to say that it's all fake. He even pasted copied ptsd shit from Google in his latest tweet. He barely gave evidence to support it meanwhile kwite gave so much.\n\n#Kwite #kwiteallegations",
    "username": "\u2606Cursed\u2606",
    "handle": "@MeebooShaba",
    "datetime": "2023-03-11T23:20:41.000Z",
    "verified": false,
    "url": "/MeebooShaba/status/1634695870158368771",
    "image": "https://pbs.twimg.com/media/Fq-bOy_WcAQjUO9?format=jpg&name=900x900",
    "replies": "6"
  },
  {
    "text": "Before Google Map",
    "username": "RVCJ Media",
    "handle": "@RVCJ_FB",
    "datetime": "2023-03-11T12:55:00.000Z",
    "verified": true,
    "url": "/RVCJ_FB/status/1634538411565682691",
    "image": "https://pbs.twimg.com/media/Fq8EWePXgAE7Aao?format=jpg&name=medium",
    "likes": "993",
    "retweets": "74",
    "replies": "13",
    "views": "22270"
  },
  {
    "text": "Perform

It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.
It seems that the gevent monkey-patching is be

[
  {
    "name": "Google",
    "handle": "@Google",
    "bio": "#HeyGoogle",
    "url": "/Google",
    "image": "https://pbs.twimg.com/profile_images/1605297940242669568/q8-vPggS_x96.jpg"
  },
  {
    "name": "Google News",
    "handle": "@googlenews",
    "bio": "Google News helps you learn more about the stories that matter to you and the world. Download: http://news.app.goo.gl/GetGoogleNews",
    "url": "/googlenews",
    "image": "https://pbs.twimg.com/profile_images/993911709357162496/NWRtL2J__x96.jpg"
  },
  {
    "name": "Google AI",
    "handle": "@GoogleAI",
    "bio": "Google AI is focused on bringing the benefits of AI to everyone. In conducting and applying our research, we advance the state-of-the-art in many domains.",
    "url": "/GoogleAI",
    "image": "https://pbs.twimg.com/profile_images/993649592422907904/yD7LkqU2_x96.jpg"
  },
  {
    "name": "GoogleTrends",
    "handle": "@GoogleTrends",
    "bio": "Official Google data and visualizations from the @GoogleNewsInit

# Scraping Profiles
- To scrape profiles we'll be using Twitter's backend API which is [powered by graphQL](https://scrapfly.io/blog/web-scraping-graphql-with-python/).

- This approach is a bit more complicated than our Playwright approach but it's much faster. This will help us to retrieve thousands of Twitter profiles without wasting time waiting for pages to load.

- To start, let's take a look at how the Twitter profile page works by using the browser's developer tools which can be accessed in most modern browsers using the F12 button. For our examples, we'll be using Chrome.

# Python

In [None]:
import httpx

def get_guest_token(auth: str):
    """register guest token for auth key"""
    headers_pre = {
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "accept-encoding": "gzip",
        "accept-language": "en-US,en;q=0.5",
        "connection": "keep-alive",
        "authorization": f"Bearer {auth}",
    }
    result = httpx.post("https://api.twitter.com/1.1/guest/activate.json")
    guest_token = result.json()["guest_token"]  # '1622833653452804096'
    return guest_token

authorization = "Bearer AAAAAAAAAAAAAAAAAAAAAPYXBAAAAAAACLXUNDekMxqa8h%2F40K4moUkGsoc%3DTYfbDKbT3jJPCEVnMYqilB28NHfOPqkca3qaAxGfsyKCs0wRbw"
print(get_guest_token(authorization))

It seems that the gevent monkey-patching is being used.
Please set an environment variable with:
GEVENT_SUPPORT=True
to enable gevent support in the debugger.


KeyError: ignored