Skip to content

AWeirdDev/researches

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

researches

Researches is a Google scraper. Minimal requirements.

Key designs:

  • No beautifulsoup. We want to make sure everything is running smoothly and not slowly.
  • Simple API. Great developer experience, that's all that matters.
  • Typed. We support typing for everything you see.

Note thet researches does not clean up data for you, meaning it's better for LLM-based content consumption.

search("Who invented papers?")
# Result(snippet=Snippet(…), aside=None, weather=None, web=[Web(…), …], …)

Requirements

  • A decent computer with an Internet connection
  • Python ≥ 3.9 (dataclasses support)
  • primp – 🪞 HTTP connections & fingerprint impersonation.
  • selectolax – 🌯 The HTML parser.

Usage

Just start searching right away. Don't worry, Gemini won't hurt you (also gemini).

# Sync code
search(
    "US to Japan",  # query
    hl="en",        # language
    ua=None,        # custom user agent or ours
    **kwargs        # kwargs to pass to primp (optional)
) -> Result

For people who love async, we've also got you covered:

# Async code
await asearch(
    "US to Japan"   # query
    hl="en",        # language
    ua=None,        # custom user agent or ours
    **kwargs        # kwargs to pass to primp (optional)
) -> Result

So, what does the Result class has to offer? At a glance:

result.snippet?
        .text: str
        .name: str?

result.aside?
       .text: str

result.weather?
       .c: str
       .f: str
       .precipitation: str
       .humidty: str
       .wind_metric: str
       .wind_imperial: str
       .description: str
       .forecast: PartialWeatherForReport[]
                    .weekday: str
                    .high_c: str
                    .low_c: str
                    .high_f: str
                    .low_f: str

result.web: Web[]
             .title: str
             .url: str
             .text: str

result.flights: Flight[]
                 .title: str
                 .description: str
                 .duration: str
                 .price: str

result.lyrics?
       .text: str
       .is_partial: bool

Background

Data comes in different shapes and sizes, and Google played it extremely well. That also includes randomizing CSS class names making it almost impossible to scrape data.

Google sucks, but it's actually the knowledge base we all need. Say, there are these types of result pages:

  • Links – What made Google, "Google." Or, &udm=14.
  • Weather – Weather forecast.
  • Wikipedia (aside) – Wikipedia text.
  • Flights – Flights.
  • Lyrics – Both full and partial lyrics. unstable

...and many more. (Contribute!)

Scraper APIs out there are hella expensive, and ain't no way I'm paying or entering their free tier. So, I made my own that's perfect for extracting data with LLMs.

Other projects

If you're looking for something other than Google or something more general-purposed, check these out:

  • air_web – A lightweight package for crawling with the minimalist of code.
  • ddginternal – Simple Duckduckgo scraper.

(c) 2024 AWeirdDev, sus2790, and other silly people