# kibsu
> ## kibsu (Akkadian)
> 
> [Transport → Surface]
> 
> a footprint... a trail , a path , a route... a way of calculation... a line of reasoning...

_[`Kibsu` on assyrianlanguages.org](https://www.assyrianlanguages.org/akkadian/dosearch.php?searchkey=7840&language=id)_

Kibsu is an experimental high-level HTML/XML scraping framework in Python that allows you to scrape with semantics, not div classes. It is based on Xpath, but provides some added advantages:
1. Readability
2. DRY Compositional Logic
3. Higher-level logic
4. Cross-site reusability (with luck!)

The ultimate **goal** of Kibsu is to write a single scraper and deploy it across many sites with different HTML but similar content.

For this demo, we will scrape the following sites, which all have a similar semantic structure (dates and links) but have differing HTML and CSS classes, making them laborious to scrape.

+ https://www.markey.senate.gov/news/press-releases
+ https://www.young.senate.gov/newsroom/press-releases/
+ https://www.mcconnell.senate.gov/public/index.cfm/pressreleases
+ https://www.crapo.senate.gov/media/newsreleases
+ https://www.wyden.senate.gov/news/blog

In [21]:
press_release_urls = [
    "https://www.markey.senate.gov/news/press-releases",
    "https://www.young.senate.gov/newsroom/press-releases/",
    "https://www.mcconnell.senate.gov/public/index.cfm/pressreleases",
    "https://www.crapo.senate.gov/media/newsreleases",
    "https://www.wyden.senate.gov/news/blog",
]

## Resolvers
The building block of Kibsu is the `Resolver`. A `Resolver` always has these properties:
1. The method `resolve`, which takes in an LXML etree element and returns a value of `Any` type. This is the function that extracts the value of interest from the XML/HTML.
2. The method `does_resolve`, which takes in an LXML etree element and returns a `bool`. This method tells Kibsu if the resolver was able to find any matches of interest. The function `does_resolve` can just check if the function `resolve` returns null or errors, but it can also do more than that, including data validation, etc.
4. An `include` property, which tells Kibsu whether or not the value from `resolve` should be included in output.
5. A `resolver_type` property with the name of the type of the resolver.

Resolvers are very flexible, and can even be composed. Here is a composed `Resolver` which looks for the first link which is a descendant of the current element:

In [22]:
from core import HasDescendantResolver, LinkOrFragmentResolver


has_descendant_with_link = HasDescendantResolver(
        {
            'link': LinkOrFragmentResolver(),
        }
    )

Resolvers can do logic and take arbitary functions. This composition of resolvers finds a descendant of your current element containing a date in one of three different date formats. The `resolve` method will call one of the `date_format_to_iso_parser`s and return an ISO format date string like this: `{'pub_date': '2024-03-08T00:00:00'}`.

In [23]:

from core import FunctionResolver, OneSubResolver, date_format_to_iso_parser


has_descendant_with_date = HasDescendantResolver(
    resolvers = {
        'pub_date': OneSubResolver(
            resolvers = {
                '%B %d, %Y': FunctionResolver(date_format_to_iso_parser('%B %d, %Y')),
                '%m/%d/%y': FunctionResolver(date_format_to_iso_parser('%m/%d/%y')),
                '%A, %B %d, %Y': FunctionResolver(date_format_to_iso_parser('%A, %B %d, %Y')),
            }
        ),
    },
)

The real fun begins with `DeepestResolver`, a resolver which `does_resolve` if it has children which resolve the sub-`Resolver`s, but none of the children themselves `do_resolve` all of the sub-`Resolver`s. As a consequence, `DeepestResolver` `does_resolve` on the deepest part of the element tree which has your pattern of interest, allowing you to collect lots of resolutions of interest, easily. 

In [24]:
from core import DeepestResolver


press_release = DeepestResolver(
    resolvers={
        "link_descendant": has_descendant_with_link,
        "date_descendant": has_descendant_with_date,
    },
    children_only=True,
)

## Kibsu (The Class)
Just like that, we've almost written a working press release scraper for several senators, without even looking at the specific html on their websites. The scraper works on all three of these sites, even though they have different HTML and classes.

In [25]:
from core import Kibsu

get_press_releases = Kibsu(
    resolvers={"press_release": press_release},
)

In [26]:
from lxml import etree
import requests
from pprint import pprint

all_press_releases = []
for url in press_release_urls:
        
        print(url)

        r = requests.get(url)
        tree = etree.ElementTree(etree.HTML(r.content))
        press_releases = get_press_releases(tree)
        all_press_releases.extend(press_releases)
        pprint(press_releases[2])

https://www.markey.senate.gov/news/press-releases
{'press_release': {'date_descendant': {'pub_date': '2024-03-15T00:00:00'},
                   'link_descendant': {'link': {'text': 'Markey, Merkley, '
                                                        'Colleagues: Don’t '
                                                        'Leave Flight Crews '
                                                        'Behind, Protect '
                                                        'Workers’ Right to '
                                                        'Pump at Work',
                                                'url': 'https://www.markey.senate.gov/news/press-releases/markey-merkley-colleagues-dont-leave-flight-crews-behind-protect-workers-right-to-pump-at-work'}}}}
https://www.young.senate.gov/newsroom/press-releases/
{'press_release': {'date_descendant': {'pub_date': '2024-03-13T00:00:00'},
                   'link_descendant': {'link': {'text': 'Young, Colleagues '
        

In [27]:
from core import FlatMapper

flat_mapper = FlatMapper({
    'date': ['press_release', 'date_descendant', 'pub_date'],
    'link_url': ['press_release', 'link_descendant', 'link', 'url'],
    'link_text': ['press_release', 'link_descendant', 'link', 'text'],
})
pprint([flat_mapper(pr) for pr in all_press_releases])

[{'date': '2024-03-15T00:00:00',
  'link_text': 'Senator Markey, Sunrise Movement Talk the Power of Organizing, '
               'American Climate Corps on Latest Episode of the “The Ed Markey '
               'Podcast”',
  'link_url': 'https://www.markey.senate.gov/news/press-releases/senator-markey-sunrise-movement-talk-the-power-of-organizing-american-climate-corps-on-latest-episode-of-the-the-ed-markey-podcast'},
 {'date': '2024-03-15T00:00:00',
  'link_text': 'Markey, Schumer, Colleagues Demand MOHELA Remedy Harms to '
               'Millions of Borrowers Following Egregious Business Practices',
  'link_url': 'https://www.markey.senate.gov/news/press-releases/markey-schumer-colleagues-demand-mohela-remedy-harms-to-millions-of-borrowers-following-egregious-business-practices'},
 {'date': '2024-03-15T00:00:00',
  'link_text': 'Markey, Merkley, Colleagues: Don’t Leave Flight Crews Behind, '
               'Protect Workers’ Right to Pump at Work',
  'link_url': 'https://www.markey.se