In [65]:
import aiohttp 
import asyncio

This is the first component of the project advertised on my website.

The RESOURCE_EXTRACTOR takes a list of resources manually clustered (this ensures quality), and extracts the html out of them.

The update to the extractor will include a better handling of the resource grouping under the url_assigner (so that it wont be necessary to initialise manually, for each resource cluster, 'self.output').

Note that the below design is functional to a program able, in perspective, to deal with sources from a number of different websites.

The weekly updates will include:
- A precise terminology, commented and improved functions (accepted data type specification, output type specification etc...).
- A parser able to extract dates and structure the text.
- A processing pipeline  (pyspark). 
- A data storing mechanism (pydoop)
(I do not have a server so I will start the spark and hadoop servers locally)

When all the above will be completed I will try to model the data and see whether it is possible to extract some insights about the evolution of this war (a small anticipation of what I am thinking: _*Mathematics and Politics: Strategy, Voting, Power, and Proof*_ by Alan D. Taylor and Allison M. Pacelli: [here](https://link.springer.com/book/10.1007/978-0-387-77645-3).

In [52]:
class RESOURCE_EXTRACTOR:
    
    def __init__(self):
        self.russo_ukranian_war_sources = [
            'https://www.understandingwar.org/backgrounder/ukraine-conflict-updates-2022',
            'https://www.understandingwar.org/backgrounder/ukraine-conflicts-updates-january-2-may-31-2024',
            'https://www.understandingwar.org/backgrounder/ukraine-conflict-updates']
        self.all_resources = {'ISW_Russia_Ukraine_War': self.russo_ukranian_war_sources}
        self.output = {'ISW_Russia_Ukraine_War':''}

    def url_assigner(self, url):
        for key, value_list in self.all_resources.items():
            if url in value_list:
                return str(key)
        return f'{url} : NOT IDENTIFIED'
        
    async def text_extractor(self, session, url):
        key = self.url_assigner(url)
        async with session.get(url) as response:
            self.output[key] += await response.text()

    async def run_text_extractor(self):
        async with aiohttp.ClientSession() as session:
            tasks = [self.text_extractor(session, resource_page) for resource_list in self.all_resources.values() for resource_page in resource_list]
            await asyncio.gather(*tasks)
            return self.output

In [63]:
extractor = ISW_EXTRACTOR()
output  = await extractor.run_text_extractor()

In [64]:
output

