In [1]:
import aiohttp 
import asyncio

from bs4 import BeautifulSoup
import dateutil, datetime

# Introduction 

This is the first component of the project advertised on my website.

The RESOURCE_EXTRACTOR takes a list of resources manually clustered (this ensures quality), and extracts the html out of them.

The update to the extractor will include a better handling of the resource grouping under the url_assigner (so that it wont be necessary to initialise manually, for each resource cluster, 'self.output').

Note that the below design is functional to a program able, in perspective, to deal with sources from a number of different websites.

The weekly updates will include:
- A precise terminology, commented and improved functions (accepted data type specification, output type specification etc...).
- A parser able to extract dates and structure the text.

**soon:**

- A processing pipeline  (pyspark).
- A data storing mechanism (pydoop)
(I do not have a server so I will start the spark and hadoop servers locally)

When all the above will be completed I will try to model the data and see whether it is possible to extract some insights about the evolution of this war (a small anticipation of what I am thinking: _*Mathematics and Politics: Strategy, Voting, Power, and Proof*_ by Alan D. Taylor and Allison M. Pacelli: [here](https://link.springer.com/book/10.1007/978-0-387-77645-3).

# Section I - Retrieving HTML 

The purpose of the ResourceExtractor is navigating on the ISW website pages and retrieve from them the html. The result is stored in the variable 'output'. Note how the code is defined in a general way: self.all_resources is a dictionary allowing for the integration of multiple resources. This does not mean that the code can be extended to a number of arbitrary resources. For the latter funcionality to be available, it would be necessary to devise a generalise TextParser (defined in Section II). Since the present project is an experiment with certain data engineering tools more that a scraper, the task of such generalisation **will be tackled upon completition**.

In [14]:
class ResourceExtractor:
    
    def __init__(self):
        self.russo_ukranian_war_sources = [
            'https://www.understandingwar.org/backgrounder/ukraine-conflict-updates',
            'https://www.understandingwar.org/backgrounder/ukraine-conflicts-updates-january-2-may-31-2024',
            ]
        self.all_resources = {'ISW_Russia_Ukraine_War': self.russo_ukranian_war_sources}
        self.output = {key:'' for key in self.all_resources}
    
    def url_assigner(self, url):
        for key, value_list in self.all_resources.items():
            if url in value_list:
                return str(key)
        return f'{url} : NOT IDENTIFIED'
        
    async def text_extractor(self, session, url):
        key = self.url_assigner(url)
        async with session.get(url) as response:
            await asyncio.sleep(1.5)  
            if response.status == 200:
                self.output[key] += await response.text()

    async def run_text_extractor(self):
        async with aiohttp.ClientSession() as session:
            tasks = [self.text_extractor(session, resource_page) for resource_list in self.all_resources.values() for resource_page in resource_list]
            await asyncio.gather(*tasks)
            return self.output

In [15]:
extractor = ResourceExtractor()
output  = await extractor.run_text_extractor()

# Section II - Retrieving Textual Elements

In this section I define a class called TextParser, note that the class is tailored to the websites being considered and cannot be applied on any website.

There are two main options for generalising the methods of this class, and they are dependent on what is meant with "generalisation".
#### generalisation = integration of multiple resources
In this case it suffices considering a finite set of websites and design methods that exploits the commonalities between them, or that change depending on the website. In this case the scraper would be generalise to n-resources (whereas currently it can be applied to only one).

#### generalisation = widespread applicability
In this case we would like a set of methods that apply to any website. To do this it is necessary to devise an intelligent (or adaptive) program. Personally, I see the opportunity for Bayesian classifiers, but we'll see as soon as the "important" parts of the project will be completed: we're here to use Pyspark and Hadoop!

In [41]:
"""In this version I have systematised the code uploaded this morning (2024-07-23) and prepared
it for an integration with pydoop, that will be accomplished tomorrow (2024-07-24).
The present code is updated at h. 21.58"""

class TextParser:
    def __init__(self, resource_dictionary):
        self.resource_dictionary = resource_dictionary
        self.date_paragraph_map = {}

    def page_dissecter(self):
        soups = []
        for html_text in self.resource_dictionary.values():
            soup = BeautifulSoup(html_text, 'html.parser')
            soups.append(soup)
        return [[p_tag.text for p_tag in soup.find_all('p')]for soup in soups]

    def is_date(self, text):
        try:
            possible_date = dateutil.parser.parse(text)
            return True, possible_date
        except:
            return False, None

    def shuffler(self):
        resource_p_tags = self.page_dissecter()
        current_date = None
        for resource in resource_p_tags:
            for paragraph in resource:
                is_date, parsed_date = self.is_date(paragraph)
                if is_date:
                    current_date = parsed_date
                elif current_date:
                    if current_date not in self.date_paragraph_map:
                        self.date_paragraph_map[current_date] = ""
                    self.date_paragraph_map[current_date] += paragraph
        return self.date_paragraph_map


In [42]:
tp = TextParser(output)
s = tp.shuffler()

In [54]:
#I slice the output to show the final results since GitHub cannot render them 
sliced_dict = {}
count = 0
K = 1000
for key, value in tp.date_paragraph_map.items():
    count += 1
    sliced_dict[key] = value[:K] + '....CONTINUES'
    if count >= 3:
        break
sliced_dict

{datetime.datetime(2024, 7, 22, 19, 30): 'Note: The data cut-off for this product was 1:30pm ET on July 22. ISW will cover subsequent reports in the July 23 Russian Offensive Campaign Assessment.Russia and North Korea are pursuing increased cooperation in the judicial sphere.\xa0Russian Prosecutor General Igor Krasnov arrived in Pyongyang, North Korea and met with his North Korean counterpart Kim Chol Won on July 22, marking the first time that a Russian Prosecutor General has visited North Korea.[1]\xa0Krasnov and Kim reportedly discussed avenues for continued cooperation and signed an agreement for joint work between the Russian and North Korean prosecutor generals\' offices for 2024–2026.[2]\xa0The Russian and North Korean prosecutor general\'s offices have notably maintained dialogue since 2010 through a separate cooperation agreement, but the new agreement will likely be much more focused in scope, reflecting intensified Russo–North Korean cooperation over the past year.[3]\xa0Kra