# Assignment 2

Visit the "Publications" section of the [Hugo Steinhaus Center website](http://prac.im.pwr.wroc.pl/~hugo/HSC/hsc.html). Scrape the data on research papers from that site and generate a cooperation network of authors in the following way:

1. Members of HSC are the nodes.
2. The size of a node is proportional to the number of papers co-authored by the node.
3. A link between two nodes means a paper written together by the corresponding members.
4. A weight of the link indicates the total number of common papers.

Visualize the network (with names and link weights). Detect the connected components.

...

The first thing to do is to request the resource from the webpage. For this we will use the `requests` library.

In [1]:
import requests

And define the [URL](https://en.wikipedia.org/wiki/URL) which we will be requesting.

In [2]:
URL = r"http://prac.im.pwr.wroc.pl/~hugo/HSC/Publications.html"

We will be using the `GET` method so we call an appropriate method.

In [3]:
try:
    response = requests.get(URL, timeout=5)
    response.raise_for_status()  # Raise appropriate error in case the request is unsuccessful
except requests.exceptions.HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")
except Exception as err:
    print(f"Other error occurred: {err}")

We can also check the status code of the response

In [4]:
response.status_code

200

To avoid unnecessary ecoding problems let's check the encoding and change it to UTF-8 if it's not already

In [5]:
response.encoding 

'ISO-8859-1'

In [6]:
response.encoding = 'utf-8'

To parse the HTML text into document tree we will use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library.

In [7]:
import bs4
from bs4 import BeautifulSoup

In [8]:
html = BeautifulSoup(response.content)

If we look at the site we will see that data which is interesting to us can be found in the "Reasearch Papers" section. There, for each year separately, exists a list with authors and the paper title. What's even better is that staff members names are written in bold font.

In [9]:
research_papers = html.find(attrs={'name': '#Research papers'})

Going two levels up in the hierarchy we will get to the `<h2>` tag which will be sibling to the lists containing the articles.

In [10]:
h2 = research_papers.parent.parent

We have to be careful to stop at the next `<h2>` tag. Otherwise we will include the "Research reports" in our data and we are not interested in that.

We will represent the scrapped data in a dictionary where each year corresponds to list of articles. Each article will itself be a dictionary with three keys:
1. `authors` - authors of the paper.
2. `local_authors` - authors of the paper which are HSC staff.
3. `title` - title of the paper.

We will use two additional libraries 
1. `collections` for more readable code using `collections.defaultdict`.
2. `regex` for deciding which piece of text corresponds to what category.

In [11]:
import collections
import regex

We also define a `content_type` function, which use regexes to categorize each term to one of four categories: *name*, *year*, *title* or *other*.

In [13]:
def content_type(string: str) -> str:
    """bla bla bla"""
    string = string.strip()
    if regex.match(r'^(\pL[.][ ]?)+\pL[a-z]+', string):
        return 'name'
    elif regex.match(r'^[\d]{4}$', string):
        return 'year'
    elif regex.match(r'^".+"$', string):
        return 'title'
    else:
        return 'other'

Exemplary usage is ilustrated below:

In [15]:
content_type('A. B. Aak')

'name'

In [16]:
content_type('1234')

'year'

In [17]:
content_type('"aloha"')

'title'

In [18]:
content_type('Alice has a cat.')

'other'

We also define `process_list` function which, given a `<ol>` tag processes each item within and returns list of articles.

In [20]:
def process_list(ol):  # ol - ordered list (HTML tag)
    list_data = []
    for li in ol.find_all('li'):
        article_data = collections.defaultdict(list)
        for font in li.find_all('font'): 
            content = font.string
            if content:  # Not empty
                type_ = content_type(content)
                processed_content = str.replace(content, ' ', '')
            if type_ == 'name':
                article_data['authors'].append(processed_content)
                if content.parent.name == 'b':
                    article_data['local_authors'].append(processed_content)
            elif type_ == 'title':
                article_data['title'] = processed_content
        list_data.append(article_data)
    return list_data

Finally we can process all the data:

In [21]:
data = collections.defaultdict(list)
for sibling in h2.next_siblings:
    if isinstance(sibling, bs4.Tag): # Process only valid tags
        if sibling.name == 'h2':
            break  # Break at "Research reports"
            
        try:
            year = sibling['name']  # Update year of publication
        except KeyError:
            pass
        
        if sibling.name == 'ol':
            data[year].extend(process_list(sibling))

In [22]:
data

defaultdict(list,
            {'2020': [defaultdict(list,
                          {'authors': ['A.Grzesiek', 'A.Wyłomańska'],
                           'local_authors': ['A.Grzesiek', 'A.Wyłomańska'],
                           'title': '"SubordinatedProcesseswithInfiniteVariance"'}),
              defaultdict(list,
                          {'authors': ['A.Michalak',
                            'J.Wodecki',
                            'A.Wyłomańska',
                            'R.Zimroz'],
                           'local_authors': ['A.Wyłomańska'],
                           'title': '"InfluenceofSignaltoNoiseRatioontheEffectivenessofCointegrationAnalysisforVibrationSignal"'}),
              defaultdict(list,
                          {'authors': ['P.Poczynek',
                            'P.Kruczek',
                            'A.Wyłomańska'],
                           'local_authors': ['P.Kruczek', 'A.Wyłomańska'],
                           'title': '"Ornstein-UhlenbeckProc

In [None]:
POLISH_LETTERS_PROJECTION = {
    'ą': 'a'
#     ...
}

# Transforming the data

With the data prepared we can start transforming the data to a form which will be most suitable for network visualizations.

In [23]:
edges = []

In [24]:
for year, articles in data.items():
    for article in articles:
        if article['local_authors']:  # Consider only non-empty lists
            edges.append(article['local_authors'])

In [25]:
# Consider only co-authored works
edges = [frozenset(_) for _ in edges]

In [26]:
edges

[frozenset({'A.Grzesiek', 'A.Wyłomańska'}),
 frozenset({'A.Wyłomańska'}),
 frozenset({'A.Wyłomańska', 'P.Kruczek'}),
 frozenset({'A.Wyłomańska'}),
 frozenset({'A.Weron', 'H.Loch-Olszewska', 'K.Burnecki', 'M.Balcerek'}),
 frozenset({'K.Burnecki', 'Z.Palmowski'}),
 frozenset({'A.Weron', 'G.Sikora', 'K.Burnecki'}),
 frozenset({'A.Wilkowska', 'K.Burnecki', 'M.Teuerle'}),
 frozenset({'J.Szwabiński'}),
 frozenset({'A.Kumar', 'A.Wylomanska'}),
 frozenset({'K.Burnecki'}),
 frozenset({'J.Gruszka', 'J.Szwabiński'}),
 frozenset({'A.Wyłomańska'}),
 frozenset({'A.Wyłomańska'}),
 frozenset({'H.Loch-Olszewska', 'J.Szwabiński', 'P.Kowalek'}),
 frozenset({'A.Wylomanska', 'P.Kruczek'}),
 frozenset({'A.Wyłomańska'}),
 frozenset({'A.Wyłomańska', 'R.Połoczański'}),
 frozenset({'M.Magdziarz'}),
 frozenset({'M.Magdziarz'}),
 frozenset({'A.Michalak', 'A.Wylomanska'}),
 frozenset({'K.Burnecki'}),
 frozenset({'A.Wyłomańska', 'G.Sikora', 'Ł.Bielak'}),
 frozenset({'A.Weron', 'J.Janczura', 'K.Burnecki'}),
 frozens

In [27]:
collections.Counter(edges)

Counter({frozenset({'A.Grzesiek', 'A.Wyłomańska'}): 2,
         frozenset({'A.Wyłomańska'}): 20,
         frozenset({'A.Wyłomańska', 'P.Kruczek'}): 4,
         frozenset({'A.Weron',
                    'H.Loch-Olszewska',
                    'K.Burnecki',
                    'M.Balcerek'}): 1,
         frozenset({'K.Burnecki', 'Z.Palmowski'}): 1,
         frozenset({'A.Weron', 'G.Sikora', 'K.Burnecki'}): 4,
         frozenset({'A.Wilkowska', 'K.Burnecki', 'M.Teuerle'}): 1,
         frozenset({'J.Szwabiński'}): 4,
         frozenset({'A.Kumar', 'A.Wylomanska'}): 1,
         frozenset({'K.Burnecki'}): 9,
         frozenset({'J.Gruszka', 'J.Szwabiński'}): 1,
         frozenset({'H.Loch-Olszewska', 'J.Szwabiński', 'P.Kowalek'}): 1,
         frozenset({'A.Wylomanska', 'P.Kruczek'}): 2,
         frozenset({'A.Wyłomańska', 'R.Połoczański'}): 3,
         frozenset({'M.Magdziarz'}): 34,
         frozenset({'A.Michalak', 'A.Wylomanska'}): 1,
         frozenset({'A.Wyłomańska', 'G.Sikora', 'Ł.Bie

In [None]:
import itertools

### Do all the possible pairs

In [None]:
new_edges = {}
for authors, num_connections in collections.Counter(edges).items():
    for permutation in itertools.combinations(authors, 2):
        new_edges[permutation] = num_connections

In [None]:
new_edges

In [None]:
g = nx.Graph()
for edge, weight in new_edges.items():
    g.add_edge(*edge)

In [None]:
nx.draw(g, with_labels=True, width=list(new_edges.values()))
plt.show()

# Network

In [None]:
import networkx as nx
import matplotlib.pyplot as plt
import matplotlib

In [None]:
matplotlib.rcParams['figure.figsize']= [12, 12]

In [None]:
g = nx.Graph()

In [None]:
g.add_edge(1, 2)
g.add_cycle((2, 3, 1))

In [None]:
nx.draw(g, with_labels=True)
plt.show()

In [None]:
g = nx.Graph()
for edge in edges:
    g.add_cycle(edge)

In [None]:
nx.draw_circular(g, with_labels=True)
plt.show()