<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Importing-libraries" data-toc-modified-id="Importing-libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Importing libraries</a></span></li><li><span><a href="#Making-Web-Requests" data-toc-modified-id="Making-Web-Requests-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Making Web Requests</a></span></li><li><span><a href="#Wrangling-HTML-with-BeautifulSoup" data-toc-modified-id="Wrangling-HTML-with-BeautifulSoup-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Wrangling HTML with BeautifulSoup</a></span></li><li><span><a href="#Putting-All-Together" data-toc-modified-id="Putting-All-Together-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Putting All Together</a></span></li></ul></div>

### Importing libraries

In [1]:
from requests import get
from requests.exceptions import RequestException
from bs4 import BeautifulSoup
from datetime import date, datetime, timedelta

### Making Web Requests

In [2]:
# function prepared to download web pages
def simple_get(url):
    """
    Function attempts to get the content at 'url' by making an HTTP GET request.
    If the content-type of response is a kind of HTML/XML,
    it will return the text content, otherwise return None.
    """
    try:
        resp = get(url)
        if is_good_response(resp):
            return resp.content
        else:
            return None
            
    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None
    
# function checking the answer to the request
def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200
           and content_type is not None
           and content_type.find('html') > -1)

# function printing error message if web downloading was unsuccessful
def log_error(e):
    """
    It is always a good idea to log errors.
    This function just prints them, but you can make it
    do anything.
    """
    #print(e)
    

In [3]:
# example where simple_get() returns None because 'Content-Type' is not html (json)
url = 'https://api.github.com/events'
raw_html = simple_get(url)
print(raw_html)

resp = get(url)
resp.headers['Content-Type'] # headears 'Content-Type' paramater returns type of 

None


'application/json; charset=utf-8'

In [4]:
# simple_get returns content of html object if response is html
url_1 = 'https://realpython.com/blog/'
resp_1 = get(url_1)
resp_1.headers['Content-Type']

raw_html_1 = simple_get(url_1)
len(raw_html_1)
#print(raw_html_1)

'text/html; charset=utf-8'

401546

In [5]:
# checking the simple_get() result for not exisiting www page
url_2 = 'http://onet_comp.pl'
simple_get(url_2)

### Wrangling HTML with BeautifulSoup

In [6]:
# getting a list of famous mathematicians making usage of
# functions defined in point 2
url = 'http://www.fabpedigree.com/james/mathmen.htm'
raw_html = simple_get(url) # returns www page content
html = BeautifulSoup(raw_html, 'html.parser')

# making an enumerative list of mathematicians names
# webbrowser dev tools help to examine html document to find
# and select tags containing mathmen names (<li>)
for i, li in enumerate(html.select('li')):
    print(i, li.text)

0  Isaac Newton
 Archimedes
 Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

1  Archimedes
 Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

2  Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

3  Leonhard Euler
 Bernhard Riemann

4  Bernhard Riemann

5  Henri Poincaré
 Joseph-Louis Lagrange
 Euclid  of Alexandria
 David Hilbert
 Gottfried W. Leibniz

6  Joseph-Louis Lagrange
 Euclid  of Alexandria
 David Hilbert
 Gottfried W. Leibniz

7  Euclid  of Alexandria
 David Hilbert
 Gottfried W. Leibniz

8  David Hilbert
 Gottfried W. Leibniz

9  Gottfried W. Leibniz

10  Alexandre Grothendieck
 Pierre de Fermat
 Évariste Galois
 John von Neumann
 René Descartes

11  Pierre de Fermat
 Évariste Galois
 John von Neumann
 René Descartes

12  Évariste Galois
 John von Neumann
 René Descartes

13  John von Neumann
 René Descartes

14  René Descartes

15  Karl W. T. Weierstrass
 Srinivasa Ramanujan
 Hermann K. H. Weyl
 Peter G. L. Dirichlet
 Niels Abel

16  Srinivasa Ramanujan
 Hermann K. H. Weyl
 

In [7]:
# function extracting a single list of names
def get_names():
    """
    Downloads the page where the list of mathematicians is found
    and returns a list of strings, one per mathematician.
    """
    url = 'http://www.fabpedigree.com/james/mathmen.htm'
    response = simple_get(url)
    
    if response is not None:
        html = BeautifulSoup(response, 'html.parser')
        names = set() # there are no repeated elements
        for li in html.select('li'):
            for name in li.text.split('\n'):
                if len(name) > 0:
                    names.add(name.strip())
                    
    return list(names)

    # raising an exception if a failure to get any data from the url
    raise Exception('Error retrieving contents at {}'.format(url))

In [8]:
get_names()[:2]

['Eudoxus  of Cnidus', 'Gottfried W. Leibniz']

In [9]:
# url received by using 'wikimedia' API so BeatifulSoap scraping will not be used in that code () - checking response
url = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{name}/daily/{start_day}/{now}'
get(url)

<Response [400]>

In [10]:
# function returns 'popularity score' - the pageviews number for each name
# viewed on Wiki(all statistics are on xtools.wmflabs.org)

def get_hits_on_name(name):
    """
    Accepts a name of the mathematician and returns the number
    of hist that mathematician's Wikipedia page received in the
    last 60 days, as an 'int'.
    # function returns 'popularity score' - the pageviews number for each name
    """
   
    now = date.today()
    start_day = now - timedelta(60)
    now, start_day = [datetime.strftime(elm, '%Y%m%d') for elm in [now, start_day]]

    # using API method 'pageviews with given name to loop over singular days(in 'items') and get hit results of the last 60 days
    url = f'https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{name}/daily/{start_day}/{now}'
    response = get(url).json()
    if response is not None:
        if 'title' in response.keys():
            log_error('No pageviews found for {}'.format(name))
            return None 
        else:
            result = sum([x['views'] for x in response['items']])
            return result

### Putting All Together

In [11]:
# the goal of the code is finding out which mathematician is
# most beloved by the public (no by ranking but by views) and 
# next to sort the names by popularity
if __name__ == '__main__':
    print('Getting the list of names...')
    names = get_names()
    print('... done.\n')
    results = []
    
    print('Getting stats for each name')
    for name in names:
        try:
            hits = get_hits_on_name(name)
            if hits is None:
                hits = -1
                        
            results.append((hits, name))
        except:
            results.append((-1, name))
            log_error('Error encountered while processing {}, skipping'.format(name))
            
    print('... done.\n')

    results.sort()
    results.reverse()
    
    if len(results) > 5:
        top_marks = results[:5]
    else:
        top_marks = results
        
    print('\nThe most popular mathematicians are:\n')
    for (mark, mathematician) in top_marks:
        print('{} with {} pageviews'.format(mathematician, mark))
    no_results = len([res for res in results if res[0] == -1])
    print('\nBut we did not find results for {} mathematicians on the list'.format(no_results))

Getting the list of names...
... done.

Getting stats for each name
... done.


The most popular mathematicians are:

Albert Einstein with 1165493 pageviews
Isaac Newton with 525051 pageviews
Aristotle with 403253 pageviews
Galileo Galilei with 376890 pageviews
Srinivasa Ramanujan with 285146 pageviews

But we did not find results for 28 mathematicians on the list
