# Python Block Course
# Assignment 4: Importing data from the web

Marc Luettecke, Anna Bahss

Winter Term 2020 / 2021

In this fourth assignment we will practice how to import data from the web. You can score up to 3 points in this assignment. Please submit your solutions by sending it to **marc.luettecke@subsequent.ai** or **anna.bahss@uni-konstanz.de** (ideally both). Please include "**Python Blockkurs 2020 Assignment 4**" in the subject line of the email. The deadline for submission is on **Friday, October 30, 09:59**, everything submitted after will automatically receive a grade of 0/3.

**Notice: There's nothing new under the sun. Some of these problems are inspired by problems already existing out there. We can't avoid you copying code off the web, but know that 1. it is surprisingly easy to spot, if somebody uses techniques not introduced or referenced in the assignments yet, 2. you are missing out on actually understanding how to solve the problems with the tools (or at least hints) provided. Do yourself a favor and don't plagiarize.**

## 4.1.1 Scraping data from websites
We have dealt with a lot of data analysis tools, but what good are they if you do not have interesting data. As valuable as mtcars and iris are to dive into the mechanics, eventually, you will try to come up with interesting questions of your own, which might include collecting data that nobody else has thought of to combine before. In this first part, we will deal with the more manual labor of collecting data directly from the HTML code of a website, a practice which is called 'web-scraping'. We will scrape information off of Wikipedia's. By the way, you can check the source code of websites in most browsers by pressing F12 (Chrome) or just right-click anywhere on a website and look for an 'inspect' option. Let's get started with Python's 'request' and 'Beautiful Soup' package you already know from DataCamp.

<div class="alert alert-block alert-info">
<b>Exercise (1 Points)</b>: Work through the documented steps below and solve the assignment. Be careful for hints and specific instructions.
</div>

In [1]:
# import the requests and the beautiful soup packages (with nickname bs)
from bs4 import BeautifulSoup
import requests
# define the wikipedia start page as your URL (where it says 'Welcome to Wikipedia') and save it to 'url'
URL = 'https://en.wikipedia.org/wiki/Main_Page'

In [2]:
# just to make sure everything worked, let us print the html source code
# get the URL with the requests package
request_0 = requests.get(URL)

# extract the text, save it to a variable 'html' and print the result
html = request_0.text
print(html)
# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html, 'html.parser')

ProxyError: HTTPSConnectionPool(host='en.wikipedia.org', port=443): Max retries exceeded with url: /wiki/Main_Page (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x000001B735C89518>: Failed to establish a new connection: [WinError 10061] 由于目标计算机积极拒绝，无法连接。')))

Ok, we could have achieved goal so far by just going to a website, showing the source code and copy/pasting the information into a .txt file. But let us now find all the hyperlinks on the Wikipedia start page. We will need a little background knowledge of HTML for this task, to eventually retrieve information from the website in a systematic manner.
HTML builds the skeletton of a website. It structures sections within a page and works with so-called 'tags'. A very prominent one is the `<li> \</li>` tag for example, which stands for 'list item'-tag. Everything that is written between the opening and the closing tag ( <> and </ >) will be printed as a bullet point on a webpage. Let us start by finding all list tags on a website:

In [13]:
# Example: We will use the find_all() method on your beautiful soup project to find all 'li' tags. This method returns a list, through which you can loop and print all the elements.
all_links = soup.findAll('li')

# iterate through the body
for link in all_links:
    print(link)

>
<li><a class="external text" href="https://sl.wikipedia.org/wiki/"><span class="autonym" lang="sl" title="Slovenian (sl:)">Slovenščina</span></a></li>
<li><a class="external text" href="https://th.wikipedia.org/wiki/"><span class="autonym" lang="th" title="Thai (th:)">ไทย</span></a></li></ul>
</div></li>
<li><a class="external text" href="https://ast.wikipedia.org/wiki/"><span class="autonym" lang="ast" title="Asturian (ast:)">Asturianu</span></a></li>
<li><a class="external text" href="https://bs.wikipedia.org/wiki/"><span class="autonym" lang="bs" title="Bosnian (bs:)">Bosanski</span></a></li>
<li><a class="external text" href="https://et.wikipedia.org/wiki/"><span class="autonym" lang="et" title="Estonian (et:)">Eesti</span></a></li>
<li><a class="external text" href="https://el.wikipedia.org/wiki/"><span class="autonym" lang="el" title="Greek (el:)">Ελληνικά</span></a></li>
<li><a class="external text" href="https://simple.wikipedia.org/wiki/"><span class="autonym" lang="simple" 

Now it's your turn: find all the 'a' tags, which indicate a hyperlink. Iterate through them and append the links to a list with a name of your choice, if the .text property of  the element exists. What does the .text property show us?



In [14]:
all_links = soup.find_all('a')
link_list = list(all_links)
for link in all_links:
    print(link.text)


Jump to navigation
Jump to search
Wikipedia
free
encyclopedia
anyone can edit
6,193,769
English
The arts
Biography
Geography
History
Mathematics
Science
Society
Technology
All portals

Super Mario World
platform game
Super Nintendo Entertainment System
Mario
Princess Toadstool
Bowser
Super Mario
Luigi
Yoshi
Takashi Tezuka
Shigeru Miyamoto
greatest video games of all time
best-selling SNES game
Yoshi's Island
Full article...
Sahure
The Boat Race 2019
Thomas White (Australian politician)
Archive
By email
More featured articles

Anna May Wong
had the starring role
Daughter of the Dragon
the non-Asian actor
Keep Talking
José Antonio Álvarez Lima
Canal Once
The South's Finest
Isabelle Li
2010 Summer Youth Olympics
Herr, wir bringen in Brot und Wein
offertory
Huub Oosterhuis
first edition of Gotteslob
the second
cutter
HMS Surly
Pawnee
Bright Star
Archive
Start a new article
Nominate an article
COVID-19 pandemic
Disease
Virus
By location
Impact
Portal

National Science Foundation
Arecibo Ob

Now we will go a step further and click on a link on a website - so we tell python to use the information to go to a different website. For that we need to find a pattern in the way Wikipedia's URLs change if you click on a link. Click from the main page on different articles to find the portion of the link that never changes and save that as a string called: 'base_url'.

In [15]:
base_url = 'https://en.wikipedia.org/wiki/'

How can we use the links we have scraped now to go to a new website - i.e. one of the links the main page refers to?
We can scrape all the links of the main page (add the href=True option to findAll) and add the changing part of the url to the base part and then redirect python to this new URL. We will first need to do some cleaning of the links, though.
Once again find all link elements with the added option.

In [16]:
all_url = soup.find_all('a', href=True)
print(type(all_url))

<class 'bs4.element.ResultSet'>


Create a result list to append to.

In [17]:
results_list = []

Iterate through the items in your links to append the 'href' portion from the links to the result list - you can do so by subsetting a list element via list_element['href'], which is why we saved the href info in the findAll option.


In [25]:
for item in all_url:
    # print(item)
    url = item.get('href')
    # print(type(url))
    results_list.append(url)
print(results_list)

y_(1920)', '/wiki/Irish_Republican_Army_(1919%E2%80%931922)', '/wiki/Cairo_Gang', '/wiki/Gaelic_football', '/wiki/Croke_Park', '/wiki/1950', '/wiki/Canoe_River_train_crash', '/wiki/Valemount', '/wiki/John_Diefenbaker', '/wiki/1970', '/wiki/Vietnam_War', '/wiki/Operation_Ivory_Coast', '/wiki/S%C6%A1n_T%C3%A2y_prison_camp', '/wiki/Prisoner_of_war', '/wiki/2015', '/wiki/2015_Brussels_lockdown', '/wiki/Brussels', '/wiki/Henry_Purcell', '/wiki/Alexander_Berkman', '/wiki/Stan_Musial', '/wiki/November_20', '/wiki/November_21', '/wiki/November_22', '/wiki/Wikipedia:Selected_anniversaries/November', 'https://lists.wikimedia.org/mailman/listinfo/daily-article-l', '/wiki/List_of_days_of_the_year', '/wiki/File:Adolf_Mosengel_Dorf_in_den_Berner_Alpen.jpg', '/wiki/Adolf_Mosengel', '/wiki/Hamburg', '/wiki/Westphalia', '/wiki/Kunstakademie_D%C3%BCsseldorf', '/wiki/Hans_Gude', '/wiki/Geneva', '/wiki/Alexandre_Calame', '/wiki/En_plein_air', '/wiki/Bernese_Alps', '/wiki/Adolf_Mosengel', '/wiki/Template:P

We can see that there are a lot of links in that list that might reference an internal link, or browsing languages, etc. Do you find a common beginning of references that seem to link to other article pages?
We can clean the data by iterating through the elements and use the startswith() string method as a condition to define the pattern that we would like to see in all resulting elements of a string. Append the appropriate results to a list called 'list_clean'.


In [26]:
list_clean = []
for item in results_list:
    if item.startswith('/wiki'):
        list_clean.append(item.strip('/wiki/'))
print(list_clean)

['Wikipedia', 'Free_content', 'Encyclopedia', 'Help:Introduction_to_Wikipedia', 'Special:Statistics', 'English_language', 'Portal:The_arts', 'Portal:Biography', 'Portal:Geography', 'Portal:History', 'Portal:Mathematics', 'Portal:Science', 'Portal:Society', 'Portal:Technology', 'Wikipedia:Contents/Portals', 'File:Shigeru_Miyamoto_cropped_3_Shigeru_Miyamoto_201911.jpg', 'Super_Mario_World', 'Platform_game', 'Super_Nintendo_Entertainment_System', 'Mario', 'Princess_Peach', 'Bowser_(character)', 'Super_Mario', 'Luig', 'Yosh', 'Takashi_Tezuka', 'Shigeru_Miyamoto', 'List_of_video_games_considered_the_best', 'List_of_best-selling_Super_Nintendo_Entertainment_System_video_games', 'Yoshi%27s_Island', 'Super_Mario_World', 'Sahure', 'The_Boat_Race_2019', 'Thomas_White_(Australian_politician)', 'Wikipedia:Today%27s_featured_article/November_2020', 'Wikipedia:Featured_articles', 'File:Poster_-_Daughter_of_the_Dragon_01.jpg', 'Anna_May_Wong', 'Anna_May_Wong_on_film_and_television', 'Daughter_of_the_

Let us now try to combine our base url and one of the list elements. Combine the base with the first element of the result list and set the new page as a new request and a new beautiful soup item and print the html code of this article. Work with print statements in between to verify that the URL looks right.

In [28]:
request_test = requests.get(base_url + list_clean[0])
html_test = request_test.text
print(html_test)
soup_test = BeautifulSoup(html_test, 'html.parser')

tps://nso.wikipedia.org/wiki/Wikipedia" title="Wikipedia – Northern Sotho" lang="nso" hreflang="nso" class="interlanguage-link-target">Sesotho sa Leboa</a></li><li class="interlanguage-link interwiki-tn"><a href="https://tn.wikipedia.org/wiki/Wikipedia" title="Wikipedia – Tswana" lang="tn" hreflang="tn" class="interlanguage-link-target">Setswana</a></li><li class="interlanguage-link interwiki-sq"><a href="https://sq.wikipedia.org/wiki/Wikipedia" title="Wikipedia – Albanian" lang="sq" hreflang="sq" class="interlanguage-link-target">Shqip</a></li><li class="interlanguage-link interwiki-scn"><a href="https://scn.wikipedia.org/wiki/Wikipedia" title="Wikipedia – Sicilian" lang="scn" hreflang="scn" class="interlanguage-link-target">Sicilianu</a></li><li class="interlanguage-link interwiki-si"><a href="https://si.wikipedia.org/wiki/%E0%B7%80%E0%B7%92%E0%B6%9A%E0%B7%92%E0%B6%B4%E0%B7%93%E0%B6%A9%E0%B7%92%E0%B6%BA%E0%B7%8F" title="විකිපීඩියා – Sinhala" lang="si" hreflang="si" class="interlangua

This is really exciting. We have created an URL conditional on another URL. To wrap things up, we will scrape all links of all links from Wikipedias main page (we will limit this to the first 10 links from the main page, due to Jupyter's memory settings, otherwise it might crash). So, more intuitively, you will proceed in the following steps:

1. set the soup element for the wikipedia main page

2. define a base URL

3. retrieve all link elements from the main page and append the href portion to a result list

4. clean the links so you can combine the base URL with the relative links

5. limit the list to the first 10 elements to avoid the Jupyter kernel from crashing
`link_list_clean = link_list_clean[:10]`

6. set a new soup element for every element in the links from the main page and retrieve all links (full URLs!) from these article sites - this might take a couple of seconds. Important: do not scrape one level deeper (links within links within links), or your scraper might run a couple of days or more likely, will crash - exponentional growth of processing complexity.

7. save the links to an overall result dictionary with key: depth of scraping (base root is 0, one level deeper is 1) and value: all links retrieved from that level of scraping within a list.


In [52]:
# 1.
URL = 'https://en.wikipedia.org/wiki/Main_Page'
request_wiki = requests.get(URL)
html_wiki = request_wiki.text
soup = BeautifulSoup(html_wiki, 'html.parser')
# 2.
base_url = 'https://en.wikipedia.org/wiki/'
# 3.
url_list = []
all_links = soup.find_all('a', href=True)
for item in all_links:
    url = item.get('href')
    url_list.append(url)
# print(url_list)
# 4.
link_list_clean = []
for item in url_list:
    # print(type(item))
    if item.startswith('/wiki'):
        # print(item)
        link_list_clean.append(base_url+item.strip('/wiki'))
# print(link_list_clean)
# 5.
link_list_clean = link_list_clean[:10]
# 6.
new_link_list = []
for link in link_list_clean:
    rq = requests.get(link)
    html0 = rq.text
    sp = BeautifulSoup(html0, 'html.parser')
    links = sp.find_all('a', href=True)
    # print(links)
    for item in links:
        url = item.get('href')
        # print(type(url))
        if url.startswith('/wiki'):
            new_link_list.append(base_url+url.strip('/wiki'))
# 7.
html_dict = {0: link_list_clean, 1: new_link_list}
print(html_dict)

ikipedia.org/wiki/Eternal_return', 'https://en.wikipedia.org/wiki/Eternalism_(philosophy_of_time)', 'https://en.wikipedia.org/wiki/Event_(philosophy)', 'https://en.wikipedia.org/wiki/Multiple_time_dimensions', 'https://en.wikipedia.org/wiki/Perdurantism', 'https://en.wikipedia.org/wiki/Philosophical_presentism', 'https://en.wikipedia.org/wiki/Static_interpretation_of_time', 'https://en.wikipedia.org/wiki/Temporal_finitism', 'https://en.wikipedia.org/wiki/Temporal_parts', 'https://en.wikipedia.org/wiki/The_Unreality_of_Time', 'https://en.wikipedia.org/wiki/Accounting_period', 'https://en.wikipedia.org/wiki/Chronemics', 'https://en.wikipedia.org/wiki/Fiscal_year', 'https://en.wikipedia.org/wiki/Generation_time', 'https://en.wikipedia.org/wiki/Mental_chronometry', 'https://en.wikipedia.org/wiki/Duration_(music)', 'https://en.wikipedia.org/wiki/Procrastination', 'https://en.wikipedia.org/wiki/Punctuality', 'https://en.wikipedia.org/wiki/Temporal_database', 'https://en.wikipedia.org/wiki/Te

## 4.1.2 Structure your code
This is really exciting work. We have just scraped a website and recursively build up our data. Wikipedia links might not be the most interesting data source, but this excourse was supposed to illustrate how automatic data retrieval with Python will simplify your workflow by a lot (can you imagine copy/pasting all these links by yourself?).

Let's structure our code from above in components. We will build a class structure along the lines of what we have explored yesterday.

<div class="alert alert-block alert-info">
<b>Exercise (0.5 Points)</b>: Work through the documented steps below and solve the assignment. Be careful for hints and specific instructions.
</div> 

Define four classes 'SoupSetUp', 'DataPreprocessing', 'DataScraper', and ResultPreparation

**Instructions for SoupSetUp**
1. define the constructor with one parameter: url of type string, also create two class properties 'html' and 'soup', which are filled with None in the constructor
2. define method 'build_soup' in which you use the url property of the class and create the html and the beautiful soup element and assign them to the class properties



In [53]:
class SoupSetUp:
    def __init__(self, url):
        self.url = url
        self.html = None
        self.soup = None

    def build_soup(self):
        self.html = requests.get(self.url).text
        self.soup = BeautifulSoup(self.html, 'html.parser')

**Instructions for DataPreprocessing**
1. define the constructor with one parameter: list_of_links and assign to a class property. Also define a class property 'clean_data_list', which is None
2. define method 'clean_links' with one parameter: valid_start of type string in which you implement the cleaning process from above (only keep the links that start with the desired property - here 'valid_start') and assign the result to the 'clean_data_list' property.
3. define method 'shorten_list' to subset result list with one parameter 'list_length' of type int to shorten a list and shorten the clean_data_list property to 'string_length'.



In [54]:
class DataPreprocessing:
    def __init__(self, list_of_links):
        self.list_of_links = list_of_links
        self.clean_data_list = None

    def clean_links(self, valid_start: str):
        self.clean_data_list = []
        for item in self.list_of_links:
            if item.startswith(valid_start):
                self.clean_data_list.append(item)

    def shorten_list(self, list_length: int):
        string_length = self.clean_data_list[: list_length]
    
    

**Instruction for DataScraper**
1. define the constructor with one parameter: base_url of type string. Also set a class property of result_list to None
2. define method 'scrape_element', with 3 parameters:
        'tag_name' (string) to define which kind of tag we would like to scrape
        'href' (boolean) which tells us if we would like to access the href property of the scraped elements
        'soup_element': the Beautiful soup element that we would like to scrape

In [55]:
class DataScraper:
    def __init__(self, base_url: str):
        self.base_url = base_url
        self.result_list = None

    def scrape_element(self, tag_name: str, href: bool, soup_element):
        # retrieve all link elements from the main page and append the href portion to a result list
        self.result_list = soup_element.find_all(tag_name)
        if href:
            return self.result_list

**Instruction for ResultPreparation**
1. define the constructor with one parameter: max_depth_level of type int and a result_dictionary property which we set to None
2. define method populate_results with one parameter 'list_of_lists' which represents a list of the list of links which creates the desired dictionary from the scraped list of links. Remember here that if you want to iterate through the max_depth_level property that the range function does not include the last number, so range(2) might be the right level if you want to create a dictionary with level 0 and 1. Hint: max_depth_level + 1



In [56]:
class ResultPreparation:
    def __init__(self, max_depth_level):
        self.max_depth_level = max_depth_level
        self.result_dictionary = None

    def populate_results(self, list_of_lists):
        self.result_dictionary = {self.max_depth_level + 1: list_of_lists}

We will leave it as a practice exercise to rerun the above scraping with the classes you have just defined. For such a small data pipeline, 4 classes might be an overkill, but projects become more complex as you go, so planning flexibly for potential add-ons down the road, might save you some headaches

## 4.2 Working with a web API
So far, we have dealt with the raw approach of finding information within the source code of a website. But some sites are set up to let you interact with them and offer their data to you. While most of these access points, or APIs, require some form of verification (see [Twilio](https://www.twilio.com/docs/usage/api) or [google-maps](https://cloud.google.com/maps-platform), for very exiting use-cases) the API we will work with does not require any of this set up.

It is called [Datamuse](https://www.datamuse.com/api/) and offers interesting english language functionality. Please refer to their documentation if certain tasks are not clear - this step of reading the documentation on your own, for code that other people provided for you for you will be at least 50% of your work as a data scientist (which is one of the reasons why documenting your code is just as importing as writing it!).

<div class="alert alert-block alert-info">
<b>Exercise (1.5 Points)</b>: Work through the documented steps below and solve the assignment. Be careful for hints and specific instructions.
</div>

According to their documentation, the URL tells us everything we need to know to make requests.
Define the base_url portion that apparently never changes.

In [65]:
base_url = 'https://api.datamuse.com/words?'

If we now want to find all words that are spelled like 'Python' we need to define the relative_portion of the URL including the search parameter and the word in question.

In [58]:
relative_portion = 'https://api.datamuse.com/words?sp=Python'

Make a get request (like above) with the full URL.


In [59]:
request_1 = requests.get(relative_portion)

Print the data with the json() method of the request object.

In [61]:
print(request_1.json())

[{'word': 'python', 'score': 65397}, {'word': 'pithon', 'score': 27}, {'word': 'pytho', 'score': 5}, {'word': 'phython', 'score': 2}]


In [62]:
# let's practice some more, make a request for all words that are more general than 'dog' and that rhyme with 'sound'
url1 = 'https://api.datamuse.com/words?rel_gen=dog&rel_rhy=sound'

# make a get-request (like above) with the full URL
request_2 = requests.get(url1)

# print the data with the json() method of the request object
print(request_2.json())

[{'word': 'hound', 'score': 1349}]


We will now build something semi-useful: a sentence translator to related meaning. We will build a simple app, where a user is prompted to enter a sentence, and where each word is replaced by the closest word in meaning according to the API.

1. Like every software development project, let us start with a simple case and build ourselves up in complexity: we will use the example sentence 'today is a hot day' - assign this sentence to a variable


2. Split the sentence into a list of words


3. For every word run the API query to find the closest word (highest score, or if ordered first in list of values of result dictionary) in meaning and assign the result to a list - if the resulting list of words which are close in meaning is empty or if the score is below 50000, return the original word.    
**Hint 1:** enumerate() is a helpful function to iterate over lists   
**Hint 2:** if you expect that in some cases you might run into an error, such as 'if a file is present, open it', or here: if there is a list of words close in meaning , return the first, you can check this with a so-called try-except statement. Error handling itself is a big topic in programming, but you can read a nice introduction here: https://www.w3schools.com/python/python_try_except.asp.


4. Print the new sentence from the list of new words as one string. Hint: https://www.programiz.com/python-programming/methods/string/join


Maybe not be the most satisfying result, but it might lead to some giggles. More realistic use-cases are auto-completion in search bars, or synonym / antonym search engines that rely on similar technology

In [72]:
# 1.
sentence = "today is a hot day"
# 2.
words = sentence.split(' ')
# 3.
for word in words:
    query_url = base_url + 'ml='+word
    request_3 = requests.get(query_url)
    r3_js = request_3.json()
    if len(r3_js) == 0 or r3_js[0]['score'] < 50000:
        pass
    else:
        words[i] = r3_js[0]['word']
# 4.
new_sentence = ""
for item in words:
    new_sentence += item
    new_sentence += ' '
print(new_sentence)


daylight is a hot day 


Now that we have found the workflow, we can define a function to ask a user for a word (via an input call and a nice print statement) and print out the resulting adjacent sentence. Call the function find_neigbor_sentence(), add inline comments for the separete steps to document your logic and your code for the next programmer to understand what you were doing.

In [73]:
def find_neighbor_sentence():
    st = input("please input a sentence: ")
    word = st.split(' ')
    base_url = 'https://api.datamuse.com/words?ml='
    for i in range(len(word)):
        get_json = requests.get(base_url + word[i]).json()
        if len(get_json) == 0 or get_json[0]['score'] < 50000:
            pass
        else:
            word[i] = get_json[0]['word']
    new_sentence = ""
    for item in word:
        new_sentence += item
        new_sentence += ' '
    print(new_sentence)

# play around with the function and have some fun
find_neighbor_sentence()

I am glad 


We have covered a lot of ground today. From scraping the web, over using a language detection API you have practiced working with real and dirty data, as well as strengthened your understanding of general data structures, code modularization and function definition. You are well on your way to become a well-versed data scientist!