# This Notebook contains all solutions from the Exercise 3

## Nr. 1

- Pick a list within the Wikipedia like the list of sovereign states. Choose some other list on your own, based on your personal interests. The only requirement is that there are other Wikipedia articles linked within the list.

Since we are hopeless drunkards and celebrate a lot (irony) we decided to use a list that gives an overview of the countries that have a ban on alcohol.

**URL: [https://en.wikipedia.org/wiki/List_of_countries_with_alcohol_prohibition](https://en.wikipedia.org/wiki/List_of_countries_with_alcohol_prohibition)**

## Nr.2

- Get all the names and URLs to the corresponding items in the list and export them into a CSV file that has two columns (name and URL).

To access the Wikipedia data we use the module [`Beautiful Soup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) together with the module [`requests`](https://realpython.com/python-requests/). This allows us to work in real time with a copy of the source code of the page we want to access. To structure the program more clearly we have declared functions that describe the processes. Little by little we get the necessary information. First the names of the countries, then the URL to the Wikipedia articles and finally the CSV file is written.  

**File: [List_of_countries_with_alcohol_prohibition.csv](./List_of_countries_with_alcohol_prohibition.csv)**

In [1]:
# Importing modules
import requests
from bs4 import BeautifulSoup
import csv
# URL of Wiki list
url = "https://en.wikipedia.org/wiki/List_of_countries_with_alcohol_prohibition"
# Reading in the source code
content = requests.get(url).text
soup = BeautifulSoup(content)
# Finding all the Country names
def titles(soup):
    # Searching for the second unsorted list
    present_list = soup.find_all("ul")[1]
    titles = []
    # Fetching the title attribute containing the country name
    for x in present_list.find_all("a"):
        titles.append(x.get("title"))
    # Creating a unique list in case there are duples
    titles = list(set(titles))
    # Removing everything that is not a country or that is more defined
    remove = [None, "Indonesia", "COVID-19 pandemic in South Africa", "India", "South Yemen", "Union Territory", "Sharjah (emirate)"]
    for things in remove:
        titles.remove(things)
    # Sorting the list
    titles_clean = sorted(titles)
    return titles_clean
# Finding all URL's
def href(titles):
    # Base for the URL
    url = "https://wikipedia.org/wiki/"
    href = []
    links = []
    # Iterating through all titles and adding the base URL
    for country in titles:
        href.append(url)
    # Adding the specific URL name to the base
    for i in range(len(href)):
        links.append(list(href[i] + titles[x] for x in range(len(titles))))
    # Removing duples lists
    links = links.pop(0)
    links_clean = []
    # Replacing every whitespace for a underscore
    for entries in links:
        re = entries.replace(" ", "_")
        links_clean.append(re)
    return links_clean
# Writing the CSV File
def writer(title, link):
    column_names = ["Names", "URL"]
    with open("List_of_countries_with_alcohol_prohibition.csv", "w", newline = "") as csvfile:
        writer = csv.writer(csvfile, delimiter=",")
        writer.writerow(column_names)
        # Zipping both lists together as they have to be in two seperate columns
        writer.writerows(zip(title, link))
titles(soup)
href(titles(soup))
writer(titles(soup), href(titles(soup)))

## Nr. 3

- For every Wikipedia article in the CSV list choose a few attributes from the infobox on the right that you would like to extract (e.g., population, name of the head of state, whatever...). Extract this information for every entry in your list. Store this information in an appropriate data structure.

Now we come to the somewhat more difficult part of the task. In order to find out the individual values of the countries we have to execute the program for each row. For this task even the program needs some seconds. As a basis the program written before in no. 2 is used, because otherwise the CSV file can not be written correctly, since the attributes `titles` and `href` are missing. To find out the capital cities (`capital`) and the time values (`timezone`) we proceeded similarly as in the task solved before. Finally the CSV file was overwritten.

In [3]:
# Importing modules
from bs4 import BeautifulSoup
import csv
import pandas as pd
import requests
import re
import lxml
# URL of Wiki list
url = "https://en.wikipedia.org/wiki/List_of_countries_with_alcohol_prohibition"
# Reading in the source code
content = requests.get(url).text
soup = BeautifulSoup(content)
# Finding all the Country names
def titles(soup):
    # Searching for the second unsorted list
    present_list = soup.find_all("ul")[1]
    titles = []
    # Fetching the title attribute containing the country name
    for x in present_list.find_all("a"):
        titles.append(x.get("title"))
    # Creating a unique list in case there are duples
    titles = list(set(titles))
    # Removing everything that is not a country or that is more defined
    remove = [None, "Indonesia", "COVID-19 pandemic in South Africa", "India", "South Yemen", "Union Territory", "Sharjah (emirate)"]
    # Creating a sorted clean list
    for things in remove:
        titles.remove(things)
    titles_clean = sorted(titles)
    return titles_clean
# Finding all URL's
def href(titles):
    # Base for the URL
    url = "https://wikipedia.org/wiki/"
    href = []
    links = []
    # Iterating through all titles and adding the base URL
    for countrie in titles:
        href.append(url)
    # Adding the specific URL name to the base
    for i in range(len(href)):
        links.append(list(href[i] + titles[x] for x in range(len(titles))))
    # Removing duples lists
    links = links.pop(0)
    links_clean = []
    # Replacing every whitespace for a underscore
    for entries in links:
        repl = entries.replace(" ", "_")
        links_clean.append(repl)
    return links_clean
# Calling the functions
titles(soup)
href(titles(soup))
# Finding all capitals
def capital(link):
    # Reading in the source code
    content = requests.get(link).text
    soup = BeautifulSoup(content, "lxml")
    # Shrinking down source code
    infobox = soup.find("table", attrs={"class": "infobox geography vcard"}).tbody
    # Searching for all the table rows
    rows = infobox.find_all("tr")
    capitals = []
    # Iterating over all rows
    for x in rows:
        # Ignoring all cases in which there are no matches
        if re.search("Capital", str(x)) != None:
            # Ignoring all cases in which the capital is "capital"
            if x.find("a").get_text() == "Capital":
                capitals.append(x.find_all(["a"])[1].get_text())
            else:
                capitals.append(x.find("a").get_text())
    capitals_clean = []
    # Replacing every whitespace for a underscore
    for entries in capitals:
        repl = entries.replace(" ", "_")
        capitals_clean.append(repl)
    for x in capitals_clean:
        return x
# Finding all time zones
def timezone(link):
    # Reading in the source code
    content = requests.get(link).text
    soup = BeautifulSoup(content, "lxml")
    # Shrinking down source code
    infobox = soup.find("table", attrs={"class": "infobox geography vcard"}).tbody
    # Searching for all the table rows
    rows = infobox.find_all("tr")
    timezone = []
    # Iterating over all rows
    for x in rows:
        # Ignoring all cases in which there are no matches
        if re.search("Time zone", str(x)) != None:
            # Ignoring all cases in which the time zone is "Time zone"
            if x.find(["span", "a"]).get_text() == "Time zone":
                timezone.append(x.find_all(["span", "a"])[1].get_text())
            else:
                timezone.append(x.find(["span", "a"]).get_text())
    timezones_clean = []
    # Replacing every whitespace for a underscore
    for entries in timezone:
        repl = entries.replace(" ", "_")
        timezones_clean.append(repl)
    for x in timezones_clean:
        return x
# Writing the CSV File
def writer(title, link, capitals, timezones):
    column_names = ["Names", "URL", "Capitals", "Timezones"]
    with open("List_of_countries_with_alcohol_prohibition.csv", "w", newline = "") as csvfile:
        writer = csv.writer(csvfile, delimiter=",")
        writer.writerow(column_names)
        # Zipping all the lists together as they have to be in seperate columns
        writer.writerows(zip(title, link, capitals, timezones))
# Calling the functions
capitals = []
timezones = []
# Creating a dataframe for the CSV table created in Ex. 3.2
df = pd.read_csv("List_of_countries_with_alcohol_prohibition.csv", delimiter = ",")
# Iterating through every Wiki link in the CSV file from Ex. 3.2
for link in df["URL"]:
    # Calling the functions with every link
    capitals.append(capital(link))
    timezones.append(timezone(link))
# Calling the writer function with all variables
writer(titles(soup), href(titles(soup)), capitals, timezones)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: unexpected end of data

## Nr. 4

- Save your scraped information into a JSON file. Try to export clean data.

In [3]:
import csv
import json
CSV_PATH = './List_of_countries_with_alcohol_prohibition.csv'
JSON_PATH = './List_of_countries_with_alcohol_prohibition.json'
csv_file = csv.DictReader(open(CSV_PATH, 'r'))
json_list = []
for row in csv_file:
    json_list.append(row)
open(JSON_PATH, 'w').write(json.dumps(json_list, indent=4))

3810