# Data Wrangling - Assignment 3

## 0. Setup

### 0.1. Install and Import Dependencies

In [None]:
%pip install -r requirements.txt

In [14]:
from pathlib import Path

import scrapy
from parsel import SelectorList
from scrapy.selector import Selector

### 0.2. Global Variables

In [4]:
DATA_DIR = Path().cwd() / "data"
POSTS_DIR = DATA_DIR / "200posts"

## 1. Parsing Hikes

\[*In the first part of the assignment, you need to extract the relevant attributes from the web pages scraped from hikr.org. Extend the `parse` function so that it extracts all the attributes you need to create the ranking. You may define your own helper functions and extend the `parse` function as necessary. Just keep in mind that the arguments/result types should not be changed to enable you to use the function in the second part of the assignment.*\]

The following Features have been extracted:

| Feature | Description | Purpose |
| :--- | :--- | :--- |
| Name | The name of the Tour. | Provides a concise description of the tour. |
| Difficulty | How difficult the tour is. See [definition on hiker.org](https://www.hikr.org/post238.html). | Can be used to select tours based on their difficulty. |
| Required Time | How long a tour takes to complete. | Can be used to select tours with a certain length. |

In [60]:

def extract_required_time(time: SelectorList[Selector]) -> int | None:
    """Extract the required time by calculating the time in minutes.
    
    Parameters
    ----------
    time : Selector
        The selector that contains the raw time value.

    Returns
    -------
    int or None
        The required time in minutes. If no time was found `None` is
        returned.
    """
    print(time)
    if (len(time) == 0):
        return None
    
    if (time.re_first(r"(?:\d+ Tage? )?(?:[0-1]?[0-9]|2[0-3]):[0-5][0-9]") 
            is not None):
        days = int(time.re_first(r"(\d+) Tage", default=0))
        hours = int(time.re_first(r"(\d{1,2}):?\d{1,2}", default=0))
        minutes = int(time.re_first(r"\d{1,2}:?(\d{1,2})", default=0))
        return 24*60*days + 60*hours + minutes
    else:
        return None

In [61]:
def parse(tour):
    """Parse a hikr.org tour and extract all the attributes we are interested
    in.
     
    Parameters
    ----------
    tour : Tuple[str, str]
        HTML Content of the hikr.org tour. The first string is the name
        of the file in which the tour is stored in. The second string
        is the content of the file.
    
    Returns
    -------
    dict
        A dictionary containing the extracted attributes for this tour.
    """
    # id is the filename, text is the file content
    id, text = tour
    # Parse it using scrapy
    document = Selector(text=text)
    # Do some extraction

    # get occurrences of the time in a tour (there should be 0-1)
    time = document.xpath(
        '//td[text()="Zeitbedarf:"]/following-sibling::td/text()'
    )

    # TODO: Extract more attributes and add them to the result dictionary!
    # Consider "Klettern Schwierigkeit:" to also consider climbing tours
    result = {
        "name": document.css("h1.title::text").get(),
        "difficulty": document.xpath(
            '//td[text()="Wandern Schwierigkeit:"]/following-sibling::td/a/text()')
            .re_first(r"(T[1-6][\+-]?)"
        ),
        "required_time_minutes": extract_required_time(time)
    }
    return result

In [63]:
# Extract the 200posts.zip file in the same folder where this jupyter notebook is located.
# Then you can run the parse function on an example tour:
# original file: post24010.html
for post in POSTS_DIR.iterdir():
    with open(post) as f:
        content = f.read()
        r = parse((f.name, content))
        print(r)

[<Selector query='//td[text()="Zeitbedarf:"]/following-sibling::td/text()' data='\n4:30\n'>]
{'name': 'Klettersteig Pinut', 'difficulty': None, 'required_time_minutes': 270}
[]
{'name': 'Tschirgant (2370m)', 'difficulty': 'T3', 'required_time_minutes': None}
[]
{'name': 'Gemsmättli / P 2054 ', 'difficulty': 'T4+', 'required_time_minutes': None}
[<Selector query='//td[text()="Zeitbedarf:"]/following-sibling::td/text()' data='\n2:30\n'>]
{'name': 'Kurze Bergwanderung auf La Palma-der Pico Birigoyo(1807m)', 'difficulty': 'T2', 'required_time_minutes': 150}
[<Selector query='//td[text()="Zeitbedarf:"]/following-sibling::td/text()' data='\n10:00\n'>]
{'name': 'Einsamer Hochschwabberg - Der Brandstein - ein langer "Latschenruachler"', 'difficulty': 'T3', 'required_time_minutes': 600}
[<Selector query='//td[text()="Zeitbedarf:"]/following-sibling::td/text()' data='\n5:30\n'>]
{'name': 'Grigna Settentrionale (2409m)- Cresta Piancaformia', 'difficulty': 'T3+', 'required_time_minutes': 330}
[<Se

# 2. Parallelization & Aggregation (Spark)

It is highly recommended to wait with this part until after the Spark lecture!

This part only works on databricks!

Warning: In the community edition, databricks terminates your cluster after 2 hours of inactivity. If you re-create the cluster, you will lose your data.

In [None]:
%pip install scrapy

To add a library such as scrapy, it might not always work with the command above. Should you run into problems, you can alternatively do the following:

- Go to the "Clusters" panel on the left
- Select your cluster
- Go to the "Libraries" tab
- Click "Install New"
- Choose "PyPI" as library source
- Type the name of the library, "scrapy", into the package field
- Click "Install"
- Wait until the installation has finished

You can now use the newly installed library in your code.

In [None]:
# AWS Access configuration
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "AKIAYFVAOB5OOWVMUSCZ")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "BddS/X8w8qXdBkkqbzmO+5RgmfPRQuIT+wbUxrn2")

# Contains the whole hikr dataset.
# The full dataset contains 42330 tours and has a size of around 3 GB. Use this dataset for your final results if possible. 
# Execution is likely to take around 20 to 30 minutes.
# tours = sc.wholeTextFiles("s3a://dawr-hikr3/hikr/*.html")

# There are 8176 posts starting with "post10*", which is a nicer size for smaller experiments. (~ 5 minutes to process)
# tours = sc.wholeTextFiles("s3a://dawr-hikr3/hikr/post10*.html")

# If you want to further shrink the dataset size for testing, you can add another zero (or more) to the pattern (post100*.html).
tours = sc.wholeTextFiles("s3a://dawr-hikr3/hikr/post100*.html")

In [None]:
# Apply our parse function and persist the parse results so that we can repeat all further steps easier
import pyspark
parsedTours = tours.map(parse).persist(pyspark.StorageLevel.MEMORY_AND_DISK)

In [None]:
# actually force the parsedTours RDD. Above it was only defined, but not evaluated. This will take a while.
parsedTours.count()

In [None]:
# TODO
# Add your code here. Note that executing this cell and any below can reuse the results from "parsedTours".

# Example - let's just collect everything
parsedTours.collect()