# Now You Code In Class: Extracting Hyperlinks from a Webpage

Let's write a program which extracts the hyperlinks from a Webpage. There are 3 webpage files you can try with this program:

    httpbin-org.html
    ischool-directory.html
    ist256-com.html
    wikipedia-President-of-the-United-States.html

Run this command to download these three files:



In [2]:
!curl https://httpbin.org/ -o httpbin-org.html
!curl https://ischool.syr.edu/directory/?cat=all -o ischool-directory.html
!curl https://ist256.com -o ist256-com.html   
!curl https://en.wikipedia.org/wiki/President_of_the_United_States  -o wikipedia-president-of-the-united-states.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9593  100  9593    0     0  97887      0 --:--:-- --:--:-- --:--:-- 97887
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1318k    0 1318k    0     0  44.3M      0 --:--:-- --:--:-- --:--:-- 44.3M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15613    0 15613    0     0   118k      0 --:--:-- --:--:-- --:--:--  118k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  665k  100  665k    0     0  4435k      0 --:--:-- --:--:-- --:--:-- 4435k


In addition you can download any file you wish by copying the `!curl` command. Place the URL after `curl` and the filename after `-o`.

## Strategy

Here's the basic strategy. We must read in the contents all at once since there is not line structure within the HTML. So as one big string we should be looking for the following token in the text of the file `<a href="` when you find this token the link is all characters until the second `"` character. 

For example: `<a href="http://ist256.github.io">` the Hyperlink would be: `http://ist256.github.io`.

In addition there are 4 types of links:

- **absolute** links are URL's like `https://www.google.com`. These begin with `http://` or `https://`
- **email** links like `mailto:president@whitehouse.gov`. This begin with `mailto:`
- **bookmarks** are links to places on the same page. `#about`. These begin with `#`. 
- **relative** links (to pages on the same site). Consider all other links this type of link.


## Recommended approach : Bottom Up

We will use an approach called Bottom Up to solve this problem. It works by writing the alogrithm at a high level, then for each step in the algorithm which is non-trivial (more than a line of code) we write the step as its own function. We test the functions then assemble the final program code by calling the functions we made.

THE ALGORITHM

    1. read the file into a string
    2. loop
    3.    if string has link
    4.       extract link from string returning link and remaining string
    5.       determine link type
    6.       print link and link type (external / internal / email / bookmark)
    7.    else
    8.       stop looping


### Which steps must be written as functions?

Ask yourself which lines of the alogirthm can be implemented trivially in Python.

PROMPT 1



##  ## Step 1a: Problem Analysis for `readFile()`

Inputs: 

    PROMPT 2

Outputs: 

    PROMPT 3

Algorithm (Steps in Program):

    PROMPT 4

How many tests are required and why?

    PROMPT 5


## Step 1b: Code the function

In [20]:
# PROMPT 6
def readFile(file):
    with open(file, "r") as f:
        contents = f.read()
        return contents

## Step 2: Write tests

Let's make sure the function works

In [26]:
# PROMPT 7 Test(s)
print(f"When file='test.html' \nEXPECT= this is a test <a href=\"https://www.testing.com\"></a> \nACTUAL=", readFile('test.html'))


When file='test.html' 
EXPECT= this is a test <a href="https://www.testing.com"></a> 
ACTUAL= this is a test <a href="https://www.testing.com"></a>


##  ## Step 3a: Problem Analysis for `extractLink()`

Inputs: 

    PROMPT 8

Outputs: 

    PROMPT 9

Algorithm (Steps in Program):

    PROMPT 10 (Having Trouble? Let's discuss an approach. Write then re-factor into a function.)

How many tests are required and why?

    PROMPT 11


## Step 3b: Code the function

In [23]:
# PROMPT 12 - write code
def extractLink(text):
    token = '<a href="'
    start = text.find(token)
    end = text.find('"', start + len(token))
    link = text[start+len(token):end]
    rest = text[end+1:]
    return link, rest 


## Step 4: Write tests

Let's make sure the function works

In [25]:
# PROMPT 13 - write test(s)

text = "<a href=\"https://www.testing.com\">test</a>"
link, rest = extractLink(text)
print("When:", text)
print("EXPECT: link=https://www.testing.com, rest=>test</a>")
print(f"ACTUAL: link={link}, rest={rest}")


When: <a href="https://www.testing.com">test</a>
EXPECT: link=https://www.testing.com, rest=>test</a>
ACTUAL: link=https://www.testing.com, rest=>test</a>


##  ## Step 5a: Problem Analysis for `linkType()`

Inputs: 

    PROMPT 14

Outputs: 

    PROMPT 15

Algorithm (Steps in Program):

    PROMPT 16 

How many tests are required and why?

    PROMPT 17


## Step 5b: Code the function

In [34]:
# PROMPT 18 - write code
def linkType(link):
    if link.startswith("http://") or link.startswith("https://"):
        return "absolute"
    elif link.startswith("mailto:"):
        return "email"
    elif link.startswith("#"):
        return "bookmark"
    else:
        return "relative"


## Step 6: Write tests

Let's make sure the function works

In [None]:
# PROMPT 19 - write tests


## Final Program

With functions written, return to the original algorithm and imeplement. Here's a layout to get it working in ipython interact.

In [37]:
from ipywidgets import interact_manual
files = ['httpbin-org.html','ischool-directory.html','ist256-com.html','wikipedia-president-of-the-united-states.html']

print("Link Extractor")
@interact_manual( choose_file=files)
def main(choose_file):
    #TODO Write code here
    contents = readFile(choose_file)
    token = '<a href="'
    while True:
        if contents.find(token) >=0:
            link, contents = extractLink(contents)
            link_type = linkType(link)
            print(f"{link_type} ==> {link}")
        else:
            break

Link Extractor


interactive(children=(Dropdown(description='choose_file', options=('httpbin-org.html', 'ischool-directory.html…