# Parsing a HTML Page for URLs using Beautiful Soup 4

This notebook's aim is to extract and format hyperlinks from a URL into a csv formatted file. This will cover how to read a URL, use Beautiful Soup 4 and Python 3 to extract the hyperlinks, and finally save them in a csv formatted file. 

## Libraries and Resources used

-  Python 3
-  Beautiful Soup 4
-  Request (A simple HTTP library for Python)

## Note:

For installation of the nessesary resources and libraries refer to their respective home page for installation steps for your operation system.

In this tutorial we will be taking hyperlinks from Wikipedia. It is possilbe when you are viewing this tutorial, there have been changes done to the example URLs in question. If that is true, this may result in your own implementation of this code to produce different results or throwing an error.

### Importing the requried libaries

In [15]:
import requests
from bs4 import BeautifulSoup

### Loading the desired URL

In this example we are going to use a wikipedia page about the Mid-Autumn Festival.

In [23]:
#Using request we can fetch the URL we want
webPage = requests.get('https://en.wikipedia.org/wiki/Mid-Autumn_Festival#Mooncakes')

### Creating a Beauitful Soup Object

A Beauitful Soup object is a special type of object that allows us to perform a varitity of different operations. But in this tutorial we will not be using all the functionality built into the library. 

In [17]:
# Here we create a soup object using the text from the webPage we declared eariler
#'html.parser' is used to tell the library that we are parsing a HTML
soup = BeautifulSoup(webPage.text, 'html.parser')

### Finding all the 'a' or 'anchor' Tags

Now that we have the web page loaded it is time to get all the URLs. To do this we look for the 'a' Tags or the 'anchor' Tags. These contain the URLs we are looking for.

In [18]:
# Here we declare we ask Beautiful Soup to find all the 'a' Tags in the webpage
anchor_on_page = soup.find_all('a')

### Storing the values we want

Before we can look at gathering all the URLs we first need to save it somewhere (so later on we have a more organized dataset to work off of)

In [19]:
# Create a dictionary with information about each link
linkDictionary = {}

### Gathering the URLs

Now we are going to examine each 'a' or 'anchor' tag for the keyword 'href'. In HTML this is use to denote a link to another webpage. Therefore we can skip any tags that don't have this key word and save the ones who do have it into the dictionary.

In [20]:
# Iterate through each anchor in our list of anchor
for element in anchor_on_page:
    
    # If the element does not have the keyword we are looking for we are going to skip it
    if element.get('href') == None:
        continue
    
    # Create another nested dictionary for each link under "href". 
    # The nested dictionary is used to store another other meta data/ information that is also inside the anchor
    linkDictionary[element.get('href')] = {}
    
    # In the entry for the nested dictionary we are going to assign the keyword "link" with the URL link
    linkDictionary[element.get('href')]['link'] = str(element.get('href'))
    
    # Grab any string attached to the anchor. Setting all that have none present to 'null'
    if (element.string == None):
        linkDictionary[element.get('href')]['string'] = 'null'
    else:
        linkDictionary[element.get('href')]['string'] = element.string
    
    # This next part is dependant on the url you are using. In this example, the Wikipedia page sometimes uses the 
    # word "title" inside their anchor. Therefore in the cases that it does appear I want to save it, if it doesn't
    # exist then I want to enter "null" for the value. 
    if (element.get('title')) == None:
        linkDictionary[element.get('href')]['title'] = 'null'
    else:
        linkDictionary[element.get('href')]['title'] = element.get('title')


One of the main reasons we don't skip inserting the information of 'title' or 'string' into the dictionary when there isn't one is because we want our csv to be consistence. That each entry will have something in each field, even if it is empty.

Depending on the website you are trying to scrap, other meta data other than 'title'. There could be many more or none at all. It is up to you the extent of information you want to record. To check to see what type of other meta data are embedded inspect the source code of the website. 

### Testing checkpoint

Now before we move onto formatting the newly aquired information we are going to see what linkDictionary looks like

In [21]:
#For each key and value inside the dictionary we print out the result
for key, value in linkDictionary.items():
    print("Key: " + str(key) + "\n Value: " + str(value) + "\n")

Key: http://www.nytimes.com/content/help/site/ie9-support.html
 Value: {'link': 'http://www.nytimes.com/content/help/site/ie9-support.html', 'string': 'LEARN MORE »'}

Key: #top-news
 Value: {'link': '#top-news', 'string': 'Skip to content'}

Key: #site-index-navigation
 Value: {'link': '#site-index-navigation', 'string': 'Skip to navigation'}

Key: http://cn.nytimes.com
 Value: {'link': 'http://cn.nytimes.com', 'string': '中文 (Chinese)'}

Key: https://www.nytimes.com/es/
 Value: {'link': 'https://www.nytimes.com/es/', 'string': 'Español'}

Key: https://www.nytimes.com/
 Value: {'link': 'https://www.nytimes.com/', 'string': 'null'}

Key: http://www.nytimes.com/pages/todayspaper/index.html
 Value: {'link': 'http://www.nytimes.com/pages/todayspaper/index.html', 'string': 'null'}

Key: https://www.nytimes.com/video
 Value: {'link': 'https://www.nytimes.com/video', 'string': 'Video'}

Key: https://www.nytimes.com/section/world
 Value: {'link': 'https://www.nytimes.com/section/world', 'strin

As you can see each entry has 3 fields. 'link','string', and 'title' with the key being the link.

### Writing to a csv file

Now that we have all the information we need, it is time to write it into a csv file

In [22]:
#Open a csv file to save it into (if it doesn't exist it will create one)
outputfile = open( 'webPageLink.csv', 'w' )

#Iterate through the created dictionary and create an entry for each
for key, nestDictionary in  linkDictionary.items() :
    #The first feild would be the 'href' value
    outputfile.write( str(key))
    #For each element in the nested list add it
    for value in nestDictionary.values():
        outputfile.write("," + str(value))
    #End the entry
    outputfile.write("\n")

#Close the file
outputfile.close()

## Completition

Now you have parsed a website for all the URL redirects on the page and saved all the relavent links, meta data, etc into a csv file. Although in the csv file there is some unnecessary content (in this example, there are some link that take you down to the reference section on the bottom of the web page), every outgoing URL link is now saved.