# Parsing a HTML Page for URLs using Beautiful Soup 4

This notebook's aim is to extract and format hyperlinks from a URL into a csv formatted file. This will cover how to read a URL, use Beautiful Soup 4 and Python 3 to extract the hyperlinks, and finally save them in a csv formatted file. 

## Libraries and Resources used

-  Python 3
-  Beautiful Soup 4
-  Request (A simple HTTP library for Python)

## Note:

For installation of the nessesary resources and libraries refer to their respective home page for installation steps for your operation system.

In this tutorial we will be using a hyperlink from Wikipedia. It is possible when you are viewing this tutorial, there have been changes done to the URL in question. If that is true, this may result in your own implementation of this code to produce different results or throwing an error.

Written in September 2017

### Importing the requried libaries

In [1]:
import requests
from bs4 import BeautifulSoup

### Loading the desired URL

In this example we are going to use a wikipedia page about the Mid-Autumn Festival.

In [2]:
#Using request we can fetch the URL we want
webPage = requests.get('https://en.wikipedia.org/wiki/Mid-Autumn_Festival#Mooncakes')

### Creating a Beauitful Soup Object

A Beauitful Soup object is a special type of object that allows us to perform a varitity of different operations. But in this tutorial we will not be using all the functionality built into the library. 

In [3]:
# Here we create a soup object using the text from the webPage we declared eariler
#'html.parser' is used to tell the library that we are parsing a HTML
soup = BeautifulSoup(webPage.text, 'html.parser')

### Finding all the 'a' or 'anchor' Tags

Now that we have the web page loaded it is time to get all the URLs. To do this we look for the 'a' Tags or the 'anchor' Tags. These contain the URLs we are looking for.

In [4]:
# Here we declare we ask Beautiful Soup to find all the 'a' Tags in the webpage
anchor_on_page = soup.find_all('a')

### Storing the values we want

Before we can look at gathering all the URLs we first need to save it somewhere (so later on we have a more organized dataset to work off of)

In [5]:
# Create a dictionary with information about each link
linkDictionary = {}

### Gathering the URLs

Now we are going to examine each 'a' or 'anchor' tag for the keyword 'href'. In HTML this is use to denote a link to another webpage. Therefore we can skip any tags that don't have this key word and save the ones who do have it into the dictionary.

In [6]:
# Iterate through each anchor in our list of anchor
for element in anchor_on_page:
    
    # If the element does not have the keyword we are looking for we are going to skip it
    if element.get('href') == None:
        continue
    
    # Create another nested dictionary for each link under "href". 
    # The nested dictionary is used to store another other meta data/ information that is also inside the anchor
    linkDictionary[element.get('href')] = {}
    
    # In the entry for the nested dictionary we are going to assign the keyword "link" with the URL link
    linkDictionary[element.get('href')]['link'] = str(element.get('href'))
    
    # Grab any string attached to the anchor. Setting all that have none present to 'null'
    if (element.string == None):
        linkDictionary[element.get('href')]['string'] = 'null'
    else:
        linkDictionary[element.get('href')]['string'] = element.string
    
    # This next part is dependant on the url you are using. In this example, the Wikipedia page sometimes uses the 
    # word "title" inside their anchor. Therefore in the cases that it does appear I want to save it, if it doesn't
    # exist then I want to enter "null" for the value. 
    if (element.get('title')) == None:
        linkDictionary[element.get('href')]['title'] = 'null'
    else:
        linkDictionary[element.get('href')]['title'] = element.get('title')


One of the main reasons we don't skip inserting the information of 'title' or 'string' into the dictionary when there isn't one is because we want our csv to be consistence. That each entry will have something in each field, even if it is empty.

Depending on the website you are trying to scrap, other meta data other than 'title'. There could be many more or none at all. It is up to you the extent of information you want to record. To check to see what type of other meta data are embedded inspect the source code of the website. 

### Testing checkpoint

Now before we move onto formatting the newly aquired information we are going to see what linkDictionary looks like

In [7]:
#For each key and value inside the dictionary we print out the result

#Limit it to 10 results
amountToPrint = 10

for key, value in linkDictionary.items():
    if amountToPrint > 0:
        print("Key: " + str(key) + "\n Value: " + str(value) + "\n")
        amountToPrint = amountToPrint - 1
    else:
        break

Key: #mw-head
 Value: {'link': '#mw-head', 'string': 'navigation', 'title': 'null'}

Key: #p-search
 Value: {'link': '#p-search', 'string': 'search', 'title': 'null'}

Key: /wiki/Tsukimi
 Value: {'link': '/wiki/Tsukimi', 'string': 'Tsukimi', 'title': 'Tsukimi'}

Key: /wiki/Chuseok
 Value: {'link': '/wiki/Chuseok', 'string': 'Chuseok', 'title': 'Chuseok'}

Key: /wiki/File:Mid-Autumn_Festival-beijing.jpg
 Value: {'link': '/wiki/File:Mid-Autumn_Festival-beijing.jpg', 'string': 'null', 'title': 'null'}

Key: /wiki/Beijing
 Value: {'link': '/wiki/Beijing', 'string': 'Beijing', 'title': 'Beijing'}

Key: /wiki/China
 Value: {'link': '/wiki/China', 'string': 'null', 'title': 'China'}

Key: /wiki/Taiwan
 Value: {'link': '/wiki/Taiwan', 'string': 'Taiwan', 'title': 'Taiwan'}

Key: /wiki/Malaysia
 Value: {'link': '/wiki/Malaysia', 'string': 'Malaysia', 'title': 'Malaysia'}

Key: /wiki/Singapore
 Value: {'link': '/wiki/Singapore', 'string': 'Singapore', 'title': 'Singapore'}



As you can see each entry has 3 fields. 'link','string', and 'title' with the key being the link.

### Writing to a csv file

Now that we have all the information we need, it is time to write it into a csv file

In [8]:
#Open a csv file to save it into (if it doesn't exist it will create one)
outputfile = open( 'webPageLink.csv', 'w' )

#Iterate through the created dictionary and create an entry for each
for key, nestDictionary in  linkDictionary.items() :
    #The first feild would be the 'href' value
    outputfile.write( str(key))
    #For each element in the nested list add it
    for value in nestDictionary.values():
        outputfile.write("," + str(value))
    #End the entry
    outputfile.write("\n")

#Close the file
outputfile.close()

## Completition

Now you have parsed a website for all the URL redirects on the page and saved all the relavent links, meta data, etc into a csv file. Although in the csv file there is some unnecessary content (in this example, there are some link that take you down to the reference section on the bottom of the web page), every outgoing URL link is now saved.

## Additional Formatting

Although the code above gathers all the links, there exist some data that are redirects to some other section of a URL or lack some other information. In the next part we are going to ensure we do not add entries that are not URLs and add back the base URL to some of the entries. To do this we will be modifying the previous import to csv code.

In the code we will also be using two helper function. Although their code doesn't have to be their own function, to keep the code a bit cleaner I have opted to have them be their own function. 

In [9]:
#Checks the key in our dictionary to see if it is either a new URL or a redirect on the same site
def cleanKey(dictionaryKey, baseURL):

    #If it is a new url don't change it
    #This is done by checking the first 4 character for 'http'
    if dictionaryKey[0:4] == 'http':
        return dictionaryKey
    #If it is a redirect return it with the base site attached to it
    #This is done by checking the 1st character and seeing if it is '/'
    elif dictionaryKey[0] == '/':
        return (baseURL + dictionaryKey)
    #If it is not a site return None
    else:
        return None

#Does a similar purpose of cleanKey but only cares for redirects
def reattachBase(tagWord, baseURL):
    if tagWord[0] == '/':
        return (baseURL + tagWord)
    else:
        return tagWord

#Write to file called webPageLinkClean
outputfile = open( 'webPageLinkClean.csv', 'w' )

#The base url of the web page we are scraping (changes based on what site you use)
baseURL = "https://en.wikipedia.org"
    
#Iterate through each value in our link dictionary    
for key, nestDictionary in  linkDictionary.items() :

    #Parse the key
    parsedURL = cleanKey(str(key), baseURL)

    #If it is not a URL we will add it to the csv
    if parsedURL != None:
        
        #Write into the file the parse/formatted URL
        outputfile.write(parsedURL)
        
        for value in nestDictionary.values():
            #If any of the values are a redirect reattach the base
            outputfile.write("," + reattachBase(str(value), baseURL))
            
        outputfile.write("\n")
    
    