# Web Scraping Wikipedia 

This notebook's aim is to extract different tags located on a Wikipedia page. In this example we will be mostly focusing on the information within the "paragraph" tag in Wikipedia. Although this example is mainly focused on using Wikipedia as an example. Aspects of it can be dissected out to work on other website. This will cover how to read a URL, use Beautiful Soup 4 and Python 3 to extract/parse different tags inside Wikipedia.

## Libraries and Resources used

-  Python 3
-  Beautiful Soup 4 and some of its related libraries 
-  Request (A simple HTTP library for Python)

## Note:

For installation of the nessesary resources and libraries refer to their respective home page for installation steps for your operation system.

In this tutorial we will be using a hyperlinks from Wikipedia. It is possible when you are viewing this example, there have been changes done to the URL in question. If that is true, this may result in your own implementation of this code to produce different results or throwing an error.

Written in September 2017

### Importing the requried libaries

In [1]:
# request handles the URL management
import requests

# Import the different funtionality from Beautiful Soup and related modules
from bs4 import BeautifulSoup

### Loading the desired URL

In this example we are going to use a wikipedia page about Web Scraping.

In [2]:
# Gets the HTML information about the web page
webPage = requests.get('https://en.wikipedia.org/wiki/Web_scraping')

### Creating a Beauitful Soup Object

A Beauitful Soup object is a special type of object that allows us to perform a varitity of different operations. This libaray allows us to extract key information/aspects about the page such as tags

In [3]:
# Here we create a soup object using the text from the webPage we declared eariler
# 'html.parser' is used to tell the library that we are parsing a HTML
soup = BeautifulSoup(webPage.text, 'html.parser')

### Iterating through Tags

As you can notice on any Wikipedia page there are hyperlinks inside the content of a page. From these hyperlinks it will bring you to another Wikipedia page. However if we look at the HTML of a "paragraph" tag inside a Wikipedia page below

In [4]:
# Prints the 1st paragraph tag inside the soup object
print(soup.p)

<p><b>Web scraping</b>, <b>web harvesting</b>, or <b>web data extraction</b> is <a href="/wiki/Data_scraping" title="Data scraping">data scraping</a> used for <a href="/wiki/Data_extraction" title="Data extraction">extracting data</a> from <a href="/wiki/Website" title="Website">websites</a>.<sup class="reference" id="cite_ref-Boeing2016JPER_1-0"><a href="#cite_note-Boeing2016JPER-1">[1]</a></sup> Web scraping software may access the World Wide Web directly using the <a href="/wiki/Hypertext_Transfer_Protocol" title="Hypertext Transfer Protocol">Hypertext Transfer Protocol</a>, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a <a href="/wiki/Internet_bot" title="Internet bot">bot</a> or <a href="/wiki/Web_crawler" title="Web crawler">web crawler</a>. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local <a href="/wiki/Dat

As you can see inside the "paragraph" tag there exist more embedded tags such as the "anchor" tag, etc. Therefore we would like to clean it up such that we are left with just the text and none of the meta data and HTML tags.

### Cleaning all Tags inside the HTML Page

If the goal is to remove all the Tags and remove the metadata assoicated with the HTML page. This would be the simpliest method to do so.


In [5]:
# Find all the tags you are want to save
# In this example I am only concern with the "paragraph", "list item" and "header 1" tag
# If you want everything you can just use "soup.find_all()" with nothing declared in the parameters
wikiContent = soup.find_all(["p"])

# Now to check to see the result
# Print out all the item to check the result
count = 11

for item in wikiContent:
    # Display on the first 10 results
    if count > 0:
        # Get the item's text (does not return the metadata)
        print(item.get_text() + " \n")
        count = count - 1
    else:
        break
    

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. 

Web scraping a web page involves fetching it and extracting from it.[1][2] Fetching is the downloading of a page (which a browser does when you view the page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on. Web scrapers typically ta

As you can see, we have removed all the metadata and tags assoicated with the HTML and we are left with just the result. 

### Saving into a Text File

Now that we have aquired a cleaned version of the HTML we can now save it into a text file.

In [6]:
# Open a text file to save it into (if it doesn't exist it will create one)
outputfile = open( 'cleanedHTML.txt', 'w' )

# For every item we declared eariler in wikiContent
for item in wikiContent:
    
    # Write into the file the text and end it with a "newline" 
    # If you do not have the division between the elements remove the "+ " \n" inside the bracket
    outputfile.write(item.get_text() + " \n")

# Close the file
outputfile.close()

### Conclusion

Now you know how to clean a Wikipedia Page and write the cleaned HTML into a text file for future use. In the sections below will show other methods to clean the HTML. Depending on your needs this may or may not be relavent.

### Parsing What You Need

If you are only concern with all the "paragraph" tags in a webpage, there is a method in which we can just extract the "paragraph" tags. This might be relavent in the next section, since we do not want to clean what we are not going to use. Therefore in the next few sections we are going to need to import some additional modules.

In [7]:
# The additional modules we are going to need
from bs4 import SoupStrainer, NavigableString, Tag

### Creating A More Filtered Soup Object

In this section we are going to show how to create a soup object that only contains one type of tag. To do so we are going to use SoupStrainer.

In [8]:
# This is the basic soup object previously seen
oldSoup = BeautifulSoup(webPage.text, 'html.parser')

# Create a parser that looks for one particular tag in the HTML
# In this example it is the "paragraph" tag
only_p_tags = SoupStrainer("p")

# This creates the soup object under the condition of the parser
newSoup = BeautifulSoup(webPage.text, 'html.parser', parse_only = only_p_tags)

### Cleaning the Beautiful Object 

This involves the manual removal of the tags presented inside the soup object. The following example words with the old soup object as well as the new one. 

### Parsing the Original

First lets see what the first "paragraph" tag in the original soup object contains

In [9]:
print(oldSoup.p)

<p><b>Web scraping</b>, <b>web harvesting</b>, or <b>web data extraction</b> is <a href="/wiki/Data_scraping" title="Data scraping">data scraping</a> used for <a href="/wiki/Data_extraction" title="Data extraction">extracting data</a> from <a href="/wiki/Website" title="Website">websites</a>.<sup class="reference" id="cite_ref-Boeing2016JPER_1-0"><a href="#cite_note-Boeing2016JPER-1">[1]</a></sup> Web scraping software may access the World Wide Web directly using the <a href="/wiki/Hypertext_Transfer_Protocol" title="Hypertext Transfer Protocol">Hypertext Transfer Protocol</a>, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a <a href="/wiki/Internet_bot" title="Internet bot">bot</a> or <a href="/wiki/Web_crawler" title="Web crawler">web crawler</a>. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local <a href="/wiki/Dat

Now lets see if we can't replace all the "anchor" tags with what we see on the original page

In [10]:
# This code aims to remove all the anchor tags
# Iterate through all the anchor tags
while True:
    
    # If there is no more answer we are done
    if oldSoup.a == None:
        break
        
    # If there is no string for the URL just insert a white space
    elif oldSoup.a.string == None:
        oldSoup.a.replace_with(" ")
    
    # Replace the anchor tag with the string inside
    else:
        # Adding spaces before and after the words
        newFormat = " " + str(oldSoup.a.string) + " "
        oldSoup.a.replace_with(newFormat)

The code above removes all the metadata and information inside the "anchor" tag.

In [11]:
print(oldSoup.p)

<p><b>Web scraping</b>, <b>web harvesting</b>, or <b>web data extraction</b> is  data scraping  used for  extracting data  from  websites .<sup class="reference" id="cite_ref-Boeing2016JPER_1-0"> [1] </sup> Web scraping software may access the World Wide Web directly using the  Hypertext Transfer Protocol , or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a  bot  or  web crawler . It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local  database  or spreadsheet, for later  retrieval  or  analysis .</p>


As you can see. Compared to the previous example we have now removed all the "anchor" tags that existed. This exact process can be repeated for all other types as well. For Example what if we also wanted to remove all the "b" tags?

In [12]:
while True:
    
    # The previous code
    if oldSoup.a == None:
        break
    elif oldSoup.a.string == None:
        oldSoup.a.replace_with(" ")
    else:
        newFormat = " " + str(oldSoup.a.string) + " "
        oldSoup.a.replace_with(newFormat)
    
    # If there is no more answer we are done
    if oldSoup.b == None:
        break
        
    # If there is no string for the URL just insert a white space
    elif oldSoup.b.string == None:
        oldSoup.b.replace_with(" ")
    
    #Replace the anchor tag with the string inside
    else:
        #Adding spaces before and after the words
        newFormat = " " + str(oldSoup.b.string) + " "
        oldSoup.b.replace_with(newFormat)

In [13]:
print(oldSoup.p)

<p><b>Web scraping</b>, <b>web harvesting</b>, or <b>web data extraction</b> is  data scraping  used for  extracting data  from  websites .<sup class="reference" id="cite_ref-Boeing2016JPER_1-0"> [1] </sup> Web scraping software may access the World Wide Web directly using the  Hypertext Transfer Protocol , or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a  bot  or  web crawler . It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local  database  or spreadsheet, for later  retrieval  or  analysis .</p>


Therefore this could be expanded to encompass whatever tags you wish to remove. All that is need is to add additional if statements for the desired tags in the while loop. 

### Saving into Text File

Unlike how it was done eariler, there is some slight alteration that need to be done in order to save this new version. The reason for this is in the previous example we used a prebuilt method that stripped all the tags, however in this case we only removed two (the "a" and "b" tag)

In [14]:
# Following the previous example of gathering only the "p", "li" and "h1" tag
parsedWikiContent = oldSoup.find_all(["p","li","h1"])

# Open a text file to save it into (if it doesn't exist it will create one)
outputfile = open( 'selectiveCleanHTML.txt', 'w' )

# For every item we declared eariler in wikiContent
for item in parsedWikiContent:
    
    # Convert the objects to string so we can write it to file
    outputfile.write(str(item) + " \n")

# Close the file
outputfile.close()