# Web Scraping  

Webscraping allows you to obtain data from website. This could be new articles, blogs, etc. However not all websites can be scraped or is structured the same. Therefore it is important to know what information you want from a website and seeing if it is possible.

To start off, sites that does not embed the information directly into the site will not be scarped by the following code. To find out if the site you are interested in does this, see below.

This notebook will use two example to showcase how this can be done. The first will be looking at the different tags located on a Wikipedia page. In this example we will be mostly focusing on the information within the "paragraph" tag in Wikipedia. The next example will be using a news article from the New York Times. 

## Libraries and Resources used

-  Python 3
-  Beautiful Soup 4 and some of its related libraries 
-  Request (A simple HTTP library for Python)

## Note:

For installation of the nessesary resources and libraries refer to their respective home page for installation steps for your operation system.

In this tutorial we will be using a hyperlinks from Wikipedia. It is possible when you are viewing this example, there have been changes done to the URL in question. If that is true, this may result in your own implementation of this code to produce different results or throwing an error.

Written in September 2018

### Importing the requried libaries

In [1]:
# request handles the URL management
import requests

# Import the different funtionality from Beautiful Soup and related modules
from bs4 import BeautifulSoup

### Loading the desired URL

Before we move any further, it is important to know if the site you are interested in can be scraped. First open up the site on your preferred browser and press F12 or your browser equivalent (Also called Developer console). This will open up a window that allows you to inspect the website. It should appear somewhere on your screen. 

For a more detail explaination on what is there see: https://support.airtable.com/hc/en-us/articles/232313848-How-to-open-the-developer-console 

What is most important here is the button next to "Inspector". It has an icon that looks like a square/rectangle with a cursor in the corner. By clicking this we can now look at each element of the webpage. Therefore if you hover your cursor over some aspect of the website you would like you can see a "tag" in which it is under. This is important to know as this is where you want to scrap for your information. 

Alternatively we can look under the "Inspector" tab itself. As you hover your mouse over the different section presented it will highlight them on the site. 

If you cannot find the information you want in either section (if it is there you will see an exact duplicate of the content located in the "Inspector" section) that means the site uses some other method to display their content. This means that the following code will not work on that site.

For the following example we will be using 

Wikipedia page: https://en.wikipedia.org/wiki/Web_scraping

New York Times: https://www.nytimes.com/2018/09/28/technology/facebook-hack-data-breach.html?action=click&module=Top%20Stories&pgtype=Homepage


In [2]:
# Gets the HTML information about the web page
WikiWebPage = requests.get('https://en.wikipedia.org/wiki/Web_scraping')

NewsWebPage = requests.get('https://www.nytimes.com/2018/09/28/technology/facebook-hack-data-breach.html?action=click&module=Top%20Stories&pgtype=Homepage')

### Creating a Beauitful Soup Object

A Beauitful Soup object is a special type of object that allows us to perform a varitity of different operations. This libaray allows us to extract key information/aspects about the page such as tags

In [3]:
# Here we create a soup object using the text from the webPage we declared eariler
# 'html.parser' is used to tell the library that we are parsing a HTML

# Wikipedia Soup
soupWiki = BeautifulSoup(WikiWebPage.text, 'html.parser')

# News Soup
soupNews = BeautifulSoup(NewsWebPage.text, 'html.parser')

## Tags

Website have content that are seperated into different sections or under different "tags". Therefore a common place to find content is under the "paragraph" or "< p >" tag. However this is not the same for every site, but it is a good place to start.

To find out more about tags see: https://www.w3schools.com/tags/ 

### Iterating through Tags (Wikipedia)

As you can notice on any Wikipedia page there are hyperlinks that bring you to other pages. When we take a look at the "< p >" tags we can see that there is a lot of extra information we may not need for our analysis

In [4]:
# Prints the 1st paragraph tag inside the soup object
print(soupWiki.p)

<p><b>Web scraping</b>, <b>web harvesting</b>, or <b>web data extraction</b> is <a href="/wiki/Data_scraping" title="Data scraping">data scraping</a> used for <a href="/wiki/Data_extraction" title="Data extraction">extracting data</a> from <a href="/wiki/Website" title="Website">websites</a>.<sup class="reference" id="cite_ref-Boeing2016JPER_1-0"><a href="#cite_note-Boeing2016JPER-1">[1]</a></sup> Web scraping software may access the World Wide Web directly using the <a href="/wiki/Hypertext_Transfer_Protocol" title="Hypertext Transfer Protocol">Hypertext Transfer Protocol</a>, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a <a href="/wiki/Internet_bot" title="Internet bot">bot</a> or <a href="/wiki/Web_crawler" title="Web crawler">web crawler</a>. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local <a href="/wiki/Dat

As you can see inside the "< p >" tag there exist more embedded tags such as the "anchor" or "< a >" tags, etc. Therefore we would like to clean it up such that we are left with just the text and none of additional HTML tag / meta data 

### Remove Excess HTML tag in Wikipedia page

If the goal is to remove all the tags and remove the metadata assoicated with the wikipedia's page. There are many different methods to do this. Therefore what is shown below is just one way to do it.

In [6]:
# Find all the tags you are want to save
# In this example I am only concern with the "paragraph", "list item" and "header 1" tag
# If you want everything you can just use "soup.find_all()" with nothing declared in the parameters
wikiContent = soupWiki.find_all(["p"])

# Now to check to see the result
# Print out all the item to check the result
count = 11

for item in wikiContent:
    # Display on the first 10 results
    if count > 0:
        # Get the item's text (does not return the metadata)
        print(item.get_text() + " \n")
        count = count - 1
    else:
        break
    

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
 

Web scraping a web page involves fetching it and extracting from it.[1][2] Fetching is the downloading of a page (which a browser does when you view the page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on. Web scrapers typically t

As you can see compared to before this is a lot easier to read and more useful for analysis. By remove all the additional HTML tags and meta data we can now get begin analysis on the text itself. 

Now that we discussed how to scrap a wikipedia article, we can try to scrap a news artcile.

### Iterating through Tags (New York Times)

Similar to before we are going to look at the paragraph tags in the New York Times Article.

In [7]:
# Prints the 1st paragraph tag inside the soup object
print(soupNews.p)

<p>Advertisement</p>


Unlike the wikipedia artcile we don't see too much additional HTML tag in the first paragraph tag. However this may not be the case with all websites you encounter. Therefore it is important to know before hand what content you need and what tag they are under.

Again like before we are going to see if we can clean and find all the content located in a paragraph tag. In this example we are going to showcase the first 10 entries. 

In [10]:
# Find all the tags you are want to save
# In this example I am only concern with the "paragraph", "list item" and "header 1" tag
# If you want everything you can just use "soup.find_all()" with nothing declared in the parameters
NewsContent = soupNews.find_all(["p"])

# Now to check to see the result
# Print out all the item to check the result
count = 11

for item in NewsContent:
    # Display on the first 10 results
    if count > 0:
        # Get the item's text (does not return the metadata)
        print(item.get_text() + " \n")
        count = count - 1
    else:
        break
    

Advertisement 

Supported by 

By Mike Isaac and Sheera Frenkel 

SAN FRANCISCO — Facebook, already facing scrutiny over how it handles the private information of its users, said on Friday that an attack on its computer network had exposed the personal information of nearly 50 million users. 

The breach, which was discovered this week, was the largest in the company’s 14-year history. The attackers exploited a feature in Facebook’s code to gain access to user accounts and potentially take control of them. 

The news could not have come at a worse time for Facebook. It has been buffeted over the last year by scandal, from revelations that a British analytics firm got access to the private information of up to 87 million users to worries that disinformation on Facebook has affected elections and even led to deaths in several countries. 

Senior executives have testified several times this year in congressional hearings where some lawmakers suggested that the government will need to step

### Saving into a Text File

Now that we have cleaned out all the tags / meta data from the content we can save the result into a text file for later analysis. In the following example we will use the cleaned wikipedia page.

In [6]:
# Open a text file to save it into (if it doesn't exist it will create one)
outputfile = open( 'cleanedHTML.txt', 'w' )

# For every item we declared eariler in wikiContent
for item in wikiContent:
    
    # Write into the file the text and end it with a "newline" 
    # If you do not have the division between the elements remove the "+ " \n" inside the bracket
    outputfile.write(item.get_text() + " \n")

# Close the file
outputfile.close()

### Conclusion

Now you know how to clean a webpage and save the resulting content into text file for future use, we can look into a more indepth method of cleaning. In the sections below will show other methods to clean the HTML. Depending on your needs this may or may not be relavent.

### Parsing What You Need

If you are only concern with all the "paragraph" tags in a webpage, there is a method in which we can just extract the "paragraph" tags. This is because we don't want to clean what we are not going to use. Therefore in the next few sections we are going to need to import some additional modules to help us do so.

In [11]:
# The additional modules we are going to need
from bs4 import SoupStrainer, NavigableString, Tag

### Creating A More Filtered Soup Object

In this section we are going to show how to create a soup object that only contains one type of tag. To do so we are going to use SoupStrainer (from beautiful soup). In the following examples we are going the wikipedia page as the example site. Here we are going to look at just the "paragraph" or "p" tag. If you would like to search via other tags just replace the "p" in the "SoupStrainer"

In [35]:
# Create a parser that looks for one particular tag in the HTML
# In this example it is the "paragraph" tag
only_p_tags = SoupStrainer("p")

# This is the basic soup object previously seen
ParseSoup = BeautifulSoup(WikiWebPage.text, 'html.parser', parse_only = only_p_tags)

### Cleaning the Beautiful Object 

Now that we have all the "paragraph" or "p" content saved into a soup object we can manually clean elements we want. This may be important as you may want to leave in certain elements for analysis.

### Examining the new soup object

First lets print out what is contained in "PraseSoup" to see what we are working with.

In [36]:
print(ParseSoup)

<p><b>Web scraping</b>, <b>web harvesting</b>, or <b>web data extraction</b> is <a href="/wiki/Data_scraping" title="Data scraping">data scraping</a> used for <a href="/wiki/Data_extraction" title="Data extraction">extracting data</a> from <a href="/wiki/Website" title="Website">websites</a>.<sup class="reference" id="cite_ref-Boeing2016JPER_1-0"><a href="#cite_note-Boeing2016JPER-1">[1]</a></sup> Web scraping software may access the World Wide Web directly using the <a href="/wiki/Hypertext_Transfer_Protocol" title="Hypertext Transfer Protocol">Hypertext Transfer Protocol</a>, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a <a href="/wiki/Internet_bot" title="Internet bot">bot</a> or <a href="/wiki/Web_crawler" title="Web crawler">web crawler</a>. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local <a href="/wiki/Dat

Although we are only looking at content in the paragraph tags, there are still other tags enbedded within it. Now lets see if we can replace all the "anchor" or "a" tags with a blank space. 

In [37]:
# This code aims to remove all the anchor tags
# Iterate through all the anchor tags
while True:
    
    # If there is no more answer we are done
    if ParseSoup.a == None:
        break
        
    # If there is no string for the URL just insert a white space
    elif ParseSoup.a.string == None:
        ParseSoup.a.replace_with(" ")
    
    # Replace the anchor tag with the string inside
    else:
        # Adding spaces before and after the words
        newFormat = " " + str(ParseSoup.a.string) + " "
        ParseSoup.a.replace_with(newFormat)

The code above replaces all the metadata and information inside the "anchor" or "a" tag with a " " or blank space.

In [39]:
# Rexamine the ParseSoup object after we remove the "a" tags
print(ParseSoup)

<p><b>Web scraping</b>, <b>web harvesting</b>, or <b>web data extraction</b> is  data scraping  used for  extracting data  from  websites .<sup class="reference" id="cite_ref-Boeing2016JPER_1-0"> [1] </sup> Web scraping software may access the World Wide Web directly using the  Hypertext Transfer Protocol , or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a  bot  or  web crawler . It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local  database  or spreadsheet, for later  retrieval  or  analysis .
</p><p>Web scraping a web page involves fetching it and extracting from it.<sup class="reference" id="cite_ref-Boeing2016JPER_1-1"> [1] </sup><sup class="reference" id="cite_ref-2"> [2] </sup> Fetching is the downloading of a page (which a browser does when you view the page). Therefore, web crawling is a main component of web scr

As you can see. Compared to the previous example we have now removed all the "anchor" tags that existed. This exact process can be repeated for all other types as well. For example what if we also wanted to remove all the "b" tags? All you need to do is replace the "a" with a "b" whenever you see "ParseSoup.a"

In [40]:
while True:
    
    # If there is no more answer we are done
    if ParseSoup.b == None:
        break
        
    # If there is no string for the URL just insert a white space
    elif ParseSoup.b.string == None:
        ParseSoup.b.replace_with(" ")
    
    #Replace the anchor tag with the string inside
    else:
        #Adding spaces before and after the words
        newFormat = " " + str(ParseSoup.b.string) + " "
        ParseSoup.b.replace_with(newFormat)

In [41]:
print(ParseSoup)

<p> Web scraping ,  web harvesting , or  web data extraction  is  data scraping  used for  extracting data  from  websites .<sup class="reference" id="cite_ref-Boeing2016JPER_1-0"> [1] </sup> Web scraping software may access the World Wide Web directly using the  Hypertext Transfer Protocol , or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a  bot  or  web crawler . It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local  database  or spreadsheet, for later  retrieval  or  analysis .
</p><p>Web scraping a web page involves fetching it and extracting from it.<sup class="reference" id="cite_ref-Boeing2016JPER_1-1"> [1] </sup><sup class="reference" id="cite_ref-2"> [2] </sup> Fetching is the downloading of a page (which a browser does when you view the page). Therefore, web crawling is a main component of web scraping, to fetch

Therefore this could be expanded to encompass whatever tags you wish to remove. All that is need is to add additional if statements for the desired tags in the while loop. 

### Saving into Text File

Unlike how it was done eariler, there is some slight alteration that need to be done in order to save this new version. The reason for this is in the previous example we used a prebuilt method that stripped all the tags, however in this case we only removed two (the "a" and "b" tag)

In [45]:
# Open a text file to save it into (if it doesn't exist it will create one)
outputfile = open( 'selectiveCleanHTML.txt', 'w' )

# For every item we declared eariler in wikiContent
for item in ParseSoup:
    
    # Convert the objects to string so we can write it to file
    outputfile.write(str(item) + " \n")

# Close the file
outputfile.close()