Web Crawling & Scraping (Web Scraping from a Blog page)

Web Scraping using Python

We use BeautifulSoup package in Python - to perform web scraping, whereby
we attempt to extract out only certain text content.

In this lab, we attempt to scrape a local Singaporean influencer's blog content at this site:
http://www.mongabong.com/2017/08/largest-executive-condo-sol-acres.html

We are particularly interested in extracting a given blog entry's:
1. Title
2. Date (of entry)
3. Blog content (essay)

In [None]:
######## Step 1 ########
'''
1) requests
--> we use this module to make HTTP request to go and download a webpage
2) BeautifulSoup
--> we use this module to parse out text from HTML content
'''
import requests
from bs4 import BeautifulSoup

In [None]:
######## Step 2 ########
'''
Later on, you will need to modify this script to read MULTIPLE HTML files from a directory.
For now, we're going to connect to the blogger's blog entry via HTTP.
To do that, we simply specify the blog entry's URL.
'''
source_path = "http://www.mongabong.com/2017/08/largest-executive-condo-sol-acres.html"
page = requests.get(source_path)
page_content = page.content # http response

In [None]:
######## Step 3 ########
'''
The way BeautifulSoup works is... you need to grab the webpage's HTML content
--> page.content

And then, you need to tell BeautifulSoup that you want it to "parse" page's HTML content
--> 'html.parser'
'''
soup = BeautifulSoup(page_content, 'html.parser')

'''
prettify() function will do indentation so that the extracted HTML content has
HTML tags nicely indented for easy viewing.
Try uncommenting the below line and print.
'''
print (soup.prettify())

In [None]:
# 1) Retrieve blog title
# <div class="post-header"> --> <h1>

h1_items = soup.find_all('h1')
#print(h1_items)

title = h1_items[1].text
print(title)

In [None]:
# 2) Retrieve blog date
# <span class="date">Friday, August 25, 2017</span>

date_span = soup.find('span', class_="date")
#print(date_span.text)
date_text = date_span.text
print(date_text)

In [None]:
# 3) Retrieve the article content (excluding photos)

divs = soup.find_all('div', class_="separator")
#print(divs)

for div in divs:
    print("---")
    print(div.text)

In [None]:
######## Step 4 ########
'''
This line calls find_all() function to look for 'div' tag with class='post-header'.
find_all() returns us a LIST.

What we want is the FIRST item in this LIST.

As you know... in Python (and in many other programming languages),
the FIRST guy in a list/array has the INDEX value of 0 (zero).
So, we're going to add [0] to the end of the line - to extract the FIRST div with class='post-header'.

Why are we interested in extracting this div?

Inside this div... are: 1) title, 2) date of the blog entry.

this div... looks like this

<div class="post-header">
    <h1>
        First Home- Singapore's Largest Development EC, Sol Acres
    </h1>
    <span class="date">Friday, August 25, 2017</span>
</div>
'''

# retrieve all elements with class='post-header'
title_date_div = soup.find('div', class_='post-header') # not a list, object
print(post_header)

title_element = post_header.find('h1')
print("-------")
print(title_element.text)

# title_date_div = soup.find_all('div', class_='post-header')[0]
# print(title_date_div)

date_text = post_header.find('span')
print(date_text.text)

In [None]:
######## Step 5 ########
'''
title_date_div contains what we want... 1) title and 2) date.

<div class="post-header">
    <h1>
        First Home- Singapore's Largest Development EC, Sol Acres
    </h1>
    <span class="date">Friday, August 25, 2017</span>
</div>

We need to extract out the blog entry's "title", inside <h1>.
How do we get ... just the TEXT portion ... inside <h1> ... </h1>?

First, we need to find h1 tag inside the div.
--> find('h1')

After this, to extract just the TEXT portion... we call get_text() function.
Try this and print. See what you get.
'''
title = title_date_div.find('h1').get_text()
# In case... the title content has end-of-line (\n), we don't want it - we want it removed.
# We use replace() function to replace (if any) '\n' (end-of-line character) with empty string ('').
title = title.replace('\n', '')
print(title)

In [None]:
######## Step 6 ########
'''
Same thing here.. we need to get "date" of this blog entry.

<div class="post-header">
    <h1>
        First Home- Singapore's Largest Development EC, Sol Acres
    </h1>
    <span class="date">Friday, August 25, 2017</span>
</div>

The "date" we want... is inside <span>... </span>.
--> find('span')
This finds us <span>...</span>.
Note that you can also... look for a tag with class="date" by doing:
--> find(class_='date')
Either way is fine.

Once you get the span... call get_text() function to extract the TEXT portion,
which is essentially what we want --> blog entry's DATE. :)

Try the below code and see what you get.
'''
date = title_date_div.find('span').get_text()
#date = title_date_div.find(class_='date').get_text()
# In case... the date content has end-of-line (\n), we don't want it - we want it removed.
# We use replace() function to replace (if any) '\n' (end-of-line character) with empty string ('').
date = date.replace('\n', '')
print(date)

In [None]:
######## Step 7 ########
'''
Here, we need to capture blog's main content (essay portion).

If you look at the HTML content... you'll see that the blogger's article consists of
a series of:

 <div class="separator" style="clear: both; text-align: center;">
    some content...
 </div>

On her website, the content looks like a contiguous block of text but the HTML isn't.
So, we need BeautifulSoup to extract ALL ... <div class="separator"> tags.

When it does, it will insert all matched entries into a LIST.
'''
essay_div = soup.find_all('div', class_='separator')

for para in essay_div:
    print(para.text)

In [None]:
'''
What we'll do now... is to LOOP THRU this LIST of <div class="separator"> tags...
and extract out the TEXT portion.
One by one...as we extract out the TEXT portion of the tag, we will append it to essay_content text variable.

In the end, essay_content will contain the ENTIRE blog entry's essay portion.
'''

# We initialize essay_content string... to an empty string.
essay_content = ""

# We LOOP THRU the list of div tags...
for segment in essay_div:
    # segment is a div object - we need to extract out the TEXT portion by calling get_text() function.
    segment_text = segment.get_text()
    # If the TEXT portion contains end-of-line character (\n), remove it.
    # We use replace() function to do this operation.
    segment_text = segment_text.replace('\n', '')

    # If the extracted TEXT portion after stripping off \n ... is NOT empty, then it must be
    # containing some essay content... so append it to essay_content string.
    if segment_text != "":
        #print(segment_text)
        essay_content = essay_content + segment_text + ' '

'''
We're done grabbing all the essay content.
Let's print and see what it looks like.
'''
print (essay_content)