Web Crawling & Scraping (Web Scraping from an HTML page)

Web Scraping using Python

We use BeautifulSoup package in Python - to perform web scraping, whereby
we attempt to extract out only certain text content.

In this lab, we are particularly interested in extracting Supreme Leader's welcome message at this site:
http://krazywoman.com/index.html

The above index.html page has the following HTML content.
Our objective is to extract out the text content enclosed with <p> and </p> tags.
There are 2 pairs of them.

<html>
<head>
<title>Supreme Leader's Home</title>
</head>

<body bgcolor="black">

<center>
<font color="white">

    <h1>Supreme Leader's Online Home</h1>
    <h2>World's Best Leader</h2>
    <h3>Seriously, I am.</h3>

    <p>Welcome to my homepage. You didn't know I had a homepage, right? Well, I do and now you know.</p>
    <p>Browse around and see if you can find any interesting articles I've been writing in my Supreme Palace.</p>

</font>
</center>

</body>
</html>

In [None]:
######## Step 1 ########
'''
1) requests
--> we use this module to make HTTP request to go and download a webpage
2) BeautifulSoup
--> we use this module to parse out text from HTML content
'''
import requests
from bs4 import BeautifulSoup

In [None]:
######## Step 2 ########
'''
Connect to Supreme Leader's homepage (index.html is the main page).
To do that, we simply specify the blog entry's URL.
'''
source_path = 'http://krazywoman.com/index.html'
page = requests.get(source_path) # http request
page_content = page.content

In [None]:
print(page_content)

In [None]:
######## Step 3 ########
'''
The way BeautifulSoup works is... you need to grab the webpage's HTML content
--> page.content

And then, you need to tell BeautifulSoup that you want it to "parse" page's HTML content
--> 'html.parser'
'''
soup = BeautifulSoup(page_content, 'html.parser')

'''
prettify() function will do indentation so that the extracted HTML content has
HTML tags nicely indented for easy viewing.
Try uncommenting the below line and print.
'''
print (soup.prettify())

In [None]:
# Find <title>
title = soup.find('title')
print(title)
print(title.text)

In [None]:
# Find <h1>
heading1 = soup.find('h1')
print(heading1.text)

In [None]:
# Find all instances of <p>
paras = soup.find_all('p')
print(paras)

for para in paras:
    print("----")
    print(para.text)

In [None]:
# Find a <p>
# Note that unlike find_all()
#   find() returns the 1st found instance of <p>

para = soup.find('p')
print(para)
print(para.get_text()) # strip off HTML tags, give u text
print(para.text) # same as above

In [None]:
######## Step 4 ########
'''
This line calls find_all() function to look for one or more 'p' (paragraph) tags.
--> When one or more 'p' tags are found, find_all() function returns us a LIST of bs4.element.Tag objects.
'''
results_list = soup.find_all('p')

# Let's see what the search results look like
print(results_list)
#print(results_list[1])

In [None]:
######## Step 5 ########
'''
If you look at the list, <p> and </p> tags are still found.
We need to remove these HTML tags.

To do that, we use get_text() function off each result object (bs4.element.Tag).
get_text() function strips off HTML tags and just returns the text content!

We will store these text content as strings in a new LIST called 'messages'.
'''
text_list = []
for result in results_list:
    #print(type(result)) # Uncomment this and you can see that each item in the list is of bs4.element.Tag type.
    print(result.get_text())
    text_list.append(result.get_text())


In [None]:
######## Step 6 ########
'''
Let's put together the two paragraph texts into a single string.
We will use join function to concatenate/append/join two strings.
'''
final_message = " ".join(str(n) for n in text_list)

# Let's see if we have the two 
print(final_message)