The pupose of this notebook is to illustrate the use of a webscraper in Python 3 to gather text based data.

The tutorial will be based on the below link and will be extended to other websites to gather information of interest.

https://www.dataquest.io/blog/web-scraping-tutorial-python/

In [1]:
# -*- coding: utf-8 -*-
"""
Created on Tue Oct 30 13:12:50 2018
@author: Sanil Purryag
"""

#import libraries
import requests
from bs4 import BeautifulSoup

A GET request fetches the whole HTML page.

In [2]:
#Send GET HTML request  to get webpage content
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
#check request status
page
#check the content of the webpage
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

Beautifulsoup helps to parse and analyse the data.

In [3]:
#Parse/structure the webpage content using Beautifulsoup
soup = BeautifulSoup(page.content, 'html.parser')

#check object strucutre
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [4]:
#move through the object one level at a time
#Note that children returns a list generator, so we need to call the list function on it

list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [5]:
#check the items present in the list

[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

It is noted that there are three objects from the above:

* The first is a Doctype object, which contains information about the type of the document. 
* The second is a NavigableString, which represents text found in the HTML document. 
* The final item is a Tag object, which contains other nested tags. 

The Tag object allows us to navigate through an HTML document, and extract other tags and text.

In [6]:
html = list(soup.children)[2]
print(html)

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>


There are two tags above, head, and body. 

The part of interest is the "p" tag.


In [7]:
#Find all instances of the "p" tag

soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

In [8]:
#Since the above command retuns a list, use get_text to extract the text object

soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

These techniques will now be extended to construct a pandas dataframe from a weather forecast page

In [10]:
#Download the whole weather page
weatherpage = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
weatherpage

<Response [200]>

In [13]:
#Get appears to be successful
#check the content of the webpage

#weatherpage.content

#There is too much output, web inspector (ctrl+shift+c) in firefox will be used to identify the element of choice

In [16]:
#Based on indentation a tombstone container is used

#First parse the webpage

weathersoup = BeautifulSoup(weatherpage.content, 'html.parser')

#Find the seven day forecast
seven_day = weathersoup.find(id="seven-day-forecast")

#Extract all the instances of the tombstone container from the seven_day forecast
forecast_items = seven_day.find_all(class_="tombstone-container")

#Show tonight's forecast
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Sunny, with a high near 72. Light and variable wind becoming west 6 to 11 mph in the afternoon. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 72. Light and variable wind becoming west 6 to 11 mph in the afternoon. "/>
 </p>
 <p class="short-desc">
  Sunny
 </p>
 <p class="temp temp-high">
  High: 72 °F
 </p>
</div>


From the above, there are cetain elements that can be accessed:

* Short description : Sunny
* temp : temp-high
* Forecast item/date : Tonight


In [18]:
#From the tonight object find the related text for each of the above.
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

Today
Sunny
High: 72 °F


In [20]:
#Extract the full description from the image attribute

img = tonight.find("img")
desc = img['title']

print(desc)

Today: Sunny, with a high near 72. Light and variable wind becoming west 6 to 11 mph in the afternoon. 


In [21]:
#Select all the days in the forecast by selecting the whole tombstone container

period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Today',
 'Tonight',
 'Thursday',
 'ThursdayNight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight',
 'Sunday']

In [23]:
#Get all the items for weather descriptions from tombstone container
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

['Sunny', 'Mostly Clear', 'Mostly Sunny', 'Partly Cloudy', 'Sunny', 'Mostly Clear', 'Mostly Sunny', 'Partly Cloudy', 'Mostly Sunny']
['High: 72 °F', 'Low: 57 °F', 'High: 74 °F', 'Low: 56 °F', 'High: 71 °F', 'Low: 56 °F', 'High: 71 °F', 'Low: 56 °F', 'High: 69 °F']
['Today: Sunny, with a high near 72. Light and variable wind becoming west 6 to 11 mph in the afternoon. ', 'Tonight: Mostly clear, with a low around 57. West wind 6 to 11 mph becoming light southwest  after midnight. ', 'Thursday: Mostly sunny, with a high near 74. Light and variable wind becoming west 9 to 14 mph in the afternoon. ', 'Thursday Night: Partly cloudy, with a low around 56. West southwest wind 5 to 13 mph. ', 'Friday: Sunny, with a high near 71. Light and variable wind becoming west 8 to 13 mph in the afternoon. ', 'Friday Night: Mostly clear, with a low around 56.', 'Saturday: Mostly sunny, with a high near 71.', 'Saturday Night: Partly cloudy, with a low around 56.', 'Sunday: Mostly sunny, with a high near 69

In [24]:
#Combine in pandas dataframe for future analysis

import pandas as pd
weather = pd.DataFrame({
        "period": periods, 
        "short_desc": short_descs, 
        "temp": temps, 
        "desc":descs
    })
weather

Unnamed: 0,period,short_desc,temp,desc
0,Today,Sunny,High: 72 °F,"Today: Sunny, with a high near 72. Light and v..."
1,Tonight,Mostly Clear,Low: 57 °F,"Tonight: Mostly clear, with a low around 57. W..."
2,Thursday,Mostly Sunny,High: 74 °F,"Thursday: Mostly sunny, with a high near 74. L..."
3,ThursdayNight,Partly Cloudy,Low: 56 °F,"Thursday Night: Partly cloudy, with a low arou..."
4,Friday,Sunny,High: 71 °F,"Friday: Sunny, with a high near 71. Light and ..."
5,FridayNight,Mostly Clear,Low: 56 °F,"Friday Night: Mostly clear, with a low around 56."
6,Saturday,Mostly Sunny,High: 71 °F,"Saturday: Mostly sunny, with a high near 71."
7,SaturdayNight,Partly Cloudy,Low: 56 °F,"Saturday Night: Partly cloudy, with a low arou..."
8,Sunday,Mostly Sunny,High: 69 °F,"Sunday: Mostly sunny, with a high near 69."
