# Week 12 Extracting Data from Websites

When performing data science tasks, it's common to want to use data found on the internet. You'll usually be able to access the data in csv format, or via an Application Programming Interface (API). However, there are times when the data you want can only be accessed as part of a web page. In cases like this, you'll want to use a technique called **web scraping** to get the data from the web page into a format you can work with in your analysis.

In [1]:
# Download a webpage
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page #2** status code usually means successful download

<Response [200]>

In [2]:
# Show what is downloaded
print(page.content)

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'


We will use **beautifulsoup** library to extract useful information from the html script.

In [6]:
! pip install --user --upgrade pip
!pip install BeautifulSoup4



In [10]:
from bs4 import BeautifulSoup
import bs4
bs4.__version__

'4.9.1'

In [11]:
soup = BeautifulSoup(page.content, 'html.parser')
# soup
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [12]:
# using the children attribute to select all top-level tags
list(soup.children)

['html',
 '\n',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [17]:
# There are only 3 top-level tags
len(list(soup.children))

top_level_tags = list(soup.children)
print(top_level_tags[2])

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>


In [18]:
# type of each children
print([type(item) for item in list(soup.children)])

[<class 'bs4.element.Doctype'>, <class 'bs4.element.NavigableString'>, <class 'bs4.element.Tag'>]


In [19]:
# select the html tag and its children by taking the third item in the list:
html = list(soup.children)[2]
print(html.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [20]:
list(html.children)

['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n']

In [26]:
# the join method connects elements from a list
list1 = ["A", "B", "C", "D"]
print("|".join(list1)) # A|B|C|D
print("---".join(list1)) # A---B---C---D
print("\n".join(list1)) # display each item in its own line
print("\n-------\n".join(list1)) # display each item in its own line, and draw a line of dashes in between

A|B|C|D
A---B---C---D
A
B
C
D
A
-------
B
-------
C
-------
D


In [29]:
# Display the items more clearly
# print('\n-----\n'.join([str(idx) + ':\n' + str(item) \
#                  for idx, item in enumerate(list(html.children))]))
print('\n=============\n'.join([str(item) for item in list(html.children)]))



<head>
<title>A simple example page</title>
</head>


<body>
<p>Here is some simple content for this page.</p>
</body>




In [None]:
# Display the number of childern
len(list(html.children))

In [None]:
print([type(item) for item in list(html.children)])

In [30]:
body = list(html.children)[3]
print(body)

<body>
<p>Here is some simple content for this page.</p>
</body>


In [31]:
print(list(body.children))

['\n', <p>Here is some simple content for this page.</p>, '\n']


In [32]:
p = list(body.children)[1]
print(p)

<p>Here is some simple content for this page.</p>


In [33]:
p.get_text()

'Here is some simple content for this page.'

#### FInding all instances of a tag at once

In [34]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

In [35]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

In [36]:
# Find the first instance of a tag
soup.find('p')

<p>Here is some simple content for this page.</p>

#### Searching for tags by class and id

In [37]:
# Let's look at another webpage with classes and id's
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>


In [41]:
# Find all tags of a class
soup.find_all(class_="outer-text")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [42]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

In [38]:
soup.find_all('p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [40]:
# Display all four sentences:
for tag in list(soup.find_all('p')):
    print(tag.get_text().strip())

First paragraph.
Second paragraph.
First outer paragraph.
Second outer paragraph.


## Extract Weather Data from Weather.gov
1. Open the [weather forecast page](https://forecast.weather.gov/MapClick.php?lat=40.7146&lon=-74.0071#.Xbc5aXVKhhE)
2. Display the source code (On Chrome press F12 or click "Developer Tools" in the menu)
3. Identify the item containing data (On Chrome right click the values and select "Inspect")

In [4]:
import requests
from bs4 import BeautifulSoup

page = requests.get("https://forecast.weather.gov/MapClick.php?lat=40.7146&lon=-74.0071#.Xbc5aXVKhhE")
soup = BeautifulSoup(page.content, 'html.parser')
# Find the div tag with id "current_conditions-summary"
div = soup.find(id="current_conditions-summary")
# print("\n-----\n".join(str(item) for item in list(div.children)))
temperature = list(div.children)[5]
print(temperature.get_text())

50°F


In [5]:
seven_day = soup.find(id="seven-day-forecast")
# print(len(seven_day))
# print(seven_day)
forecast_items = seven_day.find_all(class_="tombstone-container")
# print(len(forecast_items))
# print(forecast_items)
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Partly sunny, with a high near 57. South wind 6 to 13 mph. " class="forecast-icon" src="newimages/medium/bkn.png" title="Today: Partly sunny, with a high near 57. South wind 6 to 13 mph. "/>
 </p>
 <p class="short-desc">
  Partly Sunny
 </p>
 <p class="temp temp-high">
  High: 57 °F
 </p>
</div>


In [6]:
# Find weather forecast for the week
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Today',
 'Tonight',
 'Thursday',
 'ThursdayNight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight',
 'Sunday']

In [7]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(short_descs)
print(temps)
print(descs)

['Partly Sunny', 'GradualClearing', 'Mostly Sunnythen SlightChanceShowers', 'ShowersLikely', 'Sunny', 'Mostly Clear', 'Mostly Sunny', 'Mostly Cloudy', 'Mostly Cloudy']
['High: 57 °F', 'Low: 54 °F', 'High: 65 °F', 'Low: 42 °F', 'High: 48 °F', 'Low: 39 °F', 'High: 47 °F', 'Low: 43 °F', 'High: 57 °F']
['Today: Partly sunny, with a high near 57. South wind 6 to 13 mph. ', 'Tonight: Mostly cloudy during the early evening, then gradual clearing, with a low around 54. Southwest wind around 10 mph. ', 'Thursday: A 20 percent chance of showers after 4pm.  Mostly sunny, with a high near 65. Southwest wind 10 to 14 mph. ', 'Thursday Night: Showers likely, mainly between 7pm and 1am.  Cloudy, then gradually becoming partly cloudy, with a low around 42. South wind around 15 mph becoming west after midnight.  Chance of precipitation is 60%. New precipitation amounts between a tenth and quarter of an inch possible. ', 'Friday: Sunny, with a high near 48. Northwest wind 15 to 18 mph. ', 'Friday Night:

In [8]:
# Load the weather data as a data frame
import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
weather

Unnamed: 0,period,short_desc,temp,desc
0,Today,Partly Sunny,High: 57 °F,"Today: Partly sunny, with a high near 57. Sout..."
1,Tonight,GradualClearing,Low: 54 °F,Tonight: Mostly cloudy during the early evenin...
2,Thursday,Mostly Sunnythen SlightChanceShowers,High: 65 °F,Thursday: A 20 percent chance of showers after...
3,ThursdayNight,ShowersLikely,Low: 42 °F,"Thursday Night: Showers likely, mainly between..."
4,Friday,Sunny,High: 48 °F,"Friday: Sunny, with a high near 48. Northwest ..."
5,FridayNight,Mostly Clear,Low: 39 °F,"Friday Night: Mostly clear, with a low around 39."
6,Saturday,Mostly Sunny,High: 47 °F,"Saturday: Mostly sunny, with a high near 47."
7,SaturdayNight,Mostly Cloudy,Low: 43 °F,"Saturday Night: Mostly cloudy, with a low arou..."
8,Sunday,Mostly Cloudy,High: 57 °F,"Sunday: Mostly cloudy, with a high near 57."
