### Importing major libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import urllib3
from urllib.request import urlopen

In this tutorial, we'll see how to perform web scraping using Python 3 and the BeautifulSoup library. We'll be scraping weather forecasts from the National Weather Service, and then analyzing them using the Pandas library.

In [6]:
%%html
<img src="nws.png">

### Components of webpage

When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we're getting files from the server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:

1. HTML — contain the main content of the page.
2. CSS — add styling to make the page look nicer.
3. JS — Javascript files add interactivity to web pages.
4. Images — image formats, such as JPG and PNG allow web pages to show pictures.

### HTML

HTML isn't a programming language, like Python — instead, it's a markup language that tells a browser how to layout content. HTML allows you to do similar things to what you do in a word processor like Microsoft Word — make text bold, create paragraphs, and so on. Because HTML isn't a programming language, it isn't nearly as complex as Python.

### The 'requests' library


The first thing we'll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one. 

In [7]:
import requests

In [8]:
req = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")

Save this into content

In [10]:
con = req.content

Use BeautifulSoup to analyze the content

In [11]:
from bs4 import BeautifulSoup

In [13]:
read = BeautifulSoup(con, 'html.parser')

In [15]:
print(read.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [31]:
list(read.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [34]:
[type(x) for x in read.children]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

As you can see, all of the items are BeautifulSoup objects. 
1. The first is a Doctype object, which contains information about the type of the document. 
2. The second is a NavigableString, which represents text found in the HTML document. 
3. The final item is a Tag object, which contains other nested tags. 

The most important object type, and the one we'll deal with most often, is the Tag object

In [35]:
html = list(read.children)[2]

Each item in the list returned by the children property is also a BeautifulSoup object, so we can also call the children method on html.

Now, we can find the children inside the html tag:

In [37]:
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

### Finding all instances of a tag at once

If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

In [48]:
read.find_all('p')[0].get_text()

'Here is some simple content for this page.'

In [49]:
# gets the first instance of the html 
read.find('p')

<p>Here is some simple content for this page.</p>

### New html and exploration

In [50]:
url_get = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")

In [52]:
soup = BeautifulSoup(url_get.content, 'html.parser')

In [53]:
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

Here, we have different types of classes in each 'p' tags. So we can access each of them using a class_ parameter in the find_all function. 

In [56]:
soup.find_all('p', class_ = "outer-text")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

Now we can also use this as a regular expression where we specify class_ = "outer-text"

In [60]:
soup.find_all(class_ = 'outer-text')
# this returns a list of lists of all the 'p' tags with 'outer-text' class

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

We can also search for elements by id:



In [61]:
soup.find_all(id = "first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

### Using CSS Selectors


We can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

1. p a — finds all a tags inside of a p tag.
2. body p a — finds all a tags inside of a p tag inside of a body tag.
3. html body — finds all body tags inside of an html tag.
4. p.outer-text — finds all p tags with a class of outer-text.
5. p#first — finds all p tags with an id of first.
6. body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

In [64]:
# example
# for CSS selectors, we use 'select' function 
soup.select(selector='div p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

In [65]:
soup.select('p.outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [67]:
soup.select('p.outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

### Downloading weather data


We now know enough to proceed with extracting information about the local weather from the National Weather Service website. The first step is to find the page we want to scrape. We'll extract weather information about downtown San Francisco from [this page.!](https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.W8-6W2hKg2w)

In [69]:
%%html
<img src = "Capture.png">

We now know enough to download the page and start parsing it. In the below code, we:

1. Download the web page containing the forecast.
2. Create a BeautifulSoup class to parse the page.
3. Find the div with id seven-day-forecast, and assign to seven_day
4. Inside seven_day, find each individual forecast item.
5. Extract and print the first forecast item.

In [71]:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.W8-6W2hKg2w")
page = page.content

In [73]:
nsoup = BeautifulSoup(page, 'html.parser')

In [81]:
seven_day = nsoup.find(id = "seven-day-forecast")

In [82]:
seven_day

<div class="panel panel-default" id="seven-day-forecast">
<div class="panel-heading">
<b>Extended Forecast for</b>
<h2 class="panel-title">
	    	    San Francisco CA	</h2>
</div>
<div class="panel-body" id="seven-day-forecast-body">
<div id="seven-day-forecast-container"><ul class="list-unstyled" id="seven-day-forecast-list"><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Partly cloudy, with a low around 54. West wind 8 to 16 mph, with gusts as high as 21 mph. " class="forecast-icon" src="newimages/medium/nsct.png" title="Tonight: Partly cloudy, with a low around 54. West wind 8 to 16 mph, with gusts as high as 21 mph. "/></p><p class="short-desc">Partly Cloudy</p><p class="temp temp-low">Low: 54 °F</p></div></li><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Wednesday<br/><br/></p>
<p><img alt="Wednesday: Mostly sunny, with a high near 66. West wind 5 to 10 m

In [83]:
forecast_items = seven_day.find_all(class_ = "tombstone-container")

In [80]:
forecast_items

[<div class="tombstone-container">
 <p class="period-name">Tonight<br/><br/></p>
 <p><img alt="Tonight: Partly cloudy, with a low around 54. West wind 8 to 16 mph, with gusts as high as 21 mph. " class="forecast-icon" src="newimages/medium/nsct.png" title="Tonight: Partly cloudy, with a low around 54. West wind 8 to 16 mph, with gusts as high as 21 mph. "/></p><p class="short-desc">Partly Cloudy</p><p class="temp temp-low">Low: 54 °F</p></div>,
 <div class="tombstone-container">
 <p class="period-name">Wednesday<br/><br/></p>
 <p><img alt="Wednesday: Mostly sunny, with a high near 66. West wind 5 to 10 mph increasing to 13 to 18 mph in the afternoon. Winds could gust as high as 23 mph. " class="forecast-icon" src="newimages/medium/sct.png" title="Wednesday: Mostly sunny, with a high near 66. West wind 5 to 10 mph increasing to 13 to 18 mph in the afternoon. Winds could gust as high as 23 mph. "/></p><p class="short-desc">Mostly Sunny</p><p class="temp temp-high">High: 66 °F</p></div>,


In [84]:
tonight = forecast_items[0]

In [85]:
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Partly cloudy, with a low around 54. West wind 8 to 16 mph, with gusts as high as 21 mph. " class="forecast-icon" src="newimages/medium/nsct.png" title="Tonight: Partly cloudy, with a low around 54. West wind 8 to 16 mph, with gusts as high as 21 mph. "/>
 </p>
 <p class="short-desc">
  Partly Cloudy
 </p>
 <p class="temp temp-low">
  Low: 54 °F
 </p>
</div>


In [94]:
tonight.find('img')['title']

'Tonight: Partly cloudy, with a low around 54. West wind 8 to 16 mph, with gusts as high as 21 mph. '

### Extracting all the information from the page


Now that we know how to extract each individual piece of information, we can combine our knowledge with css selectors and list comprehensions to extract everything at once.


In the below code, we:

1. Select all items with the class period-name inside an item with the class tombstone-container in seven_day.
2. Use a list comprehension to call the get_text method on each BeautifulSoup object.

In [98]:
period_tags = seven_day.select(".tombstone-container .period-name")

In [99]:
periods = [pt.get_text() for pt in period_tags]
periods

['Tonight',
 'Wednesday',
 'WednesdayNight',
 'Thursday',
 'ThursdayNight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight']

As we can see above, our technique gets us each of the period names, in order. We can apply the same technique to get the other 3 fields:

In [104]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

['Partly Cloudy', 'Mostly Sunny', 'Partly Cloudy', 'Mostly Sunny', 'Mostly Clear', 'Sunny', 'Partly Cloudy', 'Mostly Sunny', 'Partly Cloudy']
['Low: 54 °F', 'High: 66 °F', 'Low: 55 °F', 'High: 68 °F', 'Low: 56 °F', 'High: 70 °F', 'Low: 56 °F', 'High: 71 °F', 'Low: 56 °F']
['Tonight: Partly cloudy, with a low around 54. West wind 8 to 16 mph, with gusts as high as 21 mph. ', 'Wednesday: Mostly sunny, with a high near 66. West wind 5 to 10 mph increasing to 13 to 18 mph in the afternoon. Winds could gust as high as 23 mph. ', 'Wednesday Night: Partly cloudy, with a low around 55. West wind 12 to 17 mph decreasing to 5 to 10 mph after midnight. Winds could gust as high as 22 mph. ', 'Thursday: Mostly sunny, with a high near 68. Light west wind increasing to 11 to 16 mph in the afternoon. Winds could gust as high as 21 mph. ', 'Thursday Night: Mostly clear, with a low around 56. West southwest wind 10 to 15 mph, with gusts as high as 20 mph. ', 'Friday: Sunny, with a high near 70.', 'Frida

### Combining our data into a Pandas Dataframe


In [105]:
weather = pd.DataFrame({
        "period": periods, 
        "short_desc": short_descs, 
        "temp": temps, 
        "desc":descs
    })


In [109]:
weather


Unnamed: 0,desc,period,short_desc,temp
0,"Tonight: Partly cloudy, with a low around 54. ...",Tonight,Partly Cloudy,Low: 54 °F
1,"Wednesday: Mostly sunny, with a high near 66. ...",Wednesday,Mostly Sunny,High: 66 °F
2,"Wednesday Night: Partly cloudy, with a low aro...",WednesdayNight,Partly Cloudy,Low: 55 °F
3,"Thursday: Mostly sunny, with a high near 68. L...",Thursday,Mostly Sunny,High: 68 °F
4,"Thursday Night: Mostly clear, with a low aroun...",ThursdayNight,Mostly Clear,Low: 56 °F
5,"Friday: Sunny, with a high near 70.",Friday,Sunny,High: 70 °F
6,"Friday Night: Partly cloudy, with a low around...",FridayNight,Partly Cloudy,Low: 56 °F
7,"Saturday: Mostly sunny, with a high near 71.",Saturday,Mostly Sunny,High: 71 °F
8,"Saturday Night: Partly cloudy, with a low arou...",SaturdayNight,Partly Cloudy,Low: 56 °F


In [112]:
weather["temp"][0].split()[1]

'54'

In [114]:
weather['temp_number'] = weather['temp'].apply(lambda x: x.split()[1])

In [115]:
weather

Unnamed: 0,desc,period,short_desc,temp,temp_number
0,"Tonight: Partly cloudy, with a low around 54. ...",Tonight,Partly Cloudy,Low: 54 °F,54
1,"Wednesday: Mostly sunny, with a high near 66. ...",Wednesday,Mostly Sunny,High: 66 °F,66
2,"Wednesday Night: Partly cloudy, with a low aro...",WednesdayNight,Partly Cloudy,Low: 55 °F,55
3,"Thursday: Mostly sunny, with a high near 68. L...",Thursday,Mostly Sunny,High: 68 °F,68
4,"Thursday Night: Mostly clear, with a low aroun...",ThursdayNight,Mostly Clear,Low: 56 °F,56
5,"Friday: Sunny, with a high near 70.",Friday,Sunny,High: 70 °F,70
6,"Friday Night: Partly cloudy, with a low around...",FridayNight,Partly Cloudy,Low: 56 °F,56
7,"Saturday: Mostly sunny, with a high near 71.",Saturday,Mostly Sunny,High: 71 °F,71
8,"Saturday Night: Partly cloudy, with a low arou...",SaturdayNight,Partly Cloudy,Low: 56 °F,56


### Next Steps


We now have a good understanding of how to scrape web pages and extract data. A good next step would be to pick a site and try some web scraping on our own. Some good examples of data to scrape are:

1. News articles
2. Sports scores
3. Weather forecasts
4. Stock prices
5. Online retailer prices

## THANK YOU