
## What is Web Scraping?
* Extracting data from internet
* Extracted data is collected and then changed into a suitable format that is useful for the user (i.e. CSV)
* Extract all data from the page or specific data selected by the user before it is run
* Specific data requires techniques to identify CSS and Javascript element corresponds to required data
* User checks through the data, confirming scraper works properly
* Web scraper outputs the data collected
* Collected data can then be changed into a suitable format

What are web scrapers used for
* Extracting information from the net
* Depending problem statement and the type of analysis the data will be run on

Types of websites
* Static: the content of page does not change e.g. history sites
* Dynamic: content of the page, hence it is never the same at any point of time e.g. e-commerce sites

Beautiful Soup
* One of the most commonly used parsing libraries
* Very useful in pulling out information from the HTML page

## 1. Download/Import libraries

In [None]:
# !pip install urllib
# !pip install bs4
# !pip install requests
# !pip install pprint

In [None]:
# Import libraries
import pandas as pd
import urllib
import urllib.request
from bs4 import BeautifulSoup
import requests
from pprint import pprint #pretty print helps for formatting

## 2. Scraping Basic Knowledge

In [None]:
#1. Scraping HTML
url = "https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168"
page = requests.get(url)

In [3]:
page
#Informational responses (100–199)
#Successful responses (200–299)
#Redirection messages (300–399)
#Client error responses (400–499)
#Server error responses (500–599)

<Response [200]>

In [None]:
#2. Make HTML looks more presentable / has indentation
soup = BeautifulSoup(page.content, 'html.parser')
# print(soup) # Just to compare between using and not using prettify
print(soup.prettify())

In [None]:
# 3. How to find information of tags from HTML
#  a) Finding all instances of a tag using find_all

soup.find_all('p')

In [16]:
# get_text() function extract text

soup.find_all('p')[1].get_text()
# Count the statements in between the "p"s, notice the index 1 statement

'Your local forecast office is'

In [15]:
# Get statement from index 2
soup.find_all('p')[2].get_text()

'\n                    A significant arctic outbreak will spread over the northern Plains on Thanksgiving and advance farther south and east on Friday into the weekend. Much of the lower 48 states will see freezing temperatures with dangerous wind chills in the northern Plains. Heavy lake effect snow is likely downwind of the Great Lakes. Rain is forecast from southern New England into the Southeast U.S.\n                                                                Read More >\n'

In [None]:
# 3b) Finding the first instance of tag using find()
soup.find('p')

<p>
<input checked="checked" id="nws" name="affiliate" type="radio" value="nws.noaa.gov"/>
<label class="search-scope" for="nws">NWS</label>
<input id="noaa" name="affiliate" type="radio" value="noaa.gov"/>
<label class="search-scope" for="noaa">All NOAA</label>
</p>

In [18]:
# This is a get_text function, all the those with ="" are not text
soup.find_all('p')[0].get_text()

'\n\nNWS\n\nAll NOAA\n'

In [None]:
# Split up a string into a list
soup.find_all('p')[0].get_text().rsplit()

['NWS', 'All', 'NOAA']

In [20]:
#c) Search for tags by class or id

# Find tags with class period-name
soup.find_all(class_= 'period-name')
# Reason why the class does not require quotations: it is a CSS selector, hence recognised by the Python language
# The reason why we put an underscore after the class is because class is a function inbuilt in the Python system, 
# therefore underscore '_' tells the system to get the CSS selector instead

[<p class="period-name">Today</p>,
 <p class="period-name">Tonight</p>,
 <p class="period-name">Thanksgiving Day</p>,
 <p class="period-name">Thursday Night</p>,
 <p class="period-name">Friday</p>,
 <p class="period-name">Friday Night</p>,
 <p class="period-name">Saturday</p>,
 <p class="period-name">Saturday Night</p>,
 <p class="period-name">Sunday</p>]

In [21]:
# Find tags with id news-items
soup.find_all(id = "news-items")
# Get whatever it is under or nested in the id 'news-items' (Refer to the soup code)
# Reason why id is not required a quotation: it is a CSS selector, hence recognised by the Python language

[<div id="news-items">
 <div id="topnews">
 <div class="icon"><img src="/bundles/templating/images/top_news/important.png"/></div>
 <div class="body">
 <h1 style="font-size: 11pt;">Arctic Blast Set to Arrive on Thanksgiving; Dangerous Wind Chills with Lake Effect Snow </h1>
 <p>
                     A significant arctic outbreak will spread over the northern Plains on Thanksgiving and advance farther south and east on Friday into the weekend. Much of the lower 48 states will see freezing temperatures with dangerous wind chills in the northern Plains. Heavy lake effect snow is likely downwind of the Great Lakes. Rain is forecast from southern New England into the Southeast U.S.
                                                                 <a href="http://www.wpc.ncep.noaa.gov/discussions/hpcdiscussions.php?disc=pmdspd" target="_blank">Read More &gt;</a>
 </p>
 </div>
 </div>
 </div>]

## 3. Scraping for real now
1. Download webpage containing the forecast
2. Create a BeautifulSoup Class to parse the page
3. Find the div with id seven-day-forecast and assgin to seven-day
4. Inside seven-day, find each individual forecast item
5. Extract and print the first forecast item
6. Using the tag information found from Step 5, extract the following information: Period, Short Description, Temperature and Description of the condtions
7. Format the extracted data into a pandas dataset

In [None]:
# Show our current soup variable
print(soup.prettify())

In [None]:
# Find the div with id seven-day-forecast and assign to seven-day
seven_day = soup.find(id = "seven-day-forecast")
print(seven_day.prettify())

In [24]:
# Step 4: Inside the seven day, find each individual forecast item
forecast_items = seven_day.find_all(class_ = 'tombstone-container')
print(forecast_items)

[<div class="tombstone-container"><p class="period-name">Today</p><p><img alt="Today: Sunny, with a high near 60. North northwest wind around 6 mph. " class="forecast-icon" src="newimages/medium/skc.png" title="Today: Sunny, with a high near 60. North northwest wind around 6 mph. "/></p><p class="temp temp-high">High: 60 °F</p><p class="short-desc">Sunny</p></div>, <div class="tombstone-container"><p class="period-name">Tonight</p><p><img alt="Tonight: Clear, with a low around 43. Light northeast wind. " class="forecast-icon" src="newimages/medium/nskc.png" title="Tonight: Clear, with a low around 43. Light northeast wind. "/></p><p class="temp temp-low">Low: 43 °F</p><p class="short-desc">Clear</p></div>, <div class="tombstone-container"><p class="period-name">Thanksgiving Day</p><p><img alt="Thanksgiving Day: Sunny, with a high near 60. Northeast wind 6 to 8 mph. " class="forecast-icon" src="newimages/medium/few.png" title="Thanksgiving Day: Sunny, with a high near 60. Northeast wind

In [25]:
# Step 5: Extract and print the first forecast item
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
 </p>
 <p>
  <img alt="Today: Sunny, with a high near 60. North northwest wind around 6 mph. " class="forecast-icon" src="newimages/medium/skc.png" title="Today: Sunny, with a high near 60. North northwest wind around 6 mph. "/>
 </p>
 <p class="temp temp-high">
  High: 60 °F
 </p>
 <p class="short-desc">
  Sunny
 </p>
</div>



In [None]:
# Step 6: Using the tag information found from Step 5, extract the following information:
# Period, Short Description, Temperature and Description of the condtions
period = tonight.find(class_ = 'period-name').get_text()
short_desc = tonight.find(class_ = 'short-desc').get_text()
temp = tonight.find(class_ = 'temp').get_text()
print(period)
print(short_desc)
print(temp)

Today
Sunny
High: 60 °F


In [None]:
# Description of the conditions
img = tonight.find('img') # 'img' requires a quotation as it is not a CSS selector; not recognised by the Python
desc = img['title'] # Same reasoning goes to the title; not a CSS selector
print(desc)

Today: Sunny, with a high near 60. North northwest wind around 6 mph. 


In [28]:
# Extract all period names
period_tags = seven_day.select('.tombstone-container .period-name')
print(period_tags)
periods = [pt.get_text() for pt in period_tags]
periods

[<p class="period-name">Today</p>, <p class="period-name">Tonight</p>, <p class="period-name">Thanksgiving Day</p>, <p class="period-name">Thursday Night</p>, <p class="period-name">Friday</p>, <p class="period-name">Friday Night</p>, <p class="period-name">Saturday</p>, <p class="period-name">Saturday Night</p>, <p class="period-name">Sunday</p>]


['Today',
 'Tonight',
 'Thanksgiving Day',
 'Thursday Night',
 'Friday',
 'Friday Night',
 'Saturday',
 'Saturday Night',
 'Sunday']

In [None]:
# Create our variables short_descs, temps, descs
short_descs = [sd.get_text() for sd in seven_day.select('.tombstone-container .short-desc')]
# We need put the . for class and id, because they are CSS selectors
temps = [t.get_text() for t in seven_day.select('.tombstone-container .temp')]
descs = [d['title'] for d in seven_day.select('.tombstone-container img')]
# Do not have to put the dots there because img is not under the division of the tombstone-container

print(short_desc)
print(temps)
print(descs)

Sunny
['High: 60 °F', 'Low: 43 °F', 'High: 60 °F', 'Low: 44 °F', 'High: 61 °F', 'Low: 43 °F', 'High: 60 °F', 'Low: 46 °F', 'High: 62 °F']
['Today: Sunny, with a high near 60. North northwest wind around 6 mph. ', 'Tonight: Clear, with a low around 43. Light northeast wind. ', 'Thanksgiving Day: Sunny, with a high near 60. Northeast wind 6 to 8 mph. ', 'Thursday Night: Partly cloudy, with a low around 44. North northeast wind around 5 mph. ', 'Friday: Sunny, with a high near 61. Northeast wind around 7 mph. ', 'Friday Night: Partly cloudy, with a low around 43.', 'Saturday: Mostly sunny, with a high near 60.', 'Saturday Night: Partly cloudy, with a low around 46.', 'Sunday: Sunny, with a high near 62.']


In [31]:
# Step 7: Format the extracted data into a pandas dataset

import pandas as pd

weather = pd.DataFrame({'Period':periods,
                        'Short Descriptions':short_descs,
                        'Temperature':temps,
                        'Descriptions':descs
                        })

weather

Unnamed: 0,Period,Short Descriptions,Temperature,Descriptions
0,Today,Sunny,High: 60 °F,"Today: Sunny, with a high near 60. North north..."
1,Tonight,Clear,Low: 43 °F,"Tonight: Clear, with a low around 43. Light no..."
2,Thanksgiving Day,Sunny,High: 60 °F,"Thanksgiving Day: Sunny, with a high near 60. ..."
3,Thursday Night,Partly Cloudy,Low: 44 °F,"Thursday Night: Partly cloudy, with a low arou..."
4,Friday,Sunny,High: 61 °F,"Friday: Sunny, with a high near 61. Northeast ..."
5,Friday Night,Partly Cloudy,Low: 43 °F,"Friday Night: Partly cloudy, with a low around..."
6,Saturday,Mostly Sunny,High: 60 °F,"Saturday: Mostly sunny, with a high near 60."
7,Saturday Night,Partly Cloudy,Low: 46 °F,"Saturday Night: Partly cloudy, with a low arou..."
8,Sunday,Sunny,High: 62 °F,"Sunday: Sunny, with a high near 62."
