# Web Scraping

Basic tutorial using BeautifulSoup to extract forecast data for a particular city in U.S. from weather.gov website.  
Based on <a href="https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/" target="_blank">this tutorial</a>.

If you are not confident with HTML, you can review the basics of the <a href="https://docs.google.com/presentation/d/1GonNbQS5eZUZIHmoM9GuGbDb-F1oOx6l8BMYGRYBbFQ/edit?usp=sharing" target="_blank">here</a>.

Additionally, you will need <a href="https://docs.mongodb.com/manual/installation/" target="_blank">MongoDB</a> to persist scraped data. In addition, you can use <a href="https://robomongo.org/" target="_blank">Robo3T</a> as UI client to access MongoDB.

In [1]:
# Importing required libraries

import requests
from urllib.request import urlopen

from bs4 import BeautifulSoup

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from pymongo import MongoClient

In [2]:
# Defining the website URL and query parameters for the analysis
SITE_URL = "https://forecast.weather.gov/"
PAGE_URL = "{site_url}/MapClick.php?lat={lat}&lon={lon}"
LAT = 40.7146
LON = -74.0071

In [3]:
# Creating a connection to MongoDB
client = MongoClient("", 27017)
db = client["weather"]
collection = db["forecast"]

ConfigurationError: Empty host (or extra comma in host list).

In [None]:
# Downloading and storing in-memory the HTML returned by the weather server
page = requests.get(PAGE_URL.format(site_url = SITE_URL, lat = LAT, lon = LON))
print(page) # Success making the request

In [None]:
# HTML content is passed to BeautifulSoup for scraping analysis
soup = BeautifulSoup(page.content, "html.parser")

In [None]:
# Finding by id the tag containing the forecasts
seven_day = soup.find(id = "seven-day-forecast")

In [None]:
# Tags classed with `tombstone-container` contain the different forecast data points
forecast_items = seven_day.find_all(class_ = "tombstone-container")
print(len(forecast_items)) # 9 forecast data points founded

In [None]:
# Printint the HTML content for today's forecast
tonight = forecast_items[0]
print(tonight.prettify())

In [None]:
# Extracting info from the HTML content for today's forecast

period = tonight.find(class_ = "period-name").get_text()
print(period)

short_desc = tonight.find(class_ = "short-desc").get_text()
print(short_desc)

temp = tonight.find(class_ = "temp").get_text()
print(temp)

In [None]:
# Accessing to img tag directly by name
img = tonight.find("img")

In [None]:
# Extracting and showing a static resource, the image best representing the forecast
f = urlopen(SITE_URL + img["src"])
a = plt.imread(f)
plt.imshow(a)
plt.show()

In [None]:
# Extracting additional metadata from image
desc = img["title"]
print(desc)

In [None]:
# Reproducing previous extractions for all data points

periods = [pt.get_text() for pt in seven_day.select(".tombstone-container .period-name")]
print("Periods:", periods)

short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
print("Short descriptions:", short_descs)

temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
print("Temperatures:",temps)

descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print("Descriptions:", descs)

In [None]:
# Transforming extracted data to a tabular format
weather_df = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc": descs
})

In [None]:
# Printing tabular forecast data
weather_df

In [None]:
# Cleaning temperature column
weather_df["temp_num"] = weather_df["temp"].apply(lambda x: x.split(" ")[1]).astype("int")

In [None]:
# Printing data, again
weather_df

In [None]:
# What is the mean forecasted temperature?
round(weather_df["temp_num"].mean(), 2)

In [None]:
# Visualizing some relevant information about forecast weather
plt.figure(figsize = (15, 7))
plt.bar(weather_df["period"], weather_df["temp_num"])
plt.title("Forecasted temperature (ºF) for next 4 days")

In [None]:
# Transforming df to dict
weather_dict = weather_df.to_dict(orient = "row")

In [None]:
weather_dict

In [None]:
# Storing extracted information for further analysis
collection.insert_many(weather_dict)