## Web Scraping with Python Using Beautiful Soup. 
#### In this notebook basic web scraping is performed using the Beautiful Soup library. Seven day weather forecasts is scraped from the National Weather Service, and then data is loaded to a pandas dataframe.The progeam etracts weather information about downtown San Francisco from this page: https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.Y2kYL-SZNpI.

In [120]:
import requests
from bs4 import BeautifulSoup


1. Download the web page containing the forecast.
2. Create a BeautifulSoup class to parse the page.
3. Find the div with id seven-day-forecast, and assign to seven_day
4. Inside seven_day, find each individual forecast item.
5. Extract and print the first forecast item.


In [121]:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168.html")
soup = BeautifulSoup(page.content, 'html.parser')
# print(soup.prettify())

In [122]:
seven_day = soup.find(id="seven-day-forecast")
items = seven_day.find_all('div', class_ = 'tombstone-container')
# items

### Extracting from one tag:
1. The name of the forecast item.
2. A short description of the conditions.
3. The temperature low.
4. The description of the conditions.

In [123]:
# Get the second element from the tombstone-container
tonight = items[1]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Showers likely and possibly a thunderstorm before 10pm, then a chance of showers and thunderstorms between 10pm and 1am, then a chance of rain after 1am.  Mostly cloudy, with a low around 48. West southwest wind 6 to 8 mph.  Chance of precipitation is 60%. New rainfall amounts of less than a tenth of an inch, except higher amounts possible in thunderstorms. " class="forecast-icon" src="newimages/medium/nshra60.png" title="Tonight: Showers likely and possibly a thunderstorm before 10pm, then a chance of showers and thunderstorms between 10pm and 1am, then a chance of rain after 1am.  Mostly cloudy, with a low around 48. West southwest wind 6 to 8 mph.  Chance of precipitation is 60%. New rainfall amounts of less than a tenth of an inch, except higher amounts possible in thunderstorms. "/>
 </p>
 <p class="short-desc">
  Showers
  <br/>
  Likely
 </p>
 <p class="temp temp-

In [124]:
period_name = tonight.find('p', class_ = "period-name").get_text()
temp_low = tonight.find('p', class_ = "temp temp-low").get_text()
short_desc = tonight.find('p', class_ = "short-desc").get_text()
print(period_name)
print(temp_low)
print(short_desc)

Tonight
Low: 48 °F
ShowersLikely


Description of the conditions is witin an image tag. To extract it the BeautifulSoup object is treated like a dictionary, and pass in the attribute we want as a key:


In [125]:
image = tonight.find('img')
description = image['title']
description

'Tonight: Showers likely and possibly a thunderstorm before 10pm, then a chance of showers and thunderstorms between 10pm and 1am, then a chance of rain after 1am.  Mostly cloudy, with a low around 48. West southwest wind 6 to 8 mph.  Chance of precipitation is 60%. New rainfall amounts of less than a tenth of an inch, except higher amounts possible in thunderstorms. '

### Extracting all the information from the page:
1. Select all items with the class period-name inside an item with the class tombstone-container in seven_day.
2. Use a list comprehension to call the get_text method on each BeautifulSoup object.


In [126]:
seven_day = soup.find(id="seven-day-forecast")
all_items = seven_day.find_all(class_ = 'period-name')
all_items

[<p class="period-name">Today<br/><br/></p>,
 <p class="period-name">Tonight<br/><br/></p>,
 <p class="period-name">Wednesday<br/><br/></p>,
 <p class="period-name">Wednesday<br/>Night</p>,
 <p class="period-name">Thursday<br/><br/></p>,
 <p class="period-name">Thursday<br/>Night</p>,
 <p class="period-name">Veterans<br/>Day</p>,
 <p class="period-name">Friday<br/>Night</p>,
 <p class="period-name">Saturday<br/><br/></p>]

In [127]:
periods = [pt.get_text() for pt in all_items]
periods

['Today',
 'Tonight',
 'Wednesday',
 'WednesdayNight',
 'Thursday',
 'ThursdayNight',
 'VeteransDay',
 'FridayNight',
 'Saturday']

In [128]:
period_names = [pn.get_text() for pn in seven_day.select(".tombstone-container .period-name")]
temp_lows = [tl.get_text() for tl in seven_day.select(".tombstone-container .temp")]
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
print(period_names)
print(temp_lows)
print(short_descs)
descs = [d['title'] for d in seven_day.select(".tombstone-container img")] #this tag already contains list with strings so no need for 'get_text()'
descs

['Today', 'Tonight', 'Wednesday', 'WednesdayNight', 'Thursday', 'ThursdayNight', 'VeteransDay', 'FridayNight', 'Saturday']
['High: 57 °F', 'Low: 48 °F', 'High: 59 °F', 'Low: 44 °F', 'High: 58 °F', 'Low: 43 °F', 'High: 59 °F', 'Low: 46 °F', 'High: 60 °F']
['ShowersLikely', 'ShowersLikely', 'Mostly Sunny', 'Mostly Clear', 'Sunny', 'Partly Cloudy', 'Slight ChanceRain', 'Chance Rain', 'Chance Rain']


['Today: Showers likely and possibly a thunderstorm.  Mostly cloudy, with a high near 57. West southwest wind 11 to 14 mph, with gusts as high as 20 mph.  Chance of precipitation is 70%. New rainfall amounts of less than a tenth of an inch, except higher amounts possible in thunderstorms. ',
 'Tonight: Showers likely and possibly a thunderstorm before 10pm, then a chance of showers and thunderstorms between 10pm and 1am, then a chance of rain after 1am.  Mostly cloudy, with a low around 48. West southwest wind 6 to 8 mph.  Chance of precipitation is 60%. New rainfall amounts of less than a tenth of an inch, except higher amounts possible in thunderstorms. ',
 'Wednesday: Mostly sunny, with a high near 59. West wind 5 to 8 mph. ',
 'Wednesday Night: Mostly clear, with a low around 44. North northeast wind 3 to 7 mph. ',
 'Thursday: Sunny, with a high near 58. North northeast wind around 6 mph. ',
 'Thursday Night: Partly cloudy, with a low around 43.',
 'Veterans Day: A 20 percent chanc

### Create a DF with the extracted data

Check if the list lenght is the same for all lists

In [129]:
print(len(descs))
print(len(short_descs))
print(len(period_names))
print(len(temp_lows))

9
9
9
9


Uncomment and modify if len differs

In [130]:
# descs = descs[1:]
# short_descs = short_descs[1:]
# period_names = period_names[1:]

In [131]:
import pandas as pd
df = pd.DataFrame.from_dict({
    'Description': descs,
    'Short desc': short_descs,
    'Forecasted Period': period_names,
    'Temperature low': temp_lows,
})
df

Unnamed: 0,Description,Short desc,Forecasted Period,Temperature low
0,Today: Showers likely and possibly a thunderst...,ShowersLikely,Today,High: 57 °F
1,Tonight: Showers likely and possibly a thunder...,ShowersLikely,Tonight,Low: 48 °F
2,"Wednesday: Mostly sunny, with a high near 59. ...",Mostly Sunny,Wednesday,High: 59 °F
3,"Wednesday Night: Mostly clear, with a low arou...",Mostly Clear,WednesdayNight,Low: 44 °F
4,"Thursday: Sunny, with a high near 58. North no...",Sunny,Thursday,High: 58 °F
5,"Thursday Night: Partly cloudy, with a low arou...",Partly Cloudy,ThursdayNight,Low: 43 °F
6,Veterans Day: A 20 percent chance of rain afte...,Slight ChanceRain,VeteransDay,High: 59 °F
7,"Friday Night: A chance of rain, mainly after 1...",Chance Rain,FridayNight,Low: 46 °F
8,"Saturday: A chance of rain. Partly sunny, wit...",Chance Rain,Saturday,High: 60 °F


Extracting the numerical values from 'Temperature low' column

In [135]:
df['Temp numerical'] = df['Temperature low'].str.extract('(\d+)').astype(int) #using regular expressions
df

Unnamed: 0,Description,Short desc,Forecasted Period,Temperature low,Temp numerical
0,Today: Showers likely and possibly a thunderst...,ShowersLikely,Today,High: 57 °F,57
1,Tonight: Showers likely and possibly a thunder...,ShowersLikely,Tonight,Low: 48 °F,48
2,"Wednesday: Mostly sunny, with a high near 59. ...",Mostly Sunny,Wednesday,High: 59 °F,59
3,"Wednesday Night: Mostly clear, with a low arou...",Mostly Clear,WednesdayNight,Low: 44 °F,44
4,"Thursday: Sunny, with a high near 58. North no...",Sunny,Thursday,High: 58 °F,58
5,"Thursday Night: Partly cloudy, with a low arou...",Partly Cloudy,ThursdayNight,Low: 43 °F,43
6,Veterans Day: A 20 percent chance of rain afte...,Slight ChanceRain,VeteransDay,High: 59 °F,59
7,"Friday Night: A chance of rain, mainly after 1...",Chance Rain,FridayNight,Low: 46 °F,46
8,"Saturday: A chance of rain. Partly sunny, wit...",Chance Rain,Saturday,High: 60 °F,60


In [136]:
# Calculating the mean temperature
df['Temp numerical'].mean()

52.666666666666664

In [140]:
# Check if the observation is night or day
df["is_night"] = df["Forecasted Period"].str.contains("ight")
df

Unnamed: 0,Description,Short desc,Forecasted Period,Temperature low,Temp numerical,is_night
0,Today: Showers likely and possibly a thunderst...,ShowersLikely,Today,High: 57 °F,57,False
1,Tonight: Showers likely and possibly a thunder...,ShowersLikely,Tonight,Low: 48 °F,48,True
2,"Wednesday: Mostly sunny, with a high near 59. ...",Mostly Sunny,Wednesday,High: 59 °F,59,False
3,"Wednesday Night: Mostly clear, with a low arou...",Mostly Clear,WednesdayNight,Low: 44 °F,44,True
4,"Thursday: Sunny, with a high near 58. North no...",Sunny,Thursday,High: 58 °F,58,False
5,"Thursday Night: Partly cloudy, with a low arou...",Partly Cloudy,ThursdayNight,Low: 43 °F,43,True
6,Veterans Day: A 20 percent chance of rain afte...,Slight ChanceRain,VeteransDay,High: 59 °F,59,False
7,"Friday Night: A chance of rain, mainly after 1...",Chance Rain,FridayNight,Low: 46 °F,46,True
8,"Saturday: A chance of rain. Partly sunny, wit...",Chance Rain,Saturday,High: 60 °F,60,False


In [141]:
# Convert True and Flase from 'is_night' column to 1s and 0s
import numpy as np
df["is_night_num"] = np.where(df["is_night"] == True, 1, 0)
df

Unnamed: 0,Description,Short desc,Forecasted Period,Temperature low,Temp numerical,is_night,is_night_num
0,Today: Showers likely and possibly a thunderst...,ShowersLikely,Today,High: 57 °F,57,False,0
1,Tonight: Showers likely and possibly a thunder...,ShowersLikely,Tonight,Low: 48 °F,48,True,1
2,"Wednesday: Mostly sunny, with a high near 59. ...",Mostly Sunny,Wednesday,High: 59 °F,59,False,0
3,"Wednesday Night: Mostly clear, with a low arou...",Mostly Clear,WednesdayNight,Low: 44 °F,44,True,1
4,"Thursday: Sunny, with a high near 58. North no...",Sunny,Thursday,High: 58 °F,58,False,0
5,"Thursday Night: Partly cloudy, with a low arou...",Partly Cloudy,ThursdayNight,Low: 43 °F,43,True,1
6,Veterans Day: A 20 percent chance of rain afte...,Slight ChanceRain,VeteransDay,High: 59 °F,59,False,0
7,"Friday Night: A chance of rain, mainly after 1...",Chance Rain,FridayNight,Low: 46 °F,46,True,1
8,"Saturday: A chance of rain. Partly sunny, wit...",Chance Rain,Saturday,High: 60 °F,60,False,0
