"""""
Web Scraping is an automatic way to retrieve unstructured data from a website and store them in a structured format.

Can you scrape from all the websites?

Scraping makes the website traffic spike and may cause the breakdown of the website server. 

Thus, not all websites allow people to scrape. How do you know which websites are allowed or not? You can look at 

the ‘robots.txt’ file of the website. You just simply put robots.txt after the URL that you want to 

scrape and you will see information on whether the website host allows you to scrape the website.

We can also search for items using CSS selectors. These selectors are how the CSS language allows developers to 

specify HTML tags to style. Here are some examples:

p a — finds all a tags inside of a p tag.

body p a — finds all a tags inside of a p tag inside of a body tag.

html body — finds all body tags inside of an html tag.

p.outer-text — finds all p tags with a class of outer-text.

p#first — finds all p tags with an id of first.

body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

^^^Examples

**BeautifulSoup objects support searching a page via CSS selectors using the select method. We can use CSS selectors 

to find all the p tags in our page that are inside of a div like this:

==>soup.select("div p")

**search for elements by id:

==>soup.find_all(id="first")

**look for any tag that has the class outer-text:

==>soup.find_all(class_="outer-text")

**we can use the find_all method to search for items by class or by id. In the below example, we’ll search for any p 

tag that has the class outer-text:

==>soup.find_all('p', class_='outer-text')

**Note that find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

==>soup.find_all('p')[0].get_text()

**if you instead only want to find the first instance of a tag, you can use the find method, which will return a 

single BeautifulSoup object:

==>soup.find('p')
"""

In [1]:
from bs4 import BeautifulSoup, SoupStrainer
import requests
import pandas as pd
import json

In [2]:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Increasing clouds, with a low around 59. Breezy, with a west wind 21 to 26 mph decreasing to 13 to 18 mph after midnight. Winds could gust as high as 33 mph. " class="forecast-icon" src="DualImage.php?i=nwind_few&amp;j=nsct" title="Tonight: Increasing clouds, with a low around 59. Breezy, with a west wind 21 to 26 mph decreasing to 13 to 18 mph after midnight. Winds could gust as high as 33 mph. "/>
 </p>
 <p class="short-desc">
  Mostly Clear
  <br/>
  and Breezy
  <br/>
  then Partly
  <br/>
  Cloudy
 </p>
 <p class="temp temp-low">
  Low: 59 °F
 </p>
</div>


In [3]:

period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

Tonight
Mostly Clearand Breezythen PartlyCloudy
Low: 59 °F


In [4]:

img = tonight.find("img")
desc = img['title']
print(desc)

Tonight: Increasing clouds, with a low around 59. Breezy, with a west wind 21 to 26 mph decreasing to 13 to 18 mph after midnight. Winds could gust as high as 33 mph. 


In [5]:
#Extracting all the information from the page
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
print(periods)

['Tonight', 'Saturday', 'SaturdayNight', 'Sunday', 'SundayNight', 'Monday', 'MondayNight', 'Tuesday', 'TuesdayNight']


In [6]:

short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(short_descs)

['Mostly Clearand Breezythen PartlyCloudy', 'Partly Sunny', 'Mostly Cloudy', 'Partly Sunny', 'Partly Cloudyand Breezythen PartlyCloudy', 'Mostly Sunnythen Sunnyand Breezy', 'Mostly Clearand Breezythen PartlyCloudy', 'Sunny', 'Partly Cloudy']


In [7]:
print(temps)

['Low: 59 °F', 'High: 73 °F', 'Low: 61 °F', 'High: 75 °F', 'Low: 59 °F', 'High: 72 °F', 'Low: 58 °F', 'High: 70 °F', 'Low: 59 °F']


In [8]:
print(descs)

['Tonight: Increasing clouds, with a low around 59. Breezy, with a west wind 21 to 26 mph decreasing to 13 to 18 mph after midnight. Winds could gust as high as 33 mph. ', 'Saturday: Partly sunny, with a high near 73. Southwest wind 9 to 18 mph, with gusts as high as 23 mph. ', 'Saturday Night: Mostly cloudy, with a low around 61. West wind 8 to 11 mph. ', 'Sunday: Partly sunny, with a high near 75. West wind 11 to 20 mph, with gusts as high as 24 mph. ', 'Sunday Night: Partly cloudy, with a low around 59. Breezy, with a west wind 15 to 22 mph, with gusts as high as 26 mph. ', 'Monday: Mostly sunny, with a high near 72. Breezy. ', 'Monday Night: Mostly clear, with a low around 58. Breezy. ', 'Tuesday: Sunny, with a high near 70.', 'Tuesday Night: Partly cloudy, with a low around 59.']


In [9]:

weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc": descs
})

In [10]:
weather

Unnamed: 0,period,short_desc,temp,desc
0,Tonight,Mostly Clearand Breezythen PartlyCloudy,Low: 59 °F,"Tonight: Increasing clouds, with a low around ..."
1,Saturday,Partly Sunny,High: 73 °F,"Saturday: Partly sunny, with a high near 73. S..."
2,SaturdayNight,Mostly Cloudy,Low: 61 °F,"Saturday Night: Mostly cloudy, with a low arou..."
3,Sunday,Partly Sunny,High: 75 °F,"Sunday: Partly sunny, with a high near 75. Wes..."
4,SundayNight,Partly Cloudyand Breezythen PartlyCloudy,Low: 59 °F,"Sunday Night: Partly cloudy, with a low around..."
5,Monday,Mostly Sunnythen Sunnyand Breezy,High: 72 °F,"Monday: Mostly sunny, with a high near 72. Bre..."
6,MondayNight,Mostly Clearand Breezythen PartlyCloudy,Low: 58 °F,"Monday Night: Mostly clear, with a low around ..."
7,Tuesday,Sunny,High: 70 °F,"Tuesday: Sunny, with a high near 70."
8,TuesdayNight,Partly Cloudy,Low: 59 °F,"Tuesday Night: Partly cloudy, with a low aroun..."


In [11]:
import re
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
print(temp_nums)


0    59
1    73
2    61
3    75
4    59
5    72
6    58
7    70
8    59
Name: temp_num, dtype: object


In [12]:
print(weather["temp_num"].mean())

65.11111111111111


In [13]:
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
print(is_night)


0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
Name: temp, dtype: bool


In [14]:
print(weather["is_night"])

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
Name: is_night, dtype: bool


In [15]:
file_name = 'WeatherData.xlsx'
  
# saving the excel
weather.to_excel(file_name)
print('DataFrame is written to Excel File successfully.')

DataFrame is written to Excel File successfully.
