# A tutorial to scrape the web.

This example scrapes the BBC weather website for any specific city, and collects weather forecast for the next 14 days and saves it as a csv file.

*Web scraping might not be legal always. It is a good idea to check the terms of the website you plan to scrape before proceeding. Also, if your code requests a url from a server multiple times, it is a good practice to either cache your requests, or insert a timed delay between consecutive requests.*

Sure! I'll explain in simpler terms and also clarify what "parsing" means.

### Beautiful Soup

Beautiful Soup is a tool that helps you get information from web pages. It's written in Python, a programming language. Here’s what Beautiful Soup can do for you:

1. **Understanding Web Pages:** Beautiful Soup can read web pages and understand their structure, like figuring out the different parts of a web page such as headings, paragraphs, and links.

2. **Finding Information:** It helps you search through the web page to find specific information. For example, you can look for all the links on a page or find all the paragraphs that have a certain word.

3. **Changing Content:** You can use Beautiful Soup to make changes to the web page content, like adding new parts, removing some parts, or changing text.

4. **Handling Different Languages:** It can handle web pages written in different languages or character sets without any issues.

5. **Working with Other Tools:** Beautiful Soup works well with other tools that help you fetch web pages from the internet and process them faster.

### Simple Example

Here’s a basic example of how Beautiful Soup works. Imagine you want to get all the links from a web page:

```python
from bs4 import BeautifulSoup
import requests

# Get the web page content
url = 'http://example.com'
response = requests.get(url)
web_page_content = response.content

# Read the content with Beautiful Soup
soup = BeautifulSoup(web_page_content, 'html.parser')

# Find all the links on the page
links = soup.find_all('a')

# Print the URLs of the links
for link in links:
    print(link.get('href'))
```

In simpler terms:
- **requests.get(url)**: This line gets the content of a web page.
- **BeautifulSoup(web_page_content, 'html.parser')**: This line reads the web page content and understands its structure.
- **soup.find_all('a')**: This line finds all the links (the `<a>` tags) on the web page.
- **link.get('href')**: This line gets the URL (the web address) from each link.

### Parsing

Parsing is like breaking down a complex thing into simpler parts to understand it better. In the context of web scraping and Beautiful Soup, parsing means taking the HTML or XML code of a web page and breaking it down into a structure that is easier to work with. This structure allows you to find and extract the information you need from the web page.

For example, if you have a web page that looks like this:

```html
<html>
  <head><title>My Web Page</title></head>
  <body>
    <h1>Welcome to my web page</h1>
    <a href="http://example.com">Click here</a>
  </body>
</html>
```

Parsing this web page would mean breaking it down to understand that there is a title, a heading, and a link, making it easy to find and work with these parts separately.

In [1]:
import json                   # to convert API to json format

from urllib.parse import urlencode

import requests               # to get the webpage
from bs4 import BeautifulSoup # to parse the webpage

import pandas as pd
import re                     # regular expression operators

from datetime import datetime

We now GET the webpage of interest, from the server

In [27]:
required_city = "Chennai"
location_url = 'https://locator-service.api.bbci.co.uk/locations?' + urlencode({
   'api_key': 'AGbFAKx58hyjQScCXIYrxuEwJh2W2cmv',
   's': required_city,
   'stack': 'aws',
   'locale': 'en',
   'filter': 'international',
   'place-types': 'settlement,airport,district',
   'order': 'importance',
   'a': 'true',
   'format': 'json'
})
location_url

'https://locator-service.api.bbci.co.uk/locations?api_key=AGbFAKx58hyjQScCXIYrxuEwJh2W2cmv&s=Chennai&stack=aws&locale=en&filter=international&place-types=settlement%2Cairport%2Cdistrict&order=importance&a=true&format=json'

In [28]:
result = requests.get(location_url).json()
result

{'response': {'results': {'results': [{'id': '1264527',
     'name': 'Chennai',
     'container': 'India',
     'containerId': 1269750,
     'language': 'en',
     'timezone': 'Asia/Kolkata',
     'country': 'IN',
     'latitude': 13.08784,
     'longitude': 80.27847,
     'placeType': 'settlement'},
    {'id': '6301127',
     'name': 'Chennai International Airport',
     'container': 'India',
     'containerId': 1269750,
     'language': 'en',
     'timezone': 'Asia/Kolkata',
     'country': 'IN',
     'latitude': 12.98833,
     'longitude': 80.16578,
     'placeType': 'airport'}],
   'totalResults': 2}}}

In [29]:
# url      = 'https://www.bbc.com/weather/1275339' # url to BBC weather, corresponding to a specific city (Mumbai, in this example)
url      = 'https://www.bbc.com/weather/'+result['response']['results']['results'][0]['id']
response = requests.get(url)

Next, we initiate an instance of  BeautifulSoup.

In [30]:
soup = BeautifulSoup(response.content,'html.parser')

The information we want (daily high and low temp., and daily weather summary), are in specific blocks on the webpage.
We need to find the block type, type of identifier, and the identifier name (all these can be figured out by right clicking
on the webpage and selecting 'Inspect' on the Chrome browser; similar modus operandi for other browsers)

In [31]:
daily_high_values = soup.find_all('span', attrs={'class': 'wr-day-temperature__high-value'}) # block-type: span; identifier type: class; and class name: wr-day-temperature__high-value
daily_high_values

[<span class="wr-day-temperature__high-value"><span class="wr-value--temperature"><span class="wr-value--temperature--c">35°</span><span class="wr-hide"> </span><span class="wr-value--temperature--f">96°</span></span></span>,
 <span class="wr-day-temperature__high-value"><span class="wr-value--temperature"><span class="wr-value--temperature--c">35°</span><span class="wr-hide"> </span><span class="wr-value--temperature--f">95°</span></span></span>,
 <span class="wr-day-temperature__high-value"><span class="wr-value--temperature"><span class="wr-value--temperature--c">35°</span><span class="wr-hide"> </span><span class="wr-value--temperature--f">95°</span></span></span>,
 <span class="wr-day-temperature__high-value"><span class="wr-value--temperature"><span class="wr-value--temperature--c">35°</span><span class="wr-hide"> </span><span class="wr-value--temperature--f">94°</span></span></span>,
 <span class="wr-day-temperature__high-value"><span class="wr-value--temperature"><span class="w

In [32]:
daily_low_values  = soup.find_all('span', attrs={'class': 'wr-day-temperature__low-value'})
daily_low_values

[<span class="wr-day-temperature__low-value"><span class="wr-value--temperature"><span class="wr-value--temperature--c">26°</span><span class="wr-hide"> </span><span class="wr-value--temperature--f">78°</span></span></span>,
 <span class="wr-day-temperature__low-value"><span class="wr-value--temperature"><span class="wr-value--temperature--c">26°</span><span class="wr-hide"> </span><span class="wr-value--temperature--f">79°</span></span></span>,
 <span class="wr-day-temperature__low-value"><span class="wr-value--temperature"><span class="wr-value--temperature--c">26°</span><span class="wr-hide"> </span><span class="wr-value--temperature--f">79°</span></span></span>,
 <span class="wr-day-temperature__low-value"><span class="wr-value--temperature"><span class="wr-value--temperature--c">27°</span><span class="wr-hide"> </span><span class="wr-value--temperature--f">80°</span></span></span>,
 <span class="wr-day-temperature__low-value"><span class="wr-value--temperature"><span class="wr-val

In [33]:
daily_summary = soup.find('div', attrs={'class': 'wr-day-summary'})
daily_summary

<div class="wr-day-summary"><div class="gel-wrap"><span class="">Thundery showers and a gentle breeze</span><span class="wr-hide">Thundery showers and a gentle breeze</span><span class="wr-hide">Thundery showers and a gentle breeze</span><span class="wr-hide">Thundery showers and a gentle breeze</span><span class="wr-hide">Thundery showers and a moderate breeze</span><span class="wr-hide">Thundery showers and a moderate breeze</span><span class="wr-hide">Thundery showers and a moderate breeze</span><span class="wr-hide">Sunny intervals and a moderate breeze</span><span class="wr-hide">Thundery showers and a gentle breeze</span><span class="wr-hide">Light rain and a gentle breeze</span><span class="wr-hide">Drizzle and a gentle breeze</span><span class="wr-hide">Thundery showers and a gentle breeze</span><span class="wr-hide">Thundery showers and a gentle breeze</span><span class="wr-hide">Thundery showers and a gentle breeze</span></div></div>

In [34]:
daily_summary.text

'Thundery showers and a gentle breezeThundery showers and a gentle breezeThundery showers and a gentle breezeThundery showers and a gentle breezeThundery showers and a moderate breezeThundery showers and a moderate breezeThundery showers and a moderate breezeSunny intervals and a moderate breezeThundery showers and a gentle breezeLight rain and a gentle breezeDrizzle and a gentle breezeThundery showers and a gentle breezeThundery showers and a gentle breezeThundery showers and a gentle breeze'

General book keeping.

With the code snippet in the cell above, we get forecast data for 14 days, including today. We will now post process the data to first extract the required information/text and discard all the html wrapper code, then combine all variables into one common list, and finally convert it into a pandas data frame.

In [35]:
daily_high_values[0].text.strip()

'35° 96°'

In [36]:
daily_high_values[5].text.strip()

'34° 93°'

In [37]:
daily_high_values[0].text.strip().split()[0]

'35°'

In [38]:
daily_high_values_list = [daily_high_values[i].text.strip().split()[0] for i in range(len(daily_high_values))]
daily_high_values_list

['35°',
 '35°',
 '35°',
 '35°',
 '34°',
 '34°',
 '35°',
 '35°',
 '35°',
 '35°',
 '33°',
 '33°',
 '34°']

In [39]:
daily_low_values_list = [daily_low_values[i].text.strip().split()[0] for i in range(len(daily_low_values))]
daily_low_values_list

['26°',
 '26°',
 '26°',
 '27°',
 '27°',
 '27°',
 '27°',
 '26°',
 '26°',
 '27°',
 '27°',
 '26°',
 '26°',
 '25°']

In [40]:
daily_summary.text

'Thundery showers and a gentle breezeThundery showers and a gentle breezeThundery showers and a gentle breezeThundery showers and a gentle breezeThundery showers and a moderate breezeThundery showers and a moderate breezeThundery showers and a moderate breezeSunny intervals and a moderate breezeThundery showers and a gentle breezeLight rain and a gentle breezeDrizzle and a gentle breezeThundery showers and a gentle breezeThundery showers and a gentle breezeThundery showers and a gentle breeze'

In [41]:
daily_summary_list = re.findall('[a-zA-Z][^A-Z]*', daily_summary.text) #split the string on uppercase
daily_summary_list

['Thundery showers and a gentle breeze',
 'Thundery showers and a gentle breeze',
 'Thundery showers and a gentle breeze',
 'Thundery showers and a gentle breeze',
 'Thundery showers and a moderate breeze',
 'Thundery showers and a moderate breeze',
 'Thundery showers and a moderate breeze',
 'Sunny intervals and a moderate breeze',
 'Thundery showers and a gentle breeze',
 'Light rain and a gentle breeze',
 'Drizzle and a gentle breeze',
 'Thundery showers and a gentle breeze',
 'Thundery showers and a gentle breeze',
 'Thundery showers and a gentle breeze']

In [42]:
datelist = pd.date_range(datetime.today(), periods=len(daily_high_values)).tolist()
datelist

[Timestamp('2024-06-21 13:28:45.662231'),
 Timestamp('2024-06-22 13:28:45.662231'),
 Timestamp('2024-06-23 13:28:45.662231'),
 Timestamp('2024-06-24 13:28:45.662231'),
 Timestamp('2024-06-25 13:28:45.662231'),
 Timestamp('2024-06-26 13:28:45.662231'),
 Timestamp('2024-06-27 13:28:45.662231'),
 Timestamp('2024-06-28 13:28:45.662231'),
 Timestamp('2024-06-29 13:28:45.662231'),
 Timestamp('2024-06-30 13:28:45.662231'),
 Timestamp('2024-07-01 13:28:45.662231'),
 Timestamp('2024-07-02 13:28:45.662231'),
 Timestamp('2024-07-03 13:28:45.662231')]

In [43]:
datelist = [datelist[i].date().strftime('%y-%m-%d') for i in range(len(datelist))]
datelist

['24-06-21',
 '24-06-22',
 '24-06-23',
 '24-06-24',
 '24-06-25',
 '24-06-26',
 '24-06-27',
 '24-06-28',
 '24-06-29',
 '24-06-30',
 '24-07-01',
 '24-07-02',
 '24-07-03']

In [44]:
zipped = zip(datelist, daily_high_values_list, daily_low_values_list, daily_summary_list)

In [45]:
df = pd.DataFrame(list(zipped), columns=['Date', 'High','Low', 'Summary'])

In [46]:
display(df)

Unnamed: 0,Date,High,Low,Summary
0,24-06-21,35°,26°,Thundery showers and a gentle breeze
1,24-06-22,35°,26°,Thundery showers and a gentle breeze
2,24-06-23,35°,26°,Thundery showers and a gentle breeze
3,24-06-24,35°,27°,Thundery showers and a gentle breeze
4,24-06-25,34°,27°,Thundery showers and a moderate breeze
5,24-06-26,34°,27°,Thundery showers and a moderate breeze
6,24-06-27,35°,27°,Thundery showers and a moderate breeze
7,24-06-28,35°,26°,Sunny intervals and a moderate breeze
8,24-06-29,35°,26°,Thundery showers and a gentle breeze
9,24-06-30,35°,27°,Light rain and a gentle breeze



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



In [48]:
# remove the 'degree' character
df.High = df.High.replace('\°','',regex=True).astype(float)
df.Low  = df.Low.replace('\°','',regex=True).astype(float)

In [49]:
display(df)

Unnamed: 0,Date,High,Low,Summary
0,24-06-21,35.0,26.0,Thundery showers and a gentle breeze
1,24-06-22,35.0,26.0,Thundery showers and a gentle breeze
2,24-06-23,35.0,26.0,Thundery showers and a gentle breeze
3,24-06-24,35.0,27.0,Thundery showers and a gentle breeze
4,24-06-25,34.0,27.0,Thundery showers and a moderate breeze
5,24-06-26,34.0,27.0,Thundery showers and a moderate breeze
6,24-06-27,35.0,27.0,Thundery showers and a moderate breeze
7,24-06-28,35.0,26.0,Sunny intervals and a moderate breeze
8,24-06-29,35.0,26.0,Thundery showers and a gentle breeze
9,24-06-30,35.0,27.0,Light rain and a gentle breeze


Extract the name of the city for which data is gathered.

In [53]:
#location = soup.find('div', attrs={'class':'wr-c-location'})
location = soup.find('h1', attrs={'id':'wr-location-name-id'})
location.text.split()



In [54]:
# create a recording
filename_csv = location.text.split()[0]+'.csv'
df.to_csv(filename_csv, index=None)

In [55]:
filename_xlsx = location.text.split()[0]+'.xlsx'
df.to_excel(filename_xlsx)