# **Data Collection Demo**

This Jupyter Notebook illustrates 

(1) **collecting data via web scaping**: how we can use [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), a Python package for parsing HTML and XML documents, to parse websites and extract the data of interest.

(2) **collecting data via using a website's API** how we can use the API of MetOffice to get a three-hourly five-day forecast for a location of interest

**Important Note: This demo is only provided for illustration purposes.** There may legal and/or ethical consideration of web scaping websites and you should always pay attention to the terms and conditions of the website you may want to mine.

***


### **Demo 1: Extracting The Best Selling Books From Amazon**

We will use web scraping to extract the best selling book from Amazon as listed at [https://www.amazon.co.uk/Best-Sellers-Books/zgbs/books](https://www.amazon.co.uk/Best-Sellers-Books/zgbs/books)

In [1]:
import requests
from bs4 import BeautifulSoup

pagesNum = 1

def getAmazonBestSellers(pageNum):  
  #Define the headers
  headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

  #Define the template of the website we want to scrape
  urlTemplate = 'https://www.amazon.co.uk/Best-Sellers-Books/zgbs/books/ref=zg_bs_pg_' + str(pageNum) + '?_encoding=UTF8&pg=' +str(pageNum)

  #Request the data
  r = requests.get(urlTemplate)

  #Check the HTTP status code; see, https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
  status = r.status_code
  if status != 200: #if not executed succesfully stop
    return 

  #Get the content and instantiate a Beautifoul soup instance  
  content = r.content
  soup = BeautifulSoup(content)

  #List to keep the data of Amazon Books
  amazonBooks = []

  #Now we need to find and analyse the HTML tags that hold the necessary data
  for d in soup.findAll('div', attrs={'class':'a-section a-spacing-none aok-relative'}):    
    name = d.find('span', attrs={'class':'zg-text-center-align'})
    n = name.find_all('img', alt=True)
    author = d.find('a', attrs={'class':'a-size-small a-link-child'})
    rating = d.find('span', attrs={'class':'a-icon-alt'})
    usersRated = d.find('a', attrs={'class':'a-size-small a-link-normal'})
    price = d.find('span', attrs={'class':'p13n-sc-price'})
        
    #List for keeping data of each book
    book = []
    
    if name is not None:
      book.append(n[0]['alt'])
    else:
      book.append("unknown-product")
      
    if author is not None:
      book.append(author.text)
    elif author is None:
      author = d.find('span', attrs={'class':'a-size-small a-color-base'})
      if author is not None:
        book.append(author.text)
      else:    
        book.append('0')

    if rating is not None:
      book.append(rating.text.strip(" out of 5 stars"))#remove the "out of 5 stars"
    else:
      book.append('-1')

    if usersRated is not None:
      book.append(usersRated.text.replace(",", "")) #remove the comma
    else:
      book.append('0')     

    if price is not None:
      book.append(price.text.strip("£")) #remove the £ sign for further manipulation
    else:
      book.append('0')
    
    #Add each book's data into the list after converting its data into a tuple
    amazonBooks.append(tuple(book))
  
  return amazonBooks

In [2]:
#Invoke the function and collect a list of tuples, one for each book
amazonBooks = getAmazonBestSellers(1)  

#Print the first 3 elements
amazonBooks[:3]

[('Pinch of Nom Quick & Easy: 100 Delicious, Slimming Recipes',
  'Kay Featherstone',
  '4.9',
  '11646',
  '10.00'),
 ('The Boy, The Mole, The Fox and The Horse',
  'Charlie Mackesy',
  '4.9',
  '46180',
  '9.00'),
 ('The Thursday Murder Club: The Record-Breaking Sunday Times Number One Bestseller',
  'Richard Osman',
  '4.',
  '34167',
  '7.49')]

In [3]:
#Create the list of tuples into a numpy array
import numpy as np
amazonBooksArray = np.array(amazonBooks, dtype=[('name',"U50"), ('author',"U50"), ('score','f4'), ('reviews','i4'),('price','f4')])

#Print the first 10 elements of the array
amazonBooksArray[0:10]

array([('Pinch of Nom Quick & Easy: 100 Delicious, Slimming', 'Kay Featherstone', 4.9,  11646, 10.  ),
       ('The Boy, The Mole, The Fox and The Horse', 'Charlie Mackesy', 4.9,  46180,  9.  ),
       ('The Thursday Murder Club: The Record-Breaking Sund', 'Richard Osman', 4. ,  34167,  7.49),
       ('Pinch of Nom: 100 Slimming, Home-style Recipes', 'Kay Featherstone', 4.8,  39432,  9.99),
       ('Pinch of Nom Everyday Light: 100 Tasty, Slimming R', 'Kay Featherstone', 4.8,  22878,  9.99),
       ('Where the Crawdads Sing', 'Delia Owens', 4.7, 135417,  5.99),
       ('Good Vibes, Good Life: How Self-Love Is the Key to', 'Vex King', 4.7,  12086,  7.99),
       ('Why Men Love Bitches: From Doormat to Dreamgirl - ', 'Sherry Argov', 4. ,   8070, 11.19),
       ('Bridgerton: The Duke and I (Bridgertons Book 1): T', 'Julia Quinn', 4.4,   7804,  6.  ),
       ('Read Write Inc. Phonics: Home More Phonics Flashca', 'Ruth Miskin', 4.8,   2273,  5.74)],
      dtype=[('name', '<U50'), ('author',

In [4]:
#Fron now on we can analyse the dataset as normally
#For instance, let's calculate the mean, median, standard deviation price of the best selling books
print("Mean price:",   np.mean(amazonBooksArray['price']))
print("Median price:", np.median(amazonBooksArray['price']))
print("Std price:",    np.std(amazonBooksArray['price']))

Mean price: 7.3641996
Median price: 6.745
Std price: 3.6471958


***

### **Demo 2: Extracting The Best Hotels In York From Booking.Com**

We will use web scraping to extract the Hotels in York according to Booking.com as reported at [https://www.booking.com/city/gb/york.en-gb.html](https://www.booking.com/city/gb/york.en-gb.html[link text](https://))

In [5]:
import requests
from bs4 import BeautifulSoup

def getBestYorkHotels ():
  #Define the headers
  headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

  #Define the template of the website we want to scrape
  urlTemplate = 'https://www.booking.com/city/gb/york.en-gb.html'

  #Request the data
  r = requests.get(urlTemplate)

  #Check the HTTP status code; see, https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
  status = r.status_code
  if status != 200: #if not executed succesfully stop
    return 

  #Get the content and instantiate a Beautifoul soup instance  
  content = r.content
  soup = BeautifulSoup(content)

  #List to keep the data of the best hotels 
  #To do this, we need to extract the necessary HTML tags from the website
  #In this case, we need the name of the hotel, its rating, the number of reviews, and its price per night
  bestHotels = []

  hotels = soup.findAll('div', attrs={'class':'sr__card_main_row bui-spacer--large'})
  for h in hotels:
    name = h.find('span', attrs={'class': 'bui-card__title'})
    nameS = name.text.strip()

    score = h.find('div', attrs={'class': 'bui-review-score__badge'})
    scoreS = score.text.strip()

    scoreStr = h.find('div', attrs={'class': 'bui-review-score__title'})
    scoreStrS = scoreStr.text.strip()

    reviews = h.find('div', attrs={'class': 'bui-review-score__text'})
    reviewsS = reviews.text.strip()
    reviewsS = reviewsS.strip(" reviews")
    reviewsS = reviewsS.replace(",", "")

    price = h.find('div', attrs={'class': 'bui-price-display__value bui-f-color-constructive'})
    priceS = price.text.strip()
    priceS = priceS.strip("£")

    bestHotels.append((nameS, scoreS, scoreStrS, reviewsS, priceS))

  return bestHotels


In [6]:
#Invoke the function and collect a list of tuples, one for each hotel
bestYorkHotels = getBestYorkHotels()

#Print the hotels' details
bestYorkHotels

[('Park Inn by Radisson York City Centre', '8.4', 'Very good', '7557', '63'),
 ('The Grand, York', '9.2', 'Superb', '6817', '139'),
 ('Hampton by Hilton York', '8.8', 'Fabulous', '5812', '64'),
 ('DoubleTree by Hilton York', '8.1', 'Very good', '3968', '81'),
 ('Novotel York Centre', '8.2', 'Very good', '5267', '61'),
 ('Hilton York', '8.1', 'Very good', '4034', '86'),
 ('ibis York Centre', '7.4', 'Good', '5724', '35'),
 ('Elmbank Hotel And Lodge - part of The Cairn Collection',
  '8.7',
  'Fabulous',
  '2490',
  '70'),
 ('Hotel Indigo York', '8.9', 'Fabulous', '3548', '88'),
 ('Principal York', '8.2', 'Very good', '3407', '119')]

In [7]:
#Create the list of tuples into a numpy array
import numpy as np
hotelsArray = np.array(bestYorkHotels, dtype=[('name',"U50"), ('score','f4'), ('scoreStr','U20'), ('reviews','i4'),('price','f4')])

#Print the hotels 
hotelsArray

array([('Park Inn by Radisson York City Centre', 8.4, 'Very good', 7557,  63.),
       ('The Grand, York', 9.2, 'Superb', 6817, 139.),
       ('Hampton by Hilton York', 8.8, 'Fabulous', 5812,  64.),
       ('DoubleTree by Hilton York', 8.1, 'Very good', 3968,  81.),
       ('Novotel York Centre', 8.2, 'Very good', 5267,  61.),
       ('Hilton York', 8.1, 'Very good', 4034,  86.),
       ('ibis York Centre', 7.4, 'Good', 5724,  35.),
       ('Elmbank Hotel And Lodge - part of The Cairn Collec', 8.7, 'Fabulous', 2490,  70.),
       ('Hotel Indigo York', 8.9, 'Fabulous', 3548,  88.),
       ('Principal York', 8.2, 'Very good', 3407, 119.)],
      dtype=[('name', '<U50'), ('score', '<f4'), ('scoreStr', '<U20'), ('reviews', '<i4'), ('price', '<f4')])

In [8]:
#Fron now on we can analyse the dataset as normally
#For instance, let's calculate the mean, median, standard deviation of the score of the best hotels books
print("Mean price:",   np.mean(hotelsArray['score']))
print("Median price:", np.median(hotelsArray['score']))
print("Std price:",    np.std(hotelsArray['score']))

Mean price: 8.4
Median price: 8.299999
Std price: 0.48989785


***

### **Demo 3: Extracting The Weather Forecast Using MetOffice's API**

We will use the API provide by MetOffice to extract the three-hourly five-day forecast for Dunkeswell Aerodrome.
For more information see (https://www.metoffice.gov.uk/services/data/datapoint/api-reference)[https://www.metoffice.gov.uk/services/data/datapoint/api-reference]

In [9]:
import requests
import json

#Define the template of the website we want to scrape
#I have download the weather forecast from 28/01/21 to 01/02/21 and saved it at
#You can also view its contents by pasting the link below at your browser
urlTemplate = "https://www-users.cs.york.ac.uk/simos/DAT1/yorkWeather.json"

#Request the data
r = requests.get(urlTemplate)

#Check the HTTP status code; see, https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
status = r.status_code
if status != 200: #if not executed succesfully stop
  exit
    
#Create a JSON object
#Read more about the JSON format at https://en.wikipedia.org/wiki/JSON
yorkData = r.json()


In [10]:
#Print the JSON object
yorkData

{'SiteRep': {'DV': {'Location': {'Period': [{'Rep': [{'$': '1080',
        'D': 'S',
        'F': '2',
        'G': '11',
        'H': '99',
        'Pp': '18',
        'S': '7',
        'T': '5',
        'U': '0',
        'V': 'PO',
        'W': '5'},
       {'$': '1260',
        'D': 'SE',
        'F': '3',
        'G': '16',
        'H': '98',
        'Pp': '97',
        'S': '4',
        'T': '5',
        'U': '0',
        'V': 'PO',
        'W': '15'}],
      'type': 'Day',
      'value': '2021-01-28Z'},
     {'Rep': [{'$': '0',
        'D': 'SW',
        'F': '6',
        'G': '20',
        'H': '96',
        'Pp': '94',
        'S': '7',
        'T': '8',
        'U': '0',
        'V': 'MO',
        'W': '15'},
       {'$': '180',
        'D': 'WSW',
        'F': '5',
        'G': '27',
        'H': '91',
        'Pp': '1',
        'S': '11',
        'T': '8',
        'U': '0',
        'V': 'GO',
        'W': '2'},
       {'$': '360',
        'D': 'WSW',
        'F': '3',
      

In [11]:
#Get the dates onbjects
data = yorkData['SiteRep']['DV']['Location']['Period']

#Get the latest date (which is 01/02/21)
date0102 = data[len(data)-1]

#Since the weather forecast is in period of three hours, 
#then we expect to have the forecast for 7 periods
time = 0
for period in date0102['Rep']:
  print("%dh-%dh => %soC"% (time, time+3, period['T']))
  time += 3


0h-3h => 2oC
3h-6h => 1oC
6h-9h => 1oC
9h-12h => 1oC
12h-15h => 5oC
15h-18h => 5oC
18h-21h => 2oC
21h-24h => 2oC
