<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 20px; height: 55px">

# Data acquisition and forming project questions

---



### Learning Objectives

**After this lesson, you will be able to:**
- List some data acquisition methods and data sources
- Use an API to obtain data
- Perform basic web scraping tasks
- Find machine learning and data science datasets 
- Describe, in broad terms, the question you will explore for your final project, some possible data sources, and the limitations of these sources. 

## Where can I get data from?

Every data science project will (naturally!) rely on being able to access some data that you'll then clean, analyse, model and visualise. 

There are several possible sources for data, including:

* **Collecting your own** This could be via a survey, experiment or poll, carried out by your organisation or commissioned by your organisation. 

* **Bulk download** This could be the direct download of a spreadsheet, text document, or other file. An example is http://data.gov, where data is available for bulk download in a number of formats.

* **APIs**

* **Web scraping**

In this session, we'll learn about APIs and web scraping. 


## 1. What's an API?

An API (or application programming interface) is a safe, legal and controlled way of accessing data that another organisation has chosen to make public. 

Requesting data from an API is called **making an API call** or an **API request.**

Data that is returned via an API is usually in a specific format called **JSON**, which stands for JavaScript Object Notation. 

Many well known companies have APIs including Facebook, Google, Twitter, YouTube, Transport for London, Spotify, the Office for National Statistics, and the UK Police. 

APIs don't just provide access to data; they can also provide access to an organisation's computing or machine learning power. 

## 2. Getting data from an API

There are two main ways to get data from an API; through your web browser, or programmatically through a script. 

### 2.1 Accessing an API from your web browser

We can access data from an API using our web browsers. Normally, when you type a URL into your web browser and hit 'go', what's happening in the background? Your web browser is requesting information from the URL you've specified, and some HTML data is sent back to your web browser, translated by your browser into a visual web page and displayed for you. 

For example, when I type 'bbc.co.uk' into my browser, my browser sends a request to the BBC News servers (the computers, probably sitting in a data centre somewhere remote, that contain all the content for the BBC News website). The BBC servers send back some HTML, CSS and JavaScript files that my browser is able to translate into a visual, interactive website and display to me.

But a URL can be used to request **data** as well as HTML. We can usually tell that some sort of data is being requested if a URL contains a question mark. 

Here's an example.

Let's visit https://data.police.uk/docs/.

This shows the documentation for the UK Police API. 

**What sort of data is available from this API?**

Let's focus on data for street level crimes. Visit https://data.police.uk/docs/method/crime-street/.

The documentation tells us how to build a URL that will request the precise data we want.

The request parameters allow us to fine-tune our request to get crime data about a specific location, timeframe, or crime category. 

The example request gives us a template that we can modify to fit our own request, and the example response tells us what the data will look like. 

Let's try visiting this URL: 

https://data.police.uk/api/crimes-street/all-crime?lat=52.629729&lng=-1.131592&date=2017-01

The result is raw data, formatted in a language called JSON (JavaScript Object Notation). Most data returned by APIs is in JSON format.

**What does JSON visually remind you of?**

**What does each element in the JSON correspond to?**

**What sort of information is contained in each JSON element?**

JSON is formatted just like a Python dictionarty, with key:value pairs. This will come in useful later!

**Why isn't it ideal to access API data from a web browser?**

### 2.2 Accessing an API programmatically through Python

This is a much faster and more efficient way of accessing an API. Let's try it out with the Police API.




In [39]:
import requests # a library that lets us make HTTP requests

date_str = '2017-01'

# let's define the same URL that we accessed in our web browsers
police_api_url = 'https://data.police.uk/api/crimes-street/all-crime?lat=52.629729&lng=-1.131592&date='+date_str

police_response = requests.get(police_api_url) # make a request to the URL


We get back a 'response' object that contains a few different pieces of information

In [40]:
type(police_response)

requests.models.Response

We can get the status code of the response. 200 means the request was successful, 404 usually means there was an error.

In [41]:
police_response.status_code # this tells us whether the request was successful

200

We can access the data sent back from the API formatted as plain text. This isn't very helpful to work with, as it's just a long string rather than a dictionary or JSON object that we can search and manipulate.

In [42]:
police_text = police_response.text # this gives us the data that was sent back by the Police API, formatted as plain text
police_text



'[{"category":"anti-social-behaviour","location_type":"Force","location":{"latitude":"52.633894","street":{"id":883120,"name":"On or near Mensa Close"},"longitude":"-1.114174"},"context":"","outcome_status":null,"persistent_id":"","id":54162963,"location_subtype":"","month":"2017-01"},{"category":"anti-social-behaviour","location_type":"Force","location":{"latitude":"52.628839","street":{"id":883279,"name":"On or near De Montfort Place"},"longitude":"-1.123465"},"context":"","outcome_status":null,"persistent_id":"","id":54166589,"location_subtype":"","month":"2017-01"},{"category":"anti-social-behaviour","location_type":"Force","location":{"latitude":"52.636536","street":{"id":883356,"name":"On or near Humberstone Gate"},"longitude":"-1.128602"},"context":"","outcome_status":null,"persistent_id":"","id":54166567,"location_subtype":"","month":"2017-01"},{"category":"anti-social-behaviour","location_type":"Force","location":{"latitude":"52.640029","street":{"id":883238,"name":"On or near

We can also data in JSON format. This is great

In [44]:
police_json = police_response.json() # this gives us the data that was sent back by the Police API, formatted as JSON!
police_json

[{'category': 'anti-social-behaviour',
  'location_type': 'Force',
  'location': {'latitude': '52.633894',
   'street': {'id': 883120, 'name': 'On or near Mensa Close'},
   'longitude': '-1.114174'},
  'context': '',
  'outcome_status': None,
  'persistent_id': '',
  'id': 54162963,
  'location_subtype': '',
  'month': '2017-01'},
 {'category': 'anti-social-behaviour',
  'location_type': 'Force',
  'location': {'latitude': '52.628839',
   'street': {'id': 883279, 'name': 'On or near De Montfort Place'},
   'longitude': '-1.123465'},
  'context': '',
  'outcome_status': None,
  'persistent_id': '',
  'id': 54166589,
  'location_subtype': '',
  'month': '2017-01'},
 {'category': 'anti-social-behaviour',
  'location_type': 'Force',
  'location': {'latitude': '52.636536',
   'street': {'id': 883356, 'name': 'On or near Humberstone Gate'},
   'longitude': '-1.128602'},
  'context': '',
  'outcome_status': None,
  'persistent_id': '',
  'id': 54166567,
  'location_subtype': '',
  'month': '2

In [45]:
type(police_json)

list

We can access different elements in the JSON using the same notation as we used with dictionaries and lists. You'll notice that JSON is formatted as a list of dictionaries. So we can access the first element like this

In [46]:
police_json[0]

{'category': 'anti-social-behaviour',
 'location_type': 'Force',
 'location': {'latitude': '52.633894',
  'street': {'id': 883120, 'name': 'On or near Mensa Close'},
  'longitude': '-1.114174'},
 'context': '',
 'outcome_status': None,
 'persistent_id': '',
 'id': 54162963,
 'location_subtype': '',
 'month': '2017-01'}

And we can access the location of the first element like this

In [49]:
police_json[0]['location']['latitude']

'52.633894'

Once we have JSON from an API, it's one small step to turning it into an easily searchable data table, or **data frame** with a library we'll be using **a lot** during this course; pandas!

In [50]:
import pandas as pd

police_df = pd.DataFrame(police_json) # convert the police JSON into a pandas data table, or data FRAME
police_df.head() # preview the first five rows of the data frame

Unnamed: 0,category,context,id,location,location_subtype,location_type,month,outcome_status,persistent_id
0,anti-social-behaviour,,54162963,"{'latitude': '52.633894', 'street': {'id': 883...",,Force,2017-01,,
1,anti-social-behaviour,,54166589,"{'latitude': '52.628839', 'street': {'id': 883...",,Force,2017-01,,
2,anti-social-behaviour,,54166567,"{'latitude': '52.636536', 'street': {'id': 883...",,Force,2017-01,,
3,anti-social-behaviour,,54163226,"{'latitude': '52.640029', 'street': {'id': 883...",,Force,2017-01,,
4,anti-social-behaviour,,54165885,"{'latitude': '52.631271', 'street': {'id': 883...",,Force,2017-01,,


We'll learn more about what a dataframe is and how to manipulate/search them in later sessions, but for now it's sufficient to know that they're the primary data type in Pandas for storing and representing large amounts of data.

### Exercises

### 1) What sort of data does the Google Maps API provide access to? 

Who might want to use it, and why?
Why would Google release this data? 
Is this API free?

### 2) What sort of data does the Transport for London API provide access to?

Why would TFL release this data?

### 3) What does Google's Cloud Vision API do? 

Try out a demo request with an image of your choice here (scroll down to 'Try the API'): https://cloud.google.com/vision/

### 4) Using a finance API

Let's explore the Alphavantage API (https://www.alphavantage.co/) to get daily financial data on a stock of your choice.

* Visit https://www.alphavantage.co/


* Read through the documentation for the ``TIME_SERIES_DAILY`` API here: https://www.alphavantage.co/documentation/. What data will the ``TIME_SERIES_DAILY`` API give you? 


* Take a look at this demo API request for the ``TIME_SERIES_DAILY`` API. https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo What are the parameters in this request? 


* Sign up for a free API key with Alphavantage, and note the key down. 


* What's an API key, and why do some APIs require them? 


* Now construct the URL for an API request to return stock information for Google (rather than Microsoft), where ``function=TIME_SERIES_DAILY``, to get daily stock information, and the value of ``apikey`` is your api key.

* Paste this URL into your browser and visit the page to confirm that the data you're getting back looks sensible.


* Now, use the ``requests`` and ``json`` libraries (the same steps we followed with the Police API on Tuesday) to make the same API request programmatically, using Python. Your result should be a JSON object.


* Use the ``pandas`` library to convert this JSON object into a pandas dataframe. Preview the first five rows of the dataframe with the ``head`` function. You might need to manipulate the JSON a bit in order to get a sensible looking dataframe!


In [None]:
api_key='YOUR API KEY HERE'
ticker = 'GOOGL'

alphavantage_url= 'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol='+ ticker + '&apikey=' + api_key

alphavantage_json = requests.get(alphavantage_url).json()
alphavantage_df = pd.DataFrame(alphavantage_json['Time Series (Daily)']).transpose()


## 3. Web scraping

### 3.1 What is web scraping?

Sometimes the data we're after won't be available via a spreadsheet download, or an API. It might be embedded in a web page, or spread across several web pages. In this case, web scraping can be a good solution. 

Web scraping is a slightly more complicated data acquisition method. It involves two steps:

* Grabbing (or 'scraping') the HTML underlying a website

* Searching (or 'parsing') it to extract the information you're interested in. HTML is a language where different pieces of content on a website are sandwiched or enclosed inside 'tags' that describe exactly what that piece of content is. So, a large heading would be enclosed between opening and closing heading tags: ``<h1> My Heading <\h1>``. By searching for particular tags in our scraped HTML, we can pick out and store the exact pieces of content we're interested in.

Scraping is the programmatic equivalent of browsing a website, and copy-pasting content from the website into your own local file or spreadsheet. There are some basic rules to follow when scraping websites, to avoid getting into trouble:

* **Don't scrape websites that ask you not to scrape them** It's important to avoid scraping websites that explicitly prohibit scrapers/crawlers/spiders/robots (these can sometimes be used to all mean scraping) in their Terms of Use or Terms and Conditions. Under special circumstances, it might be possible to get permission from a website to scrape them if you make direct contact with the owners, explain why you'd like to scrape their site, and what you'll do with the results. 

* **Ask permission** If possible, it's polite/good practise to drop the organisation behind a website a note or email to let them know you'll be scraping their site. 

* **Avoid scraping personal data**

* **Be considerate** Don't send one million requests to a website in the space of one second! If you're looping through several URLs, add in a pause of a second or two using a function like ``time.sleep(5)`` to make sure you don't overwhelm a website's servers.

The Office for National Statistics has published a good set of ethical scraping guidelines here: https://www.ons.gov.uk/aboutus/transparencyandgovernance/lookingafterandusingdataforpublicbenefit/policies/policieswebscrapingpolicy



### 3.2 Let's try it out!

We're going to scrape the front page of Wikipedia, and try to extract the URL of every single link on this page. (**Aside: why might we want to do this in real life?**)

Let's start by defining the URL we want to scrape, and using ``requests.get()`` to grab the HTML behind the site. 

In [53]:
wiki_url = 'https://en.wikipedia.org/wiki/Main_Page'
wiki_text = requests.get(wiki_url).text

wiki_text

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Wikipedia, the free encyclopedia</title>\n<script>document.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":889268954,"wgRevisionId":889268954,"wgArticleId":15580374,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec

In [54]:
type(wiki_text)

str

Eek. That's one big, messy string that contains all the HTML from the Wikipedia front page. How can we turn this into a searchable object? 

We use a library called ``beautiful soup`` to transform this string into a searchable object.

**In your terminal window, run ``conda install -c anaconda beautifulsoup4`` to install this library**

In [3]:
from bs4 import BeautifulSoup # let's import beautiful soup

Let's now convert our raw text from the Wikipedia front page into a more easily searchable object.

In [55]:
wiki_soup = BeautifulSoup(wiki_text, 'html.parser')
wiki_soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className=document.documentElement.className.replace(/(^|\s)client-nojs(\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":889268954,"wgRevisionId":889268954,"wgArticleId":15580374,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgR

Our output doesn't look very different to ``wiki_text``, but let's check the types of our two variables:

In [56]:
type(wiki_text)

str

In [57]:
type(wiki_soup)

bs4.BeautifulSoup

Whereas ``wiki_text`` is a string, ``wiki_soup`` is a 'beautiful soup object.' This means we can very easily and precisely search for tagged HTML content. We know that the HTML tag for a hyperlink is ``'a'``. We can use this knowledge, together with the ``find_all`` method in beautiful soup, to extract every URL.

In [58]:
wiki_soup.find_all('a',href=True)

[<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#p-search">Jump to search</a>,
 <a href="/wiki/Wikipedia" title="Wikipedia">Wikipedia</a>,
 <a href="/wiki/Free_content" title="Free content">free</a>,
 <a href="/wiki/Encyclopedia" title="Encyclopedia">encyclopedia</a>,
 <a href="/wiki/Wikipedia:Introduction" title="Wikipedia:Introduction">anyone can edit</a>,
 <a href="/wiki/Special:Statistics" title="Special:Statistics">5,904,442</a>,
 <a href="/wiki/English_language" title="English language">English</a>,
 <a href="/wiki/Portal:Arts" title="Portal:Arts">Arts</a>,
 <a href="/wiki/Portal:Biography" title="Portal:Biography">Biography</a>,
 <a href="/wiki/Portal:Geography" title="Portal:Geography">Geography</a>,
 <a href="/wiki/Portal:History" title="Portal:History">History</a>,
 <a href="/wiki/Portal:Mathematics" title="Portal:Mathematics">Mathematics</a>,
 <a href="/wiki/Portal:Science" title="Portal:Science">Science</a>,
 <a href="/wiki/Po

We've got a list of URLs! We can now loop through the list, extract just the URL, and append it to a list.

In [59]:
url_list = []

for result in wiki_soup.find_all('a',href=True):

    url_list.append(result['href'])
    
##### either of these are fine 

# url_list = []
# results_list = wiki_soup.find_all('a',href=True)
# for result in results_list:

#     url_list.append(result['href'])
    

In [60]:
url_list

['#mw-head',
 '#p-search',
 '/wiki/Wikipedia',
 '/wiki/Free_content',
 '/wiki/Encyclopedia',
 '/wiki/Wikipedia:Introduction',
 '/wiki/Special:Statistics',
 '/wiki/English_language',
 '/wiki/Portal:Arts',
 '/wiki/Portal:Biography',
 '/wiki/Portal:Geography',
 '/wiki/Portal:History',
 '/wiki/Portal:Mathematics',
 '/wiki/Portal:Science',
 '/wiki/Portal:Society',
 '/wiki/Portal:Technology',
 '/wiki/Portal:Contents/Portals',
 '/wiki/File:Portrait_Diptych_of_D%C3%BCrer%27s_Parents.jpg',
 '/wiki/Portrait_Diptych_of_D%C3%BCrer%27s_Parents',
 '/wiki/Albrecht_D%C3%BCrer',
 '/wiki/Albrecht_D%C3%BCrer_the_Elder',
 '/wiki/Ageing',
 '/wiki/Journeyman',
 '/wiki/Germanisches_Nationalmuseum',
 '/wiki/Portrait_Diptych_of_D%C3%BCrer%27s_Parents',
 '/wiki/Siberian_accentor',
 '/wiki/Analog_Science_Fiction_and_Fact',
 '/wiki/Stephen,_King_of_England',
 '/wiki/Wikipedia:Today%27s_featured_article/August_2019',
 'https://lists.wikimedia.org/mailman/listinfo/daily-article-l',
 '/wiki/Wikipedia:Featured_articl

Beautiful Soup lets us perform more complex searches. 

This will involve using the **Inspect** function in your web browser to look for the HTML tag that corresponds to the content you want to scrape, and then building a Beautiful Soup search in Python to pick out that exact content.

Let's try this out by grabbing a list of everyone who spoke in Parliament on a given date.

We start by building a URL that will take us to the Hansard webpage for the date we're interested in, and using ``requests`` to grab that URL.

In [61]:
date = '2019-01-31'
hansard_url = 'https://hansard.parliament.uk/html/Commons/' + date + '/CommonsChamber'

hansard_request = requests.get(hansard_url, allow_redirects=True)

In [63]:
hansard_text = hansard_request.text
hansard_text



We now convert our downloaded HTML into a searchable Beautiful Soup object

In [27]:
hansard_soup = BeautifulSoup(hansard_text,'html.parser')

Now we need to do some detective work. Using the ``View Source`` function in our web browser, we need to find the unique HTML tag that corresponds to the names of MPs on the page we've just scraped. 


After some visual inspection, it looks like every MP's name is contained within a ``h2`` HTML tag that also has the attribute ``class='memberLink'``. 

We can customise our search in BeautifulSoup to pick out this content:

In [64]:
hansard_soup.find_all('a',{'title':"View member's contributions"})

[<a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=4005" title="View member's contributions">Bob Blackman (Harrow East) (Con)</a>,
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=4115" title="View member's contributions">The Minister for Digital and the Creative Industries (Margot James)</a>,
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=4005" title="View member's contributions">Bob Blackman</a>,
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=4115" title="View member's contributions">Margot James</a>,
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=4131" title="View member's contributions">Jim Shannon (Strangford) (DUP)</a>,
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=4115" title="View member's contributions">Margot James</a>,
 <a class="nohighlight" href="/search

In [28]:
hansard_soup.find_all('h2',{'class':'memberLink'})

[<h2 class="memberLink">
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=4005" title="View member's contributions">Bob Blackman (Harrow East) (Con)</a>
 </h2>, <h2 class="memberLink">
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=4115" title="View member's contributions">The Minister for Digital and the Creative Industries (Margot James)</a>
 </h2>, <h2 class="memberLink">
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=4005" title="View member's contributions">Bob Blackman</a>
 </h2>, <h2 class="memberLink">
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=4115" title="View member's contributions">Margot James</a>
 </h2>, <h2 class="memberLink">
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=4131" title="View member's contributions">Jim Shannon (Strangford) (DUP)</a>
 </h2>, <h2 class="memberLink">
 <

Now we can cycle through the results of our search and pick out the text (discarding all the 'junk' html like brackets and tags, etc) using Beautiful Soup's ``get_text()`` method.

In [77]:
mp_names = []

for result in hansard_soup.find_all('h2',{'class':'memberLink'}):
    
    mp_names.append(result.get_text().strip())

In [78]:
mp_names

['Bob Blackman (Harrow East) (Con)',
 'The Minister for Digital and the Creative Industries (Margot James)',
 'Bob Blackman',
 'Margot James',
 'Jim Shannon (Strangford) (DUP)',
 'Margot James',
 'Mr Richard Bacon (South Norfolk) (Con)',
 'Margot James',
 'Gavin Newlands (Paisley and Renfrewshire North) (SNP)',
 'Liz McInnes (Heywood and Middleton) (Lab)',
 'Nick Smith (Blaenau Gwent) (Lab)',
 'The Secretary of State for Digital, Culture, Media and Sport (Jeremy Wright)',
 'Gavin Newlands',
 'Jeremy Wright',
 'Liz McInnes',
 'Jeremy Wright',
 'Nick Smith',
 'Jeremy Wright',
 'Sir Desmond Swayne (New Forest West) (Con)',
 'Jeremy Wright',
 'Kevin Foster (Torbay) (Con)',
 'Jeremy Wright',
 'Mr Steve Reed (Croydon North) (Lab/Co-op)',
 'Jeremy Wright',
 'Hannah Bardell (Livingston) (SNP)',
 'Jeremy Wright',
 'Anna Turley (Redcar) (Lab/Co-op)',
 'Liz Twist (Blaydon) (Lab)',
 'The Secretary of State for Digital, Culture, Media and Sport (Jeremy Wright)',
 'Anna Turley',
 'Jeremy Wright',
 '

## Exercise

Turn the code for getting the list of speakers into a function, where the **input** is the date we want to scrape, and the **output** is a list of MPs who spoke on that date.

**Hint: Parliament doesn't sit on every day. Add in a check using an ``if`` statement to figure out if the response code from the ``requests.get()`` function is 200 (success) or 404 (failure)**

In [20]:
from bs4 import BeautifulSoup


In [22]:
def get_hansard(date):
    
    hansard_url = 'https://hansard.parliament.uk/html/Commons/' + date + '/CommonsChamber'
    hansard_request = requests.get(hansard_url, allow_redirects=True)
    
    speaker_list = []
    
    if hansard_request.status_code==200:
        
        hansard_text = hansard_request.text
        hansard_soup = BeautifulSoup(hansard_text,'html.parser')

        for result in hansard_soup.find_all('h2',{'class':'memberLink'}):
    
            speaker_list.append(result.get_text().strip())
    
    else:
        
        speaker_list = ['Sorry, Parliament wasn\'t sitting on this date!']
    
    return speaker_list

date = '2019-01-31'
get_hansard(date)

["Sorry, Parliament wasn't sitting on this date!"]

## If you'd like to practise your scraping skills using websites that are explicitly safe to scrape, try http://toscrape.com/ 

## 4. Kaggle, UCI and other data repositories



Kaggle (https://www.kaggle.com/) and UCI (https://archive.ics.uci.edu/ml/datasets.php) are open source repositories for datasets on anything from healthcare to property prices. These can be good starting points for data science projects, and the datasets are often very large (which is a good thing!). 
