<div style="display: block; width: 100%; height: 100px;">

<p style="float: left;">
    <span style="font-weight: bold; line-height: 24px; font-size: 16px;">
        Department of Digital Humanities
        <br />
        5AAVC210 Introduction to Programming in Python 2017-2018
    </span>
    <br >
    <span style="line-height: 22x; font-size: 14x; margin-top: 10px;">
        Created by Dom Weldon (dominic.weldon@kcl.ac.uk) <br />
        Office Hours: Thursdays, 11am to 1pm, K-1.026 (King's Building, Strand Campus) <br />
        Submit Assessed work to KEATS before the deadline.
    </span>
</p>


</div>

# Week 5: Requests and Statistics - Practice Challenge

This week our focus is exclusively on getting everything you need in place to help you do your assessment. In the practical session, feel free to ask any questions at all, and do make sure to complete the challenges below. You should be able to take code from all of your previous week's tasks and use similar code to complete the mid term assessment.

## This Week's Aims

This week we will focus on making requests to web pages and using the data returned from them for research. We'll also say hello to the standard `statistics` library, and look at handy functions for calculating the mean, median and mode. By the end of this week, you should be comfortable doing the following.

* Making an HTTP GET request to a remote web page and receiving a result back.
* Using BeautifulSoup to parse a web page and get basic data out of an HTML tree.
* Using the statistics module to calculate the mean, median and mode.

If you're a bit unsure about the final bit and would like a re-cap of the terms, [check out this article](https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/mean-median-basics/a/mean-median-and-mode-review).

So far, in increasing levels of complexity, we've made programs which have modelled: a mock shop inventory, countries of the world, presidents of the United States, and if you did the extra exercises from week 4, London tube lines and chemical elements. However, so far all of this data has been provided to us ready-made. This week, we'll be getting some of that data from the real world, by taking it from the internet.

## 0. Recap from Last Week

Last week was a revision week aimed at ensuring we are all comfortable with complex types and can model real world data using them, so we don't want to spend too long going over this, but let's do one more example program using complex data already provided to us.

This time, we're going to make an app to tell us about London Boroughs (this is important, as we'll also use this data later when we're exploring statistics in Python). In case you're not familiar with the geography of London, Greater London is formed of 32 administrative districts called boroughs and the City of London (a very weird institution, but which we'll define as a borough for our purposes today; if you'd like to know more about the history of London, [watch this two minute video](https://www.youtube.com/watch?v=6mjXCj2-c6k)).

The code in the cell below should import the list of boroughs for us. 

In [None]:
from boroughs import boroughs 

We now have a variable called `boroughs` which contains a list of all the London boroughs as a **list of dictionaries** containing the name, area (in square miles) and population of each borough as a string, float and integer respectively.

In [None]:
# there should be 33 boroughs 
print(len(boroughs))

# let's look at the City of London, where we currently are
# as we can see, it's tiny
print(boroughs[-1])

**Challenge**

Design and write a program which asks the user for the name of a given London Borough and then prints out the population and area of the borough. If no borough of that name is found, it should inform the user that no borough with that name was found.

In [None]:
# Your code here!

## 1. Making Requests

As you might expect, it's very rare to have data readily available to us as a neat list or dictionary in Python to begin with. Usually, we'll need to take our data *from* somewhere, and more often than not, that data is likely to be on the internet or in a file. This week, we will focus on taking some data from an internet source and parsing it on our computers.

But first a recap from last year about the way that the internet works: HTTP requests and reponses.

Whenever we want to visit a web page, we type the URL into our browser, hit enter, and then the web page appars. But what happens "under the hood?" If we remember our History of Networked Technologies course, we should know that:

* The URL is a Uniform Resource Locator, which informs the browser of the protocol, location and path of the resource we want to get. So, for example, `http://www.bbc.co.uk/news` is actually an instruction to our browser to use the `http` protocol to look on the `www.bbc.co.uk` server (using the DNS system) for a resource called `/news`.
* HTTP stands for Hyper-Text Transfer Protocol, and was developed by [Tim Berners-Lee at CERN in 1989](http://info.cern.ch/Proposal.html).
* The browser makes an HTTP request to the server. The request must be one of several different methods, the most common of which are `GET` and `POST`. Today we're only looking at `GET` requests.
* The server receives the request, and looks for the resource on the server. It will send a response with a status code and a body.
    * If the resource is found, the server will send a response with status code **200**, along with the body of the response.
    * If the resource is not found, the server will send a response with status code **404**, perhaps with a page saying "not found" in the body.
    
So, how do we do this in Python?

Well, like all things, there are several different methods we can use, but the most common and often most effective method is to use the `requests` library which hopefully you installed in the lecture. Now, let's see an example of making a request.

In [None]:
# import the library
import requests

# let's specify a URL, the KCL news page
u = 'https://spotlight.kcl.ac.uk/'

# and now make a GET request
r = requests.get(u)

# and see the status code
print(r.status_code)

Hopefully, you should see the number `200` printed out from the cell above. Hoorah, we made a successful request. And just to prove it, let's make a nonsensical request.

In [None]:
# let's specify a URL, the KCL news page
u_bad = 'https://spotlight.kcl.ac.uk/dom/is/cool'

# and now make a GET request
r_bad = requests.get(u_bad)

# and see the status code
print(r_bad.status_code)

Uh oh, seems they haven't gotten around to making that page yet, as you can see, we got an error, a `404` meaning `NOT FOUND`.

### But What Does it Mean?

Okay, so we've gotten a response back, but what did it say? Let's look at a special property of the response, called its `text`, and see if we can make sense of it.

In [None]:
print(r.text)

So hopefully you should all recognise that as HTML (hyper-text markup language), but it's going to take us a very long time to figure out what it says and what it means. Luckily, there's also a library we can use to start to pick apart some of that HTML and to take something more maningful out of it.

That library is called BeautifulSoup, so named because it, well, takes a messy "soup" of text, and turns it into something more structured, and 'beautiful'. Again, hopefully you will have installed this successfully in the lecture.

Let's imagine we wanted to get the titles of the news articles from the KCL website. The code below should do that for us, run it now and we'll see what happens.

In [None]:
# import BeautifulSoup
from bs4 import BeautifulSoup

# create a new BeautifulSoup object from our text
tree = BeautifulSoup(r.text)

# Find all the titles
titles = tree.findAll('h2', {'class': 'grid-title'})
print(titles)

Wait, what happened there? Well, the important thing lies in this line here:

    titles = tree.findAll('h2', {'class': 'grid-title'})
    
With the new tree we've created (some people call this object soup, or something similar, the name really doesn't matter), we've used the `findAll()` method to find the part of the web page that we're interested in. By looking at the page's source code in our web browser, we can see that all the headers of the news articles we want to find out more about are contained in `h2` elements (header number 2), and contain links(`<a>` tags), like the first story below.

    <h2 class="grid-title">
        <a href="https://spotlight.kcl.ac.uk/2018/02/06/votes-women-womens-suffrage-history-lessons-today/">
            Votes for women: the history of women’s suffrage and lessons for today
        </a>
    </h2>

However, the site might also contain `h2` elements which aren't titltes of news articles, so we needed a way to tell the difference. Often, the way that a site will do this is using a *class* which helps to style the element in the way the designer wants. So, we supplied two arguments to `findAll()`, one a string saying what kind of element we want, and another a dictionary saying what properties that element must have. `findAll()` then found all the possible matches, and returned them to us as a list. 

However, that output is still a little messy and hard to read, ideally we probably just want the titles of all the articles without the extra messy tags in the way. luckily, BeautifulSoup is way ahead of us, and has just such a way for us to get on with this: the `text` property. For example, see the code below.

In [None]:
# loop through the titles
for title in titles:
    print(title.text)

Excellent! Now we should know everything we need to tackle this week's first propblem!

<h1 style="background-image: url(https://s3.eu-west-2.amazonaws.com/intro-to-python/homer-excited.png); padding-top: 400px; background-repeat: no-repeat; background-size: contain; background-position: right top; font-size: 36px;">
Simpsons Trivia Challenge!
</h2>

*The Simpsons* is ~~a popular animated television show which as been produced since 1989~~ the best show on television. Numerous fan sites exist on the internet which list episodes and give plot summaries for each episode. One such site is SimpsonCrazy.com.

This site lists all the seasons of the Simpsons at the following URL: [http://www.simpsoncrazy.com/episodes](http://www.simpsoncrazy.com/episodes).

Each individual season's page is also available at its own URL such that [http://www.simpsoncrazy.com/episodes/season/1](http://www.simpsoncrazy.com/episodes/season/1) is the URL for season one, and season two is available at [http://www.simpsoncrazy.com/episodes/season/2](http://www.simpsoncrazy.com/episodes/season/2). Each individual season page lists every episode in that season.

Write a program which continually asks a user for a season number of the simpsons, and then retrieves the relevant page for that season from SimpsonCrazy.com and then prints out a nicely formatted list of all the episodes in that season. If the season is not found (i.e., the response status is `404` rather than `200`), your program should tell the user that the season was not found. After each reques the program should ask the use if they wish to continue and act accordingly. 

*Hint: the following code should help you get a list of all the episodes from the BeautifulSoup object*. 

    episodes = simpsons_tree.findAll('h2', {'class': ''})

In [None]:
# Your code here

As Mr Burns would say: "*Excellent...*."

Now we're going to take our simpsons program a little further. Hopefully you should have noticed that each episode of the Simpsons also has its own page on SimpsonsCrazy.com. Copy and paste your code from the previous section into the cell below, and adjust it so that after printing out a list of all the episodes from a given series, the program asks the user to enter another number, saying which episode from that series they would like to find out more about. Your program should then request the episode page for that episode from SimpsonsCrazy.com, and then print out a plot summary. 

or example, if th user entered 1, for series 1, they should see a list of all the season 1 episodes, and then be able to enter 2 for episode 2 (*Homer's Odyssey*). Your program will then make a reques to [http://www.simpsoncrazy.com/episodes/homers-odyssey](http://www.simpsoncrazy.com/episodes/homers-odyssey) and print out the summary of the plot from that page.

**Hints**

You will need to keep track of all of the episodes from the requested season in a complex type of some kind (either a list, or a dictionary - up to you!). The most important thing is to remember the URL for each episode from that series. You can get the path for each episode from the seasons page using code like that below. But do think: what will you need to do to the path to get the URL?

    for episode in episodes:
        title = episode.text
        link_href = episode.find('a').get('href')
        print(episode_title, link_href)
        
You will also need to get the plot from the episode page. The code below should help you do that.

    plot_tree = episode_tree.find('div', {'class': 'episodePlot'})
    plot = plot_tree.text

In [None]:
# Your code here

## The `statistics` Module

Now, from Springfield back to London! This time we're going to re-use the London Borough data to do some basic statistics. 

Now that we're more advanced in our python knowledge, and we know about libraries, functions and complex types, this section shouldn't need too much introduction, so here goes.

Python has a statistics library which we can use to work out basic statistics based on complex data types. For example, if I wanted to find out some statistics about the ages of students, I could do the following.

In [None]:
# let's get the ages of students
student_ages = [19, 20, 19, 21, 22, 19, 20, 20, 24, 19, 20, 20, 19, 20, 21]

# import our function 
from statistics import mean

# work out the mean
mean_age = mean(student_ages)
print('The mean age of a student in the class is', mean_age)

Likewise, for the median:

In [None]:
from statistics import median

median_age = median(student_ages)
print('The median age is', median_age)

And for the mode:

In [None]:
from statistics import mode

mode_age = mode(student_ages)
print('The most common age is', mode_age)

## Final Problem! London Boroughs

Now, for our final problem of this half of the course. Hopefully a nice and straightforward one!

Using code, import the London boroughs dataset and work out the mean and median population, area and population density (number of people per square mile) of all London Boroughs.

In [None]:
from boroughs import boroughs

# Your code here

Then, work out which Borough has each of the following:

* The highest population 
* The lowest population 
* The greatest population density 
* The lowest population density
* The greatest area
* The smallest area.

In [None]:
# Your code here

Finally, what is the mean, median and mode length of the name of a London Borough?

In [None]:
# your code here

<img src="https://s3.eu-west-2.amazonaws.com/intro-to-python/the-simpsons-futurama-crossover-episode.jpg" />

# Wahoo! We're Done!

Congratulations on finishing the first half of the course - you're now all programmers!

As ever, don't forget you can get help from Gabriele and Dom via email, and through office hours throughout reading week!