In [48]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

# Data Source

In order to run some bake off predictions we will need to acquire some data to infer patterns from. We could go through each episode and manually track down information such as: who was `star baker`, who `left the tent` and how each person ranked at the technical. This would be tedious however so we are better off doing a bit of web scraping using Python's `requests` and `BeautifulSoup` libraries.

The wikipedia pages for each season of Bake Of contain a treasure trove of useful information and rankings all saved in neat tabular format. If you look at the entry for [season 7](https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_(series_7)) for example you will notice the following types of tables:
1. A baker summary table containing each baker's: name, age, occupation and hometown
2. A results summary for the season detailing overall rankings of each baker for a given episode
3. A series of episode tables containing the names of each baker's creation for a given challenge and how they ranked on the technical challenge

It's worth noting that there are also tables concerning `specials` and other non-competitive, and therefore boring/irrelevant content for our statistically minded study.

The first thing we need to do is download the correct webpage for a given season (say season 7 for example) and extract all those tables in a useful format (csv).

## Soup Acquisition

We download the webpage by calling the relevant URL and parse it using beautiful soup. The result is a parsable `soup` object.

In [58]:
def get_soup(season_number):
    url = 'https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_(series_{})'.format(season_number)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    
    return soup

season = '7'

soup = get_soup(season)

## Summary Tables Parsing
The summary tables are the first two tables on top of the wikipedia page. They will most likely be the most useful in determining overall patterns of progress from episode to episode.

We parse their content from the webpage using beautiful soup and return two pandas dataframes: `bakers_df` and `elimination_df`.

In [59]:
def extract_summary_tables(soup, season):
    summary_tables = soup.select('h2 + table')

    if season == '10':
        t1 = soup.select('h2 + p +  table')[0]
        t2 = soup.select('h2 + table')[0]
        summary_tables = [t1, t2]

    bakers_table = str(summary_tables[0])

    elimination_key = soup.select('p + dl > dd')

    elimination_table = summary_tables[1]
    
    return bakers_table, elimination_key, elimination_table

def elimination_to_df(elimination_key, elimination_table, season):
    rows = elimination_table.tbody.find_all('tr')
    
    column_names = rows[1].getText().split('\n')[1:]
    
    parsed_table = [column_names]
    
    elimination_colour_key = {}

    for key in elimination_key:
        colour = re.findall('background-color:(\w*);', key.span['style'])[0].lower()
        value = key.contents[1]
        elimination_colour_key[colour] = value

    elimination_colour_key
    
    for row in rows[2:]:
        columns = row.find_all('td')

        name = re.sub('\s+', '', columns[0].getText())
        rounds = [name]

        for col in columns[1:]:
            colour = re.sub('background:','',col['style'])[:-1].lower()
            if colour in elimination_colour_key.keys():
                rank = elimination_colour_key[colour][1:]
            else:
                rank = 'out'
            try:
                multiplier = int(col['colspan'])
            except:
                multiplier = 1
            rounds.extend(multiplier * [rank])
            
        parsed_table.append(rounds)
    
    df = pd.DataFrame(parsed_table)
    df.columns = df.iloc[0]
    df = df.iloc[1:]
    df['season'] = season
    
    return df



bakers_df = pd.read_html(bakers_table, header=0)[0]



In [60]:
bakers_table, elimination_key, elimination_table = extract_summary_tables(soup, season)

bakers_df = pd.read_html(bakers_table, header=0)[0]

elimination_df = elimination_to_df(elimination_key, elimination_table, season)

In [61]:
bakers_df

Unnamed: 0,Baker[6][7][8],Age,Occupation,Hometown,Links
0,Andrew Smyth,25,Aerospace engineer,"Derby / Holywood, County Down",[9][10]
1,Benjamina Ebuehi,23,Teaching assistant,South London,[11]
2,Candice Brown,31,PE teacher,"Barton-Le-Clay, Bedfordshire",[12][13]
3,Jane Beedle,61,Garden designer,Beckenham,[14]
4,Kate Barmby,37,Nurse,"Brooke, Norfolk",[15][16]
5,Lee Banfield,67,Pastor,Bolton,[17]
6,Louise Williams,46,Hairdresser,Cardiff,[18]
7,Michael Georgiou,20,Student,Durham,[19]
8,Rav Bansal,28,Student support,Erith,[20]
9,Selasi Gbormittah,30,Client service associate,London,[21]


In [62]:
elimination_df

Unnamed: 0,Baker,1,2,3,4,5,6,7,8,9,10,Unnamed: 12,None,None.1,None.2,None.3,None.4,None.5,season
1,Candice,Baker was one of the judges' least favourite b...,Baker was the Star Baker.,Baker was one of the judges' least favourite b...,Baker got through to the next round.,Baker was the Star Baker.,Baker was one of the judges' favourite bakers ...,Baker got through to the next round.,Baker was the Star Baker.,Baker was one of the judges' favourite bakers ...,Baker was the series winner.,,,,,,,,7
2,Andrew,Baker got through to the next round.,Baker was one of the judges' favourite bakers ...,Baker was one of the judges' favourite bakers ...,Baker was one of the judges' favourite bakers ...,Baker got through to the next round.,Baker was one of the judges' least favourite b...,Baker was the Star Baker.,Baker got through to the next round.,Baker was the Star Baker.,Baker was a series runner-up.,,,,,,,,7
3,Jane,Baker was the Star Baker.,Baker got through to the next round.,Baker got through to the next round.,Baker got through to the next round.,Baker was one of the judges' favourite bakers ...,Baker got through to the next round.,Baker got through to the next round.,Baker was one of the judges' favourite bakers ...,Baker was one of the judges' least favourite b...,Baker was a series runner-up.,,,,,,,,7
4,Selasi,Baker was one of the judges' favourite bakers ...,Baker got through to the next round.,Baker got through to the next round.,Baker got through to the next round.,Baker got through to the next round.,Baker was one of the judges' favourite bakers ...,Baker was one of the judges' least favourite b...,Baker was one of the judges' least favourite b...,Baker was eliminated.,out,out,out,out,out,out,out,out,7
5,Benjamina,Baker was one of the judges' favourite bakers ...,Baker got through to the next round.,Baker got through to the next round.,Baker was the Star Baker.,Baker got through to the next round.,Baker got through to the next round.,Baker was one of the judges' favourite bakers ...,Baker was eliminated.,out,out,out,out,out,out,out,out,,7
6,Tom,Baker got through to the next round.,Baker got through to the next round.,Baker was the Star Baker.,Baker was one of the judges' least favourite b...,Baker was one of the judges' least favourite b...,Baker was the Star Baker.,Baker was eliminated.,out,out,out,out,out,out,out,out,out,,7
7,Rav,Baker got through to the next round.,Baker got through to the next round.,Baker got through to the next round.,Baker was one of the judges' least favourite b...,Baker got through to the next round.,Baker was eliminated.,out,out,out,out,out,out,out,out,,,,7
8,Val,Baker was one of the judges' least favourite b...,Baker was one of the judges' least favourite b...,Baker was one of the judges' least favourite b...,Baker got through to the next round.,Baker was eliminated.,out,out,out,out,out,out,out,out,,,,,7
9,Kate,Baker got through to the next round.,Baker got through to the next round.,Baker was one of the judges' favourite bakers ...,Baker was eliminated.,out,out,out,out,out,out,out,out,,,,,,7
10,Michael,Baker got through to the next round.,Baker got through to the next round.,Baker was eliminated.,out,out,out,out,out,out,out,out,,,,,,,7


## Individual Episode Extraction
Extracting data from each indivdual episode can also be helpful when we try to determine if certain ingredients or rankings on technical challenges can make an actual difference to the eventual outcome of a season. One possible hypothesis is that `biscuit week` is what seals the deal for a given season. Or maybe it turns out that technical challenges have little to no bearing on whether a baker does well on the show overall. 

In [77]:
def get_episode_tables(soup):
    episode_tables = soup.select('h3 + p + table')
    episode_tables = [str(table) for table in episode_tables]

    episode_names = soup.select('p + table ~ h3')
    episode_names = [name.span['id'] for name in episode_names]

    return episode_names, episode_tables


def parse_episodes(episode_tables, episode_names):
    episode_dfs = pd.read_html(''.join(episode_tables), header=0)

    for i, episode in enumerate(episode_dfs):
        try:
            challenge_list = [re.findall('\((.*)\)',challenge)[0] for challenge in episode.columns[1:]]
            episode_overview = {
                "Signature": challenge_list[0],
                "Technical": challenge_list[1],
                "Showstopper": challenge_list[2]
            }
        except:
            episode_overview = {}

        episode.columns = ['Baker', 'Signature', 'Technical', 'Showstopper']

        episode_name_split = episode_names[i].split(':')

        if len(episode_name_split) == 2:
            episode_number = episode_names[i].split(':')[0].split('_')[1]
            episode_theme = episode_names[i].split(':')[1]
        else:
            episode_number = 'special'
            episode_theme = episode_names[i]

        episode_overview['episode_number'] = episode_number
        episode_overview['theme'] = episode_theme
        episode_overview['season'] = season

        episode['episode_number'] = episode_number
        episode['episode_theme'] = episode_theme
        episode['season'] = season

    season_df = pd.concat(episode_dfs)

    return season_df

In [75]:
episode_names, episode_tables = get_episode_tables(soup)

season_df = parse_episodes(episode_tables, episode_names)

season_df.head()

There are 13 episode titles
There are 11 episode tables


Unnamed: 0,Baker,Signature,Technical,Showstopper,episode_number,episode_theme,season
0,Andrew,Lemon and Rosemary Drizzle Cake,12th,'Ultimate Indulgence' Mirror Glaze Cake,1,_Cakes,7
1,Benjamina,"Pistachio, Cardamom and Lemon Drizzle Cake",6th,White Chocolate Mirror Glaze with Salted Prali...,1,_Cakes,7
2,Candice,Raspberry and Rhubarb Drizzle Custard Bundt Cake,5th,"Mirror Mirror On The Wall, Who Is The Shiniest...",1,_Cakes,7
3,Jane,Lemon and Poppy Seed Drizzle Cake,7th,Chocolate Orange Mirror Cake,1,_Cakes,7
4,Kate,Berry Best Apple and Bramble Drizzle Cake,4th,One Swallow Does Not Make A Summer Cake,1,_Cakes,7


## Save To Disk

Now that we have all these nice dataframes we should probably save them somewhere to save us the effort of parsing the webpages all over again.

In [86]:
import os

def save_data(season, season_df, bakers_df, elimination_df):
    directory = 'season_{}'.format(season)

    if not os.path.exists(directory):
        os.makedirs(directory)

    season_df.to_csv('{}/episodes.csv'.format(directory), index=False)
    bakers_df.to_csv('{}/bakers.csv'.format(directory), index=False)
    elimination_df.to_csv('{}/elimination.csv'.format(directory), index=False)

## Get All The Seasons!

Right so we have covered how to download, parse and save Bake Off data from wikipedia for Season 7. Now lets run it for all seasons and save the data to disk so we can start running some data exploration in the next section.

In [87]:
for season in range(1,11):
    season = str(season)
    
    soup = get_soup(season)
    
    try:
        bakers_table, elimination_key, elimination_table = extract_summary_tables(soup, season)

        bakers_df = pd.read_html(bakers_table, header=0)[0]

        elimination_df = elimination_to_df(elimination_key, elimination_table, season)

        episode_names, episode_tables = get_episode_tables(soup)

        season_df = parse_episodes(episode_tables, episode_names)

        directory = 'data/season_{}'.format(season)

        save_data(season, season_df, bakers_df, elimination_df)

        print("Acquired and parsed season {}!!".format(season))
    except Exception as e:
        print(e)
        print("Failed to acquire and parse season {}... You win some, you lose some.".format(season))


No text parsed from document: 
Failed to acquire and parse season 1... You win some, you lose some.
Acquired and parsed season 2!!
Acquired and parsed season 3!!
Acquired and parsed season 4!!
Acquired and parsed season 5!!
Acquired and parsed season 6!!
Acquired and parsed season 7!!
Acquired and parsed season 8!!
Acquired and parsed season 9!!
Acquired and parsed season 10!!
