# Project 4: Scrapping marathon results
#### Seungmin Lee || 50230433
---
## Objective
---
Our goal for this project is to retrieve data from the [2019 Chicago Marathon](https://results.chicagomarathon.com/well-known/2019/) and create a dataframe showing the top 50 male runners. 

For that purpose we will be using the follow tools:
- `requests`: Allows us to retrieve data from a website.
- `BeautifulSoup`: Aides in sorting html information.
- `pandas`: Will be used in creating a data frame containing top 50 male runners.
- `time`: Adds a delay for get requests to prevent potential issues.

In [2]:
import requests
from bs4 import BeautifulSoup as BS
import pandas as pd
import time

In [3]:
# Specific parameters to obtain top 50 male contenders
params = {"event":"MAR", "search[sex]":"M", "search[age_class]":"%", "num_results":"50"}
marathonURL = requests.get("https://results.chicagomarathon.com/well-known/2019/?pid=list", params = params).text
soup = BS(marathonURL)
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Bank of America Chicago Marathon
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="black" name="apple-mobile-web-app-status-bar-style"/>
  <link href="//results-static.mikatiming.com/2019/chicago/../../stages/blue/images/apple-touch-icon.png" rel="apple-touch-icon"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="198023663696085" property="fb:app_id"/>
  <meta content="https://results-static.mikatiming.com/2019/chicago/styles/responsive_2016/logo_fb.png" property="og:image"/>
  <meta content="630" property="og:image:width"/>
  <meta content="315" property="og:image:height"/>
  <meta content="website" property="og:type"/>
  <meta content="Mika timing" property="og:site_name"/>
  <meta content="https://results.chicagomarathon.com/2019/?content=well-known&amp;event=MAR&amp;lang=EN_CAP&amp;num_results=50&amp;pid=list&amp;search%5Bsex%5D=M&amp;search%5Bage_c

Since we will be pulling information for all 50 male runners, we need to know what type of information we will be working with. To start, we will be accessing the first place male runner, Cherono, Lawrence.

In [4]:
# URL linking to Lawrence's profile
test1 = requests.get("https://results.chicagomarathon.com/well-known/2019/?content=detail&fpid=list&pid=list&idp=999999107FA317000023A4AD&lang=EN_CAP&event=MAR&lang=EN_CAP&num_results=50&search%5Bsex%5D=M&search%5Bage_class%5D=%25&search_event=MAR").text
soup1 = BS(test1)
# Most of the data we will be working with is under the "td" tag
tdTag = soup1.find_all("td")
tdTag

[<td class="f-__fullname last">Cherono, Lawrence (KEN)</td>,
 <td class="f-age_class last">30-34</td>,
 <td class="f-start_no_text last">4</td>,
 <td class="f-age last">31</td>,
 <td class="f-__city_state last">Chepkoilel</td>,
 <td class="f-display_name_short last">LC</td>,
 <td class="f-starttime_net last">07:30:03</td>,
 <td class="f-place_all last">1</td>,
 <td class="f-place_age last">1</td>,
 <td class="f-place_nosex last">1</td>,
 <td class="f-time_finish_netto last">02:05:45</td>,
 <td class="f-__nix last"><a class="icon l" href="?content=detail&amp;idp=999999107FA317000023A4AD&amp;lang=EN_CAP&amp;num_results=50&amp;pid=list&amp;search%5Bsex%5D=M&amp;search%5Bage_class%5D=%25&amp;event=MAR&amp;favorite_add=999999107FA317000023A4AD"><img alt="Add runner to 'My Runners'" src="//assets.mikatiming.com/img/mt/results/default/icons/add.svg" title="Add runner to 'My Runners'"/></a></td>,
 <td class="time_day">07:44:48AM</td>,
 <td class="time">00:14:45</td>,
 <td class="diff">14:45</t

The information that we want are the following:
- Name: `<td class="f-__fullname last">`
- Age Group: `<td class="f-age_class last">`
- bib Number: `<td class="f-start_no_text last">`
- Age: `<td class="f-age last">`
- City/State: `<td class="f-__city_state last">`
- Split Times: `<td class="time">`

The first 5 can be obtained by slicing for the first 5 instances of "td".

For Split Times, they'll be divided to the following:
- 05K: Element 13
- 10K: Element 21
- 15K: Element 27
- 20K: Element 34
- HALF: Element 41
- 25K: Element 48
- 30K: Element 55
- 35K: Element 62
- 40K: Element 69
- Finish: Element 76

In [5]:
# Obtains name, age group, bib number, age, city/state
first5 = [x.get_text() for x in tdTag[0:5]]
first5

['Cherono, Lawrence (KEN)', '30-34', '4', '31', 'Chepkoilel']

In [6]:
# Obtains split times
times = [x.get_text() for x in tdTag[13:77:7]]
times

['00:14:45',
 '00:29:28',
 '00:44:11',
 '00:59:01',
 '01:02:15',
 '01:13:57',
 '01:28:59',
 '01:43:54',
 '01:59:10',
 '02:05:45']

In [7]:
# Brings the 2 together into 1 list containing everything we want
info = first5 + times
info

['Cherono, Lawrence (KEN)',
 '30-34',
 '4',
 '31',
 'Chepkoilel',
 '00:14:45',
 '00:29:28',
 '00:44:11',
 '00:59:01',
 '01:02:15',
 '01:13:57',
 '01:28:59',
 '01:43:54',
 '01:59:10',
 '02:05:45']

In [8]:
h4tags = soup.find_all("h4") # Each runner's information/profile links are separated by the common "h4" tag
marathonList = []
for h4 in h4tags:
    profileURL = "https://results.chicagomarathon.com/2019/" + h4.find('a')['href'] 
    temp = requests.get(profileURL) # Pulling runner's profile
    tempSoup = BS(temp.text)
    tdTag = tempSoup.find_all("td")
    tempList = [x.get_text() for x in tdTag[0:5]] + [x.get_text() for x in tdTag[13:77:7]]
    marathonList.append(tempList) # Appends a list containing a runner's information from their profile
    time.sleep(2) # Ensures that there is a delay between each get request
print(marathonList)

[['Cherono, Lawrence (KEN)', '30-34', '4', '31', 'Chepkoilel', '00:14:45', '00:29:28', '00:44:11', '00:59:01', '01:02:15', '01:13:57', '01:28:59', '01:43:54', '01:59:10', '02:05:45'], ['Debela, Dejene (ETH)', '20-24', '38', '24', 'West Chester', '00:14:45', '00:29:29', '00:44:12', '00:59:02', '01:02:16', '01:13:56', '01:28:59', '01:43:53', '01:59:08', '02:05:46'], ['Mengstu, Asefa (ETH)', '30-34', '5', '31', 'Addis Ababa', '00:14:46', '00:29:29', '00:44:12', '00:59:02', '01:02:15', '01:13:55', '01:28:59', '01:43:53', '01:59:08', '02:05:48'], ['Karoki, Bedan (KEN)', '25-29', '9', '29', 'Mbuyu', '00:14:45', '00:29:27', '00:44:10', '00:59:02', '01:02:15', '01:13:54', '01:28:59', '01:43:53', '01:59:09', '02:05:53'], ['Abdi, Bashir (BEL)', '30-34', '10', '30', 'Nijmegen', '00:14:47', '00:29:30', '00:44:23', '00:59:32', '01:02:54', '01:14:53', '01:30:22', '01:45:15', '01:59:53', '02:06:14'], ['Tura, Seifu (ETH)', '20-24', '39', '22', 'Addis Abeba', '00:14:46', '00:29:29', '00:44:12', '00:59:

In [9]:
# "col" contains the header for the data frame
col = ["Name (CTZ)","Age Group","Bib Number","Age","City,State","05K","10K","15K","20K","HALF","25K","30K","35K","40K","Finish"]
df = pd.DataFrame(marathonList, columns = col)
df

Unnamed: 0,Name (CTZ),Age Group,Bib Number,Age,"City,State",05K,10K,15K,20K,HALF,25K,30K,35K,40K,Finish
0,"Cherono, Lawrence (KEN)",30-34,4,31,Chepkoilel,00:14:45,00:29:28,00:44:11,00:59:01,01:02:15,01:13:57,01:28:59,01:43:54,01:59:10,02:05:45
1,"Debela, Dejene (ETH)",20-24,38,24,West Chester,00:14:45,00:29:29,00:44:12,00:59:02,01:02:16,01:13:56,01:28:59,01:43:53,01:59:08,02:05:46
2,"Mengstu, Asefa (ETH)",30-34,5,31,Addis Ababa,00:14:46,00:29:29,00:44:12,00:59:02,01:02:15,01:13:55,01:28:59,01:43:53,01:59:08,02:05:48
3,"Karoki, Bedan (KEN)",25-29,9,29,Mbuyu,00:14:45,00:29:27,00:44:10,00:59:02,01:02:15,01:13:54,01:28:59,01:43:53,01:59:09,02:05:53
4,"Abdi, Bashir (BEL)",30-34,10,30,Nijmegen,00:14:47,00:29:30,00:44:23,00:59:32,01:02:54,01:14:53,01:30:22,01:45:15,01:59:53,02:06:14
5,"Tura, Seifu (ETH)",20-24,39,22,Addis Abeba,00:14:46,00:29:29,00:44:12,00:59:02,01:02:15,01:13:56,01:28:59,01:43:53,02:00:13,02:08:35
6,"Chumba, Dickson (KEN)",30-34,6,32,Kipngoror,00:14:45,00:29:27,00:44:11,00:59:01,01:02:14,01:13:55,01:28:58,01:44:27,02:01:05,02:09:11
7,"Farah, Mo (GBR)",35-39,1,36,Teddington,00:14:47,00:29:29,00:44:23,00:59:32,01:02:54,01:14:54,01:30:30,01:46:23,02:02:28,02:09:58
8,"Riley, Jacob (USA)",30-34,16,30,Boulder,00:15:33,00:30:59,00:46:30,01:01:59,01:05:24,01:17:33,01:33:17,01:48:44,02:04:02,02:10:36
9,"Mock, Jerrell (USA)",20-24,36,24,Fort Collins,00:15:33,00:30:58,00:46:30,01:01:59,01:05:25,01:17:32,01:33:17,01:48:44,02:04:02,02:10:37
