# Scrapping marathon results
#### Seungmin Lee
---
## Objective
---
Our goal for this project is to retrieve data from the [2019 Chicago Marathon](https://results.chicagomarathon.com/well-known/2019/) and create a dataframe showing the top 50 male runners. 

For that purpose we will be using the follow tools:
- `requests`: Allows us to retrieve data from a website.
- `BeautifulSoup`: Aides in sorting html information.
- `pandas`: Will be used in creating a data frame containing top 50 male runners.
- `time`: Adds a delay for get requests to prevent potential issues.

In [1]:
import requests
from bs4 import BeautifulSoup as BS
import pandas as pd
import time

In [2]:
# Specific parameters to obtain top 50 male contenders
params = {"event":"MAR", "search[sex]":"M", "search[age_class]":"%", "num_results":"50"}
marathonURL = requests.get("https://results.chicagomarathon.com/well-known/2019/?pid=list", params = params).text
soup = BS(marathonURL)
#print(soup.prettify())

Since we will be pulling information for all 50 male runners, we need to know what type of information we will be working with. To start, we will be accessing the first place male runner, Cherono, Lawrence.

In [3]:
# URL linking to Lawrence's profile
test1 = requests.get("https://results.chicagomarathon.com/well-known/2019/?content=detail&fpid=list&pid=list&idp=999999107FA317000023A4AD&lang=EN_CAP&event=MAR&lang=EN_CAP&num_results=50&search%5Bsex%5D=M&search%5Bage_class%5D=%25&search_event=MAR").text
soup1 = BS(test1)
# Most of the data we will be working with is under the "td" tag
tdTag = soup1.find_all("td")
#tdTag

The information that we want are the following:
- Name: `<td class="f-__fullname last">`
- Age Group: `<td class="f-age_class last">`
- bib Number: `<td class="f-start_no_text last">`
- Age: `<td class="f-age last">`
- City/State: `<td class="f-__city_state last">`
- Split Times: `<td class="time">`

The first 5 can be obtained by slicing for the first 5 instances of "td".

For Split Times, they'll be divided to the following:
- 05K: Element 13
- 10K: Element 21
- 15K: Element 27
- 20K: Element 34
- HALF: Element 41
- 25K: Element 48
- 30K: Element 55
- 35K: Element 62
- 40K: Element 69
- Finish: Element 76

In [4]:
# Obtains name, age group, bib number, age, city/state
first5 = [x.get_text() for x in tdTag[0:5]]
first5

['Marathon', '2019', 'Cherono, Lawrence (KEN)', '30-34', '4']

In [5]:
# Obtains split times
times = [x.get_text() for x in tdTag[13:77:7]]
times

['14:45',
 '14:43',
 '14:43',
 '14:50',
 '03:14',
 '11:42',
 '15:02',
 '14:55',
 '15:16',
 '06:35']

In [6]:
# Brings the 2 together into 1 list containing everything we want
info = first5 + times
info

['Marathon',
 '2019',
 'Cherono, Lawrence (KEN)',
 '30-34',
 '4',
 '14:45',
 '14:43',
 '14:43',
 '14:50',
 '03:14',
 '11:42',
 '15:02',
 '14:55',
 '15:16',
 '06:35']

In [7]:
h4tags = soup.find_all("h4") # Each runner's information/profile links are separated by the common "h4" tag
marathonList = []
for h4 in h4tags:
    profileURL = "https://results.chicagomarathon.com/2019/" + h4.find('a')['href'] 
    temp = requests.get(profileURL) # Pulling runner's profile
    tempSoup = BS(temp.text)
    tdTag = tempSoup.find_all("td")
    tempList = [x.get_text() for x in tdTag[0:5]] + [x.get_text() for x in tdTag[13:77:7]]
    marathonList.append(tempList) # Appends a list containing a runner's information from their profile
    time.sleep(2) # Ensures that there is a delay between each get request
#print(marathonList)

In [8]:
# "col" contains the header for the data frame
col = ["Name (CTZ)","Age Group","Bib Number","Age","City,State","05K","10K","15K","20K","HALF","25K","30K","35K","40K","Finish"]
df = pd.DataFrame(marathonList, columns = col)
df

Unnamed: 0,Name (CTZ),Age Group,Bib Number,Age,"City,State",05K,10K,15K,20K,HALF,25K,30K,35K,40K,Finish
0,Exhibition Handcycle,2019,"Robinson, Anthony (USA)",–,335,10:52,10:46,11:02,11:32,02:37,09:13,10:46,10:57,10:54,04:51
1,Marathon,2019,"Cherono, Lawrence (KEN)",30-34,4,14:45,14:43,14:43,14:50,03:14,11:42,15:02,14:55,15:16,06:35
2,Wheelchair,2019,"Romanchuk, Daniel (USA)",MT53,201,11:03,10:34,10:52,11:13,02:13,08:15,10:12,10:39,10:45,04:40
3,Exhibition Handcycle,2019,"Walton, Jess (USA)",–,340,10:51,10:46,11:03,11:32,02:37,09:12,10:47,11:13,11:08,05:23
4,Marathon,2019,"Debela, Dejene (ETH)",20-24,38,14:45,14:44,14:43,14:50,03:14,11:40,15:03,14:54,15:15,06:38
5,Wheelchair,2019,"Weir, David (GBR)",MT53,218,11:03,10:35,10:50,11:14,02:17,08:27,10:52,11:40,11:39,04:54
6,Exhibition Handcycle,2019,"Morgan, Carl (USA)",–,334,11:40,12:14,10:33,12:17,02:52,10:04,11:53,12:26,12:22,06:15
7,Marathon,2019,"Mengstu, Asefa (ETH)",30-34,5,14:46,14:43,14:43,14:50,03:13,11:40,15:04,14:54,15:15,06:40
8,Wheelchair,2019,"Van Dyk, Ernst (RSA)",MT53,202,11:03,10:37,10:50,11:13,02:21,08:24,10:51,11:40,11:39,04:54
9,Exhibition Handcycle,2019,"Sapp, Greg (USA)",–,338,13:10,12:16,13:02,13:25,03:01,10:45,12:25,13:27,14:03,07:11
