## Power10 Scraping Challenge

> If you have not had a chance to check out the [Lazy Guide to Web Scraping](https://github.com/sportsdatasolutions/sport_x_code_eis/blob/master/3.projects/LazyGuides/web_scraping.md) please do so! You can find the code for the guide in this [Deepnote Project](https://deepnote.com/project/0d7f30b4-7eb4-4d7e-b601-28085e59e0d3#%2F2.HTMLTable.ipynb).

#### The challenge is to contribute to this scraper so it scrapes `4 years/seasons` worth of rankings data from the `Men` and `Women's` `60m`, `100m` and `200m` events.

1. ##### Below is a sloution to the Data Handling Power10 Bonus Challenge. ***Note the `helpers.py` module containting the functions used in the loop below to clean each Dataframe***.

2. ##### **Simply** make the script below scrape the same tables (M, W, 60, 100, 200) from the `2020`, `2019`, `2018` and `2017` seasons. Don't be afraid to edit the `helpers.py` module. 

3. ##### This web scraper will take a ***while*** to run if scraping from all 4 seasons. To speed up development, have it scrape only what is nessisary to test expected behaviour. You could either lessen the amount of variables to loop through our you could utilise [loop statements](https://www.digitalocean.com/community/tutorials/how-to-use-break-continue-and-pass-statements-when-working-with-loops-in-python-3).


In [None]:
import pandas as pd
import helpers

gender = ['M', 'W']
events = ['60', '100', '200']

for sex in gender:
    for event in events:
        url = f"https://www.thepowerof10.info/rankings/rankinglist.aspx?event={event}&agegroup=ALL&sex={sex}&year=2020"

        df = pd.read_html(url, match='Rank')[1]

        df = helpers.clean_power10_table(df, event, sex)

        helpers.write_power10_table_to_csv(df, event ,sex)

## Bonus Task! 
> Add the following to your solution above...

#### Edit your solution so it also adds `two` new columns to your data, one containing the `Athlete Profile URLs` and one containing the `Coach Profile URLs`.
1. ##### **Start** by inspecting the relevant links you are wanting to scrape (within the HTML Table) e.g. Open your Browser Dev Tools > Click the Inspector Tool > Click on a athlete link on a table > you should see the HTML (anchor tag with `href` link) containing the link we want.

2. ##### **Pandas is limited** in the fact it ***can't*** obtain this additional information **embbeded within the HTML**, for that we'll want to use **BeautifulSoup**.

3. ##### The **example code** is importing `BeautifulSoup4` to obtain all the HTML tables (similar to what pandas does behind the scenes). We use the `find_all` method and reference HTML elements that we want data from e.g. `table`.

4. ##### **Now attempt to use BeautifulSoup to create a list of Athlete and Coach URLs to add to the Dataframes within your loop from your main challenge solution.** Even better, try ***refactor*** (once you have an MVP/Working Solution) it into your `helpers.py` module.
    
    **Note**: Notice that the urls are note complete urls, for them to be actual links, you'll have to append `https://www.thepowerof10.info/` to each url you scrape off the tables.

    **Hint**: Scrape and Add the list of athlete urls and coach urls, from the html table, to the the main DataFrame you are constructing before proceeding to the cleaning phase!




In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.thepowerof10.info/rankings/rankinglist.aspx?event=100&agegroup=ALL&sex=M&year=2020'

response = requests.get(url)

soup = BeautifulSoup(response.content, 'lxml')

len(soup.find_all('table')) # number of tables scraped

table = soup.find_all('table')[6] # 7th table in the list is the HTML we are after  

athlete_url_list = []

# Loop through all Table Rows (tr) and add links to list
for tr in table.find_all('tr'):
    # Use try except block to avoid terminating loop in case of error/exception (not all rows have links)
    try:
        # appends link, href in anchor (a) HTML element within Athlete Name row (('td')[6]) if applicable...
        athlete_url_list.append(tr.find_all('td')[6].a.get('href'))
    except:
        # appends 'N/A' if not applicable...
        athlete_url_list.append('N/A')

athlete_url_list

#### Bonus Bonus Task!
> With your solution inculding the first bonus....

#### Edit your solution so it also adds an additional `nation` column so we can differenciate between the `English`, `Welsh`, `Scottish` and `Northern Irish` athletes.

1. ##### Firstly you'll realise there is no way of telling from the existing data who is English, Scottish, Welsh etc.

2. ##### You'll want to pay attention to the Power10 Website, specifically the `Region/Nation` filter.

3. ##### You'll most likely want to `scrape` the Regional/national tables separaely, `merge` together then `sort` out the rankings for each Gender, Event and Season...

    **Note**: A full scrape should take between 6-7mins.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=01834284-5a22-433a-9a28-a120736ff2ec' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>