# Athletes - Web Scraping

>- It´s necessary a premium login to extract Segments Information, a free acount doesn´t have full access to all data.
>- For this script you can use either a list of logins, or setup a unique login through the **Function** `change_user ()`, with `number_logins=1` and any login of your choice
>- For a web scraping process is important to have a exception when the internet connection is lost, for this was created the **Function** `connect()` that tests the connection before each Action, and a loop while in each Action too to garantee to scrap all the information

The last update was at February, 18, 2021.

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Environment" data-toc-modified-id="Environment-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Environment</a></span></li></ul></li><li><span><a href="#Steps-functions" data-toc-modified-id="Steps-functions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Steps functions</a></span></li><li><span><a href="#Informations-to-scrap" data-toc-modified-id="Informations-to-scrap-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Informations to scrap</a></span></li><li><span><a href="#Create-session,-open-page-and-change-language-to-English" data-toc-modified-id="Create-session,-open-page-and-change-language-to-English-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Create session, open page and change language to English</a></span></li><li><span><a href="#Scraping" data-toc-modified-id="Scraping-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Scraping</a></span></li></ul></div>

## Setup

In [None]:
import time
import datetime
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import pyversions  # https://pypi.org/project/pyversions/
import sys, os
import urllib.request
import re
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException
pyversions.versions();

### Environment

In [None]:
#parameters
export = "NEW YORK.csv"

In [None]:
#It´s used a group of logins to access the Strava plataform and scraping the information
#id_login it´s the first index to rotate logins

path2 = r'../Datasets'
global logins
logins=pd.read_excel(os.path.join(path2,'logins.xlsx')) 
id_login=0 #choose the first index in the logins list to rotate the accesses
number_logins = 1 #choose the number of logins to use
count_logins = 1
aux_id = id_login #variable used to compare the current position of the index with the initial
club = 1 #1 - without attend Strava clubs, and 2 - with attend Strava clubs
segments = 'https://www.strava.com/segments/8386468'

last_page = 850 # a huge number to paginate and support the scraping process

In [None]:
logins.columns

## Steps functions

It's important to evaluate the page behavior, trough the DOM tree. Most of scripts were used based on their own xpath. For example: to select age_group, the scripts finds the element

```sh
element=driver.find_element_by_xpath('//*[@id="premium-enhanced"]/ul/ul[1]/li['+str(ag)+']/a')
```

and selects each one with the **ag variable**.

Bellow the list of functions with their descriptions:

- **open_section** - create a new instance for selenium webdriver, and load the login page
- **login** - fill down the form and click
- **user_access** - kill the session and starts a new one with other logins, works together with **change_user" function
- **change_user** - this function changes the logged user if you have a list of logins. It's import to avoid 429 error (too many requests)
- **select_english_language** - to scrap all the information in English, this function changes the language view to English
- **go_link** - It's used to load the segments page
- **select_gender** - for each gender selection the script fills with M for male, and F for female
- **select_age_group** - for each age group he script selects each one through xpath
- **next_page** - the page is selected in the footer, the **li [ ]** term needs to sum with 2 to get the next_page. For example 

```sh
...ul/li['+str(page+2)+']/a')
```

if you are in page 2, the page 3 will be loaded through 

```
...ul/li[4]/a')
```

There is an exception for this part: ```NoSuchElementException``` when the element isn't found, in this case when the page is the last one from the table.

In the html code there is a specific area to the page selection, and each page number has a xpath that repeats after page 5. For this reason there a condition that repeats page 6 for values greater than 5.

>a consequence from this: to go to page 8, you need to pass through each previous page

```sh
if page>=6:
    page=6
element=driver.find_element_by_xpath('//*[@id="results"]/nav/ul/li['+str(page+2)+']/a')
```

- **page_loaded** - check if the page was loaded, in a negative condition, the process starts from the  current age_group selection. During this reloading process, the script doesn't store data, only after reaching the page where stopped before.
- **scraping** - activities information are stored in this part. Personal informations as **gender** and **age group** are obtained through the for loops and selections.
- **create_dataset** - organize the informations and save a **.csv** file
- **connect** - check the Internet connection, there a loop while to wait the Internet connection

In [None]:
def open_section():
    while True:
        try:
            connect()
            driver = webdriver.Chrome()
            driver.get('https://www.strava.com/login')
            time.sleep(5)
            return driver
            break
        except: continue
    
def login(driver,email,key):
    while True:
        try:
            connect()
            username = driver.find_element_by_id("email")
            password = driver.find_element_by_id("password")
            username.send_keys(email)
            password.send_keys(key)
            driver.find_element_by_id("login-button").click()
            time.sleep(2)
            break
        except: 
            driver.get('https://www.strava.com/login')
            time.sleep(5)
            continue
    
def user_access(driver,ge,ag,id_login):
    global count_logins
    global aux_id
    global segments
    
    if count_logins%3==0 and number_logins>1: #check if there is more than 1 login and the number of logins up to the limit for another
        
        connect()
        driver.close()
        driver.quit()
        
        aux_id = aux_id + 1
        if (aux_id - id_login)>=number_logins:
            aux_id = id_login
        
        driver = open_section()
        email, key = change_user(aux_id,logins)
        login(driver,email, key)
        select_english_language(driver,aux_id)
        go_link(driver,url=segments)
        select_gender(driver,ge)
        select_age_group(driver,ag,ge)
    
    count_logins=count_logins+1
    return driver
    
def change_user(id_login,logins):
     
    email = logins['email'][id_login]
    key=logins['key'][id_login]
    
    
    return email, key

def select_english_language(driver,id_login):
    while True:
        try:
            connect()
            element=driver.find_element_by_xpath('//*[@id="language-picker"]/ul/li[3]/div') # select English language
            driver.execute_script("arguments[0].click();", element) #click
            break
        except: 
            email, key = change_user(id_login,logins)
            login(driver,email,key)
            continue
    
def go_link(driver,url):
    while True:
        try:
            connect()
            driver.get(url)
            time.sleep(2)
            break
        except: continue
    
def select_gender(driver,gender):
    while True:
        try:
            connect()
            element=driver.find_element_by_xpath('//*[@id="segment-results"]/div[2]/table/tbody/tr/td[4]/div/ul/li['+str(gender)+']/a')
            driver.execute_script("arguments[0].click();", element)
            time.sleep(1)
            break
        except: 
            driver.refresh()
            continue
    
def select_age_group(driver,ag,ge):
    global club
    while True:
        try:
            connect()
            #if the user attends to Strava clubs, the code below should be changed to "ul[2]"
            element=driver.find_element_by_xpath('//*[@id="premium-enhanced"]/ul/ul['+str(club)+']/li['+str(ag)+']/a')
            driver.execute_script("arguments[0].click();", element) 
            time.sleep(5)
                        
            loading_page = driver.find_element_by_xpath('//*[@id="segment-results"]/div[2]/h4')
            loading_page_html=loading_page.get_attribute('innerHTML')
            loading_page_soup=BeautifulSoup(loading_page_html, "html.parser")
            
            if re.search(age_group[str(ag)],str(loading_page_soup),re.IGNORECASE):
                break

        except: 
            select_gender(driver,ge)
            continue
    
def next_page(driver,page,ge,ag):
    #driver.find_element_by_link_text("→").click()
    while True:
        try:
            connect()
            if page>=6:
                page=6
            element=driver.find_element_by_xpath('//*[@id="results"]/nav/ul/li['+str(page+2)+']/a')
            element_html=element.get_attribute('innerHTML')
            element_soup=BeautifulSoup(element_html, "html.parser")
            driver.execute_script("arguments[0].click();", element) 
            time.sleep(3)
            end = False
            
            break
        except NoSuchElementException:
            if connect() == True:
                print('End page')
                end = True
                break
    return end
            
def page_loaded(driver,page,ge,ag,end):
    print('Page loaded?')
    if end == False:
        aux_page = page
        while True:
            connect()
            if page>=6:
                page=5
            try:
                connect()
                loading_page = driver.find_element_by_xpath('//*[@id="results"]/nav/ul/li['+str(page+2)+']/span')
                loading_page_html=loading_page.get_attribute('innerHTML')
                loading_page_soup=BeautifulSoup(loading_page_html, "html.parser")

                if str(aux_page+1) == str(loading_page_soup):

                    break
                    
                else:
                    select_gender(driver,ge)
                    select_age_group(driver,ag,ge)
                    reload_page = 1
                    #next_page(driver,aux_page,ge,ag)
                    aux_reload_page = reload_page
                    while (aux_reload_page <= aux_page ):
                        try:
                            connect()
                            if reload_page>=6:
                                reload_page=6
                            element=driver.find_element_by_xpath('//*[@id="results"]/nav/ul/li['+str(reload_page+2)+']/a')
                            driver.execute_script("arguments[0].click();", element) 
                            time.sleep(3)
                            reload_page = reload_page + 1
                            aux_reload_page = aux_reload_page + 1
                        except:
                            
                            continue
             

            except:
                try:
                    connect()
                    loading_page = driver.find_element_by_xpath('//*[@id="results"]/nav/ul/li['+str(page+3)+']/span')
                    loading_page_html=loading_page.get_attribute('innerHTML')
                    loading_page_soup=BeautifulSoup(loading_page_html, "html.parser")

                    if str(aux_page+1) == str(loading_page_soup):

                        break
                    
                    else:
                        select_gender(driver,ge)
                        select_age_group(driver,ag,ge)
                        reload_page = 1
                        
                        aux_reload_page = reload_page
                        while (aux_reload_page <= aux_page ):
                            try:
                                connect()
                                if reload_page>=6:
                                    reload_page=6
                                element=driver.find_element_by_xpath('//*[@id="results"]/nav/ul/li['+str(reload_page+2)+']/a')
                                driver.execute_script("arguments[0].click();", element) 
                                time.sleep(3)
                                reload_page = reload_page + 1
                                aux_reload_page = aux_reload_page + 1
                            except:
                                continue
                                            
                except:
                    select_gender(driver,ge)
                    select_age_group(driver,ag,ge)
                    reload_page = 1
                    
                    aux_reload_page = reload_page
                    while (aux_reload_page <= aux_page ):
                        try:
                            connect()
                            if reload_page>=6:
                                reload_page=6
                            element=driver.find_element_by_xpath('//*[@id="results"]/nav/ul/li['+str(reload_page+2)+']/a')
                            driver.execute_script("arguments[0].click();", element) 
                            time.sleep(3)
                            reload_page = reload_page + 1
                            aux_reload_page = aux_reload_page + 1
                        except:
                            continue
                        
                    continue
        
    CRED = '\033[92m'
    CEND = '\033[0m'
    print(CRED+'OKAY !!!!'+CEND)
    
def scraping(driver,ge,ag):
    global results
    global result_age_group
    global result_gender
    global url_activity
    global url_athlete
    
    while True:
        try:
            connect()
            information = driver.find_element_by_id("segment-leaderboard")
            html=information.get_attribute('innerHTML')
            soup =  BeautifulSoup(html, "html.parser") 
            table = soup.find("table", attrs={"class": "table table-striped table-padded table-leaderboard"})
            rows = table.findAll("tr")
            break
        except: 
            select_gender(driver,ge)
            select_age_group(driver,ag,ge)
            continue

    print('Scraping...')
    for row in rows:
        try: 
            
            a = [t.text.strip() for t in row.findAll("td")][0:] 

            b = [c['href'] for c in row.find_all('a', href=True) if c.text] #hyperlinks
            if len(a) == 6:
                results.append(a)
                
                result_age_group.append(age_group[str(ag)])
                result_gender.append(gender[str(ge)])
                url_activity.append('https://strava.com' + b[1]) #b[0] athlete url | b[1] event url 
                url_athlete.append('https://strava.com' + b[0])

        except:
            next

def create_dataset():
    df_result_age_group=pd.DataFrame(result_age_group,columns=['age_group'])
    df_result_gender=pd.DataFrame(result_gender,columns=['gender'])
    df_url_activity=pd.DataFrame(url_activity,columns=['url_activity'])
    df_url_athlete = pd.DataFrame(url_athlete,columns=['url_athlete'])

    columns = ['classification','name', 'date','pace', 'heart_rate','time']
    df=pd.DataFrame(results,columns=columns)


    df=pd.concat([df,df_result_age_group,df_result_gender,df_url_activity,df_url_athlete],axis=1)
    
    df['major'] = df['date'].apply(lambda x: export[:-4] +' '+ str(datetime.datetime.strptime(x, '%b %d, %Y').year))
    
    df.to_csv(export,index=False,sep=';')
    print('Dataset created')
    
def connect():
    CRED = '\033[91m'
    CEND = '\033[0m'
    while True:

        try:
            urllib.request.urlopen('https://outlook.live.com/owa/') #Python 3.x
            return True
            break
        except:
            time.sleep(5)
            print(CRED+'Lost connection'+CEND)
            continue

## Informations to scrap

In [None]:
results = []
result_age_group = []
result_gender=[]
url_activity = []
url_athlete = []
age_group={'2':'19 and under','3':'20 to 24','4':'25 to 34','5':'35 to 44','6':'45 to 54','7':'55 to 64','8':'65 to 69',
              '9':'70 to 74','10':'75+'}
gender={'2':'M','3':'F'}

## Create session, open page and change language to English

In [None]:
driver = open_section()
email, key = change_user(id_login,logins)
login(driver,email, key)
select_english_language(driver,id_login)
go_link(driver,url=segments)

## Scraping

In [None]:
for ge in range(2,4): #'2':'M','3':'F'
    
    for ag in range(2,11): # '2':'19 and under','3':'20 to 24','4':'25 to 34','5':'35 to 44','6':'45 to 54','7':'55 to 64','8':'65 to 69',
                                  #'9':'70 to 74','10':'75+'        
        select_gender(driver,ge)
        select_age_group(driver,ag,ge)

        for page in range(1,last_page):
            
            CRED = '\033[1m'
            CEND = '\033[0m'

            print(CRED + gender[str(ge)] + " - " + age_group[str(ag)] + " - page - " + str(page) + CEND)

            scraping(driver,ge,ag)
            driver = user_access(driver,ge,ag,id_login)

            end = next_page(driver,page,ge,ag)
            page_loaded(driver,page,ge,ag,end)
            if end == True:
                break
        create_dataset()