# Part 2 - Other features analysis

The aim of our project is not clearly defined yet. We are hesitating between several topics:
- predict the winner of the 1O0m sprint (or to define)
- predict the winner of the marathon
- have a look on the influence of being in your home country for french athletes at several event in the olympic games; such as surf, climbing, marathon, decathlon. The results of these events might be significantly impacted by the acclamation of the crowd, or the setup of the event can greatly facilitate a french athlete that is use to the meteorological condition...   

In order to do them, our project will be divided in 3 parts:
- a sentiment analysis using sports articles
- a basic neural network machine learning algorithm on past performances/categorical features
- a combination of the 2 above

In this part we are doing the part 2

## Data collection

Web scrapping with *beautifulsoup4* from 
- the website https://worldathletics.org/world-rankings/5000m/men?regionType=world&page=1&rankDate=2024-06-04&limitByCountry=0
- 

#### *100m sprint*

In [1]:
pip install requests beautifulsoup4 pandas

Note: you may need to restart the kernel to use updated packages.


In [5]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the page to scrape
url = "https://worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/men/senior"

# Send a request to fetch the page content
response = requests.get(url)
response.raise_for_status()  # Ensure the request was successful

# Parse the page content
soup = BeautifulSoup(response.text, 'html.parser')

# Find the table containing the data
table = soup.find('table')

# Extract table headers
headers = [header.text.strip() for header in table.find_all('th')]

# Extract table rows
rows = []
for row in table.find_all('tr')[1:]:  # Skip the header row
    columns = row.find_all('td')
    row_data = [col.text.strip() for col in columns]
    rows.append(row_data)

# Create a DataFrame
df = pd.DataFrame(rows, columns=headers)

# Save the data to a CSV file
df.to_csv('100m_sprint_all_time_toplist.csv', index=False)

print("Data has been saved to 100m_sprint_all_time_toplist.csv")


Data has been saved to 100m_sprint_all_time_toplist.csv


In [3]:
df

Unnamed: 0,Rank,Mark,WIND,Competitor,DOB,Nat,Pos,Unnamed: 8,Venue,Date,Results Score
0,1,9.58,+0.9,Usain BOLT,21 AUG 1986,JAM,1,,"Olympiastadion, Berlin (GER)",16 AUG 2009,1356
1,2,9.69,+2.0,Tyson GAY,09 AUG 1982,USA,1,,Shanghai (CHN),20 SEP 2009,1316
2,2,9.69,-0.1,Yohan BLAKE,26 DEC 1989,JAM,1,,"Stade Olympique de la Pontaise, Lausanne (SUI)",23 AUG 2012,1316
3,4,9.72,+0.2,Asafa POWELL,23 NOV 1982,JAM,1f1,,"Stade Olympique de la Pontaise, Lausanne (SUI)",02 SEP 2008,1305
4,5,9.74,+0.9,Justin GATLIN,10 FEB 1982,USA,1,,"Suhaim bin Hamad Stadium, Doha (QAT)",15 MAY 2015,1298
...,...,...,...,...,...,...,...,...,...,...,...
95,95,9.94,-0.2,Bernard WILLIAMS,19 JAN 1978,USA,2,,"Commonwealth Stadium, Edmonton (CAN)",05 AUG 2001,1228
96,95,9.94,+1.7,Diondre BATSON,13 JUL 1992,USA,2h2,,"Hayward Field, Eugene, OR (USA)",25 JUN 2015,1227
97,95,9.94,+1.4,Andrew FISHER,15 DEC 1991,JAM,2,,"Moratalaz, Madrid (ESP)",11 JUL 2015,1227
98,95,9.94,+1.0,Ameer WEBB,19 MAR 1991,USA,2f1,,"Stadio Olimpico, Roma (ITA)",02 JUN 2016,1227


We now have created a dataframe and saved it in a csv file. This gives us informations about the 100 first athletes for the 100m sprint in the world.

We have 11 columns:
- ID
- Rank
- Mark
- WIND
- Competitor (the name and surname of the athlete)
- DOB (the date of birth)
- NAT (nationality)
- Pos (I don't know yet)
- Venue (The venue in which he made his best time)
- Date (the date he made his best time)
- Result score

We ideally want to have more information about each individual athlete. To do that, another page of this website has access to their profile page. We need to get access to it doing once again webscrapping.

#### *5000m*

In [6]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the page to scrape
url = "https://worldathletics.org/world-rankings/5000m/men?regionType=world&page=1&rankDate=2024-06-04&limitByCountry=0"

# Send a request to fetch the page content
response = requests.get(url)
response.raise_for_status()  # Ensure the request was successful

# Parse the page content
soup = BeautifulSoup(response.text, 'html.parser')

# Find the table containing the data
table = soup.find('table')

# Extract table headers
headers = [header.text.strip() for header in table.find_all('th')]

# Extract table rows
rows = []
for row in table.find_all('tr')[1:]:  # Skip the header row
    columns = row.find_all('td')
    row_data = [col.text.strip() for col in columns]
    rows.append(row_data)

# Create a DataFrame
df5000 = pd.DataFrame(rows, columns=headers)

# Save the data to a CSV file
df5000.to_csv('5000m_all_time_toplist.csv', index=False)

print("Data has been saved to 5000m_all_time_toplist.csv")


Data has been saved to 5000m_all_time_toplist.csv


In [9]:
df5000.head(10)

Unnamed: 0,Place,Competitor,DOB,Nat,Score,Event List
0,1,Yomif KEJELCHA,01 AUG 1997,ETH,1457,5000m [3000m]
1,2,Hagos GEBRHIWET,11 MAY 1994,ETH,1441,5000m [5 km Road]
2,3,Berihu AREGAWI,28 FEB 2001,ETH,1421,5000m [3000m]
3,4,Telahun Haile BEKELE,13 MAY 1999,ETH,1409,5000m [3000m]
4,5,Jakob INGEBRIGTSEN,19 SEP 2000,NOR,1405,5000m [3000m]
5,6,Jacob KIPLIMO,14 NOV 2000,UGA,1403,5000m
6,7,Selemon BAREGA,20 JAN 2000,ETH,1396,5000m [3000m sh]
7,8,Grant FISHER,22 APR 1997,USA,1364,5000m [3000m]
8,9,Luis GRIJALVA,10 APR 1999,GUA,1361,5000m [3000m]
9,10,Joshua CHEPTEGEI,12 SEP 1996,UGA,1350,5000m


Let's find more details of these first 10 athletes

In [32]:
import requests
from bs4 import BeautifulSoup
import json

# URL of the page to scrape
url = "https://worldathletics.org/athletes/ethiopia/hagos-gebrhiwet"

# Fetch the HTML content of the page
response = requests.get(url)
html_content = response.text

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Find the <script> tag with id="__NEXT_DATA__" and type="application/json"
script_tag = soup.find('script', {'id': '__NEXT_DATA__', 'type': 'application/json'})

# Extract the JSON content from the <script> tag
json_content = script_tag.string

# Parse the JSON content
data = json.loads(json_content)

# Save the JSON content to a file
with open('athlete_profile.json', 'w') as json_file:
    json.dump(data, json_file, indent=4)

print("JSON data has been saved to 'athlete_profile.json'")


JSON data has been saved to 'athlete_profile.json'


In [33]:
import json

# Load the JSON data from the file
with open('athlete_profile.json', 'r') as json_file:
    data = json.load(json_file)

# Function to recursively search for 5000 meters results
def search_5000m_results(node, results):
    if isinstance(node, dict):
        for key, value in node.items():
            if isinstance(value, list) or isinstance(value, dict):
                search_5000m_results(value, results)
            elif key == 'discipline' and '5000 Metres' in value:
                results.append(node)
    elif isinstance(node, list):
        for item in node:
            if isinstance(item, list) or isinstance(item, dict):
                search_5000m_results(item, results)

# Extract 5000 meters results
results_5000m = []
search_5000m_results(data, results_5000m)

# Print the extracted results
print(json.dumps(results_5000m, indent=4))



# Optionally, save the extracted results to a file
with open('5000m_results.json', 'w') as json_file:
    json.dump(results_5000m, json_file, indent=4)


[]


In [34]:
import json
import csv
import requests
from bs4 import BeautifulSoup

# URL of the page to scrape
url = "https://worldathletics.org/athletes/ethiopia/hagos-gebrhiwet-14477352"

# Fetch the HTML content of the page
response = requests.get(url)
html_content = response.text

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Find the <script> tag with id="__NEXT_DATA__" and type="application/json"
script_tag = soup.find('script', {'id': '__NEXT_DATA__', 'type': 'application/json'})

# Extract the JSON content from the <script> tag
json_content = script_tag.string

# Parse the JSON content
data = json.loads(json_content)

# Function to recursively search for 5000 meters results
def search_5000m_results(node, results):
    if isinstance(node, dict):
        for key, value in node.items():
            if isinstance(value, list) or isinstance(value, dict):
                search_5000m_results(value, results)
            elif key == 'discipline' and '5000 Metres' in value:
                results.append(node)
    elif isinstance(node, list):
        for item in node:
            if isinstance(item, list) or isinstance(item, dict):
                search_5000m_results(item, results)

# Extract 5000 meters results
results_5000m = []
search_5000m_results(data, results_5000m)

data_athlete_5000m = results_5000m  # Remove json.dumps

# List to hold the extracted rows
rows = []

# Function to extract relevant data
def extract_5000m_data(results):
    for result in results:
        # Check if this is a progression result with multiple results
        if 'results' in result:
            for progression in result['results']:
                rows.append({
                    'score': progression.get('resultScore', ''),
                    'date': progression.get('date', ''),
                    'venue': progression.get('venue', ''),
                    'place': progression.get('place', '')  # Use 'listPosition' instead of 'place'
                })
        else:
            rows.append({
                'score': result.get('resultScore', ''),
                'date': result.get('date', ''),
                'venue': result.get('venue', ''),
                'place': result.get('place', '')  # Use 'listPosition' instead of 'place'
            })

# Extract data
extract_5000m_data(data_athlete_5000m)

# Define CSV file name
csv_file_name = '5000m_results.csv'

# Write to CSV
with open(csv_file_name, 'w', newline='') as csv_file:
    fieldnames = ['score', 'date', 'venue', 'place']  # Add 'place' to fieldnames
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    for row in rows:
        writer.writerow(row)

print(f"Data successfully written to {csv_file_name}")


Data successfully written to 5000m_results.csv


In [38]:
import json
import csv
import requests
from bs4 import BeautifulSoup

# Function to extract 5000m results for a given athlete URL
def extract_5000m_results(url):
    # Fetch the HTML content of the page
    response = requests.get(url)
    html_content = response.text

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Find the <script> tag with id="__NEXT_DATA__" and type="application/json"
    script_tag = soup.find('script', {'id': '__NEXT_DATA__', 'type': 'application/json'})

    # Extract the JSON content from the <script> tag
    json_content = script_tag.string

    # Parse the JSON content
    data = json.loads(json_content)

    # Function to recursively search for 5000 meters results
    def search_5000m_results(node, results):
        if isinstance(node, dict):
            for key, value in node.items():
                if isinstance(value, list) or isinstance(value, dict):
                    search_5000m_results(value, results)
                elif key == 'discipline' and '5000 Metres' in value:
                    results.append(node)
        elif isinstance(node, list):
            for item in node:
                if isinstance(item, list) or isinstance(item, dict):
                    search_5000m_results(item, results)

    # Extract 5000 meters results
    results_5000m = []
    search_5000m_results(data, results_5000m)

    return results_5000m

# List of athletes and their corresponding URLs
athletes = {
    "Yomif KEJELCHA": "https://worldathletics.org/athletes/ethiopia/yomif-kejelcha-14594967",
    "Hagos GEBRHIWET": "https://worldathletics.org/athletes/ethiopia/hagos-gebrhiwet-14477352",
    "Berihu AREGAWI": "https://worldathletics.org/athletes/ethiopia/berihu-aregawi-14848753",
    "Telahun Haile BEKELE": "https://worldathletics.org/athletes/ethiopia/telahun-haile-bekele-14797485",
    "Jakob INGEBRIGTSEN": "https://worldathletics.org/athletes/norway/jakob-ingebrigtsen-14653717",
    "Jacob KIPLIMO": "https://worldathletics.org/athletes/uganda/jacob-kiplimo-14735365",
    "Selemon BAREGA": "https://worldathletics.org/athletes/ethiopia/selemon-barega-14751317",
    "Grant FISHER": "https://worldathletics.org/athletes/united-states/grant-fisher-14591210",
    "Luis GRIJALVA": "https://worldathletics.org/athletes/guatemala/luis-grijalva-14749285",
    "Joshua CHEPTEGEI": "https://worldathletics.org/athletes/uganda/joshua-cheptegei-14645612"
}

# List to hold all extracted rows
all_rows = []

# Loop through each athlete and extract their 5000m results
for athlete, athlete_url in athletes.items():
    print(f"Extracting 5000m results for {athlete}...")
    athlete_results = extract_5000m_results(athlete_url)

    # List to hold the extracted rows for this athlete
    rows = []

    # Function to extract relevant data
    def extract_5000m_data(results):
        for result in results:
            # Check if this is a progression result with multiple results
            if 'results' in result:
                for progression in result['results']:
                    rows.append({
                        'Athlete': athlete,
                        'Score': progression.get('resultScore', ''),
                        'Date': progression.get('date', ''),
                        'Venue': progression.get('venue', ''),
                        'Place': progression.get('place', '')  # Use 'listPosition' instead of 'place'
                    })
            else:
                rows.append({
                    'Athlete': athlete,
                    'Score': result.get('resultScore', ''),
                    'Date': result.get('date', ''),
                    'Venue': result.get('venue', ''),
                    'Place': result.get('place', '')  # Use 'listPosition' instead of 'place'
                })

    # Extract data for this athlete
    extract_5000m_data(athlete_results)

    # Add rows for this athlete to the list of all rows
    all_rows.extend(rows)

# Define CSV file name
csv_file_name = '5000m_results.csv'

# Write all rows to CSV
with open(csv_file_name, 'w', newline='') as csv_file:
    fieldnames = ['Athlete', 'Score', 'Date', 'Venue', 'Place']  # Add 'Athlete' field
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    for row in all_rows:
        writer.writerow(row)

print(f"Data successfully written to {csv_file_name}")


Extracting 5000m results for Yomif KEJELCHA...
Extracting 5000m results for Hagos GEBRHIWET...
Extracting 5000m results for Berihu AREGAWI...
Extracting 5000m results for Telahun Haile BEKELE...
Extracting 5000m results for Jakob INGEBRIGTSEN...
Extracting 5000m results for Jacob KIPLIMO...
Extracting 5000m results for Selemon BAREGA...
Extracting 5000m results for Grant FISHER...
Extracting 5000m results for Luis GRIJALVA...
Extracting 5000m results for Joshua CHEPTEGEI...
Data successfully written to 5000m_results.csv


#### Climbing

None
