# Scraping the data

imports

In [8]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

For each olympic year, there is a page for weightlifting. The page contains of information such as competition schedule and participating nations.

However, I am mainly concerned about scraping the performance data from each athlete. To do this, I have to dive into each page for every weight category for that Olympic year.

These categories can be found easily through searching for "details" in the body of the span tags with bs4. After getting each span element with a body of "details", I can grab the href from each element.

As I can repeat this for other olympic years, I can extract this into a function

In [11]:
list_of_links = []

#get the links for each weightlifting weight class
def getUrlLinks(url,year):
  response = requests.get(url).content
  soup = BeautifulSoup(response, 'html.parser')

  weight_category_links = soup.find_all('span',string='details')

  for link in weight_category_links:
    list_of_links.append(link.find('a').get('href'))

As each Wikipedia link for the weightlifting event only differ from the year, I can interate through a year array and call the function created before.

In [12]:
#only works for 1996+
olympic_years = [1996,2000,2004,2008,2012,2016,2020]
for year in olympic_years:
  url = 'https://en.wikipedia.org/wiki/Weightlifting_at_the_' + str(year) + '_Summer_Olympics'
  print(url)
  getUrlLinks(url,year)

https://en.wikipedia.org/wiki/Weightlifting_at_the_1996_Summer_Olympics
https://en.wikipedia.org/wiki/Weightlifting_at_the_2000_Summer_Olympics
https://en.wikipedia.org/wiki/Weightlifting_at_the_2004_Summer_Olympics
https://en.wikipedia.org/wiki/Weightlifting_at_the_2008_Summer_Olympics
https://en.wikipedia.org/wiki/Weightlifting_at_the_2012_Summer_Olympics
https://en.wikipedia.org/wiki/Weightlifting_at_the_2016_Summer_Olympics
https://en.wikipedia.org/wiki/Weightlifting_at_the_2020_Summer_Olympics


Now that I have a list of links (link to each weightlifting category for each year), I can interate through each link and fetch certain data.

As each page contains a table with the results, I find this table and use the pandas read_html function to convert to a dataframe.

For each dataframe created this way, I will append to a dataframe array to later concatenate to one large dataframe.

In [13]:
dataframes = []
def getTableData(url):
    full_url = "https://en.wikipedia.org" + url
    response = requests.get("https://en.wikipedia.org" + url)
    soup = BeautifulSoup(response.content, 'html.parser')

    #remove all references, such as [3] or [5]
    for tag in soup.find_all(class_='reference'):
        tag.decompose()

    
    title = soup.find('h1', {'id':'firstHeading'}).text
    
    
    #get year of event
    results_year = title.split(' ')[3]
        
    #find correct table
    results_title = soup.find('span', {'id':'Results'})
    results_table = results_title.findNext('table')
    df = pd.read_html(str(results_table),header=1)[0]
    df['Year'] = results_year
    
    #fix tables that have the "athlete" called "name".
    df.rename(columns={'Name':'Athlete'})
    
    return df

for link in list_of_links:
    dataframes.append(getTableData(link))
    
print("finished fetching all tables from all links")

finished fetching all tables from all links


There are some things I would like to tidy up before concatenating the dataframes.

First is to fix how the rank for the top 3 athletes for the weight category are displayed. In the Wikipedia table, it is displayed as an image, and through convertion it will be NaN.

To fix this, I will assign the first three rows with 1,2,3 respsectively. This is defined as assign_top_three()

In [14]:
dataframes[0].head(5)

Unnamed: 0,Rank,Athlete,Group,Body weight,1,2,3,Result,1.1,2.1,3.1,Result.1,Total,Year
0,,Halil Mutlu (TUR),A,53.91,125.0,130.0,132.5,132.5,152.5,152.5,155.0,155.0,287.5,1996
1,,Zhang Xiangsen (CHN),A,53.39,122.5,127.5,130.0,130.0,150.0,155.0,157.5,150.0,280.0,1996
2,,Sevdalin Minchev (BUL),A,54.0,117.5,122.5,125.0,125.0,147.5,152.5,157.5,152.5,277.5,1996
3,4.0,Lan Shizhang (CHN),A,53.61,120.0,125.0,127.5,125.0,150.0,157.5,162.5,150.0,275.0,1996
4,5.0,Traian Cihărean (ROU),A,53.9,115.0,120.0,122.5,120.0,140.0,145.0,152.5,145.0,265.0,1996


Another issue to fix is that some olympic years have very similar but different column names. An example of this is "Bodyweight" and "Body weight". To make life easier, I will remove all space for each column name before concatenating all the dataframes. I have put this inside the function clean_df_columns

In [15]:
dataframes[0].columns

Index(['Rank', 'Athlete', 'Group', 'Body weight', '1', '2', '3', 'Result',
       '1.1', '2.1', '3.1', 'Result.1', 'Total', 'Year'],
      dtype='object')

In [35]:
dataframes[98].columns

Index(['Rank', 'Athlete', 'Nation', 'Group', 'Bodyweight', '1', '2', '3',
       'Result', '1.1', '2.1', '3.1', 'Result.1', 'Total', 'Year'],
      dtype='object')

In [36]:
def assign_top_three(df):
    for i in range(3):
        df.loc[i,'Rank']=[i+1.0]
    
def clean_df(df):
    df.columns = df.columns.str.replace(' ','')
    return df
    
for df in dataframes:
    assign_top_three(df)
    clean_df(df)
    
results = pd.concat(dataframes,axis=0,ignore_index=True)
results

Unnamed: 0,Rank,Athlete,Group,Bodyweight,1,2,3,Result,1.1,2.1,3.1,Result.1,Total,Year,Nation
0,1.0,Halil Mutlu (TUR),A,53.91,125.0,130.0,132.5,132.5,152.5,152.5,155.0,155.0,287.5,1996,
1,2.0,Zhang Xiangsen (CHN),A,53.39,122.5,127.5,130.0,130.0,150.0,155.0,157.5,150.0,280.0,1996,
2,3.0,Sevdalin Minchev (BUL),A,54.0,117.5,122.5,125.0,125.0,147.5,152.5,157.5,152.5,277.5,1996,
3,4.0,Lan Shizhang (CHN),A,53.61,120.0,125.0,127.5,125.0,150.0,157.5,162.5,150.0,275.0,1996,
4,5.0,Traian Cihărean (ROU),A,53.9,115.0,120.0,122.5,120.0,140.0,145.0,152.5,145.0,265.0,1996,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1694,10,Sarah Fischer,A,93.35,93,97,97,97,117,123,123,123,220,2020,Austria
1695,11,Anna Van Bellinghen,B,87.1,96,100,100,96,115,119,123,123,219,2020,Belgium
1696,12,Erdenebatyn Bilegsaikhan,B,,80,85,87,85,115,120,122,122,207,2020,Mongolia
1697,13,Scarleth Ucelo,B,113.5,86,87,87,87,107,112,116,116,203,2020,Guatemala


now that we have finished scraping the data, lets convert to csv.

In [37]:
results.to_csv('olympics_weightlifting_1996_to_2020.csv',encoding='utf-8-sig',index=False)