# Project: Table Extraction from a Website Using Python

## Introduction

> In today's data-driven world, extracting structured information from websites is a fundamental task for various applications, including web scraping, data analysis, and automation. One common challenge involves extracting tabular data from web pages, as tables often contain valuable information presented in an organized format. Python, with its powerful libraries and tools, provides an excellent platform for tackling this challenge efficiently.
>
> Table extraction from websites involves parsing HTML content to identify tables and then extracting the data within them. Python offers several libraries and frameworks, such as Beautiful Soup and lxml, that simplify the process of web scraping and HTML parsing. Additionally, specialized libraries like Pandas can be used to transform the extracted data into a structured and analyzable format.
>
>In this project, we will explore how to leverage Python's capabilities to extract tables from web pages step-by-step. We will cover techniques for web page retrieval, HTML parsing, and data extraction. 

In [1]:
#importing all the liberies we need
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
#parsing the url into BeautifulSoup
url = 'https://worldathletics.org/records/toplists/sprints/100-metres/outdoor/men/senior/2023?regionType=countries&region=ngr&timing=electronic&windReading=regular&page=1&bestResultsOnly=false'

page = requests.get(url)

soup = BeautifulSoup(page.text, 'html')

In [3]:
#finding the table using the class attribute in html
table =  soup.find("table", class_="records-table")

In [4]:
#finding all the table headers
mtable_titles = table.find_all('th')

In [5]:
mtable_titles

[<th>
                                         Rank
                                     </th>,
 <th>
                                         Mark
                                     </th>,
 <th>
                                             WIND
                                         </th>,
 <th>
                                         Competitor
                                     </th>,
 <th>
                                         DOB
                                     </th>,
 <th>
                                         Nat
                                     </th>,
 <th>
                                         Pos
                                     </th>,
 <th></th>,
 <th>
                                         Venue
                                     </th>,
 <th>
                                         Date
                                     </th>,
 <th>
                                             Results Score
                                         </th>]

In [6]:
#store the table headers to a python list 
mtable_table_titles = [title.text.strip() for title in mtable_titles]

print(mtable_table_titles)

['Rank', 'Mark', 'WIND', 'Competitor', 'DOB', 'Nat', 'Pos', '', 'Venue', 'Date', 'Results Score']


In [7]:
#coverting the saved python list into a dataframe
df = pd.DataFrame(columns = mtable_table_titles)

df

Unnamed: 0,Rank,Mark,WIND,Competitor,DOB,Nat,Pos,Unnamed: 8,Venue,Date,Results Score


In [8]:
#finding all rows in the table
column_data = table.find_all('tr')

In [9]:
#iterating through the rows of the table
for row in column_data[1:]:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    
    length = len(df)
    df.loc[length] = individual_row_data

In [10]:
df.head()

Unnamed: 0,Rank,Mark,WIND,Competitor,DOB,Nat,Pos,Unnamed: 8,Venue,Date,Results Score
0,1,9.9,1.8,Godson Oke OGHENEBRUME,27 MAY 2003,NGR,2,,"Mike A. Myers Stadium, Austin, TX (USA)",09 JUN 2023,1241
1,2,9.92,1.4,Udodi Chudi ONWUZURIKE,29 JAN 2003,NGR,1h2,,"Hornet Stadium - Sac St., Sacramento, CA (USA)",26 MAY 2023,1234
2,3,9.93,0.8,Godson Oke OGHENEBRUME,27 MAY 2003,NGR,1sf3,,"Mike A. Myers Stadium, Austin, TX (USA)",07 JUN 2023,1231
3,4,9.96,1.3,Favour Oghene Tejiri ASHE,28 APR 2002,NGR,1sf2,,"Mike A. Myers Stadium, Austin, TX (USA)",07 JUN 2023,1220
4,5,9.98,1.8,Udodi Chudi ONWUZURIKE,29 JAN 2003,NGR,6,,"Mike A. Myers Stadium, Austin, TX (USA)",09 JUN 2023,1213


In [11]:
#save the data into a csv file
df.to_csv(r'C:\Users\sodiq.otubela\Downloads\python\100m_men.csv', index = False)