# Web Scraping In Python

**Olojede Joseph**

12/8/2024

## Introduction

Web scraping (or data scraping) is a technique used to collect content and data from the internet. In this project, I utilized the BeautifulSoup library in Python for web scraping to extract data from Wikipedia. The objective was to compile information on the "List of clubs who have played in the Premier League from its inception in 1992 to the 2024–25 season".

Import the packages

In [2]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import warnings

In Python, you can ignore warnings using the warnings module. You can use the filterwarnings function to filter or ignore specific warning messages or categories.

In [3]:
# Ignore all warnings
warnings.filterwarnings("ignore", category=FutureWarning)

**Input in the website url.**

In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_Premier_League_clubs'

page = requests.get(url).text

soup = BeautifulSoup(page, 'html')

Print the result.

In [None]:
soup

Select the first table in the website.

In [None]:
table = soup.find_all('table')[0]
table

Get the column names of the table.

In [None]:
title = table.find_all('th')
title

Initiate a for loop to iterate over the list of column names and put it in a list.

In [8]:
header_titles = [title.text.strip() for title in title]

header_titles

['Club',
 'Location',
 'Totalseasons',
 'Totalspells',
 'Longestspell',
 'Most recentpromotion',
 'Most recentrelegation',
 'Totalseasonsabsent',
 'Seasons',
 'Current spell',
 'Most recentfinish',
 'Highestfinish',
 'Top scorer']

Define the table row.

In [9]:
column_data = table.find_all('tr')

Create an empty list and append the row data to the epty list.

In [10]:
all_data = []

for row in column_data[1:]:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    all_data.append(individual_row_data)

Convert the list and turn it to a DataFrame

In [11]:
df = pd.DataFrame(all_data, columns=header_titles)

df.head()

Unnamed: 0,Club,Location,Totalseasons,Totalspells,Longestspell,Most recentpromotion,Most recentrelegation,Totalseasonsabsent,Seasons,Current spell,Most recentfinish,Highestfinish,Top scorer
0,Arsenal,London (Holloway),33,1,33,1914–15[a],Never relegated,0,1992–,32,2nd,1st,Thierry Henry (175)
1,Aston Villa,Birmingham (Aston),30,2,24,2018–19,2015–16,3,1992–20162019–,5,4th,2nd,Gabriel Agbonlahor (74)
2,Barnsley,Barnsley,1,1,1,1996–97,1997–98,31,1997–1998,0,League One6th,19th (relegated),Neil Redfearn (10)
3,Birmingham City,Birmingham (Bordesley),7,3,4,2008–09,2010–11,25,2002–20062007–20082009–2011,0,Championship22nd (relegated),9th,Mikael Forssell (29)
4,Blackburn Rovers,Blackburn,18,2,11,2000–01,2011–12,14,1992–19992001–2012,0,Championship19th,1st,Alan Shearer (112)


Convert the DataFrame to a .csv file and encoded it so Excel can read the ASCII characters well.

In [None]:
df.to_csv('Premier League Clubs from 1992 till 2024-25 Season.csv', index=False, encoding='utf-8-sig')
