# Scraping data from web pages

This example scrapes data from a HTML table firstly into a pndas dataframe and then exports to a CSV file.  
It uses 
* [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), a popular web scraping library; 
* pandas, the de-facto library for data analysis; 
* and requests, the most popular Python library.

In [None]:
!pip install beautifulsoup4

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from io import StringIO

View [this page](https://zomalex.co.uk/datasets/cars_table.html).  This contains a simple HTML table



Originally, this was a wikipedia page providing the results for the 20th series of Strictly, a Saturday evening light entertainment dance competition on the BBC. This series started in September 2022. Wikipedia now require registartion and authentication
The next code cell gets all the details of the page, including the HTML content, into a variable.
Before running the next cell, [view this page](https://en.wikipedia.org/wiki/Strictly_Come_Dancing_(series_20)) in a browser.

In [None]:
url = 'https://zomalex.co.uk/datasets/cars_table.html'
response = requests.get(url)
response.status_code, response.content[:1000]

The next code cell uses beautifulsoup to parse the content, and then shows the first 1000 character in a formatted style.

In [None]:
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify()[:1000])


The HTML page contains a table which is the first HTML table within a span section with the id Week_6:_Halloween_Week.  This table contains the data of interest, the score that couples obtained from the judges for a dance. The next code cell finds and gets the HTML content that reprsents this table. 

In [None]:
my_table = soup.find('table')
my_table.prettify()[:1000]

The next code cell uses the pandas read_html function to read data from the HTML table text into a dataframe.

In [None]:
# we may need to install lxml: !pip install lxml
df = pd.read_html(StringIO(str(my_table)))[0]
df.head()

The next code cell saves the dataframe to a CSV file for later use, perhaps within Power BI

In [None]:
df.to_csv('cars.csv', index=False)