# Scraping data from web pages

This example scrapes data from a HTML table firstly into a pndas dataframe and then exports to a CSV file.  
It uses 
* [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), a popular web scraping library; 
* pandas, the de-facto library for data analysis; 
* and requests, the most popular Python library.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

This wikipedia page provides the results for the 20th series of Strictly, a Saturday evening light entertainment dance competition on the BBC. This series started in September 2022.
The next code cell gets all the details of the page, including the HTML content, into a variable.
Before running the next cell, [view this page](https://en.wikipedia.org/wiki/Strictly_Come_Dancing_(series_20)#Week_9:_Blackpool_Week) in a browser.

In [2]:
#strictly_url = 'https://en.wikipedia.org/wiki/Strictly_Come_Dancing_(series_20)#Week_9:_Blackpool_Week'
strictly_url = 'https://en.wikipedia.org/wiki/Strictly_Come_Dancing_(series_20)'
response = requests.get(strictly_url)
response.status_code, response.content[:1000]

(200,
 b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Strictly Come Dancing (series 20) - Wikipedia</title>\n<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled";(function(){var cookie=document.cookie.

The next code cell uses beautifulsoup to parse the content, and then shows the first 1000 character in a formatted style.

In [5]:
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify()[:1000])


<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Strictly Come Dancing (series 20) - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled";(function(){var cookie=document.co

The HTML page contains a table which is the first HTML table within a span section with the id Week_6:_Halloween_Week.  This table contains the data of interest, the score that couples obtained from the judges for a dance. The next code cell finds and gets the HTML content that reprsents this table. 

In [6]:
span_halloween = soup.find('span', {'id': 'Week_6:_Halloween_Week'})
my_table = span_halloween.find_next('table')
my_table.prettify()[:1000]

'<table class="wikitable" style="width:90%;">\n <tbody>\n  <tr>\n   <th>\n    Couple\n   </th>\n   <th>\n    Score\n    <sup class="reference" id="cite_ref-44">\n     <a href="#cite_note-44">\n      [44]\n     </a>\n    </sup>\n   </th>\n   <th>\n    Dance\n    <sup class="reference" id="cite_ref-week6-songs_45-0">\n     <a href="#cite_note-week6-songs-45">\n      [45]\n     </a>\n    </sup>\n   </th>\n   <th>\n    Music\n    <sup class="reference" id="cite_ref-week6-songs_45-1">\n     <a href="#cite_note-week6-songs-45">\n      [45]\n     </a>\n    </sup>\n   </th>\n   <th>\n    Result\n    <sup class="reference" id="cite_ref-46">\n     <a href="#cite_note-46">\n      [46]\n     </a>\n    </sup>\n   </th>\n  </tr>\n  <tr>\n   <td>\n    Tony &amp; Katya\n   </td>\n   <td>\n    31 (7,8,8,8)\n   </td>\n   <td>\n    Quickstep\n   </td>\n   <td>\n    "\n    <a href="/wiki/The_Devil_Went_Down_to_Georgia" title="The Devil Went Down to Georgia">\n     The Devil Went Down to Georgia\n    </a>\

The next code cell uses the pandas read_html function to read data from the HTML table text into a dataframe.

In [7]:
df = pd.read_html(str(my_table))[0]
df.head()

Unnamed: 0,Couple,Score[44],Dance[45],Music[45],Result[46]
0,Tony & Katya,"31 (7,8,8,8)",Quickstep,"""The Devil Went Down to Georgia""—Charlie Danie...",Safe
1,Will & Nancy,"32 (8,8,8,8)",Cha-Cha-Cha,"""Mama Told Me Not to Come""—Tom Jones & Stereop...",Safe
2,Kym & Graziano,"34 (8,8,9,9)",Rumba,"""Frozen""—Madonna",Safe
3,James & Amy,"27 (6,7,7,7)",Charleston,"""Bumble Bee""—LaVern Baker",Eliminated
4,Molly & Carlos,"33 (6,9,9,9)",Argentine Tango,"""Running Up That Hill""—Kate Bush",Safe


The current column names are a bit ugly. The next code cell renames the columns of the pandas dataframe

In [8]:
df.columns = ['couple', 'score','dance','music','result']
df.head()


Unnamed: 0,couple,score,dance,music,result
0,Tony & Katya,"31 (7,8,8,8)",Quickstep,"""The Devil Went Down to Georgia""—Charlie Danie...",Safe
1,Will & Nancy,"32 (8,8,8,8)",Cha-Cha-Cha,"""Mama Told Me Not to Come""—Tom Jones & Stereop...",Safe
2,Kym & Graziano,"34 (8,8,9,9)",Rumba,"""Frozen""—Madonna",Safe
3,James & Amy,"27 (6,7,7,7)",Charleston,"""Bumble Bee""—LaVern Baker",Eliminated
4,Molly & Carlos,"33 (6,9,9,9)",Argentine Tango,"""Running Up That Hill""—Kate Bush",Safe


The next code cell saves the dataframe to a CSV file for later use, perhaps within Power BI

In [None]:
df.to_csv('strictly.csv', index=False)