# Python surgery 2 : Dealing with dates

Helping Sam with his question about scraping dates. 

First we scrape the pages.

In [3]:
#importing requests
import requests
#importing beautiful soup scraper library
from bs4 import BeautifulSoup
import pandas as pd


#fetch URL 
page = requests.get("https://www.gov.uk/employment-tribunal-decisions")

#command beautiful soup to parse the page
soup = BeautifulSoup(page.content,'html.parser')

#identify the lines in the page that I want by link and class
cases = soup.select('a')

print(cases)

[<a class="gem-c-skip-link govuk-skip-link govuk-!-display-none-print" data-module="govuk-skip-link" href="#content">Skip to main content</a>, <a class="govuk-link" href="/help/cookies">View cookies</a>, <a class="govuk-link" data-module="gem-track-click" data-track-action="Cookie banner settings clicked from confirmation" data-track-category="cookieBanner" href="/help/cookies">change your cookie settings</a>, <a class="govuk-header__link govuk-header__link--homepage" data-module="gem-track-click" data-track-action="logoLink" data-track-category="headerClicked" data-track-dimension="GOV.UK" data-track-dimension-index="29" data-track-label="https://www.gov.uk" href="https://www.gov.uk" id="logo" title="Go to the GOV.UK homepage">
<span class="govuk-header__logotype">
<!--[if gt IE 8]><!-->
<svg aria-hidden="true" class="govuk-header__logotype-crown gem-c-layout-super-navigation-header__logotype-crown" focusable="false" height="30" viewbox="0 0 132 97" width="36" xmlns="http://www.w3.org

## Grabbing all the links

We check we can grab the links.

In [4]:
#this grabs the links to each case
cases = soup.select('a[class="gem-c-document-list__item-title govuk-link"]')
#check them
cases

[<a class="gem-c-document-list__item-title govuk-link" data-ecommerce-index="1" data-ecommerce-path="/employment-tribunal-decisions/ms-vidgen-and-others-v-k2-smiles-ltd-2302095-slash-2019-and-others" data-ecommerce-row="1" data-track-action="Employment tribunal decisions.1" data-track-category="navFinderLinkClicked" data-track-label="/employment-tribunal-decisions/ms-vidgen-and-others-v-k2-smiles-ltd-2302095-slash-2019-and-others" data-track-options='{"dimension28":50,"dimension29":"Ms Vidgen and Others v K2 Smiles Ltd: 2302095/2019 and Others"}' href="/employment-tribunal-decisions/ms-vidgen-and-others-v-k2-smiles-ltd-2302095-slash-2019-and-others">Ms Vidgen and Others v K2 Smiles Ltd: 2302095/2019 and Others</a>,
 <a class="gem-c-document-list__item-title govuk-link" data-ecommerce-index="2" data-ecommerce-path="/employment-tribunal-decisions/mr-michael-zawadzki-v-oak-furnitureland-group-ltd-2303383-slash-2020" data-ecommerce-row="1" data-track-action="Employment tribunal decisions.2

## Grabbing just the text, and just the link

Now we loop through those matches and store the linked text, and the link URL, separately in two lists. 

Those two lists are used to make a dataframe.

In [5]:
#remove superflulous html/css
case = []
link = []

for i in cases:
  #extract the text
  casename = i.get_text()
  #extract link
  linkname = i['href']
  #add the text and link to the previously empty lists
  case.append(casename)
  link.append(linkname)

caselist = pd.DataFrame({"case name" : case, "URL" : link})

caselist

Unnamed: 0,case name,URL
0,Ms Vidgen and Others v K2 Smiles Ltd: 2302095/...,/employment-tribunal-decisions/ms-vidgen-and-o...
1,Mr M Zawadzki v Oak Furnitureland Group Ltd: 2...,/employment-tribunal-decisions/mr-michael-zawa...
2,Mr M Yasin v Mitie Ltd and Butler Rose Recruit...,/employment-tribunal-decisions/mr-m-yasin-v-mi...
3,Dr M Nze v Guy’s & St Thomas’ NHS Foundation T...,/employment-tribunal-decisions/dr-m-nze-v-guy-...
4,Mr J Riley v Digital Mauve Ltd T/a Mauve Partn...,/employment-tribunal-decisions/mr-j-riley-v-di...
5,Miss N H v A Company: 2300952/2020 and 2301085...,/employment-tribunal-decisions/miss-n-h-v-a-co...
6,Miss E Gradzik v Hall and Woodhouse Ltd and So...,/employment-tribunal-decisions/miss-e-gradzik-...
7,Mr D Antwi and others v The Royal Parks Ltd: 2...,/employment-tribunal-decisions/mr-d-antwi-and-...
8,Mr T Ikeda v Misuho Bank Ltd and others: 22013...,/employment-tribunal-decisions/mr-t-ikeda-v-mi...
9,Mr A Akabogu v Notting Hill Genesis: 2200018/2021,/employment-tribunal-decisions/mr-a-akabogu-v-...


## Grabbing list items

The first time we try to grab list items (the tag `<li>` it's too broad.

In [6]:
#grab all li tags
dates = soup.select('li')
#show the results
dates


[<li class="gem-c-layout-super-navigation-header__navigation-item gem-c-layout-super-navigation-header__navigation-item--with-children">
 <div class="gem-c-layout-super-navigation-header__navigation-toggle-wrapper govuk-clearfix">
 <a class="gem-c-layout-super-navigation-header__navigation-item-link" data-module="gem-track-click" data-track-action="topicsLink" data-track-category="headerClicked" data-track-dimension="Topics" data-track-dimension-index="29" data-track-label="/browse" href="/browse">Topics</a>
 <button aria-controls="super-navigation-menu__section-c4cfd67d" aria-expanded="false" aria-label="Show Topics menu" class="gem-c-layout-super-navigation-header__navigation-second-toggle-button" data-text-for-hide="Hide Topics menu" data-text-for-show="Show Topics menu" data-toggle-desktop-group="top" data-toggle-mobile-group="second" data-tracking-key="topics" hidden="hidden" id="super-navigation-menu__section-c4cfd67d-toggle" type="button">
 <span class="gem-c-layout-super-naviga

## Specify a class attribute

So we specify we only want `<li>` tags with a specific `class=`.

In [7]:
#grab the tags that match the description
dates= soup.select('li[class="gem-c-document-list__attribute"]')
dates

[<li class="gem-c-document-list__attribute">
                     Decided: <time datetime="2022-02-07">7 February 2022</time>
 </li>, <li class="gem-c-document-list__attribute">
                     Decided: <time datetime="2022-01-17">17 January 2022</time>
 </li>, <li class="gem-c-document-list__attribute">
                     Decided: <time datetime="2021-03-22">22 March 2021</time>
 </li>, <li class="gem-c-document-list__attribute">
                     Decided: <time datetime="2020-01-20">20 January 2020</time>
 </li>, <li class="gem-c-document-list__attribute">
                     Decided: <time datetime="2022-01-05">5 January 2022</time>
 </li>, <li class="gem-c-document-list__attribute">
                     Decided: <time datetime="2020-09-19">19 September 2020</time>
 </li>, <li class="gem-c-document-list__attribute">
                     Decided: <time datetime="2020-08-24">24 August 2020</time>
 </li>, <li class="gem-c-document-list__attribute">
                     Decid

## Loop to store the results in lists

Now we can generate a 3 column dataframe by adding another loop for the new list of matches.

In [8]:
date = []

for i in dates:
  #extract the date
  datename = i.get_text()
  #add the date to the date list
  date.append(datename)

case = []
link = []

for i in cases:
  #extract the text
  casename = i.get_text()
  #extract link
  linkname = i['href']
  #add the text and link to the previously empty lists
  case.append(casename)
  link.append(linkname)

caselist = pd.DataFrame({"case name" : case, "URL" : link, "date" : date})

caselist

Unnamed: 0,case name,URL,date
0,Ms Vidgen and Others v K2 Smiles Ltd: 2302095/...,/employment-tribunal-decisions/ms-vidgen-and-o...,\n Decided: 7 February 2022\n
1,Mr M Zawadzki v Oak Furnitureland Group Ltd: 2...,/employment-tribunal-decisions/mr-michael-zawa...,\n Decided: 17 January 2022\n
2,Mr M Yasin v Mitie Ltd and Butler Rose Recruit...,/employment-tribunal-decisions/mr-m-yasin-v-mi...,\n Decided: 22 March 2021\n
3,Dr M Nze v Guy’s & St Thomas’ NHS Foundation T...,/employment-tribunal-decisions/dr-m-nze-v-guy-...,\n Decided: 20 January 2020\n
4,Mr J Riley v Digital Mauve Ltd T/a Mauve Partn...,/employment-tribunal-decisions/mr-j-riley-v-di...,\n Decided: 5 January 2022\n
5,Miss N H v A Company: 2300952/2020 and 2301085...,/employment-tribunal-decisions/miss-n-h-v-a-co...,\n Decided: 19 September 20...
6,Miss E Gradzik v Hall and Woodhouse Ltd and So...,/employment-tribunal-decisions/miss-e-gradzik-...,\n Decided: 24 August 2020\n
7,Mr D Antwi and others v The Royal Parks Ltd: 2...,/employment-tribunal-decisions/mr-d-antwi-and-...,\n Decided: 16 November 2021\n
8,Mr T Ikeda v Misuho Bank Ltd and others: 22013...,/employment-tribunal-decisions/mr-t-ikeda-v-mi...,\n Decided: 17 February 2022\n
9,Mr A Akabogu v Notting Hill Genesis: 2200018/2021,/employment-tribunal-decisions/mr-a-akabogu-v-...,\n Decided: 17 February 2022\n


## Grab the `time` attributes

Now although the dates are on the pages as text, there are also tags that have those dates in a much more useful format, with months shown as numbers, e.g. `<time datetime="2022-02-07">`

In [11]:
#grab the <time> tags
times = soup.select('time')
times

[<time datetime="2022-02-07">7 February 2022</time>,
 <time datetime="2022-01-17">17 January 2022</time>,
 <time datetime="2021-03-22">22 March 2021</time>,
 <time datetime="2020-01-20">20 January 2020</time>,
 <time datetime="2022-01-05">5 January 2022</time>,
 <time datetime="2020-09-19">19 September 2020</time>,
 <time datetime="2020-08-24">24 August 2020</time>,
 <time datetime="2021-11-16">16 November 2021</time>,
 <time datetime="2022-02-17">17 February 2022</time>,
 <time datetime="2022-02-17">17 February 2022</time>,
 <time datetime="2021-09-14">14 September 2021</time>,
 <time datetime="2022-02-16">16 February 2022</time>,
 <time datetime="2022-02-09">9 February 2022</time>,
 <time datetime="2022-02-09">9 February 2022</time>,
 <time datetime="2022-02-05">5 February 2022</time>,
 <time datetime="2020-09-21">21 September 2020</time>,
 <time datetime="2020-07-27">27 July 2020</time>,
 <time datetime="2020-09-04">4 September 2020</time>,
 <time datetime="2020-07-02">2 July 2020</

## Extract the `datetime` attribute

To extract the `datetime=` attribute you need to add `['datetime']` after a match item (not the whole list)

In [25]:
#show the first match in full
print(times[0])
#now show the datetime attribute of that match
print(times[0]['datetime'])
#store it
testdatetime = times[0]['datetime']
#now extract the first 4 characters
print(testdatetime[0:4])
#and the characters from position 8 to 9 (the month)
print(testdatetime[8:10])
#and the characters from position 5 to 6 (the day)
#converting to an integer at the same time
print(int(testdatetime[5:7]))

<time datetime="2022-02-07">7 February 2022</time>
2022-02-07
2022
07
2


## Loop to store the `datetime` in the dataframe too

Now we expand the earlier code to include this. 

We create 4 new lists - one for the whole datetime string, and one each for the day, month and year that we extract from that.

In [27]:
#create an empty list
datetimes = []
years = []
days = []
months = []

#loop through those <time> tag matches
for i in times:
  #extract the datetime= value
  casedate = i['datetime']
  #add to the datetimes list
  datetimes.append(casedate)
  #'slice' the string to extract the year, month and day, adding each to a different list
  years.append(int(casedate[0:4]))
  months.append(int(casedate[5:7]))
  days.append(int(casedate[8:10]))

In [16]:
#show the datetimes list
datetimes

['2022-02-07',
 '2022-01-17',
 '2021-03-22',
 '2020-01-20',
 '2022-01-05',
 '2020-09-19',
 '2020-08-24',
 '2021-11-16',
 '2022-02-17',
 '2022-02-17',
 '2021-09-14',
 '2022-02-16',
 '2022-02-09',
 '2022-02-09',
 '2022-02-05',
 '2020-09-21',
 '2020-07-27',
 '2020-09-04',
 '2020-07-02',
 '2020-08-28',
 '2020-08-31',
 '2020-07-21',
 '2021-07-20',
 '2020-04-28',
 '2020-07-24',
 '2020-08-04',
 '2020-09-24',
 '2021-01-11',
 '2020-01-07',
 '2020-07-01',
 '2020-07-30',
 '2020-05-01',
 '2020-09-14',
 '2021-02-26',
 '2020-09-01',
 '2020-01-16',
 '2019-08-06',
 '2020-07-31',
 '2020-07-09',
 '2020-07-06',
 '2020-09-10',
 '2020-07-09',
 '2020-08-31',
 '2020-07-16',
 '2022-02-18',
 '2022-02-17',
 '2022-02-18',
 '2022-02-16',
 '2022-02-17',
 '2022-02-17']