# Python case study: dealing with dates

When we import or grab dates they are often treated as text strings, when we actually want to treat them *as* dates. 

In this notebook we scrape some data that includes dates, and then convert the dates to a 'datetime' variable so we can perform related actions (such as extracting a month, or identifying which day of the week a date fell on).

## Scrape the page

First we scrape the data - we are going to scrape a page of tribunal decisions.

In [4]:
#importing requests
import requests
#importing beautiful soup scraper library
from bs4 import BeautifulSoup
import pandas as pd


#fetch URL 
page = requests.get("https://www.gov.uk/employment-tribunal-decisions")

#command beautiful soup to parse the page
soup = BeautifulSoup(page.content,'html.parser')

#identify the lines in the page that I want by link and class
cases = soup.select('a')

#print the first match
print(cases[0])

<a class="gem-c-skip-link govuk-skip-link govuk-!-display-none-print" data-module="govuk-skip-link" href="#content">Skip to main content</a>


## Grabbing all the links

We check we can grab the links.

In [7]:
#this grabs the links to each case
cases = soup.select('a[class="gem-c-document-list__item-title govuk-link"]')
#check the first
print(cases[0])

<a class="gem-c-document-list__item-title govuk-link" data-ecommerce-index="1" data-ecommerce-path="/employment-tribunal-decisions/miss-c-lewis-v-joy2care-ltd-2600556-slash-2021" data-ecommerce-row="1" data-ga4-ecommerce-path="/employment-tribunal-decisions/miss-c-lewis-v-joy2care-ltd-2600556-slash-2021" data-track-action="Employment tribunal decisions.1" data-track-category="navFinderLinkClicked" data-track-label="/employment-tribunal-decisions/miss-c-lewis-v-joy2care-ltd-2600556-slash-2021" data-track-options='{"dimension28":50,"dimension29":"Miss C Lewis v Joy2Care Ltd: 2600556/2021"}' href="/employment-tribunal-decisions/miss-c-lewis-v-joy2care-ltd-2600556-slash-2021">Miss C Lewis v Joy2Care Ltd: 2600556/2021</a>


## Grabbing just the text, and just the link

Now we loop through those matches and store the linked text, and the link URL, separately in two lists. 

Those two lists are used to make a dataframe.

In [8]:
#remove superflulous html/css
case = []
link = []

#loop through cases 
for i in cases:
  #extract the text
  casename = i.get_text()
  #extract link
  linkname = i['href']
  #add the text and link to the previously empty lists
  case.append(casename)
  link.append(linkname)

#create a new dataframe which uses those two lists as its two columns
caselist = pd.DataFrame({"case name" : case, "URL" : link})

#show the first 5 rows
caselist[:5]

Unnamed: 0,case name,URL
0,Miss C Lewis v Joy2Care Ltd: 2600556/2021,/employment-tribunal-decisions/miss-c-lewis-v-...
1,Ms B Reddish v SMN Care Homes Ltd and SMN Inve...,/employment-tribunal-decisions/ms-b-reddish-v-...
2,Mrs N Frost and Mrs M Schofield v DMJ Drainage...,/employment-tribunal-decisions/mrs-n-frost-and...
3,Mr S Mamodaly v EEV Management Ltd: 2601955/20...,/employment-tribunal-decisions/mr-s-mamodaly-v...
4,Mr G Peck v Esland North Ltd: 2602432/2022,/employment-tribunal-decisions/mr-g-peck-v-esl...


## Grabbing list items

The first time we try to grab list items (the tag `<li>` it's too broad.

In [10]:
#grab all li tags
dates = soup.select('li')
#show the first 3 results
dates[:3]


[<li class="gem-c-layout-super-navigation-header__dropdown-list-item">
 <a class="govuk-link gem-c-layout-super-navigation-header__navigation-second-item-link" data-ga4-link='{"event_name":"navigation","type":"header menu bar","index":"1.1.1","index_total":29,"section":"Topics"}' data-track-action="topicsLink" data-track-category="headerClicked" data-track-dimension="Benefits" data-track-dimension-index="29" data-track-label="/browse/benefits" href="/browse/benefits">Benefits</a>
 </li>, <li class="gem-c-layout-super-navigation-header__dropdown-list-item">
 <a class="govuk-link gem-c-layout-super-navigation-header__navigation-second-item-link" data-ga4-link='{"event_name":"navigation","type":"header menu bar","index":"1.1.2","index_total":29,"section":"Topics"}' data-track-action="topicsLink" data-track-category="headerClicked" data-track-dimension="Births, death, marriages and care" data-track-dimension-index="29" data-track-label="/browse/births-deaths-marriages" href="/browse/births

## Specify a class attribute

So we specify we only want `<li>` tags with a specific `class=`.

In [11]:
#grab the tags that match the description
dates= soup.select('li[class="gem-c-document-list__attribute"]')
#show the first 3
dates[:3]

[<li class="gem-c-document-list__attribute">
                     Decided: <time datetime="2023-02-07">7 February 2023</time>
 </li>, <li class="gem-c-document-list__attribute">
                     Decided: <time datetime="2023-02-02">2 February 2023</time>
 </li>, <li class="gem-c-document-list__attribute">
                     Decided: <time datetime="2023-01-23">23 January 2023</time>
 </li>]

## Loop to store the results in lists

Now we can generate a 3 column dataframe by adding another loop for the new list of matches.

In [12]:
#create an empty list
date = []

#loop through 'dates'
for i in dates:
  #extract the date
  datename = i.get_text()
  #add the date to the date list
  date.append(datename)

#create two more empty lists
case = []
link = []

#loop through 'cases'
for i in cases:
  #extract the text
  casename = i.get_text()
  #extract link
  linkname = i['href']
  #add the text and link to the previously empty lists
  case.append(casename)
  link.append(linkname)

#create a new dataframe with those three lists as the 3 columns
caselist = pd.DataFrame({"case name" : case, "URL" : link, "date" : date})

#show the first 5 rows
caselist[:5]

Unnamed: 0,case name,URL,date
0,Miss C Lewis v Joy2Care Ltd: 2600556/2021,/employment-tribunal-decisions/miss-c-lewis-v-...,\n Decided: 7 February 2023\n
1,Ms B Reddish v SMN Care Homes Ltd and SMN Inve...,/employment-tribunal-decisions/ms-b-reddish-v-...,\n Decided: 2 February 2023\n
2,Mrs N Frost and Mrs M Schofield v DMJ Drainage...,/employment-tribunal-decisions/mrs-n-frost-and...,\n Decided: 23 January 2023\n
3,Mr S Mamodaly v EEV Management Ltd: 2601955/20...,/employment-tribunal-decisions/mr-s-mamodaly-v...,\n Decided: 12 January 2023\n
4,Mr G Peck v Esland North Ltd: 2602432/2022,/employment-tribunal-decisions/mr-g-peck-v-esl...,\n Decided: 3 January 2023\n


## Grab the `time` attributes

Now although the dates are on the pages as text, there are also tags that have those dates in a much more useful format, with months shown as numbers, e.g. `<time datetime="2022-02-07">`

In [13]:
#grab the <time> tags
times = soup.select('time')
#show the first 5
times[:5]

[<time datetime="2023-02-07">7 February 2023</time>,
 <time datetime="2023-02-02">2 February 2023</time>,
 <time datetime="2023-01-23">23 January 2023</time>,
 <time datetime="2023-01-12">12 January 2023</time>,
 <time datetime="2023-01-03">3 January 2023</time>]

## Extract the `datetime` attribute

To extract the `datetime=` attribute you need to add `['datetime']` after a match item (not the whole list)

In [None]:
#show the first match in full
print(times[0])
#now show the datetime attribute of that match
print(times[0]['datetime'])
#store it
testdatetime = times[0]['datetime']
#now extract the first 4 characters
print(testdatetime[0:4])
#and the characters from position 8 to 9 (the month)
print(testdatetime[8:10])
#and the characters from position 5 to 6 (the day)
#converting to an integer at the same time
print(int(testdatetime[5:7]))

<time datetime="2022-02-07">7 February 2022</time>
2022-02-07
2022
07
2


## Loop to store the `datetime` in the dataframe too

Now we expand the earlier code to include this. 

We create 4 new lists - one for the whole datetime string, and one each for the day, month and year that we extract from that.

In [15]:
#create 4 empty lists
datetimes = []
years = []
days = []
months = []

#loop through those <time> tag matches
for i in times:
  #extract the datetime= value
  casedate = i['datetime']
  #add to the datetimes list
  datetimes.append(casedate)
  #'slice' the string to extract the year, month and day, adding each to a different list
  years.append(int(casedate[0:4]))
  months.append(int(casedate[5:7]))
  days.append(int(casedate[8:10]))

In [16]:
#show the first 5 items in the datetimes list
datetimes[:5]

['2023-02-07', '2023-02-02', '2023-01-23', '2023-01-12', '2023-01-03']