# Behind the scenes: Accessing webpages with user log-in

### Webscraping court case event logs from the government e-filing system

UK legal claims are recorded in a public database, on https://efile.cefile-app.com/
The retrieval app requires the user sets up an account and logs in with their username and password.
Once logged in, the user can request event logs for a specific case number. Retrieval of actual file content is a paid service, but it is free to see what events have occurred in a case. 

A friend has asked me to retrieve the event logs of 10k+ cases, along with the dates, event types, and descriptions of the logs for each case.

The input, queries, is a list of desired case numbers. The output, cases_out, is saved to a json file and contains a list of dictionaries, one for each case. A case contains a case nr, and events, which is another list of dictionaries, one for each event storing details of the event.

In [6]:
# Mechanicalsoup is used navigating html
import mechanicalsoup
# Requests is used to call webpages and feed them inputs
import requests
# Pandas for dataframe manipulation - don't think i need anymore
#import pandas as pd
# json for reading and writing to json files
import json

In [7]:
# import list of case numbers from csv or wherever
queries = ['CL-2014-000636',
'CL-2015-000128',
'BL-2018-002514',
'CR-2019-003497',
'LM-2019-000044',
'BL-2018-002522',
'HC-2017-002130',
'HC-2015-004346',
'CL-2015-000634',
'CL-2018-000709']

In [8]:
# set up our output: a list of dictionaries
cases_out = []

# Open credentials text file and retrieve username and password from there rather than typing out here. 
with open('Credentials.txt', 'r') as f:
    username, password = f.read().split('\n')
    # Do I need to close the file again?

# Instantiate a browser environment    
browser = mechanicalsoup.StatefulBrowser(raise_on_404=True)

# Firstly, we try and access a random court case to see if we're asked for log-in details.        
try:
    url = "https://efile.cefile-app.com/officecopies/filing/search?caseNumber="+"CL-2015-000128"+"&formToken=1582146479269"
    response = browser.open(url)
    form = browser.select_form('form[id=loginForm]')
    form['username'] = username
    form['password'] = password
    response = browser.submit_selected()
    print("Submitted username and password")
except:
    print("No log-in needed")

# Once we're in, we run through our list of cases
for case in queries:
    print("running case: ", case)
    # First test if the webpage opens at all:
    try:
        url = "https://efile.cefile-app.com/officecopies/filing/search?caseNumber="+str(case)+"&formToken=1582146479269"
        response = browser.open(url)
    except:
        print("Could not open case: ", case)
        continue;
    
    # Then check if the casenumber corresponds to an existing court case:
    try:
        response.text.split('case number entered')[1]
        print('case does not exist')
        cases_out.append({"case_nr" : case, "error" : "case does not exist or is not available"})
    except:
    
    # If the case indeed exists, have a look inside. If there are no events listed, report an empty log:
        try:
            # If there are no events in the log, the table body will be empty and the string of opening and closing tbody tags will exist. 
            response.text.split('<tbody></tbody>')[1]
            print('No records found')
            cases_out.append({"case_nr" : case, "events" : "No records found"})
     
    # If the opening and closing tags are not adjacent, there are events in the log. 
    # Go through each row of the content table and retrieve info: 
        except:
            case_events = []
            for table_row in response.text.split("Case Event Log")[1].split('tbody>')[1].split('<tr>')[1:]:
                eventNr = table_row.split('<td>')[1][:-5]
                subDate = table_row.split('<td>')[2].split('>')[1].split('<')[0]
                fileDate = table_row.split('<td>')[3].split('>')[1].split('<')[0]
                entryType = table_row.split('<td>')[4][:-5]
                description = table_row.split('<td>')[5][:-5]
            
    # Save the retrieved event in a dictionary:
                current_event = {'EventNr': [eventNr], 
                            'Submitted Date': [subDate], 
                            'Filed Date': [fileDate], 
                            'Type' : [entryType],
                            'Description' : [description] }
    # Append the single event to the other events for this case:
                case_events.append(current_event)
            print("Case retrieved")
    # Append the completed case search to the list of cases:
            cases_out.append({"case_nr" : case, "events" : case_events})


Submitted username and password
running case:  CL-2014-000636
Case retrieved
running case:  CL-2015-000128
Case retrieved
running case:  BL-2018-002514
Case retrieved
running case:  CR-2019-003497
No records found
running case:  LM-2019-000044
case does not exist
running case:  BL-2018-002522
No records found
running case:  HC-2017-002130
Case retrieved
running case:  HC-2015-004346
Case retrieved
running case:  CL-2015-000634
Case retrieved
running case:  CL-2018-000709
Case retrieved


In [9]:
# Export our list of cases into a json file
with open('case_event_logs.json', 'w') as fout:
    json.dump(cases_out , fout)