Navigation Menu

Skip to content
This repository has been archived by the owner on Mar 5, 2019. It is now read-only.

Data Gathering notes

charlottebrf edited this page Oct 5, 2016 · 1 revision

Example files from @elischutze https://www.dropbox.com/sh/nuciypizp91bc10/AAB6z7G8LEgzWeQv6PBrdeu1a?dl=0![](https://github.com/HackBrexit/MinistersUnderTheInfluence/issues/14)

@sikesLpp Thanks for this. See the attached -that was the correct website, we just need to ensure the search is correctly filtered. gov uk_meetings

The 15 departments are:

Priority government departments (15 we discussed):

Cabinet Office HM Treasury Department for Communities and Local Government Department for Culture, Media & Sport Department for Education Department for Environment, Food & Rural Affairs Department for International Development Department for Transport Department for Work and Pensions Department of Health Foreign & Commonwealth Office Home Office Ministry of Defence Ministry of Justice

New departments that we will want the data from once they start publishing it

15.	Department for Business, Energy & Industrial Strategy
16. 	Department for Exiting the European Union
17. 	Department for International Trade

Notes from Momo As discussed last meeting I have created a script to harvest data. I had to create a fork of the repo @ https://github.com/sikesLpp/MinistersUnderTheInfluence as I do not have push access. The script requires a a linux machine with libxml and php-cli installed and must be run on the shell. It will take a 'rich' url for a search at https://www.gov.uk/government/publications and dump links to all documents found and some relenvant metadata to a csv file (govharvester_listfile.csv).

As discussed this is a prove of concept script that will need some further tuning ( including actually downloading the docs and storing the metadata in some sort of a database) .

short instruction for usage:

go to https://www.gov.uk/government/publications in your browser make a selection via the search dropdowns push search paste the url the page created as first argument to the script NB: do not forget to enclose in single quotes as the shell will interpret '&' as 'AND' ... example: ./govharvester.php 'https://www.gov.uk/government/publications?keywords=&publication_filter_option=transparency-data&topics[]=all&departments[]=attorney-generals-office'

PS: I will not be able to make it to the meetup tommorow, as I am out of town for a wedding

Do we have a slackchannel or mailinglist for better communictios already ?

grtz Momo

Hey @sikesLpp, I tried to run this script today but kept on getting this error - Fatal error: Cannot use object of type DOMNodeList as array - line 120 when parsing the type from the dom I haven't done any php so I'm not sure how to move forward from this...

Also our slack channel is hackbrexit

@sikesLpp it was my php version and the script needed to have php tags.