# NLP and Local Authority Climate Action Plans

The plan is to programmatically assess how Local Authority climate action plans score for their treatment of EVs. This will involve several stages. 

1. Importing the LA Climate Action plans
2. Making them machine readable and ready for NLP
3. NLP to find the parts that are relevant to electric vehicles and charging infrastructure

Packages go here:

In [167]:
#Downloading & pre-processing tools
import os
import glob
import PyPDF2
import requests
import re

#NLP
import spacy
import pandas as pd

## Importing a Climate Action Plan


Helpfully, MySociety have created a list of local authority climate action plans here (https://docs.google.com/spreadsheets/d/1tEnjJRaWsdXtCkMwA25-ZZ8D75zAY6c2GOOeUchZsnU/edit#gid=0). 

We're going to start with Bristol. 

In time we will automatically read URLs from the MySociety Document, but until then, we'll do an old fashioned copy-paste. 



### Download the file

In [142]:
cs_url = 'https://www.bristolonecity.com/wp-content/uploads/2020/02/one-city-climate-strategy.pdf'
r = requests.get(cs_url)
with open('bristol.pdf', 'wb') as f:
    f.write(r.content)

### Create a 'Bristol' folder

In [143]:
if not os.path.isdir('Bristol'):
    os.mkdir('Bristol')

### Scrape the text from the PDF, clean and tidy
Annoyingly, it is not in a very machine-readable format. Will need some clean up. 

In [144]:
pdf = PyPDF2.PdfFileReader(open('bristol.pdf', 'rb'))
for i in range(pdf.numPages): 
#might be able to optimise this later - it currently scrapes off each page separately - probably an easier way to get it into one single document
    num = str(i)
    with open(f'Bristol/bristol{num}.txt', 'w', encoding='utf-8') as f:
        f.write(pdf.pages[i].extractText())

Now to put it into one file: bristol_full.txt

In [145]:
files = glob.glob('Bristol/bristol*.txt')

#create an empty text file to receive the text...
with open('Bristol/full_text.txt', 'w', encoding = 'utf-8') as f:
    f.write(' ')
    pass #leave it blank for now - will loop through the data and append it here later

In [146]:
#Collect the data all into the one file, deleting empty lines as we go along, and removing line breaks. 
if not os.path.isfile('Bristol/bristol_full.txt'):
    for i in range(len(files)):
        item = files[i]
        with open(item, 'r+', encoding='utf-8') as f:
            text = f.readlines()
            #print(text)
            new_text = ''
            for line in text:
                new_text += line.replace('\n', '')
            # now append text to master text file
            with open('Bristol/bristol_full.txt', 'a', encoding='utf-8') as p:
                p.write(new_text)
else:
    print("There's already a document called 'bristol_full.txt', delete this first if you want to create a fresh one.")


## Cleaning up the document
Everything's a bit messy. Lots of whitespace loss, odd additional page numbers. We'll clean it up now.

First, inserting spaces between digits followed immediately by letters, or letters followed immediately by digits. E.g. "In2030 Bristol will" or "Net Zero by2050". 

In [147]:
#Open the document and save the text to climate_plan_text
climate_plan_text = ''
with open('Bristol/bristol_full.txt', 'r', encoding='utf-8') as f:
    climate_plan_text = f.read()

print(len(climate_plan_text))

144871


In [148]:
#Define a regex expression to recognise letters followed by digits or digits followed by letters
dl = re.compile(r'(?<=\d)(?=[^\d\s,.])|(?<=[^\d\s,.])(?=\d)')

In [149]:
# Run a test
test_string = 'Bristol will be carbon neutral by2030 and from 2040will only use 140EVs. 2025 will see the launch of 1.5kWh battery.'
print(dl.sub(' ', test_string))

Bristol will be carbon neutral by 2030 and from 2040 will only use 140 EVs. 2025 will see the launch of 1.5 kWh battery.


In [150]:
#Apply the changes
climate_plan_text += dl.sub(' ', climate_plan_text)

In some places, years are followed by an extra digit, usually a rogue page number, or something like that. Years are important because they indicate the timing, and ambition, of particular commitments, so we'd like to preserve years where possible. 

In [151]:
# Define a regex that matches the 5th digit in a 5 digit number. We assume that any other numbers use comma separations. 
yr = re.compile(r'(\d)(?<=\d{5})')

In [152]:
test_string_2 = 'climate resilient Bristol by 20302 Foreword'
print(yr.sub('', test_string_2))

climate resilient Bristol by 2030 Foreword


In [153]:
# Now apply this to the document:
climate_plan_text = yr.sub('', climate_plan_text)

We have some full stops that do not have spaces after them, too. Now to clear up those.

In [154]:
#Define a regex expression to match the full stops surrounded by letters and insert a space.
fs = re.compile(r'\.(?=\w)')

In [155]:
#Run a test...
test_string_3 = 'in Bristol.This will help.Testing, testing. Some are. Written correctly.'
print(fs.sub('. ', test_string_3))

in Bristol. This will help. Testing, testing. Some are. Written correctly.


In [156]:
# Now apply to text
climate_plan_text = fs.sub('. ', climate_plan_text)

Bristol/bristol_full.txt is now a pretty good text rendering of Bristol's climate action plan. 

# Extracting keyword sentences.
Using SpaCy to extract keywords and do other document analysis. 

In [157]:
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)


In [163]:
text = climate_plan_text #use the raw text - not tokenized or processed beyond being a clean(ish) txt file

phrases  = ['electric vehicles', 'electric vehicle', 'charging', 'cars', 'zero emission vehicles', 'emission vehicles']

patterns = [nlp(t) for t in phrases]

matcher.add('EVs', None, *patterns)

doc = nlp(text)

#list for some tags:
tags = []

#list for some sentences:
tagged_sents = []

In [164]:
matches = matcher(doc)

for match_id, start, end in matches:
    span = doc[start:end]
    tags.append(span.text)
    tagged_sents.append(span.sent)
    print(span.text)

charging
charging
electric vehicles
electric vehicle
charging
cars
cars
charging
cars
emission vehicles
electric vehicle
charging
charging
emission vehicles
emission vehicles
charging
charging
electric vehicles
electric vehicle
charging
cars
cars
charging
cars
emission vehicles
electric vehicle
charging
charging
emission vehicles
emission vehicles


In [168]:
data = {'tags': tags, 'tagged_sents': tagged_sents}
df = pd.DataFrame(data = data)
df.head()

Unnamed: 0,tags,tagged_sents
0,charging,"(This, will, include, working, with, regulator..."
1,charging,"(We, will, need, signi˜cant, new, walking, ,, ..."
2,electric vehicles,"(We, will, need, signi˜cant, new, walking, ,, ..."
3,electric vehicle,"( , -Development, of, a, citywide, plan, for, ..."
4,charging,"( , -Development, of, a, citywide, plan, for, ..."


In [169]:
df.to_csv('keywords.csv')