# Pilot Project: Learning the Climate

Author: Shashi Badloe, Ioannis N. Athanasiadis


Purpose: Provides a step-by-step interactive pipeline for the cleanup, modeling and storing of data used in paper:  
Robbert Biesbroek, Shashi Badloe, Ioannis N. Athanasiadis (2020). [Machine learning for research on climate change adaptation policy integration: an exploratory UK case study](http://dx.doi.org/10.1007/s10113-020-01677-8), published in Regional Environmental Change.


Note: It is important to not change any folder names or directories created during intermediate steps in the pipeline. Doing so, may raise errors as the scripts look in folders with these specific names in specified relative directories.


For further details on the function of each script, please look into the code.

## Step 1 - Defining input data
To start the analysis, we need to define the training data and testing data. In the folder 'PDF_files' the PDF documents used as training or testing data are contained. 

The following four folders should be in `PDF_files`:
- `Adaptation policy documents`: Training data for adaptation policies
- `Mitigation policy documents`: Training data for mitigation policies
- `Non-climate policy documents`: Training data for non-climate documents
- `Mixed policy documents`: Testing data, any PDF document(s) you want to predict on.

You may define your own training and testing data by placing the PDF files in the specified folders.

## Step 2 - Turning PDF into raw text and translating into bag-of-words
PDF files are inaccessible to the machine learning algorithm, therefore we need to extract the raw text from PDF files. Converting from PDF to raw text can alter/break the composition of the text (i.e. add invalid characters or whitespace between words). We have applied some automated checks and edits to fix the most common problems, but there may still be  files that do not translate into raw text properly.

The following scripts are used to convert and clean the PDF files:

In [None]:
#Set working directory to the folder that holds all scripts
import os
script_folder = 'Python Scripts'
if not os.getcwd().endswith(script_folder):
    os.chdir(script_folder)
print("Current Working Directory " , os.getcwd())

In [None]:
#Run script to convert PDF into raw text
#Attempts to preserve paragraph structure within text and fixes for invalid characters.
#This connects to a server for PDF parsing so internet connection is required. If connection fails try re-running this.
#Outputs folder named parsed_files that holds raw text for every PDF file
exec(open('pdf_parser.py').read())

In [None]:
#Run script to clean up text and create eligible blocks based on paragraph size.
#Uses tagger to determine word types and filters for useful words
#Outputs folder named structured_files that contains a bag-of-words for every file in python list format
exec(open('text_cleanup.py').read())

## Step 3 - Building the database
A SQLite database is created to hold every block. This database allows for quick storing and retrieval of data and is required when working with big data.

In [None]:
#Run script to create database and lots training data in Labeled_data and testing data in Unlabeled_data tables.
#Outputs 'climate.db' file in scripts folder. Will overwrite any file with the same name.

#Note that this requires a supply of metadata in 'metadata.txt'
#The metadata should have a python dictionary format where the key is the filename and contents are a 
#tuple of date and department like so: "pdf_filename.pdf: ('day month year', 'Department')"
exec(open('sqlite_db.py').read())

## Step 4 - Training the model
The model is a simple feed forward neural network model. The input are the blocks and it assigns/adjusts weights of each word towards the three classes depending on how often they occur in the training data for that class or how often they co-occur in the same bag as a word strongly correlated to a class.


'TF_classification_predict.py' - Uses stored model to predict on new data. Results stored in database.

In [None]:
#Run script to build vocabulary from training data 
#Output file 'conversion disctionary.txt' is a python dictionary where every word in the training
#corpus is assigned a number
exec(open('numberizer.py').read())

In [None]:
#Run this script to start training the neural network
#Internal validation with cross validation or regular split is possible
#Outputs a folder named 'tensorflow/logdir' in the root folder containing the model
(open('TF_classification_BW.py').read())

In [None]:
import tensorflow as tf
from numberizer import connect_to_db
from tensorflow import keras
import numpy as np
import sqlite3
import os
import pandas
import time
import TF_classification_BW as m1

In [None]:
import tensorflow as tf

In [None]:
bow_column = tf.feature_column.categorical_column_with_identity(WORDS_FEATURE, num_buckets=n_words)
bow_embedding_column = tf.feature_column.embedding_column(bow_column, dimension=EMBEDDING_SIZE)
bow = tf.feature_column.input_layer(features, feature_columns=[bow_embedding_column])
logits = tf.layers.dense(bow, MAX_LABEL, activation=None)

In [None]:
import TF_classification_predict as m2

In [None]:
model_directory = '../tensorflow/logdir'

In [None]:
imported = tf.saved_model.load("../tensorflow/logdir/1563368487/")

In [None]:
imported.initializer.resource_handle()

In [None]:
imported.graph.get_collection('trainable_varaibles')

In [None]:
#Run this script to predict on unseen data
#Predicts the class of blocks in 'Unlabeled_data' table in the climate.db database.
exec(open('TF_classification_predict.py').read())

In [None]:
converter = tf.lite.TFLiteConverter.from_saved_model('../tensorflow/logdir/1563368487/')
tflite_model = converter.convert()

The pipeline is finished and you should have obtained probabilities and class predictions for your input data in "Mixed policy documents" folder.