<h1 align="center"> Script for GIS Data Catalogue Generation </h1>

The script makes use of docxtpl library to generate word documents. Basically a word template 'templateDataCatalogue.docx' is designed, which is in turn read and populated by the script. As of now, there are 66 datasets for which the document is generated.

The pre-requisites to run this script are :
1. templateDataCatalogue : The word template placed in folder 'catalogue (template)'
2. GISMetadata_8-06 :  Latest csv export from the metadata pipeline
3. catalogue (images) : Folder comprising the images (.jpg files) for every dataset 


The word exports are saved to folder 'catalogue (exports)'.

In [1]:
#Making the necessary library imports
import os, sys
import pandas as pd
from docxtpl import DocxTemplate, InlineImage
from docx.shared import Mm

# Changing the path to current working directory
os.chdir(sys.path[0])

## Populating the Template

In [2]:
#Reading the document template
template = DocxTemplate('../catalogue (template)/templateDataCatalogue.docx')

#Reading the latest export from metadata pipeline
gisMetaDf = pd.read_csv('../GISMetadata_8-06.csv')

#Grouping datasets using the datasetCode
gisMetaLayerGroups=gisMetaDf.groupby("datasetCode")

#Preparing context to be populated in the template
pageNumber = 0

#Layers under every dataset is read as a Dataframe - 'layerDf'
for datasetCode in gisMetaLayerGroups.groups.keys():    
    layerDf=gisMetaLayerGroups.get_group(datasetCode)
    
    #Removing the list elements like '[', ']' and ' from dimensions and layerName
    target = {39 : None, 91 : None, 93 : None}
    dimension = str(layerDf['dimension'].unique().tolist()).translate(target)
    layerName = str(layerDf['layerTitle'].unique().tolist()).translate(target)
    pageNumber += 1 
    
    #Creating a range of years
    yearList = layerDf['year'].unique().tolist()
    if max(yearList) == min(yearList):
        yearRange = str(min(yearList))
    else:
        yearRange = str(min(yearList))+'-'+str(max(yearList))

    #Building context dictionary to populate the word template
    context = {'layerID':layerName if dimension == 'nan' else dimension,
                   'datasetTitle': layerDf['datasetTitle'].unique()[0],
                   'fileName':layerDf['fileName'].unique()[0],
                   'citation': layerDf['citation'].unique()[0],
                  'datasetCode':layerDf['datasetCode'].unique()[0],
                  'format': layerDf['dataFormat'].unique()[0],
                  'resolution':layerDf['resolution'].unique()[0],
                  'unit': layerDf['dataType'].unique()[0],
                  'year':yearRange,
                  'datasetDescription':layerDf['datasetDescription'].unique()[0],
                  'link':layerDf['dataLink'].unique()[0],
                  'pageNumber':pageNumber,
                  'myimage':InlineImage(template, image_descriptor='../catalogue (images)/'+datasetCode+'.jpg',width=Mm(111.7), height=Mm(59.5))
                  }
    
    #Template is rendered using the context dictionary
    template.render(context)
    
    #Generated output word document is saved in the 'catalogue (exports)' folder with the datasetCode in the name
    template.save('../catalogue (exports)/catalogue-'+datasetCode+'.docx')


## Conversion of Word Document (Exports) to pdf

Here, all the word documents present in the folder 'catalogue (exports)' are converted to pdf files

In [3]:
#Making the necessary imports
from comtypes.client import CreateObject
import os

folder = '../catalogue (exports)/'

wdToPDF = CreateObject("Word.Application")
wdFormatPDF = 17
files = os.listdir(folder)
word_files = [f for f in files if f.endswith((".doc", ".docx"))]
for word_file in word_files:
    word_path = os.path.join(sys.path[0],folder, word_file)
    pdf_path = word_path
    if pdf_path[-3:] != 'pdf':
        pdf_path = pdf_path[:-5] + ".pdf"

    if os.path.exists(pdf_path):
        os.remove(pdf_path)

    pdfCreate = wdToPDF.Documents.Open(word_path)
    pdfCreate.SaveAs(pdf_path, wdFormatPDF)
    pdfCreate.Close()
    
    

## Merging all pdf files 

The pdf exports are finally merged to a single file named 'GISDataCatalogue.pdf'

In [4]:
#Making the necessary library imports
# pip install PyPDF2
import os
from PyPDF2 import PdfFileMerger

folder = '../catalogue (exports)/'
source_dir = os.getcwd()
merger = PdfFileMerger()

for item in os.listdir(folder):
    if item.endswith('pdf'):        
        merger.append(folder+'\\' + item)

merger.write(source_dir+'\\' + 'GISDataCatalogue.pdf')
merger.close()

However, while merging the pdf files the page numbers have to be taken into consideration. At times the order of pdf merge may not be the same as the order in which they were generated intially. The merged GIS Data Catalogue is saved to the folder '../processing script/'.

Once, the GISDataCatalogue.pdf is obtained it has to be merged with :

1_Introduction.pdf

2_OverviewTable.pdf

placed within the folder 'Introduction'