# Zoning PDF Text Extraction and Parsing Functions Demo

### Introduction
Zoning By-laws contain important information about land use, building height, density, and other development regulations. They are important documents that inform urban planning and development decisions in cities.

They are often stored as long, unstructured PDF legal documents and it's difficult to find information within them. Zoning information is also spatial and tied to geospatial datasets. It would be great if the zoning information in the by-laws could be extracted in an efficient and automated way and joined with geospatial datasets.

These functions extract text from the zoning by-law and parses it so that it can be used in other pipelines. Can also be used on its own to split up zoning by-law texts to make them easier to read and search for information.

### Extract Markdown Function Demo
The code opens a zoning by-law pdf document and uses the pymupdf4llm library to extract the text into markdown syntax. Markdown syntax is used because LLMs are better able to understand text and documents in markdown. Markdown provides heiarchy to text documents, which helps LLMs understand the pdf document structure.

First let's import the required libraries, modules, and the requried parsing functions from parsing.py.

In [1]:
from parsing import Parsing

Next, we're going to setup the variables and initialize the parsing module.
*Note: Please ensure that a pdf by-law document is in the same repository as the script being run.

In [2]:
PARSING = Parsing()
zoningpdf_path = "NiagaraFalls_by-law-no.-79-200-updated-to-may-31-2024.pdf" # Change the filename accordingly

Run the extract markdown function from the parsing module.

In [3]:
markdown_text = PARSING.extract_markdown(zoningpdf_path)
print(markdown_text[:1500])

Processing NiagaraFalls_by-law-no.-79-200-updated-to-may-31-2024.pdf...
**_CITY OF NIAGARA FALLS_**

**_By-law No. 79-200_**

.................

A by-law to regulate the use of land and the erection, use, height, bulk, location, spacing of any
other matters relating to buildings and structures, and to prohibit certain uses of lands and the
erection and use of certain buildings and structures in various areas of the City of Niagara Falls;
and WHEREAS there is an official plan in effect in the City of Niagara Falls; and

WHEREAS the Council of The Corporation of the City of Niagara Falls deems it desirable to pass
this By-law pursuant to section 35 and other provisions or The Planning Act, R.S.O. 1970, as
amended;

NOW THEREFORE THE COUNCIL OF THE CORPORATION OF THE CITY OF NIAGARA
FALLS ENACTS AS FOLLOWS:

**SECTION 1 - TITLE AND INTERPRETATION**

1.1 This By-law may be cited as "the Zoning By-law".

1.2 NUMBERING SYSTEM: The provisions of this By-law have been numbered using a
decimal 

### Extract Zoning Titles Using Regex Demo
As zoning by-laws from different municipalities vary greatly in formatting and complexity, a general pattern in formatting needs to be identified to design an algorithm that can successfully extract data from as many by-law formats as possible.

Extracting the section titles in the by-law documents is preferred because Table of Content formats vary too greatly across documents and make it difficult to extract the pdf sections. The text in the zoning by-laws have to be split into sections according to their zoning category because LLMs only accept a limited amount of tokens.

Once the text is extracted into markdown syntax, zoning categories and codes are extracted from the document section titles and stored in memory as a list/array.

First, set up the list/array to store all the extracted zoning titles.

In [4]:
titles = []

Next, call the parsing extract zoning titles function to extract the zoning category titles.

In [5]:
titles = PARSING.extract_zoning_titles(markdown_text)
titles[:10]

['**SECTION 3 - ZONES**',
 '**ZONES** **SHORT TITLES**',
 '**SECTION 5 - GENERAL PROVISIONS FOR RESIDENTIAL ZONES**',
 '**SECTION 7 - RESIDENTIAL ZONES**',
 '**7.1** **RESIDENTIAL 1A DENSITY ZONE (R1A ZONE).**',
 '**7.2  RESIDENTIAL 1B DENSITY ZONE (R1B ZONE).**',
 '**7.3** **RESIDENTIAL 1C DENSITY ZONE (R1C ZONE).**',
 '**7.4  RESIDENTIAL 1D DENSITY ZONE (R1D ZONE).**',
 '**7.5** **RESIDENTIAL 1E DENSITY ZONE (R1E ZONE).**',
 '**7.5A  RESIDENTIAL 1F DENSITY ZONE (R1F ZONE).**']

### Split Zoning By-law PDF Text into Sections Based on Zoning Categories Demo
The list/array of extracted titles can be used to identify and extract the relevant zoning category sections of the text. The extracted sections are stored in memory as a dictionary.

First, set up the dictionary to store the zoning category titles and section content. The for loop goes through each extracted zoning title in the array "titles" and puts them in the created dictionary.

In [6]:
zoning_sections = {}

for zoning_title in titles:
    zoning_sections[zoning_title] = {
        "title": zoning_title,
        "section": ""
    }

Call the split zoning sections function to extract the sections and save them into the dictionary.

In [14]:
zoning_sections = PARSING.split_zoning_sections(titles, markdown_text, zoning_sections)
print(zoning_sections)

{'**SECTION 3 - ZONES**': {'title': '**SECTION 3 - ZONES**', 'section': '3.1 CLASSIFICATION OF ZONES: For the purpose of this By-law, the following defined areas\nof the City of Niagara Falls, namely:\n\n(a) all of the former Village of Chippawa, and\n(b) all of the remaining part of the City of Niagara Falls lying north of the middle of the\nmain channel of the Welland River are hereby divided into the following zones,\nwhich are hereby established and the lands included in each zone are shown on\nthe Zoning Maps appended hereto as Schedule "A".'}, '**ZONES** **SHORT TITLES**': {'title': '**ZONES** **SHORT TITLES**', 'section': 'Residential 1A Density Zone R1A\n\nResidential 1B Density Zone R1B\n\nResidential 1C Density Zone R1C\n\nResidential 1D Density Zone R1D\n\nResidential 1E Density Zone R1E\n\nResidential 1F Density Zone R1F\n\nResidential Mobile Home Park Zone RMP\n\nResidential Two Zone R2\n\nResidential Mixed Zone R3\n\nResidential Low Density, Group Multiple Dwelling Zone R