
## What you need to be able to run this notebook (and the application)

- **Open AI API key:** Key allows you to query various GPT models, in particular the image model that we use to perform the optical character recognition
    - Go here to sign up for Open AI account and get an API key https://platform.openai.com/docs/quickstart
    - When you get your API key, create a file called `.env` in the home directory of this repository (i.e., not within the directory of this notebook but the one that contains the notebook directory) and add the line `OPENAI_API_KEY = <YOUR API KEY>` anywhere inside. 
- **Python Libraries:** Uncomment and execute the cell below to install all the requirements needed for the application and notebook. NOTE: You would probably want to be in a virtual environment to avoid adding these libraries to your global Python libraries. 


In [1]:
## installing requirements; uncomment line below
# %pip install -r ../requirements.txt

In [2]:
# needed libraries
import base64
import pprint
import os
import json
import time
from rapidfuzz import fuzz, process, utils
from dotenv import load_dotenv
from openai import OpenAI
import pandas as pd

In [4]:
# loading environmental variables
load_dotenv('../.env', override=True)

# define your open AI API key here; Remember this is a personal notebook! Don't push your API key to the remote repo
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

----------------------

# Onboarding Noteboook

**(July 18, 2024)** 

Woot! Thanks for your interest in the Ballot Initiative Project! This notebook was created to walk you through the main parts of the project (as they exist according to the above date), and thus give you a good background on how to contribute to the fundamental tech stack. Currently the application is quite simple, but we are working each month to improve functionality and usability and YOU can be a part of these improvements! 

## The Ballot Initiative Problem

Let' say you wanted to get a minimum wage increase on the ballot in DC. By get "on the ballot," we mean put the "Should Minimum Wage be raised to $15?" question on a ballot in the next election. That way voters could vote on the issue and it could be changed city wide.

How would you get such an issue on the ballot? According to the the City Council of DC (https://code.dccouncil.gov/us/dc/council/code/sections/1-1001.16), after getting your initiative approved by the council, you would then need to collect signatures from the registered voters in the district. If you get enough *validated* signatures, then your issue gets on the ballot and people can vote on it. But How many validated signatures do you need?  Here is a direct quote from the website:

> In order for any initiative measure or referendum measure to qualify for the ballot for consideration by the electors of the District, the proposer of the initiative measure or referendum measure shall secure the valid signatures of registered qualified electors upon the initiative or referendum measure equal in number to **5% of the registered qualified electors in the District; provided, that the total signatures submitted include 5% of the registered qualified electors in each of 5 or more of the 8 wards**.

So you would need to get 5% of the total population of registered voters and 5% per ward (for at least 5 wards) for registered voters. How could you ensure you meet this threshold? Mostly through ballpark estimates. You get a bunch of volunteers to go out into the city and collect signatures, and then at the end of the day you tally how many you collected. You keep a running tally per ward, and you check the results against a voter database, so that when you finally submit the list to the DC BOE you have a good sense of how many signatures are valid. 

![alt text](ballot_initiative_flow.png "Title")

The general data flow of the project is shown above. PDFs of ballots come in and a validated list of those who signed the ballot comes out. This flow can be broken into two parts an OCR (i.e., Optical Character Recognition) part and a Validation/Match part. Here are 

**OCR Processing of Signatures**
- A collection of signed ballots in PDF format come in. We select a single page of that PDF. We convert that page into an image (importantly one that can be ingested by GPT). We use GPT's vision capabilities to perform Optical Character Recognition on the image, and extract the names and addresses of those who signed the ballot. 

**Validation/Matching of Extracted Signatures**
- From the collection of names and addresses, we compare the elements of the list with a record of voters. We want to determine whether the voter name and address extracted from the ballot matches one found in the Voter Record. If there is a match, then that voter record is said to be validated. 

Below we breakdown what pieces of this process currently exist and we end by outlining some ways to contribute to the existing project. 

## OCR Processing of Signatures

#### Basic GPT Image Recognition

For our OCR, we are using the Open AI Vision API (described here: https://platform.openai.com/docs/guides/vision) to extract signatures from a PDF page of the ballot. To get familiar with using it we can consider a simpler example than a ballot and ask GPT to explain the Ballot Initiative diagram at the start of this notebook. 

The first thing we need to do is put the image in a format that the API can recognize. Below is a function (complete with the stackoverflow page it was stolen from) that properly converts the image. 

In [5]:
# Function is needed to put image in proper format for uploading
# From: https://stackoverflow.com/questions/77284901/upload-an-image-to-chat-gpt-using-the-api
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

Next, we apply the function to the `ballo_inititiave_flow.png` file that exists at the start of this notebook.

In [6]:
# Path to your image
image_path = "ballot_initiative_flow.png"

# Getting the base64 string
base64_image = encode_image(image_path)

Finally we ask the API to explain the image. 

In [8]:
# sample use of open AI API

prompt = "Please explain the meaning of the provided diagram."

client = OpenAI(api_key= OPENAI_API_KEY)

messages = [{"role": "user",
             "content": [{"type": "text",
                          "text": prompt},
                          {"type": "image_url",
                           "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
                           }
                        ]
              }
              ]

results = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    temperature=0.0,
)

print(results.choices[0].message.content)

The diagram outlines the workflow for processing signed forms related to a ballot initiative project. Here’s a breakdown of the steps involved:

1. **Signed Forms in PDF Format**: The process begins with a collection of signed forms stored in PDF format.

2. **Select Single Form**: A specific form is selected from the collection for further processing.

3. **Convert to Image**: The selected PDF form is converted into an image format to facilitate text extraction.

4. **Open AI API**: The image is then processed using the OpenAI API, specifically utilizing the GPT-4 vision API along with OCR (Optical Character Recognition) prompt engineering to extract text.

5. **Extracted Text**: The extracted text includes key information such as names, addresses, wards, and dates, formatted as a list of dictionaries.

6. **Python Package: rapidfuzz**: The extracted data is then processed using the Python package "rapidfuzz" for fuzzy matching, which helps in validating the extracted information agai

#### Extract Signature Function

Following the simple example above, we can now pursue the desired use case: Extracting signatures from a jpg version of a ballot. We currently have a prompt to do this, but we have to tell the API that these signatures are "toy examples," in order for it to properly process the personal data. So this is not an ideal approach to OCR. Finding better OCR approaches is one of the tasks that can be worked on for this project. 

In [9]:
def extract_signature_info(image_path):

    """
    Extracts names and addresses from single ballot image.
    """

    # Getting the base64 string
    base64_image = encode_image(image_path)

    # open AI client definition
    client = OpenAI(api_key= OPENAI_API_KEY)

    # prompt message
    messages = [
          {
            "role": "user",
            "content": [
              {
                "type": "text",
                "text": """Using the written text in the image create a list of dictionaries where each dictionary consists of keys 'Name', 'Address', 'Date', and 'Ward'. Fill in the values of each dictionary with the correct entries for each key. Write all the values of the dictionary in full. Only output the list of dictionaries. No other intro text is necessary. The output should be in JSON format, and look like
                {'data': [{"Name": "John Doe",
                          "Address": "123 Picket Lane",
                          "Date": "11/23/2024",
                          "Ward": "2"},
                          {"Name": "Jane Plane",
                          "Address": "456 Fence Field",
                          "Date": "11/23/2024",
                          "Ward": "3"},
                          ]} """
              },
              {
                "type": "image_url",
                "image_url": {
                  "url": f"data:image/jpeg;base64,{base64_image}"
                }
              }
            ]
          }
        ]

    # processing result through GPT
    results = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.0,
        response_format={"type": "json_object"}
    )

    # convert json into list
    signator_list = json.loads(results.choices[0].message.content)['data']

    return signator_list

Testing function on single ballot

In [10]:
# timing the result
start_time = time.time()

# get home github directory
repo_root = os.path.dirname(os.path.dirname(os.path.abspath('notebooks')))

# ocr extraction of the text
resulting_data = extract_signature_info(f'{repo_root}/sample_data/page-0.jpg')

# pretty printing the data; ; uncomment in the notebook run
pprint.pprint(resulting_data)

# recording elapsed time; uncomment in the notebook run
print(f'\nElapsed Time: {time.time()-start_time:.3f} secs')

[{'Address': '1234 Main St, Seattle, WA 98101',
  'Date': '11/15/2022',
  'Name': 'Marion Jones',
  'Ward': '1'},
 {'Address': '980 Oak Dr, Seattle, WA 98103',
  'Date': '1/15/2022',
  'Name': 'James Smith',
  'Ward': '2'},
 {'Address': '765 Cedar Ln, Seattle, WA 98105',
  'Date': '1/15/2022',
  'Name': 'Sarah Williams',
  'Ward': '3'},
 {'Address': '432 Elm St, Seattle, WA 98104',
  'Date': '1/15/2022',
  'Name': 'Michael Johnson',
  'Ward': '4'},
 {'Address': '765 Cedar Ln, Seattle, WA 98105',
  'Date': '1/15/2022',
  'Name': 'Emily Brown',
  'Ward': '5'}]

Elapsed Time: 8.775 secs


## Validation/Matching of Extracted Signatures

The second part of the pipeline is the validation/matching of the extracted signatures through "fuzzy matching" (https://en.wikipedia.org/wiki/Approximate_string_matching) between the OCR output of the ballot pages and a voter records file. Why is this necessary? Here is a direct quote from the council of DC page. 

> For the purpose of verifying a signature on any petition filed pursuant to this section, **the Board shall first determine that the address on the petition is the same as the residence shown on the signer’s voter registration record. If the address is different, the signature shall not be counted as valid unless the Board’s records show that the person was registered to vote from the address listed on the petition at the time the person signed the petition.**

So for each signature in an initiative, we need to extract the name of the signor and their address and we need to ensure that both exist in the record of registered voters. The OCR output is not always a clean name and address (i.e., it's not always the exact name/address the signor intended to write), so we need to find a way to collect the "closest matches" to the names and addresses in the voter records file. The next few cells walk through how we do this for names only. We will also need to do this for addresses. 

#### Basics of Fuzzy Matching

"Fuzzy matching" is called such because the matching it aims for is not exact or precise like a crystal clear image. It's kind of fuzzy like when you wear the wrong glasses prescription. In fuzzy matching, the word "Bomegranate" and "Pomegranate" would have a high match score even though they are different words, because they only differ by one character. 

There are many fuzzy matching approaches for strings, but the one we use is from the library `rapidfuzz` (https://pypi.org/project/rapidfuzz/). Below is an application of the library on our fruit motivated example.

*(We haven't discussed exactly what's happening under the hood of the library, but feel free to check the library docs (https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#ratio) for details.)*

In [11]:
from rapidfuzz import fuzz
fuzz.ratio('Bomegranate', 'Pomegranate')

90.9090909090909

This example can be extended to one that better resembles the voter records problem. Say that a user inputs a string to a program. There might be misspellings in the input string, but we want to determine which fruit they *meant* to write. How can we use fuzzy matching to get the closest fruit? 

**One Approach:** 
1. Begin with a list of standard fruit
2. Go through the list and compute the fuzzy match between the user input and an element in the list; Record the scores each time
3. Output the list of fruits that have the highest match score to the user input. 

Here is a simple implementation of this procedure. 

In [12]:
# user input
user_fruit = 'Bomegranate'

# list of fruits
fruit_list = ['Apple', 'Banana', 'Orange', 'Strawberry', 'Grapes', 'Mango', 'Pineapple', 'Watermelon', 'Blueberry', 'Cherry', 'Peach', 'Pear', 'Kiwi', 'Lemon', 'Lime', 'Raspberry', 'Blackberry', 'Pomegranate', 'Coconut', 'Papaya']

# dictionary of scores
score_dict = dict()
for fruit_elem in fruit_list:
    score_dict[fruit_elem] = fuzz.ratio(user_fruit, fruit_elem)

# scores sorted by highest values
list(dict(sorted(score_dict.items(), reverse=True, key=lambda item: item[1])).items())[:5]

[('Pomegranate', 90.9090909090909),
 ('Banana', 47.05882352941176),
 ('Orange', 47.05882352941176),
 ('Grapes', 35.29411764705882),
 ('Coconut', 33.333333333333336)]

We see that we correctly determined that "Pomegranate" has the highest score. Now, we want to apply the same procedure above to the ballot initiative problem The only difference is that we will use the OCR output (e.g., the Name and Address determined from the ballot) in place of `user_fruit` and we will use the list of registered voter names and addressed in place of `fruit_list`. 

 #### Fuzzy Matching and Voter Records

 Above, we applied "fuzzy matching" to a `user_input` of the name of a fruit and a `fruit_list` containing a list of fruit references. Next, we need to apply the same logic to check the output signature of the the OCR. The idea is the same. We have an input (or a collection of inputs) and we want to compare it with a list of possible inputs. 

 For the ballot initiative problem, let's gather the list that is analogous to `fruit_list`: The list of registered voters. For simplicity we will focus only on the full names of these voters. 

First we import the voter records file

In [13]:
import pandas as pd
# reading in election data; File is not stored locally in this repository
voter_records_2023_df = pd.read_csv(f'{repo_root}/sample_data/fake_voter_records.csv', dtype=str)

In [14]:
# displaying head of data; uncomment when want to see output
voter_records_2023_df.head()

Unnamed: 0,First_Name,Last_Name,Street_Number,Street_Name,Street_Type,Street_Dir_Suffix
0,Erica,Massey,6071,Martin Island,,
1,Terry,Osborne,395,Kathryn Mall,,
2,David,Holmes,30154,Tara Ports Apt. 314,,
3,Michele,Ballard,310,Landry Hills,,
4,Mary,Wiggins,26734,Susan Cliffs Suite 119,,


Next, we create a "full name" column in the dataframe, that we will use as our reference list

In [15]:
# creating full name column
voter_records_2023_df['Full Name'] = voter_records_2023_df["First_Name"] + ' ' + voter_records_2023_df['Last_Name']

In [16]:
# converting column into a list and displaying first two entries
full_name_list = list(voter_records_2023_df['Full Name'])

# displaying first 20 entries
print(full_name_list[:20])

['Erica Massey', 'Terry Osborne', 'David Holmes', 'Michele Ballard', 'Mary Wiggins', 'Audrey Smith', 'Willie Davis', 'Candace Jones', 'Patricia Hayes', 'Deborah Davies', 'Robert Boyd', 'Terry Meyer', 'Rachel Brown', 'Kathryn Brown', 'Sandra Stewart', 'Jacqueline Knox', 'Dr. Gina Burnett', 'Martin Collins', 'Colleen Stewart', 'Ryan Clark']


We now have our full list of registered voter names. Now when we have a name that we extract from the OCR, we can compare it with this list of names to find the close match. This is the first step in checking "validating" a signature. 


Take the first name that we extracted from the OCR 


In [17]:
resulting_data[0]['Name']

'Marion Jones'

Now, we apply the same procedure as in the `user_fruit` and `fruit_list` example above to find the close matches to 'James Hatch' in the list of voter names.

In [18]:
# signor name
signor_name = resulting_data[0]['Name']

# list of voters
full_name_list

# dictionary of scores
voter_score_dict = dict()
for voter_name in full_name_list:
    voter_score_dict[voter_name] = fuzz.ratio(signor_name, voter_name)

# scores sorted by highest values
list(dict(sorted(voter_score_dict.items(), reverse=True, key=lambda item: item[1])).items())[:5]

[('Mario Jones', 95.65217391304348),
 ('Marvin Jones', 91.66666666666666),
 ('Madison Jones', 88.0),
 ('Aaron Jones', 86.95652173913044),
 ('Marco Jones', 86.95652173913044)]

So we've found that there is indeed a "James Hatch" in the records of voters, and we completed the first part of the validation. The next part we would need to complete is checking that the address "James Hatch" wrote down matches that written in the voter registration records. We'll leave that as an extra task. 

Rapidfuzz contains an extract function which can perform this matching for one name against an entire iterable or collection of pandas dataframes at once.

[Documentation for process.extract](https://rapidfuzz.github.io/RapidFuzz/Usage/process.html#extract)

For ease of reference later, we can write this full process as a function

In [19]:
def score_fuzzy_match_slim(query_name, names_list):
    # using ratio.fuzz as the scorer will return a score in standard % confidence, rather than default Levenshtein distance
    # default_process removes whitespace, lowers all letters, removes any non-alphanumeric characters
    list_of_match_tuples = process.extract(query=query_name, choices=names_list, scorer=fuzz.ratio, processor=utils.default_process, limit=5)
    # this will produce a list of tuples whose values are as follows:
    # (matched record: string, the record which the query matched*,
    # score: float, % match confidence between the query and the matched record,
    # index: when checked against an iterable i.e standard python list, this will be an index, when checked against panda dataframes, it will return a key)
    return list_of_match_tuples

In [20]:
# finding elmements in election database that are similar to a given string
list_of_match_tuples = score_fuzzy_match_slim(resulting_data[0]['Name'], full_name_list)
# returns a list of tuples in the format (matched_name, score, index), with percentage score truncated for readability
list_of_match_tuples

[('Mario Jones', 95.65217391304348, 80106),
 ('Marvin Jones', 91.66666666666666, 24287),
 ('Madison Jones', 88.0, 3636),
 ('Aaron Jones', 86.95652173913044, 2002),
 ('Marco Jones', 86.95652173913044, 36862)]

## Full (Mini) Pipeline

Now, that we have worked through the two pieces of starting the diagram, we can put the pieces together in a mini-pipeline

<img src="ballot_initiative_flow.png" alt="drawing" width="1000"/>

The process of the pipeline
1. An image of a page of signed ballots comes in
2. Perform GPT-based OCR to extract the names from page
3. Compare each name in the extraction to voter record names
4. Output the closest matches for the names (preferably in a table format)

In [21]:
# full (mini) pipeline

##############
# BALLOT OCR #
##############

# defining single image path
ballot_image = 'page-0.jpg'

# ocr processing of image
ocr_data = extract_signature_info(f"{repo_root}/sample_data/{ballot_image}")

#######################
# VALIDATION/MATCHING #
#######################

# empty list of voter signature data
match_data = list()

# cycling through processed data
for elem in ocr_data:

    # temporary dictionary of results
    tmp_dict = dict()

    # name determined from OCR
    tmp_dict['OCR NAME'] = elem['Name']

    # closest matched name, index and score of closest match name in records
    name, score, index = score_fuzzy_match_slim(elem['Name'], full_name_list)[0]

    # matched voter name
    tmp_dict['MATCHED VOTER NAME'] = name

    # match score
    tmp_dict['MATCH_SCORE'] = score

    # appending data to dictionary
    match_data.append(tmp_dict)

In [22]:
# displaying data and matches
pd.DataFrame(match_data)

Unnamed: 0,OCR NAME,MATCHED VOTER NAME,MATCH_SCORE
0,Marion Jones,Mario Jones,95.652174
1,James Smith,James Smith,100.0
2,Sarah Williams,Sarah Williams,100.0
3,Michael Johnson,Michael Johnson,100.0
4,Emily Brown,Emily Brown,100.0


## Post-Onboarding Work

That's the end of the onboarding notebook. If you worked through this, you now know the basic functions that run in the background of the application.

