# idea
### background
Large Language Models have been growing in popularity, especially as chatbots and psuedo-search engines. But, how could they be used to benefit our work?

If you load up an LLM from a source like OpenAI, you're getting access to a model that has already been trained on text data and is familiar with navigating at least English text.
- we have tested using English prompts on Spanish documents, and it appears plausible to use this process and get similar quality in results without use of a translator API. Because the model can be instructed to return snippets of a document in its response ([among other things](https://platform.openai.com/docs/guides/prompt-engineering)), you have the option to capture the raw, Spanish text and the model's working English translation of it with the rest of the output for transparency and auditing. 
- I don't believe we've tested using strictly non-English prompts & responses, but it looks like [someone else has](https://community.openai.com/t/force-api-response-to-be-in-non-english-language-how/175381/2). If you go that route, let us know how it goes!

Using an LLM through an API, we can feed in a document and then prompt the model with specific questions to "extract" information from the document. In this way, we avoid asking the model to use referential "knowledge" (info gleaned from its training data, which we don't know about or trust to be accurate) in favor of relying on what it found in the input document. 

- quick disclaimer: I'm going to talk about OpenAI as the core tool here, but we're actually going to use LangChain in our code because it provides some more user-friendly wrappings to the same underlying functionality.

### use case
In the DPA report data, we have text data in the form of `findings_of_fact`, which can be anywhere from a sentence or two to multiple pages worth of text. To better prepare the reports for review by the busy Public Defender's Office, we want to test an LLM's ability to identify and summarize the meaningful justifications provided in the text for the `finding` given.

For example, in the document:
> "The complainants described the officers as having been angry and peaking rudely to them, but did not corroborate one another in any detailed manner. The officers denied that any of them were rude or angry. There were no independent witnesses to the interaction between officers and complainants. There was insufficient evidence to prove or disprove the allegation."

If we ask the model to list the justifications, we should get back all of these snippets:
- "Complainants did not corroborate one another in any detailed manner"
- "The officers denied that any of them were rude or angry."
- "There were no independent witnesses to the interaction between officers and complainants."

We store some other expected and verified responses in a hand file, `"../hand/examples.yml"`.

# security + privacy

### credentials
##### A VERY IMPORTANT note
In order to access one of these models, you will need to have credentials in the form of an **API key**. Similarly to SSH keys, part of this is private and it is extremely important not to share or (accidentally) make it public. Without the secret part, you won't be able to do the thing. 

When you sign up for an account with OpenAI, you will have an ["API Keys" tab](https://platform.openai.com/api-keys) where you can make new keys and view existing ones. ***Crucially, when you generate a new key, the secret part will be shown to you once during that process, and then never again.*** If you lose it or make it public, you will need to make a new one. That being said, trashing and re-generating keys is an easy and effective action for if you're worried that your key may have been leaked, so use that option freely.

In this demo, and in other projects, I store my private key in a text file in a credentials folder of my dotfiles, `"../../../dotfiles/creds/openai"`. Since my dotfiles get pushed to GitHub, I also have a `.gitignore` file in that repo that ignores the `creds` folder, so that I don't accidentally make it public. With this structure, I can use a handy `getcreds()` method that just reads the text file and loads the key into this notebook environment, without me pasting it here or into a `.bashrc` file.
- This notebook is setup for this style of credentialling because it's what I use, but there are other ways to do this kind of thing secretly and flexibly, each one has their tradeoffs. Please feel free to explore other methods or use what you know already!

### document sharing
In this demo, we use publicly available report data and supply the model with text snippets from those documents. However, if you wanted to use data that may contain PII or other sensitive information, I highly recommend reviewing [OpenAI's FAQs](https://help.openai.com/en/articles/7039943-data-usage-for-consumer-services-faq) and current privacy policy and deciding for yourself if it's an appropriate tool to use. At the moment of this demo's creation, mid-Dec 2023, it's our understanding that the documents basically only live on OpenAI's servers for the duration of the API call, and are otherwise not stored or kept by the company. However, privacy policies can be changed at any time and you don't agree to any permissions when you sign-up, so it's unclear what we should expect in perpetuity.

In lieu of sharing entire plain-text documents, you can also consider limiting the shared text to a specific section that would leave the model context-unaware or not include the sensitive info. Alternatively, vectorizing with an embedding tool could provide an optimized run and better privacy. 

# OpenAI setup
Python used here, but other languages are available.
- [ ] sign up for an account at [openai.com](https://openai.com/)
- [ ] if applicable, follow a link to join an organization already registered on the platform
    - **note about using an 'organization' during signup:** the platform doesn't intuitively merge identically named organizations you're connected to, so if you list "HRDAG" when you sign up and then join "hrdag" via invite later, you will be a member of two organizations with separate IDs. If you then rename "HRDAG" to "hrdag", it won't prompt you to pick a different name. You will just be a member of two identically-named "hrdag" teams, one with other team members and account funds, and one with just you that was generated and renamed by you.
    - The Organizations tab in Settings has an editable name field and an uneditable "Organization ID" field. There's no option to delete the org.
    - The Team tab in Settings has an option to Leave, but you can't leave a team you are the sole Owner of without transferring ownership, even if you are also the only member.
    - No help topics about deleting the org, closest is deleting account. I don't feel like contacting support about this. I renamed the org I signed up under as "hrdag-backup". 
- [ ] follow the first few steps in the [developer quickstart guide](https://platform.openai.com/docs/quickstart?context=python)
    - [ ] install the `openai`, `langchain` python packages
    - [ ] setup the `OPENAI_API_KEY` either as an environment variable (described in the quickstart guide) or a creds file (as described in "Credentials" above)
    - [ ] try the `openai-test.py` script under "Sending your first API request" with the suggested code and make sure it runs successfully


# references
- Ayyub Ibrahim's [llm-criminal-justice-research](https://github.com/ayyubibrahimi/llm-criminal-justice-research/tree/main)
- Tarak and Ayyub's blogpost, [Using large language models for structured information extraction from the Innocence Project New Orleans' wrongful conviction case files](https://hrdag.org/tech-notes/large-language-models-IPNO.html)
- Tristan Chambers' 30 Nov BIDS seminar, [Unpacking police violence and misconduct records: Solving information extraction challenges using Large Language Models (LLMs)](https://events.berkeley.edu/BIDS/event/209002-bids-seminar-with-tristan-chambers)
- [Answering Question About Custom Documents Using LangChain (and OpenAI)](https://kleiber.me/blog/2023/02/25/question-answering-using-langchain/)

Also:
- LangChain's [Introduction docs](https://python.langchain.com/docs/get_started/introduction)
- OpenAI's [Prompt Engineering guide](https://platform.openai.com/docs/guides/prompt-engineering)

# jupyter note
When I want to demo or explore something, I rely on the [`runtools`](https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/runtools) and [`codefolding`](https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/codefolding) Jupyter notebook extensions to mark and collapse key setup cells, so that exploratory and analytical work stays in focus. This is totally optional, but I wanted to mention it.

# setup for the demo

In [1]:
# depedencies
from random import randint
import yaml
import pandas as pd

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

In [2]:
# support methods
def getcreds():
    with open('../../../../dotfiles/creds/openai') as f:
        out = f.readline().strip()
    return out


def read_yaml(fname):
    with open(fname, 'r') as f:
        data = yaml.safe_load(f)
    return data


def read_template(fname):
    with open(fname, 'r') as f:
        out = f.readlines()
    return ''.join(out)


def getex():
    return examples[randint(1,len(examples))]


def getdoc():
    return allegs.findings_of_fact.sample(1).values[0]


def query_model(doc, ex=None):
    if not ex: ex = getex()
    res = chain.invoke({'EX_DOCUMENT': ex['DOCUMENT'],
                        'EX_RESPONSE': ex['RESPONSE'],
                        'DOCUMENT': doc})
    return res.content.strip()


def getreport(samp):
    rep = f"""
    ALLEGATION(S):\t{samp.allegations.values}
    FINDING:\t{samp.finding.values}


    QUERY RESPONSE:\n{samp.q_justifications.values}

    ---
    DOCUMENT (AKA, "FINDINGS OF FACT"):\n
    {samp.findings_of_fact.values}
    ---
    """
    return rep

In [3]:
# main
# setup
OPENAI_API_KEY = getcreds()
examples = read_yaml("../hand/examples.yml")
templ = read_template("../hand/template.txt")

# get the data
allegs = pd.read_parquet("../input/allegations.parquet", columns=[
    'allegation_id', 'allegations', 'finding', 'findings_of_fact']).dropna()

# this is about context window length
# which is worth reading about but isn't discussed here
allegs['ntokens'] = allegs.findings_of_fact.apply(lambda x: len(x.split()))
allegs = allegs.loc[allegs.ntokens < 4000].copy()

# setup the model
llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0, api_key=OPENAI_API_KEY)
prompt = ChatPromptTemplate.from_template(templ)
chain = prompt | llm

### preview allegation data

In [4]:
allegs.sample(3).T

Unnamed: 0,15082,10721,21654
allegation_id,a7eb190ad309f884,ece472d40320f4b0,190c72b7e4f0de2c
allegations,1: The officer entered the residence.,1: The officer behaved and spoke inappropriate...,1: The officer arrested the complainant withou...
finding,PC,NS,PC
findings_of_fact,The San Francisco Police Department officers o...,The complainant stated the named officer mistr...,The officer conducted a warrants check through...
ntokens,81,47,108


# prompt + examples

### about the template
The prompt you supply the LLM with has a direct effect on the response you get, including, for example:
- **format:** whether the response is a complete sentence or a piece of data
- **ambiguity handling:** whether the model supplies a response like "No" when "Unclear" or "Not mentioned" might have been more accurate
- **citations:** whether the model includes a snippet of the text in its response and/or the location of the text in the larger document

You also have the option to supply things like background information and a role/persona to the model through the prompt. 

Recommended reading if you get into using these tools:
- [Prompt Engineering](https://platform.openai.com/docs/guides/prompt-engineering)
- [LangChain Templates](https://python.langchain.com/docs/templates)

In [5]:
print(templ)

# Backstory
In the last 30 years, allegations of police misconduct have been investigated by one of two agencies, depending on when the allegation was received and opened for investigation:
- the Office of Citizen Complaints, or the "OCC" for short.
- the Department of Police Accountability, or the "DPA" for short.
Each month, the agency publishes a report that includes their findings related to each investigation.

# Assignment
Your role: You are an AI assistant retrieving information from the reports published by these agencies for your team to review.
Your focus: The reasons for the conclusion.

# Example
Here is an example document and the correct response for that document:
Ex)
{EX_DOCUMENT}

Response:
{EX_RESPONSE}

# Query
Now, below is the document for you to review. 
What are the reasons given by the investigating agency for their finding?
- When you identify a reason in the text, add it to a list and then just give me that list.
- If the allegation was mediated, say that the 

In [6]:
# get an example set
ex = getex()

In [7]:
ex['DOCUMENT']

'The complainant stated that her son was stopped for no apparent reason. She was not present during the traffic stop. The named officers stated they were driving behind the complainant’s son’s vehicle when they observed that the vehicle had expired registration tabs. Records from the California Department of Motor Vehicles indicate that the vehicle in question had expired registration when the vehicle was stopped. Department General Order 5.03 allows a police officer to briefly detain a person for questioning or request identification only if the officer has reasonable suspicion that the person’s behavior is related to criminal activity. The evidence proved that the act, which provided the basis for the allegation, occurred. However, the act was justified, lawful and proper.'

In [8]:
ex['FINDING']

'PC'

In [9]:
ex['RESPONSE']

['The named officers stated they observed that the vehicle had expired registration tabs.',
 'Records from the California Department of Motor Vehicles indicate that the vehicle in question had expired registration when the vehicle was stopped.',
 'Department General Order 5.03 allows a police officer to briefly detain a person for questioning or request identification only if the officer has reasonable suspicion that the person’s behavior is related to criminal activity.']

# explore 1: single query

In [10]:
# get a single doc
doc = getdoc()

In [11]:
query_model(doc=doc, ex=ex)

"['Department records showed that the complainant was detained during a homicide investigation.', 'Records also showed that the inspector assigned to the homicide investigation has retired from the Department.', 'The identity of the alleged officer could not be established.']"

#### is the above a reasonable response for the below document?
Why or why not?

In [12]:
doc

'In her written complaint, the complainant stated that she was strip-searched without cause. The complainant did not provide an interview. Department records showed that the complainant was detained during a homicide investigation, and that the SFPD’s Tactical Unit was at the scene. Records also showed that the inspector assigned to the homicide investigation has retired from the Department. In addition, the officer in charge of the Tactical Unit has also retired from the Department. The identity of the alleged officer could not be established. DATE OF COMPLAINT: 12/04/13                DATE OF COMPLETION: 07/31/17                PAGE# 2 of 2'

Is there any detail missing that should be included, or vice versa? Should this be a reason to fine-tune our template/prompt? Is it understandable, fair, or acceptable for the model to respond this way given this kind of info?

Are there any other questions to stop and ask before we deploy this method on a larger sample?

# explore 2: batch querying
This approach is still experimental to us, but we're very curious about how this might work on a larger document collection. Let's take a sample of the DPA allegation data and run `query_model()` over the `findings_of_fact` "document"s.

In [13]:
test_allegs = allegs[['allegation_id', 'allegations', 'finding', 'findings_of_fact']].sample(20).copy()

In [14]:
test_allegs['q_justifications'] = test_allegs.findings_of_fact.apply(query_model)

### review justification(s) 

In [15]:
samp = test_allegs.sample(1)
rep = getreport(samp)

#### in the below report, is the LLM's reponse valid?

In [16]:
print(rep)


    ALLEGATION(S):	['1: The officer failed to take required action. ']
    FINDING:	['PC']


    QUERY RESPONSE:
["['The officer correctly determined she did not have sufficient evidence to make an arrest.', 'The officer explained the citizens arrest process to the complainant.', 'The officer properly investigated and documented the incident in a report.']"]

    ---
    DOCUMENT (AKA, "FINDINGS OF FACT"):

    ['The complainant stated the officer negligently responded to his call for service. The complainant alleged his neighbor assaulted him and the officer failed to arrest the neighbor. The officer responded, along with two backup officers. The officer interviewed the complainant and the neighbor. The officer also interviewed the complainant’s landlord. There were no witnesses to the alleged assault and the complainant was not injured. The officer correctly determined she did not have sufficient evidence to make an arrest. The officer explained the citizens arrest process to the co

# explore 3:  'SUSTAINED' allegation

In [17]:
sust = allegs.loc[allegs.finding == 'S'].copy()

In [18]:
sust_samp = sust.sample()
sust_samp['q_justifications'] = sust_samp.findings_of_fact.apply(query_model)

In [19]:
print(getreport(sust_samp))


    ALLEGATION(S):	['11-12: The officers detained the complainant without justification. ']
    FINDING:	['S']


    QUERY RESPONSE:
["['Entering a residence without a search warrant, consent or exigent circumstance is prohibited by Department General Orders, California State law and the Fourth Amendment to the United States Constitution.', 'The allegation was sustained.']"]

    ---
    DOCUMENT (AKA, "FINDINGS OF FACT"):

    ['The officers stated they detained the complainant when the complainant refused to allow officers to enter the residence without a search warrant. The officers acknowledged that they did not have a search warrant. Entering a residence without a search warrant, consent or exigent circumstance is prohibited by Department General Orders, California State law and the Fourth Amendment to the United States Constitution. The allegation was sustained.']
    ---
    


- Is the model missing or under-reporting justifications?
- Is it over-reporting sentences that are not related to the justification?

These findings are constructed with an audience of lawyers in mind, and the boundaries of legal justifications compared to other statements might be a problem for this kind of prompting. Our PDO partners might be able to intuitively audit the responses and know when something is off about an individual case, but this experiment seems to have been a better exploration than application test.

# explore 4: precision + quality
This is an interesting template that might benefit from fine-tuning, but let's try a more structured question that we can compare to ground truth data to get a better idea of how the model is actually doing. 

Let's go back to our dataset and pull the `complaint_meta` column, which contains the raw text from the top of the document where the `date_complained` and `date_completed` fields live. We can also use the `time_to_complete` variable as an extra challenge. 

In [20]:
# re-read allegation data, with a focus on date fields
dates = pd.read_parquet("../input/allegations.parquet", columns=[
    'allegation_id', 'complaint_meta',
    'date_complained', 'date_completed', 'time_to_complete']).dropna()

Then we can reframe the prompt around extracting the date info that we already have, and see how accurate the responses actually are.

In [21]:
# reset setup, with a focus on date fields
def query_dates(doc):
    res = date_chain.invoke({'DOCUMENT': doc})
    return res.content.strip()


date_tmpl = """
# Backstory
In the last 30 years, allegations of police misconduct have been investigated by one of two agencies, depending on when the allegation was received and opened for investigation:
- the Office of Citizen Complaints, or the "OCC" for short.
- the Department of Police Accountability, or the "DPA" for short.
Each month, the agency publishes a report that includes their findings related to each investigation.

# Assignment
Your role: You are an AI assistant retrieving information from the reports published by these agencies for your team to review.
Your focus: Collecting the date of the complaint and when the investigation into it was closed.

# Query
Now, below is the document for you to review.
Please respond with ONLY these 3 pieces of information in a numbered list:
    1. The date the complaint was made, as just "%Y-%m-%d". Do not write a complete sentence.
    2. The date the investigation was completed, as just "%Y-%m-%d". Do not write a complete sentence.
    3. How much time passed between the completed date and the complained date, as "n days"?

Also,
- If the date is incomplete, cannot be parsed or formatted, say that.
- If you could not find and format both dates, or there was a different problem calculating how many days passed, return 'None'.
- If the date completed is earlier than the date complained, say NEGATIVE days for item 3.

---

{DOCUMENT}

---
"""
date_prompt = ChatPromptTemplate.from_template(date_tmpl)
date_chain = date_prompt | llm

Again, starting with a sample set rather than the full table.

In [22]:
less_dates = dates.sample(20).copy()

In [23]:
less_dates['q_dateinfo'] = less_dates.complaint_meta.apply(query_dates)

In [24]:
# add a method to help fish the answer out of the response string
def find_res(v, resn):
    pos = v.find(f"{resn}. ") + 3
    if f"{resn+1}. " in v:
        return v[pos: v.find(f"{resn+1}. ")].strip()
    return v[pos:].strip()

In [25]:
# unpack the response column
less_dates['q_date_complained'] = less_dates.q_dateinfo.apply(lambda x: find_res(x, 1))
less_dates['q_date_completed'] = less_dates.q_dateinfo.apply(lambda x: find_res(x, 2))
less_dates['q_time_to_complete'] = less_dates.q_dateinfo.apply(lambda x: find_res(x, 3))

In [26]:
assert (less_dates.date_complained.astype(str) != less_dates.q_date_complained).sum() == 0

0

In [27]:
assert (less_dates.date_completed.astype(str) != less_dates.q_date_completed).sum() == 0

0

In [28]:
less_dates.loc[(less_dates.date_complained.astype(str) != less_dates.q_date_complained) |
               (less_dates.date_completed.astype(str) != less_dates.q_date_completed),
               ['date_complained', 'q_date_complained', 'date_completed', 'q_date_completed']]

Unnamed: 0,date_complained,q_date_complained,date_completed,q_date_completed


It looks extracting dates is a pretty reliable process for most of the texts, enough that we could probably make a test case or assertion out of this statement with a few changes.

In [29]:
less_dates.loc[(less_dates.date_complained.astype(str) != less_dates.q_date_complained) |
               (less_dates.date_completed.astype(str) != less_dates.q_date_completed),
               'complaint_meta'].values

array([], dtype=object)

If the model was wrong in an above response, is there an obvious reason or tweak to be made to the prompt based on what we see in the metadata document? How might we avoid this mistake in the future, or prevent a similar error with other documents?

And finally, how does the challenge question hold up?

In [30]:
less_dates[['time_to_complete', 'q_time_to_complete', 'date_complained', 'date_completed']]

Unnamed: 0,time_to_complete,q_time_to_complete,date_complained,date_completed
16588,65 days,65 days,2009-08-12,2009-10-16
1049,110 days,110 days,2021-10-08,2022-01-26
19167,170 days,170 days,2008-04-22,2008-10-09
1701,55 days,55 days,2022-05-18,2022-07-12
15894,69 days,69 days,2009-05-22,2009-07-30
3584,75 days,75 days,2021-01-20,2021-04-05
20913,62 days,62 days,2007-04-26,2007-06-27
18929,205 days,205 days,2008-03-04,2008-09-25
15469,337 days,337 days,2008-06-23,2009-05-26
14945,101 days,101 days,2008-11-18,2009-02-27


### models vs. libraries

In this dataset, `time_to_complete` was calculated by the `pandas` and `datetime` packages, which I expect to be more accurate than the model's math. 

That being said, I tested this code with two different [available models](https://platform.openai.com/docs/models/gpt-3-5) and they performed quite differently when it came to this challenge Q/A.

The first test used a legacy LLM, `text-davinci-003`, and while some of the numbers were pretty close, especially when the true value was under 60 days, longer time periods were significantly off in the time calculation. This is likely do to a naive model's assumption that every month has 30 days, or something similar. However, a more recent model like the one used in this demo, `gpt-3.5-turbo`, appears to do signficantly better at making accurate calculations with time data. 

A programming library could be expected to check the true possible number of days when doing a calculation between two `datetime` values, and an LLM might be correct if prompted about the number of days in a particular month, but it might not be realistic to assume that any LLM will do this kind of calculation correctly. 

**TLDR;** Unless you've already confirmed the selected model will make the selected calculations appropriately for your purpose, it's worth checking the documentation and running some examples first. Eventually, it might be the case that all production models have this knowledge, but it's good to be cautious.

Regardless of how the time difference math went, the date extraction appears pretty solid, so if we wanted to we could get the dates from the documents that way and then calculate the time to complete the investigation using known libraries, which would still be super helpful.