## Exploratory Analysis for Police Summary Reports

### Packages

In [18]:
# pip install transformers
# !pip install nltk

In [19]:
import os
import re
import random
import pandas as pd
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Local imports
from text_cleaner import TextParser

### Parameters

In [20]:
PATH = "C:/Users/fdmol/Desktop/MSCAPP/CAPP30255/NLP-Police-Complaints/data/text_files"
CHARS_TO_REMOVE = ["\n", "§"]
REGEX_PATTERNS = [
    r"civilian office of police accountability\s+",
    r"log\s*\#\s*\d+",
    r"-\s*\d+\s*\d+",
    r"summary report of investigation\s+",
    r"i.\s+executive\s+summary",
    r"_+",
    r"\s*date of incident:\s*\w+\s+\d+,\s+\d+",
    r"\s*time of incident:\s*\d+:\d+\w+",
    r"\s*location of incident:\s*\d+\w+\s*\w+",
    r"\s*date of copa notification:\s*\w+\s+\d+,\s+\d+",
    r"\s*time of copa notification:\s*\d+:\d+\w+",
    r"applicable rules and laws|"
    r"conclusion|"
    r"digital evidence|"
    r"documentary evidence|"
    r"legal standard|",
    r"appendix\s+.*",
    r"\s+deputy chief administrator\s+",
    r"\s+deputy chief investigator\s+",
    r"\s+ibid\s+",
]

#### Functions

I define a class for reading and processing the data, I took some ideas from Matt's analysis to remove headers and other elements that are not relevant to us. This could also help in getting better results for the summarization task.

The function below wraps HuggingFace's tokenizer and model to generate a summary for each complaint. As I mention below, I tweaked the parameters to get better summaries.

In [21]:
def generate_summary(complaint_text, tokenizer, model):
    """
    Generates a summary of a complaint given
    the complaint text
    """
    # Tokenize the text
    inputs = tokenizer(
        complaint_text, return_tensors="pt", max_length=2512, truncation=True
    )

    # Generate summary
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=1200,
        min_length=40,
        length_penalty=2.0,
        no_repeat_ngram_size=2,
        num_beams=4,
        early_stopping=True,
    )

    # Decode and print the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

In [22]:
text_parser = TextParser(PATH, nlp_task="summarization")
complaints = os.listdir(PATH)
complaints = [complaint for complaint in complaints if complaint.endswith(".txt")]
complaints = random.sample(complaints, 10)


for complaint in complaints:
    complaint_text = text_parser.file_to_string(complaint)
    print(complaint)
    print("=====================================")
    # print(complaint_text)


# complaint_text

2019-0000577.txt
1089956.txt
2018-1089681.txt
2017-1084713.txt
2018-1089882.txt
1088316.txt
1090499.txt
2015-1083058.txt
1092634.txt
2017-1087409.txt


#### NLP Task: Summarization

I will use this model:

https://huggingface.co/docs/transformers/main/en/model_doc/t5#transformers.T5ForConditionalGeneration

In the cells below, I use the model to generate a summary for ten random complaints. The summaries are of medium quality, depending on each complaint. 

I did the following to try to improve the quality of the summaries:

- Adjuster the `max_length` parameter to limit the length of the summary
- Adjusted the `min_length` parameter to ensure the summary is at least a certain length
- Adjusted the `num_beams` parameter to increase the number of beams used in beam search
- Adjusted `no_repeat_ngram_size` parameter to avoid repeating n-grams in the summary

I also experimented with the max_length of the tokens used in the tokenizer.

In [23]:
model_name = "Falconsai/text_summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [None]:
text_parser = TextParser(PATH, nlp_task="summarization")

# Get a random list of 10 complaints
complaints = os.listdir(PATH)
complaints = [complaint for complaint in complaints if complaint.endswith(".txt")]
complaints = random.sample(complaints, 3)


for complaint in complaints:
    complaint_text = text_parser.file_to_string(complaint)
    summary = generate_summary(complaint_text, tokenizer, model)
    print(f"Complaint: {complaint}")
    print("=====================================")
    print(f"Summary: {summary}\n\n")

Complaint: 2019-0003037.txt
Summary: alleged that on july 22, 2019, at approximately 10:37 pm, at or near 803 w. 80th st., you: 1. performed a traffic stop on without justification, in violation of rule 2 and rule 8. iv. exonerated.


Complaint: 2008-1018328.txt
Summary: a gunshot wound to the right shoulder and multiple rib fractures. witness 7 was driving southbound on the i expressway when subject 2 accelerated and drove toward officer f. Witness 7, stated that he did not witness the police involved shooting. witnesses 6 provided an account that was consistent with the description of events that were docum ented in the report of assistant deputy superintendent.


Complaint: 2016-1079728.txt
Summary: officer roldan had a gun in his right hand and his left hand was on mr. neck. giovanni was walking to his parked car near his apartment building and yelling from inside ipra on march 20, 2016, swung his service weapon and began searching he walked away from the officer's t-shirt, and the

Some next steps could be:

- Improve cleaning process to remove irrelevant headers and footers
- Try other models
- Finetune a model on this dataset (we would need to create a labeled dataset for this, and possibly generate the summaries by hand).