## Exploratory Analysis for Police Summary Reports

### Packages

In [18]:
# pip install transformers
# !pip install nltk


In [19]:
import os
import re
import random
import pandas as pd
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Local imports
from text_parser import TextParser

### Parameters

In [25]:
PATH = "C:/Users/fdmol/Desktop/MSCAPP/CAPP30255/NLP-Police-Complaints/data/text_files"

#### Functions

I define a class for reading and processing the data, I took some ideas from Matt's analysis to remove headers and other elements that are not relevant to us. This could also help in getting better results for the summarization task.

The function below wraps HuggingFace's tokenizer and model to generate a summary for each complaint. As I mention below, I tweaked the parameters to get better summaries.

In [26]:
def generate_summary(complaint_text, tokenizer, model):
    """
    Generates a summary of a complaint given
    the complaint text
    """
    # Tokenize the text
    inputs = tokenizer(
        complaint_text, return_tensors="pt", max_length=2512, truncation=True
    )

    # Generate summary
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=1200,
        min_length=40,
        length_penalty=2.0,
        no_repeat_ngram_size=2,
        num_beams=4,
        early_stopping=True,
    )

    # Decode and print the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

In [27]:
text_parser = TextParser(PATH, nlp_task="summarization")
complaints = os.listdir(PATH)
complaints = [complaint for complaint in complaints if complaint.endswith(".txt")]
complaints = random.sample(complaints, 10)


for complaint in complaints:
    complaint_text = text_parser.file_to_string(complaint)
    print(complaint)
    print("=====================================")
    # print(complaint_text)


# complaint_text

2017-1086745.txt
2019-0000512.txt
2019-0002776.txt
1086047.txt
2019-0003780.txt
2019-0005246.txt
2016-1083001.txt
2019-1092533.txt
2020-0004803.txt
2019-0005312.txt


#### NLP Task: Summarization

I will use this model:

https://huggingface.co/docs/transformers/main/en/model_doc/t5#transformers.T5ForConditionalGeneration

In the cells below, I use the model to generate a summary for ten random complaints. The summaries are of medium quality, depending on each complaint. 

I did the following to try to improve the quality of the summaries:

- Adjuster the `max_length` parameter to limit the length of the summary
- Adjusted the `min_length` parameter to ensure the summary is at least a certain length
- Adjusted the `num_beams` parameter to increase the number of beams used in beam search
- Adjusted `no_repeat_ngram_size` parameter to avoid repeating n-grams in the summary

I also experimented with the max_length of the tokens used in the tokenizer.

In [23]:
model_name = "Falconsai/text_summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [24]:
text_parser = TextParser(PATH, nlp_task="summarization")

# Get a random list of 10 complaints
complaints = os.listdir(PATH)
complaints = [complaint for complaint in complaints if complaint.endswith(".txt")]
complaints = random.sample(complaints, 3)


for complaint in complaints:
    complaint_text = text_parser.file_to_string(complaint)
    summary = generate_summary(complaint_text, tokenizer, model)
    print(f"Complaint: {complaint}")
    print("=====================================")
    print(f"Summary: {summary}\n\n")

Complaint: 2019-0002186.txt
Summary: copa finds that allegation #1 against both officers is sustained. the officers deactivated their bwc while still eng aged in law -enforcement-related activity, in violation of rule 6. a prepondera nce of evidence can be described as evidence indicating that it is more likely that the conduct occurred, even if by narrow margin, then the 'preponderance of th e evidence standard is met'


Complaint: 2019-0003830.txt
Summary: officer homar navar and officer pedro anaya arrested on september 23, 2019, at approximately 2:55 pm. lieutenant darwin butler said he intended the accused officers to be the partners who were arrested officer, lt dr. klaus's' iv. rule 5: failure to assure the safety of arrestee by leaving in possession of a belt while was left unattended in the holding room'


Complaint: 2021-0002220.txt
Summary: sgt. amelia kessem posted on her facebook page on june 7th, 202 1. she was informed that her posts were “clearly homophobic” and violate

Some next steps could be:

- Improve cleaning process to remove irrelevant headers and footers
- Try other models
- Finetune a model on this dataset (we would need to create a labeled dataset for this, and possibly generate the summaries by hand).