## Exploratory Analysis for Police Summary Reports

### Packages

In [1]:
# pip install transformers
# !pip install nltk

In [2]:
import os
import random
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import pandas as pd

# Local imports
from text_parser import TextParser

c:\Users\fdmol\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll
c:\Users\fdmol\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was too old on your system - pyarrow 10.0.1 is the current minimum supported version as of this release.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fdmol\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\fdmol

### Parameters

In [3]:
PATH = "C:/Users/fdmol/Desktop/MSCAPP/CAPP30255/NLP-Police-Complaints/data/text_files"
DATA_PATH = "C:/Users/fdmol/Desktop/MSCAPP/CAPP30255/NLP-Police-Complaints/data"

#### Functions

I define a class for reading and processing the data, I took some ideas from Matt's analysis to remove headers and other elements that are not relevant to us. This could also help in getting better results for the summarization task.

The function below wraps HuggingFace's tokenizer and model to generate a summary for each complaint. As I mention below, I tweaked the parameters to get better summaries.

In [4]:
class Summarizer:
    """
    This class is in charge of summarizing a given text using
    any Hugging Face model
    """

    BART_INPUT_SIZE = 1024

    def __init__(self, model_name: str, complaint_text: str, input_size: int = 2048):
        self.model_name = model_name
        self.complaint_text = complaint_text
        self.input_size = input_size

        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        if self.model_name == "facebook/bart-large-cnn":
            self.inputs = self.tokenizer(
                self.complaint_text,
                return_tensors="pt",
                max_length=self.BART_INPUT_SIZE,
                truncation=True,
            )

        else:
            self.inputs = self.tokenizer(
                self.complaint_text,
                return_tensors="pt",
                max_length=self.input_size,
                truncation=True,
            )

    def generate_summary(
        self,
        max_length: int = 1200,
        min_length: int = 40,
        length_penalty: float = 2.0,
        no_repeat_ngram_size: int = 2,
        num_beams: int = 4,
        early_stopping: bool = True,
    ):
        """
        This function will generate the summary
        """

        # Generate summary
        summary_ids = self.model.generate(
            self.inputs["input_ids"],
            max_length=max_length,
            min_length=min_length,
            length_penalty=length_penalty,
            no_repeat_ngram_size=no_repeat_ngram_size,
            num_beams=num_beams,
            early_stopping=early_stopping,
        )

        # Decode and print the summary
        summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        return summary

#### Using default function to test out some models

In [5]:
def generate_summary(complaint_text, model_name):
    """
    Generates a summary of a complaint given
    the complaint text
    """

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    # Tokenize the text

    if model_name == "facebook/bart-large-cnn":
        inputs = tokenizer(
            complaint_text, return_tensors="pt", max_length=1024, truncation=True
        )

    else:
        inputs = tokenizer(
            complaint_text, return_tensors="pt", max_length=2048, truncation=True
        )

    # Generate summary
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=1200,
        min_length=40,
        length_penalty=2.0,
        no_repeat_ngram_size=2,
        num_beams=4,
        early_stopping=True,
    )

    # Decode and print the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

#### NLP Task: Summarization

I will use the following models:

- https://huggingface.co/docs/transformers/main/en/model_doc/t5#transformers.T5ForConditionalGeneration
- https://huggingface.co/facebook/bart-large-cnn
- https://huggingface.co/google/flan-t5-large


In the cells below, I use the model to generate a summary for ten random complaints. The summaries are of medium quality, depending on each complaint. 

I did the following to try to improve the quality of the summaries:

- Adjuster the `max_length` parameter to limit the length of the summary
- Adjusted the `min_length` parameter to ensure the summary is at least a certain length
- Adjusted the `num_beams` parameter to increase the number of beams used in beam search
- Adjusted `no_repeat_ngram_size` parameter to avoid repeating n-grams in the summary

I also experimented with the max_length of the tokens used in the tokenizer.

In [6]:
model_name_falcon = "Falconsai/text_summarization"
model_name_bart = "facebook/bart-large-cnn"
model_name_flan = "google/flan-t5-large"

text_parser = TextParser(PATH, nlp_task="summarization")

# Get a random list of 10 complaints
complaints = os.listdir(PATH)
complaints = [complaint for complaint in complaints if complaint.endswith(".txt")]
sample_complaints = random.sample(complaints, 5)

Initializing parsers for summarization


### Let us compare the three models to see which one is the best for our task.

We will generate the summaries for 5 random complaints and compare the results.

In [7]:
for complaint in sample_complaints:
    complaint_text = text_parser.file_to_string(complaint)
    print(f"Model: {model_name_falcon}")
    summary = generate_summary(complaint_text, model_name_falcon)
    print(f"Complaint: {complaint}")
    print("=====================================")
    print(f"Summary: {summary}\n")

    print(f"Model: {model_name_bart}")
    summary = generate_summary(complaint_text, model_name_bart)
    print(f"Summary: {summary}\n")

    print(f"Model: {model_name_flan}")
    summary = generate_summary(complaint_text, model_name_flan)
    print(f"Summary: {summary}\n")

Model: Falconsai/text_summarization


Complaint: 1045950.txt
Summary: [girlfriend of subject 1] informed officers c and d that she and herboyfriend, subject 1, became engaged in an altercation because subject1 left their children home alone to go drink with his friends. he repeatedly kicked and punched [daughter] about the head, face and body, causing minor swelling to her head. the incident occurred on 07 june 2011, at approximately 0142 hours, at xxxx s. honore street, chicago, police and ambulance personnel arrived.

Model: facebook/bart-large-cnn
Summary: Officer #1, “officer a’sinjuries: multiple gunshot wounds; shot seven (7) total times; fatal. Officer a, unit 007:1) violated department policy regarding the use of deadly force in that he shot thesubject, subject 1, without justification, in violation of chicago police departmentgeneral order 03.

Model: google/flan-t5-large
Summary: rd#ht/domestic battery, and detective case supplementary report, gt /u#11involved officer #1: “officer a” (chicago police officer); mal

We can see that the best performing models are BART Large by Facebook and the T5 FalconAI Model, we will use this model to generate the summaries for the entire dataset. The hyperparameters used are:

``` Python
max_length=1200,
min_length=40,
length_penalty=2.0,
no_repeat_ngram_size=2,
num_beams=4,
early_stopping=True,
```

### Grid Search for Facebook BART model

In [9]:
# We will now perform the same task using the Summarizer class, chaning some of the parameters
no_repeat_ngram_sizes = [2, 3, 4]
num_beams = [3, 4, 5]

sample_complaints = random.sample(complaints, 1)

for complaint in sample_complaints:
    complaint_text = text_parser.file_to_string(complaint)
    print(f"Complaint: {complaint}")
    print("=====================================")
    for no_repeat_ngram_size in no_repeat_ngram_sizes:
        print(
            f" ============== No Repeat Ngram Size: {no_repeat_ngram_size} ============"
        )
        for num_beam in num_beams:
            print(f"Num Beams: {num_beam}")
            summary = Summarizer(model_name_bart, complaint_text).generate_summary(
                max_length=1200,
                min_length=40,
                length_penalty=2,
                no_repeat_ngram_size=no_repeat_ngram_size,
                num_beams=num_beam,
            )
            print(f"Summary: {summary}\n")

Complaint: 2022-0000737.txt
Num Beams: 3
Summary: Officer ryan edwards, officer calla roulds, and officer adan pedroza jr. were on patrol in a marked chicago police department (cpd) vehicle in the area of 97th st. when they observed a hyundai tucson hatchback without a front license plate. The officers conducted a traffic stop of the vehicle at 9700 s harvard ave. An initial law enforcement agencies data system ( lead s) inquiry revealed the hy Hyundai’s registration to be expired since may 2020. He was then arrested and transported to the 5th district police s tation.

Num Beams: 4
Summary: Officer ryan edwards, star #19672, and officer adan pedroza jr. were on patrol in a marked chicago police department vehicle in the area of 97th st. when they observed a hyundai tucson hatchback without a front license plate. The officers conducted a traffic stop of the vehicle at 9700 s harvard ave. An initial law enforcement agencies data system ( lead s) inquiry revealed the hy Hyundai’s registr

### Grid Search for Falcon T5 model

In [10]:
for complaint in sample_complaints:
    complaint_text = text_parser.file_to_string(complaint)
    print(f"Complaint: {complaint}")
    print("=====================================")
    for no_repeat_ngram_size in no_repeat_ngram_sizes:
        print(
            f" ============== No Repeat Ngram Size: {no_repeat_ngram_size} ============"
        )
        for num_beam in num_beams:
            print(f"Num Beams: {num_beam}")
            summary = Summarizer(model_name_falcon, complaint_text).generate_summary(
                max_length=1200,
                min_length=40,
                length_penalty=2,
                no_repeat_ngram_size=no_repeat_ngram_size,
                num_beams=num_beam,
            )
            print(f"Summary: {summary}\n")


Complaint: 2022-0000737.txt
Num Beams: 3
Summary: officer ryan edwards, star #19672, doa : february 28, 2022, at approximately 3:12 p.m. copa conducted a traffic stop of the vehicle at 9700 south harvard avenue, chicago, il 60628. Officer roulds said that while pedroza ran the driver’s name, she tried to run the name and license plate but was unable to get the car's identity - so they radio

Num Beams: 4
Summary: officer ryan edwards allegedly searched his hyundai without his consent. copa conducted a traffic stop of the vehicle at 9700 south harvard avenue, chicago, il 60628 p.m. Officer roulds explained that she tried to run the driver’s name and license plate but wasn’t able to get his name, so she ran the name instead of pedroza jr. he was arrested and transported to

Num Beams: 5
Summary: officer ryan edwards was arrested for various traffic offenses and a criminal arrest warrant from another jurisdiction. officer calla roulds said that on february 28, 2022, at approximately 3:12 

### Next Steps

- Finetune models to improve the quality of the summaries
- Probably cross reference the summaries with the topic modeling results to see if the summaries are coherent with the topics

In [None]:
# Random list of 25 complaints to summarize by hand
complaints_to_summarize = random.sample(complaints, 25)
complaints_to_summarize

# Create a dataframe to store the results
manual_summaries = pd.DataFrame([])
manual_summaries.loc[:, "complaint"] = complaints_to_summarize
manual_summaries.loc[:, "manual_summary"] = ""

# Save the dataframe

# manual_summaries.to_csv(f"{DATA_PATH}/complaints_to_summarize.csv", index=False)