## Exploratory Analysis for Police Summary Reports

### Packages

In [1]:
# pip install transformers
# !pip install nltk

In [15]:
import os
import random
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import pandas as pd

# Local imports
from text_parser import TextParser

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was too old on your system - pyarrow 10.0.1 is the current minimum supported version as of this release.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Parameters

In [9]:
PATH = "C:/Users/fdmol/Desktop/MSCAPP/CAPP30255/NLP-Police-Complaints/data/text_files"
DATA_PATH = "C:/Users/fdmol/Desktop/MSCAPP/CAPP30255/NLP-Police-Complaints/data"

#### Functions

I define a class for reading and processing the data, I took some ideas from Matt's analysis to remove headers and other elements that are not relevant to us. This could also help in getting better results for the summarization task.

The function below wraps HuggingFace's tokenizer and model to generate a summary for each complaint. As I mention below, I tweaked the parameters to get better summaries.

In [10]:
class Summarizer:
    """
    This class is in charge of summarizing a given text using
    any Hugging Face model
    """

    BART_INPUT_SIZE = 1024

    def __init__(self, model_name: str, complaint_text: str, input_size: int = 2048):
        self.model_name = model_name
        self.complaint_text = complaint_text
        self.input_size = input_size

        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        if self.model_name == "facebook/bart-large-cnn":
            self.inputs = self.tokenizer(
                self.complaint_text,
                return_tensors="pt",
                max_length=self.BART_INPUT_SIZE,
                truncation=True,
            )

        else:
            self.inputs = self.tokenizer(
                self.complaint_text,
                return_tensors="pt",
                max_length=self.BART_INPUT_SIZE,
                truncation=True,
            )

    def generate_summary(
        self,
        max_length: int = 1200,
        min_length: int = 40,
        length_penalty: float = 2.0,
        no_repeat_ngram_size: int = 2,
        num_beams: int = 4,
        early_stopping: bool = True,
    ):
        """
        This function will generate the summary
        """

        # Generate summary
        summary_ids = self.model.generate(
            self.inputs["input_ids"],
            max_length=max_length,
            min_length=min_length,
            length_penalty=length_penalty,
            no_repeat_ngram_size=no_repeat_ngram_size,
            num_beams=num_beams,
            early_stopping=early_stopping,
        )

        # Decode and print the summary
        summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        return summary

#### Using default function to test out some models

In [11]:
def generate_summary(complaint_text, model_name):
    """
    Generates a summary of a complaint given
    the complaint text
    """

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    # Tokenize the text

    if model_name == "facebook/bart-large-cnn":
        inputs = tokenizer(
            complaint_text, return_tensors="pt", max_length=1024, truncation=True
        )

    else:
        inputs = tokenizer(
            complaint_text, return_tensors="pt", max_length=2048, truncation=True
        )

    # Generate summary
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=1200,
        min_length=40,
        length_penalty=2.0,
        no_repeat_ngram_size=2,
        num_beams=4,
        early_stopping=True,
    )

    # Decode and print the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

#### NLP Task: Summarization

I will use the following models:

- https://huggingface.co/docs/transformers/main/en/model_doc/t5#transformers.T5ForConditionalGeneration
- https://huggingface.co/facebook/bart-large-cnn
- https://huggingface.co/google/flan-t5-large


In the cells below, I use the model to generate a summary for ten random complaints. The summaries are of medium quality, depending on each complaint. 

I did the following to try to improve the quality of the summaries:

- Adjuster the `max_length` parameter to limit the length of the summary
- Adjusted the `min_length` parameter to ensure the summary is at least a certain length
- Adjusted the `num_beams` parameter to increase the number of beams used in beam search
- Adjusted `no_repeat_ngram_size` parameter to avoid repeating n-grams in the summary

I also experimented with the max_length of the tokens used in the tokenizer.

In [12]:
model_name_falcon = "Falconsai/text_summarization"
model_name_bart = "facebook/bart-large-cnn"
model_name_flan = "google/flan-t5-large"

text_parser = TextParser(PATH, nlp_task="summarization")

# Get a random list of 10 complaints
complaints = os.listdir(PATH)
complaints = [complaint for complaint in complaints if complaint.endswith(".txt")]
sample_complaints = random.sample(complaints, 5)

Initializing parsers for summarization


### Let us compare the three models to see which one is the best for our task.

We will generate the summaries for 5 random complaints and compare the results.

In [12]:
for complaint in sample_complaints:
    complaint_text = text_parser.file_to_string(complaint)
    print(f"Model: {model_name_falcon}")
    summary = generate_summary(complaint_text, model_name_falcon)
    print(f"Complaint: {complaint}")
    print("=====================================")
    print(f"Summary: {summary}\n")

    print(f"Model: {model_name_bart}")
    summary = generate_summary(complaint_text, model_name_bart)
    print(f"Summary: {summary}\n")

    print(f"Model: {model_name_flan}")
    summary = generate_summary(complaint_text, model_name_flan)
    print(f"Summary: {summary}\n")

Initializing parsers for summarization
Model: Falconsai/text_summarization
Complaint: 1079738.txt
Summary: lt. a resigned in july of 2017 and did not respond to requests for statement. 1 on september 15, 2017, the copa replaced the independent police review authority (ipra) as the civilian oversight agency of the chicago police department. This investigation revealed that while cpd was investigating the stabbing, subject 1 actively interfered with police’s treatment of an injured party, while on or off duty.

Model: facebook/bart-large-cnn
Summary: The investigation was conducted by the Chicago Police Department's civilian oversight agency (copa) The investigation began on March 21, 2016, when a civilian was stopped by police after being stabbed. The complaint alleges that while cpd was investigating a stabbing, subject 1 actively interfered with cfd’s treatment of an injured party. It is more likely than not that both subjects were intoxicated at the time of the incident.

Model: goog

We can see that the best performing model is BART Large by Facebook, we will use this model to generate the summaries for the entire dataset. The hyperparameters used are:

``` Python
max_length=1200,
min_length=40,
length_penalty=2.0,
no_repeat_ngram_size=2,
num_beams=4,
early_stopping=True,
```

### Grid Search for Facebook BART model

In [21]:
# We will now perform the same task using the Summarizer class, chaning some of the parameters
min_lengths = [40, 60]
max_lengths = [1200, 1400]
no_repeat_ngram_sizes = [2, 3, 4, 5]
num_beams = [3, 4, 5, 6]

sample_complaints = random.sample(complaints, 1)

for complaint in sample_complaints:
    complaint_text = text_parser.file_to_string(complaint)
    print(f"Complaint: {complaint}")
    print("=====================================")
    for min_length in min_lengths:
        for max_length in max_lengths:
            for no_repeat_ngram_size in no_repeat_ngram_sizes:
                for num_beam in num_beams:
                    print(f"Min Length: {min_length}")
                    print(f"Max Length: {max_length}")
                    print(f"No Repeat Ngram Size: {no_repeat_ngram_size}")
                    print(f"Num Beams: {num_beam}")
                    print("=====================================")
                    summary = Summarizer(
                        model_name_bart, complaint_text
                    ).generate_summary(
                        max_length=max_length,
                        min_length=min_length,
                        length_penalty=2,
                        no_repeat_ngram_size=no_repeat_ngram_size,
                        num_beams=num_beam,
                    )
                    print(f"Summary: {summary}\n")

Complaint: 2015-1076089.txt
Min Length: 40
Max Length: 1200
No Repeat Ngram Size: 2
Num Beams: 3
Summary: The complainant, subject 1, was involved in an argument with her fifteen year -old son, juvenile 1, which escalated into a physical altercation. She slapped him with the back of her hand and he bit her on her right forearm and punched her about her upper arm. off -duty officer b intervened and allegedly physically maltreatedsubject 1. Subject 1 was advised on how to obtain an order of protection. The complainant was not injured and did not seek medical attention.

Min Length: 40
Max Length: 1200
No Repeat Ngram Size: 2
Num Beams: 4
Summary: independent police review authority  3 allegations: on 11 july 2015, at approximately 2021 hours, sergeant a, #xxxx, unit xxx, telephoned cpic and registered this complaint, on behalf of subject 1 with officer a. it is also alleged that on an date in august 2014,. at xxxx n. canfield avenue, officer b: 2) punchedsubject 1 on her head. applicable

### Next Steps

- Finetune models to improve the quality of the summaries
- Probably cross reference the summaries with the topic modeling results to see if the summaries are coherent with the topics

In [16]:
# Random list of 25 complaints to summarize by hand
complaints_to_summarize = random.sample(complaints, 25)
complaints_to_summarize

# Create a dataframe to store the results
manual_summaries = pd.DataFrame([])
manual_summaries.loc[:, "complaint"] = complaints_to_summarize
manual_summaries.loc[:, "manual_summary"] = ""

# Save the dataframe

manual_summaries.to_csv(f"{DATA_PATH}/complaints_to_summarize.csv", index=False)