## Exploratory Analysis for Police Summary Reports

### Packages

In [11]:
# pip install transformers
# !pip install nltk

In [1]:
import os
import random
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import pandas as pd

# Local imports
from text_parser import TextParser

c:\Users\fdmol\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll
c:\Users\fdmol\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fdmol\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\fdmol\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Parameters

In [2]:
PATH = "C:/Users/fdmol/Desktop/MSCAPP/CAPP30255/NLP-Police-Complaints/data/text_files"
DATA_PATH = "C:/Users/fdmol/Desktop/MSCAPP/CAPP30255/NLP-Police-Complaints/data"

#### Functions

I define a class for reading and processing the data, I took some ideas from Matt's analysis to remove headers and other elements that are not relevant to us. This could also help in getting better results for the summarization task.

The function below wraps HuggingFace's tokenizer and model to generate a summary for each complaint. As I mention below, I tweaked the parameters to get better summaries.

In [3]:
class Summarizer:
    """
    This class is in charge of summarizing a given text using
    any Hugging Face model
    """

    BART_INPUT_SIZE = 1024

    def __init__(self, model_name: str, complaint_text: str, input_size: int = 2048):
        self.model_name = model_name
        self.complaint_text = complaint_text
        self.input_size = input_size

        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        if self.model_name == "facebook/bart-large-cnn":
            self.inputs = self.tokenizer(
                self.complaint_text,
                return_tensors="pt",
                max_length=self.BART_INPUT_SIZE,
                truncation=True,
            )

        else:
            self.inputs = self.tokenizer(
                self.complaint_text,
                return_tensors="pt",
                max_length=self.input_size,
                truncation=True,
            )

    def generate_summary(
        self,
        max_length: int = 1200,
        min_length: int = 40,
        length_penalty: float = 2.0,
        no_repeat_ngram_size: int = 2,
        num_beams: int = 4,
        early_stopping: bool = True,
    ):
        """
        This function will generate the summary
        """

        # Generate summary
        summary_ids = self.model.generate(
            self.inputs["input_ids"],
            max_length=max_length,
            min_length=min_length,
            length_penalty=length_penalty,
            no_repeat_ngram_size=no_repeat_ngram_size,
            num_beams=num_beams,
            early_stopping=early_stopping,
        )

        # Decode and print the summary
        summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        return summary

#### Using default function to test out some models

In [4]:
def generate_summary(complaint_text, model_name):
    """
    Generates a summary of a complaint given
    the complaint text
    """

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    # Tokenize the text

    if model_name == "facebook/bart-large-cnn":
        inputs = tokenizer(
            complaint_text, return_tensors="pt", max_length=1024, truncation=True
        )

    else:
        inputs = tokenizer(
            complaint_text, return_tensors="pt", max_length=2048, truncation=True
        )

    # Generate summary
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=1200,
        min_length=40,
        length_penalty=2.0,
        no_repeat_ngram_size=2,
        num_beams=4,
        early_stopping=True,
    )

    # Decode and print the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

#### NLP Task: Summarization

I will use the following models:

- https://huggingface.co/docs/transformers/main/en/model_doc/t5#transformers.T5ForConditionalGeneration
- https://huggingface.co/facebook/bart-large-cnn
- https://huggingface.co/google/flan-t5-large


In the cells below, I use the model to generate a summary for ten random complaints. The summaries are of medium quality, depending on each complaint. 

I did the following to try to improve the quality of the summaries:

- Adjuster the `max_length` parameter to limit the length of the summary
- Adjusted the `min_length` parameter to ensure the summary is at least a certain length
- Adjusted the `num_beams` parameter to increase the number of beams used in beam search
- Adjusted `no_repeat_ngram_size` parameter to avoid repeating n-grams in the summary

I also experimented with the max_length of the tokens used in the tokenizer.

In [5]:
model_name_falcon = "Falconsai/text_summarization"
model_name_bart = "facebook/bart-large-cnn"
model_name_flan = "google/flan-t5-large"

text_parser = TextParser(PATH, nlp_task="summarization")

# Get a random list of 10 complaints
complaints = os.listdir(PATH)
complaints = [complaint for complaint in complaints if complaint.endswith(".txt")]
sample_complaints = random.sample(complaints, 5)

Initializing parsers for summarization


### Let us compare the three models to see which one is the best for our task.

We will generate the summaries for 5 random complaints and compare the results.

In [6]:
for complaint in sample_complaints:
    complaint_text = text_parser.file_to_string(complaint)
    print(f"Model: {model_name_falcon}")
    summary = generate_summary(complaint_text, model_name_falcon)
    print(f"Complaint: {complaint}")
    print("=====================================")
    print(f"Summary: {summary}\n")

    print(f"Model: {model_name_bart}")
    summary = generate_summary(complaint_text, model_name_bart)
    print(f"Summary: {summary}\n")

    print(f"Model: {model_name_flan}")
    summary = generate_summary(complaint_text, model_name_flan)
    print(f"Summary: {summary}\n")

Model: Falconsai/text_summarization
Complaint: 2015-1073745.txt
Summary: officer a, #xxxx:1.took complainant’s coat and refused to return it, in violation of rule 2;2. ipra closed the case.complainant said that she didn’t remember being slammed to the floor, but she did not remember it.

Model: facebook/bart-large-cnn
Summary: complainant alleged that on february 10, 2015, officer a “slammed” her on the floor andscratched her hand during a domestic argument. when asked to provide an affidavit, complainantdeclined to cooperate with ipra’s investigation. given that there was insufficient evidence to justify arequest for a affidavit override, ipRA closed the case.

Model: google/flan-t5-large
Summary: introductioncomplainant alleged that on february 10, 2015, officer a “slammed” her on the floor andscratched her hand duringa domestic argument. when asked to provide an affidavit, complainantdeclined to cooperate with ipra’s investigation. given that there was insufficient evidence to justi

We can see that the best performing models are BART Large by Facebook and the T5 FalconAI Model, we will use this model to generate the summaries for the entire dataset. The hyperparameters used are:

``` Python
max_length=1200,
min_length=40,
length_penalty=2.0,
no_repeat_ngram_size=2,
num_beams=4,
early_stopping=True,
```

### Grid Search for Facebook BART model

In [18]:
# We will now perform the same task using the Summarizer class, chaning some of the parameters
no_repeat_ngram_sizes = [2, 3, 4]
num_beams = [3, 4, 5]

sample_complaints = random.sample(complaints, 1)

for complaint in sample_complaints:
    complaint_text = text_parser.file_to_string(complaint)
    print(f"Complaint: {complaint}")
    print("=====================================")
    for no_repeat_ngram_size in no_repeat_ngram_sizes:
        print(
            f" ============== No Repeat Ngram Size: {no_repeat_ngram_size} ============"
        )
        for num_beam in num_beams:
            print(f"Num Beams: {num_beam}")
            summary = Summarizer(model_name_bart, complaint_text).generate_summary(
                max_length=1200,
                min_length=40,
                length_penalty=2,
                no_repeat_ngram_size=no_repeat_ngram_size,
                num_beams=num_beam,
            )
            print(f"Summary: {summary}\n")

Complaint: 2019-0002625.txt
Num Beams: 3
Summary: The incident occurred on July 11, 2019 at approximately 10:20 p.m. in the home of officer charles sykes. The wife called 911 for police assistance and reported her husband had attacked her and was under the influence of alcohol. copa obtained an affidavit override from the chicago police department to proceed with this investigation.

Num Beams: 4
Summary: Officer charles sykes and sergeant dennis graber engaged in a verbal altercation that turned physical in that she was pushed to the ground. copa obtained an affidavit override from the chicago police department in order to proceed with this investigation. It is alleged that the accused was intoxicated while off-duty in violation of rule 15.

Num Beams: 5
Summary: Officer charles sykes and sergeant dennis graber engaged in a verbal altercation that turned physical in that she was pushed to the ground. copa obtained an affidavit override from the chicago police department in order to pr

### Grid Search for Falcon T5 model

In [19]:
for complaint in sample_complaints:
    complaint_text = text_parser.file_to_string(complaint)
    print(f"Complaint: {complaint}")
    print("=====================================")
    for no_repeat_ngram_size in no_repeat_ngram_sizes:
        print(
            f" ============== No Repeat Ngram Size: {no_repeat_ngram_size} ============"
        )
        for num_beam in num_beams:
            print(f"Num Beams: {num_beam}")
            summary = Summarizer(model_name_falcon, complaint_text).generate_summary(
                max_length=1200,
                min_length=40,
                length_penalty=2,
                no_repeat_ngram_size=no_repeat_ngram_size,
                num_beams=num_beam,
            )
            print(f"Summary: {summary}\n")

Complaint: 2019-0002625.txt
Num Beams: 3
Summary: officer unit of assignment 116, dob: 1976 gender: male, race: black, black 1 ii. allegations officer allegation finding / recommendation charles sykes sergeant dennis graber 1. it is alleged that on or about july 11, 2019 at or near the location of chicago, il, the accused was intoxicated and allowed him to operate a motor vehicle in violation of rule 3 & 6. 6. e.g. he failed to protect/preser

Num Beams: 4
Summary: officer unit of assignment 116, dob: 1976 gender: male, race: black, black 1 ii. allegations officer allegation finding / recommendation charles sykes sergeant dennis graber 1. it is alleged that on or about july 11, 2019 at or near the location of chicago, il, the accused was intoxicated and allowed him to operate a motor vehicle in violation of rule 3 & 6. 4. he failed to provide adequate police service in his interaction with involved

Num Beams: 5
Summary: officer unit of assignment 116, dob: 1976 gender: male, race: bla

### Next Steps

- Finetune models to improve the quality of the summaries
- Probably cross reference the summaries with the topic modeling results to see if the summaries are coherent with the topics

In [20]:
# Random list of 25 complaints to summarize by hand
complaints_to_summarize = random.sample(complaints, 25)
complaints_to_summarize

# Create a dataframe to store the results
manual_summaries = pd.DataFrame([])
manual_summaries.loc[:, "complaint"] = complaints_to_summarize
manual_summaries.loc[:, "manual_summary"] = ""

# Save the dataframe

# manual_summaries.to_csv(f"{DATA_PATH}/complaints_to_summarize.csv", index=False)