## Exploratory Analysis for Police Summary Reports

### Packages

In [2]:
# pip install transformers
# !pip install nltk

In [1]:
import os
import random
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Local imports
from text_parser import TextParser

c:\Users\fdmol\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll
c:\Users\fdmol\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fdmol\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\fdmol\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Parameters

In [2]:
PATH = "C:/Users/fdmol/Desktop/MSCAPP/CAPP30255/NLP-Police-Complaints/data/text_files"

#### Functions

I define a class for reading and processing the data, I took some ideas from Matt's analysis to remove headers and other elements that are not relevant to us. This could also help in getting better results for the summarization task.

The function below wraps HuggingFace's tokenizer and model to generate a summary for each complaint. As I mention below, I tweaked the parameters to get better summaries.

In [3]:
def generate_summary(complaint_text, model_name):
    """
    Generates a summary of a complaint given
    the complaint text
    """

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    # Tokenize the text

    if model_name == "facebook/bart-large-cnn":
        inputs = tokenizer(
            complaint_text, return_tensors="pt", max_length=1024, truncation=True
        )

    else:
        inputs = tokenizer(
            complaint_text, return_tensors="pt", max_length=2048, truncation=True
        )

    # Generate summary
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=1200,
        min_length=40,
        length_penalty=2.0,
        no_repeat_ngram_size=2,
        num_beams=4,
        early_stopping=True,
    )

    # Decode and print the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

#### NLP Task: Summarization

I will use the following models:

- https://huggingface.co/docs/transformers/main/en/model_doc/t5#transformers.T5ForConditionalGeneration
- https://huggingface.co/facebook/bart-large-cnn
- https://huggingface.co/google/flan-t5-large


In the cells below, I use the model to generate a summary for ten random complaints. The summaries are of medium quality, depending on each complaint. 

I did the following to try to improve the quality of the summaries:

- Adjuster the `max_length` parameter to limit the length of the summary
- Adjusted the `min_length` parameter to ensure the summary is at least a certain length
- Adjusted the `num_beams` parameter to increase the number of beams used in beam search
- Adjusted `no_repeat_ngram_size` parameter to avoid repeating n-grams in the summary

I also experimented with the max_length of the tokens used in the tokenizer.

In [4]:
model_name_falcon = "Falconsai/text_summarization"
model_name_bart = "facebook/bart-large-cnn"
model_name_flan = "google/flan-t5-large"

In [5]:
text_parser = TextParser(PATH, nlp_task="summarization")

# Get a random list of 10 complaints
complaints = os.listdir(PATH)
complaints = [complaint for complaint in complaints if complaint.endswith(".txt")]
complaints = random.sample(complaints, 5)


for complaint in complaints:
    complaint_text = text_parser.file_to_string(complaint)
    print(f"Model: {model_name_falcon}")
    summary = generate_summary(complaint_text, model_name_falcon)
    print(f"Complaint: {complaint}")
    print("=====================================")
    print(f"Summary: {summary}\n")

    print(f"Model: {model_name_bart}")
    summary = generate_summary(complaint_text, model_name_bart)
    print(f"Summary: {summary}\n")

    print(f"Model: {model_name_flan}")
    summary = generate_summary(complaint_text, model_name_flan)
    print(f"Summary: {summary}\n")

Initializing parsers for summarization
Model: Falconsai/text_summarization
Complaint: 1085581.txt
Summary: officer a was on his way to work on june 14, 2017, at approximately 6:45 am, at xxxx east 103rd street chicago, il. subject 1 related she told officera she was speeding, and she had some trouble retrieving her insurance from hervisor so she could get back in the car.subject 1 recalled that sergeanta asked her to “step away” and said, “this never should have happened.” Captain said he recognized the officer, now

Model: facebook/bart-large-cnn
Summary: Officer a, star # xxxxx, allegedly slammed subject 1’s left foot in the door of her vehicle and approached her with his hand on his firearm. Officer a told captain a that “this was a real traffic stop” and ‘not to get involved. Captain a said to officer a "she was one of you, you don't have to do this"

Model: google/flan-t5-large
Summary: officer b and said, “you’re not going to get a ticket. you’ve been pulled over by an ass / jerk

We can see that the best performing model is BART Large by Facebook, we will use this model to generate the summaries for the entire dataset. The hyperparameters used are:

``` Python
max_length=1200,
min_length=40,
length_penalty=2.0,
no_repeat_ngram_size=2,
num_beams=4,
early_stopping=True,
```

### Next Steps

- Finetune models to improve the quality of the summaries
- Probably cross reference the summaries with the topic modeling results to see if the summaries are coherent with the topics