## Exploratory Analysis for Police Summary Reports

In [21]:
pip install transformers

Collecting transformersNote: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip



  Downloading transformers-4.37.2-py3-none-any.whl.metadata (129 kB)
     ---------------------------------------- 0.0/129.4 kB ? eta -:--:--
     ------------------------------------ - 122.9/129.4 kB 3.5 MB/s eta 0:00:01
     -------------------------------------- 129.4/129.4 kB 2.5 MB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers)
  Downloading huggingface_hub-0.20.3-py3-none-any.whl.metadata (12 kB)
Collecting pyyaml>=5.1 (from transformers)
  Downloading PyYAML-6.0.1-cp311-cp311-win_amd64.whl.metadata (2.1 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2023.12.25-cp311-cp311-win_amd64.whl.metadata (41 kB)
     ---------------------------------------- 0.0/42.0 kB ? eta -:--:--
     ---------------------------------------- 42.0/42.0 kB 2.1 MB/s eta 0:00:00
Collecting tokenizers<0.19,>=0.14 (from transformers)
  Downloading tokenizers-0.15.1-cp311-none-win_amd64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.1 (from transformers

In [22]:
import os
import pandas as pd
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer


  from .autonotebook import tqdm as notebook_tqdm


### Parameters

In [34]:
PATH = "C:/Users/fdmol/Desktop/MSCAPP/CAPP30255/NLP-Police-Complaints/data/text_files"
CHARS_TO_REMOVE = ["\n"]


## Testing Approaches

In [41]:
class TextParser:
    CHARS_TO_REMOVE = CHARS_TO_REMOVE

    def __init__(self, path):
        self.path = path

    def txt_to_list(self, filename):
        """
        Add each line of a text file to a list
        """

        file_path = os.path.join(self.path, filename)
        lines = []
        with open(file_path, "r", encoding="utf-8") as f:
            for line in f:
                line = line.strip().split()
                lines.append(line)

        return lines

    def file_to_string(self, filename):
        """
        Add each line of a text file to a string
        """
        text = ""
        file_path = os.path.join(self.path, filename)
        with open(file_path, "r", encoding="utf-8") as f:
            for line in f:
                for char in self.CHARS_TO_REMOVE:
                    line = line.replace(char, "")
                text += line

        return text


##### I will use this model:

https://huggingface.co/docs/transformers/main/en/model_doc/t5#transformers.T5ForConditionalGeneration

In [42]:
model_name = "Falconsai/text_summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


In [51]:
text_parser = TextParser(PATH)

# Choose a complaint to summarize

complaint_text = text_parser.file_to_string("2019-0000389.txt")
complaint_text


'CIVILIAN OFFICE OF POLICE ACCOUNTABILITY  LOG # 2019 -0000389  1  SUMMARY REPORT OF INVESTIGATION  I. EXECUTIVE SUMMARY     Date of Incident:  March 12, 2019  Time of Incident:  2:58 PM  Location of Incident:  12300 S. Harvard Avenue (Alley)  Date of COPA Notification:  March 13, 2019  Time of COPA Notification:  11:12 AM   Officer  and Officer  conducted a traffic stop of  on March 12, 2019, during which Officer asked Ms.  why she failed to stop at a stop sign.  Ms.   answered that it was because she needed to use the restroom at her residence.  Officer  ran Ms.  name and released her without issuing a traffic citation.  Ms.  made allegations that Officer  ordered her to get out of her vehicle and that he fondled her breasts and vagina during a protec tive pat down.  Ms.  allegations are unfounded by the body worn camera video of the traffic stop.   II. INVOLVED PARTIES   Involved Officer #1:  ; # ; Employee  # ; Date of Appointment : , 2015 ; Police Officer ; ; DOB :  , 1992 ; male 

In [52]:
# Tokenize the text
inputs = tokenizer(
    complaint_text, return_tensors="pt", max_length=2048, truncation=True
)

# Generate summary
summary_ids = model.generate(
    inputs["input_ids"],
    max_length=1500,
    min_length=40,
    length_penalty=2.0,
    num_beams=4,
    early_stopping=True,
)

# Decode and print the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)


In [50]:
complaint_text[:2048]

'CIVILIAN OFFICE OF POLICE ACCOUNTABILITY  LOG  #2019 -0000246   1 SUMMARY REPORT OF INVESTIGATION  I. EXECUTIVE SUMMARY     Date of Incident:  February 25, 2019  Time of Incident:  12:25 p.m.  Location of Incident:  4337 West Maypole Avenue  Date of COPA Notification:  March 20, 2019  Time of COPA Notification:  5:48 p.m.   On February 25, 2019, Officer  Eric Acevedo  (Officer Acevedo) , Officer Michael Donnelly  (Officer Donnelly)  and Officer Cody Maloney  (Officer Maloney)  attempted to stop  ( )  as he walked from his vehicle toward a house.  ran from the officers . Officer Acevedo and Officer Maloney apprehended  and escorted him back to where his vehicle was parked. Officers searched ’s vehicle and recovered narcotics.  was arrested and, after a physica l struggle, was placed into a police squad car. Officer Ronald Pendleton Jr. (Officer Pendleton) rode in the backseat of the squad car with  to the police station .     alleged the officers had no reason to stop him or search his