# Streamlit Based Script Summarization App
- We will use a blend of standard transformer archetectures and piplines from huggingface in combination with the awesome stremlit library to create a text summaraization app which can handle any length of text
- We will also attempt to tailor it to film and tv script formatting which is very distinct and rather awkward for parsing.

- We will use a selection of scripts sourced online to test our application which are freely available online and I have left an open link to a foilder containing them
  - Breaking Bad Pilot
  - Single Drunk Female Pilot
  - Better Call Saul Seaon 1 Episode 6


In [None]:
!gdown --id gUlZLnaYrjvbfe-A7jCikxuMXddiYY3I

# Take Aways
### Pros
- We managed to clean the scripts really well by using a brute force approach
  - Regex removal of dialogue 
  - Brute force regex removal of all all numbers and special characters
  - Reformatting paragraph structure using a base transformer model
  - Added the option to enter the page number where the story content begins removing the summarization of script table of contents etc...
- We also created a short function which builds paragraph chunks of a set amount of tokens upto max 500 tokens. Enabling us to play around with summarizing different amounts of text at a time
  - Using the upper limit seemed to work the best in this instance

### Cons
- The summary quality is lacking in some regards as parsing the entire script in randomized chunks (midway through a scene or from the end of one scene to the tsrat of another) is not the ideal process
- We will improve on this model by researching general script stuctures and begin to count out the pages and sectioning the acts more appropriatly we can then summarize given parts of each scene or act and feed the model the scenes in a more targeted manner

# Next Steps 
### Research & Restructure Chunks
- As mentioned above I will continue this project by studying general act and scene formatting across the 20-30 / 60 and 120 min script formats
- I will look for a way to consitently isolate the correct acts/scenes to feed the more with a more targeted methodology
- In the event there is not triggers for the sectioning I will create a fall back based on the average page lengths for scenes 

In [None]:
%%capture
!pip --q install streamlit
!pip --q install transformers
!pip --q install PyPDF2
!pip --q install docx2txt
!pip --q install deepmultilingualpunctuation
!pip --q install sentencepiece

In [None]:
# Import them
from PyPDF2 import PdfFileReader
import numpy as np
import seaborn as sns
import re
from deepmultilingualpunctuation import PunctuationModel


# Testing Space

In [None]:
# model = PunctuationModel()

# # load
# start_page = 3
# def read_pdf(file):
#   pdfReader = PdfFileReader(file)
#   count = pdfReader.numPages
#   all_page_text = ""
#   for i in range(count - start_page):
#     page = pdfReader.getPage(start_page+i)
#     all_page_text += page.extractText()
#   return all_page_text


# text = read_pdf("Better_Call_Saul_1x06_-_Five-O.pdf")
# # start_idx = re.search("INT|EXT", text).start()
# # text = text[start_idx:]

# # All Dialogue
# match = re.compile(r"(?m)^\s*\b([A-Z]+)\b\s*\n(.*(?:\n.+)*)")
# text = match.sub(r' ',text)

# # Special script preprocessing in brackets and scene setting
# text = re.sub("[^a-zA-Z']|INT|EXT"," ", text)

# # Standard preprocess (lower/remove newline/spaces)
# text = text.lower()

# # Remove the header text from the script
# title_text ="Better_Call_Saul_1x06_-_Five-O.pdf"
# raw_tokens = re.split("[^A-Za-z0-9-]", title_text.lower())
# title_tokens = [t for t in raw_tokens if t]
# text = re.sub("|".join(title_tokens), "", text)
# text = ' '.join(text.split())
# text = model.restore_punctuation(text)

# # Lets save it at this stage first
# with open('BCS_Clean.txt', 'w') as f_out:
#     f_out.write(text)

In [None]:
# text

In [None]:
# sentences = text.split(".")

# max_chunk = 350
# list_index = 0 
# chunks = []
# for sent in sentences:
#     if len(chunks) == list_index + 1: 
#         if len(chunks[list_index]) + len(sent.split(' ')) <= max_chunk:
#             chunks[list_index].extend(sent.split(' '))
#         else:
#             list_index += 1
#             chunks.append(sent.split(' '))
#     else:
#         chunks.append(sent.split(' '))

# for chunk_id in range(len(chunks)):
#     chunks[chunk_id] = ' '.join(chunks[chunk_id])


# Streamlit Summarization App

In [None]:
%%writefile app.py
import streamlit as st
from transformers import pipeline
import sentencepiece
import re
from PyPDF2 import PdfFileReader
import docx2txt
from deepmultilingualpunctuation import PunctuationModel


def load_puntuator():
    punc_model = PunctuationModel()
    return punc_model

def load_summarizer(model):
    model = pipeline("summarization", model=model, device=0)
    return model

def parse_script(raw_input):
    # Removing Dialogue
    match = re.compile(r"(?m)^\s*\b([A-Z]+)\b\s*\n(.*(?:\n.+)*)")
    raw_input = match.sub(r' ',raw_input)
    # Removing all non-word characters (excl "'") and standard scene descriptors
    raw_input = re.sub("[^a-zA-Z']|INT|EXT"," ", raw_input)
    # Lowercasing text
    raw_input = raw_input.lower()
    # Remove all or most of the page header text
    title_text ="Better_Call_Saul_1x06_-_Five-O.pdf"
    raw_tokens = re.split("[^A-Za-z0-9-]", title_text.lower())
    title_tokens = [t for t in raw_tokens if t]
    raw_input = re.sub("|".join(title_tokens), "", raw_input)
    # Split and join to remove empty spaces 
    processed_text = ' '.join(raw_input.split())
    # Re punc output
    punctuated_text = punctuator.restore_punctuation(processed_text)
    # Tokenize ahead of paragraphing
    sentences = punctuated_text.split(".")
    return sentences


def create_paragraphs(sentences):
    max_chunk = 256
    list_index = 0 
    chunks = []
    for sent in sentences:
        if len(chunks) == list_index + 1: 
            if len(chunks[list_index]) + len(sent.split(' ')) <= max_chunk:
                chunks[list_index].extend(sent.split(' '))
            else:
                list_index += 1
                chunks.append(sent.split(' '))
        else:
            chunks.append(sent.split(' '))

    for chunk_id in range(len(chunks)):
        chunks[chunk_id] = ' '.join(chunks[chunk_id])
    return chunks


# Streamlit App Construction
st.title("Script Summarizer")

max = st.sidebar.slider('Select max', 50, 250, step=10, value=150)
min = st.sidebar.slider('Select min', 10, 100, step=10, value=50)

model_option = st.selectbox(
     'Select summarization model',
     ("",
      "deep-learning-analytics/wikihow-t5-small",
      "facebook/bart-large-cnn", 
      "google/pegasus-xsum", 
      "sshleifer/distilbart-cnn-6-6", 
      "t5-large", 
      "google/pegasus-large"))

if model_option is not "":
  summarizer = load_summarizer(model_option)
  punctuator = load_puntuator()


uploaded_file = st.file_uploader(
                "Upload your pdf file", 
                type=["pdf", "docx"],
                accept_multiple_files=False
)

start_page = st.text_input('Enter the page number where the story begins', '')

file_button = st.button("Summarize File")

# Reading Different File Types
def read_pdf(file):
  pdfReader = PdfFileReader(file)
  count = pdfReader.numPages
  all_page_text = ""
  for i in range(count - int(start_page)):
    page = pdfReader.getPage(int(start_page) + i)
    all_page_text += page.extractText()
  return all_page_text
 
if file_button and uploaded_file is not None:
    if uploaded_file.name[-4:] == 'docx':
      raw_input = docx2txt.process(uploaded_file)

    elif uploaded_file.name[-3:] == 'pdf':
      raw_input = read_pdf(uploaded_file)
      
with st.spinner("Generating Summary.."):
  if file_button and raw_input:
        sentences = parse_script(raw_input)
        punct_paragraphs = create_paragraphs(sentences)
        res = summarizer(punct_paragraphs,
                        max_length=max, 
                        min_length=min)
        text = ' '.join([summ['summary_text'] for summ in res])
        st.write(text)

Overwriting app.py


In [None]:
!streamlit run app.py & npx localtunnel --port 8501

# Conclusion
- We lose a little summary context by removing the periods from the sentences making some of the paragphs have an odd start/finish
- On the contrary not removing them creates too much noise as the format of the script have short punctuated sentence



# Further Development
- An algorithm trained to insert punctuation would be ideal here to transform the raw structure the cleaned scene decriptions into a story and from there parse it into chunks whose begining and ending is alligned with a sentence structure