# From Dialogue to Data

## Using Computational Text Analysis and Large Language Models for Dialogue-Rich Text Analysis



## Table of Contents

1. OCR: From Image to Digital Text
    (Sopranos Pilot, remove OCR layer)
2. XML: Adding structure to unstructured text
3. NLP / CTA - From Characters to Words: Processing Texts
3. XML Parsing
3. EDA: Exploratory Data Analysis of Texts: Extracting summary data
    (and Visualizations)
4. Asking Questions of Texts
    (Bechdel test, etc.)


## 1. OCR/HTR From Image to Digital Text

Optical Character Recognition (OCR) is the method used to convert an image of a printed text to a digital text.

Options for OCR'ing images of printed text include:

1. Proprietary software with Graphical User Interfaces

    a. i.e. Abbyy OCR Editor

2. Open-source OCR software run using a programming language

    a. i.e. using Tesseract with Python

3. OCR with LLMs with Vision Capabilities (aka. Vision-Language Models or VLM)

    a. GPT-4o(mini), Gemini Flash, etc.



### Exercise 1: 

Upload screenshot image (not pdf with text layer) of screenplay into chat.dartmouth.edu. Write a prompt asking the model (like GPT-4o mini) to transcribe the text for you. Review the results.

### Exercise 1b

Review the code below. It uses Tesseract to auto-generate a transcript of an image of a text. Revise this code to OCR or transcribe the same screenshot.

Note: for the code below to run, you need to install Tesseract. You can install Tesseract following the directions here: https://tesseract-ocr.github.io/tessdoc/Installation.html

In [56]:
import pytesseract
from PIL import Image

def ocr_image_to_text(image_path):
    """
    Reads an image file and returns the OCR'd transcript of the printed text.
    
    Args:
    image_path (str): The file path to the image to be processed.

    Returns:
    str: The text extracted from the image.
    """
    # Load the image from the specified file path
    try:
        image = Image.open(image_path)
    except Exception as e:
        return f"Error opening image file: {e}"

    # Use pytesseract to do OCR on the image
    try:
        transcript = pytesseract.image_to_string(image)
    except Exception as e:
        return f"Error during OCR processing: {e}"

    return transcript

In [63]:
import fitz

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

def transcribe_multipage_pdf(pdf_path):
    """
    Reads a PDF file and returns the OCR'd transcript of the printed text.

    Args:
    pdf_path (str): The file path to the PDF to be processed.

    Returns:
    str: The text extracted from the PDF.
    """

    # Initialize an empty string to store the complete transcript
    complete_transcript = ""

    # Open the PDF file
    try:
        document = fitz.open(pdf_path)
    except Exception as e:
        return f"Error opening PDF file: {e}"
    
    # Process each page in the PDF
    for page_number in range(len(document)):
        # Get the page
        page = document[page_number]
        
        # Render the page to an image
        pix = page.get_pixmap()
        
        # Convert the pixmap to a PIL image
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

    # Use pytesseract to do OCR on the image
        try:
            transcript = pytesseract.image_to_string(img)
            complete_transcript += transcript + "\n"  # Append each page's text
        except Exception as e:
            return f"Error during OCR processing on page {page_number + 1}: {e}"

    return complete_transcript    

In [None]:
#transcript1 = transcribe_multipage_pdf("../../texts/The Sopranos 101 - Production - Goldenrod Revision - Aug 19 1997.pdf")
#print(len(transcript1))

#transcript1[3000:6000]

transcript2 = transcribe_multipage_pdf("../../texts/image-only_The Sopranos 101 - Production - Goldenrod Revision - Aug 19 1997.pdf")

In [69]:
print(len(transcript2))

transcript2[:3000]

67921


'ae sopranos <é-<{? GREEN (REV. 8/07/97)\nTHE SOPRANOS\n\nLOCATIONS\nINTERIOR EXTERIOR\nDR. MELFI’S WAITING ROOM ‘SOPRANO\nDR. MELAY’S OFFICE ‘SOPRANO\nSOPRANO HC KITCHEN SOPRANO ~ Poon\n‘SOPRANO FAMILY ROOM STREET/~ 0S/HMO *\nSOPRANO HOUSE PARKING LOT\nSOPRANO HOUSE PORK STORE\nSOPRANO HOUSE "\'$ ROOM (pocco\'s vEesuvro\n\n/ BLM PARKWAY/LIVIA\'S HOUSE\n\nBUCCO’S VESUVIO ~ KITC!\nLIVIA’ S HOUSE\n\nCHRIS’ CAR\n\nMRI CENTER - MRI MACHINE ROOM\nTHE PORK STORE\n\nTHE PORK STORE - BUTCHERING\nAIRPORT AREA TOPLESS BAR\nGREEN GROVE RETIREMENT Ci\n\nCONSTRUCTION SITE\nSTATE PARK\n\nPITCH “PUTT\nBARONE SANITATION\nMANHATTAN\n\nTL GRANAIO\n\nCITY PAROCHIAL SCHOOL,\n\nTL GRANAIO PLASKIC MOUND\nCABIN CRUISER - BELOW (AL SCH. BALL COURT *\nCABIN CRUISER - TOP\nAPARTMENT BEDROOM\n\nCATHEDRAL\nJUNIOR’ S LINCOLN\nSOPRANO HOUSE - GARAGE .\n\n‘THE SOPRANOS GOLDENROD (REV. 8/19/97)\n\n‘THE SOPRANOS\nFADE IN:\nINT. DR, MELFI\'S WAITING ROOM ~ DAY 1\n\nTHOMAS SOPRANO, 40, sits and waits. Uneasily. Staring\

## 2. XML: Adding structure to unstructured text

Traditionally, there were two main ways researchers added xml tags to unstructured texts:

1. by manually inserting the tags, often with designated XML editors like OxygenXML.

2. programmatically, often using *regular expression* methods. Regular expressions or regexs, allow you to identify different text components by searching for specific text patterns. For example, you could identify footnotes by searching for text lines that begin with numbers followed by a period and a space found near the bottom of a page. The regex for this would look like this: `"^[0-9]+\.\s\w.*$"` which can be translated as follows:

    + ^ = first character of line
    + [0-9]+ matches any number 0-9 that appears 1 or more times (`+`)
    + \. is followed by a period (which needs to be escaped with a back slash as a period is a special wildcard character in reg ex)
    + \w matches any character or number
    + .* followed by any character (`.`) appearing 0 or more times (`*`) until the end of the line (`$`).

Now, we can use Generative AI tools to do this for us. 





### Exercise 2. 

Develop your own prompt or revise and improve the prompt below. Then upload your digitized script to chat.dartmouth.edu and run this prompt.

``` 
Here is the full transcript of __________________________
Without adding or changing a single word in this transcript, add xml following TEI guidelines for performance texts. Among other things, this TEI-encoded text should identify different elements of front matter (i.e. <title>, <prologue>, <set>, and <castList> tags), the body text (<div> tags identifying each new scene, <head> tags for the title of each scene, <sp> tags to identify dialogue by each speaker with <speaker> tags tagging each speaker’s name, <l> to identify each line break, <stage> tags to identify stage directions, etc.), and back matter (i.e. <epilogue>). The <sp> tags should identify the name of each speaker, which will help if a particular speaker is implied but not named.

```

### 2b. Working with XML text

In [38]:
# %pip install beautifulsoup4
# %pip install lxml

from bs4 import BeautifulSoup
from lxml import etree

In [39]:

with open("../../texts/sopranos-partial.xml", encoding='utf-8') as f:
    txt = f.read()

#

In [40]:
soup = BeautifulSoup(txt, features="xml")

## 3. NLP / CTA - From Characters to Words: Processing Texts

In [41]:
import re
body = soup.body
bodytext = body.text


bodytext = re.sub(r"\s{2,}", " ", bodytext)

bodytext

" INT. DR. MELFI'S WAITING ROOM - DAY\nTHOMAS SOPRANO, 40, sits and waits. Uneasily. Staring confusedly at a vaguely erotic Klimpt reproduction. Inner door opens. DR. JENNIFER MELFI (attractive, 35) appears. MELFI\nMr. Soprano? INT. DR. MELFI'S OFFICE - DAY\nMelfi gestures Tommy to a choice of seating. MELFI\nHave a seat. She seats herself in a facing armchair. She looks at him with a polite, expectant gaze. He stares back, waiting. There is utter silence. Nothing happens. Such is psychotherapy. Finally— MELFI\nMy understanding from your family physician, Dr. Cusamano, is you collapsed? Were unable to breathe? Possibly a panic attack? TOMMY\nThey said it was a panic attack — because all the neurological work and blood came back negative. They sent me here. MELFI\nYou don't agree you had a panic attack? He laughs — too loud. MELFI\nHow are you feeling now? TOMMY\nNow? Fine. I'm back at work. MELFI\nWhat line of work are you in? TOMMY\nWaste management consultant. She keeps that psychiat

### 3a. Extracting some basic data from text

In notebook 02, we created a function `get_text_data` that uses the **spaCY library** to extract some basic information from a text, including:

+ number of tokens (as counted using two different methods)
+ a list of geopolitical (GPE) entities
+ a list of person names
+ a filtered list of tokens (words and punctuation is "filtered" by removing common "stopwords" like "the", "and", "this" and "that", etc.)

The function is also copied in a standalone python file (.py) named "text_functions.py". We can import the `get_text_data` function from this python file:

In [42]:
from text_functions import get_text_data

In [None]:
# The get_text_data function returns the following data: 
# * numtokens_split = number of tokens in text identified using the split() method, 
# * numtokens_spacy = number of tokens using more advanced tokenization methods from spaCY library
# * gpelist = list of Geopolitical Entities (GPEs) in text (i.e. countries, major cities), 
# * personlist = list of people named in text, 
# * filtered_tokens = list of tokens after removing stop words (common words that do not encode meaning, i.e. "and, the, a, this, that, there, etc.")

textdata = get_text_data(bodytext)
textdata[1]

2360

In [44]:
gpes = textdata[2]

from collections import Counter

Counter(gpes)

Counter({'Aspen': 2,
         'New Jersey': 1,
         'Canada': 1,
         'Carmela': 1,
         "Jersey Shark's": 1,
         'US': 1,
         'Tommy': 1})

In [45]:
people = textdata[3]

Counter(people)

Counter({'Tommy': 15,
         'Chris': 11,
         'V.O.': 8,
         'Mahaffey': 6,
         'Soprano': 2,
         'Cusamano': 2,
         'Meadow': 2,
         "Tommy Jr.'s": 2,
         'SOPRANO BACKYARD - POOL - DAY': 2,
         'CHRIS': 2,
         'Klimpt': 1,
         'JENNIFER MELFI': 1,
         'SOPRANO HOUSE - BEDROOM - DAWN': 1,
         'SOPRANO BACKYARD - DAY': 1,
         'Tommy beams': 1,
         'Bon Jovi': 1,
         'Skeet Ulrich': 1,
         'Omigod': 1,
         'Tommy Jr.': 1,
         'Tommy Jr. CARMELA\n': 1,
         'Christopher': 1,
         'CHRIS MOLTISANTI': 1,
         'Howard Stern': 1,
         'TOMMY': 1})

In [46]:
filt_toks = textdata[4]
Counter(filt_toks)

Counter({'.': 201,
         '\n': 95,
         ',': 74,
         '-': 45,
         'TOMMY': 39,
         '?': 33,
         'MELFI': 24,
         '(': 23,
         ')': 23,
         'Tommy': 22,
         '—': 22,
         'DAY': 17,
         'SOPRANO': 13,
         'INT': 11,
         "'": 11,
         'Chris': 11,
         'V.O.': 10,
         '!': 10,
         'Mahaffey': 10,
         '...': 9,
         'EXT': 8,
         'work': 7,
         'know': 7,
         'HOUSE': 7,
         'CARMELA': 7,
         'DR': 6,
         'HUNTER': 6,
         'CHRIS': 6,
         'morning': 5,
         'pool': 5,
         'MEADOW': 5,
         'going': 5,
         'Lexus': 5,
         'door': 4,
         'OFFICE': 4,
         'ground': 4,
         'like': 4,
         'ducks': 4,
         'takes': 4,
         'KITCHEN': 4,
         'Meadow': 4,
         'home': 4,
         'Jr.': 4,
         'saying': 4,
         '/': 4,
         'MAHAFFEY': 4,
         'stares': 3,
         'silence': 3,
         'fa

### 3b. Other Preprocessing: Lemmatization

In [47]:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = spacy.load("en_core_web_sm")

doc = nlp(bodytext)

lemmas = []
for token in doc:
    lemmas.append(token.lemma_)

Counter(lemmas)


Counter({'.': 201,
         '\n': 95,
         ',': 74,
         'the': 72,
         'be': 67,
         '-': 45,
         'I': 44,
         'a': 41,
         'TOMMY': 38,
         'he': 36,
         'to': 34,
         '?': 33,
         'in': 32,
         'you': 32,
         'MELFI': 24,
         'and': 23,
         '(': 23,
         ')': 23,
         '—': 22,
         'Tommy': 20,
         'get': 18,
         "'s": 17,
         'DAY': 16,
         'have': 15,
         'his': 15,
         'not': 14,
         'of': 13,
         'with': 13,
         'it': 13,
         'do': 13,
         'SOPRANO': 12,
         'at': 12,
         'my': 12,
         'for': 12,
         'INT': 11,
         'that': 11,
         'go': 11,
         "'": 11,
         'Chris': 11,
         'they': 10,
         'V.O.': 10,
         'this': 10,
         '!': 10,
         'Mahaffey': 10,
         'she': 9,
         'back': 9,
         'from': 9,
         '...': 9,
         'up': 9,
         'your': 8,
         'what

### 3d. Word Embeddings

## 4. XML Parsing

In [48]:
speeches = body.find_all("sp")
print(len(speeches))
speeches[2]

77


<sp>
<speaker>MELFI</speaker>
<l>My understanding from your family physician, Dr. Cusamano, is you collapsed? Were unable to breathe? Possibly a panic attack?</l>
</sp>

In [49]:
speakerlist = []
for speech in speeches:
    speakerlist.append(speech.speaker.text)
speakerlist
Counter(speakerlist)

Counter({'TOMMY': 29,
         'MELFI': 16,
         'TOMMY (V.O.)': 7,
         'CARMELA': 6,
         'HUNTER': 5,
         'MEADOW': 4,
         'CHRIS': 4,
         'MAHAFFEY': 3,
         'MELFI (V.O.)': 2,
         'KIDS': 1})

In [50]:
import copy

speechlist2 = []
for speech in speeches:
    speechcopy = copy.copy(speech)
    for speaker_tag in speechcopy.find_all("speaker"):
        speaker_tag.decompose()
    #print(speechcopy, "\n$$")
    #print(speech)
    #print()
    speechlist2.append(speechcopy)

# remove speaker tag
speechlist2 = [speech.text.strip() for speech in speechlist2]
speechlist2

['Mr. Soprano?',
 'Have a seat.',
 'My understanding from your family physician, Dr. Cusamano, is you collapsed? Were unable to breathe? Possibly a panic attack?',
 'They said it was a panic attack — because all the neurological work and blood came back negative. They sent me here.',
 "You don't agree you had a panic attack?",
 'How are you feeling now?',
 "Now? Fine. I'm back at work.",
 'What line of work are you in?',
 'Waste management consultant.',
 "Look...it's impossible for me to talk to a psychiatrist.",
 'Any thoughts at all on why you blacked out?',
 "I don't know. Stress/ maybe?",
 'Stress? About what?',
 "That morning of the day I got sick? I'd been thinking: it's good to be in a thing from the ground floor. I came too late for that, I know. But lately I'm getting the feeling I might be in at the end. That the best is over.",
 'Many Americans, I think, feel this.',
 'Take my father. He never reached the heights like me. But in ways he had it better. He had his people — the

In [51]:
import pandas as pd

df = pd.DataFrame({"speaker": speakerlist, "speech": speechlist2})

df

Unnamed: 0,speaker,speech
0,MELFI,Mr. Soprano?
1,MELFI,Have a seat.
2,MELFI,"My understanding from your family physician, D..."
3,TOMMY,They said it was a panic attack — because all ...
4,MELFI,You don't agree you had a panic attack?
...,...,...
72,TOMMY,I'll give you a fuckin' bone. Where's my money?
73,MAHAFFEY,I'll get the money!
74,TOMMY,I know you'll get the — (sees Chris fussing ov...
75,TOMMY,The fuck you doing? Get over here.


In [52]:
df["speaker"].value_counts()

speaker
TOMMY           29
MELFI           16
TOMMY (V.O.)     7
CARMELA          6
HUNTER           5
MEADOW           4
CHRIS            4
MAHAFFEY         3
MELFI (V.O.)     2
KIDS             1
Name: count, dtype: int64

vs. GPT-4o Mini response:

```

Here is the frequency list of the number of times each character spoke, derived from the screenplay: 

    TOMMY: 20
    MELFI: 10
    CARMELA: 7
    CHRIS: 6
    HUNTER: 4
    MAHAFFEY: 4
    MEADOW: 4
    BIG PUSSY: 3
    DAD (TOMMY SR.): 3
    ARTHUR: 2
    LIVIA: 2
    FATHER PHIL: 2
    SILVIO: 2
    TOMMY JR.: 2
    DICK BARONE: 1
    CHARMAINE: 1
    HERMAN: 1
     

This list counts each individual line spoken by the characters in the screenplay as a separate instance. 
```


## Most Frequent Words (by Speaker)

### Python + spaCY v. GPT-4

Group speech by speaker.

In [53]:
# Group by 'speaker' and concatenate speeches
grouped_df = df.groupby('speaker')['speech'].agg(lambda x: ' '.join(x)).reset_index()

# Rename the 'speech' column if desired
grouped_df.rename(columns={'speech': 'combined_speech'}, inplace=True)

grouped_df

Unnamed: 0,speaker,combined_speech
0,CARMELA,Have something more than just cran- apple juic...
1,CHRIS,I got home too late last night. I didn't want ...
2,HUNTER,"(staring out window) Meadow, your father with ..."
3,KIDS,"(bored, humoring) Nice, dad. National Geograph..."
4,MAHAFFEY,"Security! My leg! It's broken! Oh fuck, oh fuc..."
5,MEADOW,Get out of here with that fat. Wait — like Ita...
6,MELFI,Mr. Soprano? Have a seat. My understanding fro...
7,MELFI (V.O.),"Many Americans, I think, feel this. Did you ha..."
8,TOMMY,They said it was a panic attack — because all ...
9,TOMMY (V.O.),That morning of the day I got sick? I'd been t...


In [54]:
# Function to preprocess the text
def lemmatize_text(text):
    lemmatized_words = []

    # Process each text element in the list
    #for text in text_list:
        # Process the text with spaCy
    doc = nlp(text)
        
    # Lemmatize words and filter out stop words
    for token in doc:
        # Check if the token is a stop word or punctuation
        if not token.is_stop and token.is_alpha:
            lemmatized_words.append(token.lemma_.lower())

    return lemmatized_words

# Function to find the top N frequent words
def top_n_frequent_words(word_list, n=5):
    # Count the frequency of each word
    word_counts = Counter(word_list)
    
    # Get the N most common words
    return word_counts.most_common(n)

In [55]:
grouped_df["lemmas"] = grouped_df["combined_speech"].apply(lemmatize_text)
grouped_df["top_words"] = grouped_df["lemmas"].apply(top_n_frequent_words, n=5)
grouped_df


Unnamed: 0,speaker,combined_speech,lemmas,top_words
0,CARMELA,Have something more than just cran- apple juic...,"[apple, juice, lady, need, brain, food, school...","[(school, 2), (apple, 1), (juice, 1), (lady, 1..."
1,CHRIS,I got home too late last night. I didn't want ...,"[get, home, late, night, want, wake, man, naus...","[(boo, 2), (get, 1), (home, 1), (late, 1), (ni..."
2,HUNTER,"(staring out window) Meadow, your father with ...","[stare, window, meadow, father, duck, male, fe...","[(duck, 2), (aspen, 2), (stare, 1), (window, 1..."
3,KIDS,"(bored, humoring) Nice, dad. National Geograph...","[bored, humor, nice, dad, national, geographic...","[(bored, 1), (humor, 1), (nice, 1), (dad, 1), ..."
4,MAHAFFEY,"Security! My leg! It's broken! Oh fuck, oh fuc...","[security, leg, break, oh, fuck, oh, fuck, bon...","[(oh, 2), (fuck, 2), (security, 1), (leg, 1), ..."
5,MEADOW,Get out of here with that fat. Wait — like Ita...,"[fat, wait, like, italian, pastry, brain, food...","[(fat, 1), (wait, 1), (like, 1), (italian, 1),..."
6,MELFI,Mr. Soprano? Have a seat. My understanding fro...,"[soprano, seat, understanding, family, physici...","[(say, 4), (tell, 3), (family, 2), (physician,..."
7,MELFI (V.O.),"Many Americans, I think, feel this. Did you ha...","[americans, think, feel, feeling, loss, acutel...","[(americans, 1), (think, 1), (feel, 1), (feeli..."
8,TOMMY,They said it was a panic attack — because all ...,"[say, panic, attack, neurological, work, blood...","[(work, 4), (know, 3), (beat, 3), (fuckin, 3),..."
9,TOMMY (V.O.),That morning of the day I got sick? I'd been t...,"[morning, day, get, sick, think, good, thing, ...","[(get, 2), (good, 2), (know, 2), (daughter, 2)..."


Top 5 word frequency list as generated by GPT-4o Mini:

*Prompt:*
```
Create a frequency list of words to aid the analysis of this screenplay. First, extract only the spoken text, indicated by the <l> tag within each <sp> tag. Second, from this spoken text, create a frequency list of the top 5 words spoken by each speaker, but only after removing all stop words.
```

*Response:*


GPT-4o Mini (Cloud, Paid, $) 2024-07-18
Today at 4:15 PM
 

Here's the frequency list of the top 5 words spoken by each character, after removing stop words: 

**TOMMY** 

    money: 5
    ducks: 3
    work: 3
    father: 3
    feel: 3
     

**MELFI** 

    panic: 3
    understanding: 2
    attack: 2
    work: 2
    stress: 2
     

**CARMELA** 

    Meadow: 4
    school: 3
    grades: 2
    father: 2
    ziti: 2
     

**CHRIS** 

    Mahaffey: 3
    business: 3
    money: 3
    call: 2
    family: 2
     

**HUNTER** 

    Aspen: 2
    cool: 2
    think: 2
    fun: 2
    vacation: 2
     

**MAHAFFEY** 

    sorry: 3
    money: 3
    broken: 2
    leg: 2
    company: 2
     

**MEADOW** 

    dad: 3
    think: 2
    stupid: 2
    school: 2
    trip: 2
     

**BIG PUSSY** 

    business: 2
    Kolar: 2
    money: 2
    good: 2
    family: 2
     

**ARTHUR** 

    help: 2
    vacation: 2
    deal: 2
    restaurant: 2
    days: 2
     

**LIVIA** 

    home: 2
    nursing: 2
    mother: 2
    day: 2
    people: 2
     

**FATHER PHIL** 

    damn: 2
    night: 2
    incredible: 2
    just: 2
    think: 2
     

**SILVIO** 

    night: 2
    family: 2
    go: 2
    talk: 2
    people: 2
     

**TOMMY JR.** 

    grandma: 2
    happy: 2
    home: 2
    dad: 2
    birthday: 2
     

**DICK BARONE** 

    contract: 1
    money: 1
    business: 1
    call: 1
    good: 1
     

**CHARMAINE** 

    gift: 1
    stay: 1
    home: 1
    better: 1
    anything: 1
     

**HERMAN** 

    chance: 1
    money: 1
    problem: 1
    time: 1
    feelings: 1
     

This frequency list highlights the common words spoken by each character, providing insights into their dialogue patterns and focal points in the screenplay. 

## Extracting Information for each Scene

In [79]:
div = divs[4]
type(div)

bs4.element.Tag

In [None]:
def get_scene_info(tag):
    """
    Reads in a bs4 "div tag" containing a scene in a screenplay (using the tags: head = name of scene; stage = stage directions, speaker = for the name of each speaker)

    Returns a dictionary containing the scene name, initial stage instructions (ignores any 2nd or 3rd stage instructions), and speaker count for each scene
    """
    if tag.name != "div":
        print("error, not a div tag!")
        return None, None, None
    head = tag.find("head")
    if head is not None:
        head = head.text
    stage = tag.find("stage")
    if stage is not None:
        stage = stage.text

    speakerlist = []
    if tag.find("speaker"):
        speakers = tag.find_all("speaker")
        speakerlist = [speaker.text for speaker in speakers]
    speakerlist

    speaker_ct = Counter(speakerlist)

    return {"head": head, "stage": stage, "speaker_ct": speaker_ct}


In [88]:
divs = body.find_all("div")

sceneinfodict_list = []

for div in divs:
    sceneinfodict_list.append(get_scene_info(div))

scene_df = pd.DataFrame(sceneinfodict_list)
scene_df.head()

Unnamed: 0,head,stage,speaker_ct
0,INT. DR. MELFI'S WAITING ROOM - DAY,"THOMAS SOPRANO, 40, sits and waits. Uneasily. ...",{'MELFI': 1}
1,INT. DR. MELFI'S OFFICE - DAY,Melfi gestures Tommy to a choice of seating.,"{'MELFI': 7, 'TOMMY': 5}"
2,EXT. SOPRANO HOUSE - DAWN,split-level. New Jersey. The only thing distin...,{}
3,INT. SOPRANO HOUSE - BEDROOM - DAWN,TOMMY'S EYE slams open from sleep. He stares s...,{'TOMMY (V.O.)': 1}
4,EXT. SOPRANO HOUSE - DAY,Bathrobed Tommy reads his morning paper in the...,"{'MELFI (V.O.)': 2, 'TOMMY (V.O.)': 1}"


## 5. EDA

Visualizations to make:
1. when and how often each speaker speaks
2. sentiment analysis by speaker

## 6. Asking Questions of Texts