# word_importance
To comprehend the plot of the entire film, or just an individual scene, we'll need to understand what the characters are speaking about. We'll use a variety of NLP tools to parse dialogue and try and see what the most important things they're talking about, be it other characters or concepts.

In [1]:
from subtitle_dataframes_io import *
import pysrt
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from collections import Counter
pd.set_option('display.max_colwidth', None)
nlp = spacy.load('en')

We'll start with a scene from *Plus One* (2019). We've previously defined a few functions to parse the subtitles file.

In [2]:
subs = pysrt.open('../subtitles/plus_one.srt')
subtitle_df = generate_base_subtitle_df(subs)
subtitle_df = generate_subtitle_features(subtitle_df)
subtitle_df['cleaned_text'] = subtitle_df['concat_sep_text'].map(clean_line)
sentences = partition_sentences(remove_blanks(subtitle_df['cleaned_text'].tolist()), nlp)

We have a list of the film's entire sentences called `sentences`. We'll also define two more data objects at the scene level: `scene_sentences` which is a single long string of the scene's sentences, and then a `spaCy` doc of those scene-level sentences called `scene_nlp_doc`.

In [3]:
scene_sentences = (' ').join(sentences[880:976])
scene_nlp_doc = nlp(scene_sentences)

## Entity Counting
`spaCy` can conduct Entity Recognition, identifying people and organizations, as well as more abstract things like quantities or periods of time (like "tomorrow"). We can simply count up the most common entites in a scene.

In [4]:
entities = []

for ent in scene_nlp_doc.ents:
    entities.append(ent.text)
count = Counter(entities)
count.most_common(10)

[('Nate', 6),
 ('Alice', 3),
 ('Jess Ramsey', 2),
 ('first', 2),
 ('Ben King', 2),
 ('Maggie', 1),
 ('one night', 1),
 ('Alice Mori', 1),
 ('tomorrow', 1),
 ('two weeks', 1)]

In this scene, the pair (Alice and Ben King) is arguing about exes and prospective romantic partners: Nate, Maggie, and Jess Ramsey. This was a rough solution, but it worked, mostly because they're talking about named entities.

This would be further improved if we could conduct pronoun resolution to link pronouns to their original subject. This string of sentences early in the scene is all about Maggie: *Um... she was good. Yeah she was... she was really... Yeah. She was cool.* If we could tell these sentences were about Maggie, she should appear higher in our list.