# Prompt  #2

#### What words are characteristic of the movie summaries in those genres?

In this notebook I will be:
-  CHANGE THIS

We start by importing necessary modules and the dataset:

In [187]:
import pandas as pd
import numpy as np
import collections
import ast
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import wordnet
from stop_words import get_stop_words
from gensim import corpora, models
import gensim

movies = pd.read_csv("movie_data.csv")

movies['genres'] = pd.Series(ast.literal_eval(genres) for genres in movies['genres'])

movies.head(5)



Unnamed: 0,id,title,release_date,box_office_revenue,runtime,genres,summary
0,0,Ghosts of Mars,2001-08-24,14010832.0,98.0,"[Space western, Horror, Supernatural, Thriller...","Set in the second half of the 22nd century, th..."
1,1,White Of The Eye,1987,,110.0,"[Erotic thriller, Psychological thriller, Thri...",A series of murders of rich young women throug...
2,2,A Woman in Flames,1983,,106.0,[Drama],"Eva, an upper class housewife, becomes frustra..."
3,3,The Sorcerer's Apprentice,2002,,86.0,"[Adventure, Fantasy, World cinema, Family Film]","Every hundred years, the evil Morgana returns..."
4,4,Little city,1997-04-04,,93.0,"[Romance Film, Ensemble Film, Comedy-drama, Co...","Adam, a San Francisco-based artist who works a..."


First we want to group summaries of the same genre together. Note there is no reasonable way for us to create disjoint groups, hence we allow a movie to belong to more than one genre grouping. 

In this process we also want to take the opportunity to add part-of-speech(POS) tags to each word. We can only determine an individual word's speech tag (e.g. Verb, Adjective, etc) when we observe it in context. As linguist J.R. Firth said, "You shall know a word by the company it keeps." The POS tags will allow us to lemmatize the words (Further explained below).

In [123]:
unique_genres = ["Drama", "Comedy", "Action/Adventure", "Romance Film", "Thriller"]

genres_list = {genre: [] for genre in unique_genres}

for genre in unique_genres:
    for genres, summary in zip(movies['genres'], movies['summary']):
        
        if genre in genres:
            
            genres_list[genre].append(nltk.pos_tag(summary.split()))

The following functions help with the necessary translations required when using the NLTK's lemmatizer.

In [126]:
#get_wordnet_pos maps tags from NLTK's pos_tag to tags utilized in the lemmatizing method.
def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None
    
#Used to associate lemmatized tag object to literal string name
def tag_helper(word,tag):
    wnl = WordNetLemmatizer()
    wntag = get_wordnet_pos(tag)
    
    if wntag is None:# not supply tag in case of None
        lemma = wnl.lemmatize(word)
        tb_tag = ""
        
    #assigning a variable to the english equivalent of our tags
    #Useful if we want to characterize summaries by types of words used
    else:
        lemma = wnl.lemmatize(word, pos = wntag)
        if wntag == wordnet.ADJ:
            tb_tag = "Adjective"
        elif wntag == wordnet.VERB:
            tb_tag = "Verb"
        elif wntag == wordnet.NOUN:
            tb_tag = "Noun"
        elif wntag == wordnet.ADV:
            tb_tag = "Adverb"
        elif wntag is None:
            tb_tag = ""
            
    return lemma, tb_tag

Now we use lemmatization to bring words up a level in abstraction and away from their more specific use. By lemmatizing a corpus of words we reduce the morphological variation. For instance, "better" becomes "good", "running" and "ran" become "run", "frustrated" maps to "frustrate", and so on. By aggregating words into their base form we can come to more accurate characterizations through the term-frequency inverse document-frequency (TF-IDF) score. 

Further, since parts-of-speech are necessary for lemmatization we are able to characterize genres' words by POS distribution. (e.g. Is the proportion of verbs used in Action/Adventure different than Comedy?)

In [144]:
#Genre name is key with list of lemmed summaries (in list form)
lem_dict = {} 

#Same structure as lem_dict but with POS
tag_dict ={} 

lem_count = 0
for genre,tup_list in genres_list.items():
    
    #list that contains all lemmed summaries for a given genre
    temp_lem = [] 
    temp_pos = []
    for summary in tup_list:
            
        #list of lemmed words for a given summary
        summary_lem = [] 
        summary_pos = []

        for tup in summary:

            _word, _tag = tup
            #Leave out proper nouns (NNP), plural proper nouns (NNPS), and personal pronouns (PRP)
            if str(_tag) != "NNP" or str(_tag) != "NNPS" or str(_tag) != "PRP" or str(_tag) != "PRP$":
                #We let compound words stay as is (e.g. chimney-sweep)
                _word = _word.lower().replace(".", "").strip('"!,')

                #Performs lemmatization and converts POS tag into full form (e.g. "Adjective", "Noun", etc)
                lemma, tag = tag_helper(_word, _tag) 

                #Counter to see how many words we changed
                if lemma != _word:
                    lem_count += 1
                    
                summary_lem.append(lemma)
                summary_pos.append(tag)

        temp_lem.append(summary_lem)
        temp_pos.append(summary_pos)

    lem_dict[genre] = temp_lem
    tag_dict[genre] = temp_pos
lem_count       

2743970

We can now compute the TF-IDF score for the lemmas. We start by computing term frequency of each lemma across all summaries and follow up by counting how many summaries ("documents") each lemma appears in. We use raw term count rather than actual term frequency as we are only concered with ranking and so they are proportionally equivalent.

__Note__: In the previous code block we threw out all proper nouns, singular and plural, and also all personal pronouns. Since movie characters' names can be used numerous times in an individual movie's summary but not in any other movie's summary it necessarily has a high TF-IDF score. Without discarding proper nouns much of the top n highest TF-IDF scores belonged to words that did not help us characterize genres whatsoever.

In [145]:
#Building vocab set now helps us simply do dict comprehensions and slightly simplifies tf and df dict creation
vocab = set()
for genre, summaries in lem_dict.items():
    for summary in summaries:
        for word in summary:
            vocab.add(word)
            
vocab = list(vocab)


tf = {genre: {word: 0 for word in vocab} for genre in unique_genres}
df = {genre: {word: 0 for word in vocab} for genre in unique_genres}

for genre, summaries in lem_dict.items():
    for summary in summaries:
        
        #We create a temp docuement frequency list to ensure that even if a word shows up multiple times in...
        #...a document it is only counted in df once
        temp_df = []
        
        for word in summary:
            tf[genre][word] += 1
            
            #Had we not used temp_df here we would just be counting like tf
            if word not in temp_df:
                df[genre][word] += 1
                temp_df.append(word)
                
tf_idf = {genre: {word: 0 for word in vocab} for genre in unique_genres}

for genre in unique_genres:
    for word in vocab:
        
        #We use 1 + the number of documents a word has appeared to avoid division by 0 in the idf term
        tf_idf[genre][word] = tf[genre][word] * np.log(len(lem_dict[genre]) / (1 + df[genre][word]))
        
print("Total unique words in corpus:",len(vocab))

Total unique words in corpus: 177612


Before dissecting TF-IDF scores let's consider what the part-of-speech make-up is for the genres. When looking at the distribution of parts-of-speech across genres it's best to look at intra-distributions proportionally considering there is large variation in number of summaries.

In [146]:
pos_counts = {genre: {} for genre in unique_genres}

#Building out the totals for each part-of-speech for each genre
for genre in unique_genres:
    for tags in tag_dict[genre]:
        for tag in tags:
            if tag != "":
                pos_counts[genre][tag] = pos_counts[genre].get(tag, 0) + 1

#Here we normalize the totals so we can compare proportions across genres
for genre, pos_dict in pos_counts.items():
    temp_total = 0
    
    #Summing pos counts
    for parts in pos_dict.values():
        temp_total += parts
    
    #Normalizing
    for types in pos_dict.keys():
        pos_counts[genre][types] = pos_counts[genre][types] / temp_total
    
for genre, types in pos_counts.items():
    print("{}:".format(genre))
    
    for pos, res in types.items():
        print(pos, ": ", "{}%".format(round(100*res,2)))
        
    print("...")
    print("...")       

Romance Film:
Verb :  31.05%
Adjective :  10.15%
Adverb :  6.55%
Noun :  52.26%
...
...
Thriller:
Verb :  31.23%
Adjective :  9.52%
Adverb :  5.92%
Noun :  53.32%
...
...
Comedy:
Verb :  30.32%
Adjective :  10.01%
Adverb :  6.5%
Noun :  53.18%
...
...
Drama:
Verb :  30.47%
Adjective :  10.42%
Adverb :  6.19%
Noun :  52.93%
...
...
Action/Adventure:
Adjective :  9.39%
Verb :  30.3%
Adverb :  5.67%
Noun :  54.64%
...
...


Perhaps not so surprisingly there are no significant differences across genres in types of speech used. We therefore must rely and look to the words themselves to differentiate and characterize genres.
***
Now back to the TF-IDF scores. Let's look at the highest scoring 30 words from each genre.

In [196]:
top_words = {genre: None for genre in unique_genres}

for genre in unique_genres:
    temp_top = sorted(tf_idf[genre], key=tf_idf[genre].get, reverse=True)[:102]
    temp_top = [word for word in temp_top if word != ""]
    top_words[genre] = temp_top

#This forces jupyter to display all rows
pd.set_option('display.max_rows', None)

top_100 = pd.DataFrame.from_dict(top_words)

Looking at the dataframe below we can see some interesting results. To save the reader from some scrolling here are noteworthy words that seem to characterize the genres in line with our intuition.

Note the listings are given in descending order according to TF-IDF rank.
***
 - __Action/Adventure__: - Kill, - Police, - Gang, - Fight, - Shoot, - Escape, - Attack, - Money, - Agent, - Gun, - Order, - Force, - Ship, - Officer, - Death, and even a very specific Action/Adeventure hero - Bond.
<br><br> 
 - __Comedy__: - Leave, - Kill, - Family, - Friend, - House, - Love, - School, - Money, - Mother, - Work, - Help, - Bug
<br><br>
 - __Drama__: - Kill, - Father, - Leave, - Love, - Family, - Mother, - Home, - House, - Return, - Friend, - Son, - Life, - Police, - Child, - Work, - Wife, - Money, - School, - Daughter
<br><br>
 - __Romance Film__: - Father, - Leave, - Love, - Family, - Mother, - Return, - Friend, - Marry, - Home, - House, - Kill, - Life, - Time, - Relationship, - Son, - Help
<br><br>
 - __Thriller__: - Kill, - Police, - House, - Car, - Leave, - Escape, - Shoot, - Father, - Murder, - Money, - Attack, - Return, - Home, - Meet, -Body, - Reveal, - Time, - Death, - Run, - Family, - Gun, - Mother

Action/Adventure and Thriller both seem to be characterized well as genres by their high-scoring TF-IDF words. It would come as no surprise for a Action/Adventure summary to mention police, gangs, fights, killings, shots, escapes, attacks, agents, and etc. These words actually seem to do a relatively good job describing some of the genres in toto.

Though it is obvious some genres highlighted words seem to match their respective genre better than others. Comedy for instance doesn't really have any truly defining characteristic words for their summaries. Which makes some sense seeing as how comedies plots are less constrained than say a Thriller's is. I would be hard-pressed to come up with defining general characteristics of a comedy's summary.

As seen in the list $alike$ below, every genre actually shares 65 of their 100 characterizing words. Most of these words are easy to understand being used frequently in descriptions of any stories. Yet some of them convey information about what summaries overall, and hence the movies themselves, are about. Generally these genres contain themes about friendship, killing, new meetings, returning to what once was, and families. Or at least that's what these characterizing words, along with a vague understanding of movie narratives, point to.

In [195]:
alike = set(top_100[unique_genres[0]])

for i in range(len(unique_genres)):
    if i != len(unique_genres) - 1:
        alike = alike.intersection(set(top_100[unique_genres[i + 1]]))
        
print(alike)

{'but', 'they', 'when', 'on', 'will', 'friend', 'do', 'begin', 'it', 'her', 'film', 'off', 'kill', 'time', 'he', 'call', 'while', 'after', 'then', 'all', 'try', 'come', 'him', 'an', 'from', 'new', 'return', 'out', 'be', 'for', 'one', 'meet', 'father', 'who', 'have', 'find', 'two', 'about', 'with', 'not', 'family', 'get', 'by', 'this', 'that', 'she', 'help', 'his', 'up', 'their', 'leave', 'see', 'go', 'at', 'make', 'back', 'them', 'before', 'take', 'tell', 'which', 'where', 'give', 'man', 'into'}


In [186]:
top_100

Unnamed: 0,Action/Adventure,Comedy,Drama,Romance Film,Thriller
0,her,her,her,she,her
1,she,she,she,her,she
2,him,he,he,he,he
3,that,him,that,that,that
4,he,that,him,him,him
5,they,they,his,his,they
6,kill,his,they,they,his
7,it,their,have,have,kill
8,his,have,at,tell,it
9,at,it,tell,at,at


In [None]:
like = set(top_100['Comedy']).intersection(set(top_100['Drama'])).intersection(set(top_100['Thriller'])).intersection(set(top_100['Romance Film'])).intersection(set(top_100['Action/Adventure']))
set(top_100['Action/Adventure']) - set(like)

Counter({'': 189, 'Adjective': 36, 'Adverb': 21, 'Noun': 79, 'Verb': 90})