## Basic Pipeline

0. Extract a list of genres
1. Get all the movie titles for these genres and subsequently the urls for all the movies using their titles
2. Sample randomly some of these urls and get their scripts
3. Clean the scripts
4. Get some basic statistics by comparing the character names (usually in bold) in the scripts against some female/male/unisex list of names
5. Create a mapping between character names and what these characters say
6. Get some more advanced statistics using this mapping

## More detailed Pipeline
0. Download a list of male/female/unisex names to create some name database
1. Extract the list of movie-genres from the scripts website
2. Get all the movie titles for these genres
3. For each movie create a Movie object instantiated using the title and the genre of the movie
4. Sample randomly some of these urls and get their scripts as soup objects 
5. Extract the bold tags in these soup objects which usually are the names of the characters
6. Now we can get some statistics regarding the number of male/female/unisex names by testing against the database created
7. The next stage is to split the script by the bold tags (usually names): After a bold tag (name) there is the text of what the character says 
8. Now we have both the bold tags (character names) as well as what these character say
9. Then we group up the conversations between characters
10. Finally search which of these conversations are between females and in how many of them a male name or a male word such as he, his, him etc. to get some more advanced statistics (this essentially tries to automate the Bechdel test).


## Statistics/Features returned

0. Title: The title of the movie
1. Genre: The genre of the movie
2. Male/Female/Unisex counts: These are the overall (non-unique) counts for names belonging in the corresponding category.
2. Male/Female/Unisex unique counts (not included in the final dataframe)
3. Male/Female/Unisex percentage: These are percentages over the set of characters (unique characters) in the movie
4. Top n protagonists: The gender of the three names appearing most in a movie: can be changed to any n value (returned for 3). E.g ['f', 'm', 'u'] means that the top protagonist is a female, the second a male and the third a unisex name
4. F-F conversations: The number of conversations between females (the min conversation length is set to 2 but can be changed)
5. F-U conversations: The number of conversations between female and unisex (again, the min conversation length is set to 2 but can be changed)
6. Contain male: How many of the conversations between females contain a mention of a male name or a male word such as he, his, him
7. Not Contain male: How many of the conversations between females do NOT contain a mention of a male name or a male word 
### These last two functions essentially try to automatically perform the Bechdel test but make very strong assumptions about the format of the html/script and are very susceptible to format changes but nonetheless could  be useful as features

## Some problems/suggestions/improvements
0. Time: Right now the code is slow getting the statistics (especially the more advanced ones) and as a result statistics for only 30 movies were returned. Possible solutions: reduce for loops, store intermediate values, split functions to get basic features more quickly and more advanced ones as a next step.
2. Ideally the Movie_Analysis class should be split in two classes, Movie and Analysis(Movie)
3. Year of the movie is not used (could be useful)
3. Unisex names are not very helpful, a frequency table of them per decade could be used along with the year of the movie to distribute them to male/female
4. Some movies do not follow the same format as the vast majority of them and it is currently not possible to clean them and collect statistics for them. Need to look at these cases, case by case.
4. The functions that perform the automated Bedchel test make very strong assumptions about the formatting of the html file. No easy solution to that that I know of.

In [11]:
import requests
import numpy as np
import re
import pickle
import pandas as pd
import time
from bs4 import BeautifulSoup
from collections import Counter
from itertools import groupby

In [12]:
# Get the website url and create the male/female/unisex name db

url = 'http://www.imsdb.com/'
female_df = pd.read_csv('female-names.txt', header=None, dtype=str)
male_df = pd.read_csv('male-names.txt', header=None, dtype=str)
female_db = [name for name in female_df.values if name not in male_df.values]
male_db = [name for name in male_df.values if name not in female_df.values]
unisex_db = [name for name in male_df.values if name in female_df.values]

In [13]:
def get_anchors(url, keyword):
    
    """
    Extracts texts from anchor tags containing a specific keyword
    
    Args:
        url: the url (string) of the webpage
        keyword: a word (string) to filter the anchor tags
    Returns
        anchors_texts: a list containing the filtered texts
    """
    
    # Get the soup object
    request = requests.get(url)
    data = request.content
    soup = BeautifulSoup(data, 'lxml')
    
    # Get the anchors
    anchors = soup.find_all('a')
    
    # Filter the anchors for a specific keyword in href
    keyword_anchors = list(filter(lambda anchor: keyword in anchor['href'], anchors))
    
    # Get the body of the filtered anchors
    anchor_texts = list(map(lambda anchor: anchor.text, keyword_anchors))
    
    return anchor_texts

def extract_genre_movies(url, genre):
    """
    Extracts the list of genres

    Args: 
        url: the url (string) of the webpage
        genre: the genre (string) to get the movies for

    Returns:
        movie_titles: a list of all the movies title for the given genre
    """
    # Build the url
    genre_url = url + 'genre/' + genre
    
    # Get the names of the movies
    movie_titles = get_anchors(genre_url, '/Movie')
    
    # Format the movie names
    movie_titles = map(lambda movie: re.sub(r':', '', movie), movie_titles)
    movie_titles = list(map(lambda movie: re.sub(r' ', '-', movie), movie_titles))

    return movie_titles

In [14]:
# Get the list of genres
genres = get_anchors(url, 'genre')
print(genres)

['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Drama', 'Family', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Short', 'Thriller', 'War', 'Western']


In [15]:
class Movie_Analysis(object):
    
    def __init__(self, genre, title):
        
        self.genre = genre
        self.title = title
        self.script = None
        self.bold_tags = None
        self.convs = None
        self.titles_list = None
        self.convs_texts_list = None
        
        
    @property
    def get_url(self):
        """
        Returns the url of a movie according to its name
        """
        
        url = 'https://www.imsdb.com/scripts/' + self.title + '.html'
        
        return url
    
    @property    
    def get_soup(self):
        """
        Returns the soup object of a movie
        """
        request = requests.get(self.get_url)
        script = request.content
        soup = BeautifulSoup(script, 'lxml')
        
        return soup
  
    @property
    def set_script(self):
        """
        Sets the script of a movie
        """    
        if self.get_soup.find_all('pre'):
            self.script = self.get_soup.find_all('pre')[0]

    
    @property
    def extract_bold_tags(self):
        """
        Extracts the bold text of a script
        """ 
        if self.script:
            script = self.script
            
            # Filter for bold text
            bold_tags_soup = script.find_all('b')
            bold_tags = list(map(lambda name: name.text, bold_tags_soup))

            return bold_tags
        
    @property
    def set_bold_tags(self):
        """
        Cleans the bold text of a script
        """      
        bold_text = self.extract_bold_tags
        if bold_text:
            
            # Split the words, remove unnecessary spaces, make lower case and get the first name of a movie character
            # (usually the first word is a space, hence the [1] to get the first name)
            cleaned_text = map(lambda word: re.sub(r'\s+', ' ', word), bold_text)
            cleaned_text = map(lambda word: word.lower() , cleaned_text)
            cleaned_text = list(map(lambda word: word.split(' ')[1] if len(word.split(' ')) > 1 else word.split(' ')[0], cleaned_text))
            
            self.bold_tags = cleaned_text
        
    @property
    def set_convs(self):
        """
        Creates the texts of the conversations by removing/spliting the script by the bold text in it
        """  
        if self.script :
            script = str(self.script)
            split_text = re.sub(r'\s+', " ", script)
            split_text = re.split('<b>.*?</b>', split_text)
            split_text = list(map(lambda word: word.lower() , split_text))[1:]
            self.convs = split_text
            
    
    def extract_names(self, male_db, female_db, unisex_db, mode='all'):
        """
        Extracts the names of the characters in the text given a database
        
        Args: 
            male_db: a database (list) of male names
            female_db: a database (list) of female names
            unisex_db: a database (list) of unisex names
            mode: a mode to operate with (default='all'), possible modes: 'all', 'male', 'female', 'unisex'
        Returns: 
            num_names: the number of all names in the movie (non-unique)
            names: a list of all the names (non-unique)
        """ 
        script = self.bold_tags
        names = []
        if script:
            if mode == 'all':
                names =  [word for word in script if word in male_db + female_db + unisex_db ]
            elif mode == 'male':
                names =  [word for word in script if word in male_db]
            elif mode == 'female':
                names =  [word for word in script if word in female_db]
            elif mode == 'unisex':
                names =  [word for word in script if word in unisex_db]
            else:
                print('Incorrect mode, available modes: all, male, female, unisex')
                return -1
        
        num_names = len(names) 
        
        return num_names, names
    
    
    def extract_unique_names(self, male_db, female_db, unisex_db, mode='all'):
        """
        Extracts the unique names of the characters in the text given a database
        
        Args: 
            male_db: a database (list) of male names
            female_db: a database (list) of female names
            unisex_db: a database (list) of unisex names
            mode: a mode to operate with (default='all'), possible modes: 'all', 'male', 'female', 'unisex'
        Returns: 
            num_unique_names: the number of unique names in the movie
            unique_names: a list of all unique names
        """      
        _, names = self.extract_names(male_db, female_db, unisex_db, mode=mode)
        
        if mode == 'all':
            unique_names = list(set(names))
        elif mode == 'male':
            unique_names = list(set(names))
        elif mode == 'female':
            unique_names = list(set(names))  
        elif mode == 'unisex':
            unique_names = list(set(names))    
        else:
            print('Incorrect mode, available modes: all, male, female, unisex')
            return -1   
        
        num_unique_names = len(unique_names)
        
        return num_unique_names, unique_names
    
    
    def get_counts(self, male_db, female_db, unisex_db, mode='all'):
        """
        Counts the occurences of unique names of the characters in the text, given a database
        
        Args: 
            male_db: a database (list) of male names
            female_db: a database (list) of female names
            unisex_db: a database (list) of unisex names
            mode: a mode to operate with (default='all'), possible modes: 'all', 'male', 'female', 'unisex'
        Returns: 
            counts: the total number of names appearing in the movie
            counts_dict: a dictionary of all unique names and number of occurences for each
        """    
        
        counts, names = self.extract_names(male_db, female_db, unisex_db, mode=mode)
        counts_dict = Counter(names)
        
        return counts, counts_dict
   
    def get_percentage(self, male_db, female_db, unisex_db, mode='unisex'):
        """
        Gets the percentage of male/female/unisex names in the movie
        
        Args: 
            male_db: a database (list) of male names
            female_db: a database (list) of female names
            unisex_db: a database (list) of unisex names
            mode: a mode to operate with (default='all'), possible modes: 'all', 'male', 'female', 'unisex'
        Returns: 
            percentage: the percentage of male/female/unisex depending on the operating mode  
        """         
        
        all_unique_names, _ = self.extract_unique_names(male_db, female_db, unisex_db, mode='all')
        if all_unique_names == 0:
            percentage = 0
        else:
            related_unique_names, _ = self.extract_unique_names(male_db, female_db, unisex_db, mode=mode)
            percentage = related_unique_names / all_unique_names
        
        return percentage

    @property
    def set_titles_text_map(self):
        """
        Creates two corresponding lists, one containing the character names (bold tags) and the other what
        each character
        """     
        titles = self.bold_tags
        convs_texts = self.convs
        
        # Create a "mapping" between titles(names) and what text they say (if the same name is found consecutively then
        # the name and the corresponding text gets concatenated)
        if titles and convs_texts and len(titles) <= len(convs_texts):
            titles_list = []
            convs_texts_list = []
            titles_list.append(titles[0])
            convs_texts_list.append(convs_texts[0])
            for i in range(1, len(titles)):
                if titles[i] != titles[i - 1]:
                    titles_list.append(titles[i])
                    convs_texts_list.append(convs_texts[i])                
                else:
                    convs_texts_list[-1] += ' ' + convs_texts[i]

            self.titles_list = titles_list
            self.convs_texts_list = convs_texts_list
    
    def encode_convs(self, male_db, female_db, unisex_db):
        """
        Encodes the conversations with 'f' for female, 'm' for male, 'u' for unisex and 's' for everything else
        
        Args: 
            male_db: a database (list) of male names
            female_db: a database (list) of female names
            unisex_db: a database (list) of unisex names
        Returns: 
            encoded_convs: encoded conversations 'f' denoting a person with a female name talking, 'm' a male, 
                            'u' a unisex and 's' none of the above
        """         
        encoded_convs = []
        convs = self.titles_list
        if convs:
            for word in convs:
                if word in male_db:
                    encoded_convs.append('m')
                elif word in female_db:
                    encoded_convs.append('f')
                elif word in unisex_db:
                    encoded_convs.append('u')
                else:
                    encoded_convs.append('s')

            return encoded_convs

    def protagonists_ngram(self, male_db, female_db, unisex_db, n=1):
        """
        Creates an n-gram denoting the the genders of the names of the n first protagonists based on their occurences

        Args: 
            male_db: a database (list) of male names
            female_db: a database (list) of female names
            unisex_db: a database (list) of unisex names
            n: length of the n-gram
        Returns: 
            encoded_ngram: a list of the gender of the names of the n first protagonists based on their occurences
        """     
        _, counts_dict = self.get_counts(male_db, female_db, unisex_db)
        
        # Get the character names with most counts and encode them
        ngram = [name for name in sorted(counts_dict, key = counts_dict.get)[-n:]][::-1]
        encoded_ngram = ['f' if name in female_db else 'm' if name in male_db else 'u' for name in ngram]

        return encoded_ngram
    
    def count_fem_convs(self, male_db, female_db, unisex_db, splitword='s', conv_len=2):
        """
        Counts the female conversations (as defined by a predefined length) and the potential female conversations, defined as
        conversations between two people one with a female name and one with a unisex name
        
        Args: 
            male_db: a database (list) of male names
            female_db: a database (list) of female names
            unisex_db: a database (list) of unisex names
            splitword: a word denoting noone is talking
            conv_len: the minimum length a conversation
        Returns: 
            fem_conv_count: the sum of female conversations
            fem_conv_count_pot: the sum of potential female conversations (female with unisex)
        """  
        encoded_convs = self.encode_convs(male_db, female_db, unisex_db)
        fem_conv_count = 0
        fem_conv_count_pot = 0
        
        # Group the encoded conversations and check if there are two females talking or female with unisex
        if encoded_convs:
            split_convs = list(list(g) for k,g in groupby(encoded_convs, key=lambda x: x != splitword) if k)  

            fem_conv_count = 0
            fem_conv_count_pot = 0

            for conv in split_convs:
                if len(conv) >= conv_len and 'f' in conv and 'm' not in conv and 'u' not in conv:
                    fem_conv_count += 1
                if len(conv) >= conv_len and 'f' in conv and 'm' not in conv and 'u' in conv :
                    fem_conv_count_pot += 1
                
        return fem_conv_count, fem_conv_count_pot

    def talk_about_male(self, male_db, female_db, unisex_db, splitword='s', conv_len=2):
        """
        Counts the number of female conversation containing/not containing a male word
        
        Args: 
            male_db: a database (list) of male names
            female_db: a database (list) of female names
            unisex_db: a database (list) of unisex names
            splitword: a word denoting noone is talking
            conv_len: the minimum length a conversation
        Returns: 
            yes_counter: the number of female conversations containing a male word
            no_counter: the number of female conversations not containing a male word
        """       
        yes_counter = 0
        no_counter = 0
        convs_texts_list = self.convs_texts_list
        titles_list = self.encode_convs(male_db, female_db, unisex_db)
        
        # First group the encoded titles/characters and based on this grouping, group the corresponding texts
        if titles_list:
            split_convs = list(list(g) for k,g in groupby(titles_list, key=lambda x: x if x != 's' else 's') if k) 
            split_texts = []
            idx_counter = 0
            for lst in split_convs:
                texts_list = []
                iters = len(lst)
                for idx in range(iters):
                    texts_list.append(convs_texts_list[idx_counter])
                    idx_counter += 1
                split_texts.append(texts_list)
            
            #  Usint the two corresponding lists character genders-texts detect whether two females
            # are talking about a male or not
            for conv_gender, conv_text in zip(split_convs, split_texts):
                cont_male = 'm' in conv_gender
                cont_female = 'f' in conv_gender
                cont_unisex = 'u' in conv_gender
                male_word_list = male_db + ['he', 'him', 'his', 'himself']
                cont_male_word = list(np.intersect1d(conv_text[0].split(' '), male_word_list))

                if len(conv_gender) >= conv_len and cont_female and not cont_male and not cont_unisex and not cont_male_word:
                    no_counter += 1
                if len(conv_gender) >= conv_len and cont_female and not cont_male and not cont_unisex and cont_male_word:
                    yes_counter += 1
                
        return yes_counter, no_counter

In [16]:
# Get all the movie names and create an object corresponding to each movie

movies_list = []
for genre in genres:
    genre_movies = extract_genre_movies(url, genre)
    for movie_title in genre_movies:
        movie = Movie_Analysis(genre, movie_title)
        movies_list.append(movie)
print(len(movies_list))

3263


In [29]:
# Sample randomly 150 movies (might contain some movies with no script)

sample_size = 150
seed = 1
np.random.seed(seed)
sampled_movies = np.random.choice(movies_list, sample_size, replace=False)
for movie in sampled_movies:
    movie.set_script
    if movie.script:
        movie.set_bold_tags
        movie.set_convs
        movie.set_titles_text_map

In [31]:
def query_movies(movies, male_db=male_db, female_db=female_db, unisex_db=unisex_db, conv_len=2):
    """
    Creates a dataframe with movies and their statistics

    Args: 
        movies: a list of movies to calculate statistics for
        male_db: a database (list) of male names
        female_db: a database (list) of female names
        unisex_db: a database (list) of unisex names
        conv_len: the minimum length a conversation
    Returns: 
        stats_df: a dataframe containing the statistics of the movies
    """  
    
    
    stats_df = pd.DataFrame(columns=['Title', 'Genre', 'Top 3 Protagonists', 'Male Counts', 'Female Counts', 'Unisex Counts',
                                    'Male Percent', 'Female Percent', 'Unisex Percent', 'F-F Convs', 'F-U Convs', 
                                   'Contain Male', 'Dont Contain Male'])
    
    for movie in movies:
        start = time.time()
        title = movie.title
        genre = movie.genre
        n_fem_conv, n_fem_uni_conv = movie.count_fem_convs(male_db, female_db, unisex_db, conv_len=conv_len)
        cont_male, not_cont_male =  movie.talk_about_male(male_db, female_db, unisex_db, conv_len=conv_len)
        top_3_protag =  movie.protagonists_ngram(male_db, female_db, unisex_db, n=3)
        male_counts, _ = movie.get_counts(male_db, female_db, unisex_db, mode='male')
        female_counts, _ = movie.get_counts(male_db, female_db, unisex_db, mode='female')
        unisex_counts, _ = movie.get_counts(male_db, female_db, unisex_db, mode='unisex')
        male_percent = movie.get_percentage(male_db, female_db, unisex_db, mode='male')
        female_percent = movie.get_percentage(male_db, female_db, unisex_db, mode='female')
        unisex_percent = movie.get_percentage(male_db, female_db, unisex_db, mode='unisex')
        stats_df = stats_df.append({'Title':title, 'Genre':genre, 'Top 3 Protagonists':top_3_protag,
                                    'Male Counts': male_counts, 'Female Counts':female_counts, 'Unisex Counts':unisex_counts,
                                    'Male Percent':male_percent, 'Female Percent':female_percent, 'Unisex Percent':unisex_percent, 'F-F Convs':n_fem_conv, 
                                     'F-U Convs':n_fem_uni_conv, 'Contain Male':cont_male, 'Dont Contain Male': not_cont_male},
                                   ignore_index=True)
        end = time.time()
        print(movie.title, ' Done!', ' Time elapsed: {} seconds'.format(end - start))
        stats_df.to_csv('movie_statistics.csv', sep=',')
        
    return stats_df

In [32]:
df = query_movies(sampled_movies)

Sunset-Blvd.  Done!  Time elapsed: 220.18349838256836 seconds
Boxtrolls,-The  Done!  Time elapsed: 263.233904838562 seconds
Die-Hard-2  Done!  Time elapsed: 301.334538936615 seconds
Eastern-Promises  Done!  Time elapsed: 182.6364278793335 seconds
Schindler's-List  Done!  Time elapsed: 154.56933975219727 seconds
Bad-Country  Done!  Time elapsed: 199.75597739219666 seconds
Frankenstein  Done!  Time elapsed: 171.6743552684784 seconds
Les-Miserables  Done!  Time elapsed: 172.79608130455017 seconds
Blue-Valentine  Done!  Time elapsed: 201.2129898071289 seconds
Gremlins-2  Done!  Time elapsed: 340.9937045574188 seconds
Siege,-The  Done!  Time elapsed: 255.2493360042572 seconds
Hard-to-Kill  Done!  Time elapsed: 179.66108775138855 seconds
Highlander  Done!  Time elapsed: 192.49867296218872 seconds
No-Country-for-Old-Men  Done!  Time elapsed: 197.79273676872253 seconds
Tombstone  Done!  Time elapsed: 225.92404580116272 seconds
Terminator  Done!  Time elapsed: 194.5569703578949 seconds
Man-Who-

Suspect-Zero  Done!  Time elapsed: 200.00950717926025 seconds
Thunderbirds  Done!  Time elapsed: 140.38749051094055 seconds
Frozen-(Disney)  Done!  Time elapsed: 275.1185142993927 seconds
Eternal-Sunshine-of-the-Spotless-Mind  Done!  Time elapsed: 234.48773407936096 seconds
Juno  Done!  Time elapsed: 195.66777801513672 seconds
Bull-Durham  Done!  Time elapsed: 215.09172654151917 seconds
Talented-Mr.-Ripley,-The  Done!  Time elapsed: 168.4535903930664 seconds
Color-of-Night  Done!  Time elapsed: 264.40199398994446 seconds
Lone-Star  Done!  Time elapsed: 267.81465458869934 seconds
Next  Done!  Time elapsed: 567.8624362945557 seconds
Lord-of-the-Rings-Fellowship-of-the-Ring,-The  Done!  Time elapsed: 254.44688892364502 seconds
Clockwork-Orange,-A  Done!  Time elapsed: 0.0 seconds
Starman  Done!  Time elapsed: 202.9780993461609 seconds
Star-Wars-The-Force-Awakens  Done!  Time elapsed: 263.04563665390015 seconds
What-Lies-Beneath  Done!  Time elapsed: 226.8413906097412 seconds
Collateral  D

In [33]:
df.head(50)

Unnamed: 0,Title,Genre,Top 3 Protagonists,Male Counts,Female Counts,Unisex Counts,Male Percent,Female Percent,Unisex Percent,F-F Convs,F-U Convs,Contain Male,Dont Contain Male
0,Sunset-Blvd.,Drama,"[f, f, u]",27,320,71,0.375,0.375,0.25,4,11,4,0
1,"Boxtrolls,-The",Comedy,"[u, m, m]",34,0,152,0.8,0.0,0.2,0,0,0,0
2,Die-Hard-2,Action,"[m, m, m]",167,1,45,0.722222,0.055556,0.222222,0,0,0,0
3,Eastern-Promises,Thriller,"[m, f, f]",194,242,0,0.222222,0.777778,0.0,21,0,11,12
4,Schindler's-List,Drama,"[m, f, m]",83,14,6,0.75,0.125,0.125,0,0,0,0
5,Bad-Country,Drama,"[m, u, m]",317,3,137,0.65,0.15,0.2,0,0,0,0
6,Frankenstein,Romance,"[m, f, m]",336,130,4,0.538462,0.384615,0.076923,5,3,2,7
7,Les-Miserables,Drama,"[m, f, f]",74,42,0,0.5,0.5,0.0,0,0,0,0
8,Blue-Valentine,Drama,"[m, f, u]",277,266,72,0.285714,0.357143,0.357143,3,14,2,4
9,Gremlins-2,Horror,"[u, f, m]",78,138,188,0.473684,0.368421,0.157895,6,38,0,6
