<h1>Evolution AI: Bechdel Test Task</h1>

<p>
    We are interested in looking for movies that pass the Bechdel test (https://en.wikipedia.org/wiki/Bechdel_test). We've found a website that has a large collection of movie scripts - http://www.imsdb.com/.<br/>
We would like you to scrape a few hundred scripts from this website (doesn't have to be all of them), and look for instances where a female character is talking to another female character.  Produce some statistics for the movies you have data for, that you think would be relevant to the Bechdel test.
</p>

<h2>Solution</h2>

<p>To complete this assignment I have divided it in three parts:</p>
<ol>
    <li>Scrape Data from IMSB and clean it;</li>
    <li>Perform the Bechdel Test on the Data;</li>
    <li>Analyze the Data.</li>
</ol>

<h3>Scraping the Data</h3>
<p>
    To do the scraping I've chosen to use the requests module along with the BeautifulSoup module, the first one handles http requests to URLs and the second one parses the requested document for specific data.
</p>

In [1]:
import requests
from bs4 import BeautifulSoup

<p>
    Looking at the raw html data for the sample script for the movie "Forrest Gump", we can see that all the data related to the script is stored in the class "scrtext", the titles of the scenes and names of the characters are always represented with the bold tag. The clear line breaks after the scene names and character names make the document suitable for Python's open() function. 
</p>

<p>
    The class BechTest handles a single URL. The Get method is used to receive the raw html from the URL and BeautifullSoup is used to extract only the "scrtext" class of the html document.<br>
    After some html tag cleaning we are able to generate a list of all the lines that were marked with the bold tag (blines), which indicates either a camera transition, a change in the scene or the subject we are interested in, a character's line.
    All the computation mentioned above is done in the constructor method.
</p>

In [3]:
class BechTest():
    
    def __init__(self, URL):
        self.f_names = []
        self.m_names = []
        for name in open('files/female.txt', 'r'):
            self.f_names.append(name.rstrip())
        for name in open('files/male.txt', 'r'):
            self.m_names.append(name.rstrip())
        
        #print(URL)
        page = requests.get(URL)
        soup = BeautifulSoup(page.content, 'html.parser')
        
        results = soup.find('td', class_='scrtext').prettify()
        results = results.replace('</b>', '').replace('<pre>', '').split('\n')[1:] #Delete the noise from the first line.
        self.script = results[:-1] #Delete the noise from the last line.
        
        self.blines = []
        for line in self.script:
            if line.startswith('<b>'):
                self.blines.append(line.replace('<b>', '').lstrip())
        
    def isBechdel(self):
        n_scene = True  #Go to next scene
        parser = iter(self.blines)
        if len(self.blines) < 10:
            raise Exception("Invalid Script")
        curr = next(parser).capitalize().rstrip()
        fs_in_scene = set()
    
        try:
            while True:
                #print("Evaluating: %s, next scene: %s" % (curr, n_scene)) 
                if n_scene:
                    while True:
                        if curr in self.f_names:
                            n_scene = False
                            break
                        #print("Evaluating: %s, next scene: %s" % (curr, n_scene)) 
                        curr = next(parser).capitalize().rstrip()
                    continue
                
                else:
                    if curr in self.f_names:
                        fs_in_scene.add(curr.rstrip())
                        if len(fs_in_scene) > 1:
                            return True
                    elif curr in self.m_names:
                        None
                    else:
                        n_scene = True
                            
                curr = next(parser).capitalize().rstrip()
                
        except StopIteration:
            return False  

<h3>Performing the Bechdel Test</h3>
<p>
    The approach I've taken to implement the bechdel test (or .isBechdel()) is simple: we create an iterator on the blines list and evaluate each item, if the item indicates a screen transition we fastfoward to the next female line in the script. If that female line is followed by a screen transition we fastfoward to the next female line until we get at least two named female characters with lines in a single scene, we are guarded from repetition by storing the females in scene in a set. There are mixed rules about the presence of male characters in the scenes, so if we find a male character in the middle of a scene we just ignore it.<br/>
    To understand if the character is male of female the name of the character is run against a list of more than 5000 female names in the 'females.txt' file, the list was obtained at the NLP corpora of the CMU AI repository: <a href="http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/">http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/</a>.
</p>

<p>
    Next, to obtain the list of scripts to test against the .isBechdel() method, I've extracted the text using copy and paste from the url <a href="https://imsdb.com/all-scripts.html">https://imsdb.com/all-scripts.html</a> and saved it in the "script-list.txt" file, but the text that was extracted had some noise, such as blank lines and author name lines. In order to parse it effectively we need only the movie titles with hifen separation (Forrest-Gump), to do that we use the following script, which will save the script names in the "clean-scripts.txt" file.
</p>

In [4]:
file = open('files/script-list.txt', 'r')
writef = open('files/clean-scripts.txt', 'w')

for line in file:
    if not line.startswith('Written'):
        if not line.startswith('\n'):
            clean_line = line[:line.find('(') - 1]
            script_endpoint = '-'.join(clean_line.split(' '))
            writef.write(script_endpoint.rstrip() +'\n')

<p>
    Finally, to run a smaller but representative sample of the list we just generated against .isBechdel(), we can use the random module to shuffle the list and then run the tests for the first 500 results. The results are saved in the file "results.csv".
</p>

In [5]:
import random

endpoint_list = list(open('files/clean-scripts.txt', 'r'))
results = open('files/results.csv', 'w')

random.shuffle(endpoint_list) #Randomize the list of movies.

#For the 500 first random results, we will create a new object and perform the isBechdel test, then save the 
#results to the results.csv file.
for endpoint in endpoint_list[:500]:
    url = 'https://imsdb.com/scripts/' + endpoint.rstrip() + '.html'
    try:
        bech = BechTest(url)
        results.write(endpoint.rstrip() + ';' + str(bech.isBechdel()) + '\n')
    except:
        #print("invalid-url or error-in-script")
        continue

<h3>Analyzing the Data</h3>
<p>
    To process the results we can use pandas.
</p>

In [6]:
import pandas as pd

In [7]:
df = pd.read_csv('files/results.csv', delimiter=";")
df.columns = ["movie", "bechdel"]

In [14]:
print("We checked %s scripts." % len(df))
bechdel = [result for result in df.bechdel if result]
print("%s scripts passed the test." % len(bechdel))
nope = len(df) - len(bechdel)
print("%s scripts didn't pass the test." % nope)
percent = (len(bechdel) / len(df)) * 100
print("The rate of approval was : %.2f percent" % percent)

We checked 422 scripts.
382 scripts passed the test.
40 scripts didn't pass the test.
The rate of approval was : 90.52 percent


<p>
    As we can see, discarting URL related problems, we came just over 400 movie scripts analyzed; of those, only 40 didn't pass the Bechdel test. That is not the ammount expected for the end result.<br/>
    As it says in the Wikipedia article of the Bechdel test (<a href="https://en.wikipedia.org/wiki/Bechdel_test">https://en.wikipedia.org/wiki/Bechdel_test</a>), about half of all films pass the criteria.<br/>
    This is probably because one of the most important requirements of the Bechdel test is that the women do not talk about a man during the in-scene exchange. Because the test didn't account for that we are off by more than 25 percentage points from the target value. To correct that problem we would need to modify the code to take that into account, maybe looking for keywords related to men like "him", "his" or even man names.   
</p>