<h1>Evolution AI: Bechdel Test Task</h1>

<p>
    We are interested in looking for movies that pass the Bechdel test (https://en.wikipedia.org/wiki/Bechdel_test). We've found a website that has a large collection of movie scripts - http://www.imsdb.com/.<br/>
We would like you to scrape a few hundred scripts from this website (doesn't have to be all of them), and look for instances where a female character is talking to another female character.  Produce some statistics for the movies you have data for, that you think would be relevant to the Bechdel test.
</p>

<h2>Solution</h2>

<p>To complete this assignment I have divided it in three parts:</p>
<ol>
    <li>Scrape Data from IMSB and clean it;</li>
    <li>Perform the Bechdel Test on the Data;</li>
    <li>Analyze the Data.</li>
</ol>

<h3>Scrapping the Data</h3>
<p>
    To do the scraping I've chosen to use the requests module along with the BeutifulSoup module, the first one handles http requests to URLs and the second one parses the requested document for specific data.
</p>

In [4]:
import requests
from bs4 import BeautifulSoup

<p>
    Looking at the raw html data for the sample script for the movie "Forrest Gump", we can see that all the data related to the script is stored in the class "scrtext", the titles of the scenes and names of the characters are always represented with the bold tag. The clear line breaks after the scene names and character names make the document suitable for Python's open() function. 
</p>

<p>
    The class BechTest handles a single URL. The Get method is used to receive the raw html from the URL and the Beutifull
</p>

In [None]:
class BechTest():
    
    def __init__(self, URL):
        self.f_names = []
        for name in open('files/female.txt', 'r'):
            self.f_names.append(name.rstrip())
        
        print(URL)
        page = requests.get(URL)
        soup = BeautifulSoup(page.content, 'html.parser')
        
        results = soup.find('td', class_='scrtext').prettify()
        results = results.replace('</b>', '').replace('<pre>', '').split('\n')[1:] #Delete the noise from the first line.
        self.script = results[:-1] #Delete the noise from the last line.
        
        self.blines = []
        for line in self.script:
            if line.startswith('<b>'):
                self.blines.append(line.replace('<b>', '').lstrip())
        
    def isBechdel(self):
        n_scene = True  #Go to next scene
        parser = iter(self.blines)
        if len(self.blines) < 10:
            raise Exception("Invalid Script")
        curr = next(parser).capitalize().rstrip()
        fs_in_scene = set()
    
        try:
            while True:
                #print("Evaluating: %s, next scene: %s" % (curr, n_scene)) 
                if n_scene:
                    while True:
                        if curr in self.f_names:
                            n_scene = False
                            break
                        #print("Evaluating: %s, next scene: %s" % (curr, n_scene)) 
                        curr = next(parser).capitalize().rstrip()
                    continue
                
                else:
                    if curr in self.f_names:
                        fs_in_scene.add(curr.rstrip())
                        if len(fs_in_scene) > 1:
                            return True         
                    else:
                        n_scene = True
                            
                curr = next(parser).capitalize().rstrip()
                
        except StopIteration:
            return False  