### Prepping Data Challenge: The Bechdel Test (Week 10)

If you're unfamiliar, there are 3 criteria to passing the Bechdel Test:

1. The film has to have at least two [named] women in it,
2. who talk to each other,
3. about something besides a man

### Inputs
We are taking our data from this website for this challenge. It has been webscraped to a certain point and we will continue parsing out the data. As part of cleaning up the data, we have to deal with various html codes from different symbols.

### Requirements
- Input the data
- Parse out the data in the Download Data field so that we have one field containing the Movie title and one field containing information about whether of not the movie passes the Bechdel Test
Before we deal with the majority of the html codes, I would recommend replacing <code>&amp;</code> instances with <code>'&'</code> because of this film on the website incorrectly converting the html code 
- Extract the html codes from the Movie titles
  - These will always start with a '&' and end with a ';'
  - The maximum number of html codes in a Movie title is 5
- Replace the html codes with their correct characters
  - Ensure that codes which match up to spaces have a space in their character cell rather than a null value
- Parse out the information for whether a film passes or fails the Bechdel test as well as the detailed reasoning behind this
- Rank the Bechdel Test Categorisations from 1 to 5, 1 being the best result, 5 being the worst result
- Where a film has multiple categorisations, keep only the worse ranking, even if this means the movie moves from pass to fail
- Output the data

In [1]:
import pandas as pd
from re import sub

In [2]:
# Input the data.
with pd.ExcelFile('WK10-PD Bechdel Test.xlsx') as xlsx:
    web = pd.read_excel(xlsx, 'Webscraping')
    html = pd.read_excel(xlsx, 'html')

In [3]:
web.head()

Unnamed: 0,DownloadData,Year
0,"<a href=""http://us.imdb.com/title/tt3155794/"">...",1874
1,"<a href=""http://us.imdb.com/title/tt14495706/""...",1877
2,"<a href=""http://us.imdb.com/title/tt12592084/""...",1878
3,"<a href=""http://us.imdb.com/title/tt2221420/"">...",1878
4,"<a href=""http://us.imdb.com/title/tt7816420/"">...",1881


In [4]:
html.head()

Unnamed: 0,Char,Numeric,Named,Description
0,,code,code,
1,,&#32;,,space
2,!,&#33;,,exclamation mark
3,"""",&#34;,&quot;,double quote
4,#,&#35;,,number


In [5]:
#Parse out the data in the Download Data field so that we have one field containing the Movie title 
#and one field containing information about whether of not the movie passes the Bechdel Test
#web["Movie"] = web['DownloadData'].str.extract('view/[0-9]+/[([\$#:.a-zA-Z?<_&\s=>0-9;,\-()\'"!\[\]\/]+/">([([\$#:.a-zA-Z?_&\s=0-9;,\-()\'"!\[\]\/]+)')
web["Movie"] = web['DownloadData'].str.extract('view/[0-9]+.*?>(.*?)<.*')
web["Categorisation"] = web['DownloadData'].str.extract('\[\[\d]\]"\s+title="\[(.*?)\].*?')
#web["Categorisation"] = web['DownloadData'].str.extract('\[\[\d]\]"\s+title="\[([:.a-zA-Z?<_&\s=>0-9;,\-()\']+)')
web["Pass/Fail"] = web['DownloadData'].str.extract("static*\/([a-z]+)\.png")

In [6]:
web['Movie'].isnull().sum()

0

In [7]:
#web['Movie'].isnull().sum()
html.fillna(" ", inplace = True)

In [8]:
html = html.drop(labels=0, axis=0)

In [9]:
html.head()

Unnamed: 0,Char,Numeric,Named,Description
1,,&#32;,,space
2,!,&#33;,,exclamation mark
3,"""",&#34;,&quot;,double quote
4,#,&#35;,,number
5,$,&#36;,,dollar


In [10]:
#Extract the html codes from the Movie titles
#Replace the html codes with their correct characters
numeric_dict = dict(zip(html['Numeric'], html["Char"]))
named_dict = dict(zip(html['Named'], html["Char"]))
T_dict = [numeric_dict, named_dict]
pattern = ['(?P<html_code>&#\[0-9]+;)', '(?P<html_code>&.*?;)']

In [11]:
#Replace the html codes with their correct characters
for p, t in zip(pattern, T_dict):
    def new_html(n):
        return t.get(n.group('html_code'))
    web['Movie'] = web['Movie'].apply(lambda x: sub(p, new_html, x))

In [12]:
web['Movie'].unique()

array(['Passage de Venus', 'La Rosace Magique', 'Le singe musicien', ...,
       'Without Remorse', 'Zack Snyders Justice League', 'The 355'],
      dtype=object)

In [13]:
web["Categorisation"].unique()

array(['Fewer than two women in this movie',
       "There are two or more women in this movie, but they don't talk to each other",
       'There are two or more women in this movie and they talk to each other about something other than a man',
       'There are two or more women in this movie, but they only talk to each other about a man',
       'There are two or more women in this movie and they talk to each other about something other than a man, although dubious'],
      dtype=object)

In [14]:
#Parse out the information for whether a film passes or fails the Bechdel test as well as the detailed reasoning behind this
web['Pass/Fail'] = web['Pass/Fail'].apply(lambda x: 'Pass' if x == 'pass' else 'Fail')

In [15]:
web['Pass/Fail'].unique()

array(['Fail', 'Pass'], dtype=object)

In [16]:
#Rank the Bechdel Test Categorisations from 1 to 5, 1 being the best result, 5 being the worst result
points = {'There are two or more women in this movie and they talk to each other about something other than a man' : 1, 
          'There are two or more women in this movie and they talk to each other about something other than a man, although dubious' : 2,
          'There are two or more women in this movie, but they only talk to each other about a man' : 3, 
          "There are two or more women in this movie, but they don't talk to each other" : 4, 
          'Fewer than two women in this movie' : 5}
web['Ranking'] = web['Categorisation'].map(points)

In [17]:
#Where a film has multiple categorisations, keep only the worse ranking, even if this means the movie moves from pass to fail
web = web.sort_values(by='Ranking', ascending=False)\
         .drop_duplicates(subset=['Movie','Year'])

In [18]:
df = web[['Movie','Year','Pass/Fail','Ranking','Categorisation']]

In [19]:
df.head()

Unnamed: 0,Movie,Year,Pass/Fail,Ranking,Categorisation
0,Passage de Venus,1874,Fail,5,Fewer than two women in this movie
5921,The Corridor,2010,Fail,5,Fewer than two women in this movie
1297,Reconstruction,1968,Fail,5,Fewer than two women in this movie
1295,The Producers,1968,Fail,5,Fewer than two women in this movie
5897,Ca$h,2010,Fail,5,Fewer than two women in this movie


In [20]:
#output the dataset
df.to_csv('wk10-output.csv', index=False)