## Assignment 3 Data Analysis using Pandas

This assignment will contain 1 question with details as below. The due date is October 16 (Friday), 2020 23:59PM. Each late day will result in 20% loss of total points.

### Question 1 (100 points) Celluloid ceiling

Wonder Woman             |  Captain Marvel
:-------------------------:|:-------------------------:
![wonderwoman](https://upload.wikimedia.org/wikipedia/en/e/ed/Wonder_Woman_%282017_film%29.jpg) | ![marvel](https://upload.wikimedia.org/wikipedia/pt/5/59/Captain_Marvel_%282018%29.jpg)

Women are involved in the film industry in all roles, including as film directors, actresses, cinematographers, film producers, film critics, and other film industry professions, though women have been underrepresented in all these positions. Studies found that women have always had a presence in film acting, but have consistently been underrepresented, and on average significantly less well paid. 

In 2015, Forbes reported that "...just 21 of the 100 top-grossing films of 2014 featured a female lead or co-lead, while only 28.1% of characters in 100 top-grossing films were female... This means it’s much rarer for women to get the sort of blockbuster role which would warrant the massive backend deals many male counterparts demand (Tom Cruise in Mission: Impossible or Robert Downey Jr. in Iron Man, for example)".

Also, Forbes' analysis of US acting salaries in 2013 determined that the "...men on Forbes’ list of top-paid actors for that year made 2½ times as much money as the top-paid actresses. That means that Hollywood's best-compensated actresses made just 40 cents for every dollar that the best-compensated men made. 


In this assignment, we want to examine whether and how women representation is lacking in the film industry. We will adopt The Bechdel test as a measure of the representation of women in the film industry. The test is named after the American cartoonist Alison Bechdel in whose 1985 comic strip Dykes to Watch Out For the test first appeared. **A movie is said to meet the Bechdel test  following three criteria: (1) it has to have at least two women in it, who (2) who talk to each other, about (3) something besides a man.**

We are going to obtain the data ourselves to perform the analysis. Specifically, we will retrieve the movie metadata from IMDB (Internet Movie Database), an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. As of January 2020, IMDb has approximately 6.5 million titles (including episodes) and 10.4 million personalities in its database, as well as 83 million registered users.


The IMDb Top 250 is a list of the top rated 250 films, based on ratings by the registered users of the website using the methods described. We will focus on these famous movies in this analysis:

**Question 1.1** (20 points): We will retrieve the metadata of IMDb Top 250 movies from the [IMDb charts](https://www.imdb.com/chart/top/). For each movie on the list, we can scrape the following characteristics from the information page. For example, from the [page of top rated movie "The Shawshank Redemption"](https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=F4QFC0SVZN1HTDHCY3C0&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1), we want to extract the metadata about this movie as:
- IMDb id (0111161)
- Movie name (The Shawshank Redemption)
- Year (1994)
- Director (Frank Darabont)
- Starring (Tim Robbins, Morgan Freeman, Bob Gunton)
- Rating (9.3)
- Number of reviews (2,291,324)
- Genres (Drama)
- Country (USA)
- Language (English)
- Budget (\$25,000,000)
- Box Office Revenue (\$28,815,291)
- Runtime (142 min)

![imdb](https://mrfloris.com/files/images/imdb-top250-page-start.png)


After scraping the 250 movies, save the data as a dataframe ```imdb_top_movies```. 
Also, saving the dataframe to a local file ```imdb_top_movies.csv``` so that later you can load it without scraping the website twice.

Hint: You can get the links to these movies from the IMDb top chart page, and then scrape each movie page by sending the request to these links. At each movie page, the information requested are located at different sections. 

In [1]:
import requests
import re
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import string
from scipy import stats

In [None]:
# Question 1.1
#Get links and information to top250 movies
url="https://www.imdb.com/chart/top?ref_=nv_mv_250"
req = requests.get(url)
page = req.text
soup = BeautifulSoup(page, "html.parser")
links=[]
for a in soup.find_all("a"):
    links.append(a.get("href"))
links=['https://www.imdb.com'+a.strip() for a in links if a is not None and a.startswith('/title/tt') ]

top_250_links=[]
for c in links:
    if c not in top_250_links:
        top_250_links.append(c)
column_list=["Rank","Movie_name" ,"IMDb_id" ,"Release_Year" ,"IMDB_Rating" ,
"Reviewer_count","Movie_Length" ,"Genre" , "Director", "Starring", "Budget","Cum_Worldwide_Gross",
"Country","Language"]
df = pd.DataFrame(columns=column_list)

for x in np.arange(0, len(top_250_links)):
    url=top_250_links[x]
    req = requests.get(url)
    page = req.text
    soup = BeautifulSoup(page, "html.parser")

    Movie_name=(soup.find("div",{"class":"title_wrapper"}).get_text(strip=True).split('|')[0]).split('(')[0]
        
    year_released = int(((soup.find("div",{"class":"title_wrapper"}).get_text(strip=True).split('|')[0]).split('(')[1]).split(')')[0].replace("vol. 1","2003"))
    
    imdb_rating = soup.find("span",{"itemprop":"ratingValue"}).text
    
    reviewer_count=soup.find("span",{"itemprop":"ratingCount"}).text
    
    subtext= soup.find("div",{"class":"subtext"}).get_text(strip=True).split('|') #Censor_rating
    if len(subtext)<4:
        censor_rating="Not Rated"
        movie_len=subtext[0]
        genre_list=subtext[1].split(",")
        genre = genre_list[0:len(genre_list)]
    else:
        movie_len=subtext[1]
        genre_list=subtext[2].split(",")
        genre = genre_list[0:len(genre_list)]
    
    if int(re.sub("[^0-9]", "", movie_len.split()[0])) > 10:
        movie_len = int(re.sub("[^0-9]", "", movie_len.split()[0]))
    else:
        if len(movie_len.split()) == 2:
            movie_len = int(re.sub("[^0-9]", "", movie_len.split()[0]))*60 + int(re.sub("[^0-9]", "", movie_len.split()[1]))
        else:
            movie_len = int(re.sub("[^0-9]", "", movie_len.split()[0]))*60
    
    b=[]
    for a in soup.find_all("div",{"class":"credit_summary_item"}):
        c=re.split(',|:|\|',a.get_text(strip=True))         
        b.append(c)
    stars=b.pop()
    writers=b.pop()
    directors=b.pop()
    if "See full cast & crew»" in stars: stars.remove("See full cast & crew»")
    if "1 more credit»" in writers: writers.remove("1 more credit»") 
    if "1 more credit»" in directors: directors.remove("1 more credit»")
    stars=stars[1:]
    directors=directors[1:]
    while len(stars)<5:         stars.append(" ")

    starring = stars[0:3]
    
    director=directors[0]
        
    
    b=[]                
    d={"Budget":"", "Cumulative Worldwide Gross":"","Country":"","Language":""}
    for a in soup.find_all("div",{"class":"txt-block"}):
        c=a.get_text(strip=True).split(':')
        if c[0] in d:
            b.append(c)

    for i in b:            
            if i[0] in d: 
                d.update({i[0]:i[1]})                
    cum_world_gross=int(d["Cumulative Worldwide Gross"].split(" ")[0].strip("$ (estimated) EUR JYP BLF INR KRW DEM GBP AUD TRL").replace(",","") or 0)
    budget=int(d['Budget'].strip("$ (estimated) EUR JYP BLF INR KRW DEM GBP AUD TRL").replace(",","") or 0)
    language = d["Language"].split("|")
    country = d["Country"].split("|")
    
    movie_dict={
        "Rank":x+1,
        "Movie_name" : Movie_name,
        "IMDb_id" : url.strip("https://'www.imdb.com/title/tt'" ),
        "Release_Year" : year_released,
        "IMDB_Rating" : float(imdb_rating),
        "Reviewer_count" : int(reviewer_count.replace(",","")),
        "Movie_Length" : movie_len,
        "Genre" : genre,
        "Country" : country,
        "Language" : language,
        "Director" : director,
        "Starring" : starring,
        "Budget" : budget,
        "Cum_Worldwide_Gross" : cum_world_gross,
        }
    
    df = df.append(pd.DataFrame.from_records([movie_dict],columns=movie_dict.keys() ) )
df=df[column_list]  
df=df.set_index(["Rank"], drop=False)
df

#If the budget or the global revenue is missing, it has been replaced by a 0 in the data

In [5]:
df.to_csv("imdb_top_movies.csv")

**Question 1.2** (5 points) If you group the movies by release years, show the number of movies at each decade in a descending order.

In [6]:
# Question 1.2
decades = df.groupby(pd.cut(df["Release_Year"], np.arange(1910, 2030, 10))).size()
decades.sort_values(ascending = False)

Release_Year
(2010, 2020]    47
(2000, 2010]    46
(1990, 2000]    45
(1980, 1990]    26
(1970, 1980]    22
(1950, 1960]    22
(1960, 1970]    16
(1940, 1950]    11
(1930, 1940]     8
(1920, 1930]     7
(1910, 1920]     0
dtype: int64

**Quesion 1.3** (5 points) Show the number of movies by the distribution of runtime at quartile (0-25%, 25-50%, 50-75%, 75-100%).

In [7]:
# Question 1.3
dis = df.groupby(pd.qcut(df["Movie_Length"], 4)).size().reset_index()
dis = pd.DataFrame(dis)
dis.columns = ["Movie Length", "Number of movies"]
dis["Qantile"] = ["0 - 25%","25% - 50%","50% - 75%","75% - 1"]
order = ["Qantile", "Movie Length", "Number of movies"]
dis = dis[order]
dis

Unnamed: 0,Qantile,Movie Length,Number of movies
0,0 - 25%,"(44.999, 107.25]",63
1,25% - 50%,"(107.25, 126.5]",62
2,50% - 75%,"(126.5, 146.0]",64
3,75% - 1,"(146.0, 321.0]",61


**Question 1.4** (5 points) What is the proportion of movies that have Budget higher than 75% of all movies (i.e. the third quartile)?

In [8]:
# Quesion 1.4
Proportion = (df.loc[df["Budget"]>df["Budget"].quantile(0.75), ["Budget"]].count())

print("The proportion of movies that have Budget higher than 75% of all movies",(int(Proportion)/250)*100, "%\nThis was to be expected, as one quantile should contain 25% each.")

The proportion of movies that have Budget higher than 75% of all movies 25.2 %
This was to be expected, as one quantile should contain 25% each.


**Question 1.5** (5 points) Show the top 10 most popular actor/actresses in terms of number of movies they have starred. 

In [9]:
# Question 1.5
di = {}
x  = 1
while x < (len(df)+1):
    a = df["Starring"][x]
    for i in a:
        if i in di:
            di[i] = di[i] + 1
        else:
            di[i] = 1
    x = x+1
max_key = sorted(di, key=di.get, reverse=True)[:10]
for i in max_key:
    print(i,di[i])

Robert De Niro 9
Tom Hanks 6
Leonardo DiCaprio 6
Harrison Ford 6
Charles Chaplin 6
Christian Bale 5
Clint Eastwood 5
Morgan Freeman 4
Al Pacino 4
Brad Pitt 4


**Question 1.6** (5 points) Show the top 5 directors with the most total box office revenues.

In [10]:
# Question 1.6
di_16 = {}
x  = 1
while x < (len(df)+1):
    a = df["Director"][x]
    if a in di_16:
        di_16[a] = di_16[a] + df["Cum_Worldwide_Gross"][x]
    else:
        di_16[a] = df["Cum_Worldwide_Gross"][x]
    x = x+1
max_key = sorted(di_16, key=di_16.get, reverse=True)[:5]
for i in max_key:
    print(i,":",di_16[i])

Anthony Russo : 4846160318
Christopher Nolan : 4143007170
Steven Spielberg : 3055115821
Peter Jackson : 2973971329
Pete Docter : 2172645522


**Question 1.7** (5 points) Show the average ratings of movies across the genres and decades.

In [11]:
# Question 1.7
#average ratings of movies across the genres
#Films that are assigned to multiple genres are considered multiple times
all_genre = []
for i in df["Genre"]:
    for g in i:
        if g not in all_genre:
            all_genre.append(g)
dict_Genre_IMDb_Rating = {}
for g in all_genre:
    l = []
    x = 1
    for i in df["Genre"]:
        if g in i:
            l.append(df.loc[x,"IMDB_Rating"])
        x  = x+1    
    dict_Genre_IMDb_Rating[g] = l

for g in all_genre:    
    Average = round(sum(dict_Genre_IMDb_Rating[g])/len(dict_Genre_IMDb_Rating[g]),4)
    print("Genre:",g, "\nAverage rating:", Average)

Genre: Drama 
Average rating: 8.3106
Genre: Crime 
Average rating: 8.3444
Genre: Action 
Average rating: 8.3614
Genre: Biography 
Average rating: 8.2615
Genre: History 
Average rating: 8.2429
Genre: Adventure 
Average rating: 8.3054
Genre: Western 
Average rating: 8.4
Genre: Romance 
Average rating: 8.2565
Genre: Sci-Fi 
Average rating: 8.3136
Genre: Fantasy 
Average rating: 8.3357
Genre: Mystery 
Average rating: 8.2933
Genre: Comedy 
Average rating: 8.2395
Genre: Thriller 
Average rating: 8.2575
Genre: Family 
Average rating: 8.3214
Genre: War 
Average rating: 8.285
Genre: Animation 
Average rating: 8.2545
Genre: Music 
Average rating: 8.4
Genre: Horror 
Average rating: 8.35
Genre: Film-Noir 
Average rating: 8.2667
Genre: Musical 
Average rating: 8.4
Genre: Sport 
Average rating: 8.1667


In [12]:
#average ratings of movies across the decades
AvRating_Dec = df.groupby(pd.cut(df["Release_Year"], np.arange(1920, 2030, 10)))["IMDB_Rating"].mean()
AvRating_Dec

Release_Year
(1920, 1930]    8.185714
(1930, 1940]    8.262500
(1940, 1950]    8.300000
(1950, 1960]    8.290909
(1960, 1970]    8.300000
(1970, 1980]    8.368182
(1980, 1990]    8.257692
(1990, 2000]    8.375556
(2000, 2010]    8.308696
(2010, 2020]    8.259574
Name: IMDB_Rating, dtype: float64

**Question 1.8** (5 points) Creat a new column ```ROI``` that measures the return on investment using the (box revenue-budget)/budget, and compare the ROI between movies in English and those in non-English. Use the t-test to examine whether such difference is statistically significant (You can use ```scipy.stats.ttest_ind``` to test the mean difference of two distributions)

In [15]:
# Question 1.8
ROI = []
x = 1
while x < (len(df)+1):
    if df["Budget"][x] == 0:
        roi = 0
    else:
        roi = ((df["Cum_Worldwide_Gross"][x] - df["Budget"][x])/df["Budget"][x])
    ROI.append(roi)
    x = x+1
    
df["ROI"] = ROI

y = 1
ROI_eng = []
Count_eng = 0
ROI_Neng = []
Count_Neng = 0
while y < (len(df)+1):
    eng = False
    for j in df["Language"][y]:
        if j == "English":
            eng = True
            
    if eng == False:
        ROI_Neng.append(df["ROI"][y])
        Count_Neng = Count_Neng + 1
    else:
        ROI_eng.append(df["ROI"][y])
        Count_eng = Count_eng + 1
            
    y = y+1

print(round(np.mean(ROI_eng),4), "is the average ROI of ", Count_eng, "movies in english.")
print(round(np.mean(ROI_Neng),4), "is the average ROI of ", Count_Neng,"movies not in english.")
print(round(np.mean(ROI_eng) - np.mean(ROI_Neng),4), "Difference of ROI from englisch moviese to non english movies")

print(stats.ttest_ind(ROI_eng,ROI_Neng, equal_var = False))
print(stats.ttest_ind(ROI_eng,ROI_Neng, equal_var = False)[1])
print("The p-value of the t-test is >",round((stats.ttest_ind(ROI_eng,ROI_Neng, equal_var = False)[1]),4),"we cannot reject the null hypothesis of identical average-ROIs")

7.1881 is the average ROI of  201 movies in english.
4.2764 is the average ROI of  49 movies not in english.
2.9117 Difference of ROI from englisch moviese to non english movies
Ttest_indResult(statistic=1.7363988429696204, pvalue=0.08536797951389721)
0.08536797951389721
The p-value of the t-test is > 0.0854 we cannot reject the null hypothesis of identical average-ROIs


**Question 1.9** (5 points) Do the commercially successfuly movies also 
receive higher ratings. Check the correlations between box office revenues and ratings using Pearman and Spearman correlations.

In [16]:
#Question 1.9
CWG = df["Cum_Worldwide_Gross"].astype(float)
IR = df["IMDB_Rating"].astype(float)

#Spearman
cor_Spea = CWG.corr(IR, method = "spearman")
print("Spearman correlation is", round(cor_Spea,4))


#Pearson
cor_Pear = CWG.corr(IR, method = "pearson")
print("Pearson correlation is", round(cor_Pear,4))

print("Both correlation aren't that high so there is no guarantee for commercial success even if a film is good.")

Spearman correlation is 0.1466
Pearson correlation is 0.2047
Both correlation aren't that high so there is no guarantee for commercial success even if a film is good.


**Question 1.10** (10 points) Now let's retrieve data from Bechdel Test Movie website [for each movie](https://bechdeltest.com/). You can send the requests to the API: https://bechdeltest.com/api/v1/doc#getMovieByImdbId. For example, for the movie The Shawshank Redemption (the IMDb id: 0111161), you can simply call: http://bechdeltest.com/api/v1/getMovieByImdbId?imdbid=0111161. 

Create a dataframe ```bechdel_imdb_top``` that merge the bechdel test info with the ```imdb_top_movies``` show how many top 250 movies are also in the bechdel test website.

In [17]:
import urllib.request, json
df_f = pd.DataFrame(columns=["rating","submitterid","title","imdbid","year","id","date","visible","dubious"])
not_in_bech = []
index = 1
for i in df["IMDb_id"]:
    with urllib.request.urlopen("http://bechdeltest.com/api/v1/getMovieByImdbId?imdbid="+i) as url:
        data = json.loads(url.read().decode())
        if data.get("status") == "404" or data.get("status") == "403":
            not_in_bech.append(df.loc[df["IMDb_id"] == i, "Movie_name"].iloc[0])
        else:
            df_f = df_f.append(pd.DataFrame.from_records([data],columns=data.keys(), index = [index]))
    index = index+1

            
print(not_in_bech)
df_f

['Harakiri', 'Hamilton', 'Tengoku to jigoku', 'Cafarnaum', 'Taare Zameen Par', 'Bacheha-Ye aseman', 'Anand', 'O Julgamento de Nuremberga', 'Vikram Vedha', 'Babam ve Oglum', 'O Homem Elefante', 'Eskiya', 'Hotel Ruanda', 'Gangs of Wasseypur', 'Drishyam']


Unnamed: 0,rating,submitterid,title,imdbid,year,id,date,visible,dubious
1,0,1,"Shawshank Redemption, The",0111161,1994,339,2009-06-13 14:43:18,1,
2,2,3113,"Godfather, The",0068646,1972,2224,2011-04-23 18:52:32,1,0
3,2,6585,"Godfather: Part II, The",0071562,1974,3896,2013-02-14 11:01:57,1,0
4,3,1,"Dark Knight, The",0468569,2008,66,2008-07-23 00:00:00,1,1
5,0,17,12 Angry Men,0050083,1957,302,2009-03-24 15:18:12,1,
...,...,...,...,...,...,...,...,...,...
246,1,406,La battaglia di Algeri,0058946,1966,658,2010-01-22 01:40:20,1,0
247,3,1587,"Terminator, The",0088247,1984,1418,2010-08-01 13:08:15,1,0
248,0,1,Aladdin,0103639,1992,98,2008-07-30 00:36:20,1,
249,0,14709,Tangerines,2991224,2013,7533,2017-03-26 18:04:55,1,0


In [20]:
in_bech = len(df)-len(not_in_bech)
print(in_bech, "of top 250 movies are also in the bechdel test website.")
print("\nOut of the IMDb Top250 movies the following are not on the Bechdel test website.", not_in_bech)

235 of top 250 movies are also in the bechdel test website.

Out of the IMDb Top250 movies the following ar not on the Bechdel test website. ['Harakiri', 'Hamilton', 'Tengoku to jigoku', 'Cafarnaum', 'Taare Zameen Par', 'Bacheha-Ye aseman', 'Anand', 'O Julgamento de Nuremberga', 'Vikram Vedha', 'Babam ve Oglum', 'O Homem Elefante', 'Eskiya', 'Hotel Ruanda', 'Gangs of Wasseypur', 'Drishyam']


In [21]:
df_fn = df_f[["imdbid","visible","rating","id","dubious"]]
bechdel_imdb_top = pd.merge(df_fn, df, left_on=["imdbid"],
                   right_on= ["IMDb_id"], 
                   how = 'left')
bechdel_imdb_top = bechdel_imdb_top.set_index("Rank").drop("imdbid",1)
#merged Dataframe containing the movies on the Bechdel Website which are also in the top 250 of IMDb
bechdel_imdb_top

Unnamed: 0_level_0,visible,rating,id,dubious,Movie_name,IMDb_id,Release_Year,IMDB_Rating,Reviewer_count,Movie_Length,Genre,Director,Starring,Budget,Cum_Worldwide_Gross,Country,Language,ROI
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,1,0,339,,Os Condenados de Shawshank,0111161,1994,9.3,2297698,142,[Drama],Frank Darabont,"[Tim Robbins, Morgan Freeman, Bob Gunton]",25000000,28815291,[USA],[English],0.152612
2,1,2,2224,0,O Padrinho,0068646,1972,9.2,1585756,175,"[Crime, Drama]",Francis Ford Coppola,"[Marlon Brando, Al Pacino, James Caan]",6000000,246120986,[USA],"[English, Italian, Latin]",40.020164
3,1,2,3896,0,O Padrinho: Parte II,0071562,1974,9.0,1107919,202,"[Crime, Drama]",Francis Ford Coppola,"[Al Pacino, Robert De Niro, Robert Duvall]",13000000,48035783,[USA],"[English, Italian, Spanish, Latin, Sicilian]",2.695060
4,1,3,66,1,O Cavaleiro das Trevas,0468569,2008,9.0,2261899,152,"[Action, Crime, Drama]",Christopher Nolan,"[Christian Bale, Heath Ledger, Aaron Eckhart]",185000000,1005456758,"[USA, UK]","[English, Mandarin]",4.434901
5,1,0,302,,Doze Homens em Fúria,0050083,1957,8.9,675116,96,"[Crime, Drama]",Sidney Lumet,"[Henry Fonda, Lee J. Cobb, Martin Balsam]",350000,576,[USA],[English],-0.998354
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
246,1,1,658,0,A Batalha de Argel,0058946,1966,8.1,51765,121,"[Drama, War]",Gillo Pontecorvo,"[Brahim Hadjadj, Jean Martin, Yacef Saadi]",800000,964028,"[Italy, Algeria]","[Arabic, French, English]",0.205035
247,1,3,1418,0,O Exterminador Implacável,0088247,1984,8.0,789830,107,"[Action, Sci-Fi]",James Cameron,"[Arnold Schwarzenegger, Linda Hamilton, Michae...",6400000,78680331,"[UK, USA]","[English, Spanish]",11.293802
248,1,0,98,,Aladdin,0103639,1992,8.0,367789,90,"[Animation, Adventure, Comedy]",Ron Clements,"[Scott Weinger, Robin Williams, Linda Larkin]",28000000,504050219,[USA],[English],17.001794
249,1,0,7533,0,Tangerinas,2991224,2013,8.2,38325,87,"[Drama, War]",Zaza Urushadze,"[Lembit Ulfsak, Elmo Nüganen, Giorgi Nakashidze]",650000,1024132,"[Estonia, Georgia]","[Estonian, Russian, Georgian]",0.575588


**Question 1.11** (5 points) Show how many movies in terms of percentage) that has passed the test in different ways (Number from 0 to 3 (0 means no two women, 1 means no talking, 2 means talking about a man, 3 means it passes the test)

In [22]:
passed = bechdel_imdb_top["rating"].value_counts(normalize=True)
for i in passed.index:
    print(round(passed[i]*100,2), "% passed the test with an rating of",i)


34.89 % passed the test with an rating of 3
34.47 % passed the test with an rating of 1
20.85 % passed the test with an rating of 0
9.79 % passed the test with an rating of 2


**Question 1.12** (5 points) Show the percenage of movies given differen genres that has passed the test in different ways (Number from 0 to 3 (0 means no two women, 1 means no talking, 2 means talking about a man, 3 means it passes the test))

In [23]:
all_genre = []
for i in bechdel_imdb_top["Genre"]:
    for g in i:
        if g not in all_genre:
            all_genre.append(g)
print(all_genre)

['Drama', 'Crime', 'Action', 'Biography', 'History', 'Adventure', 'Western', 'Romance', 'Sci-Fi', 'Fantasy', 'Mystery', 'Comedy', 'Thriller', 'Family', 'War', 'Animation', 'Music', 'Horror', 'Film-Noir', 'Musical', 'Sport']


In [24]:
#Films that are assigned to more than one genre are considered more than once. 
dict_Genre_Rating = {}
for g in all_genre:
    l = []
    x = 0
    for i in bechdel_imdb_top["Genre"]:
        if g in i:
            l.append(bechdel_imdb_top.iloc[x,1])
        x  = x+1    
    dict_Genre_Rating[g] = l                   

In [25]:
dict_Genre_Rating[all_genre[0]].count(3)/len(dict_Genre_Rating[all_genre[0]])
ratings = [0,1,2,3]
for g in all_genre:
    print("Genre: ", g)
    for r in ratings:
        print("Rating: ", r, round((dict_Genre_Rating[g].count(r)/len(dict_Genre_Rating[g])*100),4),"%")
        

Genre:  Drama
Rating:  0 20.4819 %
Rating:  1 35.5422 %
Rating:  2 11.4458 %
Rating:  3 32.5301 %
Genre:  Crime
Rating:  0 16.3265 %
Rating:  1 38.7755 %
Rating:  2 14.2857 %
Rating:  3 30.6122 %
Genre:  Action
Rating:  0 17.0732 %
Rating:  1 34.1463 %
Rating:  2 0.0 %
Rating:  3 48.7805 %
Genre:  Biography
Rating:  0 13.0435 %
Rating:  1 43.4783 %
Rating:  2 13.0435 %
Rating:  3 30.4348 %
Genre:  History
Rating:  0 16.6667 %
Rating:  1 25.0 %
Rating:  2 8.3333 %
Rating:  3 50.0 %
Genre:  Adventure
Rating:  0 23.2143 %
Rating:  1 35.7143 %
Rating:  2 5.3571 %
Rating:  3 35.7143 %
Genre:  Western
Rating:  0 50.0 %
Rating:  1 33.3333 %
Rating:  2 0.0 %
Rating:  3 16.6667 %
Genre:  Romance
Rating:  0 26.087 %
Rating:  1 17.3913 %
Rating:  2 8.6957 %
Rating:  3 47.8261 %
Genre:  Sci-Fi
Rating:  0 9.0909 %
Rating:  1 27.2727 %
Rating:  2 9.0909 %
Rating:  3 54.5455 %
Genre:  Fantasy
Rating:  0 7.1429 %
Rating:  1 28.5714 %
Rating:  2 14.2857 %
Rating:  3 50.0 %
Genre:  Mystery
Rating:  0 18

**Question 1.13** (5 points) Show the top 10 highest-rated movies that passed the test completely (rating=3) 

In [26]:
df_f_HR = bechdel_imdb_top[bechdel_imdb_top["rating"]==3]
df_f_HR[["Movie_name","IMDb_id"]].head(10)
for i in df_f_HR["Movie_name"].head(10):
    print(i)
print("top 10 highest-rated movies that passed the test completely (rating=3)")

O Cavaleiro das Trevas
A Lista de Schindler
Pulp Fiction
A Origem
O Senhor dos Anéis - As Duas Torres
Matrix
Tudo Bons Rapazes
O Silêncio dos Inocentes
Do Céu Caiu Uma Estrela
A Viagem de Chihiro
top 10 highest-rated movies that passed the test completely (rating=3)


**Question 1.14** (5 points) Compareing the movies that passed (rating=3) and failed the test (rating=0), are their ROI different? Explain.

In [27]:
df_f_HR = bechdel_imdb_top[bechdel_imdb_top["rating"]==3]
df_f_LR = bechdel_imdb_top[bechdel_imdb_top["rating"]==0]
print(round(np.mean(df_f_LR["ROI"]),4),"ROI for movies that failed (ratin = 0)")
print(round(np.mean(df_f_HR["ROI"]),4),"ROI for movies that passed (ratin = 3)")
diff_in_roi = np.mean(df_f_HR["ROI"])-np.mean(df_f_LR["ROI"])
print(round(diff_in_roi,4),"Empirical difference of ROI of movies which passed the test to the ones which failed.")
df_f_HR = list(df_f_HR["ROI"])
df_f_LR = list(df_f_LR["ROI"])
print("The p-value of the t-test is >",round(stats.ttest_ind(df_f_HR,df_f_LR, equal_var = False)[1],4)  ,", we cannot reject the null hypothesis of identical average-ROIs for movies which passed the test and the ones which failed")
print("Empirically the ROI differs, but not statisticaly!")

4.7564 ROI for movies that failed (ratin = 0)
7.7107 ROI for movies that passed (ratin = 3)
2.9543 Empirical difference of ROI of movies which passed the test to the ones which failed.
The p-value of the t-test is > 0.1364 , we cannot reject the null hypothesis of identical average-ROIs for movies which passed the test and the ones which failed
Empirically the ROI differs, but not statisticaly!


**Question 1.15** (10 points) Now load the ```bechdel_imdb.json``` that contains the all movies that are rated by the Bechdel Test website. Are women representation improved over the decades? Create a dataframe ```bechdel_imdb```, comparing the top 250 and other movies, in terms of percentage, how many passed/failed the test? 

In [41]:
#load the json file from the file directory
with open("bechdel_imdb.json") as f:
  data = json.load(f)

bechdel_imdb = pd.DataFrame(data) 
bechdel_imdb

Unnamed: 0,year,imdbid,rating,title,id
0,1888,0392728,0,Roundhay Garden Scene,8040
1,1892,0000003,0,Pauvre Pierrot,5433
2,1895,0132134,0,"Execution of Mary, Queen of Scots, The",6200
3,1895,0000014,0,Tables Turned on the Gardener,5444
4,1896,0000131,0,Une nuit terrible,5406
...,...,...,...,...,...
8569,2020,7134096,2,"Rhythm Section, The",8994
8570,2020,8461042,3,"Marijuana Conspiracy , The",8859
8571,2020,1502397,2,Bad Boys For Life,9071
8572,2020,7713068,3,Birds of Prey,9008


In [42]:
#download the json file with the API directly from the website
#I prefer loading via the API to get a more current data set (is the same as in the file directory)
with urllib.request.urlopen("http://bechdeltest.com/api/v1/getAllMovies") as url:
        data = json.loads(url.read().decode())       

In [43]:
bechdel_imdb = pd.DataFrame(data) 
group_tot = bechdel_imdb.groupby([pd.cut(bechdel_imdb["year"], np.arange(1880, 2030, 10)),"rating"]).size()
#print(group_tot)
bechdel_imdb

Unnamed: 0,imdbid,id,year,title,rating
0,0392728,8040,1888,Roundhay Garden Scene,0
1,0000003,5433,1892,Pauvre Pierrot,0
2,0132134,6200,1895,"Execution of Mary, Queen of Scots, The",0
3,0000014,5444,1895,Tables Turned on the Gardener,0
4,0000131,5406,1896,Une nuit terrible,0
...,...,...,...,...,...
8569,8461042,8859,2020,"Marijuana Conspiracy , The",3
8570,7134096,8994,2020,"Rhythm Section, The",2
8571,1502397,9071,2020,Bad Boys For Life,2
8572,10655686,9144,2020,Never Ricking Rick,1


In [44]:
group_perc = (group_tot/group_tot.groupby(level=0).sum())
print(group_perc)

year          rating
(1880, 1890]  0         1.000000
              1         0.000000
              2         0.000000
              3         0.000000
(1890, 1900]  0         0.956522
              1         0.000000
              2         0.000000
              3         0.043478
(1900, 1910]  0         0.882353
              1         0.058824
              2         0.029412
              3         0.029412
(1910, 1920]  0         0.384615
              1         0.000000
              2         0.153846
              3         0.461538
(1920, 1930]  0         0.411111
              1         0.144444
              2         0.211111
              3         0.233333
(1930, 1940]  0         0.113636
              1         0.195455
              2         0.204545
              3         0.486364
(1940, 1950]  0         0.141509
              1         0.207547
              2         0.160377
              3         0.490566
(1950, 1960]  0         0.144366
              1       

In [45]:
print("Share of movies which failed the test. (rating = 0) by decates.\n",group_perc[:,0])
print("Share of movies which passed the test. (rating = 3) by decates.\n",group_perc[:,3])
print("The share of movies which failed the test is declining over the decates \nwhile the share of movies which passed the test is increasing over the decates.")


Share of movies which failed the test. (rating = 0) by decates.
 year
(1880, 1890]    1.000000
(1890, 1900]    0.956522
(1900, 1910]    0.882353
(1910, 1920]    0.384615
(1920, 1930]    0.411111
(1930, 1940]    0.113636
(1940, 1950]    0.141509
(1950, 1960]    0.144366
(1960, 1970]    0.193548
(1970, 1980]    0.117493
(1980, 1990]    0.114420
(1990, 2000]    0.078924
(2000, 2010]    0.084981
(2010, 2020]    0.071025
dtype: float64
Share of movies which passed the test. (rating = 3) by decates.
 year
(1880, 1890]    0.000000
(1890, 1900]    0.043478
(1900, 1910]    0.029412
(1910, 1920]    0.461538
(1920, 1930]    0.233333
(1930, 1940]    0.486364
(1940, 1950]    0.490566
(1950, 1960]    0.489437
(1960, 1970]    0.445748
(1970, 1980]    0.480418
(1980, 1990]    0.543887
(1990, 2000]    0.579372
(2000, 2010]    0.602019
(2010, 2020]    0.637102
dtype: float64
The share of movies which failed the test is declining over the decates 
while the share of movies which passed the test is increa

In [46]:
IMDb_failed = len(bechdel_imdb_top[bechdel_imdb_top["rating"]==0])/len(bechdel_imdb_top)
IMDb_passed = len(bechdel_imdb_top[bechdel_imdb_top["rating"]==3])/len(bechdel_imdb_top)
print(round(IMDb_failed*100,4),"% movies FAILED (rating = 0) the test in the IMDb top 250 from the ones which are rated.")
print(round(IMDb_passed*100,4),"% movies PASSED (rating = 3) the test in the IMDb top 250 from the ones which are rated.")

#create Bechdel DateFrame withou movies in IMDb top 250
bechdel_not_imdb = bechdel_imdb[(~bechdel_imdb.imdbid.isin(bechdel_imdb_top.IMDb_id))]
Not_IMDb_failed = len(bechdel_not_imdb[bechdel_not_imdb["rating"]==0])/len(bechdel_not_imdb) 
Not_IMDb_passed = len(bechdel_not_imdb[bechdel_not_imdb["rating"]==3])/len(bechdel_not_imdb)
print(round(Not_IMDb_failed*100,4),"% of the movies which are in the Bachdel Data and NOT in the IMDb top 250 list FAILED the test (rating = 0)")
print(round(Not_IMDb_passed*100,4),"% of the movies which are in the Bachdel Data and NOT in the IMDb top 250 list PASSED the test (rating = 3)")
      
      

20.8511 % movies FAILED (rating = 0) the test in the IMDb top 250 from the ones which are rated.
34.8936 % movies PASSED (rating = 3) the test in the IMDb top 250 from the ones which are rated.
9.8573 % of the movies which are in the Bachdel Data and NOT in the IMDb top 250 list FAILED the test (rating = 0)
58.3523 % of the movies which are in the Bachdel Data and NOT in the IMDb top 250 list PASSED the test (rating = 3)
