# Data Preparation 

## Deep Learning Final Project 

### Jade Benson

In this project, we are interested in using multiple types of movie data from Wikipedia to describe this art form with deep learning methods. We aim to explore the different types of data individually and together to understand the landscape of movies. We will eventually use them to predict genre and whether movies pass the Bechdel test. 

This notebook will combine three disparate data sources together. 

CMU Wikipedia plots, characters, genres, networks: http://www.cs.cmu.edu/~ark/personas/

Movie poster links (Kaggle scraped from IMDB): https://www.kaggle.com/datasets/neha1703/movie-genre-from-its-poster

Bechdel test: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-03-09/readme.md

In [1]:
import pandas as pd 
import numpy as np 
import sklearn 
import matplotlib

In [4]:
## CMU Wikipedia movie dataset 

#first use "Movie meta data"
#Movie name, release data, genres, wikipedia movie ID 
#use wikipedia movie ID to link with the plot summaries in plot_summaries.txt

movie_info = pd.read_csv('MovieSummaries/movie.metadata.tsv', sep = '\t', header = None)


In [5]:
movie_info.head()
#final columns are dictionaries with freebase ID: value 

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"


In [7]:
movie_info.columns = ["wiki_ID", "freebase_ID", "title", "date", "revenue", "runtime", "language", "country", "genres"]

In [10]:
movie_info.head()

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"


In [8]:
len(movie_info)
#81,741 movies

81741

In [9]:
len(movie_info.drop_duplicates())
#no duplicates

81741

In [13]:
#lowercase titles, no spaces or punctuation to hopefully make matching easier 
import re

clean_titles = lambda x: re.sub(r'[^a-z\d]', '', x.lower())
movie_info['match_title'] = movie_info["title"].apply(clean_titles)


In [14]:
movie_info.head()

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres,match_title
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",ghostsofmars
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp...",gettingawaywithmurderthejonbentramseymystery
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...",brunbitter
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",whiteoftheeye
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}",awomaninflames


In [81]:
#movie summaries also from CMU 

#code for how to merge with the summaries 

movie_lines = []

with open('MovieSummaries/plot_summaries.txt') as file:
    for line in file:
        line_clean = line.rstrip()
        line_clean = line_clean.split("\t")
        movie_lines.append(line_clean)


In [82]:
movie_lines[0:5]

[['23890098',
  "Shlykov, a hard-working taxi driver and Lyosha, a saxophonist, develop a bizarre love-hate relationship, and despite their prejudices, realize they aren't so different after all."],
 ['31186339',
  'The nation of Panem consists of a wealthy Capitol and twelve poorer districts. As punishment for a past rebellion, each district must provide a boy and girl  between the ages of 12 and 18 selected by lottery  for the annual Hunger Games. The tributes must fight to the death in an arena; the sole survivor is rewarded with fame and wealth. In her first Reaping, 12-year-old Primrose Everdeen is chosen from District 12. Her older sister Katniss volunteers to take her place. Peeta Mellark, a baker\'s son who once gave Katniss bread when she was starving, is the other District 12 tribute. Katniss and Peeta are taken to the Capitol, accompanied by their frequently drunk mentor, past victor Haymitch Abernathy. He warns them about the "Career" tributes who train intensively at speci

In [83]:
movie_plots = pd.DataFrame(movie_lines, columns = ['wiki_ID', 'plot'])
movie_plots.head()
#the plots should be cleaned before using! 

Unnamed: 0,wiki_ID,plot
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...


In [84]:
movie_plots.dtypes

wiki_ID    object
plot       object
dtype: object

In [85]:
movie_plots.wiki_ID = movie_plots.wiki_ID.astype(np.int64)

In [86]:
len(movie_plots)

42306

In [88]:
len(movie_plots.drop_duplicates())
#no duplicates - good

42306

In [89]:
#Merge with the CMU movie metadata 

movie_info_plots = movie_info.merge(movie_plots, how = 'inner', left_on = 'wiki_ID', right_on ='wiki_ID')

len(movie_info_plots)
#only dropped like 100 of the plots - much better 

42207

In [91]:
len(movie_info_plots.drop_duplicates())
#no duplicates 

42207

In [143]:
movie_info_plots.head(10)

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres,match_title,plot
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",ghostsofmars,"Set in the second half of the 22nd century, th..."
1,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",whiteoftheeye,A series of murders of rich young women throug...
2,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}",awomaninflames,"Eva, an upper class housewife, becomes frustra..."
3,18998739,/m/04jcqvw,The Sorcerer's Apprentice,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0hzlz"": ""South Africa""}","{""/m/0hqxf"": ""Family Film"", ""/m/01hmnh"": ""Fant...",thesorcerersapprentice,"Every hundred years, the evil Morgana returns..."
4,6631279,/m/0gffwj,Little city,1997-04-04,,93.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/0hj3n0w"": ...",littlecity,"Adam, a San Francisco-based artist who works a..."
5,171005,/m/016ywb,Henry V,1989-11-08,10161099.0,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",henryv,{{Plot|dateAct 1Act 2Act 3Act 4Act 5 Finally n...
6,18296435,/m/04cqrs4,Aaah Belinda,1986,,,"{""/m/02hwyss"": ""Turkish Language""}","{""/m/01znc_"": ""Turkey""}","{""/m/01z4y"": ""Comedy""}",aaahbelinda,"Serap, a young actress with a strong, lively p..."
7,11250635,/m/02r52hc,The Mechanical Monsters,,,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06n90"": ""Science Fiction"", ""/m/03k9fj"": ""...",themechanicalmonsters,The story starts as one of the robots flies i...
8,77856,/m/0kcn7,Mary Poppins,1964-08-27,102272727.0,139.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hj3myq"": ""Children's/Family"", ""/m/04t36""...",marypoppins,The film opens with Mary Poppins perched in a...
9,32456683,/m/0gyryjt,Die Fahne von Kriwoj Rog,1967,,108.0,"{""/m/04306rv"": ""German Language""}","{""/m/03f2w"": ""German Democratic Republic""}",{},diefahnevonkriwojrog,"Otto Brosowski, a communist miner, writes to t..."


In [150]:
movie_info_plots.dtypes

wiki_ID          int64
freebase_ID     object
title           object
date            object
revenue        float64
runtime        float64
language        object
country         object
genres          object
match_title     object
plot            object
dtype: object

In [171]:
#create simple year to help with merging 
import math
import datetime 

def clean_dates(date): 
    if isinstance(date, float) == True: 
        if np.isnan(date) == True: 
            return None
    
    elif isinstance(date, str) == True: 
        new_date = date.lstrip()
        
        #if date is more than 4 
        if len(new_date) > 4: 
            year = new_date[0:4]
            return year

        else: 
            return new_date

        
movie_info_plots['year'] = movie_info_plots.apply(lambda x: clean_dates(x.date), axis=1)


In [175]:
movie_info_plots.head(20)

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres,match_title,plot,year
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",ghostsofmars,"Set in the second half of the 22nd century, th...",2001.0
1,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",whiteoftheeye,A series of murders of rich young women throug...,1987.0
2,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}",awomaninflames,"Eva, an upper class housewife, becomes frustra...",1983.0
3,18998739,/m/04jcqvw,The Sorcerer's Apprentice,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0hzlz"": ""South Africa""}","{""/m/0hqxf"": ""Family Film"", ""/m/01hmnh"": ""Fant...",thesorcerersapprentice,"Every hundred years, the evil Morgana returns...",2002.0
4,6631279,/m/0gffwj,Little city,1997-04-04,,93.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/0hj3n0w"": ...",littlecity,"Adam, a San Francisco-based artist who works a...",1997.0
5,171005,/m/016ywb,Henry V,1989-11-08,10161099.0,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",henryv,{{Plot|dateAct 1Act 2Act 3Act 4Act 5 Finally n...,1989.0
6,18296435,/m/04cqrs4,Aaah Belinda,1986,,,"{""/m/02hwyss"": ""Turkish Language""}","{""/m/01znc_"": ""Turkey""}","{""/m/01z4y"": ""Comedy""}",aaahbelinda,"Serap, a young actress with a strong, lively p...",1986.0
7,11250635,/m/02r52hc,The Mechanical Monsters,,,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06n90"": ""Science Fiction"", ""/m/03k9fj"": ""...",themechanicalmonsters,The story starts as one of the robots flies i...,
8,77856,/m/0kcn7,Mary Poppins,1964-08-27,102272727.0,139.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hj3myq"": ""Children's/Family"", ""/m/04t36""...",marypoppins,The film opens with Mary Poppins perched in a...,1964.0
9,32456683,/m/0gyryjt,Die Fahne von Kriwoj Rog,1967,,108.0,"{""/m/04306rv"": ""German Language""}","{""/m/03f2w"": ""German Democratic Republic""}",{},diefahnevonkriwojrog,"Otto Brosowski, a communist miner, writes to t...",1967.0


In [173]:
movie_info_plots.iloc[0,3]

'2001-08-24'

In [174]:
movie_info_plots.iloc[0,-1]

'2001'

## Posters 

In [92]:
# Poster dataset 
movie_posters = pd.read_csv('MoviePosters/MovieGenre.csv', encoding = "ISO-8859-1") #different encoding 
movie_posters.head()


Unnamed: 0,imdbId,Imdb Link,Title,IMDB Score,Genre,Poster
0,114709,http://www.imdb.com/title/tt114709,Toy Story (1995),8.3,Animation|Adventure|Comedy,https://images-na.ssl-images-amazon.com/images...
1,113497,http://www.imdb.com/title/tt113497,Jumanji (1995),6.9,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...
2,113228,http://www.imdb.com/title/tt113228,Grumpier Old Men (1995),6.6,Comedy|Romance,https://images-na.ssl-images-amazon.com/images...
3,114885,http://www.imdb.com/title/tt114885,Waiting to Exhale (1995),5.7,Comedy|Drama|Romance,https://images-na.ssl-images-amazon.com/images...
4,113041,http://www.imdb.com/title/tt113041,Father of the Bride Part II (1995),5.9,Comedy|Family|Romance,https://images-na.ssl-images-amazon.com/images...


In [93]:
len(movie_posters)

40108

In [180]:
#want to create new column with the years 
def extract_year(title): 
    split_title = re.split(' ', title)
    date = split_title[-1]
    year = re.sub(r'[()]', '', date)
    return year 

movie_posters['year'] = movie_posters.apply(lambda x: extract_year(x.Title), axis=1)
   


In [181]:
movie_posters.head(10)

Unnamed: 0,imdbId,Imdb Link,Title,IMDB Score,Genre,Poster,match_title,year
0,114709,http://www.imdb.com/title/tt114709,Toy Story (1995),8.3,Animation|Adventure|Comedy,https://images-na.ssl-images-amazon.com/images...,toystory,1995
1,113497,http://www.imdb.com/title/tt113497,Jumanji (1995),6.9,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...,jumanji,1995
2,113228,http://www.imdb.com/title/tt113228,Grumpier Old Men (1995),6.6,Comedy|Romance,https://images-na.ssl-images-amazon.com/images...,grumpieroldmen,1995
3,114885,http://www.imdb.com/title/tt114885,Waiting to Exhale (1995),5.7,Comedy|Drama|Romance,https://images-na.ssl-images-amazon.com/images...,waitingtoexhale,1995
4,113041,http://www.imdb.com/title/tt113041,Father of the Bride Part II (1995),5.9,Comedy|Family|Romance,https://images-na.ssl-images-amazon.com/images...,fatherofthebridepartii,1995
5,113277,http://www.imdb.com/title/tt113277,Heat (1995),8.2,Action|Crime|Drama,https://images-na.ssl-images-amazon.com/images...,heat,1995
6,114319,http://www.imdb.com/title/tt114319,Sabrina (1995),6.3,Comedy|Drama,https://images-na.ssl-images-amazon.com/images...,sabrina,1995
7,112302,http://www.imdb.com/title/tt112302,Tom and Huck (1995),5.6,Adventure|Comedy|Drama,https://images-na.ssl-images-amazon.com/images...,tomandhuck,1995
8,114576,http://www.imdb.com/title/tt114576,Sudden Death (1995),5.7,Action|Crime|Thriller,https://images-na.ssl-images-amazon.com/images...,suddendeath,1995
9,113189,http://www.imdb.com/title/tt113189,GoldenEye (1995),7.2,Action|Adventure|Thriller,https://images-na.ssl-images-amazon.com/images...,goldeneye,1995


In [182]:
#now want to remove the dates 

#remove text from within parantheses 
remove_dates = lambda x: re.sub(r'\([^()]*\)', '', x)
movie_posters['match_title'] = movie_posters["Title"].apply(remove_dates)
movie_posters['match_title'] = movie_posters["match_title"].apply(clean_titles)


In [183]:
movie_posters.head(10)
#looks good

Unnamed: 0,imdbId,Imdb Link,Title,IMDB Score,Genre,Poster,match_title,year
0,114709,http://www.imdb.com/title/tt114709,Toy Story (1995),8.3,Animation|Adventure|Comedy,https://images-na.ssl-images-amazon.com/images...,toystory,1995
1,113497,http://www.imdb.com/title/tt113497,Jumanji (1995),6.9,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...,jumanji,1995
2,113228,http://www.imdb.com/title/tt113228,Grumpier Old Men (1995),6.6,Comedy|Romance,https://images-na.ssl-images-amazon.com/images...,grumpieroldmen,1995
3,114885,http://www.imdb.com/title/tt114885,Waiting to Exhale (1995),5.7,Comedy|Drama|Romance,https://images-na.ssl-images-amazon.com/images...,waitingtoexhale,1995
4,113041,http://www.imdb.com/title/tt113041,Father of the Bride Part II (1995),5.9,Comedy|Family|Romance,https://images-na.ssl-images-amazon.com/images...,fatherofthebridepartii,1995
5,113277,http://www.imdb.com/title/tt113277,Heat (1995),8.2,Action|Crime|Drama,https://images-na.ssl-images-amazon.com/images...,heat,1995
6,114319,http://www.imdb.com/title/tt114319,Sabrina (1995),6.3,Comedy|Drama,https://images-na.ssl-images-amazon.com/images...,sabrina,1995
7,112302,http://www.imdb.com/title/tt112302,Tom and Huck (1995),5.6,Adventure|Comedy|Drama,https://images-na.ssl-images-amazon.com/images...,tomandhuck,1995
8,114576,http://www.imdb.com/title/tt114576,Sudden Death (1995),5.7,Action|Crime|Thriller,https://images-na.ssl-images-amazon.com/images...,suddendeath,1995
9,113189,http://www.imdb.com/title/tt113189,GoldenEye (1995),7.2,Action|Adventure|Thriller,https://images-na.ssl-images-amazon.com/images...,goldeneye,1995


In [184]:
movie_info_plots.head(10)

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres,match_title,plot,year
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",ghostsofmars,"Set in the second half of the 22nd century, th...",2001.0
1,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",whiteoftheeye,A series of murders of rich young women throug...,1987.0
2,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}",awomaninflames,"Eva, an upper class housewife, becomes frustra...",1983.0
3,18998739,/m/04jcqvw,The Sorcerer's Apprentice,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0hzlz"": ""South Africa""}","{""/m/0hqxf"": ""Family Film"", ""/m/01hmnh"": ""Fant...",thesorcerersapprentice,"Every hundred years, the evil Morgana returns...",2002.0
4,6631279,/m/0gffwj,Little city,1997-04-04,,93.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/0hj3n0w"": ...",littlecity,"Adam, a San Francisco-based artist who works a...",1997.0
5,171005,/m/016ywb,Henry V,1989-11-08,10161099.0,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",henryv,{{Plot|dateAct 1Act 2Act 3Act 4Act 5 Finally n...,1989.0
6,18296435,/m/04cqrs4,Aaah Belinda,1986,,,"{""/m/02hwyss"": ""Turkish Language""}","{""/m/01znc_"": ""Turkey""}","{""/m/01z4y"": ""Comedy""}",aaahbelinda,"Serap, a young actress with a strong, lively p...",1986.0
7,11250635,/m/02r52hc,The Mechanical Monsters,,,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06n90"": ""Science Fiction"", ""/m/03k9fj"": ""...",themechanicalmonsters,The story starts as one of the robots flies i...,
8,77856,/m/0kcn7,Mary Poppins,1964-08-27,102272727.0,139.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hj3myq"": ""Children's/Family"", ""/m/04t36""...",marypoppins,The film opens with Mary Poppins perched in a...,1964.0
9,32456683,/m/0gyryjt,Die Fahne von Kriwoj Rog,1967,,108.0,"{""/m/04306rv"": ""German Language""}","{""/m/03f2w"": ""German Democratic Republic""}",{},diefahnevonkriwojrog,"Otto Brosowski, a communist miner, writes to t...",1967.0


In [185]:
#inner merge between with the info/plots and the posters 
#use both the title and the year it was produced 
#this is smaller but more accurate 

movies_df = movie_info_plots.merge(movie_posters, how = 'inner', on = ['match_title', 'year'])

In [186]:
len(movies_df)


15732

In [187]:
movies_df.head(10)

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres,match_title,plot,year,imdbId,Imdb Link,Title,IMDB Score,Genre,Poster
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",ghostsofmars,"Set in the second half of the 22nd century, th...",2001,228333,http://www.imdb.com/title/tt228333,Ghosts of Mars (2001),4.9,Action|Horror|Sci-Fi,https://images-na.ssl-images-amazon.com/images...
1,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",whiteoftheeye,A series of murders of rich young women throug...,1987,94320,http://www.imdb.com/title/tt94320,White of the Eye (1987),6.4,Horror|Thriller,https://images-na.ssl-images-amazon.com/images...
2,6631279,/m/0gffwj,Little city,1997-04-04,,93.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/0hj3n0w"": ...",littlecity,"Adam, a San Francisco-based artist who works a...",1997,119548,http://www.imdb.com/title/tt119548,Little City (1997),6.1,Comedy|Romance,https://images-na.ssl-images-amazon.com/images...
3,171005,/m/016ywb,Henry V,1989-11-08,10161099.0,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",henryv,{{Plot|dateAct 1Act 2Act 3Act 4Act 5 Finally n...,1989,97499,http://www.imdb.com/title/tt97499,Henry V (1989),7.7,Action|Biography|Drama,https://images-na.ssl-images-amazon.com/images...
4,77856,/m/0kcn7,Mary Poppins,1964-08-27,102272727.0,139.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hj3myq"": ""Children's/Family"", ""/m/04t36""...",marypoppins,The film opens with Mary Poppins perched in a...,1964,58331,http://www.imdb.com/title/tt58331,Mary Poppins (1964),7.8,Comedy|Family|Fantasy,https://images-na.ssl-images-amazon.com/images...
5,21926710,/m/05p45cv,White on Rice,2009,,82.0,{},"{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/02l7c8"": ""...",whiteonrice,Jimmy ([[Hiroshi Watanabe loves dinosaurs and...,2009,892904,http://www.imdb.com/title/tt892904,White on Rice (2009),6.2,Comedy,https://images-na.ssl-images-amazon.com/images...
6,156558,/m/014k4y,Baby Boy,2001-06-27,29381649.0,123.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...",babyboy,A young 20-year-old named Jody lives with his...,2001,255819,http://www.imdb.com/title/tt255819,Baby Boy (2001),6.4,Crime|Drama|Romance,https://images-na.ssl-images-amazon.com/images...
7,26067101,/m/0b6c_nw,Siam Sunset,1999,,91.0,{},"{""/m/0chghy"": ""Australia"", ""/m/0ctw_b"": ""New Z...","{""/m/06cvj"": ""Romantic comedy"", ""/m/02l7c8"": ""...",siamsunset,Perry is an English chemist working for a pain...,1999,178022,http://www.imdb.com/title/tt178022,Siam Sunset (1999),6.5,Adventure|Comedy|Romance,https://images-na.ssl-images-amazon.com/images...
8,9548445,/m/02pjlrp,Archie: To Riverdale and Back Again,1990-05-06,,100.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01z4y"": ""Comedy""}",archietoriverdaleandbackagain,"Archie Andrews, fifteen years after graduating...",1990,99054,http://www.imdb.com/title/tt99054,Archie: To Riverdale and Back Again (1990),5.8,Comedy|Family,https://images-na.ssl-images-amazon.com/images...
9,25960460,/m/0b6kc_5,Daddy and Them,2001,,101.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/0vgkd"": ""Black comedy"", ""/m/01z4y"": ""Come...",daddyandthem,Ruby and Claude Montgomery are a very insecure...,2001,166158,http://www.imdb.com/title/tt166158,Daddy and Them (2001),5.7,Comedy|Drama,https://images-na.ssl-images-amazon.com/images...


In [195]:
print(len(movies_df.drop_duplicates()))
#there were duplicates (~400)

movies_df.drop_duplicates(inplace = True)
print(len(movies_df))


15389
15389
15371


In [204]:
#any repeated wiki IDs? 
print(len(movies_df.drop_duplicates(subset = ['wiki_ID'])))


15371


In [203]:
movies_df[movies_df.duplicated(subset = ['wiki_ID']) == True]
#these have different wiki IDs and plots but the same image and IMDB number, that's fine for our application. 

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres,match_title,plot,year,imdbId,Imdb Link,Title,IMDB Score,Genre,Poster
965,945305,/m/03s6l2,Confessions of a Dangerous Mind,2002-12-31,33013805.0,113.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/0lsxr"": ""Crime Fiction"", ""/m/01jfsb"": ""Th...",confessionsofadangerousmind,Tired of being rejected by the beautiful women...,2002,270288,http://www.imdb.com/title/tt270288,Confessions of a Dangerous Mind (2002),7.1,Biography|Comedy|Crime,https://images-na.ssl-images-amazon.com/images...
1290,1588930,/m/05drr_,The Blue Angel,1930-04-01,77982.0,105.0,"{""/m/04306rv"": ""German Language"", ""/m/02h40lc""...","{""/m/084n_"": ""Weimar Republic"", ""/m/0345h"": ""G...","{""/m/07s9rl0"": ""Drama"", ""/m/04t36"": ""Musical"",...",theblueangel,Immanuel Rath is an esteemed educator at the ...,1930,818931,http://www.imdb.com/title/tt818931,The Blue Angel (1930),7.8,Drama|Music,https://images-na.ssl-images-amazon.com/images...
1999,29475100,/m/03xzlz0,Berlin Calling,2008-08-08,,100.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama"", ""/m/01z4y"": ""Comedy"", ...",berlincalling,Berlin techno DJ and producer Martin Karow is...,2008,211946,http://www.imdb.com/title/tt211946,Berlin Calling (2008),7.3,Comedy|Drama|Music,https://images-na.ssl-images-amazon.com/images...
2672,34521846,/m/0j26mkz,11/11/11,2011-11-01,5739384.0,87.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""}",111111,Jack and Melissa Vales become increasingly fri...,2011,2015261,http://www.imdb.com/title/tt2015261,11/11/11 (2011),2.6,Horror|Thriller,https://images-na.ssl-images-amazon.com/images...
2674,33388421,/m/0gy42wb,11-11-11,2011-11-11,5232771.0,90.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""}",111111,The film starts with a dream sequence depictin...,2011,2015261,http://www.imdb.com/title/tt2015261,11/11/11 (2011),2.6,Horror|Thriller,https://images-na.ssl-images-amazon.com/images...
2712,2638362,/m/07tj4c,Emma,1996-08-02,22231658.0,123.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/06cvj"": ""Romantic comedy"", ""/m/04xvh5"": ""...",emma,The film describes a year in the life of Emma ...,1996,118308,http://www.imdb.com/title/tt118308,Emma (1996),7.1,Comedy|Drama|Romance,https://images-na.ssl-images-amazon.com/images...
2876,535335,/m/02mmwk,War of the Worlds,2005-06-13,591745550.0,112.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",waroftheworlds,Ray Ferrier is a container crane operator at ...,2005,449040,http://www.imdb.com/title/tt449040,War of the Worlds (2005),3.3,Sci-Fi,https://images-na.ssl-images-amazon.com/images...
2998,4457586,/m/0c3hzc,Soldier,1998-11-20,,,"{""/m/03k50"": ""Hindi Language""}","{""/m/03rk0"": ""India""}","{""/m/01chg"": ""Bollywood""}",soldier,Captain Vijay Malhotra attempts to defend him...,1998,211634,http://www.imdb.com/title/tt211634,Soldier (1998),6.1,Action|Drama|Musical,https://images-na.ssl-images-amazon.com/images...
3000,730819,/m/036g88,Soldier,1998-10-23,14594226.0,99.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/03btsm8"": ""Action/Adventure"", ""/m/06n90"":...",soldier,"In the near future, as part of a new military ...",1998,211634,http://www.imdb.com/title/tt211634,Soldier (1998),6.1,Action|Drama|Musical,https://images-na.ssl-images-amazon.com/images...
3493,2452691,/m/07f458,Chaos,2005-08-10,20166.0,74.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0qdzd"": ""B-movie"", ""/m/03npn"": ""Horror"", ...",chaos,"While living at her parents' mountain home, Em...",2005,405977,http://www.imdb.com/title/tt405977,Chaos (2005),3.2,Crime|Horror|Thriller,https://images-na.ssl-images-amazon.com/images...


In [205]:
movies_df.head(10)

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres,match_title,plot,year,imdbId,Imdb Link,Title,IMDB Score,Genre,Poster
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",ghostsofmars,"Set in the second half of the 22nd century, th...",2001,228333,http://www.imdb.com/title/tt228333,Ghosts of Mars (2001),4.9,Action|Horror|Sci-Fi,https://images-na.ssl-images-amazon.com/images...
1,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",whiteoftheeye,A series of murders of rich young women throug...,1987,94320,http://www.imdb.com/title/tt94320,White of the Eye (1987),6.4,Horror|Thriller,https://images-na.ssl-images-amazon.com/images...
2,6631279,/m/0gffwj,Little city,1997-04-04,,93.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/0hj3n0w"": ...",littlecity,"Adam, a San Francisco-based artist who works a...",1997,119548,http://www.imdb.com/title/tt119548,Little City (1997),6.1,Comedy|Romance,https://images-na.ssl-images-amazon.com/images...
3,171005,/m/016ywb,Henry V,1989-11-08,10161099.0,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",henryv,{{Plot|dateAct 1Act 2Act 3Act 4Act 5 Finally n...,1989,97499,http://www.imdb.com/title/tt97499,Henry V (1989),7.7,Action|Biography|Drama,https://images-na.ssl-images-amazon.com/images...
4,77856,/m/0kcn7,Mary Poppins,1964-08-27,102272727.0,139.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hj3myq"": ""Children's/Family"", ""/m/04t36""...",marypoppins,The film opens with Mary Poppins perched in a...,1964,58331,http://www.imdb.com/title/tt58331,Mary Poppins (1964),7.8,Comedy|Family|Fantasy,https://images-na.ssl-images-amazon.com/images...
5,21926710,/m/05p45cv,White on Rice,2009,,82.0,{},"{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/02l7c8"": ""...",whiteonrice,Jimmy ([[Hiroshi Watanabe loves dinosaurs and...,2009,892904,http://www.imdb.com/title/tt892904,White on Rice (2009),6.2,Comedy,https://images-na.ssl-images-amazon.com/images...
6,156558,/m/014k4y,Baby Boy,2001-06-27,29381649.0,123.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...",babyboy,A young 20-year-old named Jody lives with his...,2001,255819,http://www.imdb.com/title/tt255819,Baby Boy (2001),6.4,Crime|Drama|Romance,https://images-na.ssl-images-amazon.com/images...
7,26067101,/m/0b6c_nw,Siam Sunset,1999,,91.0,{},"{""/m/0chghy"": ""Australia"", ""/m/0ctw_b"": ""New Z...","{""/m/06cvj"": ""Romantic comedy"", ""/m/02l7c8"": ""...",siamsunset,Perry is an English chemist working for a pain...,1999,178022,http://www.imdb.com/title/tt178022,Siam Sunset (1999),6.5,Adventure|Comedy|Romance,https://images-na.ssl-images-amazon.com/images...
8,9548445,/m/02pjlrp,Archie: To Riverdale and Back Again,1990-05-06,,100.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01z4y"": ""Comedy""}",archietoriverdaleandbackagain,"Archie Andrews, fifteen years after graduating...",1990,99054,http://www.imdb.com/title/tt99054,Archie: To Riverdale and Back Again (1990),5.8,Comedy|Family,https://images-na.ssl-images-amazon.com/images...
9,25960460,/m/0b6kc_5,Daddy and Them,2001,,101.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/0vgkd"": ""Black comedy"", ""/m/01z4y"": ""Come...",daddyandthem,Ruby and Claude Montgomery are a very insecure...,2001,166158,http://www.imdb.com/title/tt166158,Daddy and Them (2001),5.7,Comedy|Drama,https://images-na.ssl-images-amazon.com/images...


In [206]:
#save this 

movies_df.to_csv('movies.csv')

In [114]:
# Missingness exploration

#what posters are being dropped? 

final_ids = movies_df['imdbId']


In [115]:
missing_posters = movie_posters[~movie_posters['imdbId'].isin(final_ids)]
len(missing_posters)

21241

In [116]:
missing_posters.head(50)
#these look like foreign movies or very small movies?
#might not have wikipedia pages 

Unnamed: 0,imdbId,Imdb Link,Title,IMDB Score,Genre,Poster,match_title
29,115012,http://www.imdb.com/title/tt115012,"Yao a yao, yao dao wai po qiao (1995)",7.2,Crime|Drama|Romance,https://images-na.ssl-images-amazon.com/images...,yaoayaoyaodaowaipoqiao
32,114952,http://www.imdb.com/title/tt114952,Wings of Courage (1995),6.5,Adventure|Romance,https://images-na.ssl-images-amazon.com/images...,wingsofcourage
36,112286,http://www.imdb.com/title/tt112286,Across the Sea of Time (1995),6.4,Adventure|Drama|Family,https://images-na.ssl-images-amazon.com/images...,acrosstheseaoftime
39,112749,http://www.imdb.com/title/tt112749,"Cry, the Beloved Country (1995)",6.9,Drama|Thriller,https://images-na.ssl-images-amazon.com/images...,crythebelovedcountry
45,113347,http://www.imdb.com/title/tt113347,How to Make an American Quilt (1995),6.2,Comedy|Drama|Romance,https://images-na.ssl-images-amazon.com/images...,howtomakeanamericanquilt
46,114369,http://www.imdb.com/title/tt114369,Se7en (1995),8.6,Crime|Drama|Mystery,https://images-na.ssl-images-amazon.com/images...,se7en
50,109950,http://www.imdb.com/title/tt109950,Guardian Angel (1994),4.8,Action|Drama|Thriller,https://images-na.ssl-images-amazon.com/images...,guardianangel
54,113158,http://www.imdb.com/title/tt113158,Georgia (1995),6.5,Drama|Music,https://images-na.ssl-images-amazon.com/images...,georgia
55,113541,http://www.imdb.com/title/tt113541,Kids of the Round Table (1995),5.0,Adventure|Comedy|Family,https://images-na.ssl-images-amazon.com/images...,kidsoftheroundtable
57,110877,http://www.imdb.com/title/tt110877,Il Postino: The Postman (1994),7.7,Biography|Comedy|Drama,https://images-na.ssl-images-amazon.com/images...,ilpostinothepostman


In [65]:
#makes sense, mispelling
movie_info[movie_info['title'] == "Seven"]

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres,match_title
66693,206818,/m/01dc0c,Seven,1995-09-22,,127.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/0lsxr"": ""Crime F...",seven


In [117]:
#not sure if this is 2 
movie_info[movie_info['title'] == "Poison Ivy"]

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres,match_title
70247,17646903,/m/0462mpt,Poison Ivy,1985-02-10,,97.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/01z4y"": ""C...",poisonivy
78060,555857,/m/02plb3,Poison Ivy,1992-01,,88.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hn10"": ""LGBT"", ""/m/01jfsb"": ""Thriller"", ...",poisonivy


In [123]:
#fuzzy match?

## Genre cleaning 

In [124]:
movies_df.head()

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres,match_title,plot,imdbId,Imdb Link,Title,IMDB Score,Genre,Poster
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",ghostsofmars,"Set in the second half of the 22nd century, th...",228333,http://www.imdb.com/title/tt228333,Ghosts of Mars (2001),4.9,Action|Horror|Sci-Fi,https://images-na.ssl-images-amazon.com/images...
1,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",whiteoftheeye,A series of murders of rich young women throug...,94320,http://www.imdb.com/title/tt94320,White of the Eye (1987),6.4,Horror|Thriller,https://images-na.ssl-images-amazon.com/images...
2,18998739,/m/04jcqvw,The Sorcerer's Apprentice,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0hzlz"": ""South Africa""}","{""/m/0hqxf"": ""Family Film"", ""/m/01hmnh"": ""Fant...",thesorcerersapprentice,"Every hundred years, the evil Morgana returns...",963966,http://www.imdb.com/title/tt963966,The Sorcerer's Apprentice (2010),6.1,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...
3,12621957,/m/05pdd86,The Sorcerer's Apprentice,2010-07-08,215283742.0,111.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06n90"": ""Science Fiction"", ""/m/03k9fj"": ""...",thesorcerersapprentice,"In AD 740, one of Merlin's three apprentices...",963966,http://www.imdb.com/title/tt963966,The Sorcerer's Apprentice (2010),6.1,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...
4,6631279,/m/0gffwj,Little city,1997-04-04,,93.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/0hj3n0w"": ...",littlecity,"Adam, a San Francisco-based artist who works a...",119548,http://www.imdb.com/title/tt119548,Little City (1997),6.1,Comedy|Romance,https://images-na.ssl-images-amazon.com/images...


In [127]:
#oh no genres is not a dictionary, it's a string 
#this is slightly more annoying to handle 
#list_genres = lambda x: list(x.values)
#movies_df['genres_list'] = movies_df["genres"].apply(list_genres)
#movies_df.head()

In [137]:
def remove_ids(genre_str): 
    #split string on colons and commas
    genre_id_lists = re.split(': |, ', genre_str)
    #select odd values (genres)
    genre_list = genre_id_lists[1::2]
    #remove unnecessary quotes
    clean_genres = [re.sub('[""{}]', '', g) for g in genre_list] 
    return clean_genres


movies_df['clean_genres'] = movies_df.apply(lambda x: remove_ids(x.genres), axis=1)

In [140]:
movies_df.head(10)

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres,match_title,plot,imdbId,Imdb Link,Title,IMDB Score,Genre,Poster,noid_genres,clean_genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",ghostsofmars,"Set in the second half of the 22nd century, th...",228333,http://www.imdb.com/title/tt228333,Ghosts of Mars (2001),4.9,Action|Horror|Sci-Fi,https://images-na.ssl-images-amazon.com/images...,"[""Thriller"", ""Science Fiction"", ""Horror"", ""Adv...","[Thriller, Science Fiction, Horror, Adventure,..."
1,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",whiteoftheeye,A series of murders of rich young women throug...,94320,http://www.imdb.com/title/tt94320,White of the Eye (1987),6.4,Horror|Thriller,https://images-na.ssl-images-amazon.com/images...,"[""Thriller"", ""Erotic thriller"", ""Psychological...","[Thriller, Erotic thriller, Psychological thri..."
2,18998739,/m/04jcqvw,The Sorcerer's Apprentice,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0hzlz"": ""South Africa""}","{""/m/0hqxf"": ""Family Film"", ""/m/01hmnh"": ""Fant...",thesorcerersapprentice,"Every hundred years, the evil Morgana returns...",963966,http://www.imdb.com/title/tt963966,The Sorcerer's Apprentice (2010),6.1,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...,"[""Family Film"", ""Fantasy"", ""Adventure"", ""World...","[Family Film, Fantasy, Adventure, World cinema]"
3,12621957,/m/05pdd86,The Sorcerer's Apprentice,2010-07-08,215283742.0,111.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06n90"": ""Science Fiction"", ""/m/03k9fj"": ""...",thesorcerersapprentice,"In AD 740, one of Merlin's three apprentices...",963966,http://www.imdb.com/title/tt963966,The Sorcerer's Apprentice (2010),6.1,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...,"[""Science Fiction"", ""Adventure"", ""Fantasy"", ""C...","[Science Fiction, Adventure, Fantasy, Comedy, ..."
4,6631279,/m/0gffwj,Little city,1997-04-04,,93.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/0hj3n0w"": ...",littlecity,"Adam, a San Francisco-based artist who works a...",119548,http://www.imdb.com/title/tt119548,Little City (1997),6.1,Comedy|Romance,https://images-na.ssl-images-amazon.com/images...,"[""Romantic comedy"", ""Ensemble Film"", ""Comedy-d...","[Romantic comedy, Ensemble Film, Comedy-drama,..."
5,171005,/m/016ywb,Henry V,1989-11-08,10161099.0,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",henryv,{{Plot|dateAct 1Act 2Act 3Act 4Act 5 Finally n...,97499,http://www.imdb.com/title/tt97499,Henry V (1989),7.7,Action|Biography|Drama,https://images-na.ssl-images-amazon.com/images...,"[""Costume drama"", ""War film"", ""Epic"", ""Period ...","[Costume drama, War film, Epic, Period piece, ..."
6,171005,/m/016ywb,Henry V,1989-11-08,10161099.0,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",henryv,{{Plot|dateAct 1Act 2Act 3Act 4Act 5 Finally n...,36910,http://www.imdb.com/title/tt36910,Henry V (1944),7.3,Biography|Drama|History,https://images-na.ssl-images-amazon.com/images...,"[""Costume drama"", ""War film"", ""Epic"", ""Period ...","[Costume drama, War film, Epic, Period piece, ..."
7,80493,/m/0ktqc,Henry V,1944,,135.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/04xvh5"": ""Costume drama"", ""/m/0520lz"": ""R...",henryv,We first see a panorama of London in 1600. We...,97499,http://www.imdb.com/title/tt97499,Henry V (1989),7.7,Action|Biography|Drama,https://images-na.ssl-images-amazon.com/images...,"[""Costume drama"", ""Roadshow theatrical release...","[Costume drama, Roadshow theatrical release, D..."
8,80493,/m/0ktqc,Henry V,1944,,135.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/04xvh5"": ""Costume drama"", ""/m/0520lz"": ""R...",henryv,We first see a panorama of London in 1600. We...,36910,http://www.imdb.com/title/tt36910,Henry V (1944),7.3,Biography|Drama|History,https://images-na.ssl-images-amazon.com/images...,"[""Costume drama"", ""Roadshow theatrical release...","[Costume drama, Roadshow theatrical release, D..."
9,77856,/m/0kcn7,Mary Poppins,1964-08-27,102272727.0,139.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hj3myq"": ""Children's/Family"", ""/m/04t36""...",marypoppins,The film opens with Mary Poppins perched in a...,58331,http://www.imdb.com/title/tt58331,Mary Poppins (1964),7.8,Comedy|Family|Fantasy,https://images-na.ssl-images-amazon.com/images...,"[""Children's/Family"", ""Musical"", ""Fantasy"", ""C...","[Children's/Family, Musical, Fantasy, Comedy, ..."


In [139]:
movies_df.iloc[0,-1]

['Thriller',
 'Science Fiction',
 'Horror',
 'Adventure',
 'Supernatural',
 'Action',
 'Space western']

In [None]:
#there are many genres that can be applied to film 
#in this case, let's just keep it simple and select the first one
#they're not ordered alphabetically 

## Bechdel Test

In [118]:
bechdel_df = pd.read_csv('bechdel_movies.csv')
bechdel_df.head()

Unnamed: 0.1,Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,...,director,released,actors,genre,awards,runtime,type,poster,imdb_votes,error
0,1,2013,tt1711425,21 &amp; Over,notalk,notalk,FAIL,13000000.0,25682380.0,42195766.0,...,,,,,,,,,,
1,2,2012,tt1343727,Dredd 3D,ok-disagree,ok,PASS,45000000.0,13414714.0,40868994.0,...,,,,,,,,,,
2,3,2013,tt2024544,12 Years a Slave,notalk-disagree,notalk,FAIL,20000000.0,53107035.0,158607035.0,...,Steve McQueen,08 Nov 2013,"Chiwetel Ejiofor, Dwight Henry, Dickie Gravois...","Biography, Drama, History",Won 3 Oscars. Another 131 wins & 137 nominations.,134 min,movie,http://ia.media-imdb.com/images/M/MV5BMjExMTEz...,143446.0,
3,4,2013,tt1272878,2 Guns,notalk,notalk,FAIL,61000000.0,75612460.0,132493015.0,...,Baltasar Kormákur,02 Aug 2013,"Denzel Washington, Mark Wahlberg, Paula Patton...","Action, Comedy, Crime",1 win.,109 min,movie,http://ia.media-imdb.com/images/M/MV5BNTQ5MTgz...,87301.0,
4,5,2013,tt0453562,42,men,men,FAIL,40000000.0,95020213.0,95020213.0,...,Brian Helgeland,12 Apr 2013,"Chadwick Boseman, Harrison Ford, Nicole Behari...","Biography, Drama, Sport",3 wins & 13 nominations.,128 min,movie,http://ia.media-imdb.com/images/M/MV5BMTQwMDU4...,43608.0,


In [119]:
bechdel_df.columns

Index(['Unnamed: 0', 'year', 'imdb', 'title', 'test', 'clean_test', 'binary',
       'budget', 'domgross', 'intgross', 'code', 'budget_2013',
       'domgross_2013', 'intgross_2013', 'period_code', 'decade_code',
       'imdb_id', 'plot', 'rated', 'response', 'language', 'country', 'writer',
       'metascore', 'imdb_rating', 'director', 'released', 'actors', 'genre',
       'awards', 'runtime', 'type', 'poster', 'imdb_votes', 'error'],
      dtype='object')

In [120]:
#use imdb id to merge in to dataset with plots 

movies_bechdel = movies_df_plots.merge(bechdel_df, how = 'inner', left_on = 'imdbId', right_on ='imdb_id')
movies_bechdel.head()

Unnamed: 0,wiki_ID,freebase_ID,title_x,date,revenue,runtime_x,language_x,country_x,genres,match_title,...,director,released,actors,genre,awards,runtime_y,type,poster,imdb_votes,error
0,18998739,/m/04jcqvw,The Sorcerer's Apprentice,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0hzlz"": ""South Africa""}","{""/m/0hqxf"": ""Family Film"", ""/m/01hmnh"": ""Fant...",thesorcerersapprentice,...,,,,,,,,,,
1,12621957,/m/05pdd86,The Sorcerer's Apprentice,2010-07-08,215283742.0,111.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06n90"": ""Science Fiction"", ""/m/03k9fj"": ""...",thesorcerersapprentice,...,,,,,,,,,,
2,171005,/m/016ywb,Henry V,1989-11-08,10161099.0,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",henryv,...,Kenneth Branagh,08 Nov 1989,"Derek Jacobi, Kenneth Branagh, Simon Shepherd,...","Action, Biography, Drama",Won 1 Oscar. Another 10 wins & 11 nominations.,137 min,movie,http://ia.media-imdb.com/images/M/MV5BMTI1ODg1...,20002.0,
3,80493,/m/0ktqc,Henry V,1944,,135.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/04xvh5"": ""Costume drama"", ""/m/0520lz"": ""R...",henryv,...,Kenneth Branagh,08 Nov 1989,"Derek Jacobi, Kenneth Branagh, Simon Shepherd,...","Action, Biography, Drama",Won 1 Oscar. Another 10 wins & 11 nominations.,137 min,movie,http://ia.media-imdb.com/images/M/MV5BMTI1ODg1...,20002.0,
4,28271896,/m/0cp0zcq,The Net,1953-11-05,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/07s9rl0"": ""Drama""}",thenet,...,Irwin Winkler,28 Jul 1995,"Sandra Bullock, Jeremy Northam, Dennis Miller,...","Action, Crime, Drama",1 nomination.,114 min,movie,http://ia.media-imdb.com/images/M/MV5BMTU0NjA2...,40448.0,


In [121]:
len(movies_bechdel)
#veryyy small! 


2037

In [122]:
#save anyway? 
movies_bechdel.to_csv('small_matched_bechdel.csv')