# Data Preparation 

## Deep Learning Final Project 

### Jade Benson

In this project, we are interested in using multiple types of movie data from Wikipedia to describe this art form with deep learning methods. We aim to explore the different types of data individually and together to understand the landscape of movies. We will eventually use them to predict genre and whether movies pass the Bechdel test. 

This notebook will combine three disparate data sources together. 

CMU Wikipedia plots, characters, genres, networks: http://www.cs.cmu.edu/~ark/personas/

Movie poster links (Kaggle scraped from IMDB): https://www.kaggle.com/datasets/neha1703/movie-genre-from-its-poster

Bechdel test: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-03-09/readme.md

In [1]:
import pandas as pd 
import numpy as np 
import sklearn 
import matplotlib

In [4]:
## CMU Wikipedia movie dataset 

#first use "Movie meta data"
#Movie name, release data, genres, wikipedia movie ID 
#use wikipedia movie ID to link with the plot summaries in plot_summaries.txt

movie_info = pd.read_csv('MovieSummaries/movie.metadata.tsv', sep = '\t', header = None)


In [5]:
movie_info.head()
#final columns are dictionaries with freebase ID: value 

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"


In [7]:
movie_info.columns = ["wiki_ID", "freebase_ID", "title", "date", "revenue", "runtime", "language", "country", "genres"]

In [10]:
movie_info.head()

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"


In [8]:
len(movie_info)
#81,741 movies

81741

In [9]:
len(movie_info.drop_duplicates())
#no duplicates

81741

In [13]:
#lowercase titles, no spaces or punctuation to hopefully make matching easier 
import re

clean_titles = lambda x: re.sub(r'[^a-z\d]', '', x.lower())
movie_info['match_title'] = movie_info["title"].apply(clean_titles)


In [14]:
movie_info.head()

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres,match_title
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",ghostsofmars
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp...",gettingawaywithmurderthejonbentramseymystery
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...",brunbitter
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",whiteoftheeye
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}",awomaninflames


In [17]:
# Poster dataset 

movie_posters = pd.read_csv('MoviePosters/MovieGenre.csv', encoding = "ISO-8859-1") #different encoding 
movie_posters.head()


Unnamed: 0,imdbId,Imdb Link,Title,IMDB Score,Genre,Poster
0,114709,http://www.imdb.com/title/tt114709,Toy Story (1995),8.3,Animation|Adventure|Comedy,https://images-na.ssl-images-amazon.com/images...
1,113497,http://www.imdb.com/title/tt113497,Jumanji (1995),6.9,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...
2,113228,http://www.imdb.com/title/tt113228,Grumpier Old Men (1995),6.6,Comedy|Romance,https://images-na.ssl-images-amazon.com/images...
3,114885,http://www.imdb.com/title/tt114885,Waiting to Exhale (1995),5.7,Comedy|Drama|Romance,https://images-na.ssl-images-amazon.com/images...
4,113041,http://www.imdb.com/title/tt113041,Father of the Bride Part II (1995),5.9,Comedy|Family|Romance,https://images-na.ssl-images-amazon.com/images...


In [20]:
#want to remove the dates 

#remove text from within parantheses 
remove_dates = lambda x: re.sub(r'\([^()]*\)', '', x)
movie_posters['match_title'] = movie_posters["Title"].apply(remove_dates)
movie_posters['match_title'] = movie_posters["match_title"].apply(clean_titles)


In [21]:
movie_posters.head()
#looks good

Unnamed: 0,imdbId,Imdb Link,Title,IMDB Score,Genre,Poster,match_title
0,114709,http://www.imdb.com/title/tt114709,Toy Story (1995),8.3,Animation|Adventure|Comedy,https://images-na.ssl-images-amazon.com/images...,toystory
1,113497,http://www.imdb.com/title/tt113497,Jumanji (1995),6.9,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...,jumanji
2,113228,http://www.imdb.com/title/tt113228,Grumpier Old Men (1995),6.6,Comedy|Romance,https://images-na.ssl-images-amazon.com/images...,grumpieroldmen
3,114885,http://www.imdb.com/title/tt114885,Waiting to Exhale (1995),5.7,Comedy|Drama|Romance,https://images-na.ssl-images-amazon.com/images...,waitingtoexhale
4,113041,http://www.imdb.com/title/tt113041,Father of the Bride Part II (1995),5.9,Comedy|Family|Romance,https://images-na.ssl-images-amazon.com/images...,fatherofthebridepartii


In [23]:
#inner merge between the two dataframes 

movies_df = movie_info.merge(movie_posters, how = 'inner', left_on = 'match_title', right_on ='match_title')

In [24]:
len(movies_df)
#32,278! That seems great 

32278

In [25]:
movies_df.head()

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres,match_title,imdbId,Imdb Link,Title,IMDB Score,Genre,Poster
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",ghostsofmars,228333,http://www.imdb.com/title/tt228333,Ghosts of Mars (2001),4.9,Action|Horror|Sci-Fi,https://images-na.ssl-images-amazon.com/images...
1,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",whiteoftheeye,94320,http://www.imdb.com/title/tt94320,White of the Eye (1987),6.4,Horror|Thriller,https://images-na.ssl-images-amazon.com/images...
2,18998739,/m/04jcqvw,The Sorcerer's Apprentice,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0hzlz"": ""South Africa""}","{""/m/0hqxf"": ""Family Film"", ""/m/01hmnh"": ""Fant...",thesorcerersapprentice,963966,http://www.imdb.com/title/tt963966,The Sorcerer's Apprentice (2010),6.1,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...
3,18997933,/m/04j9kx1,The Sorcerer's Apprentice,1955-05,,13.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/02hmvc"": ""Short Film""}",thesorcerersapprentice,963966,http://www.imdb.com/title/tt963966,The Sorcerer's Apprentice (2010),6.1,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...
4,12621957,/m/05pdd86,The Sorcerer's Apprentice,2010-07-08,215283742.0,111.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06n90"": ""Science Fiction"", ""/m/03k9fj"": ""...",thesorcerersapprentice,963966,http://www.imdb.com/title/tt963966,The Sorcerer's Apprentice (2010),6.1,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...


In [26]:
#save this 

movies_df.to_csv('movie_info_posters.csv')

In [34]:
#code for how to merge with the summaries 

movie_lines = []

with open('MovieSummaries/plot_summaries.txt') as file:
    for line in file:
        line_clean = line.rstrip()
        line_clean = line_clean.split("\t")
        movie_lines.append(line_clean)

In [35]:
movie_lines[0:5]

[['23890098',
  "Shlykov, a hard-working taxi driver and Lyosha, a saxophonist, develop a bizarre love-hate relationship, and despite their prejudices, realize they aren't so different after all."],
 ['31186339',
  'The nation of Panem consists of a wealthy Capitol and twelve poorer districts. As punishment for a past rebellion, each district must provide a boy and girl  between the ages of 12 and 18 selected by lottery  for the annual Hunger Games. The tributes must fight to the death in an arena; the sole survivor is rewarded with fame and wealth. In her first Reaping, 12-year-old Primrose Everdeen is chosen from District 12. Her older sister Katniss volunteers to take her place. Peeta Mellark, a baker\'s son who once gave Katniss bread when she was starving, is the other District 12 tribute. Katniss and Peeta are taken to the Capitol, accompanied by their frequently drunk mentor, past victor Haymitch Abernathy. He warns them about the "Career" tributes who train intensively at speci

In [36]:
movie_plots = pd.DataFrame(movie_lines, columns = ['wiki_ID', 'plot'])
movie_plots.head()
#the plots should be cleaned before using! 

Unnamed: 0,wiki_ID,plot
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...


In [39]:
movie_plots.dtypes

wiki_ID    object
plot       object
dtype: object

In [40]:
movie_plots.wiki_ID = movie_plots.wiki_ID.astype(np.int64)

In [41]:
movies_df_plots = movies_df.merge(movie_plots, how = 'inner', left_on = 'wiki_ID', right_on ='wiki_ID')
movies_df_plots.head()

Unnamed: 0,wiki_ID,freebase_ID,title,date,revenue,runtime,language,country,genres,match_title,imdbId,Imdb Link,Title,IMDB Score,Genre,Poster,plot
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",ghostsofmars,228333,http://www.imdb.com/title/tt228333,Ghosts of Mars (2001),4.9,Action|Horror|Sci-Fi,https://images-na.ssl-images-amazon.com/images...,"Set in the second half of the 22nd century, th..."
1,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",whiteoftheeye,94320,http://www.imdb.com/title/tt94320,White of the Eye (1987),6.4,Horror|Thriller,https://images-na.ssl-images-amazon.com/images...,A series of murders of rich young women throug...
2,18998739,/m/04jcqvw,The Sorcerer's Apprentice,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0hzlz"": ""South Africa""}","{""/m/0hqxf"": ""Family Film"", ""/m/01hmnh"": ""Fant...",thesorcerersapprentice,963966,http://www.imdb.com/title/tt963966,The Sorcerer's Apprentice (2010),6.1,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...,"Every hundred years, the evil Morgana returns..."
3,12621957,/m/05pdd86,The Sorcerer's Apprentice,2010-07-08,215283742.0,111.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06n90"": ""Science Fiction"", ""/m/03k9fj"": ""...",thesorcerersapprentice,963966,http://www.imdb.com/title/tt963966,The Sorcerer's Apprentice (2010),6.1,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...,"In AD 740, one of Merlin's three apprentices..."
4,6631279,/m/0gffwj,Little city,1997-04-04,,93.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/0hj3n0w"": ...",littlecity,119548,http://www.imdb.com/title/tt119548,Little City (1997),6.1,Comedy|Romance,https://images-na.ssl-images-amazon.com/images...,"Adam, a San Francisco-based artist who works a..."


In [45]:
#shorter for some reason, might be formatting of wiki_ID or if they weren't able to scrape every plot? 
len(movies_df_plots)

22773

In [42]:
#save this too 
movies_df_plots.to_csv('movie_info_posters_plots.csv')

In [43]:
##lastly merge in Bechdel test if we want to use that ever 

bechdel_df = pd.read_csv('bechdel_movies.csv')
bechdel_df.head()

Unnamed: 0.1,Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,...,director,released,actors,genre,awards,runtime,type,poster,imdb_votes,error
0,1,2013,tt1711425,21 &amp; Over,notalk,notalk,FAIL,13000000.0,25682380.0,42195766.0,...,,,,,,,,,,
1,2,2012,tt1343727,Dredd 3D,ok-disagree,ok,PASS,45000000.0,13414714.0,40868994.0,...,,,,,,,,,,
2,3,2013,tt2024544,12 Years a Slave,notalk-disagree,notalk,FAIL,20000000.0,53107035.0,158607035.0,...,Steve McQueen,08 Nov 2013,"Chiwetel Ejiofor, Dwight Henry, Dickie Gravois...","Biography, Drama, History",Won 3 Oscars. Another 131 wins & 137 nominations.,134 min,movie,http://ia.media-imdb.com/images/M/MV5BMjExMTEz...,143446.0,
3,4,2013,tt1272878,2 Guns,notalk,notalk,FAIL,61000000.0,75612460.0,132493015.0,...,Baltasar Kormákur,02 Aug 2013,"Denzel Washington, Mark Wahlberg, Paula Patton...","Action, Comedy, Crime",1 win.,109 min,movie,http://ia.media-imdb.com/images/M/MV5BNTQ5MTgz...,87301.0,
4,5,2013,tt0453562,42,men,men,FAIL,40000000.0,95020213.0,95020213.0,...,Brian Helgeland,12 Apr 2013,"Chadwick Boseman, Harrison Ford, Nicole Behari...","Biography, Drama, Sport",3 wins & 13 nominations.,128 min,movie,http://ia.media-imdb.com/images/M/MV5BMTQwMDU4...,43608.0,


In [44]:
bechdel_df.columns

Index(['Unnamed: 0', 'year', 'imdb', 'title', 'test', 'clean_test', 'binary',
       'budget', 'domgross', 'intgross', 'code', 'budget_2013',
       'domgross_2013', 'intgross_2013', 'period_code', 'decade_code',
       'imdb_id', 'plot', 'rated', 'response', 'language', 'country', 'writer',
       'metascore', 'imdb_rating', 'director', 'released', 'actors', 'genre',
       'awards', 'runtime', 'type', 'poster', 'imdb_votes', 'error'],
      dtype='object')

In [46]:
#use imdb id to merge in to dataset with plots 

movies_bechdel = movies_df_plots.merge(bechdel_df, how = 'inner', left_on = 'imdbId', right_on ='imdb_id')
movies_bechdel.head()

Unnamed: 0,wiki_ID,freebase_ID,title_x,date,revenue,runtime_x,language_x,country_x,genres,match_title,...,director,released,actors,genre,awards,runtime_y,type,poster,imdb_votes,error
0,18998739,/m/04jcqvw,The Sorcerer's Apprentice,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0hzlz"": ""South Africa""}","{""/m/0hqxf"": ""Family Film"", ""/m/01hmnh"": ""Fant...",thesorcerersapprentice,...,,,,,,,,,,
1,12621957,/m/05pdd86,The Sorcerer's Apprentice,2010-07-08,215283742.0,111.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06n90"": ""Science Fiction"", ""/m/03k9fj"": ""...",thesorcerersapprentice,...,,,,,,,,,,
2,171005,/m/016ywb,Henry V,1989-11-08,10161099.0,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",henryv,...,Kenneth Branagh,08 Nov 1989,"Derek Jacobi, Kenneth Branagh, Simon Shepherd,...","Action, Biography, Drama",Won 1 Oscar. Another 10 wins & 11 nominations.,137 min,movie,http://ia.media-imdb.com/images/M/MV5BMTI1ODg1...,20002.0,
3,80493,/m/0ktqc,Henry V,1944,,135.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/04xvh5"": ""Costume drama"", ""/m/0520lz"": ""R...",henryv,...,Kenneth Branagh,08 Nov 1989,"Derek Jacobi, Kenneth Branagh, Simon Shepherd,...","Action, Biography, Drama",Won 1 Oscar. Another 10 wins & 11 nominations.,137 min,movie,http://ia.media-imdb.com/images/M/MV5BMTI1ODg1...,20002.0,
4,28271896,/m/0cp0zcq,The Net,1953-11-05,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/07s9rl0"": ""Drama""}",thenet,...,Irwin Winkler,28 Jul 1995,"Sandra Bullock, Jeremy Northam, Dennis Miller,...","Action, Crime, Drama",1 nomination.,114 min,movie,http://ia.media-imdb.com/images/M/MV5BMTU0NjA2...,40448.0,


In [47]:
len(movies_bechdel)
#veryyy small! 


2037

In [48]:
#save anyway? 
movies_bechdel.to_csv('small_full_movie_info.csv')