# Genre Popularity

In this notebook we will create the Content Popularity column we will use for our dataset.

## Description

We are going to pull in the title dataframe and attach it to the genre info.

Once we have a genre we can find its popularity by lumping all the popular movies together and seeing which genre has the most high quality titles.

## The Process

### Dependencies

In [3]:
import pandas as pd
import os

Here we start with the imbd_id column (we dont need the title column for now)

In [74]:
path = os.path.join(os.pardir, 'data', 'interim', 'movie_titles.csv')
titles_df = pd.read_csv(path)
titles_df.drop_duplicates(inplace=True)

Now lets bring in the genres

In [6]:
path = os.path.join(os.pardir, 'data', 'raw', 'title.basics.tsv.gz')
genres_df = pd.read_csv(path, delimiter='\t')

  interactivity=interactivity, compiler=compiler, result=result)


In [21]:
genres_df.rename(columns={'tconst':'imdb_id', 'startYear': 'year'}, inplace=True)

We will now join the two data frames on the 'tconst' value

In [75]:
title_genre_df = titles_df.set_index('imdb_id').join(genres_df.set_index('imdb_id'))

In [76]:
title_genre_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 683297 entries, tt0000009 to tt9916754
Data columns (total 9 columns):
title             683297 non-null object
titleType         683297 non-null object
primaryTitle      683297 non-null object
originalTitle     683297 non-null object
isAdult           683297 non-null int64
year              683297 non-null object
endYear           683297 non-null object
runtimeMinutes    683297 non-null object
genres            683297 non-null object
dtypes: int64(1), object(8)
memory usage: 52.1+ MB


In [77]:
title_genre_df.sample(5)

Unnamed: 0_level_0,title,titleType,primaryTitle,originalTitle,isAdult,year,endYear,runtimeMinutes,genres
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tt0357504,La bomboniera,movie,La bomboniera,La bomboniera,0,2002,\N,52,Documentary
tt0072761,Capone,movie,Capone,Capone,0,1975,\N,101,"Biography,Crime,Drama"
tt3335554,Arthouse Junkies,movie,Arthouse Junkies,Arthouse Junkies,0,\N,\N,80,"Drama,Family,Music"
tt0144679,Warm to the Touch,movie,Warm to the Touch,Warm to the Touch,1,1992,\N,\N,Adult
tt5899256,Ski Wolf,movie,Ski Wolf,Ski Wolf,0,2008,\N,71,"Comedy,Horror"


In [78]:
title_genre_df.reset_index(inplace=True)

In [79]:
title_genre_year_df = title_genre_df.loc[:, ['imdb_id', 'title', 'genres', 'year']]

Now there are a bunch of duplicates from the joins we performed. Let's drop them and see where we're at.

In [81]:
title_genre_year_df.drop_duplicates(inplace=True)

We are going to bring in the 'The Numbers' dataset, which gives us the info on revenue

In [46]:
path = os.path.join(os.pardir, 'data', 'raw', 'tn.movie_budgets.csv.gz')
budgets_df = pd.read_csv(path)

In [82]:
budgets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
id                   5782 non-null int64
release_date         5782 non-null object
title                5782 non-null object
production_budget    5782 non-null object
domestic_gross       5782 non-null object
worldwide_gross      5782 non-null object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [83]:
budgets_df.sample(5)

Unnamed: 0,id,release_date,title,production_budget,domestic_gross,worldwide_gross
2912,13,"Sep 26, 2017","Mune, le gardien de la lune","$17,000,000",$0,"$14,534,046"
4626,27,"Oct 1, 2006",The Secret,"$3,500,000",$0,$0
4302,3,"Mar 15, 2013",Spring Breakers,"$5,000,000","$14,124,286","$31,149,251"
3708,9,"Oct 26, 2001",High Heels and Low Lifes,"$10,000,000","$226,792","$226,792"
4246,47,"Oct 3, 1980",Somewhere in Time,"$5,100,000","$9,709,597","$9,709,597"


In [84]:
budgets_df.rename(columns={'movie': 'title'}, inplace=True)

In [85]:
genres_budgets = budgets_df.set_index('title').join(title_genre_year_df.set_index('title'))

In [86]:
genres_budgets.reset_index(inplace=True)

In [87]:
genres_budgets.sample(5)

Unnamed: 0,title,id,release_date,production_budget,domestic_gross,worldwide_gross,imdb_id,genres,year
8277,Someone Like You,79,"Mar 30, 2001","$23,000,000","$27,338,033","$38,684,906",tt3495028,Romance,\N
9658,The Four Feathers,16,"Sep 20, 2002","$35,000,000","$18,306,166","$29,882,645",tt0018908,"Adventure,Drama,Romance",1929
7058,Por amor en el caserio,26,"Dec 22, 2015","$1,000,000",$0,$0,tt3107890,"Action,Drama,Romance",2014
4433,Home,60,"Apr 23, 2009","$500,000","$15,433","$44,793,168",tt6626800,Documentary,\N
3805,Grave Encounters,76,"Sep 9, 2011","$2,000,000",$0,"$2,151,887",tt1703199,Horror,2011


In [88]:
genres_budgets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12716 entries, 0 to 12715
Data columns (total 9 columns):
title                12716 non-null object
id                   12716 non-null int64
release_date         12716 non-null object
production_budget    12716 non-null object
domestic_gross       12716 non-null object
worldwide_gross      12716 non-null object
imdb_id              11995 non-null object
genres               11995 non-null object
year                 11995 non-null object
dtypes: int64(1), object(8)
memory usage: 894.2+ KB


In [89]:
genres_budgets.drop_duplicates(inplace=True)

In [90]:
genres_budgets.sort_values('worldwide_gross', ascending=False)

Unnamed: 0,title,id,release_date,production_budget,domestic_gross,worldwide_gross,imdb_id,genres,year
3187,Fifty Dead Men Walking,38,"Aug 21, 2009","$10,000,000",$0,"$997,921",tt1097643,"Crime,Drama,Thriller",2008
2784,Duma,33,"Sep 30, 2005","$12,000,000","$870,067","$994,790",tt2112940,"Biography,Crime,Documentary",2011
2783,Duma,33,"Sep 30, 2005","$12,000,000","$870,067","$994,790",tt0361715,"Adventure,Drama,Family",2005
4796,Insidious,63,"Apr 1, 2011","$1,500,000","$54,009,150","$99,870,886",tt1225292,"Drama,Mystery,Romance",2008
4797,Insidious,63,"Apr 1, 2011","$1,500,000","$54,009,150","$99,870,886",tt1591095,"Horror,Mystery,Thriller",2010
...,...,...,...,...,...,...,...,...,...
11982,Une Femme MariÃ©e,76,"May 24, 2016","$120,000",$0,$0,,,
2775,"Dude, Where's My Dog",93,"Dec 31, 2014","$100,000",$0,$0,,,
2803,Dutch Kills,52,"Dec 1, 2015","$25,000",$0,$0,tt2759066,"Crime,Drama,Thriller",2015
2804,Dwegons and Leprechauns,14,"Aug 29, 2014","$20,000,000",$0,$0,tt1134666,Animation,2014


Look at that, two Duma movies but with different imdb id's. Did we make a mistake?

After going back and looking at everything I decided to plug it into the IMDB website search bar and was surprised to see a different movie named 'Duma' for each id. Nice! Not a bug.

In [91]:
title_genre_df.loc[title_genre_df['imdb_id'] == 'tt0361715']

Unnamed: 0,imdb_id,title,titleType,primaryTitle,originalTitle,isAdult,year,endYear,runtimeMinutes,genres
228829,tt0361715,Duma,movie,Duma,Duma,0,2005,\N,100,"Adventure,Drama,Family"


In [92]:
title_genre_df.loc[title_genre_df['imdb_id'] == 'tt2112940']

Unnamed: 0,imdb_id,title,titleType,primaryTitle,originalTitle,isAdult,year,endYear,runtimeMinutes,genres
414282,tt2112940,Duma,movie,Duma,Duma,0,2011,\N,55,"Biography,Crime,Documentary"


Now we need to drop the columns we're not going to be using. 

In [96]:
genres_budgets.loc[:, 'release_date':'genres']

Unnamed: 0,release_date,production_budget,domestic_gross,worldwide_gross,imdb_id,genres
0,"Nov 20, 2015","$1,500,000",$0,$0,tt3526286,"Crime,Drama,Horror"
1,"Jul 17, 2009","$7,500,000","$32,425,665","$34,439,060",,
2,"Mar 11, 2016","$5,000,000","$72,082,999","$108,286,422",tt1179933,"Drama,Horror,Mystery"
3,"Nov 11, 2015","$12,000,000","$14,616","$14,616",tt3453052,Drama
4,"Mar 31, 1999","$13,000,000","$38,177,966","$60,413,950",tt0147800,"Comedy,Drama,Romance"
...,...,...,...,...,...,...
12711,"Sep 15, 2017","$30,000,000","$17,800,004","$42,531,076",,
12712,"Aug 9, 2002","$70,000,000","$141,930,000","$267,200,000",tt0295701,"Action,Adventure,Thriller"
12713,"Jan 20, 2017","$85,000,000","$44,898,413","$345,033,359",tt1293847,"Action,Adventure,Thriller"
12714,"Apr 15, 2008","$3,000,000",$0,"$895,932",,


There we go! Now we need to come up with a point system to evaluate the popularity of the genre.

The first thing that popped into my mind is that people tend to vote with their money. So, if we take the highest grossing movies and figure out percentage each genre has in that list, then we will have a solid metric for popularity. Not perfect, but a solid representation of popularity.

I also think the score should be scaled by the gross revenue, that way the more popular the movie, the higher the weight is for it's genre


Now we need the revenue for each title, and the year so we can adjust it for inflation.

### Desired Shape

We want the outcome of this step to be a dataframe with imdb_id and genre columns. We can go ahead and stitch it all together at the end of the data obtaining process

### Processing

### Completed

### Notes