# Project:  Investigate A Dataset:  TMDb  Movie Data Analysis


## Introduction

#### Questions to Investigate:

1.  Which genres generate the most adjusted revenue from year to year?

2.  Which genres are the most popular over each decade since 1960?

In [273]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.ticker as ticker
% matplotlib inline
% pylab inline

Populating the interactive namespace from numpy and matplotlib


In [274]:
# Loading the .csv data for the dataset
df = pd.read_csv("tmdb-movies.csv")

In [275]:
df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/2015,5562,6.5,2015,137999939.3,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/2015,6185,7.1,2015,137999939.3,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/2015,2480,6.3,2015,101199955.5,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/2015,5292,7.5,2015,183999919.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/2015,2947,7.3,2015,174799923.1,1385749000.0


In [276]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              1

Since the adjusted revenue is in a float, we need to change the values to integers so that they will become dollars and compute correctly later.

In [277]:
#Converting revenue_adj from float to integer
df[['revenue_adj']] = df[['revenue_adj']].apply(lambda x: x.astype(int))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              1

In [278]:
#Checking for missing values in the dataset
df[df.columns[:]].isnull().sum()

id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

Since there are no missing revenue_adj values, we don't need to focus on those.  We do have 23 missing values for the genres category that we need to look at.

In [279]:
df[pd.isnull(df.genres)].loc[:,('id','original_title','genres')]

Unnamed: 0,id,original_title,genres
424,363869,Belli di papÃ,
620,361043,All Hallows' Eve 2,
997,287663,Star Wars Rebels: Spark of Rebellion,
1712,21634,Prayers for Bobby,
1897,40534,Jonas Brothers: The Concert Experience,
2370,127717,Freshman Father,
2376,315620,Doctor Who: A Christmas Carol,
2853,57892,Vizontele,
3279,54330,ì•„ê¸°ì™€ ë‚˜,
4547,123024,London 2012 Olympic Opening Ceremony: Isles of...,


Due to these movies having no genre listed, we need to clean them from the dataset so that we have no null values.  Since the amount of movies is only 23 comparative to the thousands of movies in the dataset, the effect of removal will be negligible.

In [280]:
#Removing movies in the dataset that have no genres listed (Nan or null values)
df=df[pd.notnull(df.genres)]

In [281]:
#Checking to ensure that movies with null values in the dataset have been removed.
df[pd.isnull(df.genres)].loc[:,('id','original_title','genres')]

Unnamed: 0,id,original_title,genres


Next we need to fix the genres category in the dataset.  Many movies have more than one genre listed.  This could pose issues with analyzing our dataset.  We will need to split these movie genres into seperate columns so that each genre listed for a particular movie can be counted in our analyzation of popular movie genres.

In [282]:
#Splitting multiple movie genres into separate categories within the dataset
splitting_genres = df['genres'].astype(str).apply(lambda x: pd.Series(x.split('|')))
splitting_genres

Unnamed: 0,0,1,2,3,4
0,Action,Adventure,Science Fiction,Thriller,
1,Action,Adventure,Science Fiction,Thriller,
2,Adventure,Science Fiction,Thriller,,
3,Action,Adventure,Science Fiction,Fantasy,
4,Action,Crime,Thriller,,
5,Western,Drama,Adventure,Thriller,
6,Science Fiction,Action,Thriller,Adventure,
7,Drama,Adventure,Science Fiction,,
8,Family,Animation,Adventure,Comedy,
9,Comedy,Animation,Family,,


Now that we have split the genres up into separate categories, we can see that no movie has more than 5 categories of various genres listed.  We need to rename these new genre categories so that we can easily identify them in the dataset.  First, we need to concatenate the new categories into the dataset before we rename them.

In [283]:
#Concatenating new categories into the dataset
new_genre_data=pd.concat([df,splitting_genres], axis=1)

In [284]:
new_genre_data.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,vote_count,vote_average,release_year,budget_adj,revenue_adj,0,1,2,3,4
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,5562,6.5,2015,137999939.3,1392445893,Action,Adventure,Science Fiction,Thriller,
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,6185,7.1,2015,137999939.3,348161292,Action,Adventure,Science Fiction,Thriller,
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,2480,6.3,2015,101199955.5,271619025,Adventure,Science Fiction,Thriller,,
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,5292,7.5,2015,183999919.0,1902723130,Action,Adventure,Science Fiction,Fantasy,
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,2947,7.3,2015,174799923.1,1385748801,Action,Crime,Thriller,,


In [285]:
#Moving genres into their own categories
new_genre_data=new_genre_data.rename(columns={
    0:'first_genre', 
    1:'second_genre', 
    2:'third_genre', 
    3:'fourth_genre', 
    4:'fifth_genre'
})

Now we will check to make sure the change in the renaming of categories has taken place.

In [286]:
#Checking to ensure genre categories have been renamed
new_genre_data.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,vote_count,vote_average,release_year,budget_adj,revenue_adj,first_genre,second_genre,third_genre,fourth_genre,fifth_genre
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,5562,6.5,2015,137999939.3,1392445893,Action,Adventure,Science Fiction,Thriller,
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,6185,7.1,2015,137999939.3,348161292,Action,Adventure,Science Fiction,Thriller,
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,2480,6.3,2015,101199955.5,271619025,Adventure,Science Fiction,Thriller,,
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,5292,7.5,2015,183999919.0,1902723130,Action,Adventure,Science Fiction,Fantasy,
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,2947,7.3,2015,174799923.1,1385748801,Action,Crime,Thriller,,


We have now removed all nan values from the adjusted revenue portion of the dataset.  However, we now have a big dataset with the added genre categories.  We need to reshape the dataset and use only what we need to make accessing the data easier.  We are going to accomplish this through "melting" the dataset.

In [287]:
#Combining the five possible genre categories into one column
new_df=new_genre_data[['popularity','release_year','revenue_adj','first_genre','second_genre','third_genre','fourth_genre','fifth_genre']]
new_df_melted=pd.melt(new_df, id_vars=['popularity','release_year','revenue_adj'], var_name='genre_type', value_name='genre_value')

In [288]:
#Defining bin edges and names for decades
bin_edges = [1960, 1970, 1980, 1990, 2000, 2010, 2015]
bin_names = ['1960', '1970', '1980', '1990', '2000', '2010']

new_df_melted['decades']=pd.cut(new_genre_data['release_year'], bin_edges, labels=bin_names)

In [289]:
#Convert 
#new_df_melted['decades'] = pd.to_numeric(new_df_melted['decades'])

In [290]:
df_grouping = new_df_melted.groupby(['genre_value', 'decades'],as_index=False)

Now that we have cleaned the genre data, we need to clean up the adjusted revenue data.  Earlier, we determined that there are no null values in adjusted revenue.  Now, we need to check for zero values in the adjusted revenue so that those can be removed from the dataset.

In [291]:
df_grouping.describe()

Unnamed: 0_level_0,popularity,popularity,popularity,popularity,popularity,popularity,popularity,popularity,release_year,release_year,release_year,release_year,release_year,revenue_adj,revenue_adj,revenue_adj,revenue_adj,revenue_adj,revenue_adj,revenue_adj,revenue_adj
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
0,62.0,0.729209,1.447538,0.020760,0.191897,0.308635,0.601131,8.654359,62.0,1970.080645,...,1970.75,2015.0,62.0,5.918930e+07,1.286895e+08,0.000000e+00,0.00,0.0,3.312806e+07,5.617734e+08
1,107.0,0.698177,1.143940,0.032936,0.212768,0.336237,0.659351,9.432768,107.0,1979.149533,...,1980.00,2009.0,107.0,7.529935e+07,2.670713e+08,-2.147484e+09,0.00,63.0,1.091038e+08,6.640539e+08
2,191.0,0.634171,0.803989,0.001662,0.228684,0.382943,0.621517,5.400826,191.0,1986.691099,...,1989.00,2005.0,191.0,6.764590e+07,1.458628e+08,0.000000e+00,0.00,14790369.0,5.712209e+07,1.135764e+09
3,327.0,0.675416,0.836342,0.018196,0.263996,0.458778,0.775334,6.591277,327.0,1995.400612,...,1998.00,2013.0,327.0,7.465021e+07,1.560209e+08,0.000000e+00,0.00,5159581.0,6.557363e+07,1.137692e+09
4,555.0,0.724892,0.943406,0.004433,0.247111,0.441854,0.874529,9.363643,555.0,2005.443243,...,2008.50,2011.0,555.0,5.865671e+07,1.262307e+08,0.000000e+00,0.00,405347.0,6.297243e+07,9.048154e+08
5,343.0,1.403238,2.890624,0.002648,0.291922,0.574723,1.393020,32.985763,343.0,2012.443149,...,2014.00,2015.0,343.0,9.275344e+07,2.249863e+08,0.000000e+00,0.00,0.0,7.417907e+07,1.902723e+09
6,37.0,3.249777,7.118452,0.082856,0.250451,0.630778,1.778746,32.985763,37.0,1977.810811,...,1986.00,2015.0,37.0,3.104132e+08,4.903733e+08,0.000000e+00,0.00,8424552.0,5.477497e+08,1.902723e+09
7,40.0,1.227421,1.407966,0.040689,0.287773,0.518438,1.830282,5.488441,40.0,1981.075000,...,1990.00,2007.0,40.0,2.414639e+08,3.836344e+08,0.000000e+00,0.00,65217905.5,2.996882e+08,1.424626e+09
8,80.0,0.988892,1.129259,0.015727,0.249194,0.542935,1.450911,5.939927,80.0,1987.337500,...,1990.00,2005.0,80.0,1.761492e+08,3.072920e+08,0.000000e+00,0.00,27833510.5,2.142632e+08,1.574815e+09
9,104.0,1.034554,1.359283,0.071358,0.272820,0.630397,1.196664,8.575419,104.0,1994.490385,...,1998.00,2013.0,104.0,1.545693e+08,2.595415e+08,0.000000e+00,0.00,42348885.0,1.690337e+08,1.202518e+09


In [292]:
#Checking to see if there are zeros listed in the adjusted revenue of movies in the dataset.
df_grouping['revenue_adj']

<pandas.core.groupby.DataFrameGroupBy object at 0x0000019E02FE0780>

Due to us having some movies that have zero adjusted revenue, we need to apply nan or null values to those movies, then we can clean those from the dataset.  This would leave only movies that had adjusted revenue reported in the dataset.

In [293]:
#Replacing zero values for adjusted revenue with nan (null values)
new_df_melted=df.replace({'revenue_adj': {0: np.nan}})

Checking to ensure that adjusted revenue now has no zeros in the dataset, and has been changed to null values.

In [294]:
#Checking again to see if zero values have been populated with nan
new_df_melted['revenue_adj']

0        1.392446e+09
1        3.481613e+08
2        2.716190e+08
3        1.902723e+09
4        1.385749e+09
5        4.903142e+08
6        4.053551e+08
7        5.477497e+08
8        1.064192e+09
9        7.854116e+08
10       8.102203e+08
11       1.692686e+08
12       3.391984e+07
13       2.241460e+08
14       1.292632e+09
15       1.432992e+08
16       2.997096e+08
17       4.771138e+08
18       4.989630e+08
19       5.984813e+08
20       1.923127e+08
21       8.437300e+07
22       4.328514e+08
23       5.240791e+08
24       1.226787e+08
25       6.277435e+08
26       1.985944e+08
27       3.714978e+08
28       8.127872e+07
29       2.863562e+08
             ...     
10836             NaN
10837             NaN
10838             NaN
10839             NaN
10840             NaN
10841             NaN
10842             NaN
10843             NaN
10844             NaN
10845             NaN
10846             NaN
10847             NaN
10848    8.061618e+07
10849             NaN
10850     

We will remove these nan (null values) from the dataset now to clean up the adjusted revenue.

In [295]:
#Removing nan (null values) in adjusted revenue
new_df_melted=df[pd.notnull(df.revenue_adj)]

We will now check to ensure that the nan (null values) have been removed from the adjusted revenue.

In [296]:
#Checking again for nan (null values) in adjusted revenue
new_df_melted['revenue_adj']

0        1392445893
1         348161292
2         271619025
3        1902723130
4        1385748801
5         490314247
6         405355075
7         547749654
8        1064192017
9         785411574
10        810220283
11        169268630
12         33919845
13        224146025
14       1292632337
15        143299244
16        299709578
17        477113780
18        498963025
19        598481289
20        192312729
21         84373003
22        432851375
23        524079119
24        122678731
25        627743451
26        198594430
27        371497801
28         81278719
29        286356245
            ...    
10836             0
10837             0
10838             0
10839             0
10840             0
10841             0
10842             0
10843             0
10844             0
10845             0
10846             0
10847             0
10848      80616176
10849             0
10850             0
10851             0
10852             0
10853             0
10854             0


In [297]:
new_df_melted['genres']

0              Action|Adventure|Science Fiction|Thriller
1              Action|Adventure|Science Fiction|Thriller
2                     Adventure|Science Fiction|Thriller
3               Action|Adventure|Science Fiction|Fantasy
4                                  Action|Crime|Thriller
5                       Western|Drama|Adventure|Thriller
6              Science Fiction|Action|Thriller|Adventure
7                        Drama|Adventure|Science Fiction
8                      Family|Animation|Adventure|Comedy
9                                Comedy|Animation|Family
10                                Action|Adventure|Crime
11              Science Fiction|Fantasy|Action|Adventure
12                                 Drama|Science Fiction
13                         Action|Comedy|Science Fiction
14                      Action|Adventure|Science Fiction
15                           Crime|Drama|Mystery|Western
16                                 Crime|Action|Thriller
17                      Science

# #Analyzing the data

In [298]:
#Descriptive data for all interesting categories
new_df_melted.describe()

Unnamed: 0,id,popularity,budget,revenue,runtime,vote_count,vote_average,release_year,budget_adj,revenue_adj
count,10843.0,10843.0,10843.0,10843.0,10843.0,10843.0,10843.0,10843.0,10843.0,10843.0
mean,65868.49193,0.647456,14656720.0,39907790.0,102.137508,217.813705,5.973974,2001.315595,17588270.0,49732050.0
std,91977.394803,1.000986,30938640.0,117113100.0,31.29332,576.155351,0.93426,12.813298,34332990.0,142714100.0
min,5.0,6.5e-05,0.0,0.0,0.0,10.0,1.5,1960.0,0.0,-2147484000.0
25%,10589.5,0.208253,0.0,0.0,90.0,17.0,5.4,1995.0,0.0,0.0
50%,20558.0,0.384555,0.0,0.0,99.0,38.0,6.0,2006.0,0.0,0.0
75%,75182.0,0.715349,15000000.0,24136750.0,111.0,146.0,6.6,2011.0,20935300.0,33732800.0
max,417859.0,32.985763,425000000.0,2781506000.0,900.0,9767.0,9.2,2015.0,425000000.0,1907006000.0


In [299]:
df_grouping.describe()

Unnamed: 0_level_0,popularity,popularity,popularity,popularity,popularity,popularity,popularity,popularity,release_year,release_year,release_year,release_year,release_year,revenue_adj,revenue_adj,revenue_adj,revenue_adj,revenue_adj,revenue_adj,revenue_adj,revenue_adj
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
0,62.0,0.729209,1.447538,0.020760,0.191897,0.308635,0.601131,8.654359,62.0,1970.080645,...,1970.75,2015.0,62.0,5.918930e+07,1.286895e+08,0.000000e+00,0.00,0.0,3.312806e+07,5.617734e+08
1,107.0,0.698177,1.143940,0.032936,0.212768,0.336237,0.659351,9.432768,107.0,1979.149533,...,1980.00,2009.0,107.0,7.529935e+07,2.670713e+08,-2.147484e+09,0.00,63.0,1.091038e+08,6.640539e+08
2,191.0,0.634171,0.803989,0.001662,0.228684,0.382943,0.621517,5.400826,191.0,1986.691099,...,1989.00,2005.0,191.0,6.764590e+07,1.458628e+08,0.000000e+00,0.00,14790369.0,5.712209e+07,1.135764e+09
3,327.0,0.675416,0.836342,0.018196,0.263996,0.458778,0.775334,6.591277,327.0,1995.400612,...,1998.00,2013.0,327.0,7.465021e+07,1.560209e+08,0.000000e+00,0.00,5159581.0,6.557363e+07,1.137692e+09
4,555.0,0.724892,0.943406,0.004433,0.247111,0.441854,0.874529,9.363643,555.0,2005.443243,...,2008.50,2011.0,555.0,5.865671e+07,1.262307e+08,0.000000e+00,0.00,405347.0,6.297243e+07,9.048154e+08
5,343.0,1.403238,2.890624,0.002648,0.291922,0.574723,1.393020,32.985763,343.0,2012.443149,...,2014.00,2015.0,343.0,9.275344e+07,2.249863e+08,0.000000e+00,0.00,0.0,7.417907e+07,1.902723e+09
6,37.0,3.249777,7.118452,0.082856,0.250451,0.630778,1.778746,32.985763,37.0,1977.810811,...,1986.00,2015.0,37.0,3.104132e+08,4.903733e+08,0.000000e+00,0.00,8424552.0,5.477497e+08,1.902723e+09
7,40.0,1.227421,1.407966,0.040689,0.287773,0.518438,1.830282,5.488441,40.0,1981.075000,...,1990.00,2007.0,40.0,2.414639e+08,3.836344e+08,0.000000e+00,0.00,65217905.5,2.996882e+08,1.424626e+09
8,80.0,0.988892,1.129259,0.015727,0.249194,0.542935,1.450911,5.939927,80.0,1987.337500,...,1990.00,2005.0,80.0,1.761492e+08,3.072920e+08,0.000000e+00,0.00,27833510.5,2.142632e+08,1.574815e+09
9,104.0,1.034554,1.359283,0.071358,0.272820,0.630397,1.196664,8.575419,104.0,1994.490385,...,1998.00,2013.0,104.0,1.545693e+08,2.595415e+08,0.000000e+00,0.00,42348885.0,1.690337e+08,1.202518e+09


In [308]:


#mean of groupby object
df_means=new_df_melted.mean()



In [309]:
attend = sns.load_dataset("attention")
g = sns.FacetGrid(df_means, col="genre_value", col_wrap=5, size=1.5)
g = g.map(plt.plot, "decades", "revenue_adj", marker=".")

KeyError: 'genre_value'

In [310]:



#plotting with facegrid
g = sns.FacetGrid(df_means, col='genres',col_wrap=3)
g = g.map(plt.plot,'decades','popularity',marker='.')
g.set_xticklabels(['1950','1960','1970','1980','1990','2000','2010','2020'])

#loop to set the labels
for ax in g.axes:
    ax.xaxis.set_major_locator(ticker.MultipleLocator(base=10))
    for label in ax.get_xticklabels():
        label.set_rotation(45)

KeyError: 'genres'

In [None]:
#MAY DELETE-NEED TO CHANGE TO WORK
pltdf = top_genres_df.reset_index()

plt = sns.factorplot(x="release_decade", y="id", hue="genres", data=pltdf, kind="bar", palette="muted", size=8, legend_out=True);
plt.set_xlabels("Release Decade");
plt.set_ylabels("Number of Films");
plt.fig.suptitle("Number of Films in each Genre by Release Decade", fontsize=14);
plt._legend.set_title("Genre");

In [None]:
#rev_df = df[(df['budget_adj'] > 0) & (df['revenue_adj'] > 0)]

In [None]:
#rev_genre_df = df(rev_df, 'genre_value', sep='|')

In [None]:
#df.groupby('genre_value').revenue_adj.mean().sort_values(ascending=False)

In [None]:
#plt = df_means.groupby('genre_value').revenue_adj.mean().plot.pie(autopct='%.2f', figsize=(10,10), fontsize=12);
#plt.set_title("% of Average Revenue by Genre", fontsize=14);
#plt.set_ylabel('');