<a href="https://colab.research.google.com/github/ErikSeguinte/movie_data/blob/master/processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import cufflinks as cf
import numpy as np
from plotly import graph_objs as go
import altair as alt

In [0]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
  init_notebook_mode(connected=False)

In [0]:
cf.set_config_file(offline=True)

In [0]:
merge_keys = {'left_index':True, 'right_index':True}

* I previously pulled CSV files from Kaggle, but the files were too big to host on github.
* I imported the files I wanted into pandas, and then exported them back out as compressed pickles.
* I was able to compress a 700MB csv to a 3 MB Pickle

In [130]:
try: 
    movies = pd.read_pickle('data/movies.pkl.xz')
    #ratings = pd.read_pickle('data/ratings2.pkl.xz')
    #cpi = pd.read_csv('data/cpi.csv')
    
except:
    # Download pickles from github
    #!wget https://github.com/ErikSeguinte/movie_data/raw/master/data/ratings2.pkl.xz
    !wget https://github.com/ErikSeguinte/movie_data/raw/master/data/movies.pkl.xz
    #!wget 'https://datahub.io/core/cpi/r/cpi.csv'
    #cpi = pd.read_csv('data/cpi.csv')
    movies = pd.read_pickle('movies.pkl.xz')
    #ratings = pd.read_pickle('ratings2.pkl.xz')

--2020-03-02 21:48:28--  https://github.com/ErikSeguinte/movie_data/raw/master/data/movies.pkl.xz
Resolving github.com (github.com)... 192.30.253.112
Connecting to github.com (github.com)|192.30.253.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ErikSeguinte/movie_data/master/data/movies.pkl.xz [following]
--2020-03-02 21:48:29--  https://raw.githubusercontent.com/ErikSeguinte/movie_data/master/data/movies.pkl.xz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7917036 (7.5M) [application/octet-stream]
Saving to: ‘movies.pkl.xz.3’


2020-03-02 21:48:29 (21.1 MB/s) - ‘movies.pkl.xz.3’ saved [7917036/7917036]



## Clean Movie DF

In [0]:
movies.dtypes

adult                            object
belongs_to_collection            object
budget                          float64
genres                           object
homepage                         object
imdb_id                          object
original_language                object
original_title                   object
overview                         object
popularity                       object
poster_path                      object
production_companies             object
production_countries             object
release_date             datetime64[ns]
revenue                         float64
runtime                         float64
spoken_languages                 object
status                           object
tagline                          object
title                            object
video                            object
vote_average                    float64
vote_count                      float64
year                            float64
dtype: object

* Movies Dataframe has malformed data. `id` Should be numeric.
* After inspection, it looks like there are rows that are missing a comma somewhere, making columns not line up, and adding the wrong data to columns. Let's clean those up.
* All malformed rows have strings for IDs instead of numeric, so we will coerce them into numeric columns, and strings will be returned as `NaN`, which we'll then drop.

* `budget` and `revanue` should also be numeric, but Nans won't be dropped






In [0]:
movies['id'] = pd.to_numeric(movies['id'], errors='coerce')
movies = movies[movies['id'].notnull()]
movies = movies.set_index('id')

In [0]:
def to_numeric(df, labels):
    
    for label in labels:
        df[label] = pd.to_numeric(movies[label], errors='coerce').copy()
    return df

In [0]:
movies = to_numeric(movies, ['budget', 'revenue', 'vote_average'])

In [0]:
movies['release_date'] =pd.to_datetime(movies['release_date'], infer_datetime_format= True)

In [0]:
movies['year'] = movies['release_date'].dt.year

In [0]:
clean_movies = movies[['title','genres', 'release_date','budget', 'revenue','year' ,'runtime', 'vote_average', 'vote_count']]

## Process User Reviews
* User reviews come in a collection of individual reviews where a review gives a movie a score of 1 to 5.
* We will take the mean ratings for each movie

In [0]:
from sklearn.preprocessing import StandardScaler

In [0]:
scaler = StandardScaler()
mean_rating = pd.DataFrame(scaler.fit_transform(ratings.groupby('movieId')[['rating']].mean()), columns = ["rating"])
#mean_rating = pd.DataFrame(ratings.groupby('movieId')[['rating']].mean(), columns = ["rating"])

In [0]:
# Aggregate mean ratings and number of votes per movie
movie_ratings =pd.DataFrame(ratings.groupby('movieId')[['rating']].agg(['mean', 'count']))['rating']
movie_ratings = movie_ratings.rename({'mean': 'rating', 'count': 'num_votes'}, axis = 1)


In [0]:
ratings = None

* Lets drop any movies with less than 10 votes. Those are more easily swayed by outliers and aren't reliable.

In [0]:
movie_ratings = movie_ratings[~(movie_ratings['num_votes'] < 100)]

* And now we merge the averaged ratings back with the movie database.
* Note that not all movies are present in the user votings.

In [0]:
movie_ratings = clean_movies.merge(movie_ratings, left_index = True, right_index=True)

In [0]:
movie_ratings[['title', 'rating']].nlargest(10, 'rating')

* The movie Database also provides a rating and suffer from a similar problem of some movies having a tiny sample size.

In [0]:
movie_ratings[['title', 'vote_average', 'vote_count']].sort_values(by='vote_average', ascending = False).nlargest(10, 'vote_average')

In [0]:
movie_ratings[['title', 'revenue']].nlargest(10, 'revenue')

## Inflation
* Inflation means that a 1940 dollar is worth more than a 2020 dollar. Let's adjust Revenue for that.
* The Consumer price index can be used to convert to standarized dollars.
* Here, we'll be using 2014 dollars.
* Years later than 2014 will not be adjusted.
$$ \textrm{adjusted dollars} = \frac{\textrm{New CPI}}{\textrm{Base CPI}}$$
* where x is the current cpi and y is the cpi of that year 

In [114]:
!wget 'https://datahub.io/core/cpi/r/cpi.csv'
cpi = pd.read_csv('cpi.csv')

--2020-03-02 21:45:14--  https://datahub.io/core/cpi/r/cpi.csv
Resolving datahub.io (datahub.io)... 104.24.112.103, 104.24.113.103, 2606:4700:3035::6818:7167, ...
Connecting to datahub.io (datahub.io)|104.24.112.103|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://pkgstore.datahub.io/core/cpi/cpi_csv/data/04cb8fe18892497287d23e20d0e1ceb9/cpi_csv.csv [following]
--2020-03-02 21:45:15--  https://pkgstore.datahub.io/core/cpi/cpi_csv/data/04cb8fe18892497287d23e20d0e1ceb9/cpi_csv.csv
Resolving pkgstore.datahub.io (pkgstore.datahub.io)... 104.24.113.103, 104.24.112.103, 2606:4700:3035::6818:7167, ...
Connecting to pkgstore.datahub.io (pkgstore.datahub.io)|104.24.113.103|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 254135 (248K) [text/plain]
Saving to: ‘cpi.csv.1’


2020-03-02 21:45:15 (5.15 MB/s) - ‘cpi.csv.1’ saved [254135/254135]



In [143]:
cpi = cpi[cpi['Country Name'] == 'United States'][['Year', 'CPI']]

KeyError: ignored

In [0]:
cpi = cpi.set_index(cpi['Year'])

In [0]:
def adjust_dollars(value, year):
    year = int(year)
    try:
        current = cpi.loc[2014,'CPI']
        base = cpi.loc[year,'CPI']
        adjusted_value = value * (current/base)
        return adjusted_value
    except: 
        return value

In [0]:
clean_movies['year']= clean_movies['release_date'].dt.year

In [144]:
df = clean_movies[clean_movies['revenue'].notnull() & clean_movies['year'].notnull()]
df

Unnamed: 0_level_0,title,genres,release_date,budget,revenue,year,runtime,vote_average,vote_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
862.0,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",1995-10-30,30000000.0,373554033.0,1995.0,81.0,7.7,5415.0
8844.0,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",1995-12-15,65000000.0,262797249.0,1995.0,104.0,6.9,2413.0
31357.0,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",1995-12-22,16000000.0,81452156.0,1995.0,127.0,6.1,34.0
11862.0,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]",1995-02-10,,76578911.0,1995.0,106.0,5.7,173.0
949.0,Heat,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",1995-12-15,60000000.0,187436818.0,1995.0,170.0,7.7,1886.0
...,...,...,...,...,...,...,...,...,...
280422.0,All at Once,"[{'id': 80, 'name': 'Crime'}, {'id': 35, 'name...",2014-06-05,750000.0,3.0,2014.0,,6.0,4.0
240789.0,The Miracle,"[{'id': 18, 'name': 'Drama'}, {'id': 36, 'name...",2009-10-09,,50656.0,2009.0,110.0,6.3,3.0
62757.0,Savages,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",2006-11-23,800000.0,1328612.0,2006.0,100.0,5.8,6.0
63281.0,Pro Lyuboff,"[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...",2010-09-30,2000000.0,1268793.0,2010.0,107.0,4.0,3.0


In [0]:
adjusted = pd.DataFrame([adjust_dollars(x,y) for x,y in zip(df['revenue'], df['year'])], index = df.index, columns = ['adjusted_revenue'])

In [0]:
clean_movies = clean_movies.merge(adjusted, left_index=True, right_index=True).nlargest(10,'adjusted_revenue')

In [0]:
movie_ratings.corr()

In [0]:
# Adding a year and decade to examine trends over time
movie_ratings['year'] = movie_ratings['release_date'].dt.year

In [0]:
movie_ratings['decade'] = [x - (x%10) for x in movie_ratings['year']]

In [0]:
#enable_plotly_in_cell()
movie_ratings.groupby('year')['vote_average'].mean().iplot(kind='bar')

In [0]:
alt.Chart(movie_ratings).mark_bar().encode(
    alt.Y('mean(vote_average)'),
    alt.X('year')
)

In [0]:
alt.Chart(movie_ratings).mark_bar().encode(
    alt.Y('mean(rating)'),
    alt.X('year')
)

In [0]:
alt.Chart(movie_ratings).mark_bar().encode(
    alt.Y('mean(vote_average)'),
    alt.X('decade')
)

In [0]:
movie_ratings

In [0]:
# enable_plotly_in_cell()
movie_ratings.groupby('decade')['rating'].mean().iplot(kind='bar')

* I'd like to compare the votes from TMB to the user ratings, but they are on different scales. We'll use standard scaler to normalize them so we can more easily compare.

In [0]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
to_scale = movie_ratings[movie_ratings['vote_average'].notnull() & movie_ratings['year'].notnull() & movie_ratings['rating'].notnull()][['vote_average', 'rating']]
scaled = scaler.fit_transform(to_scale)
movie_ratings = movie_ratings.merge(
    pd.DataFrame(
        scaled,
        index = to_scale.index,
        columns = ['scaled_tmdb_vote', 'scaled_user_rating']
    ),
    left_index = True,
    right_index = True,
)

In [0]:
movie_ratings

In [0]:
traces = [
    go.Bar(name='TMDB rating',
        x = movie_ratings.groupby('year')['scaled_tmdb_vote'].mean().index,
        y = movie_ratings.groupby('year')['scaled_tmdb_vote'].mean()
    ),
        go.Bar(name='user rating',
        x = movie_ratings.groupby('year')['scaled_user_rating'].mean().index,
        y = movie_ratings.groupby('year')['scaled_user_rating'].mean()
    )
]

go.Figure(data = traces,
    layout_xaxis_tick0 = 1890
)

In [0]:
from sklearn.decomposition import PCA

pca = PCA(1)

pca_df = pd.DataFrame(pca.fit_transform(scaled), index=to_scale.index, columns = ['PCA'])

movie_ratings = movie_ratings.merge(pca_df, left_index = True, right_index = True)
movie_ratings.head(2)

In [0]:
movie_ratings[['title', 'PCA']].nlargest(10, 'PCA')

In [0]:
# enable_plotly_in_cell()
trace = go.Box(
    x = movie_ratings[movie_ratings['decade'].notnull()]['decade'],
    y = movie_ratings[movie_ratings['decade'].notnull()]['rating'],
    
)
go.Figure(
    trace,
    layout_xaxis_title = "Decade",
    layout_yaxis_title = "Movie Rating",
    layout_title = "Movie Ratings by decade"
)


In [0]:
movie_ratings['q_budget'] = pd.qcut(movie_ratings['budget'], labels = ['vlow', 'low', 'med', 'high', 'vhigh'], q = 5)

In [0]:
budget_ratings = movie_ratings[['title', 'budget', 'revenue', 'rating']].dropna()

In [0]:
budget_ratings.corr()

In [0]:
trace = go.Scatter(
    y = budget_ratings['rating'],
    x = budget_ratings['revenue'],
    mode = 'markers'
)

go.Figure(
    trace,
    layout_xaxis_title = "Budget",
    layout_yaxis_title = "Movie Rating",
    layout_title = "Movie Ratings by budget",
    
)

In [0]:
movie_ratings.shape

In [0]:
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.decomposition import PCA

In [0]:
all_ratings = movie_ratings[['vote_average','rating']].dropna()
all_ratings

In [0]:
all_ratings.isnull().sum()

In [0]:
scaler = StandardScaler()
x =scaler.fit_transform(all_ratings)

In [0]:
pca = PCA(1)

In [0]:
x = pca.fit_transform(x)

In [0]:
scaled_ratings = pd.DataFrame(x, index = all_ratings.index, columns=['scaled_rating'])

In [0]:
scaled_ratings

In [0]:
movie_ratings = movie_ratings.merge(scaled_ratings, left_index=True, right_index=True)

In [0]:
movie_ratings.corr()

In [0]:
enable_plotly_in_cell()
movie_ratings.groupby('year')['scaled_rating'].mean().iplot(kind='bar')

In [0]:
movie_ratings.groupby('year')['scaled_rating'].mean().iplot(kind='bar', title = "Scaled Rating by Year", xTitle="year", yTitle="Scaled Rating")

In [0]:
movie_ratings.groupby('decade')['scaled_rating'].mean().iplot(kind='bar')

In [0]:
budget_ratings = movie_ratings[['budget', 'rating']].dropna()

In [0]:
budget_ratings['q_budget'] = pd.qcut(budget_ratings['budget'], q = 5, labels = ['vlow', 'low', 'med', 'high', 'blockbuster'])


In [0]:
enable_plotly_in_cell()
budget_ratings.groupby('q_budget')['rating'].mean().iplot(kind='bar')

In [0]:
movie_ratings.nlargest(25, 'rating')

In [0]:
enable_plotly_in_cell()
trace = go.Scatter(
    x = movie_ratings["budget"],
    y = movie_ratings['revenue'],
    mode = "markers"
)

go.Figure(trace)

## Genre

In [0]:
genres = clean_movies[['title','genres']]

In [0]:
import ast

In [0]:
bad_genres = [
           'Aniplex',
 'BROSTA TV',
 'Carousel Productions',   
  'GoHands',
   'Mardock Scramble Production Committee',
    'Odyssey Media',
     'Pulser Productions',
 'Rogue State',
  'Sentai Filmworks',
   'Telescene Film Group Productions',
 'The Cartel',
  'Vision View Entertainment',
]

#genre_set = genre_set.difference(bad_genres)

In [148]:
genres['genres_ls'] = [
                    [d['name'] for d in ast.literal_eval(x) if d['name'] not in bad_genres ]
                    for x in genres['genres']
                    ]
genres['genres_ls']

id
862.0        [Animation, Comedy, Family]
8844.0      [Adventure, Fantasy, Family]
15602.0                [Romance, Comedy]
31357.0         [Comedy, Drama, Romance]
11862.0                         [Comedy]
                        ...             
439050.0                 [Drama, Family]
111109.0                         [Drama]
67758.0        [Action, Drama, Thriller]
227506.0                              []
461257.0                              []
Name: genres_ls, Length: 45463, dtype: object

In [0]:
from sklearn.preprocessing import MultiLabelBinarizer

In [0]:
mlb = MultiLabelBinarizer()

In [0]:
mlb_ = mlb.fit_transform(genres['genres_ls'])

In [153]:
encoded_genres = pd.DataFrame(mlb_, columns = mlb.classes_, index = genres.index)
encoded_genres

Unnamed: 0_level_0,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
862.0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
8844.0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
15602.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
31357.0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
11862.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
439050.0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
111109.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
67758.0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
227506.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [173]:
clean_movies

Unnamed: 0_level_0,title,genres,release_date,budget,revenue,year,runtime,vote_average,vote_count,adjusted_revenue
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
19995.0,Avatar,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",2009-12-10,237000000.0,2787965000.0,2009.0,162.0,7.2,12114.0,3076449000.0
11.0,Star Wars,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",1977-05-25,11000000.0,775398000.0,1977.0,121.0,8.1,6778.0,3028728000.0
597.0,Titanic,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",1997-11-18,200000000.0,1845034000.0,1997.0,194.0,7.5,7770.0,2721128000.0
9552.0,The Exorcist,"[{'id': 18, 'name': 'Drama'}, {'id': 27, 'name...",1973-12-26,8000000.0,441306100.0,1973.0,122.0,7.5,2046.0,2351851000.0
15121.0,The Sound of Music,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",1965-03-02,8200000.0,286214300.0,1965.0,174.0,7.4,966.0,2148826000.0
578.0,Jaws,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...",1975-06-18,7000000.0,470654000.0,1975.0,124.0,7.5,2628.0,2069945000.0
140607.0,Star Wars: The Force Awakens,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",2015-12-15,245000000.0,2068224000.0,2015.0,136.0,7.5,7993.0,2068224000.0
601.0,E.T. the Extra-Terrestrial,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",1982-04-03,10500000.0,792965300.0,1982.0,115.0,7.3,3359.0,1945322000.0
12230.0,One Hundred and One Dalmatians,"[{'id': 12, 'name': 'Adventure'}, {'id': 16, '...",1961-01-25,4000000.0,215880000.0,1961.0,79.0,6.8,1643.0,1708505000.0
24428.0,The Avengers,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",2012-04-25,220000000.0,1519558000.0,2012.0,143.0,7.4,12000.0,1566829000.0


In [159]:
encoded_genres = clean_movies.merge(encoded_genres, **merge_keys)
encoded_genres

Unnamed: 0_level_0,title,genres,release_date,budget,revenue,year,runtime,vote_average,vote_count,adjusted_revenue,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1
11.0,Star Wars,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",1977-05-25,11000000.0,775398000.0,1977.0,121.0,8.1,6778.0,3028728000.0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
578.0,Jaws,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...",1975-06-18,7000000.0,470654000.0,1975.0,124.0,7.5,2628.0,2069945000.0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0
597.0,Titanic,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",1997-11-18,200000000.0,1845034000.0,1997.0,194.0,7.5,7770.0,2721128000.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0
601.0,E.T. the Extra-Terrestrial,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",1982-04-03,10500000.0,792965300.0,1982.0,115.0,7.3,3359.0,1945322000.0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0
9552.0,The Exorcist,"[{'id': 18, 'name': 'Drama'}, {'id': 27, 'name...",1973-12-26,8000000.0,441306100.0,1973.0,122.0,7.5,2046.0,2351851000.0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0
12230.0,One Hundred and One Dalmatians,"[{'id': 12, 'name': 'Adventure'}, {'id': 16, '...",1961-01-25,4000000.0,215880000.0,1961.0,79.0,6.8,1643.0,1708505000.0,0,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
15121.0,The Sound of Music,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",1965-03-02,8200000.0,286214300.0,1965.0,174.0,7.4,966.0,2148826000.0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,1,0,0,0,0,0
19995.0,Avatar,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",2009-12-10,237000000.0,2787965000.0,2009.0,162.0,7.2,12114.0,3076449000.0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
24428.0,The Avengers,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",2012-04-25,220000000.0,1519558000.0,2012.0,143.0,7.4,12000.0,1566829000.0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
140607.0,Star Wars: The Force Awakens,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",2015-12-15,245000000.0,2068224000.0,2015.0,136.0,7.5,7993.0,2068224000.0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0


In [0]:
for genre in mlb.classes_:
    print(genre, encoded_genres[encoded_genres[genre] == 1]['vote_average'].mean())

Action 5.75075674832266
Adventure 5.886064363743711
Animation 6.443680592279214
Comedy 5.979235709189001
Crime 6.107923892100158
Documentary 6.6706053550640005
Drama 6.187262592898546
Family 5.934633965068733
Fantasy 5.9245916114790065
Foreign 5.981445937300071
History 6.425392083644524
Horror 5.313143601998684
Music 6.355135135135147
Mystery 5.970215410107693
Romance 6.049242424242426
Science Fiction 5.472276231981214
TV Movie 5.67937853107345
Thriller 5.7423293172690695
War 6.298033044846587
Western 5.727872340425541


In [0]:
d = {g: [encoded_genres[encoded_genres[g] == 1]['vote_average'].mean(),
         encoded_genres[encoded_genres[g] == 1]['revenue'].sum()]
     for g in mlb.classes_}

In [174]:
df = pd.DataFrame.from_dict(d, orient='index',  columns = ['tMDb rating', 'total revenue'])
df

Unnamed: 0,tMDb rating,total revenue
Action,7.55,7151145000.0
Adventure,7.4,8630644000.0
Animation,6.8,215880000.0
Comedy,6.8,215880000.0
Crime,,0.0
Documentary,,0.0
Drama,7.466667,2572555000.0
Family,7.166667,1295060000.0
Fantasy,7.333333,5649154000.0
Foreign,,0.0


In [0]:
trace = go.Bar(
    x=pd.Series(d)
)

go.Figure(
    data = trace,
    layout_x)



In [0]:
melted = df.reset_index().melt(id_vars='index')

In [164]:
alt.Chart(df.reset_index()).mark_bar().encode(
    x = 'index',
    y = 'tMDb rating',
)

In [171]:
alt.Chart(df.reset_index()).mark_bar().encode(
    x = 'index',
    y = 'total revenue',
)

