<a href="https://colab.research.google.com/github/ErikSeguinte/movie_data/blob/master/processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import cufflinks as cf
import numpy as np
from plotly import graph_objs as go

In [2]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
  init_notebook_mode(connected=False)

In [3]:
cf.set_config_file(offline=True)

* I previously pulled CSV files from Kaggle, but the files were too big to host on github.
* I imported the files I wanted into pandas, and then exported them back out as compressed pickles.
* I was able to compress a 700MB csv to a 3 MB Pickle

In [4]:
try: 
    movies = pd.read_pickle('data/movies.pkl.xz')
    ratings = pd.read_pickle('data/ratings.pkl.xz')
except:
    # Download pickles from github
    !wget https://github.com/ErikSeguinte/movie_data/raw/master/data/ratings.pkl.xz
    !wget https://github.com/ErikSeguinte/movie_data/raw/master/data/movies.pkl.xz
    # Unpickle dataframes
    movies = pd.read_pickle('movies.pkl.xz')
    ratings = pd.read_pickle('ratings.pkl.xz')

In [0]:
movies.head(1)

In [0]:
ratings.dtypes

In [0]:
ratings['datetime'] = pd.to_datetime(
    ratings['timestamp'], 
    infer_datetime_format=True, 
    unit = 's')

In [0]:
ratings.shape

In [0]:
movies.shape

In [0]:
movies.dtypes

* Movies Dataframe has malformed data. `id` Should be numeric.
* After inspection, it looks like there are rows that are missing a comma somewhere, making columns not line up, and adding the wrong data to columns. Let's clean those up.
* All malformed rows have strings for IDs instead of numeric, so we will coerce them into numeric columns, and strings will be returned as `NaN`, which we'll then drop.

* `budget` and `revanue` should also be numeric, but Nans won't be dropped






In [0]:
ratings.dtypes

In [0]:
ratings['rating'].value_counts()

In [0]:
movies['id'] = pd.to_numeric(movies['id'], errors='coerce')

In [0]:
movies = movies[movies['id'].notnull()]

In [0]:
movies['id'] = movies['id'].astype(int)

In [0]:
movies['id'].dtype

In [0]:
movies['budget'] = pd.to_numeric(movies['budget'], errors='coerce')

In [0]:
movies['revenue'] = pd.to_numeric(movies['revenue'], errors='coerce')

In [0]:
movies['revenue'].value_counts()

In [0]:
movies = movies.replace({0: np.NaN, 0.0: np.NaN})

In [0]:
movies['release_date'] =pd.to_datetime(movies['release_date'], infer_datetime_format= True)

In [0]:
clean_movies = movies[['id','title', 'release_date','budget', 'revenue', 'runtime']]

In [0]:
clean_movies.head()

In [0]:
mean_rating = ratings.groupby('movieId', as_index=False)[['movieId','rating']].mean()

In [0]:
median_rating = ratings.groupby('movieId', as_index=False)[['movieId','rating']].median()


In [0]:
avg_ratings = mean_rating.merge(median_rating, on ="movieId", suffixes = ('_mean', '_median'))

In [0]:
movie_ratings = clean_movies.merge(avg_ratings, left_on='id', right_on='movieId')

In [0]:
movie_ratings.nlargest(10, 'rating_mean')

In [0]:
movie_ratings['year'] = pd.to_datetime(clean_movies['release_date'], infer_datetime_format=True).dt.year

In [0]:
top = [int(x) for x in movie_ratings.nlargest(5, 'rating_mean')['id'].to_list()]

In [0]:
enable_plotly_in_cell()
ratings[ratings['movieId'].isin(top)].boxplot( by= 'movieId', column ='rating')

In [0]:
enable_plotly_in_cell()
movie_ratings.groupby('year')['rating_mean'].mean().iplot(kind='bar')

In [0]:
movie_ratings['decade'] = [x - (x%10) for x in movie_ratings['year']]

In [0]:
enable_plotly_in_cell()
movie_ratings.groupby('decade')['rating_mean'].mean().iplot(kind='bar')

In [0]:
enable_plotly_in_cell()
trace = go.Box(
    x = movie_ratings['decade'],
    y = movie_ratings['rating_mean'],
    
)
go.Figure(trace)


In [0]:
enable_plotly_in_cell()
trace = go.Box(
    x = movie_ratings['year'],
    y = movie_ratings['rating_median'],
    
)
go.Figure(trace)

In [0]:
enable_plotly_in_cell()
movie_ratings[['budget', 'rating_mean']].iplot(kind = 'scatter', x = 'budget', y = 'rating_mean')