<a href="https://colab.research.google.com/github/ErikSeguinte/movie_data/blob/master/processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import cufflinks as cf
import numpy as np
from plotly import graph_objs as go

In [0]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
  init_notebook_mode(connected=False)

In [0]:
cf.set_config_file(offline=True)

* I previously pulled CSV files from Kaggle, but the files were too big to host on github.
* I imported the files I wanted into pandas, and then exported them back out as compressed pickles.
* I was able to compress a 700MB csv to a 3 MB Pickle

In [0]:
try: 
    movies = pd.read_pickle('data/movies.pkl.xz')
    ratings = pd.read_pickle('data/ratings.pkl.xz')
except:
    # Download pickles from github
    !wget https://github.com/ErikSeguinte/movie_data/raw/master/data/ratings.pkl.xz
    !wget https://github.com/ErikSeguinte/movie_data/raw/master/data/movies.pkl.xz
    # Unpickle dataframes
    movies = pd.read_pickle('movies.pkl.xz')
    ratings = pd.read_pickle('ratings.pkl.xz')

In [0]:
movies.head(1)

In [0]:
ratings.shape

In [0]:
movies.shape

## Clean Movie DF

In [0]:
movies.dtypes

* Movies Dataframe has malformed data. `id` Should be numeric.
* After inspection, it looks like there are rows that are missing a comma somewhere, making columns not line up, and adding the wrong data to columns. Let's clean those up.
* All malformed rows have strings for IDs instead of numeric, so we will coerce them into numeric columns, and strings will be returned as `NaN`, which we'll then drop.

* `budget` and `revanue` should also be numeric, but Nans won't be dropped






In [0]:
movies['id'] = pd.to_numeric(movies['id'], errors='coerce')
movies = movies[movies['id'].notnull()]
movies['id'] = movies['id'].astype('Int64')
movies = movies.set_index('id')

In [0]:
def to_numeric(df, labels):
    
    for label in labels:
        df[label] = pd.to_numeric(movies[label], errors='coerce').copy()
    df = df.replace({"0": np.NaN, 0: np.NaN, 0.0:np.NaN})
    return df

In [0]:
movies = to_numeric(movies, ['budget', 'revenue', 'vote_average'])

In [0]:
movies['revenue'].value_counts()

In [0]:
movies['budget'].isnull().sum()

In [0]:
movies['release_date'] =pd.to_datetime(movies['release_date'], infer_datetime_format= True)

In [0]:
clean_movies = movies[['title', 'release_date','budget', 'revenue', 'runtime', 'vote_average']]

In [0]:
clean_movies.head()

## Process User Reviews
* User reviews come in a collection of individual reviews where a review gives a movie a score of 1 to 5.
* We will take the mean ratings for each movie

In [0]:
mean_rating = ratings.groupby('movieId')[['rating']].mean()

In [0]:
avg_ratings = mean_rating

* And now we merge the averaged ratings back with the movie database.
* Note that not all movies are present in the user votings.

In [0]:
movie_ratings = clean_movies.merge(avg_ratings, left_index = True, right_index=True)

In [0]:
movie_ratings.dtypes

In [0]:
movie_ratings['year'] = movie_ratings['release_date'].dt.year.astype('Int64')

In [0]:
movie_ratings['decade'] = [x - (x%10) for x in movie_ratings['year']]
movie_ratings['decade'] = movie_ratings['decade'].astype('Int64')

In [0]:
top = [int(x) for x in movie_ratings.nlargest(5, 'rating_mean').index.to_list()]

In [0]:
# enable_plotly_in_cell()
ratings[ratings['movieId'].isin(top)].boxplot( by= 'movieId', column ='rating')

In [0]:
#enable_plotly_in_cell()
movie_ratings.groupby('year')['rating_mean'].mean().iplot(kind='bar')

In [0]:
movie_ratings['decade']

In [0]:
# enable_plotly_in_cell()
movie_ratings.groupby('decade')['rating_mean'].mean().iplot(kind='bar')

In [0]:
# enable_plotly_in_cell()
trace = go.Box(
    x = movie_ratings[movie_ratings['decade'].notnull()]['decade'],
    y = movie_ratings[movie_ratings['decade'].notnull()]['rating_mean']
    
)
go.Figure(trace)


In [0]:
# enable_plotly_in_cell()
trace = go.Box(
    x = movie_ratings[movie_ratings['year'].notnull()]['year'],
    y = movie_ratings[movie_ratings['year'].notnull()]['rating_mean'],
    
)
go.Figure(trace)

In [0]:
trace = go.Bar(
    x = movie_ratings['qbudget'],
    y=movie_ratings['scaled_rating']
)

In [0]:
movie_ratings.shape

In [0]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [0]:
all_ratings = movie_ratings[['vote_average','rating']].dropna()
all_ratings

In [0]:
all_ratings.isnull().sum()

In [0]:
scaler = StandardScaler()
x =scaler.fit_transform(all_ratings)

In [0]:
pca = PCA(1)

In [0]:
x = pca.fit_transform(x)

In [0]:
scaled_ratings = pd.DataFrame(x, index = all_ratings.index, columns=['scaled_rating'])

In [0]:
scaled_ratings

In [0]:
movie_ratings = movie_ratings.merge(scaled_ratings, left_index=True, right_index=True)

In [0]:
movie_ratings.corr()

In [0]:
enable_plotly_in_cell()
movie_ratings.groupby('year')['scaled_rating'].mean().iplot(kind='bar')

In [0]:
movie_ratings.groupby('year')['scaled_rating'].mean().iplot(kind='bar', title = "Scaled Rating by Year", xTitle="year", yTitle="Scaled Rating")

In [0]:
movie_ratings.groupby('decade')['scaled_rating'].mean().iplot(kind='bar')

In [0]:
budget_ratings = movie_ratings[['budget', 'scaled_rating']].dropna()

In [0]:
budget_ratings['qbudget'] = pd.qcut(budget_ratings['budget'], q = 5, labels = ['vlow', 'low', 'med', 'high', 'blockbuster'])


In [0]:
enable_plotly_in_cell()
budget_ratings.groupby('qbudget')['scaled_rating'].mean()

In [0]:
movie_ratings.nlargest(25, 'scaled_rating')

In [0]:
enable_plotly_in_cell()
trace = go.Scatter(
    x = movie_ratings["budget"],
    y = movie_ratings['revenue'],
    mode = "markers"
)

go.Figure(trace)