# IMDB
This notebook contains the first few steps of the data science pipeline on a dataset containing movies.

## Group
V2H-Groep 1: Films (IMDB)
- Niels Hoiting
- Jari Oostrom
- Yusuf Syakur

## Research questions
1. What is the correlation between the gender of actors and the popularity of the movie. How does this change overtime?
2. What happens if we cluster this dataset, leaving out the genre variable?
3. To what extend can you predict the gross of a movie based on its popularity on Facebook and IMDB?

## Dataset
movie information with duration, genres, languages, country, budget and gross;
likes on facebook for director, main cast, total cast en the movie itself;
score on IMDB and reviews

## Step 1: Data collection
Import needed libraries. The dataset is already available.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import itertools
import json

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
df_movies = pd.read_csv('movie.csv')
df_movies

## Step 2: Data processing (Data munging)
Look at the current dataframe and their types.

In [None]:
df_movies.describe()

In [None]:
df_movies.dtypes

Current column order does not make sense. Order them.

In [None]:
df_movies = df_movies[['movie_imdb_link', 'movie_title', 'imdb_score', 'title_year', 'director_name', 'director_facebook_likes', 'actor_1_name',
                      'actor_1_facebook_likes', 'actor_2_name', 'actor_2_facebook_likes', 'actor_3_name', 'actor_3_facebook_likes',
                      'cast_total_facebook_likes', 'movie_facebook_likes', 'genres', 'budget', 'gross', 'country', 'language',
                      'num_critic_for_reviews', 'num_user_for_reviews', 'num_voted_users', 'plot_keywords', 'color', 'content_rating',
                      'duration', 'aspect_ratio', 'facenumber_in_poster']]
df_movies

## Step 3: Data Cleaning

Drop overall duplicates first.

In [None]:
print('Before removing duplicates', df_movies.shape)
df_movies = df_movies.drop_duplicates()
print('After removing duplicates:', df_movies.shape)

### 3.1 movie_imdb_link

The movie_imdb_link duplicates only differ on a few columns like likes and votes. Extract the unique identifier from the URL and remove these duplicate rows.

In [None]:
pd.concat(gby_result for _, gby_result in df_movies.groupby("movie_imdb_link") if len(gby_result) > 1)

In [None]:
df_movies['movie_imdb_link'] = df_movies['movie_imdb_link'].str.extract(r'(?<=title\/)(.*)(?=\/\?)', expand=False)
print('Length before removing duplicates', df_movies.shape)
df_movies = df_movies.drop_duplicates(subset='movie_imdb_link')
print('Length after removing duplicates:',df_movies.shape)

### 3.2 movie_title

Strip whitespaces from both ends for the title. Duplicate movie_title rows might be a remake or a reboot of the movie. Leave them.

In [None]:
df_movies['movie_title'] = df_movies['movie_title'].str.strip()

### 3.3 title_year
Rows that have NaN for title_year are series/reviews, not movies. We won't need these for our analysis. CHange title_year to DateTime64 for time series analysis.

In [None]:
df_movies.loc[df_movies['title_year'].isnull()]

In [None]:
print('Length before removing NaN for title_year:', df_movies.shape)
df_movies = df_movies.drop(df_movies.loc[df_movies['title_year'].isnull()].index)
print('Length after removing NaN for title_year:', df_movies.shape)
df_movies['title_year'] = pd.to_datetime(df_movies['title_year'], format='%Y', errors='coerce')
df_movies

### 3.4 actor_1_name
Rows that have NaN for actor_1_name are documentaries, not movies. Remove them.

In [None]:
df_movies.loc[df_movies['actor_1_name'].isnull()]


In [None]:
print('Length before removing NaN for actor_1_name:', df_movies.shape)
df_movies = df_movies.drop(df_movies.loc[df_movies['actor_1_name'].isnull()].index)
print('Length after removing NaN for actor_1_name:', df_movies.shape)
df_movies

### 3.5 genres

Genres are split with an '|' delimeter. In total there are 28 unique genres. There are no NaN values. Split them and give them an own boolean column.

In [None]:
list_genres = list(set(itertools.chain.from_iterable(df_movies.genres.str.split('|'))))
print(list_genres)

def add_genre(df, genre):
    genreConcat = 'genre_' + genre
    df_copy = df.copy()
    df_copy[genreConcat] = df_copy['genres'].str.contains(pat = genre)
    return df_copy

for genre in list_genres:
    df_movies = add_genre(df_movies, genre)

df_movies

### 3.6 plot_keywords
Remove '|' delimeter to able to use text mining (if needed).

In [None]:
df_movies['plot_keywords'] = df_movies['plot_keywords'].str.replace('|', ' ')
df_movies['plot_keywords']

### 3.7 content_rating
Replace NaN and 'Unrated' with 'Not Rated'.

In [None]:
print(df_movies['content_rating'].unique())

df_movies['content_rating'] = df_movies['content_rating'].str.replace('Unrated', 'Not Rated')
df_movies['content_rating'] = df_movies['content_rating'].fillna(value='Not Rated')

print(df_movies['content_rating'].unique())

### 3.8 color
All rows with NaN on color are released after 1990. Assume color is used (available since 1950s).

In [None]:
df_movies['color'] = df_movies['color'].fillna(value='Color')
df_movies['color'].unique()

### 3.9 Remove unimportant NaN's

Remove rows that have columns with NaN values. These NaN values can't be filled in by a 'default' value. Leave budget and gross (might turn out to be too much data loss).

In [None]:
print('Length before removing NaNs', len(df_movies))

cols_to_ignore = ['movie_imdb_link', 'budget']
df_budget_gross = df_movies[cols_to_ignore]
df_movies = df_movies.drop(['budget'], axis=1)

df_movies = df_movies.dropna()

print('Length after removing NaNs', len(df_movies))

df_movies = df_movies.join(df_budget_gross.set_index('movie_imdb_link'), on='movie_imdb_link')

### 3.10 Change to int64

In [None]:
df_movies = df_movies.astype({'director_facebook_likes': 'int64',
                            'actor_1_facebook_likes': 'int64',
                            'actor_2_facebook_likes': 'int64',
                            'actor_3_facebook_likes': 'int64',
                            'cast_total_facebook_likes': 'int64',
                            'num_critic_for_reviews': 'int64',
                            'num_user_for_reviews': 'int64',
                            'num_voted_users': 'int64',
                            'duration': 'int64',
                            'facenumber_in_poster': 'int64',
                              'gross': 'float64'})

df_movies

## Step 4: Data Visualization

In [None]:
import matplotlib.pyplot as plt

Check and see if there is a correlation between budget and duration. Set a limit on budget to see a clear scatterplot. Looking at the result, there is no obvious correlation.

In [None]:
fig = plt.figure(1, figsize=(10,10))

y_budget = df_movies[['budget']]
x_duration = df_movies[['duration']]

axScatter = plt.subplot(111)
axScatter.scatter(x_duration, y_budget)
plt.ylim(0, 300000000)
axScatter.set_title('Scatterplot between budget and duration')
axScatter.set_xlabel('Duration in minutes')
axScatter.set_ylabel('Budget in US Dollars')

We can add another dataset to see the gender of every actor.
Both datasets come from the same source: The Movie Database. 
So we'll join them on the title.

In [None]:
df_credits = pd.read_csv('tmdb_5000_credits.csv')
df_credits = df_credits.rename(columns={'title': 'movie_title'})
df_credits

In [None]:
movie_with_cast = pd.merge(df_movies, df_credits, how="inner", on="movie_title")
movie_with_cast

Cast is a nested field, this function will return the gender for the given cast and name.

In [None]:

def actor_to_gender(cast, name):
    cast = json.loads(cast)
    for actor in cast:
        if name == actor['name']:
            return actor['gender']
    return 0
 
movie_with_cast['actor_1_gender'] = movie_with_cast.apply(lambda movie: actor_to_gender(movie.cast, movie.actor_1_name), axis=1)
movie_with_cast['actor_2_gender'] = movie_with_cast.apply(lambda movie: actor_to_gender(movie.cast, movie.actor_2_name), axis=1)
movie_with_cast['actor_3_gender'] = movie_with_cast.apply(lambda movie: actor_to_gender(movie.cast, movie.actor_3_name), axis=1)


In [None]:
movie_with_cast.actor_1_gender.value_counts().plot(kind='pie', labels=['Male', 'Female', 'Unknown'])
plt.show();

In [None]:
movie_with_cast.actor_2_gender.value_counts().plot(kind='pie', labels=['Male', 'Female', 'Unknown'])
plt.show();

In [None]:
movie_with_cast.actor_3_gender.value_counts().plot(kind='pie', labels=['Male', 'Female', 'Unknown'])
plt.show();

The gender ratio is for the majority the same among the first three featured actors.

In [None]:
from pandas.plotting import scatter_matrix
plt_scatter = scatter_matrix(df_movies[['gross', 'num_critic_for_reviews', 'num_user_for_reviews', 'imdb_score', 'movie_facebook_likes']], alpha=0.2, figsize=(9,9), diagonal='kde').view()
plt.show()


## To what extend can you predict the gross of a movie based on its popularity on Facebook and IMDB?

First we will determine how the gross is impacted by all the factors, then we will see what the best (most accurate) formula is.
The factors we want to check are 
1. number of critic reviews
2. number of user reviews
3. movie facebook likes
4. imdb score


In [None]:
df_movies.replace([np.inf, -np.inf], np.nan).dropna(subset=['num_critic_for_reviews','num_user_for_reviews', 'movie_facebook_likes', 'imdb_score','gross'], how="all")
df_movies = df_movies.reset_index()

In [None]:
# create a Python list of feature names
feature_cols = ['num_critic_for_reviews','num_user_for_reviews', 'movie_facebook_likes', 'imdb_score',]

# use the list to select a subset of the original DataFrame
X = df_movies[feature_cols]


# print the first 5 rows
X.head()

In [None]:
# select a Series from the DataFrame
y = df_movies['gross']

# print the first 5 values
y.head()
 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


Next we start the model on all the different variables to create a linear regression model

In [None]:
# import model
from sklearn.linear_model import LinearRegression

# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)


Next we'll look at the metrics of our formula, and how accurate it is, we do this by looking at the intercept and
coefficient, and calculate the root mean squared error with the testing set.


In [None]:

print(linreg.intercept_)
print(linreg.coef_)
list(zip(feature_cols, linreg.coef_))


In [None]:
from sklearn import metrics

# make predictions on the testing set
y_pred = linreg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
# The coefficients
print('Coefficients: \n', linreg.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mse)
print('Root mean squared error: %.2f'
      % np.math.sqrt(mse))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test, y_pred))


This is not good, a value of close to 1 is ideal, this is extremely high. We'll drop the columns that seem to have the least correlation according to the scatter plot.
We will drop number of critics for review as it seems to have the least correlation.

In [None]:
# create a Python list of feature names
feature_cols_2 = ['movie_facebook_likes', 'imdb_score','num_user_for_reviews']

# use the list to select a subset of the original DataFrame
X_2 = df_movies[feature_cols_2]

from sklearn.model_selection import train_test_split
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_2, y, random_state=0)

from sklearn.linear_model import LinearRegression

# instantiate
linreg_2 = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg_2.fit(X_train_2, y_train_2)

print(linreg_2.intercept_)
print(linreg_2.coef_)
list(zip(feature_cols_2, linreg_2.coef_))

from sklearn import metrics

# make predictions on the testing set
y_pred_2 = linreg_2.predict(X_test_2)
mse = mean_squared_error(y_test_2, y_pred_2)
# The coefficients
print('Coefficients: \n', linreg_2.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mse)
print('Root mean squared error: %.2f'
      % np.math.sqrt(mse))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test_2, y_pred_2))

This is not a good result, we want something not very far from 0, it even went up a slight bit. Let's drop even more columns, this time we'll remove number of user reviews and look again.

In [None]:
# create a Python list of feature names
feature_cols_3 = ['movie_facebook_likes', 'imdb_score',]

# use the list to select a subset of the original DataFrame
X_3 = df_movies[feature_cols_3]

from sklearn.model_selection import train_test_split
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_3, y, random_state=0)

from sklearn.linear_model import LinearRegression

# instantiate
linreg_3 = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg_3.fit(X_train_3, y_train_3)

print(linreg_3.intercept_)
print(linreg_3.coef_)
list(zip(feature_cols_3, linreg_3.coef_))

from sklearn import metrics

# make predictions on the testing set
y_pred_3 = linreg_3.predict(X_test_3)
mse = mean_squared_error(y_test_3, y_pred_3)
# The coefficients
print('Coefficients: \n', linreg_3.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mse)
print('Root mean squared error: %.2f'
      % np.math.sqrt(mse))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test_3, y_pred_3))

It seems dropping columns is not the answer, our first prediction is still the best, but it's definitely a large margin of error.
It would seem that the correlation is weak at best, and removing columns is not a solution to the problem. Popularity on IMDB and facebook seems like
a weak indicator for how well a movie will do financially.


# Z-toets IMDB

Een filmcriticus stelt dat de score van engelstalige films lager is dan gemiddeld.

Onderzoek met de dataset of deze filmcriticus gelijk heeft. Neem een steekproef (met ```pandas.DataFrame.sample(n=100,random_state=1)```) van 100 engelstalige films en beschouw de hele dataset als populatie. Neem als betrouwbaarheid 90%. Gebruik van de dataset alleen de filmgegevens waarbij zowel de taal (`language`) als de score (`imdb_score`) bekend zijn.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st

In [None]:
movies = pd.read_csv('movie.csv') 

In [None]:
all_movies = movies[movies.imdb_score.notnull()]
all_movies = all_movies[all_movies.language.notnull()]

In [None]:
movies_english = all_movies[all_movies.language == 'English']

In [None]:
sample = movies_english.sample(n=100,random_state=1)

In [None]:
sample.boxplot(column="imdb_score")
plt.show()
all_movies.boxplot(column="imdb_score")
plt.show()

In [None]:
sample.imdb_score.mean()

In [None]:
all_movies.imdb_score.mean()

In [None]:
stdev_en = st.tstd(sample["imdb_score"])

print(stdev_en)

To determine the accuracy of our findings we have to do a Z-test. We will set out our hypothesis and null hypothesis and test the latter.

These are as follows:

H0 = English films score as well or better than other movies on IMDB. μother <= μenglish = 6.35

H1 = English films score significantly worse than other movies on IMDB. μother > μenglish = 6.35

In [None]:
n = 100
good_score = movies_english[movies_english.imdb_score >= all_movies.imdb_score.mean()].count()
q = .5
z_alpha = 1.29
mean_english_score = sample.imdb_score.mean()
mean_score = all_movies.imdb_score.mean()

se = stdev_en / (np.sqrt(n))

z = (mean_score - mean_english_score) / se
print(z)

The z value we found from the calculation is 0.68, this is significantly lower than the value we'd want of 1.29 or higher, 
we can therefore not reject the nul hypothesis, and not prove the alternative hypothesis either.

In [None]:
The z value we found from the calculation is 0.68, this is significantly lower than the value we'd want of 1.29 or higher, 
we can therefore not reject the nul hypothesis, and not prove the alternative hypothesis either.

In [None]:
# select a Series from the DataFrame
y = df_movies['gross']

# print the first 5 values
y.head()
 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


Next we start the model on all the different variables to create a linear regression model

In [None]:
# import model
from sklearn.linear_model import LinearRegression

# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)


Next we'll look at the metrics of our formula, and how accurate it is, we do this by looking at the intercept and
coefficient, and calculate the root mean squared error with the testing set.


In [None]:

print(linreg.intercept_)
print(linreg.coef_)
list(zip(feature_cols, linreg.coef_))


In [None]:
from sklearn import metrics

# make predictions on the testing set
y_pred = linreg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
# The coefficients
print('Coefficients: \n', linreg.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mse)
print('Root mean squared error: %.2f'
      % np.math.sqrt(mse))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test, y_pred))


This is not good, a value of close to 1 is ideal, this is extremely high. We'll drop the columns that seem to have the least correlation according to the scatter plot.
We will drop number of critics for review as it seems to have the least correlation.

In [None]:
# create a Python list of feature names
feature_cols_2 = ['movie_facebook_likes', 'imdb_score','num_user_for_reviews']

# use the list to select a subset of the original DataFrame
X_2 = df_movies[feature_cols_2]

from sklearn.model_selection import train_test_split
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_2, y, random_state=0)

from sklearn.linear_model import LinearRegression

# instantiate
linreg_2 = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg_2.fit(X_train_2, y_train_2)

print(linreg_2.intercept_)
print(linreg_2.coef_)
list(zip(feature_cols_2, linreg_2.coef_))

from sklearn import metrics

# make predictions on the testing set
y_pred_2 = linreg_2.predict(X_test_2)
mse = mean_squared_error(y_test_2, y_pred_2)
# The coefficients
print('Coefficients: \n', linreg_2.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mse)
print('Root mean squared error: %.2f'
      % np.math.sqrt(mse))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test_2, y_pred_2))

This is not a good result, we want something not very far from 0, it even went up a slight bit. Let's drop even more columns, this time we'll remove number of user reviews and look again.

In [None]:
# create a Python list of feature names
feature_cols_3 = ['movie_facebook_likes', 'imdb_score',]

# use the list to select a subset of the original DataFrame
X_3 = df_movies[feature_cols_3]

from sklearn.model_selection import train_test_split
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_3, y, random_state=0)

from sklearn.linear_model import LinearRegression

# instantiate
linreg_3 = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg_3.fit(X_train_3, y_train_3)

print(linreg_3.intercept_)
print(linreg_3.coef_)
list(zip(feature_cols_3, linreg_3.coef_))

from sklearn import metrics

# make predictions on the testing set
y_pred_3 = linreg_3.predict(X_test_3)
mse = mean_squared_error(y_test_3, y_pred_3)
# The coefficients
print('Coefficients: \n', linreg_3.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mse)
print('Root mean squared error: %.2f'
      % np.math.sqrt(mse))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test_3, y_pred_3))

It seems dropping columns is not the answer, our first prediction is still the best, but it's definitely a large margin of error.
It would seem that the correlation is weak at best, and removing columns is not a solution to the problem. Popularity on IMDB and facebook seems like
a weak indicator for how well a movie will do financially.
