# IMDB
This notebook contains the first few steps of the data science pipeline on a dataset containing movies.

## Group
V2H-Groep 1: Films (IMDB)
- Niels Hoiting
- Jari Oostrom
- Yusuf Syakur

## Research questions
1. [What is the correlation between the gender of actors and the popularity of the movie. How does this change overtime?](#What-is-the-correlation-between-the-gender-of-the-cast-and-the-popularity-of-the-movie.)
2. [What happens if we cluster this dataset, leaving out the genre variable?](#What-happens-if-we-cluster-this-dataset,-leaving-out-the-genre-variable?)
3. [To what extend can you predict the gross of a movie based on its popularity on Facebook and IMDB?](#To-what-extend-can-you-predict-the-gross-of-a-movie-based-on-its-popularity-on-Facebook-and-IMDB?)

## Dataset
Movie information with duration, genres, languages, country, budget and gross;
likes on facebook for director, main cast, total cast en the movie itself;
score on IMDB and reviews

## Step 1: Data collection
Import needed libraries. The dataset is already available.

In [None]:
import pandas as pd
import itertools

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error
df_movies = pd.read_csv('movie.csv')
df_movies

## Step 2: Data processing (Data munging)
Look at the current dataframe and their types.

In [None]:
df_movies.describe()

In [None]:
df_movies.dtypes

Current column order does not make sense. Order them.

In [None]:
df_movies = df_movies[['movie_imdb_link', 'movie_title', 'imdb_score', 'title_year', 'director_name', 'director_facebook_likes', 'actor_1_name',
                      'actor_1_facebook_likes', 'actor_2_name', 'actor_2_facebook_likes', 'actor_3_name', 'actor_3_facebook_likes',
                      'cast_total_facebook_likes', 'movie_facebook_likes', 'genres', 'budget', 'gross', 'country', 'language',
                      'num_critic_for_reviews', 'num_user_for_reviews', 'num_voted_users', 'plot_keywords', 'color', 'content_rating',
                      'duration', 'aspect_ratio', 'facenumber_in_poster']]
df_movies

## Step 3: Data Cleaning

Drop overall duplicates first.

In [None]:
print('Before removing duplicates', df_movies.shape)
df_movies = df_movies.drop_duplicates()
print('After removing duplicates:', df_movies.shape)

### 3.1 movie_imdb_link

The movie_imdb_link duplicates only differ on a few columns like likes and votes. Extract the unique identifier from the URL and remove these duplicate rows.

In [None]:
pd.concat(gby_result for _, gby_result in df_movies.groupby("movie_imdb_link") if len(gby_result) > 1)

In [None]:
df_movies['movie_imdb_link'] = df_movies['movie_imdb_link'].str.extract(r'(?<=title\/)(.*)(?=\/\?)', expand=False)
print('Length before removing duplicates', df_movies.shape)
df_movies = df_movies.drop_duplicates(subset='movie_imdb_link')
print('Length after removing duplicates:',df_movies.shape)

### 3.2 movie_title

Strip whitespaces from both ends for the title. Duplicate movie_title rows might be a remake or a reboot of the movie. Leave them.

In [None]:
df_movies['movie_title'] = df_movies['movie_title'].str.strip()

### 3.3 title_year
Rows that have NaN for title_year are series/reviews, not movies. We won't need these for our analysis. CHange title_year to DateTime64 for time series analysis.

In [None]:
df_movies.loc[df_movies['title_year'].isnull()]

In [None]:
print('Length before removing NaN for title_year:', df_movies.shape)
df_movies = df_movies.drop(df_movies.loc[df_movies['title_year'].isnull()].index)
print('Length after removing NaN for title_year:', df_movies.shape)
df_movies['title_year'] = pd.to_datetime(df_movies['title_year'], format='%Y', errors='coerce')
df_movies

### 3.4 actor_1_name
Rows that have NaN for actor_1_name are documentaries, not movies. Remove them.

In [None]:
df_movies.loc[df_movies['actor_1_name'].isnull()]


In [None]:
print('Length before removing NaN for actor_1_name:', df_movies.shape)
df_movies = df_movies.drop(df_movies.loc[df_movies['actor_1_name'].isnull()].index)
print('Length after removing NaN for actor_1_name:', df_movies.shape)
df_movies

### 3.5 genres

Genres are split with an '|' delimeter. In total there are 28 unique genres. There are no NaN values. Split them and give them an own boolean column.

In [None]:
list_genres = list(set(itertools.chain.from_iterable(df_movies.genres.str.split('|'))))
print(list_genres)

def add_genre(df, genre):
    genreConcat = 'genre_' + genre
    df_copy = df.copy()
    df_copy[genreConcat] = df_copy['genres'].str.contains(pat = genre)
    return df_copy

for genre in list_genres:
    df_movies = add_genre(df_movies, genre)

df_movies

### 3.6 plot_keywords
Remove '|' delimeter to able to use text mining (if needed).

In [None]:
df_movies['plot_keywords'] = df_movies['plot_keywords'].str.replace('|', ' ')
df_movies['plot_keywords']

### 3.7 content_rating
Replace NaN and 'Unrated' with 'Not Rated'.

In [None]:
print(df_movies['content_rating'].unique())

df_movies['content_rating'] = df_movies['content_rating'].str.replace('Unrated', 'Not Rated')
df_movies['content_rating'] = df_movies['content_rating'].fillna(value='Not Rated')

print(df_movies['content_rating'].unique())

### 3.8 color
All rows with NaN on color are released after 1990. Assume color is used (available since 1950s).

In [None]:
df_movies['color'] = df_movies['color'].fillna(value='Color')
df_movies['color'].unique()

### 3.9 Remove unimportant NaN's

Remove rows that have columns with NaN values. These NaN values can't be filled in by a 'default' value. Leave budget and gross (might turn out to be too much data loss).

In [None]:
print('Length before removing NaNs', len(df_movies))

cols_to_ignore = ['movie_imdb_link', 'budget']
df_budget_gross = df_movies[cols_to_ignore]
df_movies = df_movies.drop(['budget'], axis=1)

df_movies = df_movies.dropna()

print('Length after removing NaNs', len(df_movies))

df_movies = df_movies.join(df_budget_gross.set_index('movie_imdb_link'), on='movie_imdb_link')

### 3.10 Change to int64

In [None]:
df_movies = df_movies.astype({'director_facebook_likes': 'int64',
                            'actor_1_facebook_likes': 'int64',
                            'actor_2_facebook_likes': 'int64',
                            'actor_3_facebook_likes': 'int64',
                            'cast_total_facebook_likes': 'int64',
                            'num_critic_for_reviews': 'int64',
                            'num_user_for_reviews': 'int64',
                            'num_voted_users': 'int64',
                            'duration': 'int64',
                            'facenumber_in_poster': 'int64',
                              'gross': 'float64'})

df_movies

## Step 4: Data Visualization

In [None]:
import matplotlib.pyplot as plt

# What is the correlation between the gender of the cast and the popularity of the movie.

In order to find a correlation between gender of actors and popularity we need to define what a 'popular' movie is.


In [None]:
# imports 

import pandas as pd
import numpy as np
import json
import seaborn as sns; sns.set(color_codes=True)

import holoviews as hv
hv.extension('matplotlib')
hv.extension('bokeh')

In [None]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

In [None]:
# Load dataset
df_gender = df_movies.copy()

# Place genres into a array
df_gender.genres = df_gender.genres.str.split(pat = "|")

# Get the unique genres
movie_genres = df_gender.genres.explode().unique()

## What is popularity

There are a couple of fields that can indicate popularity
- `movie_facebook_likes`
- `num_critic_for_reviews`
- `gross`
- `imdb_score`

In [None]:
print("Total shape", df_gender.shape)
print("After dropping NAs", df_gender.dropna().shape)

print("movie_facebook_likes > 0", df_gender[df_gender.movie_facebook_likes > 0].shape)
print("num_critic_for_reviews > 0", df_gender[df_gender.num_critic_for_reviews > 0].shape)
print("gross > 0", df_gender[df_gender.gross > 0].shape)

## Cast and gender
Our current dataset does not contain data about the gender. We will join the dataset with another dataset from the same source: The Movie Database. First we will need to remove the movie titles trailing spaces.




In [None]:
# Fixing trailing characters
df_gender['movie_title'] = df_gender.movie_title.str.replace('[^\x00-\x7F]','')
df_gender['year'] = df_gender.title_year.apply(lambda m: m.year)

The new dataset contains 4 columns: `movie_id`, `movie_title`, `cast` and `crew`. We are interted in the `cast` column. The `cast` column contains an array with the cast of the movie. 

In [None]:
# A credits dataset we can join with our movie dataset.
df_credits = pd.read_csv('tmdb_5000_credits.csv')
df_credits = df_credits.rename(columns={'title': 'movie_title'})

df_credits.head(3)

Inner joining based on the title of the movie.

In [None]:
# Joining the two datasets
movie_with_cast = pd.merge(df_gender, df_credits, how="inner", on="movie_title")

The credits dataset adds a column called `cast`. This column contains an array with objects. Each object represents a actor/actress.
The genders of the actors are stored in the `gender` field in the object. Possible three possible values are:

|Value   | Gender  |
|---|---|
| 0  | Unknown  |
| 1  | Female  |
| 2  | Male  |

Ideally we have one value that represents the share of males and females within the cast of a movie. The first step toward this value is creating a vector for each possible value. After that a column for each gender (male, female and unknown) is created.

In [None]:
# cast is a nested field, this function will return the gender for the given cast and name.
def actor_to_gender(cast):
    cast = json.loads(cast)
    ratio = [0, 0, 0]
    for actor in cast:
        ratio[actor['gender']] += 1
    return ratio

movie_with_cast['gender_ratio'] = movie_with_cast.apply(lambda movie: actor_to_gender(movie.cast), axis=1)
movie_with_cast['unknown_actors'] = movie_with_cast.apply(lambda movie: movie['gender_ratio'][0], axis=1)
movie_with_cast['female_actors'] = movie_with_cast.apply(lambda movie: movie['gender_ratio'][1], axis=1)
movie_with_cast['male_actors'] = movie_with_cast.apply(lambda movie: movie['gender_ratio'][2], axis=1)
movie_with_cast['total_known_actors'] = movie_with_cast.female_actors + movie_with_cast.male_actors
movie_with_cast[['movie_title', 'gender_ratio', 'unknown_actors', 'female_actors', 'male_actors', 'total_known_actors']].head()

In the histograms illustrated below we can see that the most movies between 3 till 15 male actors and 3 till 7 female actresses

In [None]:
male = movie_with_cast.male_actors
plt.hist(male, bins=range(0, 50, 1))
plt.title("Frequency male")
plt.show()

female = movie_with_cast.female_actors
plt.title("Frequency female")
plt.hist(female, bins=range(0, 50, 1))
plt.show()

In order to find a correlation between gender and popularity we need a numeric value that represents _gender_. For this we take a ratio. By dividing the `male_actors` by the sum of `female_actors` and `male_actors` we get an ratio representing the distribution of gender. Where a low number means relatively more female actors and a high number more male actors.

In [None]:
movie_with_cast['ratio'] = movie_with_cast.male_actors / (movie_with_cast.male_actors + movie_with_cast.female_actors)
movie_with_cast[['movie_title', 'ratio']].head()

In order to compare similar movies with eachother we add one more filter. The total known actors should be higher than 20. This probably results in a dataset with higher grossing movies.

In [None]:
movie_with_cast[['ratio', 'gender_ratio', 'imdb_score', 'gross', 'movie_facebook_likes', 'num_critic_for_reviews', 'male_actors' , 'female_actors', 'total_known_actors']].describe()
filtered = movie_with_cast[movie_with_cast.total_known_actors >= 20]

As shown in the scatterplot it seems there is a light linear correlation between the two.

In [None]:
sns.regplot(x="ratio", y="imdb_score", data=filtered);

The degree of correlation between `imdb_score` and `ratio` can be measured by _Pearson correlation coefficient_. -1 and 1 would mean a high correlation. 0 would mean no correlation.

A coefficient of `0.26653` is not very high but it means there is a small correlation.

In [None]:
filtered[['ratio', 'imdb_score']].corr(method='pearson')

Next up creating a linear regression model for determining the coefficient of determination (r2). The coefficient is very low. This seems logical based on the large amount of outliers shown in the scatterplot.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

X = filtered[['ratio']]
y = filtered['imdb_score']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
reg = LinearRegression().fit(X_train, y_train)
y_pred = reg.predict(X_test)

# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test, y_pred))

We also want to know how the correlation changes overtime. This is illustrated in the graph below. The x axis represents time the y axis the _Pearson correlation coefficient_. The graph does not indicate some kind of trend.

In [None]:
corr = filtered[['year', 'ratio', 'imdb_score']].groupby(['year']).corr()
corr = corr.dropna()
corr = corr.iloc[0::2,-1]
corr.index = corr.index.get_level_values(0)

corr.plot();

Genres can also change the correlation between success and the gender ratio. The interactive graph below illustrates a scatterplot based on the genre. 

In [None]:
def load_genre(genre, **kwargs):
    #filter
    if genre == 'All':
        genre_filtered = filtered
    else:
        genre_filtered = filtered[filtered.apply(lambda m: genre in m.genres, axis=1)]
    scat = hv.Scatter(genre_filtered[['ratio', 'imdb_score']])
    
    #Pearson correlation coefficient
    r = genre_filtered.corr(method='pearson')['imdb_score']['ratio']
    
    # Create title with genre and options.
    (scat).opts(title='Genre: ' + genre + " | Pearson correlation coefficient: %.3f" % r)
    
    return scat

genres = movie_genres
genres = np.append(genres, 'All')
dmap = hv.DynamicMap(load_genre, kdims='Genre').redim.values(Genre=genres)
dmap.opts(height=500, width=500)
dmap

Notable genres:

| Genre | Coefficient|
|--|--|
| Western | 0.716 |
| Music | -0.026 |
| Musical | 0.509 |

In [None]:
## What happens if we cluster this dataset, leaving out the genre variable?
The goal is to see if groups that were formed as a result of clustering, make any sense when comparing it to genre. Lets make a subset of the dataframe picking up all the variables except the boolean categorical values.

In order to compare similar movies with eachother we add one more filter. The total known actors should be higher than 20. This probably results in a dataset with higher grossing movies.

In [None]:
movie_with_cast[['ratio', 'gender_ratio', 'imdb_score', 'gross', 'movie_facebook_likes', 'num_critic_for_reviews', 'male_actors' , 'female_actors', 'total_known_actors']].describe()
filtered = movie_with_cast[movie_with_cast.total_known_actors >= 20]

As shown in the scatterplot it seems there is a light linear correlation between the two.

In [None]:
sns.regplot(x="ratio", y="imdb_score", data=filtered);

The degree of correlation between `imdb_score` and `ratio` can be measured by _Pearson correlation coefficient_. -1 and 1 would mean a high correlation. 0 would mean no correlation.

A coefficient of `0.26653` is not very high but it means there is a small correlation.

In [None]:
filtered[['ratio', 'imdb_score']].corr(method='pearson')

Next up creating a linear regression model for determining the coefficient of determination (r2). The coefficient is very low. This seems logical based on the large amount of outliers shown in the scatterplot.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

X = filtered[['ratio']]
y = filtered['imdb_score']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
reg = LinearRegression().fit(X_train, y_train)
y_pred = reg.predict(X_test)

# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test, y_pred))

We also want to know how the correlation changes overtime. This is illustrated in the graph below. The x axis represents time the y axis the _Pearson correlation coefficient_. The graph does not indicate some kind of trend.

In [None]:
corr = filtered[['year', 'ratio', 'imdb_score']].groupby(['year']).corr()
corr = corr.dropna()
corr = corr.iloc[0::2,-1]
corr.index = corr.index.get_level_values(0)

corr.plot();

Genres can also change the correlation between success and the gender ratio. The interactive graph below illustrates a scatterplot based on the genre. 

In [None]:
def load_genre(genre, **kwargs):
    #filter
    if genre == 'All':
        genre_filtered = filtered
    else:
        genre_filtered = filtered[filtered.apply(lambda m: genre in m.genres, axis=1)]
    scat = hv.Scatter(genre_filtered[['ratio', 'imdb_score']])
    
    #Pearson correlation coefficient
    r = genre_filtered.corr(method='pearson')['imdb_score']['ratio']
    
    # Create title with genre and options.
    (scat).opts(title='Genre: ' + genre + " | Pearson correlation coefficient: %.3f" % r);
    
    return scat

genres = movie_genres
genres = np.append(genres, 'All')
dmap = hv.DynamicMap(load_genre, kdims='Genre').redim.values(Genre=genres)
dmap.opts(height=500, width=500)
dmap

Notable genres:

| Genre | Coefficient|
|--|--|
| Western | 0.716 |
| Music | -0.026 |
| Musical | 0.509 |

## What happens if we cluster this dataset, leaving out the genre variable?
The goal is to see if groups that were formed as a result of clustering, make any sense when comparing it to genre. Lets make a subset of the dataframe picking up all the variables except the boolean categorical values.

In [None]:
df_nogenre = df_movies[['movie_imdb_link', 'movie_title', 'genres', 'imdb_score', 'title_year',
       'director_name', 'director_facebook_likes', 'actor_1_name',
       'actor_1_facebook_likes', 'actor_2_name', 'actor_2_facebook_likes',
       'actor_3_name', 'actor_3_facebook_likes', 'cast_total_facebook_likes',
       'movie_facebook_likes', 'country', 'language',
       'num_critic_for_reviews', 'num_user_for_reviews', 'num_voted_users',
       'plot_keywords', 'color', 'content_rating', 'duration', 'aspect_ratio',
       'facenumber_in_poster', 'budget', 'gross']]
df_nogenre.shape

### Data pre-processing

See if clustering with k-Means using a few combinations of numerical values will create easy-to-understand groups. There are alot of categorical variables in this set, these ones will have to drop because k-Means can only use numerical variables. There is an algorithm that is similar to k-Means that is based on categorical variables. This algorithm is called k-Modes. The most important categorical variables which should indicate genre's/similar groups are actors and directors. Let's see if we can use k-Modes.

In [None]:
print('Unique actors on 1: ', df_nogenre['actor_1_name'].unique().size)
print('Unique actors on 2: ', df_nogenre['actor_2_name'].unique().size)
print('Unique actors on 3: ', df_nogenre['actor_3_name'].unique().size)
print('Unique directors: ', df_nogenre['director_name'].unique().size)

The important categorical variables have many unique values in comparison with the amount of records. Therefore we will not use the k-Modes algorithm. Since we have facebook likes for each director and actor, we can use these numerical values for the k-Means algorithm. Let's see a sample of those likes.

In [None]:
df_nogenre.head(5)

Hmm, that is strange. A highly appraised director such as James Cameron has zero facebook likes. Visiting his facebook page, this is not the case. Avatar is not the only one that James Cameron has directed. 

In [None]:
df_nogenre.loc[df_nogenre['director_name'] == 'James Cameron']

All of his great works contain zero facebook likes as a director. Something went probably wrong during the retrieval of this dataset. 

A first option would be to scrape all director facebook likes properly where director_facebook_likes == 0. But this means that you would get a new timestamp of data retrieval, resulting in doing harm to the integrity of the data. Let's not do that.

Another option would be to scrape ALL of the directors facebook likes, but that is out of scope for this research.

In [None]:
df_nogenre.loc[df_nogenre['director_facebook_likes'] == 0].shape[0]

The amount of rows with directors having zero facebook likes is 745. Dropping these would also mean dropping strong movies such as Titanic. Let's leave this variable for now.

In [None]:
print(df_nogenre.loc[df_nogenre['actor_1_facebook_likes'] == 0].shape[0])
print(df_nogenre.loc[df_nogenre['actor_2_facebook_likes'] == 0].shape[0])
print(df_nogenre.loc[df_nogenre['actor_3_facebook_likes'] == 0].shape[0])

The amount of rows with actors having zero facebook likes totals around 80. We can drop these rows and keep the rest for our k-Means model.

In [None]:
print('Rows before:', df_nogenre.shape[0])
df_nogenre.drop(df_nogenre.loc[df_nogenre['actor_1_facebook_likes'] == 0].index, inplace=True)
df_nogenre.drop(df_nogenre.loc[df_nogenre['actor_2_facebook_likes'] == 0].index, inplace=True)
df_nogenre.drop(df_nogenre.loc[df_nogenre['actor_3_facebook_likes'] == 0].index, inplace=True)
print('Rows left:', df_nogenre.shape[0])

The column cast_total_facebook_likes looks like a summation of all 3 actor's facebook likes. Let's test this.

In [None]:
df_nogenre[['actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes']].sum(1)
boolean_arr = np.where(df_nogenre[['actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes']].sum(1) == df_nogenre['cast_total_facebook_likes']
                     , 'True', 'False')
truefalse, amount = np.unique(boolean_arr, return_counts=True)
np.asarray((truefalse, amount))

Apparently this is false for most of the rows. This means this column is not a result of 2 columns combined, which would interfere with creating the kMeans clusters. Let's keep this column.

Since we noted that the data might be invalid, let's see how many rows we have for each of the remaining columns being on 0.

In [None]:
print(df_nogenre.loc[df_nogenre['movie_facebook_likes'] == 0].shape[0])
print(df_nogenre.loc[df_nogenre['num_critic_for_reviews'] == 0].shape[0])
print(df_nogenre.loc[df_nogenre['num_user_for_reviews'] == 0].shape[0])
print(df_nogenre.loc[df_nogenre['num_voted_users'] == 0].shape[0])
print(df_nogenre.loc[df_nogenre['facenumber_in_poster'] == 0].shape[0])
print(df_nogenre.loc[df_nogenre['budget'] == 0].shape[0])
print(df_nogenre.loc[df_nogenre['gross'] == 0].shape[0])
print(df_nogenre.loc[df_nogenre['cast_total_facebook_likes'] == 0].shape[0])

The amount of rows with movies having zero facebook likes is 1994. This makes sense because the dataset contains data from movies made in 1934 till 2016. Facebook was founded in 2004. Let's not use this column.

Facenumber in poster having zero makes sense, because it can contain posters not having a face on it, for example alot of thriller movies.

Since we are going with k-Means algorithm, let's drop all categorical variables as well, except for the identifiers (movie_imdb_link, movie_title, genres). We can also skip the mentioned director_facebook_likes and movie_facebook_likes column due to it's data being invalid.

In [None]:
df_nocat = df_nogenre[['movie_imdb_link', 'movie_title', 'genres', 'imdb_score', 'title_year', 'actor_1_facebook_likes', 'actor_2_facebook_likes',
       'actor_3_facebook_likes', 'cast_total_facebook_likes', 'num_critic_for_reviews', 'num_user_for_reviews', 
       'num_voted_users', 'duration', 'aspect_ratio', 'facenumber_in_poster', 'budget', 'gross']]
df_nocat.head(5)

Because k-Means works with using the mean of a cluster, a date variable would make no sense. We can transform this variable to a numerical value. Since this dataset only contain year numbers, we only need to extract the year.

In [None]:
df_nocat['title_year'] = pd.DatetimeIndex(df_nocat['title_year']).year
df_nocat['title_year']

To optimize the results of k-Means, let's use the Z-Score form of standardization (as opposed to min-max or decimal normalization) for all of the variables. We can only standardize if there are no NaN values, so we should drop them. The data is now ready to be clustered using the k-Means algorithm.

In [None]:
df_nocat = df_nocat.dropna()
df_nocat[['imdb_score']] = (df_nocat[['imdb_score']]-df_nocat[['imdb_score']].mean())/df_nocat[['imdb_score']].std()
df_nocat[['actor_1_facebook_likes']] = (df_nocat[['actor_1_facebook_likes']]-df_nocat[['actor_1_facebook_likes']].mean())/df_nocat[['actor_1_facebook_likes']].std()
df_nocat[['actor_2_facebook_likes']] = (df_nocat[['actor_2_facebook_likes']]-df_nocat[['actor_2_facebook_likes']].mean())/df_nocat[['actor_2_facebook_likes']].std()
df_nocat[['actor_3_facebook_likes']] = (df_nocat[['actor_3_facebook_likes']]-df_nocat[['actor_3_facebook_likes']].mean())/df_nocat[['actor_3_facebook_likes']].std()
df_nocat[['cast_total_facebook_likes']] = (df_nocat[['cast_total_facebook_likes']]-df_nocat[['cast_total_facebook_likes']].mean())/df_nocat[['cast_total_facebook_likes']].std()
df_nocat[['num_critic_for_reviews']] = (df_nocat[['num_critic_for_reviews']]-df_nocat[['num_critic_for_reviews']].mean())/df_nocat[['num_critic_for_reviews']].std()
df_nocat[['num_user_for_reviews']] = (df_nocat[['num_user_for_reviews']]-df_nocat[['num_user_for_reviews']].mean())/df_nocat[['num_user_for_reviews']].std()
df_nocat[['num_voted_users']] = (df_nocat[['num_voted_users']]-df_nocat[['num_voted_users']].mean())/df_nocat[['num_voted_users']].std()
df_nocat[['duration']] = (df_nocat[['duration']]-df_nocat[['duration']].mean())/df_nocat[['duration']].std()
df_nocat[['aspect_ratio']] = (df_nocat[['aspect_ratio']]-df_nocat[['aspect_ratio']].mean())/df_nocat[['aspect_ratio']].std()
df_nocat[['facenumber_in_poster']] = (df_nocat[['facenumber_in_poster']]-df_nocat[['facenumber_in_poster']].mean())/df_nocat[['facenumber_in_poster']].std()
df_nocat[['budget']] = (df_nocat[['budget']]-df_nocat[['budget']].mean())/df_nocat[['budget']].std()
df_nocat[['gross']] = (df_nocat[['gross']]-df_nocat[['gross']].mean())/df_nocat[['gross']].std()
df_nocat[['title_year']] = (df_nocat[['title_year']]-df_nocat[['title_year']].mean())/df_nocat[['title_year']].std()
df_nocat

### Model building

We need to decide on a heuristic to evaluate if the clustered groups have any connection with the genre of a movie. The total amount of genres in this dataset is 28. We can say that if in every cluster there is a certain genre for >60% of that cluster, then it's safe to say that the clustering made sense (success). If it fails, then the clustering failed to make groups split by genre. In essence, we are actually testing the classification of the clustering by using the genre variable.

We need to define a function that will let us choose the hyperparameters for the model building. We are only going to toy with the amount of clusters, the list of genres and which variables will be picked up for clustering. The function will print a tuple containing the cluster, percentage of the genre and the genre itself if it passes the heuristic.

To start lets create a function that prints out a commonly found genre per cluster. We set the n_cluster to 28 (amount of genre), heuristic percentage to 60% and pick up all numeric variables.

In [None]:
n_cluster = 28
heuristic_perc = 60
chosen_genres = list(set(itertools.chain.from_iterable(df_nocat.genres.str.split('|'))))
chosen_var = df_nocat.iloc[:,3:df_nocat.shape[1]].columns

def sum_per_cluster(df, list_var, list_genres, n_clsters, heuristic_percentage):
    print('Cluster\tPercentage\tGenre')
    allMeans = KMeans(n_clusters=n_clsters, random_state=0).fit(df.loc[:,list_var])
    df['Clusters'] = allMeans.predict(df.loc[:,list_var])
    for cluster in range(0,n_cluster):
        for genre in list_genres:
            percentage = (df.loc[df['Clusters'] == cluster].genres.str.contains(genre).sum() / df.loc[df['Clusters'] == cluster].shape[0]) * 100
            if percentage >= heuristic_percentage:
                print(cluster, '\t', np.around(percentage, decimals=2), '\t\t', genre)
                
sum_per_cluster(df_nocat, chosen_var, chosen_genres, n_cluster, heuristic_perc)

There are a few things to note about these results. There are a few clusters that contain 100% of a certain genre. This might seem ideal, but it could also be due to the small size of a cluster. We also see alot of the genre Drama. 

The result could be abit messy because:
- There is no minimum on the size of a cluster when testing the heuristic
- We set the amount of clusters to 28, because of the amount of genres. But the genres may not be even distributed and a movie may contain multiple genres.

In [None]:
list_genres = list(set(itertools.chain.from_iterable(df_nocat.genres.str.split('|'))))
genre_percentage = list()

for genre in list_genres:
    percentage = df_nocat.loc[df_nocat['genres'].str.contains(genre)].shape[0] / df_nocat.shape[0] * 100
    genre_percentage.append((genre, np.around(percentage, decimals=2)))
    
genre_percentage.sort(key=lambda x: x[1], reverse=True)

pl_genre = [i[0] for i in genre_percentage]
pl_percentage = [i[1] for i in genre_percentage]
x_pos = np.arange(len(pl_genre)) 

plt.figure(figsize=(20,10))
plt.bar(x_pos, pl_percentage,align='center')
plt.xticks(x_pos, pl_genre, rotation=20) 
plt.ylabel('Ratio record occurences in percentage')
plt.show()

As you can see, Drama appears to be in about 50% of the dataset. The least appearing genres would be Western, Documentary and Film-Noir.

In [None]:
print(df_nocat.loc[df_nocat['genres'].str.contains('Western')].shape[0], 'movies with Western')
print(df_nocat.loc[df_nocat['genres'].str.contains('Documentary')].shape[0], 'movies with Documentary')
print(df_nocat.loc[df_nocat['genres'].str.contains('Film-Noir')].shape[0], 'movies with Film-Noir')

Film-Noir only appears on 1 row. Western and Documentary stand on their own.

We can try again with less clusters, combining some genre's. When testing these clusters, all of the genres within the defined genre should appear in that row. Let's cap the total combined genre's at 14. These genre's will be either combined with AND (?=.*) or OR (|) in regex.

In [None]:
new_genre_list = ['(?=.*Drama)(?=.*Romance)', '(?=.*Comedy)(?=.*Romance)', '(?=.*Action)(?=.*Adventure)', 'Sport', 'Crime',
                  '(?=.*Drama)(?=.*Thriller)', 'Horror|Mystery', '(?=.*Comedy)(?=.*Family)', '(?=.*Action)(?=.*Sci-Fi)'
                  , 'War|History|Western', 'Animation', 'Music|Musical', 'Documentary|Biography', '(?=.*Action)(?=.*Fantasy)']

We tried to combine the higher occuring genres alot, so that it evens out. Let's see what the current distribution between the combined genre's are.

In [None]:
genre_percentage = list()

for genre in new_genre_list:
    percentage = df_nocat.loc[df_nocat['genres'].str.contains(genre, regex=True)].shape[0] / df_nocat.shape[0] * 100
    genre_percentage.append((genre, np.around(percentage, decimals=2)))
    
genre_percentage.sort(key=lambda x: x[1], reverse=True)

pl_genre = [i[0] for i in genre_percentage]
pl_percentage = [i[1] for i in genre_percentage]
x_pos = np.arange(len(pl_genre)) 

plt.figure(figsize=(20,10))
plt.bar(x_pos, pl_percentage,align='center')
plt.xticks(x_pos, pl_genre, rotation=40) 
plt.ylabel('Ratio record occurences in percentage')
plt.show()

The new genre balances out somewhat better than before.

We need to add the new heuristic (size of cluster) and test cases (combined genre) to the function. For the size of a cluster, let's pick a minimum of 20. We also lower the percentage range from 60 to 25 to be able to print results.

In [None]:
#Drop last Cluster
df_nocat = df_nocat.iloc[:, :-1]

n_cluster = 14
heuristic_perc = 25
chosen_genres = new_genre_list
chosen_var = df_nocat.iloc[:,3:df_nocat.shape[1]].columns

def sum_per_cluster(df, list_var, list_genres, n_clsters, heuristic_percentage):
    print('Cluster\tPercentage\tGenre')
    allMeans = KMeans(n_clusters=n_clsters, random_state=0).fit(df.loc[:,list_var])
    df['Clusters'] = allMeans.predict(df.loc[:,list_var])
    for cluster in range(0,n_cluster):
        if df.loc[df['Clusters'] == cluster].shape[0] < 20:
            continue
        for genre in list_genres:
            percentage = (df.loc[df['Clusters'] == cluster].genres.str.contains(genre, regex=True).sum() / df.loc[df['Clusters'] == cluster].shape[0]) * 100
            if percentage >= heuristic_percentage:
                print(cluster, '\t', np.around(percentage, decimals=2), '\t\t', genre)
                
sum_per_cluster(df_nocat, chosen_var, chosen_genres, n_cluster, heuristic_perc)

Hmm looks like the results are even worse when combining genres, because the percentages are alot lower. How about if we pick different variables? Let's keep all social media stuff and remove properties from the movie itself (and vica versa)

In [None]:
chosen_var = ['imdb_score', 'actor_1_facebook_likes',
       'actor_2_facebook_likes', 'actor_3_facebook_likes',
       'cast_total_facebook_likes', 'num_critic_for_reviews',
       'num_user_for_reviews', 'num_voted_users']

sum_per_cluster(df_nocat, chosen_var, chosen_genres, n_cluster, heuristic_perc)

In [None]:
chosen_var = ['title_year', 'duration', 'aspect_ratio',
       'facenumber_in_poster', 'budget', 'gross']

sum_per_cluster(df_nocat, chosen_var, chosen_genres, n_cluster, heuristic_perc)

### Conclusion

In the latter model we see somewhat more distinct genres. However the percentages are still not good enough to say that kMeans can cluster this set into different groups of genres. To give it a better picture, we can visualize this by plotting the largest 3 clusters and their genre distribution.

In [None]:
df_nocat.Clusters.value_counts()

In [None]:
df8 = df_nocat.loc[df_nocat['Clusters'] == 8]
df12 = df_nocat.loc[df_nocat['Clusters'] == 12]
df3 = df_nocat.loc[df_nocat['Clusters'] == 3]

genre_percentage8 = list()
genre_percentage12 = list()
genre_percentage3 = list()

for genre in new_genre_list:
    percentage8 = df8.loc[df8['genres'].str.contains(genre, regex=True)].shape[0] / df8.shape[0] * 100
    percentage12 = df12.loc[df12['genres'].str.contains(genre, regex=True)].shape[0] / df12.shape[0] * 100
    percentage3 = df3.loc[df3['genres'].str.contains(genre, regex=True)].shape[0] / df3.shape[0] * 100
    genre_percentage8.append((genre, np.around(percentage8, decimals=2)))
    genre_percentage12.append((genre, np.around(percentage12, decimals=2)))
    genre_percentage3.append((genre, np.around(percentage3, decimals=2)))

barwidth = 0.25    

pl_genre8 = [i[0] for i in genre_percentage8]
pl_percentage8 = [i[1] for i in genre_percentage8]
pl_percentage12 = [i[1] for i in genre_percentage12]
pl_percentage3 = [i[1] for i in genre_percentage3]
x_pos8 = np.arange(len(pl_genre))
x_pos12 = [x + barwidth for x in x_pos8]
x_pos3 = [x + barwidth for x in x_pos12]

plt.figure(figsize=(20,10))
plt.bar(x_pos8, pl_percentage8, edgecolor='white', width=barwidth)
plt.bar(x_pos12, pl_percentage12,edgecolor='white', width=barwidth)
plt.bar(x_pos3, pl_percentage3, edgecolor='white', width=barwidth)
plt.xticks([r + barwidth for r in range(len(pl_genre))], pl_genre8, rotation=40) 
plt.ylabel('Ratio record occurences in percentage')
plt.ylim(0, 70)
plt.legend(['8', '12', '3'], title='Clusters')
plt.hlines(60, xmin=0, xmax=14, linestyle='dashed', colors='red')
plt.show()

There isn't really a good distinction between the large clusters themselves, except for green maybe which outshines on War|History|Western and Documentary|Biography. **But none of the distributions are close to the 60% heuristic**. Therefor it is safe to say that clustering this dataset with kMeans **cannot** result into groups of genres.


## To what extend can you predict the gross of a movie based on its popularity on Facebook and IMDB?

First we will determine how the gross is impacted by all the factors, then we will see what the best (most accurate) formula is.
The factors we want to check are 
1. number of critic reviews
2. number of user reviews
3. movie facebook likes
4. imdb score


In [None]:
from pandas.plotting import scatter_matrix
plt_scatter = scatter_matrix(df_movies[['gross', 'num_critic_for_reviews', 'num_user_for_reviews', 'imdb_score', 'movie_facebook_likes']], alpha=0.2, figsize=(9,9), diagonal='kde').view()
plt.show()

In [None]:
df_movies.replace([np.inf, -np.inf], np.nan).dropna(subset=['num_critic_for_reviews','num_user_for_reviews', 'movie_facebook_likes', 'imdb_score','gross'], how="all")
df_movies = df_movies.reset_index()

In [None]:
# create a Python list of feature names
feature_cols = ['num_critic_for_reviews','num_user_for_reviews', 'movie_facebook_likes', 'imdb_score',]

# use the list to select a subset of the original DataFrame
X = df_movies[feature_cols]


# print the first 5 rows
X.head()

In [None]:
# select a Series from the DataFrame
y = df_movies['gross']

# print the first 5 values
y.head()
 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


Next we start the model on all the different variables to create a linear regression model

In [None]:
# import model
from sklearn.linear_model import LinearRegression

# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)


Next we'll look at the metrics of our formula, and how accurate it is, we do this by looking at the intercept and
coefficient, and calculate the root mean squared error with the testing set.


In [None]:

print(linreg.intercept_)
print(linreg.coef_)
list(zip(feature_cols, linreg.coef_))


In [None]:

# make predictions on the testing set
y_pred = linreg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
# The coefficients
print('Coefficients: \n', linreg.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mse)
print('Root mean squared error: %.2f'
      % np.math.sqrt(mse))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test, y_pred))


This is not good, a value of close to 1 is ideal, this is extremely high. We'll drop the columns that seem to have the least correlation according to the scatter plot.
We will drop number of critics for review as it seems to have the least correlation.

In [None]:
# create a Python list of feature names
feature_cols_2 = ['movie_facebook_likes', 'imdb_score','num_user_for_reviews']

# use the list to select a subset of the original DataFrame
X_2 = df_movies[feature_cols_2]

from sklearn.model_selection import train_test_split
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_2, y, random_state=0)

from sklearn.linear_model import LinearRegression

# instantiate
linreg_2 = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg_2.fit(X_train_2, y_train_2)

print(linreg_2.intercept_)
print(linreg_2.coef_)
list(zip(feature_cols_2, linreg_2.coef_))


# make predictions on the testing set
y_pred_2 = linreg_2.predict(X_test_2)
mse = mean_squared_error(y_test_2, y_pred_2)
# The coefficients
print('Coefficients: \n', linreg_2.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mse)
print('Root mean squared error: %.2f'
      % np.math.sqrt(mse))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test_2, y_pred_2))

This is not a good result, we want something not very far from 0, it even went up a slight bit. Let's drop even more columns, this time we'll remove number of user reviews and look again.

In [None]:
# create a Python list of feature names
feature_cols_3 = ['movie_facebook_likes', 'imdb_score',]

# use the list to select a subset of the original DataFrame
X_3 = df_movies[feature_cols_3]

from sklearn.model_selection import train_test_split
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_3, y, random_state=0)

from sklearn.linear_model import LinearRegression

# instantiate
linreg_3 = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg_3.fit(X_train_3, y_train_3)

print(linreg_3.intercept_)
print(linreg_3.coef_)
list(zip(feature_cols_3, linreg_3.coef_))

# make predictions on the testing set
y_pred_3 = linreg_3.predict(X_test_3)
mse = mean_squared_error(y_test_3, y_pred_3)
# The coefficients
print('Coefficients: \n', linreg_3.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mse)
print('Root mean squared error: %.2f'
      % np.math.sqrt(mse))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test_3, y_pred_3))

It seems dropping columns is not the answer, our first prediction is still the best, but it's definitely a large margin of error.
It would seem that the correlation is weak at best, and removing columns is not a solution to the problem. Popularity on IMDB and facebook seems like
a weak indicator for how well a movie will do financially.


# Z-toets IMDB

Een filmcriticus stelt dat de score van engelstalige films lager is dan gemiddeld.

Onderzoek met de dataset of deze filmcriticus gelijk heeft. Neem een steekproef (met ```pandas.DataFrame.sample(n=100,random_state=1)```) van 100 engelstalige films en beschouw de hele dataset als populatie. Neem als betrouwbaarheid 90%. Gebruik van de dataset alleen de filmgegevens waarbij zowel de taal (`language`) als de score (`imdb_score`) bekend zijn.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st

In [None]:
movies = pd.read_csv('movie.csv') 

In [None]:
all_movies = movies[movies.imdb_score.notnull()]
all_movies = all_movies[all_movies.language.notnull()]

In [None]:
movies_english = all_movies[all_movies.language == 'English']

In [None]:
sample = movies_english.sample(n=100,random_state=1)

In [None]:
sample.boxplot(column="imdb_score")
plt.show()
all_movies.boxplot(column="imdb_score")
plt.show()

In [None]:
sample.imdb_score.mean()

In [None]:
all_movies.imdb_score.mean()

In [None]:
stdev_en = st.tstd(sample["imdb_score"])

print(stdev_en)

To determine the accuracy of our findings we have to do a Z-test. We will set out our hypothesis and null hypothesis and test the latter.

These are as follows:

H0 = English films score as well or better than other movies on IMDB. μother <= μenglish = 6.35

H1 = English films score significantly worse than other movies on IMDB. μother > μenglish = 6.35

In [None]:
n = 100
good_score = movies_english[movies_english.imdb_score >= all_movies.imdb_score.mean()].count()
q = .5
z_alpha = 1.29
mean_english_score = sample.imdb_score.mean()
mean_score = all_movies.imdb_score.mean()

se = stdev_en / (np.sqrt(n))

z = (mean_score - mean_english_score) / se
print(z)

The z value we found from the calculation is 0.68, this is significantly lower than the value we'd want of 1.29 or higher, 
we can therefore not reject the nul hypothesis, and not prove the alternative hypothesis either.

The z value we found from the calculation is 0.68, this is significantly lower than the value we'd want of 1.29 or higher, 
we can therefore not reject the nul hypothesis, and not prove the alternative hypothesis either.