# **Actors clusterization**
*Artem Panin (August 2018)*

About a week ago the movie Hotel Artemis have launched in Russian cinemas. My colleagues described it as "super-cool premiere", but I haven't been so optimistic about this movie, because of mediocre director and mostly unknown cast, imho of course. So when I claimed this, I received a lot of opinions different from my point of view. 

In this notebook, I have tried to research actors space and clusterize it to "popular actors" cluster and "no-names actors" cluster. And of course I have tried to prove that the majority of Hotel Artemis's cast are "no-names actors".

For research I've used a dataset of movies from IMDB which contains data of approximately 5000 movies: https://www.kaggle.com/tmdb/tmdb-movie-metadata

## 1. Preprocessing

I have used functions taken from [Sohier's code](https://www.kaggle.com/sohier/getting-imdb-kernels-working-with-tmdb-data) to change data structure:

In [None]:
import numpy as np 
import pandas as pd 

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

In [None]:
import scipy
import numpy as np
import matplotlib 
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
import seaborn as sns
import warnings

sns.set_style("whitegrid")
%matplotlib inline

In [None]:
import plotly.offline as pyo
pyo.init_notebook_mode()
from plotly.graph_objs import *
import plotly.graph_objs as go

In [None]:
import json
#__________________
def load_tmdb_movies(path):
    df = pd.read_csv(path)
    df['release_date'] = pd.to_datetime(df['release_date']).apply(lambda x: x.date())
    json_columns = ['genres', 'keywords', 'production_countries', 'production_companies', 'spoken_languages']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
    return df
#____________________________
def load_tmdb_credits(path):
    df = pd.read_csv(path)
    json_columns = ['cast', 'crew']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
    return df
#_______________________________________
def safe_access(container, index_values):
    result = container
    try:
        for idx in index_values:
            result = result[idx]
        return result
    except IndexError or KeyError:
        return pd.np.nan
#_______________________________________
LOST_COLUMNS = [
    'actor_1_facebook_likes',
    'actor_2_facebook_likes',
    'actor_3_facebook_likes',
    'aspect_ratio',
    'cast_total_facebook_likes',
    'color',
    'content_rating',
    'director_facebook_likes',
    'facenumber_in_poster',
    'movie_facebook_likes',
    'movie_imdb_link',
    'num_critic_for_reviews',
    'num_user_for_reviews']
#_______________________________________
TMDB_TO_IMDB_SIMPLE_EQUIVALENCIES = {
    'budget': 'budget',
    'genres': 'genres',
    'revenue': 'gross',
    'title': 'movie_title',
    'runtime': 'duration',
    'original_language': 'language',  
    'keywords': 'plot_keywords',
    'vote_count': 'num_voted_users'}
#_______________________________________     
IMDB_COLUMNS_TO_REMAP = {'imdb_score': 'vote_average'}
#_______________________________________
def get_director(crew_data):
    directors = [x['name'] for x in crew_data if x['job'] == 'Director']
    return safe_access(directors, [0])
#_______________________________________
def pipe_flatten_names(keywords):
    return '|'.join([x['name'] for x in keywords])
#_______________________________________
def convert_to_original_format(movies, credits):
    tmdb_movies = movies.copy()
    tmdb_movies.rename(columns=TMDB_TO_IMDB_SIMPLE_EQUIVALENCIES, inplace=True)
    tmdb_movies['title_year'] = pd.to_datetime(tmdb_movies['release_date']).apply(lambda x: x.year)
    tmdb_movies['country'] = tmdb_movies['production_countries'].apply(lambda x: safe_access(x, [0, 'name']))
    tmdb_movies['language'] = tmdb_movies['spoken_languages'].apply(lambda x: safe_access(x, [0, 'name']))
    tmdb_movies['director_name'] = credits['crew'].apply(get_director)
    tmdb_movies['actor_1_name'] = credits['cast'].apply(lambda x: safe_access(x, [1, 'name']))
    tmdb_movies['actor_2_name'] = credits['cast'].apply(lambda x: safe_access(x, [2, 'name']))
    tmdb_movies['actor_3_name'] = credits['cast'].apply(lambda x: safe_access(x, [3, 'name']))
    tmdb_movies['actor_4_name'] = credits['cast'].apply(lambda x: safe_access(x, [4, 'name']))
    tmdb_movies['actor_5_name'] = credits['cast'].apply(lambda x: safe_access(x, [5, 'name']))
    tmdb_movies['genres'] = tmdb_movies['genres'].apply(pipe_flatten_names)
    tmdb_movies['plot_keywords'] = tmdb_movies['plot_keywords'].apply(pipe_flatten_names)
    return tmdb_movies

Let's load the data and convert it to original structure

In [None]:
credits = load_tmdb_credits("../input/tmdb_5000_credits.csv")
movies = load_tmdb_movies("../input/tmdb_5000_movies.csv")
movie = convert_to_original_format(movies, credits)

In [None]:
movie.head()

Summary Statistics

In [None]:
movie.describe()

Correlations Between Attributes

In [None]:
corr = movie.select_dtypes(include = ['float64', 'int64']).iloc[:, 1:].corr()
plt.figure(figsize=(16, 16))
sns.heatmap(corr, vmax=1, square=True)
plt.show()

# Actor Aggregate Imdb Score and Gross of movies

I've decided to use top-5 actors from each movie in my research. I think it's an appropriate measure to define actors popularity good enough.

First metric I've used was multiplication of IMDB score and number of users who voted and summarizing it for each movie of each actor. In my opinion it's a good idea to use sum among actor's movies instead of mean metric. And of course it reflects quality of actor's film. 

Second metric was aggregated gross of actor's movies. It reflects popularity of actor and his movies (not always corellates with quality)

In [None]:
actor = movie[['actor_1_name', 'actor_2_name', 'actor_3_name', 'actor_4_name', 'actor_5_name', 'gross', 'vote_average', 'num_voted_users', 'popularity']]
actor.head()

In [None]:
actor_list = pd.melt(actor, id_vars=['vote_average', 'num_voted_users'], value_vars=['actor_1_name', 'actor_2_name', 'actor_3_name', 'actor_4_name', 'actor_5_name'],
                    var_name='variable', value_name='actor_name')
actor_list.head()

In [None]:
actor_score = (actor_list['vote_average'] * actor_list['num_voted_users']).groupby(actor_list['actor_name']).sum()

In [None]:
actor_list_gross = pd.melt(actor, id_vars=['gross'], value_vars=['actor_1_name', 'actor_2_name', 'actor_3_name', 'actor_4_name', 'actor_5_name'],
                    var_name='variable', value_name='actor_name')
actor_score_gross= actor_list_gross['gross'].groupby(actor_list_gross['actor_name']).sum()

# Measurement of actor's quality

Lets make scatter plots and define clusters of "popular" actors and "no-name" actors. 

On the graph below you can see actors from Hotel Artemis movie (blue markers) and other actors (white markers). 

X-axis responds to Gross metrics, Y-axis responds to IMDB scores metrics. 

So, the bigger this metrics the more popular actor is

The list of Artemis actors is 'Kenneth Choi', 'Sterling K. Brown', 'Jeff Goldblum', 'Zachary Quinto', 'Charlie Day', 'Dave Bautista', 'Sofia Boutella', 'Brian Tyree Henry'

In [None]:
df = pd.concat([actor_score, actor_score_gross], axis=1)
df.columns = [['vote_average', 'gross']]

In [None]:
warnings.filterwarnings("ignore")

artemis_actors = ['Kenneth Choi', 'Sterling K. Brown', 'Jeff Goldblum', 'Zachary Quinto', 
               'Charlie Day', 'Dave Bautista', 'Sofia Boutella', 'Brian Tyree Henry']

a = df.loc[[i for i in df.index if i not in artemis_actors], :]
b = df.loc[artemis_actors, :]
df = pd.concat([a, b]).dropna()

In [None]:
def quality_graph(df):
    edge_trace = Scatter(
    x=[],
    y=[],
    line = Line(width=0.5,color='#888'),
    hoverinfo = 'none',
    mode = 'lines')

    node_trace = Scatter(
        x=[],
        y=[],
        text=[],
        mode='markers',
        hoverinfo='text',
        marker=Marker(
            colorscale='YlGnBu',
            reversescale=True,
            color=[],
            size=10,
            line=dict(width=2)))
    
    for ind, col in df.iterrows():
        node_trace['x'] += (col['gross'].values[0], )
        node_trace['y'] += (col['vote_average'].values[0], )
        node_trace['text'] += (ind,)
        if ind in artemis_actors:
            node_trace['marker']['color'] += (10, )
        else:
            node_trace['marker']['color'] += (1, )
        
    fig = Figure(data=Data([node_trace]),
                 layout=Layout(
                    title='<br>Quality of actors',
                    titlefont=dict(size=16),
                    showlegend=False,
                    hovermode='closest',
                    margin=dict(b=20,l=5,r=5,t=40),
                    annotations=[ dict(
                        showarrow=False,
                        xref="paper", yref="paper",
                        x=0.005, y=-0.002 ) ],
                    xaxis=XAxis(title='Sum Gross', showgrid=True, zeroline=False, showticklabels=True),
                    yaxis=YAxis(title='IMDB score x Users count', showgrid=True, zeroline=False, showticklabels=True)))    
    return fig

In [None]:
warnings.filterwarnings("ignore")

fig = quality_graph(df)
pyo.iplot(fig)

So lets take a look a little bit closer at actors from Hotel Artemis movie

In [None]:
def quality_graph_artemis_actors(df):
    edge_trace = Scatter(
    x=[],
    y=[],
    line = Line(width=0.5,color='#888'),
    hoverinfo = 'none',
    mode = 'lines')

    node_trace = Scatter(
        x=[],
        y=[],
        text=[],
        mode='markers',
        hoverinfo='text',
        marker=Marker(
            colorscale='YlGnBu',
            reversescale=True,
            color=[],
            size=10,
            line=dict(width=2)))
    
    for ind, col in df.iterrows():
        if ind in artemis_actors:
            node_trace['x'] += (col['gross'].values[0], )
            node_trace['y'] += (col['vote_average'].values[0], )
            node_trace['text'] += (ind,)
            node_trace['marker']['color'] += (10, )
                
        
    fig = Figure(data=Data([node_trace]),
                 layout=Layout(
                    title='<br>Quality of actors in Hotel Artemis',
                    titlefont=dict(size=16),
                    showlegend=False,
                    hovermode='closest',
                    margin=dict(b=20,l=5,r=5,t=40),
                    annotations=[ dict(
                        showarrow=False,
                        xref="paper", yref="paper",
                        x=0.005, y=-0.002 ) ],
                    xaxis=XAxis(title='Sum Gross', showgrid=True, zeroline=False, showticklabels=True),
                    yaxis=YAxis(title='IMDB score x Users count', showgrid=True, zeroline=False, showticklabels=True)))
    
    return fig

In [None]:
warnings.filterwarnings("ignore")

fig = quality_graph_artemis_actors(df)
pyo.iplot(fig)

Let's see clusters of "popular" actors and "nonames":

I've used K-means algorithm for 2 clusters: https://en.wikipedia.org/wiki/K-means_clustering

Red points are actors from Hotel Artemis

In [None]:
from sklearn.metrics.pairwise import pairwise_distances_argmin
from sklearn.cluster import KMeans

k_means = KMeans(init='k-means++', n_clusters=2, n_init=10)
k_means.fit(df)
colors = ['#4EACC5', '#FF9C34']
titles = ['NoNames', 'Popular Actors']


k_means_cluster_centers = np.sort(k_means.cluster_centers_, axis=0)
k_means_labels = pairwise_distances_argmin(df, k_means_cluster_centers)

for k, col, title in zip(range(len(k_means.cluster_centers_)), colors, titles):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    kmeans1 = go.Scatter(x=df.iloc[my_members, 1], y=df.iloc[my_members, 0],
                        showlegend=False, 
                         text = df.index[my_members],
                         mode='markers',         
                         marker=Marker(
                                    colorscale='YlGnBu',
                                    size=5,
                                    line=dict(width=2)))
    kmeans2 = go.Scatter(x=[cluster_center[1]], y=[cluster_center[0]],
                         showlegend=False,
                         mode='markers', marker=dict(color=col, size=14,
                                                    line=dict(color='black',
                                                              width=1)))
    
    actors_ = list(set(artemis_actors).intersection(df.index[my_members]))
    print(actors_, ' are ', title)
    actors =  go.Scatter(x=[x[0] for x in df.loc[actors_, 'gross'].values],
                          y=[x[0] for x in df.loc[actors_, 'vote_average'].values],
                         showlegend=False, 
                         text = actors_,
                         mode='markers', 
                         marker=dict(color='red', size=14))
    layout = dict(title = title)
    pyo.iplot(dict(data=[kmeans1, kmeans2, actors], layout=layout))

### Conclusion

This approach allows to define 2 clusters of actors: "nonames" and "popular".

There are 2 popular actors in Hotel Artemis: Jeff Goldblum and Zachary Quinto.

And a lot of no-names: 'Kenneth Choi', 'Sterling K. Brown', 'Charlie Day', 'Dave Bautista', 'Sofia Boutella', 'Brian Tyree Henry'. Most of them are not presented in dataset, to be honest.