# Zombie Adjanceny Score
#### The new z-score for the zombie apocolypse

The idea behind the Zombie Ajanceny Score (z-score) is to assume that actors in Zombie movies become infected as Zombies and see how how the virus would spread throughout Hollywood using the IMDB database and NetworkX.  The virus is assumed to spread to other actors who share a movie with an infected actor.  Those closest to the most original infected actors and hence most likely to be infected would have a high Zombie Adjency Score whereas those furthest away and least likely to be infected would have a lower score. 

In [1]:
#imports
import pandas as pd
import numpy as np
import time
import networkx as nx
pd.options.mode.chained_assignment = None  # default='warn'

### Prepping the data
Here I import various tables from the IMDB database.  The tables can be found [here](https://datasets.imdbws.com/) and descriptions of the metadata can be found [here](https://www.imdb.com/interfaces/).  I then use the .head() command to take a look at all the tables.

In [2]:
#Load Principals, Titles, Names, Attributes
principals = pd.read_csv('principals.tsv' , sep='\t')
titles = pd.read_csv('title.basics.tsv' , sep='\t', low_memory=False) 
names = pd.read_csv('name_basics.tsv' , sep='\t') 
ratings = pd.read_csv('title.ratings.tsv' , sep='\t')
principals.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Herself""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0374658,cinematographer,director of photography,\N
3,tt0000002,1,nm0721526,director,\N,\N
4,tt0000002,2,nm1335271,composer,\N,\N


In [3]:
titles.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [4]:
names.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0053137,tt0043044,tt0072308,tt0050419"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0037382,tt0038355,tt0117057,tt0071877"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,producer","tt0054452,tt0049189,tt0059956,tt0057345"
3,nm0000004,John Belushi,1949,1982,"actor,writer,soundtrack","tt0080455,tt0078723,tt0072562,tt0077975"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0083922,tt0069467,tt0050976,tt0050986"


In [5]:
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1527
1,tt0000002,6.2,185
2,tt0000003,6.5,1174
3,tt0000004,6.3,114
4,tt0000005,6.1,1889


Next, I engage in some paring down of the dataset.  Principals has almost 35 million rows.  There are also some columns that are not needed for the analysis which can be dropped.  The following are the changes made to the dataset:
    1. Narrow down the type of artwork to movies only
    2. Removing adult movies
    3. Removing all non-actors (eg. directors, writers, etc.)
    4. Narrow down to only living actors.  It is assumed that the dead actors would already be zombies!
    5. Removing films that have less than 1000 reviews on IMDB

I use some print commands to show how many rows were eliminated.  The output is percentage of size the new dataframe is to the original, the number of rows in the new dataframe and the number of rows in the original data frame.  From the last line we can see that the new dataframe is a mere 91k rows and is .26% the size of the original.
    

In [52]:
#Clean Principals
  #use Titles to get movies only
movie_IDs = titles[titles['titleType'] == 'movie']['tconst'].unique()
principals_final = principals[principals['tconst'].isin(movie_IDs)]
print ("#1", "{0:.2%}".format(round(len(principals_final)) / len (principals),4), len(principals_final), len (principals))
  #Use isAdult to remove porn
non_adult_IDs = titles[titles['isAdult'] == 0]['tconst'].unique()
principals_final = principals_final[principals_final['tconst'].isin(non_adult_IDs)]
print ("#2", "{0:.2%}".format(round(len(principals_final)) / len (principals),4), len(principals_final), len (principals))
  #Keep only Actor, Actresses
principals_final = principals_final[principals_final['category'].isin(['actor', 'actress'])]
print ("#3", "{0:.2%}".format(round(len(principals_final)) / len (principals),4), len(principals_final), len (principals))
  #Use death date to get living only
living = names[names['deathYear'].str.isnumeric() == False]['nconst'].unique().tolist()
principals_final = principals_final[principals_final['nconst'].isin(living)]
print ("#4", "{0:.2%}".format(round(len(principals_final)) / len (principals),4), len(principals_final), len (principals))
  #Use ratings to get only those that are > than 1000 ratings
ratings_1000 = ratings[ratings['numVotes']>=1000]['tconst'].tolist()
principals_final = principals_final[principals_final['tconst'].isin(ratings_1000)]
print ("#5", "{0:.2%}".format(round(len(principals_final)) / len (principals),4), len(principals_final), len (principals))

#1 10.69% 3737875 34961324
#2 10.52% 3677477 34961324
#3 4.65% 1625187 34961324
#4 3.41% 1193244 34961324
#5 0.26% 91023 34961324


Now I remove all columns except for nconst and tconst which are IMBDs unique keys for person names and title names.  I then further reduce the data by keeping the Top 100, 1,000 and 10,000 actors based on the number of films they are in.  This allows me to experiment with the size of the dataset that is a large as possible without taking too long to run.  I eventually settled on Top1000 actors as the best size for me.  In the Addendum there is a network_eval function that can be used to help select the best size for you.

In [9]:
#Create sorted groupby of actors for 3ish buckets of N size
principals_final = principals_final[['tconst', 'nconst']]
sorted_principals = principals_final.groupby(['nconst'])['tconst'].count().sort_values(ascending = False).index
Top100_actors = principals_final[principals_final['nconst'].isin(sorted_principals[:99])]
Top1000_actors = principals_final[principals_final['nconst'].isin(sorted_principals[:999])]
Top10000_actors = principals_final[principals_final['nconst'].isin(sorted_principals[:9999])]

In [10]:
###See the length of possible dataframes to use
len (Top100_actors), len (Top1000_actors), len (Top10000_actors), len (principals_final)

(4939, 24520, 62472, 91023)

Next up I load the lists of Zombie movies to determine the initially infected actors.  I then add the names to each of the four dataframes. 

The lists of Zombie movies comes from four user defined lists:
[pete-john-mcgarvey,](https://www.imdb.com/list/ls059468572/)
[bigfatbaloney,](https://www.imdb.com/list/ls055027705/)
[aronharde,](https://www.imdb.com/list/ls062760555/)
[smothlop](https://www.imdb.com/list/ls073818691/)

Note: IMDB has its own Python library [IMDbY](https://imdbpy.github.io/).  This library does have a search by topic function which I tried, but the results I got were a little od.  I have an example in the addendum.

In [12]:
#Load the Zombie Lists
bfb = pd.read_csv('Complete zombie movies list - bigfatbaloney.csv', encoding='latin1')
ah = pd.read_csv('Zombie Movies - aronharde.csv', encoding='latin1')
pjm = pd.read_csv('zombie movies - pete-john-mcgarvey.csv', encoding='latin1')
sl = pd.read_csv('zombie movies - smothlop.csv', encoding='latin1')
all_lists = pd.concat([ah,bfb,pjm,sl])
all_lists = all_lists[all_lists['Num Votes'] > 1000]
all_lists = all_lists[all_lists['Title Type'] == 'movie']
all_zombie_movie_ids = all_lists['Const'].unique().tolist()

title_merge = titles[['tconst', 'primaryTitle']]
names_merge = names[['nconst', 'primaryName']]

def add_names(df, ids):
    df['isZombie'] = df['tconst'].isin(ids)
    #Replace tconst and nconst with actual names for the target dfs
    df = df.merge(title_merge, how = 'left', left_on = 'tconst', right_on = 'tconst')
    df = df.merge(names_merge, how = 'left', left_on = 'nconst', right_on = 'nconst')
    df.drop(['tconst', 'nconst'], axis = 1, inplace = True)
    return df

Top100_actors = add_names(Top100_actors,all_zombie_movie_ids )
Top1000_actors = add_names(Top1000_actors,all_zombie_movie_ids )
Top10000_actors = add_names(Top10000_actors,all_zombie_movie_ids )
principals_final = add_names(principals_final,all_zombie_movie_ids )

### Calculating the Zombie Adjancency Score 
The function below takes one of the dataframes (in this case Top1000_actors) and creates a graph using NetworkX.  I then use the shortest_path from NetworkX to get a path from each original;y infected actor to a possible infectee.  Because NetworkX includes movies as part of the path, I need remove them to get to my definition of an infection path.  I also mandate that the infection path can only be a length of 6 or less via the max_hops variable.  

The function returns a dataframe of possible infectees with their z-score as well as a network graph to allows us some additional analysis.  As ususal, I output some stats about the network and how long it takes the function to run.

In [19]:
def get_infectees(df):
    start = time.time()
    graph = nx.from_pandas_dataframe(df, 'primaryTitle', 'primaryName')
    print ('network created')
    print ('network is connected: ',nx.is_connected(graph))
    print ('edges: ', len(graph.edges()))
    print ('nodes: ', len(graph.nodes()))
    print (f'Time to create network: {(time.time() - start):.2f} seconds')  ##Change to start time
    net_time = time.time()
    z_dict = {}
    zombies = df[df['isZombie'] == True]['primaryName'].unique().tolist()
    actors = df['primaryName'].unique().tolist()
    movie_names = df['primaryTitle'].unique().tolist()
    max_hops = 6
    for actor in actors:       
        z_dict[actor] = 0
        for zombie in zombies:
            if nx.has_path(graph, zombie, actor):
                path = nx.shortest_path(graph, source=zombie, target=actor)
                value = max(max_hops - (len(path) - len ([i for i in path if i in movie_names])), 0)
                z_dict[actor] = z_dict[actor]+value
    df['z_score'] = df['primaryName'].map(z_dict)
    df.sort_values(by='z_score', ascending = False, inplace = True)
    print (f'Total time: {(time.time() - net_time)/60:.2f} minutes')
    return df[df['isZombie'] == False].drop(columns = ['primaryTitle']).drop_duplicates(), graph

In [20]:
infected, contagion_graph = get_infectees(Top1000_actors)

network created
network is connected:  False
edges:  24513
nodes:  13713
Time to create network: 0.27 seconds
Total time: 1.49 minutes


#### We can now see who is most and least likely to be infected

In [24]:
print ('Top 10 most likely to be infected:')
infected[['primaryName', 'z_score']].head(10)

Top 10 most likely to be infected:


Unnamed: 0,primaryName,z_score
4947,Nicolas Cage,252
3620,Ray Liotta,250
21009,Samuel L. Jackson,250
21231,Robert De Niro,249
7648,Bruce Willis,249
8329,Ving Rhames,249
20730,Julianne Moore,248
23919,Ali,248
13077,Forest Whitaker,247
16800,Antonio Banderas,247


In [25]:
print ('Bottom 10 least likely to be infected:')
infected[['primaryName', 'z_score']].tail(10).sort_values(by='z_score')

Bottom 10 least likely to be infected:


Unnamed: 0,primaryName,z_score
15913,Ahmed Helmy,0
208,Tatsuya Nakadai,0
15301,Maaya Sakamoto,4
7810,Masako Nozawa,26
16498,Megumi Hayashibara,73
14331,Metin Akpinar,93
18645,Demet Akbag,93
11686,Kamal Haasan,101
7795,Mayumi Tanaka,105
20302,Dhanush,117


### Comparing Z-score with other centrality measures

Now I want to see how the z-score relates to other centrality measures and build a model to predict z-score.  First I begin by calculating some popular centrailty measures using NetworkX.

I use the following centrality measures:
    * Degree Centrality
    * Closeness Centrality
    * Betweeness Centrality
    * PageRank

After creating the measures, I add them to dataframe with .map()

Note:  Betweenness Centrality often takes 15-20 minutes to run for graph based on the Top1000 dataframe.  I have included some code to save and retrieve the dictionaries that are created from the centrality measures in the Addendum.

In [23]:
new_time = time.time()
degCent = nx.degree_centrality(contagion_graph)
print('degCent', round(time.time() - new_time,2), 'seconds')
new_time = time.time()
closeCent = nx.closeness_centrality(contagion_graph)
print('closeCent', round((time.time() - new_time)/60,2), 'minutes')
new_time = time.time()
between = nx.betweenness_centrality(contagion_graph)
print('between', round((time.time() - new_time)/60,2), 'minutes')
new_time = time.time()
p_rank = nx.pagerank(contagion_graph)
print('p_rank',round(time.time() - new_time)2), 'seconds')

degCent 0.0 seconds
closeCent 2.56 minutes
between 17.62 minutes
p_rank 0.04 minutes


In [27]:
##Add centrality measures to infected df
infected['degCent'] = infected['primaryName'].map(degCent)
infected['closeCent'] = infected['primaryName'].map(closeCent)
infected['between'] = infected['primaryName'].map(between)
infected['p_rank'] = infected['primaryName'].map(p_rank)

## Further Analysis
Let's see the correlation between Z-score and the centrality measures

In [31]:
print ('Correlation betweeen Degree Centrality and z_score:', round(infected['degCent'].corr(infected['z_score']),2))
print ('Correlation betweeen Closeness Centrality and z_score:', round(infected['closeCent'].corr(infected['z_score']),2))
print ('Correlation betweeen Betweenness Centrality and z_score:', round(infected['between'].corr(infected['z_score']),2))
print ('Correlation betweeen Page Rank and z_score:', round(infected['degCent'].corr(infected['p_rank']),2))

Correlation betweeen Degree Centrality and z_score: 0.35
Correlation betweeen Closeness Centrality and z_score: 0.96
Correlation betweeen Betweenness Centrality and z_score: 0.32
Correlation betweeen Page Rank and z_score: 0.94


Looks like Closeness and Page Rank are pretty close to the Z-score.
Now I'll see if I can predict it with a Gradient Boosting Regressor as well as a simple linear regression

In [33]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

from sklearn import linear_model
from sklearn.metrics import r2_score

In [35]:
X = infected.drop(['isZombie','primaryName','z_score'], axis=1)
y = infected['z_score']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

reg = GradientBoostingRegressor(random_state = 0).fit(X_train, y_train) 
y_pred = reg.predict(X_test)
r2_GB = r2_score(y_test, y_pred)
print("r2 for Gradient Boosted Regression: %.4f" % r2_GB)

r2 for Gradient Boosted Regression: 0.9469


In [36]:
linreg = linear_model.LinearRegression()
linreg.fit(X_train, y_train)
y_pred2 = linreg.predict(X_test)
r2_LR = r2_score(y_test, y_pred2)
print("r2 for Linear Regression: %.4f" % r2_LR)

r2 for Linear Regression: 0.9188


While the scores are fairly high, Centrality closeness has a .96 correlation.  Simply using it as a proxy would work better than the models!

### Addendum

#### A function to help evaluate how big your network will be

In [40]:
def network_eval(df):
    start = time.time()
    network = nx.from_pandas_dataframe(df, 'primaryTitle', 'primaryName')
    net_time = time.time()
    print ('network created: ', time.time() - net_time)
    net_time = time.time()
    print ('network is connected: ',nx.is_connected(network))
    print ('edges: ', len(network.edges()))
    print ('nodes: ', len(network.nodes()))
    print ((time.time() - net_time), '\n')

    return network

In [41]:
network_eval(Top1000_actors)

network created:  0.0
network is connected:  False
edges:  24513
nodes:  13713
0.031249523162841797 



<networkx.classes.graph.Graph at 0x1d6a9740320>

#### Code to save and retrieve centrality measures
[Hat tip to this Stack Overflow Article](https://stackoverflow.com/questions/18114628/pickling-multiple-dictionaries/18115159)

In [14]:
MyDicts = [degCent, closeCent, between, p_rank]

outputFile = open( "myDicts.txt", "w")
outputFile.write(str(MyDicts))
outputFile.flush()
outputFile.close()

In [15]:
import ast

inputFile = open( "myDicts.txt", "r")
lines = inputFile.readlines()

objects = []
for line in lines:
    objects.append( ast.literal_eval(line))

degCent, closeCent, between,p_rank  = objects[0][0], objects[0][1], objects[0][2], objects[0][3]

#### IMDbPY

IMDB also has it's own Python library.  [Docs can be found here](https://imdbpy.readthedocs.io/en/latest/#). It even has a function to allow searches of titles by keyword.  I considered using it but after some data exploration, I decide against it.

In [43]:
from imdb import IMDb
ia = IMDb()
zombie_movies = ia.get_keyword('zombie')

First of all it only returns the top 50 per category:

In [44]:
len(zombie_movies)

50

And then there is this:

In [46]:
for movie in zombie_movies:
    if movie['title'] == 'Wreck-It Ralph':
        print ('Wreck-It Ralph is a Zombie Movie!')

Wreck-It Ralph is a Zombie Movie!


I don't watch a lot of kids movies, but somehow I think that is not right.  Moral of the story, be sure to check any data before using.