# **Optimal Actorle Strategy**

## **Introduction**

### [Actorle](https://actorle.com/) is a fantastic game that is like wordle for actors where you have to guess the actor of the day given their filmography. However, the titles of their films are blanked out. Instead, you have access to the films' genres and IMDB scores. If you guess the wrong actor but they share a film with the right one, this film will be revealed. 

### Can we use data science to win at this game?

## **Method**

### First, we should find a database of movies and their actors. Such a dataset can be found [here](https://www.kaggle.com/datasets/juzershakir/tmdb-movies-dataset?rvi=1). Our intuition tells us that we should look for the most popular actor in the database.

### This is a good start, but it neglects a crucial bit of information: the genre of films. Therefore we should also find the most popular actor by genre.

### These two classes are obviously good for Actorle but there is another class which is also important. We remind that guessing an actor who shares a lot of films with the correct answer will reveal the names of the films they share (as per the rules) providing useful information. Thus, we should also look for the actor who has appeared in the most films shared with other popular actors, essentially the ['Kevin Bacon'](https://simple.wikipedia.org/wiki/Bacon_number#:~:text=The%20Bacon%20number%20of%20an,concept%20to%20the%20movie%20industry.) of popular movie stars. 

## **1) Most popular actor (regardless of genre)**

In [1]:
import pandas as pd
import numpy as np

In [2]:
#import dataset
data = pd.read_csv('tmdb_movies_data.csv')

In [3]:
#clean data
data = data[data["cast"].isnull() == False]
data = data[data["genres"].isnull() == False]
data = data[data.budget_adj != 0]
data = data[data.revenue_adj != 0]

In [4]:
#subset of data relevant for actorle
actorle = data[['original_title','popularity', 'genres','cast', 'vote_average']]

actorle.head(10)

Unnamed: 0,original_title,popularity,genres,cast,vote_average
0,Jurassic World,32.985763,Action|Adventure|Science Fiction|Thriller,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,6.5
1,Mad Max: Fury Road,28.419936,Action|Adventure|Science Fiction|Thriller,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,7.1
2,Insurgent,13.112507,Adventure|Science Fiction|Thriller,Shailene Woodley|Theo James|Kate Winslet|Ansel...,6.3
3,Star Wars: The Force Awakens,11.173104,Action|Adventure|Science Fiction|Fantasy,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,7.5
4,Furious 7,9.335014,Action|Crime|Thriller,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,7.3
5,The Revenant,9.1107,Western|Drama|Adventure|Thriller,Leonardo DiCaprio|Tom Hardy|Will Poulter|Domhn...,7.2
6,Terminator Genisys,8.654359,Science Fiction|Action|Thriller|Adventure,Arnold Schwarzenegger|Jason Clarke|Emilia Clar...,5.8
7,The Martian,7.6674,Drama|Adventure|Science Fiction,Matt Damon|Jessica Chastain|Kristen Wiig|Jeff ...,7.6
8,Minions,7.404165,Family|Animation|Adventure|Comedy,Sandra Bullock|Jon Hamm|Michael Keaton|Allison...,6.5
9,Inside Out,6.326804,Comedy|Animation|Family,Amy Poehler|Phyllis Smith|Richard Kind|Bill Ha...,8.0


In [5]:
actors = actorle['cast']
actors = actors.str.split('|')
freq = {}

# Using a hashmap to total prevalence
for i in actors:
    for j in i:
        if j in freq:
            freq[j]+=1
        else:
            freq[j]=1

# Sorting data
freq = sorted(freq.items(), key=lambda x:x[1], reverse = True)
print('The twenty most popular actors are')
print()
for i,j in enumerate(freq[:20]):
    print(i+1,j[0])



The twenty most popular actors are

1 Robert De Niro
2 Bruce Willis
3 Samuel L. Jackson
4 Nicolas Cage
5 Matt Damon
6 Johnny Depp
7 Harrison Ford
8 Brad Pitt
9 Tom Hanks
10 Sylvester Stallone
11 Morgan Freeman
12 Tom Cruise
13 Denzel Washington
14 Eddie Murphy
15 Liam Neeson
16 Owen Wilson
17 Julianne Moore
18 Arnold Schwarzenegger
19 Mark Wahlberg
20 Meryl Streep


### It is clear from the data above that a really good guess would be something like Robert De Niro, Bruce Willis or Samuel L. Jackson. Of course, we have not taken into account genre which is a crucial bit of information actorle gives us. We do that next.

## **2) Most popular actor by genre**

In [6]:
genres = actorle['genres']
genres = genres.str.split('|')

# Making a list of all genres with index keys
genre_list = []
for i in genres:
    for j in i:
        if j not in genre_list:
            genre_list.append(j)

print('List of genres:')
print(genre_list)
print()
genre_list_keys = {}
for i,j in enumerate(genre_list):
    genre_list_keys[j] = i


# Preparing a dictionary with entries: {... actor: (genre_list[0].freq, genre_list[1].freq...)...)}
actor_genre_freq = {}
for i in actors:
    for j in i:
        if j not in actor_genre_freq:
            actor_genre_freq[j] = [0]*len(genre_list)        
            
# Filling dictionary
for i,j in enumerate(actors):
    for actor in j:
        try:
            for genre in genres[i]:
                actor_genre_freq[actor][genre_list_keys[genre]]+=1
        except KeyError:
            pass
            
# Now one can find greatest appearances per genre, e.g. action       
genre = 'Drama'
actor_genre_freq_ans= sorted(actor_genre_freq.items(), key=lambda x:x[1][genre_list_keys[genre]], reverse = True)  

print('The twenty most popular actors in the genre of %s are:' %genre)
print()
for i,j in enumerate(actor_genre_freq_ans[:20]):
    print(i+1,j[0])

List of genres:
['Action', 'Adventure', 'Science Fiction', 'Thriller', 'Fantasy', 'Crime', 'Western', 'Drama', 'Family', 'Animation', 'Comedy', 'Mystery', 'Romance', 'War', 'History', 'Music', 'Horror', 'Documentary', 'Foreign', 'TV Movie']

The twenty most popular actors in the genre of Drama are:

1 Samuel L. Jackson
2 Robert De Niro
3 Tom Hanks
4 Meryl Streep
5 Dennis Quaid
6 Brad Pitt
7 Alec Baldwin
8 Keanu Reeves
9 Ben Affleck
10 Dwayne Johnson
11 Arnold Schwarzenegger
12 Helena Bonham Carter
13 Steve Carell
14 Tom Cruise
15 James Franco
16 Owen Wilson
17 Julia Roberts
18 Ben Stiller
19 Tommy Lee Jones
20 Greg Kinnear


### If one checks out 'Action', 'Adventure', 'Drama', 'Crime' Samuel. L. Jackson is near or at the top of all of them. He's looking like our best bet

## **Rank actors based on appearances in top 10 of genres**

### As a corollary we find the top actors across all genres. We find the top 10 per genre and give first 10 points, second 9 points etc and add them up.

In [7]:
# We now award points based on genre positions
top_actors = {}
for g in genre_list:
    for i,j in enumerate(sorted(actor_genre_freq.items(), key=lambda x:x[1][genre_list_keys[g]], reverse = True)[:10]):
        if j[0] in top_actors:
            top_actors[j[0]] += 10-i
        else:
            top_actors[j[0]] = 10-i
            
print('The twenty top actors across all genres are:')
print()       
for i,j in enumerate(sorted(top_actors.items(), key = lambda x:x[1], reverse = True)[:20]):
    print(i+1,j[0])
        
        

The twenty top actors across all genres are:

1 Samuel L. Jackson
2 Tom Hanks
3 Sandra Bullock
4 Harrison Ford
5 Arnold Schwarzenegger
6 Bruce Willis
7 Danny DeVito
8 Robert De Niro
9 Tom Cruise
10 Meryl Streep
11 Dennis Quaid
12 Brad Pitt
13 Michael Douglas
14 Ben Stiller
15 Gwyneth Paltrow
16 Nicolas Cage
17 Vincent D'Onofrio
18 Gene Hackman
19 Tom Hardy
20 Sigourney Weaver


## **Aside: list of movies for a given actor**

In [8]:
name_of_actor = 'Alicia Vikander'
list_of_films = []

for i,j in enumerate(actors):
    for actor in j:
        if actor == name_of_actor:
            try:
                list_of_films.append(actorle.loc[i, 'original_title'])
            except KeyError:
                pass
print('%s has been in these films:' %name_of_actor)
print()
print(list_of_films)

Alicia Vikander has been in these films:

['Ex Machina', 'The Last Witch Hunter', 'Everest', 'Burnt', "Valentine's Day"]


## **Aside aside: list of actors in a given film**

In [9]:
name_of_film = "Astro Boy"
actor_list = actorle.loc[actorle['original_title'] == name_of_film,'cast']
print('%s has the following top-billed cast:' %name_of_film)
print()
print(list(actor_list.str.split('|')))

Astro Boy has the following top-billed cast:

[['Nicolas Cage', 'Kristen Bell', 'Bill Nighy', 'Donald Sutherland', 'Freddie Highmore']]


## **'Kevin Bacon' of popular actors**

### Let us now find the 'Kevin Bacon' of popular actors. We know the top actors across all genres and we can find what films they have been in. Then, we go through our database of actors and if they appear in one of these films we give them a point.

In [10]:
# Finding our 'bacon_films': the films in which the most popular actors have appeared.
bacon_films = []
for i,j in enumerate(sorted(top_actors.items(), key = lambda x:x[1], reverse = True)[:20]):
    name_of_actor = j[0]

    for i,j in enumerate(actors):
        for actor in j:
            if actor == name_of_actor:
                try:
                    bacon_films.append(actorle.loc[i, 'original_title'])
                except KeyError:
                    pass

# Finding our 'bacon_actors': actors who have appeared in bacon_films.
bacon_actors = {}
for i in bacon_films:
    name_of_film = i
    actor_list = actorle.loc[actorle['original_title'] == name_of_film,'cast'].str.split('|')
    for j in actor_list:
        for a in j:
            if a in bacon_actors:
                bacon_actors[a] +=1
            else:
                bacon_actors[a] = 1

print('Top twenty Bacon actors are:')
print()       
for i,j in enumerate(sorted(bacon_actors.items(), key = lambda x:x[1], reverse = True)[:20]):
    print(i+1,j[0])

Top twenty Bacon actors are:

1 Steve Buscemi
2 John Goodman
3 Matt Damon
4 Seth Rogen
5 Elizabeth Banks
6 Robert De Niro
7 Heath Ledger
8 Nicolas Cage
9 Reese Witherspoon
10 Vince Vaughn
11 Sandra Bullock
12 Jonah Hill
13 Christopher Plummer
14 Mickey Rourke
15 Jim Cummings
16 Ashley Judd
17 Bryce Dallas Howard
18 Christian Bale
19 Jennifer Garner
20 Samuel L. Jackson


### Interestingly, once we factor in top-billed cast in movies shared with the top actors we have different stars coming out on top. Steve Buscemi has been top-billed in the most films that have also starred the most popular actors

## **Testing**

### Let us now test what we've learnt on today's [Actorle](https://actorle.com/).

## **Conclusions**

### Weekly achievement before/after Data Science

| | #Correct (/7) | Avg. Score (/8) |
| --- | --- | --- |
| **Before DS** |  3|7.00 |
| **After DS** | 6 | 4.71|

### Several improvements can be made:

### 1. Age is a vital yet practically unused bit of information that can help us narrow down and optimise our search.

### 2. Useful to scrape from the Actorle website the genre counts so we know which genre is the most popular without having to manually count.

### 3. In testing, it was clear several films have been omitted ultimately calling for a better, more complete dataset.