## Data Processing and Creation of the Network

The purpose of this section is to give an elaborate description of the data that is used in this project, and how it has been processed in order to do analysis.

In [10]:
#Imports
from IPython.display import Image, clear_output
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
import re
import nltk
import json
import bar_chart_race as bcr
import random
#from fa2 import ForceAtlas2
import json
from collections import Counter, defaultdict
import urllib
from datetime import datetime
from ipywidgets import *
import ipywidgets as widgets
import time
import math
from nltk.corpus import stopwords
#from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
import seaborn as sns
from pandas.core.common import SettingWithCopyWarning
import warnings
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

#import gif


#Init stopwords
stop_words = set(stopwords.words('english'))

### IMDB 5000

In [11]:
# Datasets
RELATIVE_FILE_PATH='../'
movies = pd.read_csv(RELATIVE_FILE_PATH+'tmdb_5000_movies.csv')
credits = pd.read_csv(RELATIVE_FILE_PATH + 'tmdb_5000_credits.csv')

In [12]:
# Creating DF
df = movies.merge(right = credits, left_on = 'id', right_on = 'movie_id')
df['release_date'] = pd.to_datetime(df['release_date'])


#Convert JSON to dict and extract the names of each actor in each movie
actors_per_movie = df['cast'].apply(lambda x: ','.join([y['name'] for y in json.loads(x)])).str.split(',', expand = True).fillna('')
actors_per_movie = df[['original_title', 'id']].merge(right = actors_per_movie, left_index = True, right_index = True)

#Unpivot
actors_per_movie = actors_per_movie.melt(id_vars=['original_title', 'id'])

#Remove empty rows
actors_per_movie = actors_per_movie[actors_per_movie['value'] != '']

#Number of movies per actor
pt = actors_per_movie.pivot_table(index = 'value', values='id', aggfunc = len).reset_index().sort_values(by='id', ascending = False).reset_index()

In [13]:
pt.head()

Unnamed: 0,index,value,id
0,45586,Samuel L. Jackson,67
1,2,Jr.,60
2,43599,Robert De Niro,57
3,6819,Bruce Willis,51
4,34677,Matt Damon,48


<b>Movie Count</b>

This table shows how many movies the top 5 most featuring actors have starred in. The column "id" represents this amount.

In [14]:
print(f"In the original {len(df)} movies, there are {len(pt)} actors. Of these actors {len(pt[pt['id']<5])} have starred in less than 5 movies")

In the original 4803 movies, there are 54199 actors. Of these actors 50394 have starred in less than 5 movies


In [15]:
df.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,spoken_languages,status,tagline,title_x,vote_average,vote_count,movie_id,title_y,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


<b>Movie Data set</b>

As can be seen, the IMDB data set consists of 4803 movies, featuring a total of 54199 actors, where most of them only starred in few movies.

It is arguably not very interesting to analyze actors only appearing in a rather small amount of movies, thus we choose to decrease the size of the data, so that it only consists of the 500 most starring actors. Reducing the size of the data is a necessity, in order to obtain fairly fast running time in the analysis.

In [9]:
#Select top N actors by number of movies
pt = pt.nlargest(500, 'id')
#Top N actors as Nodes
nodes = pt['value'].tolist()

### The Network

The purpose of the project is to investigate the network of actors in Hollywood, thus it is crucial to model the network in a appropriate manner. It has been chosen to use an <b>undirected weighted</b> graph to model the actor network, where each node represents an actor, and each edge is weighted by the amount of movies 2 actors have starred in together.

The reasoning behind weighting the edges by the amount of movies, relates to analyzing communities of the network. By weighting by the number of movies, it is more likely that 2 actors end up in the same network, if they have starred in multiple movies together. This behaviour seems reasonable.

Additionally, also due to future analysis, it has been chosen to have multiple node attributes.

**Nodes**
* <b>Movie list</b> - List of movies the actor has festured in
* <b>Movie count</b> - Amount of movie features
* <b>Dictionary of genres</b> - Genre as keys, and the amount of features in a genre as value
* <b>Top genre</b> - Genre with most features
* <b>Community</b> - Which community the actors is assigned to (will be added after the community analysis)

In [13]:
# Creating the network

# Extracting genre
movie_genre={}
for x in movies.iterrows():
    movie_genre[x[1]['original_title']]=[y['name'] for y in json.loads(x[1]['genres'])]

    
#Create edges between actors that have been in the same movie
edges = []
actor_genre = {}
actor_movies_count = {}
actor_movies_list = {}
edge_movie_lookup = {}
for node in nodes:
    actor_movies = actors_per_movie[actors_per_movie['value'] == node]['original_title'].tolist()
    edges += [(node, x) for x in\
              actors_per_movie[(actors_per_movie['original_title'].isin(actor_movies)) \
                & (actors_per_movie['value'].isin(nodes))]['value'] if node != x]

    # Actor
    cnt=len(actor_movies)
    actor_movies_count[node] = cnt
    actor_movies_list[node] = actor_movies
    actor_genre[node]=[movie_genre[y] for y in actor_movies]


In [14]:
# Init graph
G = nx.Graph()

# Add nodes and edges
G.add_nodes_from(nodes)
G.add_edges_from(edges)

In [15]:
# Set attributes

ct=Counter(edges)
# Add weights
for key, value in ct.items():
    G.edges[key]['weight'] = value
    
# Generate list of movies for edges
for key, value in edge_movie_lookup.items():
    G.edges[key]['movies']=value

# Add most frequent genre
for actor in G.nodes:
    G.nodes[actor]['genre']=Counter([x for y in actor_genre[actor] for x in y])
    G.nodes[actor]['top_genre']=G.nodes[actor]['genre'].most_common(1)[0][0]
    G.nodes[actor]['top_genre']
    G.nodes[actor]['movies_count']=actor_movies_count[actor]
    G.nodes[actor]['movies']=actor_movies_list[actor]

In [16]:
# Taking a look at a node in the graph (Before adding community)
print(actor)
for x in G.nodes[actor]:
    print (f"{x}:",G.nodes[actor][x])

Elisabeth Shue
genre: Counter({'Drama': 8, 'Comedy': 7, 'Thriller': 5, 'Romance': 4, 'Science Fiction': 4, 'Horror': 3, 'Adventure': 3, 'Family': 3, 'Action': 2, 'Mystery': 2, 'Crime': 1, 'Music': 1})
top_genre: Drama
movies_count: 17
movies: ['Piranha 3D', 'Molly', 'Hollow Man', 'The Saint', 'Don McKay', 'Leaving Las Vegas', 'Chasing Mavericks', 'House at the End of the Street', 'Back to the Future Part II', 'Dreamer: Inspired By a True Story', 'Hide and Seek', 'Back to the Future Part III', 'The Karate Kid', 'Deconstructing Harry', 'Gracie', 'Hamlet 2', 'Hope Springs']


### WIKIPEDIA YEAR LOOKUP

The other data set we wish to use in this project, is a lookup of years/decades on Wikipedia. The goal of using this data alongside the movie data, is to see if Hollywood reflect the state of the real world.

By performing sentiment analysis on the movie descriptions of a decade, and the Wikipedia description of that decade, we have a theory that these might be correlated. The theory is that Hollywood movies are indeed a <b>sign of the times</b>,


**Cleaning and preprocessing**
The data is fetched through the api and is parsed using regular expressions, regex, to a raw but filtered state.

The filtering applied is:
* Removing links an references to other pages identified by being within <> or {}
* Removing links identified by starting with url=
* Removing non alpha numeric charaters
* Removing links identified by starting with http

This leaves us some partly filtered data still containing stopwords etc. ready for analysis.

In [16]:
#The content of the characters' wiki-pages is extracted using the wikipedia API
baseurl = 'https://en.wikipedia.org/w/api.php?'
action = 'action=query'
title = 'titles='
content = 'prop=revisions&rvprop=content'
dataformat = 'format=json'

def look_up_decade(year: int)-> str:
    decade_start=int(year/10)*10
    query = '%s%s&%s&%s&%s' % (baseurl,action,f'titles={decade_start}s',content,dataformat)
    res = json.loads(urllib.request.urlopen(query).read().decode('utf-8'))
    pages = res.get('query').get('pages')
    if not pages:
        raise Exception('No pages found')
    data = []
    for page in pages.keys():
        try:
            data.append(res['query']['pages'][page]['revisions'][0]['*'])
        except:
            print(f"Failed on pages{page}")
    return data

def process_data(d:list, limit=3)->list:
    data_string = ''
    temp_str = ''
    
    i=0
    for x in d:
        if i>=limit:
            break
        # Remove special chars and data in links
        temp_str=re.sub("[\{\<.*?[\}\>]", "", x)
        # Remove links
        temp_str=re.sub('url=.\S*','',temp_str)
        # Weird chars
        temp_str=re.sub('[^a-zA-Z0-9 \n\.]', '', temp_str)
        # Remaning links
        temp_str=re.sub('http.\S*','',re.sub('[^a-zA-Z0-9 \n\.]', '', temp_str))
        # Remaning links
        temp_str=re.sub('redirect.\S*','',temp_str)
        data_string += temp_str
        i+=1
    return data_string    

**NOTE: MOST OF THE FUNCTIONS AND DATAPROCESSING STEPS HAS BEEN MOVED TO A SEPERATE HELPER FILE IN THE OTHER NOTEBOOKS TO IMPORVE READABILITY**