## Basic stats of the Data and network

This section will present some basic information about the dataset and initial preperation to allow for the reader to get a good idea about the data that project evolves around
### <span style="color:red">Elaborate</span>

In [5]:
#Imports
from IPython.display import Image, clear_output
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
import re
import nltk
import json
import bar_chart_race as bcr
import random
from fa2 import ForceAtlas2
import json
from collections import Counter, defaultdict
import urllib
from datetime import datetime
from ipywidgets import *
import ipywidgets as widgets
import time
import math
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
import seaborn as sns
from pandas.core.common import SettingWithCopyWarning
import warnings
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

#import gif


#Init stopwords
stop_words = set(stopwords.words('english'))

### IMDB 5000
### <span style="color:red">WHAT IS THE DATA ETC: WRITE MORE</span>

In [6]:
# Datasets
RELATIVE_FILE_PATH='../'
movies = pd.read_csv(RELATIVE_FILE_PATH+'tmdb_5000_movies.csv')
credits = pd.read_csv(RELATIVE_FILE_PATH + 'tmdb_5000_credits.csv')

In [7]:
# Creating DF
df = movies.merge(right = credits, left_on = 'id', right_on = 'movie_id')
df['release_date'] = pd.to_datetime(df['release_date'])


#Convert JSON to dict and extract the names of each actor in each movie
actors_per_movie = df['cast'].apply(lambda x: ','.join([y['name'] for y in json.loads(x)])).str.split(',', expand = True).fillna('')
actors_per_movie = df[['original_title', 'id']].merge(right = actors_per_movie, left_index = True, right_index = True)

#Unpivot
actors_per_movie = actors_per_movie.melt(id_vars=['original_title', 'id'])

#Remove empty rows
actors_per_movie = actors_per_movie[actors_per_movie['value'] != '']

#Number of movies per actor
pt = actors_per_movie.pivot_table(index = 'value', values='id', aggfunc = len).reset_index().sort_values(by='id', ascending = False).reset_index()

In [8]:
pt.head()

Unnamed: 0,index,value,id
0,45586,Samuel L. Jackson,67
1,2,Jr.,60
2,43599,Robert De Niro,57
3,6819,Bruce Willis,51
4,34677,Matt Damon,48


In [9]:
print(f"In the original {len(df)} movies, there are {len(pt)} actors. Of these actors {len(pt[pt['id']<5])} have starred in less than 5 movies")

In the original 4803 movies, there are 54199 actors. Of these actors 50394 have starred in less than 5 movies


#### Decreasing the size of the data

We now have the full dataset, but as we can see above there are quite Actors in this network that contribute very little information to the questions we want to answer. Modelling the network with actors as nodes and movies as the links between them we will have alot of low degree nodes potentially disturbing the picture.
Many of these low degree actors are most likely extras, but since there are only 4803 movies in the data and as of 2012 the United States has produced roughly 44.000 [movies](https://babelniche.com/2013/06/29/where-are-movies-made/) according to IMDB. 

Due to this, and also to reduce the complexity we have decided to only look at the top 500 most productive actors and actresses

In [11]:
#Select top N actors by number of movies
pt = pt.nlargest(500, 'id')
#Top N actors as Nodes
nodes = pt['value'].tolist()

#### The Network itself
As mentioned the network will consist of actors as nodes and movies as edges. Of course actors can participate in multiple movies together, and this is also part of our investigation. To handle this we will model the network as a weighted undirected graph where the weight is the number of movies the actors starred together in.

The network is also going to contain some other basic stats on nodes and edges:

### <span style="color:red">UPDATE IF ASBJØRN CHANGES</span>

**Edges**
* Weight

**Nodes**
* Name of actor/actress
* Movie id list
* Genre dictionary

The reason we only store the id's and not the full data is to reduce redundance and while not decreasing speed

In [13]:
# Creating the network

# Extracting genre
movie_genre={}
for x in movies.iterrows():
    movie_genre[x[1]['original_title']]=[y['name'] for y in json.loads(x[1]['genres'])]

    
#Create edges between actors that have been in the same movie
edges = []
actor_genre = {}
actor_movies_count = {}
actor_movies_list = {}
edge_movie_lookup = {}
for node in nodes:
    actor_movies = actors_per_movie[actors_per_movie['value'] == node]['original_title'].tolist()
    edges += [(node, x) for x in\
              actors_per_movie[(actors_per_movie['original_title'].isin(actor_movies)) \
                & (actors_per_movie['value'].isin(nodes))]['value'] if node != x]

    # Actor
    cnt=len(actor_movies)
    actor_movies_count[node] = cnt
    actor_movies_list[node] = actor_movies
    actor_genre[node]=[movie_genre[y] for y in actor_movies]


In [14]:
# Init graph
G = nx.Graph()

# Add nodes and edges
G.add_nodes_from(nodes)
G.add_edges_from(edges)

In [15]:
# Set attributes

ct=Counter(edges)
# Add weights
for key, value in ct.items():
    G.edges[key]['weight'] = value
    
# Generate list of movies for edges
for key, value in edge_movie_lookup.items():
    G.edges[key]['movies']=value

# Add most frequent genre
for actor in G.nodes:
    G.nodes[actor]['genre']=Counter([x for y in actor_genre[actor] for x in y])
    G.nodes[actor]['top_genre']=G.nodes[actor]['genre'].most_common(1)[0][0]
    G.nodes[actor]['top_genre']
    G.nodes[actor]['movies_count']=actor_movies_count[actor]
    G.nodes[actor]['movies']=actor_movies_list[actor]

In [16]:
# Taking a look at a node in the graph
print(actor)
for x in G.nodes[actor]:
    print (f"{x}:",G.nodes[actor][x])

Elisabeth Shue
genre: Counter({'Drama': 8, 'Comedy': 7, 'Thriller': 5, 'Romance': 4, 'Science Fiction': 4, 'Horror': 3, 'Adventure': 3, 'Family': 3, 'Action': 2, 'Mystery': 2, 'Crime': 1, 'Music': 1})
top_genre: Drama
movies_count: 17
movies: ['Piranha 3D', 'Molly', 'Hollow Man', 'The Saint', 'Don McKay', 'Leaving Las Vegas', 'Chasing Mavericks', 'House at the End of the Street', 'Back to the Future Part II', 'Dreamer: Inspired By a True Story', 'Hide and Seek', 'Back to the Future Part III', 'The Karate Kid', 'Deconstructing Harry', 'Gracie', 'Hamlet 2', 'Hope Springs']


### WIKIPEDIA YEAR LOOKUP

The second part of the dataset is a lookup of an the year/decades wikipedia page to see if the findings from our analysis corresponds to the sentiment found on the pages. The purpose and hypotesis of this is that we will find a direct correlation between the sentiment of movies and the sorrunding decands data.

One assumption here is that will let the production of hollywood reflect the general state of the world. The reason for this is that there is no pattern wikipedia pattern for year/decade lookup only for the us. Scraping multiple sites will take too much time and loose focus of the important stuff. A reason why this assumption might be allright is due to the fact that the earlies movies in the dataset start around 1970's, which means that society is quite globalised already

**Cleaning and preprocessing**
The data is fetched through the api and is parsed using regex to a raw but filtered state

The filtering applied is:
* Removing links an references to other pages identified by being within <> or {}
* Removing links identified by starting with url=
* Removing non alpha numeric charaters
* Removing links identified by starting with http

This leaves us some partly filtered data still containing stopwords etc. ready for analysis

In [17]:
#The content of the characters' wiki-pages is extracted using the wikipedia API
baseurl = 'https://en.wikipedia.org/w/api.php?'
action = 'action=query'
title = 'titles='
content = 'prop=revisions&rvprop=content'
dataformat = 'format=json'

def look_up_decade(year: int)-> str:
    decade_start=int(year/10)*10
    query = '%s%s&%s&%s&%s' % (baseurl,action,f'titles={decade_start}s',content,dataformat)
    res = json.loads(urllib.request.urlopen(query).read().decode('utf-8'))
    pages = res.get('query').get('pages')
    if not pages:
        raise Exception('No pages found')
    data = []
    for page in pages.keys():
        try:
            data.append(res['query']['pages'][page]['revisions'][0]['*'])
        except:
            print(f"Failed on pages{page}")
    return data

def process_data(d:list, limit=3)->list:
    data_string = ''
    temp_str = ''
    
    i=0
    for x in d:
        if i>=limit:
            break
        # Remove special chars and data in links
        temp_str=re.sub("[\{\<.*?[\}\>]", "", x)
        # Remove links
        temp_str=re.sub('url=.\S*','',temp_str)
        # Weird chars
        temp_str=re.sub('[^a-zA-Z0-9 \n\.]', '', temp_str)
        # Remaning links
        temp_str=re.sub('http.\S*','',re.sub('[^a-zA-Z0-9 \n\.]', '', temp_str))
        # Remaning links
        temp_str=re.sub('redirect.\S*','',temp_str)
        data_string += temp_str
        i+=1
    return data_string    

**NOTE: MOST OF THE FUNCTIONS AND DATAPROCESSING STEPS HAS BEEN MOVED TO A SEPERATE HELPER FILE IN THE OTHER NOTEBOOKS TO IMPORVE READABILITY**