# Creating Networks from JSON Data

This notebook contains an example that reads data from a file of movies `imdb_movies_2000to2022.prolific.json` and constructs a graph of actors. This dataset contains a sample of movies released betwen 2000-2022, their titles, genres, release years, ratings, and top-billed actors.

Using this dataset, we build a graph and perform some rudimentary graph analysis, extracting centrality metrics from it.

In [None]:
%matplotlib inline

In [1]:
import json
import random

import numpy as np
import pandas as pd
import networkx as nx


## Exercise 1: Build Graph of Actors, Finding Most Prolific Actor

The dataset contains a list of movies. We want to convert that list into a network of actors, where nodes represent the actor, and edges between them represent the movies in which the two actors have co-starred.

From there, we want to rank the actors by the number of neighboring actors to whom they are connected, and print the top 10.

In [17]:
g = nx.Graph() # Build the graph

In [None]:
with open("imdb_movies_2000to2022.prolific.json", "r") as in_file:
    for line in in_file:
        
        # Load the movie from this line
        this_movie = json.loads(line)
            
        # Create a node for every actor
        for actor_id,actor_name in this_movie['actors']:
            # add the actor to the graph
            g.add_node(actor_name)
            
        # Iterate through the list of actors, generating all pairs
        #. Starting with the first actor in the list, generate pairs with all subsequent actors
        #. then continue to second actor in the list and repeat
        i = 0 # Counter in the list
        for left_actor_id,left_actor_name in this_movie['actors']:
            for right_actor_id,right_actor_name in this_movie['actors'][i+1:]:
                
                # Add an edge for these actors
                g.add_edge(left_actor_name, right_actor_name)
                
                # Print edges
                print(left_actor_name, "<->", right_actor_name)
                
            i += 1 # increment the counter

print("Nodes:", len(g.nodes))
#print top 20 nodes with the most connections
sorted(g.degree, key=lambda x: x[1], reverse=True)[:20]

Meg Ryan <-> Hugh Jackman
Meg Ryan <-> Liev Schreiber
Meg Ryan <-> Breckin Meyer
Hugh Jackman <-> Liev Schreiber
Hugh Jackman <-> Breckin Meyer
Liev Schreiber <-> Breckin Meyer
Kenneth Tobey <-> Brinke Stevens
Kenneth Tobey <-> R.G. Wilson
Kenneth Tobey <-> John Goodwin
Brinke Stevens <-> R.G. Wilson
Brinke Stevens <-> John Goodwin
R.G. Wilson <-> John Goodwin
Crispin Glover <-> Vanessa Redgrave
Crispin Glover <-> John Hurt
Crispin Glover <-> Margot Kidder
Vanessa Redgrave <-> John Hurt
Vanessa Redgrave <-> Margot Kidder
John Hurt <-> Margot Kidder
Dean Cain <-> Thomas Ian Griffith
Dean Cain <-> Justin Whalin
Dean Cain <-> Jodi Bianca Wise
Thomas Ian Griffith <-> Justin Whalin
Thomas Ian Griffith <-> Jodi Bianca Wise
Justin Whalin <-> Jodi Bianca Wise
Jackie Shroff <-> Nana Patekar
Jackie Shroff <-> Kumar Gaurav
Jackie Shroff <-> Jaaved Jaaferi
Nana Patekar <-> Kumar Gaurav
Nana Patekar <-> Jaaved Jaaferi
Kumar Gaurav <-> Jaaved Jaaferi
Rosa Navarra <-> Rudy Giuliani
Rosa Navarra <-> D

[('Eric Roberts', 578),
 ('Tom Sizemore', 280),
 ('Michael Madsen', 268),
 ('Danny Trejo', 243),
 ('Joe Hammerstone', 222),
 ('Tony Devon', 216),
 ('Dean Cain', 199),
 ('Lloyd Kaufman', 190),
 ('Joe Estevez', 186),
 ('Nicolas Cage', 172),
 ('Bill Oberst Jr.', 170),
 ('Debbie Rochon', 169),
 ('James Franco', 165),
 ('Tony Todd', 161),
 ('Theodore Bouloukos', 161),
 ('Bruce Willis', 156),
 ('Michael Paré', 151),
 ('Samuel L. Jackson', 149),
 ('Willem Dafoe', 148),
 ('Lance Henriksen', 145)]

In [38]:
centrality_deagree=nx.degree_centrality(g)
for actor in sorted(centrality_deagree, key=centrality_deagree.get, reverse=True)[:20]:
    print(actor, centrality_deagree[actor])

Eric Roberts 0.017297620828969025
Tom Sizemore 0.008379470297770463
Michael Madsen 0.008020350142151729
Danny Trejo 0.0072721831512793655
Joe Hammerstone 0.006643722878946581
Tony Devon 0.006464162801137214
Dean Cain 0.005955409247344007
Lloyd Kaufman 0.005686069130629957
Joe Estevez 0.005566362412090379
Nicolas Cage 0.0051473888972018555
Bill Oberst Jr. 0.005087535537932067
Debbie Rochon 0.005057608858297172
James Franco 0.0049379021397575945
Tony Todd 0.004818195421218016
Theodore Bouloukos 0.004818195421218016
Bruce Willis 0.004668562023043544
Michael Paré 0.004518928624869071
Samuel L. Jackson 0.004459075265599282
Willem Dafoe 0.004429148585964387
Lance Henriksen 0.004339368547059704


In [41]:
"""
Create a graph of movies, where each node represents a movie
Add edges between movies that share at least one common actor
Print the top 10 movies that are most interconnected (i.e., those with the highest number of co-starring connections)
"""
movie_g = nx.Graph() # Build the graph
with open("imdb_movies_2000to2022.prolific.json", "r") as in_file:
    for line in in_file:
        
        # Load the movie from this line
        this_movie = json.loads(line)

        # Create a node for this movie
        movie_g.add_node(this_movie['title'], actors=this_movie['actors'])

        # Add edges between movies that share at least one common actor
        for other_movie in movie_g.nodes:
            # Skip the same movie
            if other_movie == this_movie['title']:
                continue

            # Extract actor names from the list of actors
            this_movie_actors = {actor[1] for actor in this_movie['actors']}
            other_movie_actors = {actor[1] for actor in movie_g.nodes[other_movie]['actors']}

            # Check if the two movies share at least one actor
            common_actors = this_movie_actors.intersection(other_movie_actors)
            if len(common_actors) > 0:
                # Add an edge between these movies
                movie_g.add_edge(this_movie['title'], other_movie)
            

print("Nodes:", len(movie_g.nodes))
# Print the top 10 movies that are most interconnected
top_10_movies = sorted(movie_g.degree, key=lambda x: x[1], reverse=True)[:10]
for movie, connections in top_10_movies:
    print(f"{movie}: {connections} connections")

Nodes: 19794
Beyond the Game: 377 connections
Blackbird: 360 connections
Hunting Season: 338 connections
The Immortal Wars: 333 connections
The Sector: 316 connections
Dead Ringer: 315 connections
Vice: 309 connections
Luck of the Draw: 307 connections
Spreading Darkness: 305 connections
The Alternate: 303 connections


In [42]:
centrality_deagree_movies=nx.degree_centrality(movie_g)
for movie in sorted(centrality_deagree_movies, key=centrality_deagree_movies.get, reverse=True)[:10]:
    print(movie, centrality_deagree_movies[movie])

Beyond the Game 0.01904713787702723
Blackbird 0.018188248370636085
Hunting Season 0.017076744303541655
The Immortal Wars 0.016824129742838378
The Sector 0.01596524023644723
Dead Ringer 0.015914717324306573
Vice 0.015611579851462639
Luck of the Draw 0.015510534027181326
Spreading Darkness 0.015409488202900015
The Alternate 0.015308442378618704


In [None]:
# If you want to explore this graph in Gephi or some other
#. graph analysis tool, NetworkX makes it easy to export data.
#. Here, we use the GraphML format, which Gephi can read 
#. natively, to keep node attributes like Actor Name
nx.write_graphml(g, "actors.graphml")