# Spectral analysis of 5000 movies network
### by Macko Vladimir, Novakovic Milica, Pavué Clément, Roussaky Mehdi

## Goal of the Project:


The aim is to create a graph in which nodes represent movies and edges represent similarity between the movies they are connecting. Described graph represents a network of movies and their relations using descriptions of 5000 selected movies leading to movie genre classification, suggestion of the best movie to represent the genre and quantifying how much $mainstream$ the movie is. 

## Data Acquisition:

The starting point is TMDB 5000 Movie Dataset, available at Kaggle web page, contains information about 5000 selected movies provided by users and reviewers from The Movie Database (TMDb). Namely,
each of the selected movies has the following attributes: budget, genres, homepage, id,
keywords, original language, original title, overview, popularity, production companies,
production coutries, release data, revenue and personnel aspects of cast and crew members listing their names, genders and role or contribution to the movie production and
other details. The majority of the data is in the text format encoded in JSON structure. 

Since the files in JSON format are not practical for manipulations, one single dataset is prapered in Pandas data-frame structure from 2 original JSON datafiles. During this preparation the data is cleaned, e.g. unuseful collumns are removed and corrupted lines (which have missing movie title or other issues are removed), it was examined that there are no duplicates in the produced dataset.

In [2]:
#classical inputs
import sys, os, pathlib
#setting the path to folder with modules
sys.path.insert(0, str(pathlib.Path(os.getcwd()).parents[0] / 'python'))
from Load_Datasets import *

#Loading information about movies (transforming JSON files into pandas frame)
FileAddress_movies ="../Datasets/tmdb_5000_movies.csv"
FileAddress_credits="../Datasets/tmdb_5000_credits.csv"

Custom made function $Load Datasets$ in order to convert JSON database from two input files to one single Pandas dataframe which contains movies as lines and movie attributes as columns. For more details refer to function $Load Datasets$ [implementation](https://github.com/ryancier/FinalProjectNTDS2017/blob/master/python/Load_Datasets.py)


After the first examination of the data, the following movie attributes are considered no usefull for further analysis movie id, production status (since the vast majority of movies are are released and those which are not released have no actors published and hence are subsequently removed) and homepage (since homepage does address does not contains only information which is in the movie title)

In [3]:
#Loading information about movies
Movies = Load_Datasets(FileAddress_movies,FileAddress_credits)

Drops = ['homepage','status','id']
for drop in Drops:
    Movies = Movies.drop(drop, 1)

#pandas entries contain string arrays from which can be easily converted to lists using string.split(",")
#new datafile is generated
Movies.to_csv("../../Datasets/Transformed.csv")

Final categoties are below: There is a list of actors for each movies and list of their genders. Also, there is a list of crew names and their job and departement and other movie markers such as movie popularity, revenue, etc.

In [6]:
list(Movies)

['budget',
 'genres',
 'keywords',
 'original_language',
 'original_title',
 'overview',
 'popularity',
 'production_companies',
 'production_countries',
 'release_date',
 'revenue',
 'runtime',
 'spoken_languages',
 'tagline',
 'vote_average',
 'vote_count',
 'genres_id',
 'keywords_id',
 'production_companies_id',
 'actors',
 'actors_id',
 'actor_gender',
 'crew_names',
 'crew_names_id',
 'crew_jobs',
 'crew_departments']

In [4]:
#Example 
print("Avatar 1st crew member is "+Movies['crew_names']['Avatar'].split(",")[0]
      +", he works at departement "+Movies['crew_departments']['Avatar'].split(",")[0]
      +", and his job is "+Movies['crew_jobs']['Avatar'].split(",")[0])

Avatar 1st crew member is Stephen E. Rivkin, he works at departement Editing, and his job is Editor


After removal of all the movies which are incomplete or not fully released, there are 4809 remaining movie entries in the dataset

In [9]:
Movies

Unnamed: 0_level_0,budget,genres,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,...,genres_id,keywords_id,production_companies_id,actors,actors_id,actor_gender,crew_names,crew_names_id,crew_jobs,crew_departments
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
#Horror,1500000,"Drama,Mystery,Horror,Thriller",,de,#Horror,"Inspired by actual events, a group of 12 year ...",2.815228,"AST Studios,Lowland Pictures",United States of America,2015-11-20,...,1896482753,,7527775278,"Taryn Manning,Natasha Lyonne,Chloë Sevigny,Bal...","343,10871,2838,9296,16327,210573,180425,110233...",111221001,"Tara Subkoff,Tara Subkoff,Tara Subkoff,Jason L...",611116111161111138244513824461382448,"Screenplay,Director,Producer,Producer,Producer...","Writing,Directing,Production,Production,Produc..."
(500) Days of Summer,7500000,"Comedy,Drama,Romance","date,sex,jealousy,fight,architect,gallery,inte...",en,(500) Days of Summer,"Tom (Joseph Gordon-Levitt), greeting-card writ...",45.610993,"Fox Searchlight Pictures,Watermark,Dune Entert...",United States of America,2009-07-17,...,351810749,"248,572,931,1721,2301,2861,4434,5923,8508,9673...",4343646332,"Joseph Gordon-Levitt,Zooey Deschanel,Chloë Gra...","24045,11664,56734,5375,5661,9048,56358,96624,9...",21122211122110210010020,"Mychael Danna,Hope Hanafin,Steven J. Wolfe,Mas...","5359,16469,22433,52446,52449,53648,54050,66519...","Original Music Composer,Costume Design,Produce...","Sound,Costume & Make-Up,Production,Production,..."
10 Cloverfield Lane,15000000,"Thriller,Science Fiction,Drama","kidnapping,bunker,paranoia,basement,survivalis...",en,10 Cloverfield Lane,"After a car accident, Michelle awakens to find...",53.698683,"Paramount Pictures,Bad Robot,Spectrum Effects",United States of America,2016-03-10,...,5387818,193023212340986610833123321306315381,41146178177,"Mary Elizabeth Winstead,John Goodman,John Gall...","17628,1230,17487,51329,60881,1354257,8269,1413...",1222201020,"Monika Mikkelsen,J.J. Abrams,Matthew W. Mungle...","2325,15344,23788,69506,59811,66491,92336,13649...","Casting,Producer,Makeup Effects,Director of Ph...","Production,Production,Crew,Camera,Sound,Art,Cr..."
10 Days in a Madhouse,1200000,Drama,"undercover,insane asylum,reporter",en,10 Days in a Madhouse,"Nellie Bly, a 23 year-old reporter for Joseph ...",0.489271,,United States of America,2015-11-20,...,18,1568492412193,,"Caroline Barry,Christopher Lambert,Kelly LeBro...","1478271,38559,46948,1239372,1478272,1478273,14...",0210000000011001000020,"Martin Wiley,Jan Glaser,Strathford Hamilton,Ma...","64468,71716,998473,1011452,1128550,1128550,113...","Producer,Casting,Executive Producer,Executive ...","Production,Production,Production,Production,Di..."
10 Things I Hate About You,16000000,"Comedy,Romance,Drama","shakespeare,sister,high school,cannabis,decept...",en,10 Things I Hate About You,"Bianca, a tenth grader, has never gone on a da...",54.550275,"Mad Chance,Jaret Entertainment,Touchstone Pict...",United States of America,1999-03-30,...,351074918,"497,5923,6270,8224,9758,11870,53994,53995,1561...",175717839195,"Heath Ledger,Julia Stiles,Joseph Gordon-Levitt...","1810,12041,24045,40978,38582,40979,40980,17773...","2,1,2,1,2,2,1,1,2,2,1,2,2,0,2,0,2,0,0,0,0,1,0,...","Charles Graffeo,William Shakespeare,Mark Irwin...","1800,6210,7413,16593,20359,21068,29525,32279,4...","Set Decoration,Theatre Play,Director of Photog...","Art,Writing,Camera,Editing,Costume & Make-Up,S..."
102 Dalmatians,85000000,"Comedy,Family","london england,prison,release from prison,wome...",en,102 Dalmatians,Get ready for a howling good time as an all ne...,9.895061,"Walt Disney Pictures,Cruella Productions",United States of America,2000-10-07,...,3510751,2123783398542464918841890515162158369,210472,"Glenn Close,Ioan Gruffudd,Alice Evans,Tim McIn...",51565524655354104316927,12122,"David Newman,Adrian Biddle,Kevin Lima,Gregory ...","3393,7783,15775,15779,11302,60534,60534,60543,...","Music,Director of Photography,Director,Editor,...","Sound,Camera,Directing,Editing,Production,Writ..."
10th & Wolf,8000000,"Action,Crime,Drama,Mystery,Thriller","undercover,mafia,mobster,crime family",en,10th & Wolf,A former street tough returns to his Philadelp...,3.942464,Thinkfilm,United States of America,2006-02-19,...,288018964853,1568103911157833421,446,"James Marsden,Brian Dennehy,Leo Rossi,Dennis H...","11006,6197,67524,2778,1771,84989,31007,18304,1...",2222222222221,"Robert Moresco,Robert Moresco,Allan Steele",1370641370641331795,"Director,Writer,Writer","Directing,Writing,Writing"
11:14,6000000,"Crime,Drama,Thriller","alcohol,sex,robbery,secret,gun,ambulance,vanda...",en,11:14,Tells the seemingly random yet vitally connect...,15.048067,"Firm Films,Media 8 Entertainment,MDP Worldwide","Canada,United States of America",2003-05-16,...,801853,"567,572,642,1328,1419,1546,2016,3713,5638,6149...",1838222610828,"Henry Thomas,Blake Heron,Barbara Hershey,Hilar...","9976,57127,10767,448,9048,52647,33532,3492,111...",0211222222122,"Hilary Swank,John Morrissey,Mary Vernieu,Clint...","448,816,5914,6377,6890,6892,6893,9543,11455,33...","Executive Producer,Producer,Casting,Original M...","Production,Production,Production,Sound,Product..."
12 Angry Men,350000,Drama,"judge,jurors,sultriness,death penalty,father m...",en,12 Angry Men,The defense and the prosecution have rested an...,59.259204,"United Artists,Orion-Nova Productions",United States of America,1957-03-25,...,18,"934,1417,2118,2122,2123,2124,2142,3012,3772,60...",6010212,"Henry Fonda,Martin Balsam,John Fiedler,Lee J. ...","4958,1936,5247,5248,5249,5250,2651,5251,5252,3...",22222222022220000,"Henry Fonda,Sidney Lumet,Reginald Rose,Reginal...","4958,39996,5246,5246,5246,5259,5260,5261,5262,...","Producer,Director,Screenplay,Producer,Story,As...","Production,Directing,Writing,Production,Writin..."
12 Rounds,20000000,"Action,Adventure,Drama,Thriller","police,cops,cat and mouse,family,revenge drama",en,12 Rounds,When New Orleans cop Danny Fisher prevents a b...,15.661350,"The Mark Gordon Company,Fox Atomic,20th Centur...",United States of America,2009-03-19,...,28121853,614982331507618035210347,1557289036351033917887,"John Cena,Aidan Gillen,Ashley Scott,Steve Harr...","56446,49735,71128,2202,31137,84754,54182,84756...","2,2,1,2,0,2,1,2,2,2,2,2,1,0,0,1,2,2,0,0,0,0,0,...","John Papsidera,Trevor Rabin,Brian Berdan,Mark ...","561,894,3189,6048,16938,27040,41680,50459,5544...","Casting,Music,Editor,Producer,Director,Product...","Production,Sound,Editing,Production,Directing,..."


## Data Exploration

In [None]:
#

## Data Expoitation

In [10]:
#

## Evaluation

In [11]:
#

## Conclusion