# Project: Investigate the TMDB 5000 Movie Dataset from Kaggle
## Overview
In this project, we'll be analyzing data containing information on about 10,000 movies collected from The Movie Database (TMDb). In particular, we'll be interested in finding trends among movie genre and popularity over time, as well as the kinds of properties that are associated with movies with high revenues.
<br>
This data set contains information
about 10,000 movies collected from
The Movie Database (TMDb),
including user ratings and revenue.

In [23]:
# Loading the data and the appropriate libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [5]:
movie_df = pd.read_csv('tmdb-movies.csv')
movie_df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


In [4]:
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              1

The dataset contains the following features:-<br>

movie_id - A unique identifier for each movie.<br>
cast - The name of lead and supporting actors.<br>
crew - The name of Director, Editor, Composer, Writer etc.<br>
budget - The budget in which the movie was made.<br>
genre - The genre of the movie, Action, Comedy ,Thriller etc.<br>
homepage - A link to the homepage of the movie.<br>
id - This is infact the movie_id as in the first dataset.<br>
keywords - The keywords or tags related to the movie.<br>
original_language - The language in which the movie was made.<br>
original_title - The title of the movie before translation or adaptation.<br>
overview - A brief description of the movie.<br>
popularity - A numeric quantity specifying the movie popularity.<br>
production_companies - The production house of the movie.<br>
production_countries - The country in which it was produced.<br>
release_date - The date on which it was released.<br>
revenue - The worldwide revenue generated by the movie.<br>
runtime - The running time of the movie in minutes.<br>
status - "Released" or "Rumored".<br>
tagline - Movie's tagline.<br>
title - Title of the movie.<br>
vote_average - average ratings the movie recieved.<br>
vote_count - the count of votes recieved.<br>

## First analysis plan: Which genres are most popular from year to year?
For the first analysis, I plan to track the popularity of each genre over time. The goal is to end up with a multiple line graph, with each line representing a genre of the given movie dataset. 

To start off with, let's drop the columns that won't be useful to our investigation:

In [29]:
movie_df.drop(columns=['id', 'imdb_id', 'original_title', 'homepage', 'overview', 'cast', 'homepage', 'director', 'tagline', 'overview'], axis=1, inplace=True)

It can be observed that most, if not all movie entries contain multiple genres each, separated by a '|' delimiter.<br><br> 
Let's look at the frequency of the data in the 'release_year' column:

In [30]:
pd.DataFrame(movie_df['release_year'].value_counts()).sort_index()

Unnamed: 0,release_year
1960,32
1961,31
1962,32
1963,34
1964,42
1965,35
1966,46
1967,40
1968,39
1969,31


There seems to be movies in this dataset that were released every year from 1960 to 2015, and the frequencies increase as the years become recent.  This means we can include every year from 1960 to 2015 as part of our assessment without worrying about any blank years. 

Now let's separate out the genres:

In [31]:
total_genres = set()

def getUniqueGenres(genre_data):
    split_genres = str(genre_data).split("|")
    for genre in split_genres:
        total_genres.add(genre)

movie_df["genres"].apply(getUniqueGenres)
# Now total_genres is filled with unique genres
total_genres = list(total_genres)
total_genres.remove('nan')
total_genres.sort()
total_genres

['Action',
 'Adventure',
 'Animation',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Foreign',
 'History',
 'Horror',
 'Music',
 'Mystery',
 'Romance',
 'Science Fiction',
 'TV Movie',
 'Thriller',
 'War',
 'Western']

Let's query the movies that were released from 1960:

In [32]:
movie_df.query('release_year == 1960')

Unnamed: 0,popularity,budget,revenue,keywords,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
10141,2.610362,806948,32000000,hotel|clerk|arizona|shower|rain,109,Drama|Horror|Thriller,Shamley Productions,8/14/60,1180,8.0,1960,5949601.0,235935000.0
10142,1.872132,2000000,4905000,horse|village|friendship|remake|number in title,128,Action|Adventure|Western,The Mirisch Corporation|Alpha Productions,10/23/60,224,7.0,1960,14745930.0,36164410.0
10143,1.136943,12000000,60000000,gladiator|roman empire|gladiator fight|slavery...,197,Action|Drama|History,Bryna Productions,10/6/60,211,6.9,1960,88475610.0,442378000.0
10144,0.947307,3000000,25000000,new york|new year's eve|lovesickness|age diffe...,125,Comedy|Drama|Romance,United Artists|The Mirisch Company,6/15/60,235,7.9,1960,22118900.0,184324200.0
10145,0.875173,0,0,musical,114,Comedy|Romance,Twentieth Century Fox Film Corporation|The Com...,10/7/60,15,4.9,1960,0.0,0.0
10146,0.712389,750000,0,london|inventor|future|time travel|dystopia,103,Thriller|Adventure|Fantasy|Science Fiction|Rom...,George Pal Productions|Galaxy Films Inc.,8/17/60,101,7.3,1960,5529726.0,0.0
10147,0.569424,0,0,cinematographer|photography|illegal prostitution,101,Horror|Thriller,National Film Finance Corporation (NFFC)|Anglo...,4/6/60,56,7.4,1960,0.0,0.0
10148,0.465879,0,0,tree house|island|shipwreck|pirate gang|swiss ...,126,Adventure|Family,Buena Vista Pictures,12/21/60,47,6.9,1960,0.0,0.0
10149,0.423531,0,0,gambling|perfect crime|casino|new year's eve|r...,127,Thriller|Music|Comedy|Crime,Warner Bros.|Dorchester,8/10/60,39,6.6,1960,0.0,0.0
10150,0.421043,0,0,indian|texas|farm|siblings|saddle,125,Action|Drama|Western,James Productions,1/1/60,17,4.9,1960,0.0,0.0
