# Data analysis of TMDB
Ih this task I am going to do some data analysis of movies and tv shows that are in The Movie DataBase.
The insights I am going to find is the following:
- What directors have made most movies and/or tv-shows.
- What actors have been the most active.
- What genres was the most popular per decade.

## Getting the data

I was at first going to just use [TMDB's API](https://developer.themoviedb.org/reference/intro/getting-started), but I encountered a problem. When I was trying to call for the 501th page (i.e. pages 500+) it gave me a error with the text `Invalid page: Pages start at 1 and max at 500. They are expected to be an integer.`, this is quite annoying since the amount of pages as today (16 01 2025) is a whoping `48184`. And with the page limit I would only be able to access **just a bit over 1%** using this method. I did then discover the [daily ID](https://developer.themoviedb.org/docs/daily-id-exports) lists that they have. This was pretty sweet, but sadly i stumbled upon a me problem this time, my code was not good enough and I did not know how to improve it further, a aproximatly 190 hours of processing time to find directors from just the `People` list? Not acceptable, I then went for something a bit easier, I found this lovely [.csv file](https://www.kaggle.com/datasets/alanvourch/tmdb-movies-daily-updates) that user Alan Vourc'h had uploaded to Kaggle, and that file is what I am goign to use for this project.

## Start of code
I am going to use Python as my coding language, since it is what I am the most comfortable with.

The very first thing I do is choose my modules for my code:
- I chose `pandas` as my module of choice to handle the .csv file.
- I chose `numpy` for its usefull math library.
- I chose `seaborn` for its posibility to visualize data in a friendly way.
- I chose `matplotlib.pyplot` for making graphing possible.
- I chose `date` from `datetime` for removing movies that have not been released yet.
- I chose `Counter` from `collections` to go over how often a genre shows up.

In [25]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import date
from collections import Counter

Then since the file itself has many columns that I do not need, I made a list for them, and put that list in to the `usecols` parametre in panas' `read_csv`.

In [2]:

fields = ["title","id","popularity","vote_average","vote_count","release_date","revenue","budget","runtime","genres","cast","director","writers","producers","imdb_rating","imdb_votes"] # Put the info you want here

df = pd.read_csv(r"Data\TMDB_all_movies.csv", skipinitialspace=True, usecols=fields)

Then, since there are some rows that had `NAN` as their values, i changed those to a 0 where it was fitting. (I did not do this for the movies with no credits, they are filtered out.) 
I also set the index to be the movies ID.

In [3]:
#Data cleaning START
df.set_index("id", inplace=True) #Sets the index to be the movies ID

for i in ["vote_average","vote_count","revenue","budget","imdb_rating","imdb_votes"]:
    df[i].fillna(0,inplace=True)
    #This replaces NAN values with in the list to 0


df["release_date"] = pd.to_datetime(df["release_date"]) #Turns the dates in to the date format

today = pd.to_datetime(date.today())
df = df[df["release_date"] <= today] #This removes movies that are in the dataset that are not out yet.

df.drop_duplicates(inplace = True)

df.dropna(inplace=True)
#Data cleaning END

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[i].fillna(0,inplace=True)


## The First Question

So to find out who has directed the most movies, and how many movies they have directed I do a quick query on the `director` collumn.

In [4]:
df["director"].describe(include="object")

count             225889
unique            106871
top       David DeCoteau
freq                 142
Name: director, dtype: object

And here I look at the `top` in the terminal to find out that it is David DeCoteau that has directed the most movies, and by looking at `freq` that he has directed 142 movies.

## The Second Question

Now to find the person whom has acted in the most movies, this is found in a very similar way to the first question, but now I will use the collumn `cast`.

In [5]:
df["cast"].describe(include="object")

count        225889
unique       223313
top       Mel Blanc
freq            139
Name: cast, dtype: object

And it is Mel Blanc that has played in the most movies, in 139 of them.

## The Third Question

Now to find out what genres are the most popular per decade, I first divide up each movie in to the decades that they are in.

In [22]:
df["decade"] = (df["release_date"].dt.year // 10) * 10

I then run a `df.head()` to make sure that the new collumn `decade` has been added.

In [7]:
df.head()

Unnamed: 0_level_0,title,vote_average,vote_count,release_date,revenue,runtime,budget,popularity,genres,cast,director,writers,producers,imdb_rating,imdb_votes,decade
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2,Ariel,7.1,338.0,1988-10-21,0.0,73.0,0.0,12.553,"Comedy, Drama, Romance, Crime","Jyrki Olsonen, Jaakko Talaskivi, Olli Varja, H...",Aki Kaurismäki,Aki Kaurismäki,Aki Kaurismäki,7.4,8892.0,1980
3,Shadows in Paradise,7.3,402.0,1986-10-17,0.0,74.0,0.0,16.411,"Comedy, Drama, Romance","Sirkka Silin, Helmeri Pellonpää, Tanja Talaski...",Aki Kaurismäki,Aki Kaurismäki,Mika Kaurismäki,7.5,7681.0,1980
5,Four Rooms,5.9,2650.0,1995-12-09,4257354.0,98.0,4000000.0,19.476,Comedy,"Lili Taylor, Valeria Golino, Marisa Tomei, Mar...","Quentin Tarantino, Robert Rodriguez, Allison A...","Quentin Tarantino, Robert Rodriguez, Allison A...","Quentin Tarantino, Lawrence Bender, Alexandre ...",6.7,113177.0,1990
6,Judgment Night,6.5,333.0,1993-10-15,12136938.0,109.0,21000000.0,12.11,"Action, Crime, Thriller","Stephen Dorff, Everlast, Will Zahrn, Emilio Es...",Stephen Hopkins,"Jere Cunningham, Lewis Colick","Gene Levy, Marilyn Vance, Lloyd Segan",6.6,19493.0,1990
9,Sunday in August,7.135,26.0,2004-09-02,0.0,15.0,0.0,2.729,Drama,"Rita Lengyel, Milton Welsh","Marc Meyer, Anna Haas",Marc Meyer,Marc Meyer,6.8,14.0,2000


And it has!
I will now gothrough each movie and find its genres, I am going to test a few diffrent aproaches for this:
1. I will tally up the genres for each movie, so if a movie is a `drama` & `comedy` movie that was released in the decade `1980`, the decade `1980` will have `1 drama movie` "point" and `1 comedy movie` "point"
2. I will do similar as in 1., but instead each movie will count as a fraction (`1/n` where `n` is the amount of genres that movie has). So the same movie in method 1. the decade `1980` will have `0.5 drama movie` "points" and `0.5 comedy movie` "points".
3. Similar as to 1., but this time the `popularity` collumn will play in, if the movie in method 1. has a `popularity` score of `10` the decade `1980` will have `10 drama movie` "points" and `10 comedy movie` "points".
4. Similar as to 3, but this time based on method 2., so it will divide the popularity on the amount of genres. So then the decade `1980` will have `5 drama movie` "points" and `5 comedy movie` "points".

I start with finding out all the diffrent decades in the data set.

In [30]:
print(df["decade"].unique())

[1980 1990 2000 1970 1940 1920 1960 1930 1950 2010 1910 1900 2020 1890]


Now with the diffrent decades, i go through each movie in the list, starting with the earliest decade. I then use the `pandas.explode()` to get a copy of each genre per movie by itself. Then I use `Counter` to count how many induvidual times each genre shows up, add it to a dictionary, and then go on to the next decade. I repeat this untuil there are no more decades.

In [38]:
def first_way() -> dict:
    """Goes through each movie in the data frame, then for each movie, it gets its genres,
    if those genres are not seen before they get added in to a list, then for each genre in those lists,
    they get turned in to a tupple that has the genre and the amount of times it has showned up.
    
    returns:
        dict: a dict of each decade, and its popularity
    """
    
    genre_popularity_per_decade = {}
    
    for i in sorted(df["decade"].unique()):
        temporary_df = df[df["decade"] == i]
        #I first use the pandas .explode() method to get each genre for itself
        all_genres_that_decade = temporary_df["genres"].str.split(", ").explode()
        genre_count = Counter(all_genres_that_decade)
        genre_popularity_per_decade[int(i)] = dict(genre_count)

        
    return genre_popularity_per_decade    
first_way()

{1890: {'Comedy': 5,
  'Horror': 3,
  'Fantasy': 4,
  'Drama': 1,
  'Family': 1,
  'Romance': 1},
 1900: {'Adventure': 6,
  'Science Fiction': 3,
  'Comedy': 15,
  'Fantasy': 18,
  'Horror': 6,
  'Western': 1,
  'Crime': 5,
  'Action': 3,
  'Family': 2,
  'Drama': 26,
  'History': 8,
  'War': 1,
  'Romance': 1,
  'Mystery': 1,
  'Thriller': 1},
 1910: {'Drama': 571,
  'History': 44,
  'War': 34,
  'Romance': 91,
  'Comedy': 291,
  'Horror': 39,
  'Science Fiction': 19,
  'Fantasy': 28,
  'Adventure': 58,
  'Mystery': 18,
  'Action': 23,
  'Documentary': 7,
  'Crime': 50,
  'Western': 141,
  'Family': 8,
  'Animation': 8,
  'Thriller': 12},
 1920: {'Drama': 1056,
  'Science Fiction': 16,
  'Documentary': 16,
  'Thriller': 38,
  'Crime': 121,
  'Horror': 74,
  'Fantasy': 49,
  'Romance': 362,
  'History': 66,
  'War': 51,
  'Comedy': 644,
  'Action': 86,
  'Adventure': 134,
  'Mystery': 50,
  'Music': 31,
  'Western': 119,
  'Family': 19,
  'Animation': 16},
 1930: {'Documentary': 74,
  