# Exploratry Data Analysis

The following is an EDA to understand our data **The Movies Dataset** from kaggle: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data?select=movies_metadata.csv specifically the `movies_metadata.csv`

In [26]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [27]:
df = pd.read_csv("../data/movies_metadata.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

  df = pd.read_csv("../data/movies_metadata.csv")


In [28]:
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [29]:
df.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

### Discard useless columns

It appears there are a lot of columns that wouldn't be useful for our models like:
- adult: we are not intersted in filtering content right now
- production_companies: I don't 
- homepage: Not relevant for our model
- etc...

Why we choosed these columns? These are my assumptions:
- id: its essential of we will use other files in The Movies Datasets
- imdb_id: it could be unnecessary for the recommender but it would be useful for our front-end later on
- budget: Can be indirectly correlated with popularity or user ratings.
- genres: Users often prefer movies within specific genres.
- original_language: Relevant if language preferences are known for the user.
- overview: Relevant for the LLM model
- popularity: Relevant as it could help recommend movies that are widely liked.
- release_date: Relevant if the users like newer movies or have an era preference
- revenue: Indirectly relevant as it could indicate a movie's success and thus its appeal to a broad audience.
- runtime: Relevant if the user prefer a certain length
- vote_average, vote_count: Highly relevant as indicators of a movie's overall reception and quality.

In [30]:
useful_columns = [
    "id",
    "imdb_id",
    "title",
    "budget",
    "genres",
    "original_language",
    "overview",
    "popularity",
    "release_date",
    "revenue",
    "runtime",
    "vote_average",
    "vote_count",
]

df = df[useful_columns].reset_index(drop=True)
df.head()

Unnamed: 0,id,imdb_id,title,budget,genres,original_language,overview,popularity,release_date,revenue,runtime,vote_average,vote_count
0,862,tt0114709,Toy Story,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",en,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81.0,7.7,5415.0
1,8844,tt0113497,Jumanji,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",en,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104.0,6.9,2413.0
2,15602,tt0113228,Grumpier Old Men,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",en,A family wedding reignites the ancient feud be...,11.7129,1995-12-22,0.0,101.0,6.5,92.0
3,31357,tt0114885,Waiting to Exhale,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",en,"Cheated on, mistreated and stepped on, the wom...",3.859495,1995-12-22,81452156.0,127.0,6.1,34.0
4,11862,tt0113041,Father of the Bride Part II,0,"[{'id': 35, 'name': 'Comedy'}]",en,Just when George Banks has recovered from his ...,8.387519,1995-02-10,76578911.0,106.0,5.7,173.0


## Let's visualize the dataset