<a href="https://colab.research.google.com/github/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/blob/main/Data_Analysis_Project_Blockbuster_Movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Datasets for Blockbuster Movie Analysis

To build a comprehensive dataset for analyzing blockbuster movies, we'll combine information from multiple sources. Below are some datasets that align with our project requirements:

---

**1. Movie Data Analysis Dataset**  
- Details about 7,668 movies, including:
  - Titles, ratings, genres, release years
  - IMDb scores, votes
  - Directors, writers, main stars
  - Production countries, budgets, gross earnings
  - Production companies, runtimes  
- **Source**: [GitHub Repository](https://github.com/1tannu5/Movie-Data-Analysis?utm_source=chatgpt.com)

---

**2. Global Movie Franchise Revenue and Budget Data**  
- Comprehensive data on movie franchises worldwide between 2000–2020:
  - Lifetime gross, budget, rating
  - Runtime, release date, vote count/average  
- **Source**: [Kaggle Dataset](https://www.kaggle.com/datasets/thedevastator/global-movie-franchise-revenue-and-budget-data?utm_source=chatgpt.com)

---

**3. TMDB 5000 Movies Dataset**  
- Information on over 5,000 movies:
  - Budget, cast, director
  - Keywords, runtime, genres
  - Production companies, release dates  
- **Source**: [Hugging Face Dataset](https://huggingface.co/datasets/AiresPucrs/tmdb-5000-movies/blob/main/README.md?utm_source=chatgpt.com)

---

**4. Complete Movie Metadata Dataset**  
- Data on over 722,000 movies, including:
  - ID, title, genres, budget, revenue  
- Suitable for analyzing trends in movie popularity, production companies, budgets, and revenues.  
- **Source**: [Gigasheet Dataset](https://www.gigasheet.com/sample-data/movies-daily-update-dataset?utm_source=chatgpt.com)

---

**5. Movie Revenue Analysis Dataset**  
- Approx. 5,800 movies released between 1915 and 2020:
  - Domestic and worldwide gross revenues
  - Production budgets, release dates  
- **Source**: [GitHub Repository](https://github.com/ntdoris/movie-revenue-analysis?utm_source=chatgpt.com)

---



### Dataset Acquisition

In [2]:
# 1. Movie Data Analysis Dataset

# Clone the repository and load the CSV
!git clone https://github.com/1tannu5/Movie-Data-Analysis.git
import pandas as pd

# Load the dataset
movie_data_path = "Movie-Data-Analysis/movie.csv"
movie_data = pd.read_csv(movie_data_path)

# Display the first few rows
movie_data.head()

Cloning into 'Movie-Data-Analysis'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 12 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (12/12), 612.93 KiB | 7.30 MiB/s, done.
Resolving deltas: 100% (2/2), done.


Unnamed: 0,name,rating,genre,year,released,score,votes,director,writer,star,country,budget,gross,company,runtime
0,The Shining,R,Drama,1980,"June 13, 1980 (United States)",8.4,927000.0,Stanley Kubrick,Stephen King,Jack Nicholson,United Kingdom,19000000.0,46998772.0,Warner Bros.,146.0
1,The Blue Lagoon,R,Adventure,1980,"July 2, 1980 (United States)",5.8,65000.0,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,United States,4500000.0,58853106.0,Columbia Pictures,104.0
2,Star Wars: Episode V - The Empire Strikes Back,PG,Action,1980,"June 20, 1980 (United States)",8.7,1200000.0,Irvin Kershner,Leigh Brackett,Mark Hamill,United States,18000000.0,538375067.0,Lucasfilm,124.0
3,Airplane!,PG,Comedy,1980,"July 2, 1980 (United States)",7.7,221000.0,Jim Abrahams,Jim Abrahams,Robert Hays,United States,3500000.0,83453539.0,Paramount Pictures,88.0
4,Caddyshack,R,Comedy,1980,"July 25, 1980 (United States)",7.3,108000.0,Harold Ramis,Brian Doyle-Murray,Chevy Chase,United States,6000000.0,39846344.0,Orion Pictures,98.0


In [4]:
# 2. Global Movie Franchise Revenue and Budget Data

# Install Kaggle API if not already installed
!pip install kaggle

# Upload your Kaggle API key (json file)
from google.colab import files
files.upload()

# Download the dataset
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d thedevastator/global-movie-franchise-revenue-and-budget-data

# Unzip and load the dataset
!unzip global-movie-franchise-revenue-and-budget-data.zip -d movie_franchise_data
movie_franchise_path = "movie_franchise_data/<name_of_file>.csv"
movie_franchise_data = pd.read_csv(movie_franchise_path)

# Display the first few rows
movie_franchise_data.head()


mv: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory
Dataset URL: https://www.kaggle.com/datasets/thedevastator/global-movie-franchise-revenue-and-budget-data
License(s): other
Downloading global-movie-franchise-revenue-and-budget-data.zip to /content
  0% 0.00/8.61k [00:00<?, ?B/s]
100% 8.61k/8.61k [00:00<00:00, 15.6MB/s]
Archive:  global-movie-franchise-revenue-and-budget-data.zip
  inflating: movie_franchise_data/MovieFranchises.csv  


FileNotFoundError: [Errno 2] No such file or directory: 'movie_franchise_data/<name_of_file>.csv'

In [5]:
# 3. TMDB 5000 Movies Dataset

# Install the datasets library
!pip install datasets

from datasets import load_dataset

# Load the dataset
tmdb_dataset = load_dataset("AiresPucrs/tmdb-5000-movies", split="train")
tmdb_df = pd.DataFrame(tmdb_dataset)

# Display the first few rows
tmdb_df.head()


Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

(…)-00000-of-00001-6db04ab1c75d6817.parquet:   0%|          | 0.00/13.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4803 [00:00<?, ? examples/s]

Unnamed: 0,id,budget,genres,homepage,keywords,original_language,original_title,overview,popularity,production_companies,...,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew
0,5,4000000,"[{""id"": 80, ""name"": ""Crime""}, {""id"": 35, ""name...",,"[{""id"": 612, ""name"": ""hotel""}, {""id"": 613, ""na...",en,Four Rooms,It's Ted the Bellhop's first night on the job....,22.87623,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...",...,4300000,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Twelve outrageous guests. Four scandalous requ...,Four Rooms,6.5,530,"[{""cast_id"": 42, ""character"": ""Ted the Bellhop...","[{""credit_id"": ""52fe420dc3a36847f800012d"", ""de..."
1,11,11000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 28, ""...",http://www.starwars.com/films/star-wars-episod...,"[{""id"": 803, ""name"": ""android""}, {""id"": 4270, ...",en,Star Wars,Princess Leia is captured and held hostage by ...,126.393695,"[{""name"": ""Lucasfilm"", ""id"": 1}, {""name"": ""Twe...",...,775398007,121.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"A long time ago in a galaxy far, far away...",Star Wars,8.1,6624,"[{""cast_id"": 3, ""character"": ""Luke Skywalker"",...","[{""credit_id"": ""52fe420dc3a36847f8000437"", ""de..."
2,12,94000000,"[{""id"": 16, ""name"": ""Animation""}, {""id"": 10751...",http://movies.disney.com/finding-nemo,"[{""id"": 494, ""name"": ""father son relationship""...",en,Finding Nemo,"Nemo, an adventurous young clownfish, is unexp...",85.688789,"[{""name"": ""Pixar Animation Studios"", ""id"": 3}]",...,940335536,100.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"There are 3.7 trillion fish in the ocean, they...",Finding Nemo,7.6,6122,"[{""cast_id"": 8, ""character"": ""Marlin (voice)"",...","[{""credit_id"": ""52fe420ec3a36847f80006b1"", ""de..."
3,13,55000000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",,"[{""id"": 422, ""name"": ""vietnam veteran""}, {""id""...",en,Forrest Gump,A man with a low IQ has accomplished great thi...,138.133331,"[{""name"": ""Paramount Pictures"", ""id"": 4}]",...,677945399,142.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"The world will never be the same, once you've ...",Forrest Gump,8.2,7927,"[{""cast_id"": 7, ""character"": ""Forrest Gump"", ""...","[{""credit_id"": ""52fe420ec3a36847f800076b"", ""de..."
4,14,15000000,"[{""id"": 18, ""name"": ""Drama""}]",http://www.dreamworks.com/ab/,"[{""id"": 255, ""name"": ""male nudity""}, {""id"": 29...",en,American Beauty,"Lester Burnham, a depressed suburban father in...",80.878605,"[{""name"": ""DreamWorks SKG"", ""id"": 27}, {""name""...",...,356296601,122.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Look closer.,American Beauty,7.9,3313,"[{""cast_id"": 6, ""character"": ""Lester Burnham"",...","[{""credit_id"": ""52fe420ec3a36847f8000809"", ""de..."


In [7]:
# 4. Complete Movie Metadata Dataset

# Download and load dataset manually
!wget <direct_download_link_to_dataset> -O movie_metadata.csv

# Load the dataset
movie_metadata = pd.read_csv("movie_metadata.csv")

# Display the first few rows
movie_metadata.head()


FileNotFoundError: [Errno 2] No such file or directory: 'movie_metadata.csv'

In [8]:
# 5. Movie Revenue Analysis Dataset

# Clone the repository and load the CSV
!git clone https://github.com/ntdoris/movie-revenue-analysis.git

# Load the dataset
revenue_data_path = "movie-revenue-analysis/<name_of_file>.csv"
revenue_data = pd.read_csv(revenue_data_path)

# Display the first few rows
revenue_data.head()


Cloning into 'movie-revenue-analysis'...
remote: Enumerating objects: 85, done.[K
remote: Counting objects:   3% (1/31)[Kremote: Counting objects:   6% (2/31)[Kremote: Counting objects:   9% (3/31)[Kremote: Counting objects:  12% (4/31)[Kremote: Counting objects:  16% (5/31)[Kremote: Counting objects:  19% (6/31)[Kremote: Counting objects:  22% (7/31)[Kremote: Counting objects:  25% (8/31)[Kremote: Counting objects:  29% (9/31)[Kremote: Counting objects:  32% (10/31)[Kremote: Counting objects:  35% (11/31)[Kremote: Counting objects:  38% (12/31)[Kremote: Counting objects:  41% (13/31)[Kremote: Counting objects:  45% (14/31)[Kremote: Counting objects:  48% (15/31)[Kremote: Counting objects:  51% (16/31)[Kremote: Counting objects:  54% (17/31)[Kremote: Counting objects:  58% (18/31)[Kremote: Counting objects:  61% (19/31)[Kremote: Counting objects:  64% (20/31)[Kremote: Counting objects:  67% (21/31)[Kremote: Counting objects:  70% (22/31)[Kre

FileNotFoundError: [Errno 2] No such file or directory: 'movie-revenue-analysis/<name_of_file>.csv'