<a href="https://colab.research.google.com/github/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/blob/main/Data_Analysis_Project_Blockbuster_Movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Datasets for Blockbuster Movie Analysis

To build a comprehensive dataset for analyzing blockbuster movies, we'll combine information from multiple sources. Below are some datasets that align with our project requirements:

---

**1. Movie Data Analysis Dataset**  
- Details about 7,668 movies, including:
  - Titles, ratings, genres, release years
  - IMDb scores, votes
  - Directors, writers, main stars
  - Production countries, budgets, gross earnings
  - Production companies, runtimes  
- **Source**: [GitHub Repository](https://github.com/1tannu5/Movie-Data-Analysis?utm_source=chatgpt.com)

---

**2. Global Movie Franchise Revenue and Budget Data**  
- Comprehensive data on movie franchises worldwide between 2000–2020:
  - Lifetime gross, budget, rating
  - Runtime, release date, vote count/average  
- **Source**: [Kaggle Dataset](https://www.kaggle.com/datasets/thedevastator/global-movie-franchise-revenue-and-budget-data?utm_source=chatgpt.com)

---

**3. TMDB 5000 Movies Dataset**  
- Information on over 5,000 movies:
  - Budget, cast, director
  - Keywords, runtime, genres
  - Production companies, release dates  
- **Source**: [Hugging Face Dataset](https://huggingface.co/datasets/AiresPucrs/tmdb-5000-movies/blob/main/README.md?utm_source=chatgpt.com)

---

**4. Complete Movie Metadata Dataset**  
- Data on over 722,000 movies, including:
  - ID, title, genres, budget, revenue  
- Suitable for analyzing trends in movie popularity, production companies, budgets, and revenues.  
- **Source**: [Gigasheet Dataset](https://www.gigasheet.com/sample-data/movies-daily-update-dataset?utm_source=chatgpt.com)

---

**5. Movie Revenue Analysis Dataset**  
- Approx. 5,800 movies released between 1915 and 2020:
  - Domestic and worldwide gross revenues
  - Production budgets, release dates  
- **Source**: [GitHub Repository](https://github.com/ntdoris/movie-revenue-analysis?utm_source=chatgpt.com)

---



### Dataset Acquisition

In [46]:
# 1. Movie Data Analysis Dataset

!wget https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/movie.csv -O movie.csv

# Load the CSV file
import pandas as pd
data1 = pd.read_csv("movie.csv")
data1.head()


--2025-01-12 20:13:41--  https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/movie.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1294548 (1.2M) [text/plain]
Saving to: ‘movie.csv’


2025-01-12 20:13:41 (24.3 MB/s) - ‘movie.csv’ saved [1294548/1294548]



Unnamed: 0,name,rating,genre,year,released,score,votes,director,writer,star,country,budget,gross,company,runtime
0,The Shining,R,Drama,1980,"June 13, 1980 (United States)",8.4,927000.0,Stanley Kubrick,Stephen King,Jack Nicholson,United Kingdom,19000000.0,46998772.0,Warner Bros.,146.0
1,The Blue Lagoon,R,Adventure,1980,"July 2, 1980 (United States)",5.8,65000.0,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,United States,4500000.0,58853106.0,Columbia Pictures,104.0
2,Star Wars: Episode V - The Empire Strikes Back,PG,Action,1980,"June 20, 1980 (United States)",8.7,1200000.0,Irvin Kershner,Leigh Brackett,Mark Hamill,United States,18000000.0,538375067.0,Lucasfilm,124.0
3,Airplane!,PG,Comedy,1980,"July 2, 1980 (United States)",7.7,221000.0,Jim Abrahams,Jim Abrahams,Robert Hays,United States,3500000.0,83453539.0,Paramount Pictures,88.0
4,Caddyshack,R,Comedy,1980,"July 25, 1980 (United States)",7.3,108000.0,Harold Ramis,Brian Doyle-Murray,Chevy Chase,United States,6000000.0,39846344.0,Orion Pictures,98.0


In [40]:
# 2. Global Movie Franchise Revenue and Budget Data

!wget https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/MovieFranchises.csv -O MovieFranchises.csv
import pandas as pd
data2 = pd.read_csv("MovieFranchises.csv")
data2.head()

--2025-01-12 20:10:16--  https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/MovieFranchises.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26322 (26K) [text/plain]
Saving to: ‘MovieFranchises.csv’


2025-01-12 20:10:16 (23.5 MB/s) - ‘MovieFranchises.csv’ saved [26322/26322]



Unnamed: 0,index,MovieID,Title,Lifetime Gross,Year,Studio,Rating,Runtime,Budget,ReleaseDate,VoteAvg,VoteCount,FranchiseID
0,0,1001,Star Wars: Episode IV - A New Hope,775398007,1977,Lucasfilm,PG,121.0,11000000.0,05-25-77,4.09,96233.0,101.0
1,1,1002,Star Wars: Episode V - The Empire Strikes Back,538375067,1980,Lucasfilm,PG,124.0,18000000.0,06-20-80,4.12,79231.0,101.0
2,2,1003,Star Wars: Episode VI - Return of the Jedi,475106177,1983,Lucasfilm,PG,135.0,32500000.0,05-25-83,3.98,76082.0,101.0
3,3,1004,Jurassic Park,1109802321,1993,Universal Pictures,PG-13,127.0,63000000.0,06-11-93,3.69,82700.0,102.0
4,4,1005,The Lost World: Jurassic Park,618638999,1997,Universal Pictures,PG-13,129.0,73000000.0,05-23-97,3.01,19721.0,102.0


In [41]:
# 3. TMDB 5000 Movies Dataset

!pip install datasets

from datasets import load_dataset
import pandas as pd

# Load the TMDB dataset from Hugging Face
dataset = load_dataset("AiresPucrs/tmdb-5000-movies", split="train")
data3 = pd.DataFrame(dataset)

# Display the first few rows
data3.head()

# Save the DataFrame to a CSV file
data3.to_csv("tmdb_movies.csv", index=False)

# Confirm the file exists in the current directory
import os
os.listdir()





['movie.csv', 'tmdb_movies.csv', 'drive', 'MovieFranchises.csv', 'sample_data']

In [42]:
# 4. Complete Movie Metadata Dataset

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
file_path = '/content/drive/My Drive/Tel Aviv University/BSc Statistics and Operations Research/4th Year - Winter semester/Database Systems/movies.csv'  # Adjust path as needed
data4 = pd.read_csv(file_path)
data4.head()

# Save the DataFrame to a CSV file
data4.to_csv("movies.csv", index=False)

# Confirm the file exists in the current directory
import os
os.listdir()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


['movies.csv',
 'movie.csv',
 'tmdb_movies.csv',
 'drive',
 'MovieFranchises.csv',
 'sample_data']

In [43]:
# 5. Movie Revenue Analysis Dataset

!wget https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/final_dataset.csv -O final_dataset.csv
import pandas as pd
data5 = pd.read_csv("final_dataset.csv")
data5.head()


--2025-01-12 20:12:38--  https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/final_dataset.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456039 (445K) [text/plain]
Saving to: ‘final_dataset.csv’


2025-01-12 20:12:38 (11.4 MB/s) - ‘final_dataset.csv’ saved [456039/456039]



Unnamed: 0.1,Unnamed: 0,movie,year,production_budget,domestic_gross,foreign_gross,worldwide_gross,month,profit,profit_margin,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,0,Avatar,2009,425000000,760507625,2015837654,2776345279,12,2351345279,0.846921,...,0,0,0,0,0,1,0,0,0,0
1,1,Pirates of the Caribbean: On Stranger Tides,2011,410600000,241063875,804600000,1045663875,5,635063875,0.607331,...,0,0,0,0,0,0,0,0,0,0
2,2,Avengers: Age of Ultron,2015,330600000,459005868,944008095,1403013963,5,1072413963,0.764364,...,0,0,0,0,0,1,0,0,0,0
3,3,Avengers: Infinity War,2018,300000000,678815482,1369318718,2048134200,4,1748134200,0.853525,...,0,0,0,0,0,0,0,0,0,0
4,4,Justice League,2017,300000000,229024295,426920914,655945209,11,355945209,0.542645,...,0,0,0,0,0,1,0,0,0,0


In [47]:
# Renaming Python variables based on the unique data for each dataset

movie_franchises = data1  # Data 1: Movie Franchises Dataset
print("Renamed Data 1 to 'movie_franchises'.")

movie_metadata = data2  # Data 2: Movie Metadata Dataset
print("Renamed Data 2 to 'movie_metadata'.")

tmdb_movies = data3  # Data 3: TMDB Movies Dataset
print("Renamed Data 3 to 'tmdb_movies'.")

movie_quotes = data4  # Data 4: Movie Quotes Dataset
print("Renamed Data 4 to 'movie_quotes'.")

movie_revenues = data5  # Data 5: Revenue and Budgets Dataset
print("Renamed Data 5 to 'movie_revenues'.")


import os

# Define the mapping of old file names to new file names
file_renames = {
    "movie.csv": "movie_franchises.csv",            # Data 1
    "MovieFranchises.csv": "movie_metadata.csv",    # Data 2
    "tmdb_movies.csv": "tmdb_movies.csv",           # Data 3
    "movies.csv": "movie_quotes.csv",               # Data 4
    "final_dataset.csv": "movie_revenues.csv"       # Data 5
}

# Rename files in the working directory
for old_name, new_name in file_renames.items():
    if os.path.exists(old_name):
        os.rename(old_name, new_name)
        print(f"Renamed {old_name} to {new_name}")
    else:
        print(f"{old_name} does not exist.")



Renamed Data 1 to 'movie_franchises'.
Renamed Data 2 to 'movie_metadata'.
Renamed Data 3 to 'tmdb_movies'.
Renamed Data 4 to 'movie_quotes'.
Renamed Data 5 to 'movie_revenues'.
Renamed movie.csv to movie_franchises.csv
Renamed MovieFranchises.csv to movie_metadata.csv
Renamed tmdb_movies.csv to tmdb_movies.csv
Renamed movies.csv to movie_quotes.csv
Renamed final_dataset.csv to movie_revenues.csv
