<a href="https://colab.research.google.com/github/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/blob/main/data_analysis_project_blockbuster_movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Datasets for Blockbuster Movie Analysis

To build a comprehensive dataset for analyzing blockbuster movies, we'll combine information from multiple sources. Below are some datasets that align with our project requirements:

---

**1. Movie Data Analysis Dataset**  
- Details about 7,668 movies, including:
  - Titles, ratings, genres, release years
  - IMDb scores, votes
  - Directors, writers, main stars
  - Production countries, budgets, gross earnings
  - Production companies, runtimes  
- **Source**: [GitHub Repository](https://github.com/1tannu5/Movie-Data-Analysis?utm_source=chatgpt.com)

---

**2. Global Movie Franchise Revenue and Budget Data**  
- Comprehensive data on movie franchises worldwide between 2000–2020:
  - Lifetime gross, budget, rating
  - Runtime, release date, vote count/average  
- **Source**: [Kaggle Dataset](https://www.kaggle.com/datasets/thedevastator/global-movie-franchise-revenue-and-budget-data?utm_source=chatgpt.com)

---

**3. TMDB 5000 Movies Dataset**  
- Information on over 5,000 movies:
  - Budget, cast, director
  - Keywords, runtime, genres
  - Production companies, release dates  
- **Source**: [Hugging Face Dataset](https://huggingface.co/datasets/AiresPucrs/tmdb-5000-movies/blob/main/README.md?utm_source=chatgpt.com)

---

**4. Complete Movie Metadata Dataset**  
- Data on over 722,000 movies, including:
  - ID, title, genres, budget, revenue  
- Suitable for analyzing trends in movie popularity, production companies, budgets, and revenues.  
- **Source**: [Gigasheet Dataset](https://www.gigasheet.com/sample-data/movies-daily-update-dataset?utm_source=chatgpt.com)

---

**5. Movie Revenue Analysis Dataset**  
- Approx. 5,800 movies released between 1915 and 2020:
  - Domestic and worldwide gross revenues
  - Production budgets, release dates  
- **Source**: [GitHub Repository](https://github.com/ntdoris/movie-revenue-analysis?utm_source=chatgpt.com)

---



### Dataset Acquisition

In [None]:
# 1. Movie Data Analysis Dataset

!wget https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/movie.csv -O movie.csv

# Load the CSV file
import pandas as pd
data1 = pd.read_csv("movie.csv")
data1.head()


--2025-01-19 11:56:06--  https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/movie.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1294548 (1.2M) [text/plain]
Saving to: ‘movie.csv’


2025-01-19 11:56:06 (30.0 MB/s) - ‘movie.csv’ saved [1294548/1294548]



Unnamed: 0,name,rating,genre,year,released,score,votes,director,writer,star,country,budget,gross,company,runtime
0,The Shining,R,Drama,1980,"June 13, 1980 (United States)",8.4,927000.0,Stanley Kubrick,Stephen King,Jack Nicholson,United Kingdom,19000000.0,46998772.0,Warner Bros.,146.0
1,The Blue Lagoon,R,Adventure,1980,"July 2, 1980 (United States)",5.8,65000.0,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,United States,4500000.0,58853106.0,Columbia Pictures,104.0
2,Star Wars: Episode V - The Empire Strikes Back,PG,Action,1980,"June 20, 1980 (United States)",8.7,1200000.0,Irvin Kershner,Leigh Brackett,Mark Hamill,United States,18000000.0,538375067.0,Lucasfilm,124.0
3,Airplane!,PG,Comedy,1980,"July 2, 1980 (United States)",7.7,221000.0,Jim Abrahams,Jim Abrahams,Robert Hays,United States,3500000.0,83453539.0,Paramount Pictures,88.0
4,Caddyshack,R,Comedy,1980,"July 25, 1980 (United States)",7.3,108000.0,Harold Ramis,Brian Doyle-Murray,Chevy Chase,United States,6000000.0,39846344.0,Orion Pictures,98.0


In [None]:
# 2. Global Movie Franchise Revenue and Budget Data

!wget https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/MovieFranchises.csv -O MovieFranchises.csv
import pandas as pd
data2 = pd.read_csv("MovieFranchises.csv")
data2.head()

--2025-01-19 11:56:12--  https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/MovieFranchises.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26322 (26K) [text/plain]
Saving to: ‘MovieFranchises.csv’


2025-01-19 11:56:12 (10.6 MB/s) - ‘MovieFranchises.csv’ saved [26322/26322]



Unnamed: 0,index,MovieID,Title,Lifetime Gross,Year,Studio,Rating,Runtime,Budget,ReleaseDate,VoteAvg,VoteCount,FranchiseID
0,0,1001,Star Wars: Episode IV - A New Hope,775398007,1977,Lucasfilm,PG,121.0,11000000.0,05-25-77,4.09,96233.0,101.0
1,1,1002,Star Wars: Episode V - The Empire Strikes Back,538375067,1980,Lucasfilm,PG,124.0,18000000.0,06-20-80,4.12,79231.0,101.0
2,2,1003,Star Wars: Episode VI - Return of the Jedi,475106177,1983,Lucasfilm,PG,135.0,32500000.0,05-25-83,3.98,76082.0,101.0
3,3,1004,Jurassic Park,1109802321,1993,Universal Pictures,PG-13,127.0,63000000.0,06-11-93,3.69,82700.0,102.0
4,4,1005,The Lost World: Jurassic Park,618638999,1997,Universal Pictures,PG-13,129.0,73000000.0,05-23-97,3.01,19721.0,102.0


In [None]:
# 3. TMDB 5000 Movies Dataset

!pip install datasets

from datasets import load_dataset
import pandas as pd

# Load the TMDB dataset from Hugging Face
dataset = load_dataset("AiresPucrs/tmdb-5000-movies", split="train")
data3 = pd.DataFrame(dataset)

# Display the first few rows
data3.head()

# Save the DataFrame to a CSV file
data3.to_csv("tmdb_movies.csv", index=False)

# Confirm the file exists in the current directory
import os
os.listdir()



Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

(…)-00000-of-00001-6db04ab1c75d6817.parquet:   0%|          | 0.00/13.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4803 [00:00<?, ? examples/s]

['.config',
 'movie.csv',
 'MovieFranchises.csv',
 'tmdb_movies.csv',
 'sample_data']

In [None]:
# 4. Complete Movie Metadata Dataset

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
file_path = '/content/drive/My Drive/Tel Aviv University/BSc Statistics and Operations Research/4th Year - Winter semester/Database Systems/movies.csv'  # Adjust path as needed
data4 = pd.read_csv(file_path)
data4.head()

# Save the DataFrame to a CSV file
data4.to_csv("movies.csv", index=False)

# Confirm the file exists in the current directory
import os
os.listdir()

Mounted at /content/drive


['.config',
 'movie.csv',
 'MovieFranchises.csv',
 'drive',
 'movies.csv',
 'tmdb_movies.csv',
 'sample_data']

In [None]:
# 5. Movie Revenue Analysis Dataset

!wget https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/final_dataset.csv -O final_dataset.csv
import pandas as pd
data5 = pd.read_csv("final_dataset.csv")
data5.head()


--2025-01-19 11:58:03--  https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/final_dataset.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456039 (445K) [text/plain]
Saving to: ‘final_dataset.csv’


2025-01-19 11:58:04 (12.9 MB/s) - ‘final_dataset.csv’ saved [456039/456039]



Unnamed: 0.1,Unnamed: 0,movie,year,production_budget,domestic_gross,foreign_gross,worldwide_gross,month,profit,profit_margin,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,0,Avatar,2009,425000000,760507625,2015837654,2776345279,12,2351345279,0.846921,...,0,0,0,0,0,1,0,0,0,0
1,1,Pirates of the Caribbean: On Stranger Tides,2011,410600000,241063875,804600000,1045663875,5,635063875,0.607331,...,0,0,0,0,0,0,0,0,0,0
2,2,Avengers: Age of Ultron,2015,330600000,459005868,944008095,1403013963,5,1072413963,0.764364,...,0,0,0,0,0,1,0,0,0,0
3,3,Avengers: Infinity War,2018,300000000,678815482,1369318718,2048134200,4,1748134200,0.853525,...,0,0,0,0,0,0,0,0,0,0
4,4,Justice League,2017,300000000,229024295,426920914,655945209,11,355945209,0.542645,...,0,0,0,0,0,1,0,0,0,0


In [None]:
# Renaming Python variables based on the unique data for each dataset

movie_franchises = data1  # Data 1: Movie Franchises Dataset
print("Renamed Data 1 to 'movie_franchises'.")

movie_metadata = data2  # Data 2: Movie Metadata Dataset
print("Renamed Data 2 to 'movie_metadata'.")

tmdb_movies = data3  # Data 3: TMDB Movies Dataset
print("Renamed Data 3 to 'tmdb_movies'.")

movie_quotes = data4  # Data 4: Movie Quotes Dataset
print("Renamed Data 4 to 'movie_quotes'.")

movie_revenues = data5  # Data 5: Revenue and Budgets Dataset
print("Renamed Data 5 to 'movie_revenues'.")


import os

# Define the mapping of old file names to new file names
file_renames = {
    "movie.csv": "movie_franchises.csv",            # Data 1
    "MovieFranchises.csv": "movie_metadata.csv",    # Data 2
    "tmdb_movies.csv": "tmdb_movies.csv",           # Data 3
    "movies.csv": "movie_quotes.csv",               # Data 4
    "final_dataset.csv": "movie_revenues.csv"       # Data 5
}

# Rename files in the working directory
for old_name, new_name in file_renames.items():
    if os.path.exists(old_name):
        os.rename(old_name, new_name)
        print(f"Renamed {old_name} to {new_name}")
    else:
        print(f"{old_name} does not exist.")



Renamed Data 1 to 'movie_franchises'.
Renamed Data 2 to 'movie_metadata'.
Renamed Data 3 to 'tmdb_movies'.
Renamed Data 4 to 'movie_quotes'.
Renamed Data 5 to 'movie_revenues'.
Renamed movie.csv to movie_franchises.csv
Renamed MovieFranchises.csv to movie_metadata.csv
Renamed tmdb_movies.csv to tmdb_movies.csv
Renamed movies.csv to movie_quotes.csv
Renamed final_dataset.csv to movie_revenues.csv


### E/R Diagram



In [None]:
!pip install graphviz

from graphviz import Digraph

# Initialize the graph
schema = Digraph(format='png', engine='dot')
schema.attr(rankdir='LR', size='10')

# Add nodes for tables
schema.node('movie_franchises', 'movie_franchises\n- franchise_id (PK)\n- franchise_name\n- total_movies\n- total_revenue\n- total_budget')
schema.node('movie_metadata', 'movie_metadata\n- movie_id (PK)\n- title\n- genre\n- runtime\n- release_year\n- rating')
schema.node('tmdb_movies', 'tmdb_movies\n- movie_id (PK, FK)\n- budget\n- revenue\n- production_company\n- release_date\n- cast')
schema.node('movie_quotes', 'movie_quotes\n- quote_id (PK)\n- movie_id (FK)\n- quote\n- character\n- actor')
schema.node('movie_revenues', 'movie_revenues\n- movie_id (PK, FK)\n- domestic_revenue\n- international_revenue\n- total_revenue\n- budget')

# Add edges for relationships
schema.edge('movie_franchises', 'movie_metadata', label='1:N')
schema.edge('movie_metadata', 'tmdb_movies', label='1:1')
schema.edge('movie_metadata', 'movie_quotes', label='1:N')
schema.edge('movie_metadata', 'movie_revenues', label='1:1')

# Render the schema to a file
schema.render('movie_schema', cleanup=False)
print("Schema diagram saved as 'movie_schema.png'.")


Schema diagram saved as 'movie_schema.png'.


### Normalization Steps for the Database

Normalization is a database design process that organizes data to reduce redundancy and improve integrity.
In our movie database project, normalization was applied to the `movie_metadata` table by separating genre and cast information into distinct tables.

---

**Normalization Rationale**

**1. First Normal Form (1NF)**  
To achieve 1NF, we'll ensure that each column contains atomic values. As of now, the columns for genres and cast contain multiple values within a single record, which violates 1NF. The solution will be to create separate tables for:  

- **`movie_genres`**: Stores the relationship between movies and their genres.  
- **`movie_cast`**: Stores the relationship between movies and their actors.

Before normalization, the `movie_metadata` table contained multiple values within a single column, such as:

| movie_id | title     | genre               | cast                    |
|----------|-----------|--------------------|-------------------------|
| 1        | Inception | Sci-Fi, Action      | Leonardo, Joseph, Ellen |
| 2        | Titanic   | Romance, Drama      | Leonardo, Kate          |

*Issues:*  
- `genre` and `cast` contain multiple values, violating atomicity (1NF).  

**Solution:** Split into separate tables.

| movie_id | genre     |
|----------|-----------|
| 1        | Sci-Fi     |
| 1        | Action     |
| 2        | Romance    |
| 2        | Drama      |

| movie_id | actor      |
|----------|------------|
| 1        | Leonardo   |
| 1        | Joseph     |
| 1        | Ellen      |
| 2        | Leonardo   |
| 2        | Kate       |

---

In [None]:
!apt-get install sqlite3

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Suggested packages:
  sqlite3-doc
The following NEW packages will be installed:
  sqlite3
0 upgraded, 1 newly installed, 0 to remove and 49 not upgraded.
Need to get 768 kB of archives.
After this operation, 1,873 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 sqlite3 amd64 3.37.2-2ubuntu0.3 [768 kB]
Fetched 768 kB in 1s (671 kB/s)
Selecting previously unselected package sqlite3.
(Reading database ... 124565 files and directories currently installed.)
Preparing to unpack .../sqlite3_3.37.2-2ubuntu0.3_amd64.deb ...
Unpacking sqlite3 (3.37.2-2ubuntu0.3) ...
Setting up sqlite3 (3.37.2-2ubuntu0.3) ...
Processing triggers for man-db (2.10.2-1) ...


In [None]:
import sqlite3

# Create the first table
conn = sqlite3.connect('movies.db')
cursor = conn.cursor()
cursor.execute('''
    CREATE TABLE IF NOT EXISTS movie_genres (
        movie_id INTEGER,
        genre TEXT,
        FOREIGN KEY (movie_id) REFERENCES movie_metadata(movie_id)
    );
''')
conn.commit()
conn.close()

# Create the second table after reopening connection
conn = sqlite3.connect('movies.db')
cursor = conn.cursor()
cursor.execute('''
    CREATE TABLE IF NOT EXISTS movie_cast (
        movie_id INTEGER,
        actor_name TEXT,
        FOREIGN KEY (movie_id) REFERENCES movie_metadata(movie_id)
    );
''')
conn.commit()
conn.close()


In [None]:
# Checking if our seperate tables were created
conn = sqlite3.connect('movies.db')
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())  # This should list the created tables
conn.close()

[('movie_genres',), ('movie_cast',)]


**2. Second Normal Form (2NF)**  
2NF eliminates partial dependency by ensuring that non-key attributes are fully dependent on the primary key. Since genres and actors depended on a non-key attribute (movie ID), moving them to separate tables ensures that all attributes in each table are dependent on the full primary key.

Before normalization:

| movie_id | title     | franchise | genre   |
|----------|-----------|------------|---------|
| 1        | Inception | Nolan Films | Sci-Fi  |
| 2        | Batman    | Nolan Films | Action  |

*Issue:*  
- `franchise` depends on `movie_id` but should depend on an independent `franchise_id`.

**Solution:** Separate into:

| franchise_id | franchise   |
|--------------|-------------|
| 1            | Nolan Films  |

| movie_id | title     | franchise_id |
|----------|-----------|--------------|
| 1        | Inception | 1            |
| 2        | Batman    | 1            |

---

In [None]:
import sqlite3

# Connect to the database
conn = sqlite3.connect('movies.db')
cursor = conn.cursor()

# Create a new table for movie runtime, removing dependency on the full primary key
cursor.execute('''
    CREATE TABLE IF NOT EXISTS movie_runtime (
        movie_id INTEGER,
        runtime INTEGER,
        FOREIGN KEY (movie_id) REFERENCES movie_metadata(movie_id)
    );
''')

# Create a new table for movie ratings, removing dependency on the full primary key
cursor.execute('''
    CREATE TABLE IF NOT EXISTS movie_ratings (
        movie_id INTEGER,
        rating FLOAT,
        FOREIGN KEY (movie_id) REFERENCES movie_metadata(movie_id)
    );
''')

conn.commit()
conn.close()


**3. Third Normal Form (3NF)**  
3NF eliminates transitive dependencies. By moving genre and actor attributes to separate tables, we ensure that attributes are directly related to the movie ID without intermediary dependencies.

Before normalization:

| movie_id | budget  | revenue  | profit  |
|----------|---------|---------|---------|
| 1        | 1000000 | 5000000  | 4000000 |

*Issue:*  
- `profit` can be derived from `revenue - budget`, making it a derived attribute.

**Solution:** Remove `profit` and calculate dynamically.

---

**Updated Database Schema**

After normalization, the following schema will be adopted:

- **`movie_metadata`**: (movie_id, title, runtime, release_year, rating)  
- **`movie_genres`**: (movie_id, genre)  
- **`movie_cast`**: (movie_id, actor_name)  

---


In [None]:
conn = sqlite3.connect('movies.db')
cursor = conn.cursor()

# Create a separate table for production companies
cursor.execute('''
    CREATE TABLE IF NOT EXISTS production_companies (
        company_id INTEGER PRIMARY KEY AUTOINCREMENT,
        company_name TEXT UNIQUE
    );
''')

# Link production companies to movies via foreign key
cursor.execute('''
    CREATE TABLE IF NOT EXISTS movie_production (
        movie_id INTEGER,
        company_id INTEGER,
        FOREIGN KEY (movie_id) REFERENCES movie_metadata(movie_id),
        FOREIGN KEY (company_id) REFERENCES production_companies(company_id)
    );
''')

conn.commit()
conn.close()


**Conclusion**  

By applying normalization techniques, the database will:

- Reduce redundancy and improve data integrity.  
- Support efficient querying and maintenance.  
- Handle many-to-many relationships effectively.

### Updated E/R Diagram



In [None]:
# Install graphviz if not already installed
!pip install graphviz

from graphviz import Digraph

# Initialize the graph
schema = Digraph(format='png', engine='dot')
schema.attr(rankdir='LR', size='10')

# Add nodes for tables
schema.node('movie_metadata', 'movie_metadata\n- movie_id (PK)\n- title\n- release_year')
schema.node('movie_genres', 'movie_genres\n- movie_id (FK)\n- genre')
schema.node('movie_cast', 'movie_cast\n- movie_id (FK)\n- actor_name')
schema.node('movie_runtime', 'movie_runtime\n- movie_id (FK)\n- runtime')
schema.node('movie_ratings', 'movie_ratings\n- movie_id (FK)\n- rating')
schema.node('production_companies', 'production_companies\n- company_id (PK)\n- company_name')
schema.node('movie_production', 'movie_production\n- movie_id (FK)\n- company_id (FK)')
schema.node('movie_revenues', 'movie_revenues\n- movie_id (PK, FK)\n- domestic_revenue\n- international_revenue\n- total_revenue\n- budget')
schema.node('movie_quotes', 'movie_quotes\n- quote_id (PK)\n- movie_id (FK)\n- quote\n- character\n- actor')

# Add edges for relationships
schema.edge('movie_metadata', 'movie_genres', label='1:N')
schema.edge('movie_metadata', 'movie_cast', label='1:N')
schema.edge('movie_metadata', 'movie_runtime', label='1:1')
schema.edge('movie_metadata', 'movie_ratings', label='1:1')
schema.edge('production_companies', 'movie_production', label='1:N')
schema.edge('movie_metadata', 'movie_production', label='1:N')
schema.edge('movie_metadata', 'movie_revenues', label='1:1')
schema.edge('movie_metadata', 'movie_quotes', label='1:N')

# Render the schema to a file
schema.render('updated_movie_schema', cleanup=False)
print("Schema diagram saved as 'updated_movie_schema.png'.")


Schema diagram saved as 'updated_movie_schema.png'.


### Insert Data into SQLite Database

Based on the normalization we did, we'll now form our DB for the project.

In [None]:
import pandas as pd

# Load CSV files to inspect the data
metadata_df = pd.read_csv('movie_metadata.csv')
genres_df = pd.read_csv('movie_franchises.csv')
quotes_df = pd.read_csv('movie_quotes.csv')
revenues_df = pd.read_csv('movie_revenues.csv')
tmdb_df = pd.read_csv('tmdb_movies.csv')

# Display the first few rows of each dataset
print("Movie Metadata:\n", metadata_df.head())
print("\nMovie Genres:\n", genres_df.head())
print("\nMovie Quotes:\n", quotes_df.head())
print("\nMovie Revenues:\n", revenues_df.head())
print("\nTMDB Movies:\n", tmdb_df.head())


Movie Metadata:
    index MovieID                                           Title  \
0      0    1001              Star Wars: Episode IV - A New Hope   
1      1    1002  Star Wars: Episode V - The Empire Strikes Back   
2      2    1003      Star Wars: Episode VI - Return of the Jedi   
3      3    1004                                   Jurassic Park   
4      4    1005                   The Lost World: Jurassic Park   

  Lifetime Gross  Year              Studio Rating  Runtime      Budget  \
0      775398007  1977           Lucasfilm     PG    121.0  11000000.0   
1      538375067  1980           Lucasfilm     PG    124.0  18000000.0   
2      475106177  1983           Lucasfilm     PG    135.0  32500000.0   
3     1109802321  1993  Universal Pictures  PG-13    127.0  63000000.0   
4      618638999  1997  Universal Pictures  PG-13    129.0  73000000.0   

  ReleaseDate  VoteAvg  VoteCount  FranchiseID  
0    05-25-77     4.09    96233.0        101.0  
1    06-20-80     4.12    79231

In [None]:
import sqlite3

# Connect to SQLite database
conn = sqlite3.connect('movies.db')
cursor = conn.cursor()

# Insert data from DataFrames into the tables
metadata_df.to_sql('movie_metadata', conn, if_exists='replace', index=False)
genres_df.to_sql('movie_franchises', conn, if_exists='replace', index=False)
quotes_df.to_sql('movie_quotes', conn, if_exists='replace', index=False)
revenues_df.to_sql('movie_revenues', conn, if_exists='replace', index=False)
tmdb_df.to_sql('tmdb_movies', conn, if_exists='replace', index=False)

# Commit and close the connection
conn.commit()
conn.close()

print("Data successfully inserted into the database.")


Data successfully inserted into the database.


In [None]:
# Connect to the database
conn = sqlite3.connect('movies.db')

# Check inserted data
print(pd.read_sql_query("SELECT * FROM movie_metadata LIMIT 5;", conn))
print(pd.read_sql_query("SELECT * FROM movie_franchises LIMIT 5;", conn))
print(pd.read_sql_query("SELECT * FROM movie_quotes LIMIT 5;", conn))

conn.close()

   index MovieID                                           Title  \
0      0    1001              Star Wars: Episode IV - A New Hope   
1      1    1002  Star Wars: Episode V - The Empire Strikes Back   
2      2    1003      Star Wars: Episode VI - Return of the Jedi   
3      3    1004                                   Jurassic Park   
4      4    1005                   The Lost World: Jurassic Park   

  Lifetime Gross  Year              Studio Rating  Runtime      Budget  \
0      775398007  1977           Lucasfilm     PG    121.0  11000000.0   
1      538375067  1980           Lucasfilm     PG    124.0  18000000.0   
2      475106177  1983           Lucasfilm     PG    135.0  32500000.0   
3     1109802321  1993  Universal Pictures  PG-13    127.0  63000000.0   
4      618638999  1997  Universal Pictures  PG-13    129.0  73000000.0   

  ReleaseDate  VoteAvg  VoteCount  FranchiseID  
0    05-25-77     4.09    96233.0        101.0  
1    06-20-80     4.12    79231.0        101.0  

### Basic Exploratory Data Analysis (EDA)

Now that we've finished building and designing our DB, we'll proceed with Basic Exploratory Data Analysis (EDA) to understand our database better and define meaningful queries for our project.

#### Objectives of EDA:

- EDA will help us:

  1. Understand the structure and quality of the data.
    - Identify missing values, data types, and distributions.
  
  2. Identify patterns and trends.
    - Analyze revenue trends, budget patterns, and franchise statistics.
  
  3. Detect potential relationships.
    - Examine relationships between different attributes such as budget vs. revenue, genre vs. ratings, etc.
    
  4. Define our main analytical focus.
    - Use the findings to craft the most valuable queries for insights.

#### **Understanding the Dataset Variables**

| Column             | Description                                      | Example               |
|-------------------|------------------------------------------------|-----------------------|
| `movie_id`         | Unique identifier for each movie                | 101                   |
| `title`            | Movie title                                     | "The Dark Knight"     |
| `genre`            | Movie genre                                     | "Action"              |
| `runtime`          | Duration of the movie (minutes)                  | 152                   |
| `release_year`     | Year the movie was released                      | 2008                  |
| `rating`           | Audience rating (out of 10)                      | 8.9                   |
| `budget`           | Production budget (USD)                         | 185,000,000           |
| `total_revenue`    | Total revenue from all sources (USD)             | 1,200,000,000         |

---

#### **Summary Statistics**

Before EDA:

| movie_id | title          | runtime | release_year | rating | budget     | total_revenue |
|----------|----------------|---------|--------------|--------|------------|---------------|
| 1        | Inception       | 148     | 2010         | 8.8    | 160000000  | 829000000     |
| 2        | Titanic         | 195     | 1997         | 7.8    | 200000000  | 2200000000    |
| 3        | Avatar          | 162     | 2009         | 7.9    | 237000000  | 2788000000    |
| 4        | The Dark Knight | 152     | 2008         | 9.0    | 185000000  | 1004000000    |

*Observations:*  
- **Runtime:** Values range between 148 and 195 minutes, which is within a typical movie length.  
- **Ratings:** Generally above 7, indicating good audience reception.  
- **Budget vs Revenue:** Some movies show exceptionally high revenue compared to budget, indicating profitable ventures.

---


#### Step 1: Summary Statistics

**Why?**

Summarizing the data gives an overview of the numerical and categorical columns to identify ranges, outliers, and potential data quality issues.

In [None]:
# Import necessary libraries
import pandas as pd
import sqlite3

# Connect to the database
conn = sqlite3.connect('movies.db')

# Read the data from SQL tables
metadata_df = pd.read_sql_query("SELECT * FROM movie_metadata", conn)
revenues_df = pd.read_sql_query("SELECT * FROM movie_revenues", conn)
tmdb_df = pd.read_sql_query("SELECT * FROM tmdb_movies", conn)

# Summary statistics for numerical columns
print("Movie Metadata Summary:\n", metadata_df.describe())
print("\nMovie Revenues Summary:\n", revenues_df.describe())
print("\nTMDB Movies Summary:\n", tmdb_df.describe())

conn.close()


Movie Metadata Summary:
             index     Runtime        Budget    VoteAvg     VoteCount  \
count  605.000000   60.000000  6.000000e+01  60.000000     60.000000   
mean   302.000000  137.983333  1.752417e+08   3.556000  21722.266667   
std    174.792734   19.022281  7.568151e+07   0.361359  23551.977668   
min      0.000000   92.000000  1.100000e+07   2.640000    238.000000   
25%    151.000000  125.500000  1.287500e+08   3.305000   8186.250000   
50%    302.000000  135.000000  1.700000e+08   3.645000  13226.500000   
75%    453.000000  147.000000  2.050000e+08   3.822500  26603.750000   
max    604.000000  201.000000  4.000000e+08   4.120000  96233.000000   

       FranchiseID  
count    60.000000  
mean    103.600000  
std       1.596607  
min     101.000000  
25%     102.000000  
50%     104.000000  
75%     105.000000  
max     105.000000  

Movie Revenues Summary:
         Unnamed: 0         year  production_budget  domestic_gross  \
count  1759.000000  1759.000000       1.7

#### **Handling Missing Values**

Before handling missing values:

| movie_id | title   | genre    | runtime | release_year | rating | budget  | total_revenue |
|----------|---------|---------|---------|--------------|--------|---------|---------------|
| 1        | Inception | Sci-Fi  | 148     | 2010         | 8.8    | 160000000  | 829000000     |
| 2        | Titanic  | Romance | NULL    | 1997         | 7.8    | NULL     | 2200000000    |
| 3        | Avatar   | NULL    | 162     | 2009         | 7.9    | 237000000  | 2788000000    |

*Issues Identified:*  
- Missing values in `runtime` and `budget` columns.
- Possible outliers in `budget` (extremely high values).

**Solution:**  
- Fill missing values using the median of available values.
- Investigate unusually high budgets to validate correctness.

---

#### **Distribution Analysis**

**Movie Runtime Distribution:**

\\[ \text{{Histogram of runtime values reveals a peak around 150-160 minutes, with a few outliers exceeding 180 minutes.}} \\]

**Revenue Distribution:**

\\[ \text{{Most movies fall in the range of 500M to 1B USD, with some extreme blockbusters exceeding 2B USD.}}\\]

---

#### **Key Insights from EDA:**
- Most movies in the dataset are long-form blockbusters with significant budgets.
- Revenue distribution suggests a few movies contribute to the majority of total earnings.
- Outliers exist in runtime and budget columns, which may require further investigation.

---

By conducting this EDA, we have gained better insight into our data and identified potential issues before moving to more complex querying.
