# Project Title: Vulcans Analysis

# OVERVIEW

To add

# BUSINESS UNDERSTANDING

## Objectives

1. What is the optimal production budget range for maximizing ROI? (Analyze the relationship between production budget and return on investment to identify sweet spots for budget allocation.)

2. How does critical reception (vote_average) correlate with commercial success? (2. How does critical reception (vote_average) correlate with commercial success?)

3. What is the domestic vs international revenue split trend?(Understanding geographic revenue distribution helps in marketing budget allocation and release strategies.)

4. Which genres provide the best risk-adjusted returns?(Genre analysis helps in portfolio planning and risk management for production companies)

5. How does movie popularity correlate with actual box office performance? (Social media buzz and pre-release popularity as predictors of commercial success.)

# DATA

## Data Overview

## Data 1:  rt.reviews.tsv

### 1.1 Data Overview

The dataset **rt.reviews.tsv** contains movie review data collected from Rotten Tomatoes. It includes metadata such as reviewer information, review content, ratings, freshness labels, publisher names, and publication dates.

### 1.2: Data Description

#### 1.2.1: Importing the dataset

In [1]:
#Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ast
%matplotlib inline

In [2]:
#Reading the dataset and checking top five rows
df= pd.read_csv("../Original_Data/rt.reviews.tsv", sep='\t', encoding='ISO-8859-1')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: '../Original_Data/rt.reviews.tsv'

**The file isn't encoded in UTF-8, As a result, trying to load it normally causes an error. The use of encoding="ISO-8859-1" solved the problem.**

#### 1.2.2: Basic structure

In [3]:
#checking shape
df.shape

NameError: name 'df' is not defined

In [None]:
#checking columns
df.columns

#### 1.2.3: Overview of column types and non-null values

In [None]:
df.info()

#### 1.2.4: Summary statistics numerical

In [None]:
df.describe(include='number').T

#### 1.2.5: Summary statistics categorical

In [None]:
df.describe(include='O').T

#### 1.2.6: Missing Values

In [None]:
#missing values as sum
df.isnull().sum()

In [None]:
#missing values as mean
df.isnull().mean()*100

#### 1.2.7: Duplicates

In [None]:
df.duplicated().sum()

### 1.3: Data Summary

The **RT.Reviews** dataset consists of **54,432 records** and **8 attributes**, capturing movie reviews from **Rotten Tomatoes**.

#### Key Columns:
- `review`: The text of the review
- `rating`: Rating values (e.g., `"3/5"`, `"B+"`, etc.)
- `fresh`: Indicates sentiment (`fresh` or `rotten`)
- `critic`: Name of the reviewer
- `publisher`: Source of the review
- `date`: Review publication date

#### Data Completeness:
- `rating`: **24.8% missing**
- `review`: **10% missing**
- `critic`: **5% missing**
All other fields (`id`, `fresh`, `top_critic`, `publisher`, `date`) are nearly complete.

#### Additional Insights:
- **Unique publishers**: `1,281`
- **Unique critics**: `3,496`
- **Top rating value**: `"3/5"`
- **Duplicate records**: `9` (should be removed)
- **`top_critic` field**: Binary (0 or 1), with ~**24%** marked as top critics
- **`date` field**: Spans multiple years and should be converted to datetime for analysis

#### Next Steps:
- Handle missing values
- Standardize `rating` formats
- Parse `date` into datetime objects
- Remove duplicate records
- Impute or filter out null values in key fields (`review`, `critic`)

### 1.4 Data Cleaning Strategy Summary

To ensure data quality and preserve analytical integrity, we apply the following cleaning rules:

##### Drop Column:
- **Column `id`** is just an index with no analytical value -- Decision drop

##### Drop Rows:
- **Missing `rating`**: As the most critical field representing reviewer opinion, rows without a rating are dropped entirely.
- **Less than 5% missing fields**: Any row missing less 5% values is also dropped to reduce imputing and preserve completeness.

##### Impute Missing Values:
- **Missing `review`**: Rows with missing review text are retained but imputed with the placeholder `"Unknown"` to preserve structure for analysis.

#### 1.4.1: Dropping Rows

In [None]:
#creating a list and dropping the rows with null
print(f"Shape Before: {df.shape}")
columns_to_droprows = ['rating','critic', 'publisher']
df.dropna(subset=columns_to_droprows, inplace=True)
print(f"Shape After: {df.shape}")

#### 1.4.2: Imputing Null with UNKNOWN string

In [None]:
#Imputing the nan values with UNKNOWN
df['review'].fillna('UNKNOWN', inplace=True)

In [None]:
#checking null values
df.isna().sum()

#### 1.4.3: Changing date to datetime dtype

In [None]:
df['date'].dtype

In [None]:
#Coverting the event.date into a datetime type
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['date'].dtype

#### 1.4.4: Standardizing the column rating

In [None]:
#checking unique items
df['rating'].unique()

In [None]:
# Defining letter grade mapping
letter_grades = {
    'A+': 10.0, 'A': 9.5, 'A-': 9.0,
    'B+': 8.5, 'B': 8.0, 'B-': 7.5,
    'C+': 7.0, 'C': 6.5, 'C-': 6.0,
    'D+': 5.5, 'D': 5.0, 'D-': 4.5,
    'F+': 4.0, 'F': 3.5, 'F-': 3.0
}


# Normalizing fractional ratings to 10-point scale
def normalize_rating(val):
    try:
        if '/' in val:
            num, den = val.split('/')
            return round(float(num) / float(den) * 10, 2)
        elif val in letter_grades:
            return letter_grades[val]
        else:
            return float(val)
    except:
        return None
df['rating'] = df['rating'].apply(normalize_rating)

In [None]:
#checking current value counts
df['rating'].value_counts()

In [None]:
#dropping suspicious ratings
df = df[df['rating'] <= 10]
df.shape

#### 1.4.5: Checking for duplicates

In [None]:
df.duplicated().sum()

### 1.5 Saving Cleaned Data

In [None]:
#current structure
df.head()

In [None]:
#saving the DataFrame to CSV
df.to_csv('../Cleaned_Data/cleaned_rt-reviews.csv', index=False)

## Data 2: tmdb.movies.csv

### 2.1 Data Overview


### 2.2: Data Description

#### 2.2.1: Importing the dataset

In [None]:
#importing the dataset
df1= pd.read_csv("../Original_Data/tmdb.movies.csv")
df1.head()

#### 2.2.2: Basic structure

In [None]:
#checking shape
df1.shape

In [None]:
#checking columns
df1.columns

#### 2.2.3: Overview of column types and non-null values

In [None]:
df1.info()

#### 2.2.4: Summary statistics numerical

In [None]:
df1.describe(include='number').T

#### 2.2.5: Summary statistics categorical

In [None]:
df1.describe(include='O').T

#### 2.2.6: Missing Values

In [None]:
#missing values as sum
df1.isnull().sum()

In [None]:
#missing values as mean
df1.isnull().mean()*100

#### 2.2.7: Duplicates

In [None]:
df1.duplicated().sum()

### 2.3: Data Summary

The **TMDB.Movies** dataset contains **26,517 movie records** and **10 attributes**, capturing details like titles, genres, ratings, and popularity.

####  **Key Columns:**
- `genre_ids`: List of genre codes (e.g., `[28, 12]` for Action & Adventure)
- `original_language`: Language code (e.g., `'en'`)
- `original_title` / `title`: Original and localized movie titles
- `popularity`: Popularity score (float)
- `release_date`: Date the movie was released
- `vote_average`: Average user rating (0–10 scale)
- `vote_count`: Number of user votes

---

####  **Data Quality:**
- **No missing values**
- **No duplicate rows**
- `release_date` is an object type and should be converted to datetime for analysis
- `genre_ids` contains 2,477 unique combinations (top: `[99]`)

---

####  **Quick Stats:**
- **Most common language**: `en` (English) — 23,291 records
- **Most frequent release date**: `2010-01-01` (269 movies)
- **Top genres combo**: `[99]` — 3,700 movies
- **Average vote**: `6.0`, with a max of `10.0`
- **Highest vote count**: `22,186` (Inception)

---

#### **Next Steps:**
1. Convert `genre_ids` from a string to an actual Python list.
2. Convert the `id` column to a numeric type.
3. Convert `release_date` to datetime format.
4. Remove leading and trailing whitespace in string columns (`original_language`, `original_title`, `title`).
5. Ensure `popularity`, `vote_average`, and `vote_count` are numeric and fill missing values with 0.
6. Remove duplicate movies based on the `id` column.
7. Drop rows that are missing crucial information (`title`, `release_date`).

### 2.4: Data Cleaning Strategy Summary

To ensure data quality and preserve analytical integrity, the following cleaning rules were applied to the **TMDB.Movies** dataset:

####  Drop Rows:
- **Missing `title` or `release_date`**: Rows missing these crucial fields were dropped since they are essential for timeline and identification-based analyses.
- **Duplicate entries**: Duplicate rows based on the `id` column were removed to prevent overrepresentation of movies.

#### Transformations & Imputations:
- **`genre_ids`**: Converted from string to a Python list for easier genre-level operations.
- **`release_date`**: Parsed into proper datetime format for time-series analysis.
- **String fields** (`original_language`, `original_title`, `title`): Trimmed to remove unnecessary whitespace.
- **Numeric columns** (`popularity`, `vote_average`, `vote_count`): Ensured valid numeric types and missing values were imputed with `0` to maintain completeness.


In [None]:
# Convert 'genre_ids' from a string to an actual Python list
df1['genre_ids'] = df1['genre_ids'].apply(
    lambda x: ast.literal_eval(x) if pd.notnull(x) else []
)

In [None]:
# Convert 'id' column to numeric type
df1['id'] = pd.to_numeric(df1['id'], errors='coerce')

In [None]:
# Convert 'release_date' to datetime format
df1['release_date'] = pd.to_datetime(df1['release_date'], errors='coerce')

In [None]:
# Remove leading and trailing whitespace in string columns
string_columns = ['original_language', 'original_title', 'title']
for col in string_columns:
    df1[col] = df1[col].astype(str).str.strip()

In [None]:
#Ensure numeric columns are in the right format and fill missing values
df1['popularity'] = pd.to_numeric(df1['popularity'], errors='coerce').fillna(0)
df1['vote_average'] = pd.to_numeric(df1['vote_average'], errors='coerce').fillna(0)
df1['vote_count'] = pd.to_numeric(df1['vote_count'], errors='coerce').fillna(0)

In [None]:
#Remove duplicated movies
df1.drop_duplicates(subset='id', inplace=True)

In [None]:
#Drop rows with missing crucial info
df1.dropna(subset=['title', 'release_date'], inplace=True)

In [None]:
#Reset the index after dropping rows
df1.reset_index(drop=True, inplace=True)

### 2.5 Saving Cleaned Data

In [None]:
df1.to_csv("../Cleaned_Data/cleaned_tmdb_movies.csv", index=False)

## Data 3: im.db

### 3.1 Data Overview

### 3.2: Data Description

#### 3.2.1: Importing the dataset

In [None]:
#additional libraries needed
import sqlite3
import pandas as pd

In [None]:
#connect the database
db = '../Original_Data/im.db'
conn = sqlite3.connect(db)
cursor = conn.cursor()

In [None]:
query_for_tables = """
                   SELECT name
                   FROM sqlite_master
                   WHERE type = 'table';
                   """

cursor.execute(query_for_tables)

tables = cursor.fetchall()
print(f"Tables in the database: {tables}")

#### 3.2.2: Basic structure

In [None]:
#defining function and observing movie_basics table
def table_as_df(conn, table_name):
    #takes an open SQLite connection and a table name,
    #returns the table as a Pandas DataFrame.
    return pd.read_sql_query(f"SELECT * FROM {table_name}", conn)

df_movie_basics = table_as_df(conn, 'writers')
print(df_movie_basics.info())
print(df_movie_basics.head())

#### 3.2.3: Loading both tables

In [None]:
df_basics = table_as_df(conn, 'movie_basics')
df_directors = table_as_df(conn, 'directors')

# Preview their columns
#print("movie_basics columns:", df_basics.columns())
#print("directors columns:", df_directors.columns())

#### 3.2.4: Merging the tables

In [None]:
merged_df = df_basics.merge(df_directors, on='movie_id', how='left')
print(merged_df.head())
print(merged_df.info())

In [None]:
df_ratings = table_as_df(conn, 'movie_ratings')
print(df_ratings.head())
print(df_ratings.info())
merged_df = merged_df.merge(df_ratings, on='movie_id', how='left')
merged_df.info()

In [None]:
df_principals = table_as_df(conn, 'principals')

# Merge with principals on movie_id and person_id
merged_df = merged_df.merge(
    df_principals,
    on=['movie_id', 'person_id'],
    how='left'
)

In [None]:
df_persons = table_as_df(conn, 'persons')

# Merge on person_id to get name, birth year, profession
merged_df = merged_df.merge(
    df_persons,
    on='person_id',
    how='left'
)

merged_df.info()

In [None]:
merged_df.head()

In [None]:
# Drop exact row duplicates
clean_df = merged_df.drop_duplicates()
clean_df.head()

In [None]:
columns_to_drop = [
    'birth_year',
    'death_year',
    'job',
    'characters',
    'ordering'
]

clean_df = clean_df.dropna(subset=['averagerating', 'numvotes', 'primary_name'])

clean_df.drop(columns=columns_to_drop, inplace=True)
clean_df.info()

In [None]:
# Calculate the median runtime (ignoring NaNs)
median_runtime = clean_df['runtime_minutes'].median()

# Fill NaN runtimes with the median
clean_df['runtime_minutes'].fillna(median_runtime, inplace=True)
clean_df.info()

In [None]:
clean_df = clean_df.dropna(subset=['genres', 'category', 'primary_profession'])
clean_df.info()

In [None]:
clean_df.tail()

In [None]:
clean_df.category.unique()

### 3.5 Saving Cleaned Data

In [None]:
clean_df.to_csv('../Cleaned_Data/clean_imdb_data.csv', index=False)
conn.close()

## Data 4: bom.movie_gross.csv.gz

### 4.1 Data Overview

### 4.2: Data Description

#### 4.2.1: Importing the dataset

In [None]:
# loading the dataset
df2 = pd.read_csv('../Original_Data/bom.movie_gross.csv.gz', encoding='latin1',low_memory=False)
df2.head()

#### 4.2.2: Basic structure

In [None]:
#checking shape
df2.shape

In [None]:
#checking columns
df2.columns

#### 4.2.3: Overview of column types and non-null values

In [None]:
df2.info()

#### 4.2.4: Summary statistics numerical

In [None]:
df2.describe(include='number').T

#### 4.2.5: Summary statistics categorical

In [None]:
df2.describe(include='O').T

#### 4.2.6: Missing Values

In [None]:
#missing values as sum
df2.isnull().sum()

In [None]:
#missing values as mean
df2.isnull().mean()*100

#### 4.2.7: Duplicates

In [None]:
df2.duplicated().sum()

### 4.3: Data Summary

The **BoxOffice.Revenue** dataset consists of **3,387 records** and **5 attributes**, capturing domestic and foreign earnings of movies across multiple years.

#### **Key Columns:**
- `title`: Name of the movie
- `studio`: Producing or distributing studio
- `domestic_gross`: U.S./Canada revenue (float, in dollars)
- `foreign_gross`: Revenue from outside the U.S./Canada (object, needs conversion)
- `year`: Release year (integer)

---

#### **Data Completeness:**
- `studio`: **5 missing** values
- `domestic_gross`: **28 missing**
- `foreign_gross`: **1,350 missing** — over **39%** of the data
- `title` and `year`: **100% complete**

---

#### **Additional Insights:**
- **Unique movie titles**: `3,386`
- **Unique studios**: `257` (most frequent: `IFC`, with 166 movies)
- **Max domestic gross**: `$936,700,000`
- **Top frequent `foreign_gross` value**: `1,200,000` (appears 23 times)
- **Year range**: `2010 – 2018`, centered around recent box office trends

### 4.4: Data Cleaning Strategy Summary

1. **Convert `year` to datetime format**  
   Transformed the `year` column from `int64` to datetime for consistency in time-based operations.

2. **Extract clean `year_only` column**  
   Created a separate `year_only` column by extracting the year component from the datetime object for easy filtering and plotting.

3. **Convert `foreign_gross` to float**  
   Replaced commas in `foreign_gross` values and converted the column from string to float to enable numeric analysis.

4. **Handle missing values in `studio`**  
   Filled missing entries in the studio column with mode


In [None]:
# converting year from int64 to datetime format
df2['year'] = pd.to_datetime(df2['year'],format='%Y')

# extracting the year_only for further analysis (created a new column called year_only)
df2['year_only'] = df2['year'].dt.year

# converting the foreign_gross to a float(similar to the domestic_gross)
# the foreign_gross has a comma which we will have to replace so as to convert it to float.
df2['foreign_gross'] = pd.to_numeric(df2['foreign_gross'].str.replace(',', '', regex=False), errors='coerce')

In [None]:
# checking whether the columns are now in the right format.
df2.info()

In [None]:
# checking for the null values, from the highest to the lowest %
(df2.isna().mean()*100).sort_values(ascending = False)

In [None]:
# checking for unique variables in the columnns for analysis
colm = df2.columns

for colm in df2:
    colm_val = df2[colm].unique()
    print(f"{colm},'\n',{colm_val}","\n")

In [None]:
# Show only object columns with missing values
missing_obj_col = df2.select_dtypes(include='object').isna().sum()
print(missing_obj_col,'\n')

# Get the mode of the studio column
mode_value = df2['studio'].mode()[0]
print(mode_value)

# Fill nulls with the mode
df2['studio'] = df2['studio'].fillna(mode_value)

In [None]:
# Show only numeric columns with missing values
missing_num_col = df2.select_dtypes('number').isna().sum()
print(missing_num_col,'\n')

# checking the skewness of the gross amounts to determine how best to deal with the missing data
print("Skewness of domestic_gross:", df2['domestic_gross'].skew())
print("Skewness of foreign_gross:", df2['foreign_gross'].skew())

# both are highly right-skewed. 
# Since the skew is >1, using the mean would be misleading because large outliers inflate it.
# and using the median would only be okay if the msissing values was actually an error,not an absence of data
# to avoid distorting the values by extreme outliers
# but in this case, we will make a compromise to assume that the movies did not have a foreign or domestic revenue
# therefore filling with 0 (zero) instead of the median

# Fill missing values using the median
df2['domestic_gross'] = df2['domestic_gross'].fillna(0)
df2['foreign_gross'] = df2['foreign_gross'].fillna(0)

# creating a new column total_gross 
df2['total_gross'] = df2['domestic_gross'] + df2['foreign_gross']

In [None]:
# to confirm that we do not have any missing values, numbers and objects:
missing_num_col = df2.select_dtypes('number').isna().sum()
print(missing_num_col,'\n')

missing_obj_col = df2.select_dtypes(include='object').isna().sum()
print(missing_obj_col,'\n')

In [None]:
# checking the columns we will be working on:
df.columns

In [None]:
# boxplot to check for skewness and outliers
plt.figure(figsize=(10, 4))
sns.boxplot(x=df2['domestic_gross'])
plt.title('Boxplot of Domestic Gross')
plt.show()

plt.figure(figsize=(10, 4))
sns.boxplot(x=df2['foreign_gross'])
plt.title('Boxplot of Foreign Gross')
plt.show()
# box the domestic and foreign gross are heavily right-skewed.

### 4.5 Saving Cleaned Data

In [None]:
#saving the DataFrame to CSV
df2.to_csv('../Cleaned_Data/cleaned_bom.movie_gross.csv', index=False)