# STADVDB MCO 1

[GitHub repository](https://github.com/420Rain/STADVDB_MCO1.git) 

**BALAJADIA**, John Ryan Uy<br />
**DULATRE**, Rainier Antolin<br />
**MARQUESES**, Simon Anthony Asuncion<br />


<br> <!-- Cell padder -->
<a name="setup"></a>
## Importing and data frame setup

---

In [None]:
# import ipywidgets as widgets
from sqlalchemy import create_engine
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as sst
from dotenv import load_dotenv
import os, pandas as pd

load_dotenv()

In [None]:
# postgresql://username:password@hostname/dbname
#conn_str = f"postgresql://{os.getenv("DB_USER")}:{os.getenv("DB_PASSWORD")}@{os.getenv("DB_HOST")}/{os.getenv("DB_DATABASE")}"
conn_str = f"postgresql://{os.getenv("DW_USER")}:{os.getenv("DW_PASS")}@{os.getenv("DW_HOST")}/{os.getenv("DW_DB")}"
%sql $conn_str

engine = create_engine(conn_str)

In [None]:
import olap_queries as oq

queries = oq.OLAP(engine)

<br> <!-- Cell padder -->
<a name="eda"></a>
## OLAP Querying

---

In [None]:
result = pd.DataFrame(queries.query_1())
result


In [None]:
result = pd.DataFrame(queries.query_2())
result

In [None]:
result = pd.DataFrame(queries.query_3())
result

In [None]:
result = pd.DataFrame(queries.query_4_1())
result

In [None]:
result = pd.DataFrame(queries.query_4_2())
result

In [None]:
result = pd.DataFrame(queries.query_4_3())
result

In [None]:
result = pd.DataFrame(queries.query_5())
result

In [None]:
result = pd.DataFrame(queries.query_6())
result

In [None]:
result = pd.DataFrame(queries.query_7())
result

<br> <!-- Cell padder -->
<a name="eda"></a>

## Statistical Analysis (Two-Sample Independent T-test)

In this OLAP application, we incorporated statistical analysis to complement multidimensional exploration of the IMDb data. Specifically, we performed two-sample independent t-tests to investigate whether differences in average ratings, number of votes, and TV series lifespan between different groups are statistically significant. These analyses were performed directly in PostgreSQL using aggregate queries, followed by calculation of p-values in Python.

For all t-tests, the computed t-statistics were passed to a Python function to calculate p-values and interpret results. The function prints the t-statistic, sample sizes, p-value, and a clear conclusion on whether the observed difference is statistically significant at a 5% significance level (α = 0.05). Both two-tailed and one-tailed hypotheses are supported.

---

**1. Adult vs. Non-Adult Title Ratings**

The first analysis compared the average ratings of adult versus non-adult movies. We computed group statistics (sample size (n), mean, and variance) for each category. The groups were adult and non-adult titles. This test allows us to determine whether adult titles receive significantly different audience ratings compared to non-adult titles. This analysis is useful for producers, marketers, and content platforms to understand whether adult films attract higher or lower audience ratings compared to general films. It informs content strategy, marketing focus, and potential demographic targeting.

In [None]:
result = pd.DataFrame(queries.t_test_1())
df = result.DataFrame()

t_stat = df.loc[0, 't_statistic_adult_vs_non_adult_rating']
n_non_adult = df.loc[0, 'n_non_adult']
n_adult = df.loc[0, 'n_adult']

queries.print_p_value_report(t_stat, n_non_adult, n_adult, "non adult movies", "adult movies", alpha=0.05)

**2. 19th vs. 20th Century Movie Ratings**

This test focused on historical comparison, determining whether movies from the 19th century differ in average rating from movies released in the 20th century. Group statistics were aggregated by century using the dim_date dimension. By calculating the t-test statistic from the means and variances of the two centuries, we were able to quantify the significance of differences across these historical periods. By identifying differences in audience reception over time, stakeholders can make informed decisions about promoting classic films or curating historical collections.

In [None]:
result = pd.DataFrame(queries.t_test_2())
df = result.DataFrame()

t_stat = df.loc[0, 't_statistic_century_rating_comparison']
n_19thCentury = df.loc[0, 'n_19th']
n_20thCentury = df.loc[0, 'n_20th']

queries.print_p_value_report(t_stat, n_19thCentury, n_20thCentury, "19th Century", "20th Century", alpha=0.05)

**3. Action vs. Comedy Movie Votes**

This third analysis examined whether Action films receive significantly different numbers of votes compared to Comedy films. We aggregated the number of votes for movies in these two primary genres and calculated the t-test statistic to assess whether audience engagement differs by genre. For studios, distributors, and streaming platforms, this information helps guide content investment decisions. If Action films consistently attract more votes, stakeholders might prioritize production or marketing in that genre to maximize audience engagement.

In [None]:
result = pd.DataFrame(queries.t_test_3())
df = result.DataFrame()

t_stat = df.loc[0, 't_statistic_action_vs_comedy_votes']
n_action = df.loc[0, 'n_action']
n_comedy = df.loc[0, 'n_comedy']

queries.print_p_value_report(t_stat, n_action, n_comedy, "action", "comedy", alpha=0.05)

**4. TV Series Lifespan Comparison**

This investigated the lifespan of TV series, comparing those that began in the 1990s with those that began in the 2010s. By computing the mean and variance of the lifespan (end year minus start year) per decade, we assessed whether TV series from these two decades differ significantly in longevity. This provides insight into trends in television production over time. This provides insights for network executives and streaming services regarding series longevity trends. Understanding whether newer series have shorter or longer lifespans can inform content development, renewal decisions, and scheduling strategies.

In [None]:
result = pd.DataFrame(queries.t_test_4())
df = result.DataFrame()

t_stat = df.loc[0, 't_statistic_tv_series_lifespan']
n_1990s = df.loc[0, 'n_1990s']
n_2010s = df.loc[0, 'n_2010s']

queries.print_p_value_report(t_stat, n_1990s, n_2010s, "1990s", "2010s", alpha=0.05)

**5. Franchise vs. Standalone Film Votes**

The test analyzed whether franchise films differ in audience reception (measured by number of votes) compared to standalone films. Titles with a parent_tconst were classified as franchises, while others were considered standalone. Aggregate statistics enabled the computation of the t-test statistic to evaluate whether franchises generally attract more viewer attention than individual films. Film studios, marketing teams, and franchise managers can use this analysis to evaluate the performance and appeal of franchise films relative to standalone releases. It can inform decisions about sequels, spin-offs, and long-term franchise planning.

In [None]:
result = pd.DataFrame(queries.t_test_5())
df = result.DataFrame()

t_stat = df.loc[0, 't_statistic_franchise_vs_standalone_votes']
n_franchise = df.loc[0, 'n_franchise']
n_standalone = df.loc[0, 'n_standalone']

queries.print_p_value_report(t_stat, n_franchise, n_standalone, "franchise titles", "standalone titles", alpha=0.05)

In [None]:
engine.dispose()

---

## Concluding Statement

This integration of OLAP queries and statistical analysis enhances the analytical capabilities of our data warehouse application. Users can not only explore trends and aggregate information across multiple dimensions (such as year, genre, language, and principal), but also quantitatively validate whether observed differences are statistically meaningful. This combination of descriptive analytics and inferential statistics provides a robust platform for data-driven insights into film and television production, audience reception, and genre-specific trends.

It is important to note that the statistical analyses presented here represent just a subset of the possibilities supported by our data warehouse. By adjusting query parameters, exploring additional dimensions, or incorporating different metrics, stakeholders can derive further insights into IMDb titles, including deeper genre analysis, director and actor impact, seasonal trends, or entirely new areas of research based on the rich dataset available in our warehouse. This flexibility demonstrates the capability of the OLAP system to support both routine reporting and advanced exploratory analysis.