## Final Project Submission

Please fill out:
* Student name: ALFRED RICKY OTIENO
* Student pace: PART TIME
* Scheduled project review date/time: 
* Instructor name: Daniel Ekale
* Blog post URL:



#### Step 1: **Introduction**
The primary aim of this project is to provide data-driven insights for a company looking to establish a new movie studio. With the entertainment industry evolving, the goal is to analyze existing film data to understand what types of movies perform best at the box office, helping the studio decide which genres, contributors, and themes to prioritize.



#### Step 2: **Objectives**
Our analysis will focus on the following objectives:

1. **Identify Popular Genres**: Analyze which genres tend to have higher ratings and more viewer engagement.
2. **Analyze Characteristics of High-Rated Movies**: Examine factors such as runtime, year of release, and genre combinations to see if they correlate with higher ratings.
3. **Determine Key Contributors**: Identify directors, writers, and actors who have contributed to successful movies, as potential partners for the new studio.
4. **Investigate Trends Over Time**: Look at how preferences in ratings, movie length, and genres have evolved over the years, highlighting trends that may be valuable for the new studio to consider.




#### Step 3: **Business Problem:**  
The company needs guidance on what types of films to produce in order to maximize success and minimize risk. Your task is to provide data-backed recommendations on movie genres, lengths, themes, contributors (like popular directors or actors), and other key characteristics that correlate with successful films.



#### Step 4: **Data Understanding**


**Key Questions for Data Understanding**:
- What is the distribution of genres in the dataset?
- Are there missing values in critical columns (e.g., ratings, genres, runtime)?
- How are ratings distributed across movies?
- What are the relationships between tables that can help us analyze contributor impact (e.g., directors, writers)?



In [1]:
#install all libraries to be used
import pandas as pd
import numpy as np
import seaborn as sns
import scipy as sp
import scipy.stats as st
import sqlite3
import zipfile
import os
import statsmodels.api as sm
from statsmodels.stats.power import TTestIndPower, TTestPower
import statsmodels.formula as smf
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')

#### **Data Loading and Initial Display**

Now that we’ve defined our objectives, let’s load the data and display the initial structure. I'll execute this step to provide a snapshot of each table’s structure.

In [4]:
# Define paths
zip_file_path = "C:\Projects\group3-phase2-project\data\im.db.zip"  
extracted_dir = "C:\Projects\group3-phase2-project\data\extracted"  

In [5]:
# Step 1: Unzip the file
with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
    zip_ref.extractall(extracted_dir)

In [6]:
# Define the path to the extracted database file
db_path = os.path.join(extracted_dir, "im.db")

In [7]:
# Step 2: Check if the database file exists and connect to it
if os.path.exists(db_path) and os.path.getsize(db_path) > 0:
    # Connect to the SQLite database
    conn = sqlite3.connect(db_path)

In [8]:
# Step 3: Check if the database file exists and has a reasonable size
if os.path.exists(db_path) and os.path.getsize(db_path) > 0:
    print("Database file exists and is not empty. Proceeding with connection.")
    
    # Connect to the SQLite database
    conn = sqlite3.connect(db_path)
    
    # Step 3: Check tables in the database
    tables = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table';", conn)
    
    if tables.empty:
        print("No tables found in the database. The database might be empty or corrupted.")
    else:
        print("Tables found:", tables)
else:
    print("Database file is either missing or empty.")

Database file exists and is not empty. Proceeding with connection.
Tables found:             name
0   movie_basics
1      directors
2      known_for
3     movie_akas
4  movie_ratings
5        persons
6     principals
7        writers


In [9]:
# Step 4: Load tables into DataFrames for analysis
movie_basics = pd.read_sql_query("SELECT * FROM movie_basics;", conn)
movie_ratings = pd.read_sql_query("SELECT * FROM movie_ratings;", conn)
principals = pd.read_sql_query("SELECT * FROM principals;", conn)
persons = pd.read_sql_query("SELECT * FROM persons;", conn)
known_for = pd.read_sql_query("SELECT * FROM known_for;", conn)
directors = pd.read_sql_query("SELECT * FROM directors;", conn)
writers = pd.read_sql_query("SELECT * FROM writers;", conn)
movie_akas = pd.read_sql_query("SELECT * FROM movie_akas;", conn)

In [10]:
# Display brief info summary for each DataFrame
dataframes_info = {
        "movie_basics": movie_basics.info(),
        "movie_ratings": movie_ratings.info(),
        "principals": principals.info(),
        "persons": persons.info(),
        "known_for": known_for.info(),
        "directors": directors.info(),
        "writers": writers.info(),
        "movie_akas": movie_akas.info()
    }
    
dataframes_info  # This will display summaries for all loaded tables


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028186 entr

{'movie_basics': None,
 'movie_ratings': None,
 'principals': None,
 'persons': None,
 'known_for': None,
 'directors': None,
 'writers': None,
 'movie_akas': None}