# Business Understanding

## Project Overview
For this project, you will use exploratory data analysis to generate insights for a business stakeholder.

## Business problem: 
Your company now sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies. You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of your company's new movie studio can use to help decide what type of films to create.
## Project objectives: 
### Main Objective
To analyze movie data and uncover patterns in sales, popularity, ratings, and director influence across genres, providing actionable insights for business growth and strategy.

### Specific Objectives
1. **Genre by Sales**  
   Identify which genres generate the most revenue and analyze trends contributing to their sales performance.
   Tables:`bom.movie_gross.csv` , `rt.movie_info.tsv`

2. **Genre by Popularity**  
   Understand which genres are most popular among audiences and explore factors driving their popularity.
   Tables:`bom.movie_gross.csv`, `tmdb.movies.csv`

3. **Genre by Rating**  
   Examine the ratings of movies across different genres to evaluate their critical reception.
    DB:`im.db` Tables: `movie_basics`, `movie_rating`

4. **Directors by Genre**  
   Determine which directors are most associated with specific genres and assess their impact on genre success.
    DB:`im.db` Tables: `movie_basics`, `directors`

### The Data
In the folder `zippedData` are movie datasets from:

* [Box Office Mojo](https://www.boxofficemojo.com/)
* [IMDB](https://www.imdb.com/)
* [Rotten Tomatoes](https://www.rottentomatoes.com/)
* [TheMovieDB](https://www.themoviedb.org/)
* [The Numbers](https://www.the-numbers.com/)

Because it was collected from various locations, the different files have different formats. Some are compressed CSV (comma-separated values) or TSV (tab-separated values) files that can be opened using spreadsheet software or `pd.read_csv`, while the data from IMDB is located in a SQLite database.

![movie data erd](https://raw.githubusercontent.com/learn-co-curriculum/dsc-phase-2-project-v3/main/movie_data_erd.jpeg)

Note that the above diagram shows ONLY the IMDB data. You will need to look carefully at the features to figure out how the IMDB data relates to the other provided data files.

It is up to you to decide what data from this to use and how to use it. If you want to make this more challenging, you can scrape websites or make API calls to get additional data. If you are feeling overwhelmed or behind, we recommend you use only the following data files:

* `im.db.zip`
  * Zipped SQLite database (you will need to unzip then query using SQLite)
  * `movie_basics` and `movie_ratings` tables are most relevant
* `bom.movie_gross.csv.gz`
  * Compressed CSV file (you can open without expanding the file using `pd.read_csv`)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

# Data Understanding 

In [2]:
#importing libraries for data manipulation (pandas, numpy) and visualization (seaborn, matplotlib)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import sqlite3
import warnings

# Suppress all warnings
warnings.filterwarnings('ignore')

In [3]:
# set the maximum number of columns to 40 to display all columns
pd.set_option('display.max_columns', 40)

<b>rt.movie_info.tsv</b>

In [4]:
movie_df = pd.read_csv('rt.movie_info.tsv', sep='\t')
movie_df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'rt.movie_info.tsv'

In [None]:
movie_df.tail()

In [None]:
movie_df.shape

In [None]:
movie_df.info()

<b> bom.movie_gross.csv</b>

In [None]:
gross_df = pd.read_csv("bom.movie_gross.csv")

In [None]:
gross_df.head()

In [None]:
gross_df.tail()

In [None]:
gross_df.info()

In [None]:
gross_df.describe()

<b>tmdb.movies.csv</b>

In [None]:
tmdb_df = pd.read_csv("tmdb.movies.csv")
tmdb_df.head()

In [None]:
tmdb_df.tail()

In [None]:
tmdb_df.info()

In [None]:
tmdb_df.describe()

In [None]:
#connecting to db
conn = sqlite3.Connection('im.db')


In [None]:
#getting table names
cursor = conn.cursor()
cursor.execute("""SELECT name
    FROM sqlite_master
    WHERE type = 'table';""")
print(cursor.fetchall())

In [None]:
mbasics_df = pd.read_sql("""SELECT * FROM movie_basics;""",conn)

In [None]:
mbasics_df.head()

In [None]:
mbasics_df.tail()

In [None]:
mbasics_df.info()

In [None]:
rating_df = pd.read_sql("""SELECT * FROM movie_ratings;""",conn)

In [None]:
rating_df.head()

In [None]:
rating_df.tail()

In [None]:
rating_df.info()

In [None]:
rating_df.describe()

In [None]:
directors_df = pd.read_sql("""SELECT * FROM directors;""",conn)

In [None]:
directors_df.head()

In [None]:
directors_df.tail()

In [None]:
directors_df.info()

In [None]:
#closing database
conn.close()

## Data Cleaning

### Missing Values

In [None]:
movie_df.isna().sum()

In [None]:
# drop those columns with more than 1000 non-null rows
movie_df = movie_df.drop(['currency', 'box_office', 'studio'],axis=1)

In [None]:
#replacing movie genre nulls with mode
genre_mode = movie_df.genre.mode()[0]
movie_df.genre.fillna(genre_mode, inplace=True)
movie_df.genre.isna().sum()

In [None]:
#replacing movie rating nulls with mode
rating_mode = movie_df.rating.mode()[0]
movie_df.rating.fillna(rating_mode, inplace=True)
movie_df.rating.isna().sum()

In [None]:
#drop the rest with nulls
movie_df.dropna(inplace=True)
movie_df.isna().sum()

In [None]:
gross_df.isna().sum()

In [None]:
#replacing gross for domestic and foreign with 0
gross_df.foreign_gross.fillna(0, inplace=True)
gross_df.domestic_gross.fillna(0, inplace=True)

In [None]:
#drop the rest with nulls
gross_df.dropna(inplace=True)
gross_df.isna().sum()

In [None]:
# converting 'foreign_gross' to float 
gross_df['foreign_gross'] = pd.to_numeric(gross_df['foreign_gross'],errors='coerce')

# calculating 'total_gross' as the sum of 'domestic_gross' and 'foreign_gross'
gross_df['total_gross'] = gross_df['domestic_gross'] + gross_df['foreign_gross']

gross_df[['domestic_gross', 'foreign_gross', 'total_gross']].head()


In [None]:
tmdb_df.isna().sum()

In [None]:
mbasics_df.isna().sum()

In [None]:
rating_df.isna().sum()

In [None]:
directors_df.isna().sum()

### Changing Columns

In [None]:
# Renaming columns in tmdb_df
tmdb_df = tmdb_df.rename(columns={'Unnamed: 0': 'id', 'id': 'tmdb_id'})

tmdb_df.head()


### Checking Duplicates

In [None]:
movie_df.duplicated().sum()

In [None]:
gross_df.duplicated().sum()

In [None]:
tmdb_df.duplicated().sum()

In [None]:
mbasics_df.duplicated().sum()

In [None]:
mbasics_df.drop_duplicates()
mbasics_df.duplicated().sum()

In [None]:
rating_df.duplicated().sum()

In [None]:
directors_df.duplicated().sum()

In [None]:
directors_df.drop_duplicates(inplace=True)
directors_df.duplicated().sum()

### Feature engineering

In [None]:
#changing runtime to int after splitting with a space to get int and 'minutes' the string
movie_df['runtime'] = movie_df['runtime'].str.split(" ").str[0]
movie_df['runtime'] = pd.to_numeric(movie_df['runtime'], errors='coerce')

# changing column name from runtime' to'runtime_in_minutes'
movie_df = movie_df.rename(columns={'runtime': 'runtime_in_minutes'})

# preview the first few rows
movie_df['runtime_in_minutes'].head()

In [None]:
# Split 'genre' into 'main_genre' and 'supporting_genre'
movie_df['main_genre'] = movie_df['genre'].str.split('|').str[0]
movie_df['supporting_genre'] = movie_df['genre'].str.split('|').apply(lambda x: '|'.join(x[1:]) if len(x) > 1 else '')

# Preview the result
movie_df[['genre', 'main_genre', 'supporting_genre']].head()


In [None]:
# Convert 'theater_date' and 'dvd_date' columns to datetime format
movie_df['theater_date'] = pd.to_datetime(movie_df['theater_date'], format='%b %d, %Y')
movie_df['dvd_date'] = pd.to_datetime(movie_df['dvd_date'], format='%b %d, %Y')

# preview the result
movie_df[['theater_date', 'dvd_date']].head()

In [None]:
# convert 'release_date' to datetime format
tmdb_df['release_date'] = pd.to_datetime(tmdb_df['release_date'], format='%Y-%m-%d')

# extract the year and create a new column 'release_year'
tmdb_df['release_year'] = tmdb_df['release_date'].dt.year

tmdb_df[['release_date', 'release_year']].head()

### Saving Dataset

In [None]:
movie_df.to_csv("movie_info_clean.csv")

In [None]:
gross_df.to_csv("movie_gross_clean.csv")

In [None]:
# merging the movie_df and gross_df on 'movie_id'
movie_basics_rating_df = pd.merge(mbasics_df, rating_df, on='movie_id', how='inner')

movie_basics_rating_df.to_csv("movie_basics_rating_clean.csv")
movie_basics_rating_df.head()


In [None]:
directors_df.to_csv("director_clean.csv")

# Data Preparation


# Modeling


# Evaluation