# From data to decisions: Understanding movie success through rotten tomatoes and TMDB

**Author:** David Mwai Gathimba

## Project Overview

In this project, we are going to provide a data-driven report to Microsoft's new movie studio on the success factors of movies using data from Rotten Tomatoes and TMDB. One of the basic business questions for Microsoft is figuring out which types of movies are likely to do the best in theaters—since the company has virtually no experience in the world of making motion pictures. To look for the trends and correlations, we use datasets of movie reviews, movie ratings, and movie popularity metrics. We walk through data cleaning, exploratory data analysis, and visualization to expose leading indicators. The results also show that success is associated with key factors including genre, review scores, and popularity. According to the results, we suggest that Microsoft should make movies with good ratings and movements in popular genres to get the highest possible performance in the competitive movie circle.


## Business Problem
***
Microsoft is in the process of launching a new film studio, entering into the competitive film industry, which they have no experience with, thus the reason they are leveraging data to do so. The business problem we are trying to solve is to find out the most important features which influence the revenue and popularity of movies. In this post, we will answer a couple of data questions on the observation: What types of movies are doing well at the box office scene genre-wise? Do Rotten Tomatoes film scores affect box office revenue? How does the popularity of a movie on TMDB relate to its box-office performance? What difference would it make to a movie whether these attributes like run time or period of a release play a vital role in determining the success of a movie? So what about those insights for Microsoft, to inform their entry into movie production in a strategic way?

These queries are designed to directly correlate to quantitative properties of movies, and are looking for patterns or trends the data might reveal, to advise Microsoft at the strategy level. Having insight into these factors is essential for Microsoft to mitigate costs, allocate resources efficiently, and ensure its movie studio has the best chance of success in an already crowded market. In addition to genre popularity and the effect of reviews and popularity, we will use 10-fold cross-validation to foresee potential box office performance that could act as a guide star for Microsoft, which tries to be smarter about producing data-driven production decisions.
***

## Data Understanding
***
The data used in this project comes from two primary sources: Rotten Tomatoes (rt.reviews.tsv) and TheMovieDB (tmdb.movies.csv). These datasets provide a comprehensive view of movie reviews, ratings, and popularity, which are essential for analyzing the factors that contribute to a movie's success.

The Rotten Tomatoes dataset includes reviews and ratings for a wide range of movies. This data is crucial for understanding how critical reception correlates with box office performance. Key variables in this dataset include the movie title, review scores, and the number of reviews. The sample consists of various movies reviewed on the Rotten Tomatoes platform, capturing both popular and niche films.

The TMDB dataset contains detailed information about movies listed on TheMovieDB, including attributes such as movie title, genre, popularity scores, release dates, and runtime. This dataset helps us analyze how movie characteristics and their popularity on a major database influence their success. The key variables here include the movie title, genre, popularity score, release date, and runtime. The sample encompasses a diverse array of movies from different genres and periods.

Our target variable in this analysis is the box office revenue, which, although not directly included in the provided datasets, can be inferred or matched from external sources if needed. For this project, we focus on understanding how the review scores from Rotten Tomatoes and the popularity scores from TMDB influence a movie's box office success.

The properties of the variables we intend to use include:
- *Review Scores (Rotten Tomatoes):* Represents the average rating given by critics or users. This variable is numerical and ranges from 0 to 100.
- *Popularity Scores (TMDB):* Indicates the popularity of a movie on TheMovieDB platform. This is a numerical variable with values reflecting the relative popularity of the movies.
- *Genres:* Categorical variable representing the movie genre (e.g., Action, Drama, Comedy).
- *Release Dates:* Temporal variable indicating when the movie was released.
- *Runtime:* Numerical variable representing the duration of the movie in minutes.

These variables provide a robust basis for exploring the correlations and trends that can inform Microsoft's movie production strategy.
***

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
# Loading the data
rt_data_path = 
# Load Rotten Tomatoes reviews data
rt_reviews = pd.read_csv('/mnt/Data/rt.reviews.tsv')

# Load TMDB movies data
tmdb_reviews = pd.read_csv('/mnt/Data/tmdb.movies.csv')

# First few rows of each
print(rt_reviews.head())
print(tmdb_reviews.head())

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/Data/rt.reviews.tsv'