# Group 10 - Project 1

# Members: Bryan Groves, Randy Lam, Zach Wood, Marti Reisinger

# Topic: Drivers in revenue for top 1000 movies

## Overview: We intend to utilize a dataset from Kaggle that outlines the top 1000 movies by their ranking. We intend to analyze the studio, runtime, and performance by date  from both revenue and number of movies that fall within these categories.

In [None]:
# Import 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
from scipy.stats import linregress
import datetime

In [None]:
#Load CSV
movie_df = pd.read_csv("Resources/movies.csv")

In [None]:
cleaned_columns = movie_df.drop(columns = ["Movie Info","Unnamed: 0","Genre"])

In [None]:
final_movie_list = cleaned_columns.dropna()

# Analysis

## In analyzing the top 1000 dataset to understand what truly drives revenue in these popular movies, we wanted to see if there were clear distinction between studios. The following two questions provided an initial analysis of the movie studios and their revenue performance.

### 1.	How much revenue each studio is generating by studio

#### By first understanding the total revenue for a studio, we could see that the big names that we would expect to see were producing the most revenue. Walt Disney, Warner bros, Twentieth Century Fox, Universal Studios, and Sony made up a vast majority of all the revenue in movies in the top 5 spots.


In [None]:
rev_total_columns = final_movie_list.drop(columns = ["Movie Runtime","Title","License","Release Date","Domestic Sales (in $)","International Sales (in $)"]).set_index('Distributor')

In [None]:
rev_totals = rev_total_columns.groupby('Distributor').sum()

In [None]:
x_axis = np.arange(len(rev_totals))
tick_locations = [value for value in x_axis]

In [None]:
plot_pandas= rev_totals.sort_values(by="World Sales (in $)", ascending=False).plot.bar(color='b')
plt.xlabel("Distributor")
plt.ylabel("World Sales (in Million $)")
plt.title("World Sales Revenue by Distributor")
plt.show()

### 2.	Per studio, what is the studios average revenue for movies that fall within the top 1000 list

#### As a studio may produce many more films to achieve the same success as a studio producing less films, we wanted to normalize the way in which we analyzed the revenue by studio, taking the average revenue of these studios. When measureing average total revenue in the top 1000 list per studio, as we expected the ranking changed. One interesting observation was that a studio that was on the bottom half of the list for total revenue found its way as the leading studio for average revenue.  The top 5 studios for average revenue now consisted of Newmarket films, Walt Disney, Summit Entertainment, and Dream Works.

In [None]:
average_sales = rev_total_columns.groupby('Distributor').mean()


In [None]:
plot_pandas=average_sales.sort_values(by="World Sales (in $)", ascending=False).plot.bar(color='b')
plt.xlabel("Distributor")
plt.ylabel("Average World Sales (in Million $)")
plt.title("Average World Sales Revenue by Distributor")
plt.show()

## 3. We would like to look at the top 1000 movies and understand if summer blockbusters (June-Aug) fair better than holiday movies (Nov-Jan)

### Based on the findings within this data sample we found that the months of June, July, and December have a larger revenue pull compared to all other months. This could be due to the beginning of summer break and winter holidays for many people allowing them to meet with friends and families.

## 4. Which year had the best performance for revenue?

### The year 2009 stands out as an outlier due to Avatar taking in 2.9 billion alone releasing in December. The closest movie to this would have been in 2019s’ Avenger’s End Game; however, this was not counted in our data set due to being released in April of 2019

In [None]:
#importing new csv file
shortlist_df = pd.read_csv("Resources/Movies 3.csv")
pd.set_option('display.max_rows', None)


In [None]:
shortlist_df['Release Date']=pd.to_datetime(shortlist_df['Release Date']).dt.month


In [None]:
grouped=shortlist_df.groupby(['Release Date'], as_index=False).sum()

In [None]:
grouped.set_index('Release Date').groupby("Release Date").sum()


In [None]:
plt.rcParams["figure.figsize"] = (10,10)
grouped.drop(columns=["Ranking"])
grouped.pivot(index='Release Date', columns='World Sales', values='World Sales').plot(kind='bar')
plt.xticks([0,1,2,3,4,5],["January", "June", "July", "August","November","December"])
plt.xlabel('Release Month')
plt.ylabel('World Sales in Billions')
plt.title('Blockbuster Releases')
plt.legend(loc='upper center')
plt.show()

In [None]:
movieslist = pd.read_csv("Resources/Movies 3.csv")
pd.set_option('display.max_rows', None)


In [None]:
movieslist['Release Date']=pd.to_datetime(movieslist['Release Date']).dt.year


In [None]:
years=movieslist.groupby(['Release Date'], as_index=False).sum()


In [None]:
years.set_index('Release Date').groupby("Release Date").sum()/1e9

In [None]:
years.plot.barh(x="Release Date", y="World Sales")
plt.xlabel('World Sales in Billions')
plt.ylabel('Release Year')

## 5. Create chart based on revenue by market

### By analyzing International vs Domestic revenue we want to see if there is a clear correlation between the two. Will movies that did well domestically also do well internationally?

#### Using a linear regression model we can see that most movies in our data set are close to the regression line. With an R-squared of 63% we can say there is a moderate level of correlation. 63% of the variation in International sales can be attributed to the variation in domestic sales.

#### There are a few outliers in the high grossing films. Most of the movies in these outliers did very well internationally and didn’t do as well domestically. This was surprising, the opposite was what we expected.


In [None]:
plt.rcParams["figure.figsize"] = (10,5)
x = final_movie_list['Domestic Sales (in $)'] 
y = final_movie_list['International Sales (in $)']
plt.scatter(x , y)
plt.xlabel("Domestic Sales (in $)")
plt.ylabel("International Sales (in $)")
plt.title("International Vs Domestic Sales for Top 1000 Grossing Movies")            
plt.show()

In [None]:
(slope, intercept, rvalue, pvalue, stderr) = linregress(x, y)
regress_values = x * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.rcParams["figure.figsize"] = (10,5)
plt.scatter(x,y)
plt.plot(x,regress_values,"r-")
plt.xlabel("Domestic Sales (in $)")
plt.ylabel("International Sales (in $)")
plt.title("International Vs Domestic Sales for Top 1000 Grossing Movies")  
plt.annotate(line_eq ,(510000000,200000000),fontsize=10,color="r")
print(f"The r-squared is: {rvalue**2}")
plt.show()

## 6. Of the top 1000 grossing movies, what License will the majority fall within? G, PG, PG-13 or R?

#### We found that PG-13 by far had the highest amount of movies. This is what we expected to find, PG-13 movies appeal to not only younger teens but also adults of all ages. With 48.8% of the movies falling in this license category we can assume that producing a film with this license allows studios to market to the most people. G Rated movies made up the least amount of the total in our data set. This was surprising to us considering the large amount of children movies in our dataset. Upon looking into this we found that the film rating industry made an altercation to their system in 1985, when they added the PG-13 rating. According to the MPAA official guidelines: In order to have a movie rated “G” there cannot be any language, violence, etc that would offend parents whose younger children will see the film. This change made the “G” Rating for films almost non existent.









In [None]:
G_data = final_movie_list.loc[final_movie_list["License"]=='G']


In [None]:
PG_data = final_movie_list.loc[final_movie_list["License"]=='PG']


In [None]:
PG13_data = final_movie_list.loc[final_movie_list["License"]=='PG-13']


In [None]:
R_data = final_movie_list.loc[final_movie_list["License"]=='R']


In [None]:
G_data["Title"].nunique()

In [None]:
PG_data["Title"].nunique()


In [None]:
PG13_data["Title"].nunique()

In [None]:
R_data["Title"].nunique()

In [None]:
G = 14
PG = 173
PG13 = 363
R = 194

In [None]:
license_df = pd.DataFrame({
    "G Rating":G,
    "PG Rating":PG,
    "PG-13 Rating":PG13,
    "R Rating":R,}, index=[0])


In [None]:
license_df[["G Rating","PG Rating","PG-13 Rating","R Rating"]]


In [None]:
ratings = final_movie_list["License"].value_counts()
numbers = ratings.index
quantity = ratings.values
import plotly.express as px
fig = px.pie(final_movie_list, values=quantity, names=numbers, title = 'Ratings of Top 1000 Grossing Movies')
fig.show()

In [None]:
License_Chart = license_df.plot.bar(color=['blue', 'red', 'green', 'cyan'], align= "center")

plt.ylabel("Movie Count")
plt.xlabel("Movie Rating")
plt.title('Ratings of Top 1000 Grossing Movies')
plt.show()



# Hypothesis Test

### Null Hypothesis: There is no difference in earnings in films based on runtime.

### Our hypothesis: Shorter films within our sample will have higher global sales totals than longer films.

#### We based our hypothesis on the notion that shorter films would bring in more total viewership because they would logically fit into one’s schedule more easily. It might be more challenging to get to the theaters if you have to set aside 2.5 or 3 hours as opposed to 90 minutes. Additionally, shorter films can be shown more frequently, and more showtimes could produce higher ticket sales.

#### In order to analyze “short” and “long” films, we needed to make a decision on what those parameters actually meant. The bar chart displayed (f) shows a distribution of the amount of films in our sample by their runtime. They are binned in 10 minute segments to provide a visual summary that is easier to follow. The median (116) and mean (117.2) minutes of runtime for the sample were similar, and the distribution had a right skew that represented movies that ran very long. Half of all films we analyzed ran between about 100 and 130 minutes, so we found that as a reasonable group to deem “medium” in length. We chose not to include these in our analysis, and instead cut our data at all movies less than 101 minutes or greater than 130 minutes, which represented our first and fourth quartiles respectively.

In [None]:
#Zach Hypothesis

In [None]:
hypoth_df = final_movie_list

In [None]:
hypoth_df["Movie Runtime"] = hypoth_df["Movie Runtime"].str.replace(" hr", "*60").str.replace(" ", " + ").str.replace("min", "0").apply(eval)

In [None]:
hypoth_rev_list = hypoth_df["World Sales (in $)"].tolist()
hypoth_time_list = hypoth_df["Movie Runtime"].tolist()

In [None]:
print(hypoth_df["Movie Runtime"].median())
print(hypoth_df["Movie Runtime"].mean())
print(hypoth_df["Movie Runtime"].min())
print(hypoth_df["Movie Runtime"].max())

In [None]:
hypoth_time_series = hypoth_df["Movie Runtime"] 

hypoth_quartiles = hypoth_time_series.quantile([.25, .5, .75])
hypoth_lower = hypoth_quartiles[.25]
hypoth_upper = hypoth_quartiles[.75]
hypoth_iqr = hypoth_upper - hypoth_lower

print(hypoth_lower)
print(hypoth_upper)

#### In order to analyze “short” and “long” films, we needed to make a decision on what those parameters actually meant. The bar chart displayed shows a distribution of the amount of films in our sample by their runtime. They are binned in 10 minute segments to provide a visual summary that is easier to follow. The median (116) and mean (117.2) minutes of runtime for the sample were similar, and the distribution had a right skew that represented movies that ran very long. Half of all films we analyzed ran between about 100 and 130 minutes, so we found that as a reasonable group to deem “medium” in length. We chose not to include these in our analysis, and instead cut our data at all movies less than 101 minutes or greater than 130 minutes, which represented our first and fourth quartiles respectively.

In [None]:
plt.hist(hypoth_time_list, bins = [70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210], rwidth = .95)

plt.title("Distribution of Films by Runtime")
plt.xlabel("Film Runtime in Minutes")
plt.ylabel("Count")

plt.show()

In [None]:
hypoth_s_df = hypoth_df.loc[hypoth_df["Movie Runtime"] <= 101]
hypoth_l_df = hypoth_df.loc[hypoth_df["Movie Runtime"] >= 130]

hypoth_m_df = hypoth_df.loc[(hypoth_df["Movie Runtime"] > 101) & (hypoth_df["Movie Runtime"] < 130)]

In [None]:
hypoth_s_series = hypoth_s_df["World Sales (in $)"].squeeze()
hypoth_l_series = hypoth_l_df["World Sales (in $)"].squeeze()
hypoth_m_series = hypoth_m_df["World Sales (in $)"].squeeze()

hypoth_s_timelist = hypoth_s_df["Movie Runtime"].to_list()
hypoth_l_timelist = hypoth_l_df["Movie Runtime"].to_list()
hypoth_m_timelist = hypoth_m_df["Movie Runtime"].to_list()

#### We then ran our independent samples t-test, where sample 1 included any movies less than 101 minutes and sample 2 included movies over 130 minutes. Our t-test was one-tailed and specified that sample 1 would have a greater mean than sample 2. Therefore, we could only accept our null hypothesis if that condition occurred; if there was no significant difference or sample 2’s mean was greater, we’d fail to reject the null. Unfortunately for our prediction, the latter was the case. Our p-value was .99, so our sample does not support the idea that shorter films make more money. In fact, looking at our sample 1 mean (334,578,465) versus our sample 2 mean (563,375,537), it is evident that the opposite finding might be true.

In [None]:
stats.ttest_ind(hypoth_s_series, hypoth_l_series, alternative = "greater")

In [None]:
print(hypoth_s_series.mean())
print(hypoth_l_series.mean())

#### The scatterplot we made backs up this idea. In totality, long films had more record-setting box office numbers. Some of the data points for that subgroup were substantially higher than anything else on our chart. A general rule seems to be that movies have greater potential to earn if their runtime is longer. There are plenty of movies that hit similar earning figures regardless of runtime, as evidenced by the cluster of data points at the bottom of the plot. Longer films broke apart from that cluster more frequently than shorter films and even medium length ones.

In [None]:
plt.scatter(hypoth_s_timelist, hypoth_s_series, marker = "o", facecolors = "lightblue", edgecolors = "black")
plt.scatter(hypoth_l_timelist, hypoth_l_series, marker = "o", facecolors = "lightgreen", edgecolors = "black")
plt.scatter(hypoth_m_timelist, hypoth_m_series, marker = "o", facecolors = "orange", edgecolors = "black")

plt.title("Total Film Sales by Runtime")
plt.xlabel("Film Runtime in Minutes")
plt.ylabel("World Sales in Billions of Dollars")

plt.legend(["Short Films", "Long Films", "Medium Films"])

plt.show()

In [None]:
hypoth_s_earnlist = hypoth_s_df["World Sales (in $)"].to_list()
hypoth_l_earnlist = hypoth_l_df["World Sales (in $)"].to_list()

In [None]:
hypoth_earndict = {"Short Films": hypoth_s_earnlist, "Long Films": hypoth_l_earnlist}
hypoth_boxid = [1, 2]

#### Finally, our boxplot provides another visual that suggests more variance in earnings among lengthier films. The range of world sales by short films is condensed with a larger group of potential outliers that lie above the upper bound for the interquartile range. There are only 5 films that are considered potential outliers in the long films group, and a few of them earned substantially more than the rest of the sample.


In [None]:
plt.boxplot(hypoth_earndict.values())

plt.title("Distribution of Earnings, Short vs. Long Films")
plt.ylabel("World Sales in Billions of Dollars")
plt.xticks(hypoth_boxid, hypoth_earndict.keys())

plt.show()

In [None]:
hypoth_s_dfdis = hypoth_s_df.sort_values(by = ["World Sales (in $)"], ascending = False)

hypoth_s_dfdis.head(5)

In [None]:
hypoth_l_dfdis = hypoth_l_df.sort_values(by = ["World Sales (in $)"], ascending = False)

hypoth_l_dfdis.head(5)

#### To get a snapshot of why our hypothesis test went the way it did, we looked at the top selling films in each subgroup. Many of the short films are G or PG rated and also animated, so there may be a smaller scope in terms of target audience. These films are probably shorter for a reason; they are made with children in mind and likely understand that this age group doesn’t necessarily want or need films that last multiple hours. Plenty of the top earning “long” films were PG-13 and may have a larger target audience, involving children, teens, and adults alike. Another notable observation was the clear trend of franchised movies breaking box office records. Star Wars, The Avengers, and Despicable Me are all staples of the movie world, and these movies were some of the highest earners. Part of the cause there is likely because of their reputation and anticipation leading up to their releases.

#### Our hypothesis proved to be off base, but we gained some valuable insights and potential factors in how runtime influences the earnings of films worldwide.