> **Tip**: Welcome to the Investigate a Dataset project! You will find tips in quoted sections like this to help organize your approach to your investigation. Once you complete this project, remove these **Tip** sections from your report before submission. First things first, you might want to double-click this Markdown cell and change the title so that it reflects your dataset and investigation.

# Project: Investigate a Dataset - [TMDb]

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

>   This data set contains information about 10,000 movies collected from The Movie Database (TMDb),
including user ratings and revenue,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,tagline,keywords,overview,runtime,genres,production_c
ompanies,production_companies,vote_count,vote_average,release_year.


In this project, i'll be answering the following questions:
  + What month is considered "best" for releasing a films/shows?
  + What is the relationship between runtime and vote average?
  + What genres are associated with films/shows that have high revenues?
  + What percentage do the top 5 genres make up?


In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline

In [10]:
# Upgrade pandas to use dataframe.explode() function. 
!pip install --upgrade pandas==0.25.0

Requirement already up-to-date: pandas==0.25.0 in /opt/conda/lib/python3.6/site-packages (0.25.0)


<a id='wrangling'></a>
## Data Wrangling

### General Properties


In [None]:
tmdb_df = pd.read_csv('tmdb-movies.csv')
tmdb_df.head(1)


### Data Cleaning
# Data Cleaning - Drop Unecessary Columns
 > Remove columns that are not useful for answering questions (Budget, Revenue,Homepage, Tagline, Keywords and Overview)


In [None]:
tmdb_df.drop(['budget','revenue','homepage','tagline','keywords','overview','release_year','cast','director','tagline','overview','production_companies'], axis=1 , inplace=True)
tmdb_df.columns


In [None]:
tmdb_df.describe()

In [None]:
print(tmdb_df['budget_adj'].mean())

In [None]:
tmdb_df['budget_adj'] = tmdb_df['budget_adj'].replace(0, 17551039.822886847)

In [None]:
print(tmdb_df['revenue_adj'].mean())

In [None]:
tmdb_df['revenue_adj'] = tmdb_df['revenue_adj'].replace(0, 51364363.25325093)

In [None]:
print(tmdb_df['runtime'].mean())

In [None]:
tmdb_df['runtime'] = tmdb_df['runtime'].replace(0, 102.07086324314375)
tmdb_df.describe()


## cleaning dupicated

In [None]:
sum(tmdb_df.duplicated())

In [None]:
tmdb_df.drop_duplicates(inplace=True)

## Data Cleaning - Changing Datatypes
>Change datatypes of columns to appropriate kinds. Ex. 'release_date' needs to be
datetime.


In [None]:
tmdb_df['release_date'] = pd.to_datetime(tmdb_df['release_date'])
tmdb_df.dtypes

## Exploratory Data Analysis

In [None]:
tmdb_df.hist(figsize=(15,8));

# What month is considered "best" for releasing a film/show?


In [None]:
tmdb_df['month'] = tmdb_df['release_date'].apply(lambda x: x.month)
tmdb_df.head(3)

In [None]:
month_revenue = tmdb_df.groupby('month')['revenue_adj'].sum()
month_revenue

In [None]:
sns.set_style('darkgrid')
plt.bar([1,2,3,4,5,6,7,8,9,10,11,12], month_revenue, tick_label = [1,2,3,4,5,6,7,8
plt.title('Month Released vs. Revenue')
plt.ylabel('Revenue Adjusted')
plt.xlabel('Month');


In [None]:
tmdb_df['month'].value_counts()

In [None]:
tmdb_df['month'].value_counts().mean()

# What is the relationship between runtime and voteaverage?


In [None]:
tmdb_df.plot(x='vote_average', y='runtime', kind='scatter', figsize=(15,10))
plt.title('Ratings vs. Runtime')
plt.xlabel('Rating')
plt.ylabel('Rating');

# What genres are associated with films/shows that havehigh revenues?

In [None]:
tmdb_df.info()

In [None]:
tmdb_df = tmdb_df.dropna(subset=['genres'], axis=0)
tmdb_df.info()

In [None]:
genres = tmdb_df['genres'].str.split('|', expand=True).rename(columns = lambda x: "string"+str(x+1))

In [None]:
tmdb_df.drop('genres', axis=1, inplace=True)

In [None]:
tmdb_df = pd.merge(tmdb_df, genres, left_index=True, right_index=True, how='inner')

In [None]:
top_rev = tmdb_df.nlargest(10, 'revenue_adj')
top_rev


In [None]:
copy_df = top_rev.copy()
copy_df.drop(['id','imdb_id' ,'popularity','original_title','runtime','release_dat
df1 = copy_df.melt()

In [None]:
df2 = pd.crosstab(index=df1['value'], columns=df1['variable'])
df2

In [None]:
df2['totals'] = df2['genre1'] + df2['genre2'] + df2['genre3'] + df2['genre4'] + df['genre5']
df2


In [None]:
df2['totals'].plot(kind="bar", figsize=(8,5), fontsize=12)
plt.xlabel('Genre', fontsize = 14)
plt.ylabel('Frequency', fontsize = 14)
plt.title('Genres of the Highest Earning Films/Shows', fontsize = 14);

## What percentage do the top 5 genres make up?


In [None]:
copy_df = tmdb_df.copy()
copy_df.drop(['id','imdb_id' ,'popularity','original_title','runtime','release_dat
df3 = copy_df.melt()


In [None]:
df4 = pd.crosstab(index=df3['value'], columns=df3['variable'])

In [None]:
df4['totals'] = df4['genre1'] + df4['genre2'] + df4['genre3'] + df4['genre4'] + df4['genre5']

In [None]:
top5 = df4.nlargest(5, 'totals')
top5

In [None]:
df4.drop(['Drama','Comedy','Thriller','Action','Romance'], inplace=True)

In [None]:
df4['totals'].sum()

In [None]:
count = top5.append({'totals':'11399'}, ignore_index=True)
count

In [None]:
count.index=['Drama','Comedy','Thriller','Action','Romance','Other']
count

In [None]:
genre_total = count['totals'].sum()
genre_total

In [None]:
count['percentage'] = count.loc[:,'totals'] / 26955 * 100
count

In [None]:
count['percentage'].plot(kind="pie", figsize=(8,8), fontsize=13, autopct='%1.0f%%
plt.title('Percentage of Genres', fontsize = 14)
plt.ylabel('');

>From this pie chart, we can see that out of the top 5 genres, Drama is the most
frequently made. This means that close to 1 out of every 5 films/shows is in the
Drama category.
However we can see that these top 5 genres only made up just over half of the total
number of films/shows produced - we still have several other less produced genres
that when combined, make up a good portion of the whole.
Also, we can see that just because a genre produces a larger revenue than others,
doesn't necessarily mean that it's going to be one of the most frequently produced
genres as well.


# Conclusions

>Throughout this data analysis, I posed questions that Production Companies might
find useful, and I've come to several conclusions:
It is best to release a movie/show in June or December, because I can conclusively
say that those movies are more popular and tend to bring in the most revenue. This
could be due to the fact that in the Summer and Winter, families are looking for
things to do together.
The conclusions I've come to in analyzing the relationship between ratings and
runtime are that short films (less than 10 minutes) are likely to have a mid-to-high
rating, and TV series (greater than 300 minutes) consistently get higher-thanaverage ratings. The ratings of films/shows with a runtime of around 100 minutes
are unpredictable, as they can run from low to high, and films with a runtime above
or below 100 minues tend to have mid-to-high ratings. Just at first glance of the
scatterplot, users are more friendly - as in they tend to give mostly mid-to-high
ratings overall - so production companies will want to make sure their film/show is
reviewed on TMDB.
If you're a production company and you want to know what genres earn the highest
revenues, my bar chart above concluded that out of the top earning films/shows,
Adventure, Action and Science Fiction were the most frequent genres on that list.
You can conclude that you are more likely to earn a higher revenue if you produce
those genres.
Finally, when I calculated the percentages of each genre, I noticed that only of the
highet earning genres is in the top 5 most frequently produced genres (Action).
Perhaps it is because Adventure and Science Fiction movies are more expensive to
produce so they are more rarely made, or perhaps production companies want to
focus on genres that are more popular with people, not necessarily the genres that
produce the highest revenues. No matter the cause, I can conlcude that just
because a genre produces a larger revenue than others, doesn't necessarily mean
that it's going to be one of the most frequently produced genres.
A few notes about my data cleaning are that in the runtime, budget_adj and
revenue_adj I filled all of the 0 values with their means. This possibly could've been
more accurate if I used regression to find like-properties to fill the 0 values instead
of the mean.

> Resources I used in my analysis:
https://stackoverflow.com/questions/47517831/how-to-copy-column-with-the-pandas-and-changethe-name (https://stackoverflow.com/questions/47517831/how-to-copy-column-with-the-pandasand-change-the-name) https://stackoverflow.com/questions/25146121/extracting-just-month-andyear-from-pandas-datetime-column-python
(https://stackoverflow.com/questions/25146121/extracting-just-month-and-year-from-pandasdatetime-column-python) https://stackoverflow.com/questions/30405413/python-pandas-extractyear-from-datetime-dfyear-dfdate-year-is-not
(https://stackoverflow.com/questions/30405413/python-pandas-extract-year-from-datetime-dfyeardfdate-year-is-not) https://stackoverflow.com/questions/48733618/how-to-drop-rows-from-a-

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])