# Project: Investigate a Dataset - [TMDb movie data]

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

|Column|Significance|
|:----:|:----:| 
|  imdb_id  | Indicates the ID of the movie according to IMDb  |
|  popularity  |  Measure of popularity (Higher is better) |
| budget  | How much it costs to produce the movie  |
| revenue  | How much the movie revenue from box ofice and Blue-ray  |
| original_title  |  Title of the movie |
| cast  |  The main actors/actresses in the movie |
|  homepage | the webpage corresponding to the movie  |
| director  |  director of the movie |
| tagline  |  Short description of the movie |
| keywords  | Indicates the main narrative themes of the movie  |
| overview  | Synopsis of the movie  |
| runtime  | duration of the movie in minutes  |
| genres  |  genre(s) of the movie |
|  production_companies | Credits the production companies  |
|  release_date |  When was the movie released (%m/%d/%Y) |
|  vote_count | how many users voted for a given movie  |
|  vote_average   |  Average of votes  |
|   release_year  |   The year when the movie was released  |

### Question(s) for Analysis
>**Tip**: Clearly state one or more questions that you plan on exploring over the course of the report. You will address these questions in the **data analysis** and **conclusion** sections. Try to build your report around the analysis of at least one dependent variable and three independent variables. If you're not sure what questions to ask, then make sure you familiarize yourself with the dataset, its variables and the dataset context for ideas of what to explore.

> Question 1: Is there a correlation between when is the movie realeased (seasonality) and it's revenue?

> Question 2: Do movies with longer runtime have higher popularity?

> **Tip**: Once you start coding, use NumPy arrays, Pandas Series, and DataFrames where appropriate rather than Python lists and dictionaries. Also, **use good coding practices**, such as, define and use functions to avoid repetitive code. Use appropriate comments within the code cells, explanation in the mark-down cells, and meaningful variable names. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

In [None]:
# Upgrade pandas to use dataframe.explode() function. 
#!pip install --upgrade pandas
#!pip install --upgrade seaborn

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you **document your data cleaning steps in mark-down cells precisely and justify your cleaning decisions.**


### General Properties
> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

In [None]:
df=pd.read_csv("Database_TMDb_movie_data/tmdb-movies.csv")

In [None]:
df.info()


### Data Cleaning
> First, I will convert the release_date column from object dtype to datetime dtype, and I will drop the release_year column because it is redundant as well as the homepage column, imdb_id column, overview column, and both budget_adj, revenue_adj since inflation will not cause any detoriartion for correlation between our features in our EDA, as well as other columns irrelevent to the EDA.
 

In [None]:
df.drop("release_year",axis=1,inplace=True)
df.drop("homepage",axis=1,inplace=True)
df.drop("imdb_id",axis=1,inplace=True)
df.drop("overview",axis=1,inplace=True)
df.drop("budget_adj",axis=1,inplace=True)
df.drop("revenue_adj",axis=1,inplace=True)
df.drop("cast",axis=1,inplace=True)
df.drop("production_companies",axis=1,inplace=True)
df.drop("tagline",axis=1,inplace=True)
df.drop("keywords",axis=1,inplace=True)
df.drop("id",axis=1,inplace=True)

In [None]:
df["release_date"]=pd.to_datetime(df["release_date"])

In [None]:
df.head()

> Inorder to clean our dataset, there are different dtypes which will of use (numerical, categorical) which requires different methods of cleaning.

In [None]:
num=['budget', 'revenue', 'runtime', 'vote_count','vote_average','popularity']
obj=["original_title","director","genres"]


> I will impute(fillna) columns with numerical values with the mean of each column.

In [None]:
def impute_by_mean(df,cols: list):
    for i in cols:
        print(i)
        avg=df[i].mean()
        df[i].fillna(value=avg,inplace=True)

In [None]:
impute_by_mean(df,num)

> I will impute(fillna) columns with categorical value with the mode of each column.

In [None]:
def impute_by_mode(df,cols: list):
    for i in cols:
        print(i)
        mode=df[i].mode()[0] # when there is equal number of occurances of the mode
        print(mode)
        df[i].fillna(value=mode,inplace=True)

In [None]:
impute_by_mode(df,obj)

> After cleaning the numerical columns, make sure there are 0 NaN values in this subset of our dataframe (supposed to be 0)

In [None]:
df.isnull().sum(axis=0)

> Check for any correlation between the cleaned columns.

In [None]:

pd.plotting.scatter_matrix(df[num],figsize=(15,15));

In [None]:
sns.heatmap(df[num].corr(),annot=True,cmap='rocket');

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. **Compute statistics** and **create visualizations** with the goal of addressing the research questions that you posed in the Introduction section. You should compute the relevant statistics throughout the analysis when an inference is made about the data. Note that at least two or more kinds of plots should be created as part of the exploration, and you must  compare and show trends in the varied visualizations. 



> **Tip**: - Investigate the stated question(s) from multiple angles. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables. You should explore at least three variables in relation to the primary question. This can be an exploratory relationship between three variables of interest, or looking at how two independent variables relate to a single dependent variable of interest. Lastly, you  should perform both single-variable (1d) and multiple-variable (2d) explorations.


### Do movies with longer runtime have higher popularity ?

### Is there a correlation between the budget of given movie and the revenues

In [None]:
plt.figure(figsize=(10,10))
plt.xlabel("Budget in 100s of millions")
plt.xlabel("Revenues in billions")
plt.grid()
plt.title("Budget vs revenue")
sns.regplot(x=df['budget'],y=df['revenue'])

### Is there a correlation between when is the movie realeased (seasonality) and it and the revenue?

In [None]:
df_sorted=df.sort_values(by='release_date')
df_sorted["release_date"]=pd.to_datetime(df_sorted["release_date"])
df_sorted["release_date"]=df_sorted["release_date"].dt.month
ticks=["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
df_sorted.groupby("release_date")["revenue"].mean().plot(kind="line",figsize=(10,10))
plt.xticks(np.arange(1,13),ticks)
plt.xlabel("Release Month");
plt.ylabel("Mean Earning of movie in 10s of Millions");
plt.title("Comparison of mean earning of movies released in a given month of the year")
plt.grid()
#sns.lineplot(x='release_date'.month,y='revenue',data=df_sorted)

In [None]:
df_sorted.groupby("release_date")["release_date"].count().plot(kind="pie",figsize=(10,10), autopct='%1.0f%%')
plt.xlabel("Release Month");
plt.ylabel("Mean Earning of movie in 10s of Millions");
plt.title("Percentages movies released in a given month of the year")
plt.grid()

### What is the vote average for movies released in each month of the year? and is the vote average normally distubuted ?

In [None]:
df_sorted.groupby("release_date")["vote_average"].mean().plot(kind="line");
plt.xlabel("Release month")
plt.ylabel("Average vote")
plt.xticks(np.arange(1,13),ticks)
plt.title("vote average for movies released in each month of the year")
plt.grid()

In [None]:
df_sorted.groupby("release_date")["vote_average"].mean().plot(kind="kde");
plt.xlabel("Average vote");
plt.title("The PDF of the Average vote column")
plt.grid()

<a id='conclusions'></a>
## Conclusions

> We can conclude the movies with higher budgets, usually only few production ccompanies can provide such budgets are more likely to be successful.
 
> The movies released in summer (May-July) and Christmas Eve usualy have higher revenues than the movies which are not, but movies released in Christmas Eve also tend to have better ratings.

> Percentage of movies released is consistent across months of the year.

> The normal distrubution applies to the Vote average column.


### Limitations
> **Tip**: Make sure that you are clear with regards to the limitations of your exploration. You should have at least 1 limitation explained clearly. 
> Outliers in the vote average column may occur with movies where the vote_count is extermely low, so for example if we have a movie where only one person gave it 10/10 or a 0/10, then it is not a clear represnative of the movie rating, so we might need to ignore movies that have few votes, or even select movies to comapare according to it's popularity. 



In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])