# Project: Investigate a Dataset - [TMDB Movies]

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
# Introduction:

### - Dataset Description:
Movies dataset from the movies database website that has basic movies information like title, release date, genre, budget, rating and runtime.
This data can be found online on the website www.themoviedb.org


### - Questions for Analysis:

#### 1- What are the most produced genres?
#### 2- What is production rate for most produced genres by year?
#### 3- What are the most popular movies?
#### 4- What are the most successful movies based on profit?
#### 5- What factors affect profits?
#### 6- Important facts to consider

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline 

<a id='wrangling'></a>
## Data Wrangling

### Overview

In [None]:
# Loading data and checking it
df = pd.read_csv('tmdb-movies.csv')
df.head()

Genres are seperated with | sign 

In [None]:
df.shape

In [None]:
df.duplicated('id').sum()
#Check for movie ID duplicates

In [None]:
df['id'].nunique()

Dataset has duplicates that needs to be cleaned

In [None]:
df.info()

No null values in the dataset

In [None]:
df.describe()

#### Key findings:
#### First 75% of popularity count are less than 1 
#### First 50% of budget enteries 0 (which will affect the accuracy of the analysis)
#### First 75% of the runtime enteries are less than 111 minutes while the maximum run time is 900 minutes
#### First 75 of vote count enteries are less than 145 votes but the maximum vote count is 9767 votes
#### The release years in the dataset ranges from 1960 to 2015

### Data Cleaning

In [None]:
df.drop_duplicates(['id'], inplace = True)
#Deleting ID duplicates 

In [None]:
df.duplicated('id').sum()
#Check for movie ID duplicates

In [None]:
df.shape
#Checking again

In [None]:
df = df.drop(columns=['imdb_id','cast','homepage','tagline','keywords','overview','vote_count','budget_adj','revenue_adj'])
#Deleting irrelevant columns that will not be used in the analysis

In [None]:
df.head()
#Checking again

In [None]:
df.info()

<a id='eda'></a>
## Exploratory Data Analysis



### 1- What are the most produced genres?

In [None]:
genre_list = df['genres'].str.cat(sep = '|')
genre_list = pd.Series(genre_list.split('|'))
genre_prod = genre_list.value_counts()

In [None]:
genre_prod

Visualising genre production in a better way:

In [None]:
fig = plt.figure(figsize = (15, 15))
genre_prod.plot.bar(color='pink');

### 2- What is production rate for most produced genres by year?

In [None]:
def genre_production(genre):
    type = [genre]
    rslt_df = df[df['genres'].isin(type)].sort_values(by='release_date',ascending=False)
    x = rslt_df['release_year']
    y = rslt_df['id']
    fig = plt.figure(figsize = (25, 10))
    plt.barh(x,y)

In [None]:
genre_production('Drama')

In [None]:
genre_production('Comedy')

In [None]:
genre_production('Thriller')

### 3- What are the most popular movies?

In [None]:
most_pop = df.sort_values(by='popularity',ascending=False)
most_pop = most_pop[['original_title', 'popularity']]

In [None]:
most_pop.head(10)

In [None]:
def get_pie(DF,column,number,labels,title):
    ax = DF[column].head(number).plot.pie(labels=DF[labels].head(number),figsize=(15,15),autopct='%1.1f%%',legend=True,fontsize=15)
    plt.ylabel('')
    plt.title(title,fontsize=35);

In [None]:
get_pie(most_pop,'popularity',10,'original_title','Top 10 Popular Movies')

### 4- What are the most successful movies based on profit?

In [None]:
df.insert(4,'profit',df['revenue']-df['budget'])

In [None]:
most_succ = df.sort_values(by='profit',ascending=False)
most_succ[['original_title','profit']].head(10)

In [None]:
get_pie(most_succ,'profit',10,'original_title','Top 10 Successful Movies')

### 5- What factors affect profits?

In [None]:
df.describe()

In [None]:
df_ = df.head(50)

In [None]:
def prof_relation(col,colr='blue'):
    df.plot.scatter(col,'profit',figsize=(10,10),color=colr);

In [None]:
prof_relation('popularity','red')

In [None]:
prof_relation('budget','green')

In [None]:
prof_relation('runtime')

In [None]:
prof_relation('vote_average','black')

In [None]:
prof_relation('release_year','grey')

### 6- Important facts to consider:

In [None]:
dirc_votes = df[['director', 'vote_average','original_title']].sort_values(by='vote_average',ascending=False)
dirc_votes = dirc_votes.dropna()
dirc_votes.head(10)

In [None]:
get_pie(dirc_votes,'vote_average',10,'director','Directors')

In [None]:
prod_comp = df['production_companies'].str.cat(sep = '|')
prod_comp = pd.Series(prod_comp.split('|'))
prod_comp = prod_comp.value_counts()
prod_comp.head(10)

In [None]:
prod_comp.head(10).plot.barh(figsize=(20,10),title='Top Producers');

<a id='conclusions'></a>
## Conclusions

#### First 75% of popularity count are less than 1
#### First 50% of budget enteries 0 (which will affect the accuracy of the analysis)
#### First 75% of the runtime enteries are less than 111 minutes while the maximum run time is 900 minutes
#### First 75 of vote count enteries are less than 145 votes but the maximum vote count is 9767 votes
#### The release years in the dataset ranges from 1960 to 2015
#### Drama, comedy and thriller are the most produced genres 
#### The most produced genres are peaked in 2015
#### The most popular movies are Jurassic World, Mad Max and Interstellar
#### The most profitable movies are Avatar, Star Wars and Titanic
#### There is no relation between profit and popualrity, budget or runtime
#### There is a relation between vote average and profit as the most profitable movies averaged between 5 and 8 
#### Also there is a positive relation between release year and profit as the the later the year the more the profit 
#### The top rated directors are Mark Cousins, Jennifer Siebel and Carl Tibbetts 
#### The top movie producers with largest amount of movies produced are Universal Pictures, Warner bros and Paramount Pictures

### Most Affecting Limitations
###### Budgets and revenues has 0 values
###### Most vote counts doesn't reflect a reasonable population for an accurate number

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])