
# Project: TMDb movie Data Analysis



## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

# <a id='intro'></a>
## Introduction

### Dataset Description 

>This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.Data analysis process is done using numpy , pandas and matplotlib for these TMDb movie data set.

### Columns:
>we have 21 columns = ['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
       'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
       'runtime', 'genres', 'production_companies', 'release_date',
       'vote_count', 'vote_average', 'release_year', 'budget_adj',
       'revenue_adj']


       
### Question(s) for Analysis

>1. Which generes are most popular from year to year? 
>2. What kinds of properties are associated with movies that have high revenues?
>3. Which years have the highest and lowest release of movies?
>4. Which movies have Highest And Lowest Budget?
>5. what is the average Runtime Of Movies From Year To Year?



In [1]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.
# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline


UsageError: Line magic function `%` not found.


<a id='wrangling'></a>
## Data Wrangling

> After we will observe the dataset and look at the questions related to this dataset for the analysis, we will clean unused data and keep up the important data.




### General Properties


In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
movie_df = pd.read_csv('Database_TMDb_movie_data/tmdb-movies.csv')
movie_df.head(31)


In [None]:
movie_df.tail(10)

In [None]:
#print the number of rows and columns of data set
movie_df.shape

In [None]:
# show max,min and mean .... of data set
movie_df.describe()

In [None]:
# print the conc summary of data set
movie_df.info()

In [None]:
#count total rows in each column which contain null values
movie_df.isna().sum()

In [None]:
#count total rows in each column which contain duplicates values
movie_df.duplicated().sum()


### Data Cleaning
> from print the information of data set 
> 1. we need to remove nan value in columns
> 2. we need to remove movie have 0 budget
> 3. we need to remve duplicates
> 4. we need to remove unused coulumns
> 5. we need to change format to datatime



 

#### 1. Remove duplicat

In [None]:
# remove duplicate of data
movie_df.drop_duplicates(inplace = True)
#check duplicated row is removed 
movie_df.duplicated().sum()

#### 2. Remove unused columns

In [None]:
# print the 21 coloumns 
movie_df.columns

In [None]:
# removed columns unused like : 
movie_df.drop(['imdb_id','homepage','tagline','keywords','overview','budget_adj','revenue_adj'],axis =1,inplace = True)

In [None]:
#print the df columns after removing
movie_df.columns

#### 3. remove nan values

In [None]:
#count total rows in each column which contain null values
movie_df.isna().sum()

In [None]:
# remove nan values 
movie_df.dropna(inplace=True)


In [None]:
#count total rows in each column which contain null values
movie_df.isna().sum()

#### 4.remove movie have 0 budget or revenue

In [None]:
#count total rows have 0 budget or revenue 
movie_df.query('budget == 0')['budget'].value_counts()

In [None]:
movie_df.query('revenue == 0')['revenue'].value_counts()

In [None]:
# remove o budget and revenue
movie_df = movie_df[(movie_df[['budget','revenue']] != 0).all(axis=1)]
movie_df = movie_df.reset_index()

In [None]:
movie_df.head(31)

In [None]:
#check remove zero budget
movie_df.query('budget == 0')['budget'].value_counts()

In [None]:
movie_df.query('revenue == 0')['revenue'].value_counts()

#### 5. change data time format

In [None]:
#change data time format and show it 
movie_df['release_date'] = pd.to_datetime(movie_df['release_date'])
movie_df['release_date'].head()

In [None]:
# print new dta set after cleaning 
movie_df.head()

In [None]:
movie_df.info()

In [None]:
movie_df.shape

<a id='eda'></a>
## Exploratory Data Analysis

> we cleaned your data, then we are ready to move on to exploration. we will Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. 

### 1. Which generes are most popular from year to year? 

In [None]:
#create empty list of all generes 
genres_list = []
# create array of years
years_list = np.array(movie_df['release_year'])
#put generes columnus in generes_list 
#generes columns contain multiple values separated by pipe (|) so we use split to split it on empty generes_list\
for genres in movie_df['genres'] :
    genres = genres.split('|')
    for genre in genres :
        if genre not in genres_list:
            genres_list.append(genre)

In [None]:
#create popularity data frame index = genres_list and columns years range
popularity_df = pd.DataFrame(index = genres_list, columns = range(min(years_list), max(years_list)+1))
popularity_df

In [None]:
# replace nan value with zero
popularity_df = popularity_df.fillna(value = 0.0)
popularity_df.head()

In [None]:
# fill popularity_df with values of popularity for each  geners
i = 0
for genres in movie_df['genres'] :
    split_genre = genres.split('|')
    popularity_df.loc[split_genre, years_list[i]] =  popularity_df.loc[split_genre, years_list[i]] + movie_df['popularity'][i]
    i+=1
    
popularity_df

In [None]:
#plot the bar plot of the standardised data. in last 3 years 
popularity_df.iloc[:,-3:].plot(kind='bar',figsize = (15,6),fontsize=13)
#setup the title and labels of the plot.
plt.title("Most Popular Genre Over Year To Year",fontsize=15)
plt.xlabel("Genres",fontsize = 14)
plt.ylabel("Popularity (Standerd Units)",fontsize=14)


In [None]:
#How the popularity of the genre differ year by year.
#make a subplot of size 3,3.
fig, axs = plt.subplots(5,4,figsize = (16,10))
fig.suptitle('Genre Popularity Over Year To Year',fontsize = 16)
for genre ,ax in zip(genres_list, axs.ravel()):
    popularity_df.loc[genre].plot(label = genre,ax=ax,c=np.random.rand(3,))



        

### 2. What kinds of properties are associated with movies that have high revenues?

In [None]:
# plot revenues vs some properties (budget,runtime,vot_average,Popularity,release_year)
fig,axes = plt.subplots(3,2,figsize = (16,6))
properties = ["budget","runtime","vote_average","release_year","popularity"]
fig.suptitle("Revenue Vs (budget,runtime,vot_average,Popularity,release_year)",fontsize=14)
for item,ax in zip(properties,axes.ravel()):
    movie_df.plot.scatter(x="revenue",y="budget",ax=ax,c=np.random.rand(3,))

### 3. Which years have the highest and lowest release of movies?

In [None]:
# we groupy movies in every years and count them 
movies_every_year = movie_df.groupby('release_year').count()['id']
movies_every_year.head()

In [None]:
#max and min 
movies_every_year.max()

In [None]:
movies_every_year.min()

In [None]:
#plot movie 
movies_every_year.plot.bar(figsize = (18,10))


### 4. Which movies have Highest And Lowest Budget?

In [None]:
#choose 20 top movie budget and plot them 
top_movie_budget =movie_df.sort_values(by=['budget'],ascending=False)
top_movie_budget.head()

In [None]:
#plot original_title with budget
top_movie_budget.iloc[0:20,].plot(x="original_title",y="budget",figsize = (18,10), marker='o')


In [None]:
#choose 20 lowest movie budget and plot them 
lowest_movie_budget =movie_df.sort_values(by=['budget'])
lowest_movie_budget.head()

In [None]:
#plot original_title with budget
lowest_movie_budget.iloc[0:20,].plot(x="original_title",y="budget",figsize = (18,10), marker='o')


In [None]:
high_budget = movie_df.budget.max()
high_budget

In [None]:
high_budget_movie = movie_df.query('budget == 425000000')['original_title']
high_budget_movie

In [None]:
low_budget = movie_df.budget.min()
low_budget

In [None]:
low_budget_movie = movie_df.query('budget == 1')['original_title']
low_budget_movie

### 5. what is the average Runtime Of Movies From Year To Year?

In [None]:
# groupby release year
ave_runtime = movie_df.groupby('release_year').runtime.mean()
ave_runtime

In [None]:
ave_runtime.plot.bar(figsize = (18,10))


<a id='conclusions'></a>
## Conclusions

>1. Action is most popular genres from year to year
>2. Revenus is increaseed with budget and runtime 
>3. in 2011 there are the max release of movies 
>4. The Warrior's Way has the highest budget
>5. Lost & Found - Love, Wedding, Marriage have the lowest budget
>6. in 1965 there is the max averagre runtime 

>#### Limitations
1. During the data cleaning process, I split the data seperated by '|' into lists for easy parsing during the exploration phase. This increases the time taken in calculating the result.
2. In cleaning process,we remove nan and zero data so we missed some result of some movies so the result is not sure 100 percentage


In [2]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])

255