<a id='A'></a>
# Project: TMDB Movie Dataset (with limitation completed input for every movie)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#Data Wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#con">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **Tips**:  After cleaning and wrangling you will face some analysis with 3805 Movie Dataset extracted from aproxemately 11 000 movie the rule was that every Movie has name of (the director, casts, and production_company),and get the data away of 0 budget or 0 revenue with assuming that all currency was (US-Dollars).    

> data has titles or columns ['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
       'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
       'runtime', 'genres', 'production_companies', 'release_date',
       'vote_count', 'vote_average', 'release_year', 'budget_adj',
       'revenue_adj'] but for my analysis i used only this columns ['popularity', 'budget(US-Dollars)', 'revenue(US-Dollars)',
       'original_title', 'cast', 'director', 'runtime', 'genres',
       'production_companies', 'vote_count', 'vote_average', 'release_year',
       'profit'] and i added the profit column which  = the revenue - the budget.
       
> to make Dataset readable :<br><br>
        cast - The name of lead and supporting actors.<br>
        budget - The budget in which the movie was made.<br>
        genre - The genre of the movie, Action, Comedy ,Thriller etc.<br>
        popularity - A numeric quantity specifying the movie popularity.<br>
        production_companies - The production house of the movie.<br>
        revenue - The worldwide revenue generated by the movie.<br>
        title - Title of the movie.<br>
        vote_average - average ratings the movie recieved.<br>
        vote_count - the count of votes recieved.

<a href="#A">Top NoteBook</a>

<a id='Data Wrangling'></a>
## Data Wrangling <br>
### Data Cleaning
1) Removing the columns that won't be useful in my analysis.

2) Discard 0 budget or 0 revenue movies and assum that main currency is (US-Dollars).

3) clean the empty data.

4) Removing duplicated rows.

5) Convert some data types to be integers (converting floats to integers).

### Extract the dataset from csv file.

In [3]:
#set up import statements numpy, pandas, seaborn,and matplotlib.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#load data from csv file with command pd.read_csv('the file in the same folder or the url of it').
df = pd.read_csv('tmdb-movies.csv')

### determine budget and revenue as assumed the main currency is us dollar (US-Dollars).

In [4]:
#rename the columns to be clear.
df.rename(columns={'budget': 'budget(US-Dollars)', 'revenue': 'revenue(US-Dollars)'}, inplace=True)

### wrangle  0 budget or 0 revenue.

In [5]:
#convert zeros from columns budget and revenue to NAN value to remove it.
money = ['budget(US-Dollars)', 'revenue(US-Dollars)']
df[money] = df[money].replace(0, np.NAN)

### clean the data from all not completed information.

In [6]:
#delete all this columns ['','','',...]. 
df.drop(['id', 'imdb_id', 'homepage', 'tagline', 'keywords', 'overview',
         'budget_adj', 'revenue_adj', 'release_date'], axis=1, inplace=True)

In [7]:
#delete all rows with NAN value.
df.dropna(inplace=True)

In [8]:
#remove this duplicates
df.drop_duplicates(inplace=True)

### wrangle budget and revenue and extract the profit<br>
prifit = revenue - budget

In [9]:
#convert the dataframe['column_name']from its type to be integer.
df['budget(US-Dollars)']=df['budget(US-Dollars)'].astype(int)
df['revenue(US-Dollars)']=df['revenue(US-Dollars)'].astype(int)
#make another column profit and determine it = revenue - budget
df['profit(US-Dollars)'] = df['revenue(US-Dollars)'] - df['budget(US-Dollars)']

### all information after wrangling.

In [10]:
#extract the information from every column. 
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3805 entries, 0 to 10848
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   popularity            3805 non-null   float64
 1   budget(US-Dollars)    3805 non-null   int32  
 2   revenue(US-Dollars)   3805 non-null   int32  
 3   original_title        3805 non-null   object 
 4   cast                  3805 non-null   object 
 5   director              3805 non-null   object 
 6   runtime               3805 non-null   int64  
 7   genres                3805 non-null   object 
 8   production_companies  3805 non-null   object 
 9   vote_count            3805 non-null   int64  
 10  vote_average          3805 non-null   float64
 11  release_year          3805 non-null   int64  
 12  profit(US-Dollars)    3805 non-null   int32  
dtypes: float64(2), int32(3), int64(3), object(5)
memory usage: 297.3+ KB


### 13 column and 3805 movie

In [11]:
#determine how many columns and rows
df.shape

(3805, 13)

### name of columns .

In [12]:
#extract only name of the columns.
df.columns

Index(['popularity', 'budget(US-Dollars)', 'revenue(US-Dollars)',
       'original_title', 'cast', 'director', 'runtime', 'genres',
       'production_companies', 'vote_count', 'vote_average', 'release_year',
       'profit(US-Dollars)'],
      dtype='object')

### view of first 5 row in data after wrangling.

In [13]:
# extract first 5 rows.
df.head()

Unnamed: 0,popularity,budget(US-Dollars),revenue(US-Dollars),original_title,cast,director,runtime,genres,production_companies,vote_count,vote_average,release_year,profit(US-Dollars)
0,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,5562,6.5,2015,1363528810
1,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,George Miller,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,6185,7.1,2015,228436354
2,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,Robert Schwentke,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,2480,6.3,2015,185238201
3,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,J.J. Abrams,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,5292,7.5,2015,1868178225
4,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,James Wan,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,2947,7.3,2015,1316249360


<a href="#A">Top NoteBook</a>

<a id='eda'></a>
## More Insights.
## Exploratory Data Analysis

### the question is which genre of movies is more profitable?
### so i combined genres with profit in seperated dataframe to know that 

In [14]:
#genres_df is a new dataframe with all of our df data but in column genres as you see 
#assign>>genres str to assign as string and split every word in the row and explode to make every word in its own row alone. 
genres_df = df.assign(genres=df['genres'].str.split('|')).explode('genres')
#genres_profit = genres_df but only with column genres grouped
#with mean of profit(US-Dollars) to every genres and sort descending.
genres_profit = genres_df.groupby('genres')['profit(US-Dollars)'].mean().sort_values(ascending=False)
genres_profit

genres
Animation          1.819434e+08
Adventure          1.487091e+08
Fantasy            1.463063e+08
Family             1.433311e+08
Science Fiction    1.071376e+08
Action             1.003329e+08
Comedy             6.531357e+07
War                6.377623e+07
Thriller           6.091785e+07
Romance            5.924098e+07
Music              5.917397e+07
Mystery            5.595021e+07
Crime              5.125899e+07
Drama              4.697623e+07
History            4.101761e+07
Horror             3.902479e+07
TV Movie           3.700000e+07
Western            3.457804e+07
Documentary        2.259236e+07
Foreign           -5.402044e+06
Name: profit(US-Dollars), dtype: float64

In [None]:
#To visualize our data
plt.subplots(figsize=(20,8))
plt.bar(genres_profit.index, genres_profit)
plt.title('Profit By Genre', fontsize=30)
plt.xlabel('genres',fontsize=30)
plt.ylabel('Profit(US-Dollars)', fontsize=30);

### in the same question above but with revenue to get which genre is more popular ?

In [None]:
#genres_profit = genres_df but only with column genres grouped
#with mean of profit(US-Dollars) to every genres and sort descending.
genres_revenue = genres_df.groupby('genres')['revenue(US-Dollars)'].mean().sort_values(ascending=False)
genres_revenue

In [None]:
genres_revenue.plot.pie(figsize=(20, 20))
plt.title('Pie chart Genres', fontsize=30)
plt.ylabel('genres by revenue(US-Dollars)', fontsize=30);

### in the same question but with budget to see which genre of movies need a few budget? 

In [None]:
#genres_profit = genres_df but only with column genres grouped 
#with mean of profit(US-Dollars) to every genres and sort descending.
genres_budget = genres_df.groupby('genres')['budget(US-Dollars)'].mean().sort_values()
genres_budget

### question: how is popularity the same with popularity we know from revenue?

In [None]:
# group exploded dataframe by genres, get average popularity
popularity_ = genres_df.groupby('genres').popularity.mean().sort_values(ascending=False)
popularity_

In [None]:
popularity_.plot.pie(figsize=(20, 20))
plt.title('Pie chart Genres_popularity', fontsize=30)
plt.ylabel('genres by popularity(US-Dollars)', fontsize=30);

In [None]:
df.hist(figsize=(20,14))
plt.title('df.histograms', fontsize=30)

### question which movie got the top average vote?

In [None]:
#extract max of floats in vote_average.
df[df['vote_average'] == df.vote_average.max()]

### which movie is less verage votes?

In [None]:
#Foodfight is the loser in average vote.
df[df['vote_average'] == df.vote_average.min()]

### question which movie more popular than others?

In [None]:
#most popular movie is Jurassic World people love to watch it.
df[df['popularity'] == df.popularity.max()]

### which movie the most revenue ?

In [None]:
df[df['revenue(US-Dollars)'] == df['revenue(US-Dollars)'].max()]

### which actor more profit?

In [None]:
#cast_df is a new dataframe with all of our df data but in column cast as you see 
#assign>>cast str to assign as string and split every word in the row and explode to make every word in its own row alone. 
cast_df = df.assign(cast=df['cast'].str.split('|')).explode('cast')
#cast_profit = cast_df but only with column cast grouped with mean of profit(US-Dollars) to every actor and sort descending.
cast_profit = cast_df.groupby('cast')['profit(US-Dollars)'].mean().sort_values(ascending=False)
cast_profit

### which actor work more?

In [None]:
cast_df.cast.value_counts()

### animation which more profitable  frequent or not? and every type with its iteration infact to profit?

In [None]:
genres_df.genres.value_counts()

### more frequent movie A Nightmare on Elm Street
### Steven Spielberg more frequent director
### and so on 

In [None]:
df.mode().iloc[0]

In [None]:
np.round(df.corr(), 2)

### the best runtime is 124 minutes.

In [None]:
runtime_group_0 = df.groupby(['popularity'])['runtime'].mean()
runtime_group = df.groupby(['popularity'])['runtime'].mean().tail()
runtime_group

In [None]:
runtime_group_0.hist(figsize=(20,8))
plt.title('runtime_group', fontsize=30)
plt.xlabel('runtime',fontsize=30)
plt.ylabel('popularity', fontsize=30);

### Top rated Stop Making Sense, The Shawshank Redemption, The Godfather, Whiplash

In [None]:
Top_rated = df[df['vote_average'] > 8.1]
Top_rated 

### Jurassic World, Mad Max: Fury Road, Interstellar Top popular

In [None]:
Top_rated_0 = df[df['popularity'] > 20]
Top_rated_0

<a href="#A">Top NoteBook</a>

<a id='con'></a>
# Conclusions

### limitations:

> We have used TMBD Movies dataset for our analysis and worked with popularity, revenue and runtime. Our analysis is limited to only the provided dataset. For example, the dataset does not confirm that every release of every director is listed.
There is no normalization or exchange rate or currency conversion is considered during this analysis and our analysis is limited to the numerical values of revenue.
Dropping missing or Null values from variables of our interest might skew our analysis and could show unintentional bias towards the relationship being analyzed.

> **conclusion** :as you see the average of animation movies was more profit than others <br>
    Animation, Adventure, Fantasy incourage producers to make business <br>
    Foreign, Documentary, Western not incourage producers to make business

>**conclusion**: the most 5 popular movies types is Animation, Adventure, Family, Fantasy, Science Fiction<br>
    the most 5 not popular movies types is Foreign, Documentary, TV Movie, Horror, Drama

>**conclusion**: comedy movies is a few budget and also not bad choise to get profit.<br>
fewest budget TV Movie

>**Conclusions**: popularity is not the same with popularity from revenue  

>**conclusion**: The Shawshank Redemption and Stop Making Sense with 8.4 average is the most rate

>**conclusion**: Foodfight! with 2.2 average is the fewest rate.

>**conclusion**:  most popular movie is Jurassic World people love to watch it.

>**conclusion**:  most revenue is Star Wars: The Force Awakens people love to watch it.

>**conclusion** which actor more profit? Daisy Ridley

>**conclusion** which actor work more? Robert De Niro

>**conclusion**:  animation which more profitable  frequent or not? and every type with its iteration infact to profit? No

>**conclusion**: more frequent movie A Nightmare on Elm Street<br>
Steven Spielberg more frequent director

>**conclusion**: the best runtime is 124 minutes.

>**conclusion**: Top rated Stop Making Sense, The Shawshank Redemption, The Godfather, Whiplash

>**conclusion**: Jurassic World, Mad Max: Fury Road, Interstellar Top popular

<a href="#A">Top NoteBook</a>