In [1]:
ls 'Data/'

tmdb-movies.csv




# Project: Investigate a TMDb movie data 

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **Tip**: In this section of the report, provide a brief introduction to the dataset you've selected for analysis. At the end of this section, describe the questions that you plan on exploring over the course of the report. Try to build your report around the analysis of at least one dependent variable and three independent variables.
>
> If you haven't yet selected and downloaded your data, make sure you do that first before coming back here. If you're not sure what questions to ask right now, then make sure you familiarize yourself with the variables and the dataset context for ideas of what to explore.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

% matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

In [3]:
ls

[1m[34mData[m[m/                     Project_2_template.ipynb  README.md


In [4]:
df = pd.read_csv('Data/tmdb-movies.csv')


### Data Exploring 

In [5]:
df.columns

Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
       'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
       'runtime', 'genres', 'production_companies', 'release_date',
       'vote_count', 'vote_average', 'release_year', 'budget_adj',
       'revenue_adj'],
      dtype='object')

In [6]:
df.shape

(10866, 21)

In [7]:
df.head(1).T

Unnamed: 0,0
id,135397
imdb_id,tt0369610
popularity,32.9858
budget,150000000
revenue,1513528810
original_title,Jurassic World
cast,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...
homepage,http://www.jurassicworld.com/
director,Colin Trevorrow
tagline,The park is open.


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              1

In [9]:
df.describe().T[['min','max']]

Unnamed: 0,min,max
id,5.0,417859.0
popularity,6.5e-05,32.98576
budget,0.0,425000000.0
revenue,0.0,2781506000.0
runtime,0.0,900.0
vote_count,10.0,9767.0
vote_average,1.5,9.2
release_year,1960.0,2015.0
budget_adj,0.0,425000000.0
revenue_adj,0.0,2827124000.0


### Data Cleaning

In [10]:
## remove annassiri columns 
df.drop( inplace= True ,columns=['id',
                                'imdb_id',
                                'homepage',
                                'tagline',
                                'keywords',
                                'overview',
                                'budget_adj',
                                'revenue_adj'])

In [11]:
df.shape

(10866, 13)

In [12]:
# check null
df.isnull().sum()

popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
director                  44
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
dtype: int64

In [13]:
# drop null 

# i can do better than this use api to complete  missing data but this is simple project haha ... 
df.dropna(inplace=True)

In [14]:
df.isnull().sum()

popularity              0
budget                  0
revenue                 0
original_title          0
cast                    0
director                0
runtime                 0
genres                  0
production_companies    0
release_date            0
vote_count              0
vote_average            0
release_year            0
dtype: int64

In [15]:
df.release_date = df.release_date.map(lambda x:pd.to_datetime(x))

In [16]:
df.describe()

Unnamed: 0,popularity,budget,revenue,runtime,vote_count,vote_average,release_year
count,9773.0,9773.0,9773.0,9773.0,9773.0,9773.0,9773.0
mean,0.694711,16181080.0,44226780.0,102.925509,239.298782,5.96343,2000.879362
std,1.036879,32209390.0,122583400.0,27.876224,602.982068,0.913179,13.036453
min,0.000188,0.0,0.0,0.0,10.0,1.5,1960.0
25%,0.232756,0.0,0.0,90.0,18.0,5.4,1994.0
50%,0.419765,200000.0,0.0,100.0,46.0,6.0,2005.0
75%,0.77638,19400000.0,31042040.0,112.0,173.0,6.6,2011.0
max,32.985763,425000000.0,2781506000.0,877.0,9767.0,8.7,2015.0


In [17]:
# check dublicated ... 

df.duplicated().sum()

1

In [18]:
# see duplicated detales to be sure when deleted ... 
df[df.duplicated()]

Unnamed: 0,popularity,budget,revenue,original_title,cast,director,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year
2090,0.59643,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,Dwight H. Little,92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,2010-03-20,110,5.0,2010


In [19]:
df.query("original_title == 'TEKKEN'")

Unnamed: 0,popularity,budget,revenue,original_title,cast,director,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year
2089,0.59643,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,Dwight H. Little,92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,2010-03-20,110,5.0,2010
2090,0.59643,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,Dwight H. Little,92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,2010-03-20,110,5.0,2010


In [20]:
# drop duplicated .... 
df.drop_duplicates(inplace=True)

In [21]:
df.head(1).T

Unnamed: 0,0
popularity,32.9858
budget,150000000
revenue,1513528810
original_title,Jurassic World
cast,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...
director,Colin Trevorrow
runtime,124
genres,Action|Adventure|Science Fiction|Thriller
production_companies,Universal Studios|Amblin Entertainment|Legenda...
release_date,2015-06-09 00:00:00


In [22]:
df.dtypes

popularity                     float64
budget                           int64
revenue                          int64
original_title                  object
cast                            object
director                        object
runtime                          int64
genres                          object
production_companies            object
release_date            datetime64[ns]
vote_count                       int64
vote_average                   float64
release_year                     int64
dtype: object

In [32]:
df.release_date.describe()

count                    9772
unique                   5602
top       2009-01-01 00:00:00
freq                       24
first     1969-01-01 00:00:00
last      2068-12-22 00:00:00
Name: release_date, dtype: object

In [33]:
# i see there is something wrong with date we are in 2019 and the sea the is some moveis in 2068 !!!
# lets grab this data

In [39]:
df.query('release_date == "2068-12-22 00:00:00"')

Unnamed: 0,popularity,budget,revenue,original_title,cast,director,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year
9725,0.757746,0,0,The Love Bug,Dean Jones|Michele Lee|Buddy Hackett|Joe Flynn...,Robert Stevenson,107,Comedy|Family|Fantasy,Walt Disney Productions,2068-12-22,62,5.8,1968


In [40]:
# from this data i realize the mistake it because the date before 2000 make some mistake
# the right year is with column release year 

In [42]:
# lets be sure 
# df[['release_date','release_year']]

In [92]:
def FixDate(r_d,r_y):
    '''
    r_d = release_date
    r_y = release_year
    
    this function will replace relase_date year with release_year
    and retun new fixed date
    
    '''
    return(r_d.replace(year=r_y,month=r_d.month,day=r_d.day))

In [94]:
# now we fix the date with FixDate function
df['release_date'] = df[['release_date','release_year']].apply(lambda x:FixDate(x['release_date'],x['release_year']),axis=1)

In [99]:
# cheack any date before 2000
df.query('release_year < 2000').head()

Unnamed: 0,popularity,budget,revenue,original_title,cast,director,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year
1329,12.037933,11000000,775398007,Star Wars,Mark Hamill|Harrison Ford|Carrie Fisher|Peter ...,George Lucas,121,Adventure|Action|Science Fiction,Lucasfilm|Twentieth Century Fox Film Corporation,1977-03-20,4428,7.9,1977
1330,2.379469,14000000,185438673,The Spy Who Loved Me,Roger Moore|Barbara Bach|Curd JÃ¼rgens|Richard...,Lewis Gilbert,125,Adventure|Action|Thriller,Eon Productions|Metro-Goldwyn-Mayer (MGM)|Danjaq,1977-07-07,279,6.2,1977
1331,1.719385,1200000,71215869,The Rescuers,Bob Newhart|Eva Gabor|Geraldine Page|Joe Flynn...,John Lounsbery|Wolfgang Reitherman|Art Stevens,78,Fantasy|Family|Animation|Adventure,Walt Disney Productions,1977-06-22,332,6.6,1977
1332,1.179653,4000000,38251425,Annie Hall,Woody Allen|Diane Keaton|Tony Roberts|Carol Ka...,Woody Allen,93,Comedy|Drama|Romance,United Artists,1977-04-19,493,7.6,1977
1333,1.104997,10000000,0,Pete's Dragon,Sean Marshall|Helen Reddy|Mickey Rooney|Red Bu...,Don Chaffey,128,Fantasy|Animation|Comedy|Family,Walt Disney Productions,1977-11-03,113,6.4,1977


In [100]:
# ALL GOOD :)

In [103]:
# i saw the data and i want to spreate the cast,director,genres and production_companies 
# to be usful when using data  :)

# Step 1 
# Get all cast,director,genres and production_companies in separate sets 

In [None]:
ALL_CAST = set() 
ALL_director = set() 
ALL_genres = set() 
ALL_production_companies = set()

In [151]:
print( 'len(ALL_CAST) = ' , len(ALL_CAST))

print("len(ALL_director) = ", len(ALL_director))

print("len(ALL_genres) = " , len(ALL_genres))

print("len(ALL_production_companies) = " , len(ALL_production_companies))

len(ALL_CAST) =  17124
len(ALL_director) =  4758
len(ALL_genres) =  20
len(ALL_production_companies) =  7842


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!