# Project 1: Explanatory Data Analysis & Data Presentation (Movies Dataset)

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 1 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

## Data Import and first Inspection

1. __Import__ the movies dataset from the CSV file "movies_complete.csv". __Inspect__ the data.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 

__Some additional information on Features/Columns__:

In [8]:
df= pd.read_csv('C:/Users/Ajaz/Python/alexander/Project_01_Materials/movies_complete.csv')

In [5]:
import os
print(os.getcwd())

C:\Users\Ajaz\Python\alexander\Project_01_Materials


In [10]:
df.head(1)

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,...,vote_average,popularity,runtime,overview,spoken_languages,poster_path,cast,cast_size,crew_size,director
0,862,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,373.554033,Pixar Animation Studios,...,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",English,<img src='http://image.tmdb.org/t/p/w185//uXDf...,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,13,106,John Lasseter


In [15]:
df.describe()

Unnamed: 0,id,budget_musd,revenue_musd,vote_count,vote_average,popularity,runtime,cast_size,crew_size
count,44691.0,8854.0,7385.0,44691.0,42077.0,44691.0,43179.0,44691.0,44691.0
mean,107186.242845,21.669886,68.968649,111.653778,6.003341,2.95746,97.56685,12.47909,10.313643
std,111806.362236,34.359837,146.608966,495.322313,1.28106,6.040008,34.653409,12.124663,15.892154
min,2.0,1e-06,1e-06,0.0,0.0,0.0,1.0,0.0,0.0
25%,26033.5,2.0,2.40542,3.0,5.3,0.402038,86.0,6.0,2.0
50%,59110.0,8.2,16.872671,10.0,6.1,1.150055,95.0,10.0,6.0
75%,154251.0,25.0,67.642693,35.0,6.8,3.768882,107.0,15.0,12.0
max,469172.0,380.0,2787.965087,14075.0,10.0,547.488298,1256.0,313.0,435.0


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44691 entries, 0 to 44690
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     44691 non-null  int64  
 1   title                  44691 non-null  object 
 2   tagline                20284 non-null  object 
 3   release_date           44657 non-null  object 
 4   genres                 42586 non-null  object 
 5   belongs_to_collection  4463 non-null   object 
 6   original_language      44681 non-null  object 
 7   budget_musd            8854 non-null   float64
 8   revenue_musd           7385 non-null   float64
 9   production_companies   33356 non-null  object 
 10  production_countries   38835 non-null  object 
 11  vote_count             44691 non-null  float64
 12  vote_average           42077 non-null  float64
 13  popularity             44691 non-null  float64
 14  runtime                43179 non-null  float64
 15  ov

* **id:** The ID of the movie (clear/unique identifier).
* **title:** The Official Title of the movie.
* **tagline:** The tagline of the movie.
* **release_date:** Theatrical Release Date of the movie.
* **genres:** Genres associated with the movie.
* **belongs_to_collection:** Gives information on the movie series/franchise the particular film belongs to.
* **original_language:** The language in which the movie was originally shot in.
* **budget_musd:** The budget of the movie in million dollars.
* **revenue_musd:** The total revenue of the movie in million dollars.
* **production_companies:** Production companies involved with the making of the movie.
* **production_countries:** Countries where the movie was shot/produced in.
* **vote_count:** The number of votes by users, as counted by TMDB.
* **vote_average:** The average rating of the movie.
* **popularity:** The Popularity Score assigned by TMDB.
* **runtime:** The runtime of the movie in minutes.
* **overview:** A brief blurb of the movie.
* **spoken_languages:** Spoken languages in the film.
* **poster_path:** The URL of the poster image.
* **cast:** (Main) Actors appearing in the movie.
* **cast_size:** number of Actors appearing in the movie.
* **director:** Director of the movie.
* **crew_size:** Size of the film crew (incl. director, excl. actors).

In [18]:
df.columns

Index(['id', 'title', 'tagline', 'release_date', 'genres',
       'belongs_to_collection', 'original_language', 'budget_musd',
       'revenue_musd', 'production_companies', 'production_countries',
       'vote_count', 'vote_average', 'popularity', 'runtime', 'overview',
       'spoken_languages', 'poster_path', 'cast', 'cast_size', 'crew_size',
       'director'],
      dtype='object')

In [21]:
len(df)

44691

In [23]:
df.isna().sum()

id                           0
title                        0
tagline                  24407
release_date                34
genres                    2105
belongs_to_collection    40228
original_language           10
budget_musd              35837
revenue_musd             37306
production_companies     11335
production_countries      5856
vote_count                   0
vote_average              2614
popularity                   0
runtime                   1512
overview                   951
spoken_languages          3597
poster_path                224
cast                      2189
cast_size                    0
crew_size                    0
director                   731
dtype: int64

## The best and the worst movies...

2. __Filter__ the Dataset and __find the best/worst n Movies__ with the

- Highest Revenue
- Highest Budget
- Highest Profit (=Revenue - Budget)
- Lowest Profit (=Revenue - Budget)
- Highest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10) 
- Lowest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)
- Highest number of Votes
- Highest Rating (only movies with 10 or more Ratings)
- Lowest Rating (only movies with 10 or more Ratings)
- Highest Popularity

__Define__ an appropriate __user-defined function__ to reuse code.

__Movies Top 5 - Highest Revenue__

In [36]:
df.sort_values(by=['revenue_musd'])
df.dropna(inplace=True)

In [37]:
df.revenue_musd[:6]

9     352.194034
18    212.385533
20    115.101622
33    254.134910
43    122.195920
47    346.079773
Name: revenue_musd, dtype: float64

__Movies Top 5 - Highest Budget__

In [38]:
df.sort_values(by=['revenue_musd'])
df.dropna(inplace=True)

In [39]:
df.budget_musd[:6]

9     58.00
18    30.00
20    30.25
33    30.00
43    18.00
47    55.00
Name: budget_musd, dtype: float64

__Movies Top 5 - Highest Profit__

In [43]:
Highest_Profit= df.revenue_musd - df.budget_musd

In [72]:
Highest_Profit.sort_values( axis=0, ascending=True)

1625      -62.373766
4278      -51.868170
1363      -43.533912
13352     -37.235799
2665      -34.332107
            ...     
17669    1299.557910
28501    1316.249360
24812    1363.528810
26265    1823.223624
14448    2550.965087
Length: 1131, dtype: float64

__Movies Top 5 - Lowest Profit__

In [73]:
Lowest_Profit= (df.budget_musd-df.revenue_musd).sort_values( axis=0, ascending=True)

In [74]:
Lowest_Profit

14448   -2550.965087
26265   -1823.223624
24812   -1363.528810
28501   -1316.249360
17669   -1299.557910
            ...     
2665       34.332107
13352      37.235799
1363       43.533912
4278       51.868170
1625       62.373766
Length: 1131, dtype: float64

__Movies Top 5 - Highest ROI__

In [75]:
hroi =(df.revenue_musd / df.budget_musd).sort_values( axis=0, ascending=True)

#(only movies with Budget >= 10)

In [77]:
hroi.sort_values

<bound method Series.sort_values of 8966     5.940000e-04
3128     5.078500e-03
14837    6.032333e-03
10019    8.383167e-03
12833    1.569914e-02
             ...     
9388     4.205227e+02
7723     4.396166e+02
2569     4.133333e+03
14093    1.289039e+04
2284     1.018619e+06
Length: 1131, dtype: float64>

__Movies Top 5 - Lowest ROI__

In [None]:
lroi = (df.revenue_musd / df.budget_musd) (only movies with Budget >= 10)

__Movies Top 5 - Most Votes__

In [83]:
df.vote_count.sort_values(ascending = False).head(5)

12396    12269.0
14448    12114.0
17669    12000.0
26272    11444.0
23495    10014.0
Name: vote_count, dtype: float64

__Movies Top 5 - Highest Rating__

In [84]:
df.vote_average.sort_values(ascending =True)

6665     2.8
9578     3.1
3963     3.5
10622    3.6
1587     3.8
        ... 
1144     8.2
12396    8.3
1168     8.3
1166     8.3
826      8.5
Name: vote_average, Length: 1131, dtype: float64

__Movies Top 5 - Lowest Rating__

__Movies Top 5 - Most Popular__

## Find your next Movie

3. __Filter__ the Dataset for movies that meet the following conditions:

__Search 1: Science Fiction Action Movie with Bruce Willis (sorted from high to low Rating)__

__Search 2: Movies with Uma Thurman and directed by Quentin Tarantino (sorted from short to long runtime)__

__Search 3: Most Successful Pixar Studio Movies between 2010 and 2015 (sorted from high to low Revenue)__

__Search 4: Action or Thriller Movie with original language English and minimum Rating of 7.5 (most recent movies first)__

## Are Franchises more successful?

4. __Analyze__ the Dataset and __find out whether Franchises (Movies that belong to a collection) are more successful than stand-alone movies__ in terms of:

- mean revenue
- median Return on Investment
- mean budget raised
- mean popularity
- mean rating

hint: use groupby()

__Franchise vs. Stand-alone: Average Revenue__

__Franchise vs. Stand-alone: Return on Investment / Profitability (median)__

__Franchise vs. Stand-alone: Average Budget__

__Franchise vs. Stand-alone: Average Popularity__

__Franchise vs. Stand-alone: Average Rating__

## Most Successful Franchises

5. __Find__ the __most successful Franchises__ in terms of

- __total number of movies__
- __total & mean budget__
- __total & mean revenue__
- __mean rating__

## Most Successful Directors

6. __Find__ the __most successful Directors__ in terms of

- __total number of movies__
- __total revenue__
- __mean rating__