# Project 1: Explanatory Data Analysis & Data Presentation (Movies Dataset)

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 1 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

## Data Import and first Inspection

1. __Import__ the movies dataset from the CSV file "movies_complete.csv". __Inspect__ the data.

__Some additional information on Features/Columns__:

* **id:** The ID of the movie (clear/unique identifier).
* **title:** The Official Title of the movie.
* **tagline:** The tagline of the movie.
* **release_date:** Theatrical Release Date of the movie.
* **genres:** Genres associated with the movie.
* **belongs_to_collection:** Gives information on the movie series/franchise the particular film belongs to.
* **original_language:** The language in which the movie was originally shot in.
* **budget_musd:** The budget of the movie in million dollars.
* **revenue_musd:** The total revenue of the movie in million dollars.
* **production_companies:** Production companies involved with the making of the movie.
* **production_countries:** Countries where the movie was shot/produced in.
* **vote_count:** The number of votes by users, as counted by TMDB.
* **vote_average:** The average rating of the movie.
* **popularity:** The Popularity Score assigned by TMDB.
* **runtime:** The runtime of the movie in minutes.
* **overview:** A brief blurb of the movie.
* **spoken_languages:** Spoken languages in the film.
* **poster_path:** The URL of the poster image.
* **cast:** (Main) Actors appearing in the movie.
* **cast_size:** number of Actors appearing in the movie.
* **director:** Director of the movie.
* **crew_size:** Size of the film crew (incl. director, excl. actors).

In [1]:
import pandas as pd

movies = pd.read_csv('./movies_complete.csv', parse_dates=['release_date'])
movies

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,...,vote_average,popularity,runtime,overview,spoken_languages,poster_path,cast,cast_size,crew_size,director
0,862,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,373.554033,Pixar Animation Studios,...,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",English,<img src='http://image.tmdb.org/t/p/w185//uXDf...,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,13,106,John Lasseter
1,8844,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,65.0,262.797249,TriStar Pictures|Teitler Film|Interscope Commu...,...,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,English|Français,<img src='http://image.tmdb.org/t/p/w185//vgpX...,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,26,16,Joe Johnston
2,15602,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,...,6.5,11.712900,101.0,A family wedding reignites the ancient feud be...,English,<img src='http://image.tmdb.org/t/p/w185//1FSX...,Walter Matthau|Jack Lemmon|Ann-Margret|Sophia ...,7,4,Howard Deutch
3,31357,Waiting to Exhale,Friends are the people who let you be yourself...,1995-12-22,Comedy|Drama|Romance,,en,16.0,81.452156,Twentieth Century Fox Film Corporation,...,6.1,3.859495,127.0,"Cheated on, mistreated and stepped on, the wom...",English,<img src='http://image.tmdb.org/t/p/w185//4wjG...,Whitney Houston|Angela Bassett|Loretta Devine|...,10,10,Forest Whitaker
4,11862,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,1995-02-10,Comedy,Father of the Bride Collection,en,,76.578911,Sandollar Productions|Touchstone Pictures,...,5.7,8.387519,106.0,Just when George Banks has recovered from his ...,English,<img src='http://image.tmdb.org/t/p/w185//lf9R...,Steve Martin|Diane Keaton|Martin Short|Kimberl...,12,7,Charles Shyer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44686,439050,Subdue,Rising and falling between a man and woman,NaT,Drama|Family,,fa,,,,...,4.0,0.072051,90.0,Rising and falling between a man and woman.,فارسی,<img src='http://image.tmdb.org/t/p/w185//pfC8...,Leila Hatami|Kourosh Tahami|Elham Korda,3,9,Hamid Nematollah
44687,111109,Century of Birthing,,2011-11-17,Drama,,tl,,,Sine Olivia,...,9.0,0.178241,360.0,An artist struggles to finish his work while a...,,<img src='http://image.tmdb.org/t/p/w185//xZkm...,Angel Aquino|Perry Dizon|Hazel Orencio|Joel To...,11,6,Lav Diaz
44688,67758,Betrayal,A deadly game of wits.,2003-08-01,Action|Drama|Thriller,,en,,,American World Pictures,...,3.8,0.903007,90.0,"When one of her hits goes wrong, a professiona...",English,<img src='http://image.tmdb.org/t/p/w185//eGga...,Erika Eleniak|Adam Baldwin|Julie du Page|James...,15,5,Mark L. Lester
44689,227506,Satan Triumphant,,1917-10-21,,,en,,,Yermoliev,...,,0.003503,87.0,"In a small town live two brothers, one a minis...",,<img src='http://image.tmdb.org/t/p/w185//aorB...,Iwan Mosschuchin|Nathalie Lissenko|Pavel Pavlo...,5,2,Yakov Protazanov


## The best and the worst movies...

2. __Filter__ the Dataset and __find the best/worst n Movies__ with the

- Highest Revenue
- Highest Budget
- Highest Profit (=Revenue - Budget)
- Lowest Profit (=Revenue - Budget)
- Highest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10) 
- Lowest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)
- Highest number of Votes
- Highest Rating (only movies with 10 or more Ratings)
- Lowest Rating (only movies with 10 or more Ratings)
- Highest Popularity

__Define__ an appropriate __user-defined function__ to reuse code.

__Movies Top 5 - Highest Revenue__

In [None]:
def Filter(df: pd.DataFrame, on: pd.Series | str, asc: bool = True, top: int = 5, apply: None = None):
    if isinstance(on, str):
        s = movies[on]
    else:
        s = on
    if apply:
        s.apply(apply)
    idx = s.sort_values(ascending=asc).index
    if top:
        return df.loc[idx].head(top)
    else:
        return df.loc[idx]

Filter(movies, movies['revenue_musd'], asc=False, top=5)

__Movies Top 5 - Highest Budget__

In [None]:
Filter(movies, 'budget_musd', asc=False, top=5)

__Movies Top 5 - Highest Profit__

In [None]:
Filter(movies, movies['revenue_musd'] - movies['budget_musd'], asc=True, top=5)

__Movies Top 5 - Lowest Profit__

In [None]:
Filter(movies, movies['budget_musd'] - movies['revenue_musd'], asc=True, top=5)

__Movies Top 5 - Highest ROI__

In [None]:
Filter(movies, movies['revenue_musd'] / movies[movies['budget_musd'] >= 10]['budget_musd'], asc=False, top=5)

__Movies Top 5 - Lowest ROI__

In [None]:
Filter(movies, movies['revenue_musd'] / movies[movies['budget_musd'] >= 10]['budget_musd'], top=5)

__Movies Top 5 - Most Votes__

In [None]:
Filter(movies, 'vote_count', asc=False, top=5)

__Movies Top 5 - Highest Rating__

In [None]:
Filter(movies, movies[movies['vote_count'] >= 10]['vote_average'], asc=False, top=5)

__Movies Top 5 - Lowest Rating__

In [None]:
Filter(movies, movies[movies['vote_count'] >= 10]['vote_average'], top=5)

__Movies Top 5 - Most Popular__

In [None]:
Filter(movies, 'popularity', asc=False, top=5)

## Find your next Movie

3. __Filter__ the Dataset for movies that meet the following conditions:

__Search 1: Science Fiction Action Movie with Bruce Willis (sorted from high to low Rating)__

__Search 2: Movies with Uma Thurman and directed by Quentin Tarantino (sorted from short to long runtime)__

__Search 3: Most Successful Pixar Studio Movies between 2010 and 2015 (sorted from high to low Revenue)__

__Search 4: Action or Thriller Movie with original language English and minimum Rating of 7.5 (most recent movies first)__

In [None]:
bruce_willis = movies['cast'].str.contains('Bruce Willis')
sci_fi = movies['genres'].str.contains('Science Fiction') & movies['genres'].str.contains('Action')
movies[bruce_willis & sci_fi].sort_values('vote_average', ascending=False)

In [None]:
uma_thurman = movies['cast'].str.contains('Uma Thurman')
quentin_tarantino = movies['director'] == 'Quentin Tarantino'
movies[uma_thurman & quentin_tarantino].sort_values('runtime')

In [None]:
pixar = movies['production_companies'].str.contains('Pixar')
time_period = movies['release_date'].between('2010', '2015')
movies[pixar & time_period].sort_values('revenue_musd')

In [None]:
action_or_thriller = movies['genres'].str.contains('Action') | movies['genres'].str.contains('Thriller')
eng = movies['original_language'] == 'en'
minimum_rating = movies['vote_average'] >= 7.5

movies[eng & minimum_rating & action_or_thriller].sort_values('release_date', ascending=False)

## Are Franchises more successful?

4. __Analyze__ the Dataset and __find out whether Franchises (Movies that belong to a collection) are more successful than stand-alone movies__ in terms of:

- mean revenue
- median Return on Investment
- mean budget raised
- mean popularity
- mean rating

hint: use groupby()

__Franchise vs. Stand-alone: Average Revenue__

In [2]:
collection_groups = movies.groupby('belongs_to_collection')

In [None]:
not_collection = ~ movies['belongs_to_collection'].isin(movies['belongs_to_collection'].dropna())

In [None]:
mean_revenue_collection = collection_groups['revenue_musd'].sum().mean()
mean_revenue_not_collection = movies[not_collection]['revenue_musd'].mean()
mean_revenue_not_collection > mean_revenue_collection

__Franchise vs. Stand-alone: Return on Investment / Profitability (median)__

In [None]:
return_on_invest_collection = collection_groups['revenue_musd'].sum() / collection_groups['budget_musd'].sum()
return_on_invest_not_collection = movies[not_collection]['revenue_musd'] / movies[not_collection]['budget_musd']
return_on_invest_not_collection.median() > return_on_invest_collection.median()

__Franchise vs. Stand-alone: Average Budget__

In [None]:
budget_collection = collection_groups['budget_musd'].sum().mean()
budget_not_collection = movies[not_collection]['budget_musd'].mean()
budget_not_collection > budget_collection 

__Franchise vs. Stand-alone: Average Popularity__

In [None]:
popularity_collection = collection_groups['popularity'].sum().mean()
popularity_not_collection = movies[not_collection]['popularity'].mean()
popularity_not_collection > popularity_collection

__Franchise vs. Stand-alone: Average Rating__

In [None]:
rating_collection = collection_groups['vote_average'].mean().mean()
rating_not_collection = movies[not_collection]['vote_average'].mean()
rating_not_collection > rating_collection

## Most Successful Franchises

5. __Find__ the __most successful Franchises__ in terms of

- __total number of movies__
- __total & mean budget__
- __total & mean revenue__
- __mean rating__

In [9]:
total_num_movies = collection_groups['title'].value_counts().groupby(level=[0]).sum().sort_index()
total_revenue = collection_groups['revenue_musd'].sum().sort_index()
total_budget = collection_groups['budget_musd'].sum().sort_index()
mean_revenue = collection_groups['revenue_musd'].mean().sort_index()
mean_budget = collection_groups['budget_musd'].mean().sort_index()
mean_rating = collection_groups['vote_average'].mean().sort_index()

most_successful_Franchises = pd.DataFrame({'total_num_movies' : total_num_movies,
                                            'total_budget' : total_budget,
                                            'total_revenue' : total_revenue,
                                            'mean_revenue' : mean_revenue,
                                            'mean_budget' : mean_budget,
                                            'mean_rating' : mean_rating})

most_successful_Franchises.sort_values('total_revenue', ascending=False)

Unnamed: 0_level_0,total_num_movies,total_budget,total_revenue,mean_revenue,mean_budget,mean_rating
belongs_to_collection,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Harry Potter Collection,8,1280.00,7707.367425,963.420928,160.000000,7.537500
Star Wars Collection,8,854.35,7434.494790,929.311849,106.793750,7.375000
James Bond Collection,26,1539.65,7106.970239,273.345009,59.217308,6.338462
The Fast and the Furious Collection,8,1009.00,5125.098793,640.637349,126.125000,6.662500
Pirates of the Caribbean Collection,5,1250.00,4521.576826,904.315365,250.000000,6.880000
...,...,...,...,...,...,...
Les Profs,1,12.00,0.000000,,12.000000,5.400000
Les Mystères de l'ouest (Collection),2,0.00,0.000000,,,5.700000
Les Charlots - Saga,1,0.00,0.000000,,,6.500000
Les Boys,1,0.00,0.000000,,,5.400000


## Most Successful Directors

6. __Find__ the __most successful Directors__ in terms of

- __total number of movies__
- __total revenue__
- __mean rating__

In [14]:
directors = movies.groupby('director')
total_num_movies = directors['title'].value_counts().groupby(level=[0]).sum().sort_index()
total_revenue = directors['revenue_musd'].sum().sort_index()
mean_rating = directors['vote_average'].mean().sort_index()

most_successful_directors = pd.DataFrame({'total_num_movies' : total_num_movies,
                                          'total_revenue' : total_revenue,
                                          'mean_rating' : mean_rating})

most_successful_directors.sort_values('total_revenue', ascending=False)

Unnamed: 0_level_0,total_num_movies,total_revenue,mean_rating
director,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Steven Spielberg,33,9256.621422,6.893939
Peter Jackson,13,6528.244659,7.138462
Michael Bay,13,6437.466781,6.392308
James Cameron,11,5900.610310,6.927273
David Yates,9,5334.563196,6.700000
...,...,...,...
Héctor Olivera,4,0.000000,4.725000
Hélène Fillières,1,0.000000,4.400000
Héléna Klotz,1,0.000000,4.800000
I. Robert Levy,1,0.000000,4.000000
