# Pandas review

This notebook serves as a review of the first four sections of [Chapter 3](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html) of the Python Data Science Handbook.

We're going to be using a dataset about movies to try out processing some data with Pandas.

We start with some standard imports:

In [None]:
import pandas as pd
import numpy as np

Then we load the data from a local file and checkout the data:

In [None]:
df = pd.read_csv('movies_metadata.csv').dropna(axis=1, how='all')
df.head()

## Exploring the data

This dataset was obtained from [Kaggle](https://www.kaggle.com/rounakbanik/the-movies-dataset/home) who downloaded it
through the TMDB API. 

The movies available in this dataset are in correspondence with the movies that are listed in the MovieLens 
Latest Full Dataset.

Let's see what data we have:

In [None]:
df.shape

Twenty-three columns of data for over 45,000 movies is going be a lot to look at but let's start by looking at what the columns represent:

In [None]:
df.columns

Here's an explanation of each column:
- __belongs_to_collection__: A stringified dictionary that identifies the collection that a movie belongs to (if any).
- __budget__: The budget of the movie in dollars.
- __genres__: A stringified list of dictionaries that list out all the genres associated with the movie.
- __homepage__: The Official Homepage of the movie.
- __id__: An arbitrary ID for the movie.
- __imdb_id__: The IMDB ID of the movie.
- __original_language__: The language in which the movie was filmed.
- __original_title__: The title of the movie in its original language.
- __overview__: A blurb of the movie.
- __popularity__: The Popularity Score assigned by TMDB.
- __poster_path__: The URL of the poster image (relative to http://image.tmdb.org/t/p/w185/).
- __production_companies__: A stringified list of production companies involved with the making of the movie.
- __production_countries__: A stringified list of countries where the movie was filmed or produced.
- __release_date__: Theatrical release date of the movie.
- __revenue__: World-wide revenue of the movie in dollars.
- __runtime__: Duration of the movie in minutes.
- __spoken_languages__: A stringified list of spoken languages in the film.
- __status__: Released, To Be Released, Announced, etc.
- __tagline__: The tagline of the movie.
- __title__: The official title of the movie.
- __video__: Indicates if there is a video present of the movie with TMDB.
- __vote_average__: The average rating of the movie on TMDB.
- __vote_count__: The number of votes by users, as counted by TMDB.

## Filtering the data 

Let's start by only looking at films that cost over a million dollars to make. 

Let's create a variable called `money_losers_df` that contains all columns for the movies whose revenue was less than their budget.

In [None]:
money_loser_df = df[df.revenue<df.budget]
print(money_loser_df.shape)
money_loser_df.head()

That's more than 5000 movies that lost money! Clearly a risky business.

Let's create a Series object called `vote_lookup` such that we are able to use a call to `vote_lookup['Dead Presidents']` to find the vote average of that movie.

In [None]:
vote_lookup = pd.Series(money_loser_df['vote_average'].values, index=money_loser_df['title'])
vote_lookup['Dead Presidents']

We can use the `startswith` predicate (in the `str` attribute) to find all movies that start with a particular string or letter. `sort_index` and double-bracket notation (`[[]]`) allows us to find the first movie that starts with a `P` or the last one that starts with an 'R':

In [None]:
print(vote_lookup[vote_lookup.index.str.startswith('Star')])

In [None]:
print(vote_lookup[vote_lookup.index.str.startswith('P')].sort_index()[[0]])
print(vote_lookup[vote_lookup.index.str.startswith('R')].sort_index()[[-1]])

Note that we could have used iloc instead but that only gives us the value, not the index:

In [None]:
print(vote_lookup[vote_lookup.index.str.startswith('P')].sort_index().iloc[0])

We can even do slices using strings:

In [None]:
vote_lookup_as_and_bs = vote_lookup.sort_index()["P2":"Ryna"]
vote_lookup_as_and_bs

## Column Arithmetic

We can do arithmetic on columns as if they were just numbers:

In [None]:
money_loser_df.loc[: , 'loss'] = money_loser_df['budget'] - money_loser_df['revenue']
money_loser_df.head()

## Merging

Frequently data comes from different sources and has to be merged into a single data frame. For example, let's say that I have some notes about some of these movies that I want to merge:

In [None]:
my_notes_dict = {
    "Cutthroat Island": "Has one of my favorite stunts",
    "The Neverending Story III: Escape from Fantasia": "Too many sequels here",
    "Bio-Dome": "First Pauly Shore movie I ever saw",
    "The Empire Strikes Back": "My favorite in the SW series",
    "Mighty Aphrodite": "Features Helena Bonham Carter",
}
my_notes = pd.DataFrame(pd.Series(my_notes_dict), columns=['my_notes'])
my_notes['title'] = my_notes.index
pd.merge(my_notes, money_loser_df)[["title", "my_notes", "loss"]]