## ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Pandas for EDA
by [@josephofiowa](https://twitter.com/josephofiowa)
 
<!---
This assignment was developed by Joseph Nelson

Questions? Comments?
1. Log an issue to this repo to alert me of a problem.
2. Suggest an edit yourself by forking this repo, making edits, and submitting a pull request with your changes back to our master branch.
3. Hit me up on Slack @josephnelson
--->

# Pandas Unit Lab

**Woo!** We've made it to the end of our Pandas Unit. Let's put our skills to the test.

We're going to explore data from the top movies according to IMDB. This is a guided question-and-response lab where some areas are specific asks and others are open ended for you to explore.

In this lab, we will:
- Use `movie_app.py` to obtain relevant moving rating data
- Leverage Pandas to conduct exploratory data analysis, including:
    - Assess data integrity
    - Create exploratory visualizations
    - Produce insights on top actors/actresses across films
    
Let's get going!

## The Dataset

We'll work with a dataset on the top [100 movies](https://www.imdb.com/search/title?count=100&groups=top_1000&sort=user_rating), as rated by IMDB.


Specifically, we have a CSV that contains:
- IMDB star rating
- Movie title
- Year
- Content rating
- Genre
- Duration
- Gross

_[Details available at the above link]_


In [None]:
import pandas as pd
import numpy as np

### Read in the dataset

First, read in the dataset, called `imdb_100.csv` into a DataFrame called "movies."

The data is at the file path `./data/imdb_100.csv`.

In [None]:
movies = 

## Check the dataset basics

Let's first explore our dataset to verify we have what we expect.

Print the first five rows.

How many rows and columns are in the datset?

What are the column names?

How many unique genres are there?

How many movies are there per genre?

## Obtain more data (with an API call)!

Let's take advantage of our `movie_app.py` program to obtain data from OMDBAPi on movie ratings. This will enable us to answer the question: **How do Rotten Tomato scores compare to IMDB ratings?** Where do Rotten Tomato critics most disagree with IMDB reviews?

We're going to write a function that allows us to obtain the Rotten Tomato reviews on the top 100 IMDB movies. We will store these ratings in a new column in our `movies` DataFrame.

First, we need to load `movie_app.py` into a Jupyter Notebook cell.

In [None]:
%load movie_app.py


Notice a few changes!

We've provided you with a new function, `return_single_movie_rating(movie_query)`, which is very similar to the  `print_single_movie_rating(movie_query)` function you've written. The key difference is our new function *returns* the Rotten Tomato value so we can store it.



Here's the new function:

```python
def return_single_movie_rating(movie_query):
    my_movie = return_single_movie_object(movie_query)
    # Return the rating. Note we are only returning the percentage.
    return(my_movie.get_movie_rating())
```

A bit ununituitive note: you need to re-run the above cell after loading the script. Loading the script does not also instantiate the script. So, we must do so by running the cell.

Second, let's test out this new function by querying a single movie

Before we use our script, you must fill in `omdb-api-key.txt` with your OMDBApi key that you previously used! This file already exists in this current directory, but it is empty. Open this text file and fill it in with your API key that you were emailed.

In [None]:
# let's run the function on one of the world's best movies
movie_name = "Flubber"
return_single_movie_rating(movie_name)

Great! We have a function that returns the Rotten Tomatoes score for a given movie. 

Now, we need to pass the movies from our DataFrame to this function one-by-one, and store the result of this function in a list.

So, third, write a loop that prints out each movie title from your dataframe.

Nice! So far, we are able to:

- Pass a movie name to a function (`return_single_movie_rating()`) to obtain a rating value
- Loop through all the titles in our `movies` DataFrame


Fourth: it's time to combine these steps, and store those ratings in a list!

In [None]:
# declare empty list to hold our ratings
rotten_ratings = list()

In [None]:
# loop through each movie in the DataFrame, and pass that name to our function
# store the result of that function, using append()` in our rotten_ratings list
for name in movies.title:
    
    # try to find the Rotten Tomato rating
    try:
        rotten_ratings.append(return_single_movie_rating(name))
        
    # append a null if not found
    except:
        rotten_ratings.append(np.nan)

**Great work!** (Note: a few movies will return `OMDB Error: Movie not found!`)

Fifth, and certainly not least, let's create a column to store that new list of data!

In [None]:
movies['rotten_rating'] = rotten_ratings

## Checking the basics [continued] and cleaning

Now that we have a new dataset, let's keep exploring.

Print the first five rows again, just to verify everything is looking ok.

Check for null values in all of your columns.

Check our datatypes. Notice anything potentially problematic?

Because `rotten_rating` contains a % sign, the datatype is an object. We need to clean this up!

We need to strip the "%" sign from every entry in the `rotten_ratings` column. This is a grea opportunity to practice apply functions!

First, we'll grab a single entry from the relevant column in our DataFrame.

In [None]:
movies.rotten_rating[0]

Write code that strips this single entry of its % sign.

Now, turn the above code into a function that accepts any given text (a single string **not** a list of strings) with a % sign, and returns that text formatted as an integer and without the % sign.

In [None]:
def clean_percents(string_with_percent):
    try:
        # if the value can be converted to a string, we will strip the '%'
        # convert the result of that to an integer then return
        return int(str(string_with_percent).strip('%'))
    except:
        # otherwise, we'll return the same null value we had
        return np.nan

**Test** our function on a single value.

In [None]:
clean_percents(movies.rotten_ratings[0])

Now, if we get the desired result, we can *apply* this function to the whole column of interest. We will store the result on top of the old column entries.

In [None]:
movies.rotten_ratings = movies.rotten_ratings.apply(clean_percents)

Did it work?

In [None]:
movies.dtypes

## Exploratory data analysis

Let's transition to asking and answering some questions with our data.

What are the top five R-Rated movies?

*hint: Boolean filters needed!*

What is the average Rotten Tomato score for the top 100 IMDB films?

What is the Five Number Summary like for top rated films as per IMDB? Is it skewed?

Create your own question...then answer it!

**Challenge:** Create a column that is the ratio between Rotten Tomato rating vs IMDB rating. What film has the highest IMDB : Rotten Tomato ratio? The lowest?

*[skip this if you are low on time]*

In [None]:
movies[imdb_rotten_ratio] = 

## Exploratory data analysis with visualizations

For each of these prompts, create a plot to visualize the answer. Consider what plot is *most appropriate* to explore the given prompt.


What is the relationship between IMDB ratings and Rotten Tomato ratings?

What is the relationship between IMDB rating and movie duration?

How many movies are there in each category?

What does the distribution of Rotten Tomato ratings look like?

## Bonus

There are many things left unexplored! Consider investigating something about actors and genres.