# IMDb Movie Data Exploration using Descriptive Statistics

In [5]:
import pandas as pd

In [6]:
#Read URL into a dataframe

movies = pd.read_csv('http://bit.ly/imdbratings')

In [9]:
#Display first 10 rows to get a feel of what the table looks like
#head normally populates 5 rows without an argument

movies.head(10)

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."
5,8.9,12 Angry Men,NOT RATED,Drama,96,"[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals..."
6,8.9,"The Good, the Bad and the Ugly",NOT RATED,Western,161,"[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ..."
7,8.9,The Lord of the Rings: The Return of the King,PG-13,Adventure,201,"[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK..."
8,8.9,Schindler's List,R,Biography,195,"[u'Liam Neeson', u'Ralph Fiennes', u'Ben Kings..."
9,8.9,Fight Club,R,Drama,139,"[u'Brad Pitt', u'Edward Norton', u'Helena Bonh..."


In [21]:
#Display column labels of the dataframe
movies.columns

Index(['star_rating', 'title', 'content_rating', 'genre', 'duration',
       'actors_list'],
      dtype='object')

In [25]:
#What are the data types of each column? 
movies.dtypes

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

In [10]:
movies.describe()

Unnamed: 0,star_rating,duration
count,979.0,979.0
mean,7.889785,120.979571
std,0.336069,26.21801
min,7.4,64.0
25%,7.6,102.0
50%,7.8,117.0
75%,8.1,134.0
max,9.3,242.0


## Method : Describe()

### Why did only two columns populate here? 
If one column in the dataframe has numerical values, then describe() will output descriptive statistics for those columns only. Otherwise, the output would be empty.

### What does count mean? 
Count is referring to the number of non-null values in the specified column

### Mean
The mean, or average, is the total sum of the numerical values in that column

### Std
Std is short-hand for Standard Deviation. The standard deviation is the usual(or standard) amount of distance (or deviation) from the average. In other words, standard deviation is just the average distance from the mean. 

### Min
Min is the smallest (or minimum) value in that column

### 25%
Also known as the first quartile, is the value in a column that's less than 75% of the other column values

### 50% 
Also known as the median, is the middle value of all the column values, ordered from least to greatest

### 75%
Also known as the third quartile, is the value in a column that's more than 75% of the other column values

### Max
Max is the largest (or maximum) value in that column

## Observations

There are **979** non-null values in both the star_rating and the duration column

The rest of the descriptive statistics are written out below. This isn't necessary for regular practice but may help you find an interesting aspect from the numbers.


Average star_rating = 7.89 | Average duration = 121 minutes

Standard deviation for star_rating = 0.34 | Standard deviation for duration = 26 minutes

Minimum star_rating = 7.4 | Minimum duration = 64 minutes

First quartile star_rating = 7.6 | First quartile duration = 102 minutes

Median star_rating = 7.8 | Median duration = 117 minutes

Third quartile star_rating = 8.1 | Third quartile duration = 134 minutes

Maximum star_rating = 9.3 | Maximum duration = 242 minutes

### Quick Glance
The minimum star_rating from the data frame is 7.4. Therefore, no movie received less than 7.4 on their star rating.
No movie received higher than a 9.3 star rating. 

The shortest movie was 64 minutes in duration while the longest movie was 242 minutes. 

75% of the movies received a star_rating higher than 7.6

75% of the movies are over 2 hours (120 min) long

Distributions of the movies are usually off by 0.34 in star_rating, which means the average rating is pretty accurate 

### Are there any null or missing values in those columns? 

Doing a quick count of the number of rows tell us if there are any null values in the dataframe

We can do this with the shape method

In [16]:
movies.shape

(979, 6)

## Method: Shape

Here, there are two numbers formatted similarly to an (x,y) coordinate

The first number, 979, refers to the number of rows in the movies dataframe

The second number, 6, refers to the number of columns
#### Therefore, since there are 979 rows in the dataframe, there are no null or missing values in the star_rating or duration columns 

## What about the other non-numerical columns?

We only populated statistics for the numerical columns 

Let's take a look at the other columns with the dtype = 'object' 

This will require an argument when calling the describe() function 

In [31]:
movies.describe(include='object')

Unnamed: 0,title,content_rating,genre,actors_list
count,979,976,979,979
unique,975,12,16,969
top,True Grit,R,Drama,"[u'Daniel Radcliffe', u'Emma Watson', u'Rupert..."
freq,2,460,278,6


## Object Columns Explained

**Title**
    
- No missing or null rows

- 975 unique titles (Only 4 non-unique columns)

- "True Grit" is the most popular title

- ^ Appears twice in the dataset
    
    
 **Content Rating**
 
- Missing ratings for 3 rows

- 12 unique ratings

- Rated "R" is the most frequent rating

- ^ 460 movies are rated "R"

**Genre**

- No missing or null rows

- 16 unique genres

- DRAMA is the most popular genre

- 278 movies are in the DRAMA genre

**Actors List**

- No missing or null rows

- 969 unique values, 8 non-unique rows 
  (Keep in mind that this column has a list in each row)

- Daniel Radcliffe, Emma Watson, Rupert Grint are the most frequent actors

- ^ Star in 6 movies


**Assumption**

From the 8 non-unique rows in actors_list column:
- 6 of them are from the Harry Potter franchise
- 2 of them are from True Grit

I can make that assumption because there are 8 movies in the HP franchise, 
so the actor list must be repeated multiple times

## Conclusion

That's all for today folks! This is just a quick and easy way to dip my toes after a few months of not practicing python. In my next post, I'll be answering some questions that popped up while doing this tiny project. 

## Questions to Answer

1. What are the top 10 star rated movies?

2. Of the top 100, which genre received the highest star ratings?

3. What is the least popular content rating?

4. What are the bottom 10 star rated movies?

5. What is the most popular content rating?

6. What is the least popular genre?

7. What movie has the longest duration in minutes?

8. What genre has the highest average duration in minutes?

9. Which actor appears in the most movies on this list?

10. Which actor had the highest average star rating?

11. What is the most popular genre?

12. Is there a trend between star rating and the other columns in the dataframe?

13. Display the percentages for each genre out of the whole dataset? 