Lesson 8 - Introduction to Python - Pandas Library for Data Exploration

## What is Pandas?

Pandas is possibly one of the best open source data exploration library available currently available. It gives the user tremendous power to easily explore, manipulate, query, aggregate, and visualize tabular data. Pandas is built just for analyzing tabular data composed of rows and columns. There are two primary objects that account for everything.

## Pandas creator

Pandas was built by a young guy named Wes McKinney beginning in 2008 at a hedge fund named AQR. If you are really interested in the history, you can hear it from the creator [himself](https://www.youtube.com/watch?v=kHdkFyGCxiY).

## Resources used

* Pandas official documentation: https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html
* Pandas Cookbook by Ted Petrou: https://subscription.packtpub.com/book/data/9781784393878/1/ch01lvl1sec03/dissecting-the-anatomy-of-a-dataframe
* 10 minutes to pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

## Almost 200,000 questions are discussed at Stackoverflow

* https://stackoverflow.com/questions/tagged/pandas

## Pandas data structure: Series and DataFrame

The **Series** is a single column of data with an **index** that references each element. The index is **very** important in pandas and what separates itself from a numpy array. 

The **DataFrame** is a collection of Series (columns) and forms your normal concept of a table with rows and columns. Again, the **index** is very important. Both the rows and columns have an **index** that references them.

For now, think of the **index** as a set of labels that can reference a particular row or column of data.

## Import pandas and read in some data
By convention pandas is imported and aliased as **pd**. We will read in the movie dataset with the **read_csv** function. We display the first five rows with the **head** method.

In [1]:
import pandas as pd
import numpy as np

#read data csv file and convert to pandas DataFrame object
df = pd.read_csv('data/movie.csv')
df.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [2]:
type(df)

pandas.core.frame.DataFrame

## Anatomy of a DataFrame

![](images/dataframe_anatomy.png)

The DataFrame is the most common object you will be working with during your analysis. It is important to undestand all parts it is composed of. There are three main components to a DataFrame, the **index**, the **columns** and the **data**. 


Image is taken from: https://subscription.packtpub.com/book/data/9781784393878/1/ch01lvl1sec03/dissecting-the-anatomy-of-a-dataframe 

In [5]:
#get dataframe index
df.index

RangeIndex(start=0, stop=4916, step=1)

In [6]:
#get data
df.values

array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
       ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
       ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
       ...,
       ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
       ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
       ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)

In [7]:
#get column names
df.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

In [8]:
#check columns types
df.dtypes

color                         object
director_name                 object
num_critic_for_reviews       float64
duration                     float64
director_facebook_likes      float64
actor_3_facebook_likes       float64
actor_2_name                  object
actor_1_facebook_likes       float64
gross                        float64
genres                        object
actor_1_name                  object
movie_title                   object
num_voted_users                int64
cast_total_facebook_likes      int64
actor_3_name                  object
facenumber_in_poster         float64
plot_keywords                 object
movie_imdb_link               object
num_user_for_reviews         float64
language                      object
country                       object
content_rating                object
budget                       float64
title_year                   float64
actor_2_facebook_likes       float64
imdb_score                   float64
aspect_ratio                 float64
m

In [9]:
df.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [10]:
df.index.names

FrozenList([None])

In [11]:
#assign index name
df.index.names = ['index']
df.head(3)

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000


In [12]:
df.columns.names

FrozenList([None])

In [13]:
#assign columns name
df.columns.names = ['movie_columns']
df.head(3)

movie_columns,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000


## Select a column from the dataframe

In [14]:
#select a column director_name 
s = df['director_name']
s = df.director_name
s

index
0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4             Doug Walker
              ...        
4911          Scott Smith
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

In [15]:
type(s)

pandas.core.series.Series

## Anatomy of a Series

![Series](images/series_anatomy.png)

Image is taken from: https://subscription.packtpub.com/book/data/9781784393878/1/ch01lvl1sec03/dissecting-the-anatomy-of-a-dataframe 

In [16]:
#get series indexes
s.index

RangeIndex(start=0, stop=4916, step=1, name='index')

In [18]:
#get series values
s.values

numpy.ndarray

In [19]:
#get series name
s.name

'director_name'

In [20]:
#get series type
s.dtype

dtype('O')

## Series creation

Creating a Series by passing a list of values, letting pandas create a default integer index:

In [22]:
s = pd.Series(np.array([1, 3, 5, np.nan, 6, 8]))
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a Series with specific index

In [23]:
s = pd.Series(data = [1, 3, 5, np.nan, 6, 8], index = ['a','b','c','d','e','f'], name = 'series')
s

a    1.0
b    3.0
c    5.0
d    NaN
e    6.0
f    8.0
Name: series, dtype: float64

## Build dataframe from a dictionary

Creating a DataFrame by passing a dict of objects (made of key-value pairs). Here dictionary keys correspond to column names, values are column values.

In [24]:
[.5] * 4

[0.5, 0.5, 0.5, 0.5]

In [28]:
#create dataframe
df2 = pd.DataFrame({"A": 1,
                    "B": np.array([.5] * 4, dtype="float"),
                    "C": pd.Series(10, index=list(range(4)), dtype="float32"),
                    "D": np.array([3] * 4, dtype="int32"),
                    "E": pd.Categorical(["test", "train", "test", "train"]),
                    "F": "foo"})

df2

Unnamed: 0,A,B,C,D,E,F
0,1,0.5,10.0,3,test,foo
1,1,0.5,10.0,3,train,foo
2,1,0.5,10.0,3,test,foo
3,1,0.5,10.0,3,train,foo


## Setting an index in dataframe

Let's read movie dataset again and replace the simple integer index with the title of the movie. Any column may be used for the index.

In [30]:
#read movie data again
df = pd.read_csv('data/movie.csv')
df.tail(2)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660
4915,Color,Jon Gunn,43.0,90.0,16.0,16.0,Brian Herzlinger,86.0,85222.0,Documentary,...,84.0,English,USA,PG,1100.0,2004.0,23.0,6.6,1.85,456


In [31]:
df.set_index('movie_title', inplace = True)
df.head(2)

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0


## Select duration column from movie dataframe

In [45]:
#assign duration movie column 
duration = df['duration']

#show 10 last obseravtions
duration.head(10)

movie_title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens      NaN
John Carter                                   132.0
Spider-Man 3                                  156.0
Tangled                                       100.0
Avengers: Age of Ultron                       141.0
Harry Potter and the Half-Blood Prince        153.0
Name: duration, dtype: float64

## Selection by integer location with **`.iloc`**
To do integer location selection you must use **`.iloc`**. pandas calls this an **indexer**. The locations of the elements of a Series begin with 0 and end at n-1 where n is the number of rows. There are a few ways to use .iloc. You can pass a single integer, a slice or a list of integers. 

### Single integer

In [36]:
duration.iloc[5]

132.0

In [38]:
duration.iloc[-1]

90.0

### Slice 
You may use slice notation **`start:stop:step`** to select elements as well. 

In [39]:
duration.iloc[3:30:3]

movie_title
The Dark Knight Rises                          164.0
Spider-Man 3                                   156.0
Harry Potter and the Half-Blood Prince         153.0
Quantum of Solace                              106.0
Man of Steel                                   143.0
Pirates of the Caribbean: On Stranger Tides    136.0
The Amazing Spider-Man                         153.0
The Golden Compass                             113.0
Captain America: Civil War                     147.0
Name: duration, dtype: float64

### List of integers

In [40]:
duration.iloc[[10,12,27]]

movie_title
Batman v Superman: Dawn of Justice    183.0
Quantum of Solace                     106.0
Captain America: Civil War            147.0
Name: duration, dtype: float64

## Selection by index label with `.loc`
The **`.loc`** indexer selects data by accepting the label (or labels) of the index. Just like **`.iloc`**, **`.loc`** can accept a single label, a slice or a list.

In [41]:
duration.index

Index(['Avatar', 'Pirates of the Caribbean: At World's End', 'Spectre',
       'The Dark Knight Rises', 'Star Wars: Episode VII - The Force Awakens',
       'John Carter', 'Spider-Man 3', 'Tangled', 'Avengers: Age of Ultron',
       'Harry Potter and the Half-Blood Prince',
       ...
       'Primer', 'Cavite', 'El Mariachi', 'The Mongol King', 'Newlyweds',
       'Signed Sealed Delivered', 'The Following', 'A Plague So Pleasant',
       'Shanghai Calling', 'My Date with Drew'],
      dtype='object', name='movie_title', length=4916)

In [43]:
duration.loc['Avatar'], duration.iloc[0]

(178.0, 178.0)

In [44]:
duration.loc['Star Wars: Episode VII - The Force Awakens']

nan

#### KeyError
A KeyError will be raised if you try to access a label not in the index.

In [46]:
duration.loc['Movie']

KeyError: 'Movie'

### Slicing with labels
It is possible to slice from one label to another. Pandas always **includes** the end label when slicing with labels.

In [47]:
duration.loc['The Dark Knight Rises':'Tangled']

movie_title
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens      NaN
John Carter                                   132.0
Spider-Man 3                                  156.0
Tangled                                       100.0
Name: duration, dtype: float64

In [48]:
# slice with negative step
duration.loc['Tangled':'Spectre':-1]

movie_title
Tangled                                       100.0
Spider-Man 3                                  156.0
John Carter                                   132.0
Star Wars: Episode VII - The Force Awakens      NaN
The Dark Knight Rises                         164.0
Spectre                                       148.0
Name: duration, dtype: float64

### Lists of labels

In [50]:
duration.index

Index(['Avatar', 'Pirates of the Caribbean: At World's End', 'Spectre',
       'The Dark Knight Rises', 'Star Wars: Episode VII - The Force Awakens',
       'John Carter', 'Spider-Man 3', 'Tangled', 'Avengers: Age of Ultron',
       'Harry Potter and the Half-Blood Prince',
       ...
       'Primer', 'Cavite', 'El Mariachi', 'The Mongol King', 'Newlyweds',
       'Signed Sealed Delivered', 'The Following', 'A Plague So Pleasant',
       'Shanghai Calling', 'My Date with Drew'],
      dtype='object', name='movie_title', length=4916)

In [52]:
duration.loc[['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre',
               'The Dark Knight Rises', 'Star Wars: Episode VII - The Force Awakens',
               'John Carter', 'Spider-Man 3', 'Tangled', 'Avengers: Age of Ultron',
               'Harry Potter and the Half-Blood Prince']]

movie_title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens      NaN
John Carter                                   132.0
Spider-Man 3                                  156.0
Tangled                                       100.0
Avengers: Age of Ultron                       141.0
Harry Potter and the Half-Blood Prince        153.0
Name: duration, dtype: float64

## Exercise #1
Create a 5 element pandas Series using the Series constructor with characters as the index and numbers as the values. Display the Series.

In [54]:
#your code
#series = pd.Series()
srs = pd.Series([1,2,3,4,99], index=['a','b','c','d','z'])
srs

a     1
b     2
c     3
d     4
z    99
dtype: int64

## Exercise #2
Create a dictionary with at least 5 elements and use it to create a series. Display the Series.

In [55]:
#your code
your_dict = {'a':1 , 'b':2, 'c': 3, 'd': 4, 'e': 5}
series = pd.Series(your_dict)
series

a    1
b    2
c    3
d    4
e    5
dtype: int64

## Exercise #3
Create a dictionary with at least 5 elements and use it to create a dataframe. Display the DataFrame.

In [62]:
#your code
your_dict = {'a':[1, 2] , 'b':[2, 3], 'c': [3, 4], 'd': [4, 5], 'e': [5, 6]}
df3 = pd.DataFrame(your_dict, index = ['11', '22'])
df3

Unnamed: 0,a,b,c,d,e
11,1,2,3,4,5
22,2,3,4,5,6


## Exercise #4
Use the **`read_csv`** function to read in the movie dataset (located at data/movie.csv) and set the index to the title of the movie. Output the first 10 rows.

In [63]:
#your code
df = pd.read_csv('data/movie.csv')
df.set_index('movie_title', inplace = True)
df.head(10)

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
John Carter,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
Spider-Man 3,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,336530303.0,Action|Adventure|Romance,...,1902.0,English,USA,PG-13,258000000.0,2007.0,11000.0,6.2,2.35,0
Tangled,Color,Nathan Greno,324.0,100.0,15.0,284.0,Donna Murphy,799.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,...,387.0,English,USA,PG,260000000.0,2010.0,553.0,7.8,1.85,29000
Avengers: Age of Ultron,Color,Joss Whedon,635.0,141.0,0.0,19000.0,Robert Downey Jr.,26000.0,458991599.0,Action|Adventure|Sci-Fi,...,1117.0,English,USA,PG-13,250000000.0,2015.0,21000.0,7.5,2.35,118000
Harry Potter and the Half-Blood Prince,Color,David Yates,375.0,153.0,282.0,10000.0,Daniel Radcliffe,25000.0,301956980.0,Adventure|Family|Fantasy|Mystery,...,973.0,English,UK,PG,250000000.0,2009.0,11000.0,7.5,2.35,10000


### Read from web

In [67]:
df_sp = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]
df_sp.head()

df_sp.to_csv('data/List_of_S&P_500_companies.csv',index=False)

## Boolean Indexing

In [69]:
#take only 5 first elements from duration series
duration_5 = duration.head(5)
duration_5

movie_title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens      NaN
Name: duration, dtype: float64

### Use [] with boolean indexing

In [70]:
#select index to keep
keep = [True, False, True, False, False]
duration_5[keep]

movie_title
Avatar     178.0
Spectre    148.0
Name: duration, dtype: float64

### Creating boolean Series
You can compare each element with another value using the comparison operators, <, >, <=, >=, ==, !=. A Series of booleans that have the same index label will be the result. This result will soon be used inside the indexing operator as was done above.

In [71]:
duration_5

movie_title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens      NaN
Name: duration, dtype: float64

In [72]:
duration_5 < 150

movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                        True
The Dark Knight Rises                         False
Star Wars: Episode VII - The Force Awakens    False
Name: duration, dtype: bool

In [73]:
### Create criteria for boolean selection
criteria = duration_5 < 150

duration_5[criteria]

movie_title
Spectre    148.0
Name: duration, dtype: float64

In [74]:
#multiple boolean expressions
criteria = (duration_5 >= 150) & (duration_5 <= 170)
criteria


movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End       True
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
Name: duration, dtype: bool

In [75]:
duration_5[criteria]

movie_title
Pirates of the Caribbean: At World's End    169.0
The Dark Knight Rises                       164.0
Name: duration, dtype: float64

## Boolean Indexing with String Columns

In [78]:
df = pd.read_csv('data/movie.csv', index_col='movie_title')

In [79]:
#create series with main actor
actor1 = df['actor_1_name']
actor1.head()

movie_title
Avatar                                            CCH Pounder
Pirates of the Caribbean: At World's End          Johnny Depp
Spectre                                       Christoph Waltz
The Dark Knight Rises                               Tom Hardy
Star Wars: Episode VII - The Force Awakens        Doug Walker
Name: actor_1_name, dtype: object

In [80]:
actor1[actor1 == 'Johnny Depp'].head()

movie_title
Pirates of the Caribbean: At World's End       Johnny Depp
Pirates of the Caribbean: Dead Man's Chest     Johnny Depp
The Lone Ranger                                Johnny Depp
Pirates of the Caribbean: On Stranger Tides    Johnny Depp
Alice in Wonderland                            Johnny Depp
Name: actor_1_name, dtype: object

### Using `isin` method to check for multiple equalities

In [81]:
criteria = actor1.isin(['Johnny Depp', 'Matt Damon', 'Tom Hanks'])
criteria.head(10)

movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End       True
Spectre                                       False
The Dark Knight Rises                         False
Star Wars: Episode VII - The Force Awakens    False
John Carter                                   False
Spider-Man 3                                  False
Tangled                                       False
Avengers: Age of Ultron                       False
Harry Potter and the Half-Blood Prince        False
Name: actor_1_name, dtype: bool

In [82]:
actor1[criteria].head(10)

movie_title
Pirates of the Caribbean: At World's End       Johnny Depp
Pirates of the Caribbean: Dead Man's Chest     Johnny Depp
The Lone Ranger                                Johnny Depp
Pirates of the Caribbean: On Stranger Tides    Johnny Depp
Alice in Wonderland                            Johnny Depp
Toy Story 3                                      Tom Hanks
The Polar Express                                Tom Hanks
Alice Through the Looking Glass                Johnny Depp
Charlie and the Chocolate Factory              Johnny Depp
Angels & Demons                                  Tom Hanks
Name: actor_1_name, dtype: object

## Exercise #5
Use boolean indexing to select movies with facebook likes less than 100 but greater than 0 or greater than 10000

In [83]:
df.head(1)

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000


In [84]:
#your code
facebook_likes = df['movie_facebook_likes']

# ((< 100) & (>0))|(>10000)
#criteria = if 0 < facebook_likes < 100 or 10000 < facebook_likes

facebook_likes[criteria]

SyntaxError: invalid syntax (<ipython-input-84-f9f808a23818>, line 5)

In [96]:
facebook_likes = df['movie_facebook_likes']

#0 < facebook_likes < 100 #or 10000 < facebook_likes
#for and use &
#for or use |

criteria = ((facebook_likes > 0)&(facebook_likes < 100)) | (facebook_likes > 10000)

facebook_likes[criteria].value_counts()

11000     79
13000     58
12000     58
15000     48
14000     47
          ..
153000     1
99000      1
55         1
96         1
150000     1
Name: movie_facebook_likes, Length: 205, dtype: int64

## Exercise #6
How many movies have more than 100,000 facebook likes?

In [101]:
#your code: > 100,000 likes
criteria = facebook_likes > 100000

facebook_likes[criteria].sort_values(ascending=False).head()

movie_title
Interstellar                          349000
Django Unchained                      199000
Batman v Superman: Dawn of Justice    197000
Mad Max: Fury Road                    191000
The Revenant                          190000
Name: movie_facebook_likes, dtype: int64

# The DataFrame Basics

In [102]:
#read movies dataframe one more time
df = pd.read_csv('data/movie.csv', index_col='movie_title')
df.head(2)

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0


## DataFrame Dimensions
+ **`shape`** attribute - tuple of number of rows and columns
+ **`size`** attribute - total number of elements. rows times columns
+ **`len`** function - number of rows
+ **`ndim`** attribute - number of dimensions. Always two

In [103]:
df.shape

(4916, 27)

In [104]:
df.size

132732

In [105]:
len(df)

4916

In [106]:
df.ndim

2

### DataFrame info method

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4916 entries, Avatar to My Date with Drew
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      4897 non-null   object 
 1   director_name              4814 non-null   object 
 2   num_critic_for_reviews     4867 non-null   float64
 3   duration                   4901 non-null   float64
 4   director_facebook_likes    4814 non-null   float64
 5   actor_3_facebook_likes     4893 non-null   float64
 6   actor_2_name               4903 non-null   object 
 7   actor_1_facebook_likes     4909 non-null   float64
 8   gross                      4054 non-null   float64
 9   genres                     4916 non-null   object 
 10  actor_1_name               4909 non-null   object 
 11  num_voted_users            4916 non-null   int64  
 12  cast_total_facebook_likes  4916 non-null   int64  
 13  actor_3_name               4893 non

### Get summary statistics for numeric columns

In [53]:
df.describe()

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
count,4867.0,4901.0,4814.0,4893.0,4909.0,4054.0,4916.0,4916.0,4903.0,4895.0,4432.0,4810.0,4903.0,4916.0,4590.0,4916.0
mean,137.988905,107.090798,691.014541,631.276313,6494.488491,47644510.0,82644.92,9579.815907,1.37732,267.668846,36547490.0,2002.447609,1621.923516,6.437429,2.222349,7348.294142
std,120.239379,25.286015,2832.954125,1625.874802,15106.986884,67372550.0,138322.2,18164.31699,2.023826,372.934839,100242700.0,12.453977,4011.299523,1.127802,1.40294,19206.016458
min,1.0,7.0,0.0,0.0,0.0,162.0,5.0,0.0,0.0,1.0,218.0,1916.0,0.0,1.6,1.18,0.0
25%,49.0,93.0,7.0,132.0,607.0,5019656.0,8361.75,1394.75,0.0,64.0,6000000.0,1999.0,277.0,5.8,1.85,0.0
50%,108.0,103.0,48.0,366.0,982.0,25043960.0,33132.5,3049.0,1.0,153.0,19850000.0,2005.0,593.0,6.6,2.35,159.0
75%,191.0,118.0,189.75,633.0,11000.0,61108410.0,93772.75,13616.75,2.0,320.5,43000000.0,2011.0,912.0,7.2,2.35,2000.0
max,813.0,511.0,23000.0,23000.0,640000.0,760505800.0,1689764.0,656730.0,43.0,5060.0,4200000000.0,2016.0,137000.0,9.5,16.0,349000.0


### Get summary statistics for only object type columns

In [54]:
df.describe(include=['object'])

Unnamed: 0,color,director_name,actor_2_name,genres,actor_1_name,actor_3_name,plot_keywords,movie_imdb_link,language,country,content_rating
count,4897,4814,4903,4916,4909,4893,4764,4916,4904,4911,4616
unique,2,2397,3030,914,2095,3519,4756,4916,47,65,18
top,Color,Steven Spielberg,Morgan Freeman,Drama,Robert De Niro,Steve Coogan,based on novel,http://www.imdb.com/title/tt0385004/?ref_=fn_t...,English,USA,R
freq,4693,26,18,233,48,8,4,1,4582,3710,2067


#### Transpose the summary

In [55]:
df.describe(include=['object']).T

Unnamed: 0,count,unique,top,freq
color,4897,2,Color,4693
director_name,4814,2397,Steven Spielberg,26
actor_2_name,4903,3030,Morgan Freeman,18
genres,4916,914,Drama,233
actor_1_name,4909,2095,Robert De Niro,48
actor_3_name,4893,3519,Steve Coogan,8
plot_keywords,4764,4756,based on novel,4
movie_imdb_link,4916,4916,http://www.imdb.com/title/tt0385004/?ref_=fn_t...,1
language,4904,47,English,4582
country,4911,65,USA,3710


### Get summary statistics for only float type columns

In [56]:
df.describe(include=['int'])

Unnamed: 0,num_voted_users,cast_total_facebook_likes,movie_facebook_likes
count,4916.0,4916.0,4916.0
mean,82644.92,9579.815907,7348.294142
std,138322.2,18164.31699,19206.016458
min,5.0,0.0,0.0
25%,8361.75,1394.75,0.0
50%,33132.5,3049.0,159.0
75%,93772.75,13616.75,2000.0
max,1689764.0,656730.0,349000.0


### Changing Display Settings

Pandas comes with default values for a couple dozen display settings to help control output. One of these parameters is the number of columns displayed to the screen. The options can all be found under **`pd.options.display`**. The all the display settings that can be changed below.

In [57]:
dir(pd.options.display)

['chop_threshold',
 'colheader_justify',
 'column_space',
 'date_dayfirst',
 'date_yearfirst',
 'encoding',
 'expand_frame_repr',
 'float_format',
 'html',
 'large_repr',
 'latex',
 'max_categories',
 'max_columns',
 'max_colwidth',
 'max_info_columns',
 'max_info_rows',
 'max_rows',
 'max_seq_items',
 'memory_usage',
 'min_rows',
 'multi_sparse',
 'notebook_repr_html',
 'pprint_nest_depth',
 'precision',
 'show_dimensions',
 'unicode',
 'width']

In [58]:
pd.options.display.max_columns

20

In [59]:
pd.options.display.max_columns = 50

In [60]:
df.head(2)

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0


## Selecting data by iloc

In [61]:
df.iloc[5:8]

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
John Carter,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,Daryl Sabara,212204,1873,Polly Walker,1.0,alien|american civil war|male nipple|mars|prin...,http://www.imdb.com/title/tt0401729/?ref_=fn_t...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
Spider-Man 3,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,336530303.0,Action|Adventure|Romance,J.K. Simmons,383056,46055,Kirsten Dunst,0.0,sandman|spider man|symbiote|venom|villain,http://www.imdb.com/title/tt0413300/?ref_=fn_t...,1902.0,English,USA,PG-13,258000000.0,2007.0,11000.0,6.2,2.35,0
Tangled,Color,Nathan Greno,324.0,100.0,15.0,284.0,Donna Murphy,799.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,Brad Garrett,294810,2036,M.C. Gainey,1.0,17th century|based on fairy tale|disney|flower...,http://www.imdb.com/title/tt0398286/?ref_=fn_t...,387.0,English,USA,PG,260000000.0,2010.0,553.0,7.8,1.85,29000


## Selecting data by loc

In [62]:
df.loc["Harry Potter and the Sorcerer's Stone":'Wall Street: Money Never Sleeps'].head(2)

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
Harry Potter and the Sorcerer's Stone,Color,Chris Columbus,258.0,159.0,0.0,645.0,Fiona Shaw,11000.0,317557891.0,Adventure|Family|Fantasy,Daniel Radcliffe,444683,13191,Verne Troyer,4.0,based on novel|birthday|evil wizard|quidditch|...,http://www.imdb.com/title/tt0241527/?ref_=fn_t...,1571.0,English,UK,PG,125000000.0,2001.0,687.0,7.5,2.35,16000
R.I.P.D.,Color,Robert Schwentke,208.0,96.0,124.0,1000.0,Jeff Bridges,16000.0,33592415.0,Action|Comedy|Fantasy,Ryan Reynolds,91640,31549,Stephanie Szostak,2.0,drug dealer|gold|partner|police|undead,http://www.imdb.com/title/tt0790736/?ref_=fn_t...,210.0,English,USA,PG-13,130000000.0,2013.0,12000.0,5.6,2.35,20000


## Exercise #7
Select all three actor name columns.

In [63]:
#your code

#first find actor columns from df.columns

#then use .loc to slide df dataframe selecting only actor name columns

## Missing data

In [64]:
df.isnull().head()

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
Avatar,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Pirates of the Caribbean: At World's End,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Spectre,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
The Dark Knight Rises,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Star Wars: Episode VII - The Force Awakens,True,False,True,True,False,True,False,False,True,False,False,False,False,True,False,True,False,True,True,True,True,True,True,False,False,True,False


In [65]:
#missing values in each column
df.isnull().sum(axis=0)

color                         19
director_name                102
num_critic_for_reviews        49
duration                      15
director_facebook_likes      102
actor_3_facebook_likes        23
actor_2_name                  13
actor_1_facebook_likes         7
gross                        862
genres                         0
actor_1_name                   7
num_voted_users                0
cast_total_facebook_likes      0
actor_3_name                  23
facenumber_in_poster          13
plot_keywords                152
movie_imdb_link                0
num_user_for_reviews          21
language                      12
country                        5
content_rating               300
budget                       484
title_year                   106
actor_2_facebook_likes        13
imdb_score                     0
aspect_ratio                 326
movie_facebook_likes           0
dtype: int64

In [66]:
#missing values in each row
df.isnull().sum(axis=1)

movie_title
Avatar                                         0
Pirates of the Caribbean: At World's End       0
Spectre                                        0
The Dark Knight Rises                          0
Star Wars: Episode VII - The Force Awakens    14
                                              ..
Signed Sealed Delivered                        4
The Following                                  5
A Plague So Pleasant                           4
Shanghai Calling                               2
My Date with Drew                              0
Length: 4916, dtype: int64

In [67]:
#count number of missing values in single title_year column
df['title_year'].isnull().sum()

106

## Value counts

In [68]:
#omits missing values
df['country'].value_counts()

USA                     3710
UK                       434
France                   154
Canada                   124
Germany                   94
                        ... 
Nigeria                    1
Philippines                1
Cameroon                   1
Libya                      1
United Arab Emirates       1
Name: country, Length: 65, dtype: int64

## Statistics operations for Series

In [69]:
#find mean value of a single column cast_total_facebook_likes (Series)
df['cast_total_facebook_likes'].mean()

9579.81590724166

In [70]:
#median value of a single column cast_total_facebook_likes (Series)
df['cast_total_facebook_likes'].quantile(0.5), df['cast_total_facebook_likes'].median()

(3049.0, 3049.0)

In [71]:
#quick stats summary of a single column cast_total_facebook_likes (Series)
df['cast_total_facebook_likes'].describe()

count      4916.000000
mean       9579.815907
std       18164.316990
min           0.000000
25%        1394.750000
50%        3049.000000
75%       13616.750000
max      656730.000000
Name: cast_total_facebook_likes, dtype: float64

## Statistics operations for DataFrame

In [72]:
#find mean values for each numerial column in movies dataframe
df.mean(axis = 0)

num_critic_for_reviews       1.379889e+02
duration                     1.070908e+02
director_facebook_likes      6.910145e+02
actor_3_facebook_likes       6.312763e+02
actor_1_facebook_likes       6.494488e+03
gross                        4.764451e+07
num_voted_users              8.264492e+04
cast_total_facebook_likes    9.579816e+03
facenumber_in_poster         1.377320e+00
num_user_for_reviews         2.676688e+02
budget                       3.654749e+07
title_year                   2.002448e+03
actor_2_facebook_likes       1.621924e+03
imdb_score                   6.437429e+00
aspect_ratio                 2.222349e+00
movie_facebook_likes         7.348294e+03
dtype: float64

In [73]:
#find standard deviation values for each numerial column in movies dataframe
df.std(axis = 0)

num_critic_for_reviews       1.202394e+02
duration                     2.528602e+01
director_facebook_likes      2.832954e+03
actor_3_facebook_likes       1.625875e+03
actor_1_facebook_likes       1.510699e+04
gross                        6.737255e+07
num_voted_users              1.383222e+05
cast_total_facebook_likes    1.816432e+04
facenumber_in_poster         2.023826e+00
num_user_for_reviews         3.729348e+02
budget                       1.002427e+08
title_year                   1.245398e+01
actor_2_facebook_likes       4.011300e+03
imdb_score                   1.127802e+00
aspect_ratio                 1.402940e+00
movie_facebook_likes         1.920602e+04
dtype: float64

In [74]:
#find non-missing data for each column
df.count()

color                        4897
director_name                4814
num_critic_for_reviews       4867
duration                     4901
director_facebook_likes      4814
actor_3_facebook_likes       4893
actor_2_name                 4903
actor_1_facebook_likes       4909
gross                        4054
genres                       4916
actor_1_name                 4909
num_voted_users              4916
cast_total_facebook_likes    4916
actor_3_name                 4893
facenumber_in_poster         4903
plot_keywords                4764
movie_imdb_link              4916
num_user_for_reviews         4895
language                     4904
country                      4911
content_rating               4616
budget                       4432
title_year                   4810
actor_2_facebook_likes       4903
imdb_score                   4916
aspect_ratio                 4590
movie_facebook_likes         4916
dtype: int64

In [75]:
#find column with least missing values and its count 
df.count().idxmax(), df.count().max()

('genres', 4916)