Over the next two missions, we'll dive into some of pandas' internals to better understand how it does things under the hood.

The **three key data structures in pandas** are:

* Series objects (collections of values)
* DataFrames (collections of Series objects)
* Panels (collections of DataFrame objects)

## Series

**Series** objects use NumPy arrays for fast computation, but add valuable features to them for analyzing data. While NumPy arrays use an integer index, for example, Series objects can use **other index types**, such as a string index. Series objects also allow for **mixed data types**, and use the **NaN** Python value for handling missing values.

In [4]:
import pandas as pd

fandango = pd.read_csv('fandango_score_comparison.csv')
print(fandango.head(2))

                             FILM  RottenTomatoes  RottenTomatoes_User  \
0  Avengers: Age of Ultron (2015)              74                   86   
1               Cinderella (2015)              85                   80   

   Metacritic  Metacritic_User  IMDB  Fandango_Stars  Fandango_Ratingvalue  \
0          66              7.1   7.8             5.0                   4.5   
1          67              7.5   7.1             5.0                   4.5   

   RT_norm  RT_user_norm         ...           IMDB_norm  RT_norm_round  \
0     3.70           4.3         ...                3.90            3.5   
1     4.25           4.0         ...                3.55            4.5   

   RT_user_norm_round  Metacritic_norm_round  Metacritic_user_norm_round  \
0                 4.5                    3.5                         3.5   
1                 4.0                    3.5                         4.0   

   IMDB_norm_round  Metacritic_user_vote_count  IMDB_user_vote_count  \
0              

In [5]:
series_film = fandango['FILM']
print(series_film.head(5))


series_rt = fandango['RottenTomatoes']
print(series_rt.head(5))


0    Avengers: Age of Ultron (2015)
1                 Cinderella (2015)
2                    Ant-Man (2015)
3            Do You Believe? (2015)
4     Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object
0    74
1    85
2    80
3    18
4    14
Name: RottenTomatoes, dtype: int64


Both of these Series objects **use the same integer indexes**. This means that the value at index 5, for example, would describe the same film in both Series objects (The Water Diviner (2015)). To look up information about a specific movie, we would need to know its integer index.

If we only had these two Series objects and wanted to look up the Rotten Tomatoes scores for *Minions (2015)* and *Leviathan (2014)*, we'd have to:

* Find the integer index corresponding to Minions (2015) in series_film
* Look up the value at that integer index from series_rt
* Find the integer index corresponding to Leviathan (2014) in series_film
* Look up the value at that integer index from series_rt

This becomes especially cumbersome as we scale up the problem to look for a larger number of movies. What we really want is a way to retrieve the Rotten Tomatoes scores for many movies at the same time with just one command (and one Series object). To accomplish this, we need to move away from using integer indexes, and use string indexes corresponding to the film names instead. Then we can pass in a list of strings matching the film names to retrieve the scores, like so:

`series_custom[['Minions (2015)', 'Leviathan (2014)']]`

### Exercise: 
Create a **new Series object** named series_custom that has a string index (based on the values from film_names), and contains all of the Rotten Tomatoes scores from 



In [6]:
film_names = series_film.values
rt_scores = series_rt.values

series_custom = pd.Series(data = rt_scores, index = film_names)

print(series_custom.head(5))

Avengers: Age of Ultron (2015)    74
Cinderella (2015)                 85
Ant-Man (2015)                    80
Do You Believe? (2015)            18
Hot Tub Time Machine 2 (2015)     14
dtype: int64


Even though we specified that the Series object uses a custom string index, the **object still has an internal integer index that we can use for selection** (Difference to dictionaries!). When it comes to indexes, **Series objects act like both dictionaries and lists**. We can access values with our custom index (like the keys in a dictionary), or the integer index (like the index in a list).


In [7]:
# Use string index like a dictionary with [[ ... ]]
print(series_custom[['Minions (2015)', 'Leviathan (2014)']])

# Use int index with [ ... ]
print(series_custom[5:11])

Minions (2015)      54
Leviathan (2014)    99
dtype: int64
The Water Diviner (2015)             63
Irrational Man (2015)                42
Top Five (2014)                      86
Shaun the Sheep Movie (2015)         99
Love & Mercy (2015)                  89
Far From The Madding Crowd (2015)    84
dtype: int64


**Reindexing** is the pandas way of modifying the alignment between labels (indexes) and the data (values). The `reindex()` method allows us to specify a different order for the labels (indexes) in a Series object. This method takes in a list of strings corresponding to the order we'd like for that Series object.

In [8]:
original_index = series_custom.index
sorted_index = sorted(original_index)
#print(sorted_index)
sorted_by_index = series_custom.reindex(sorted(original_index))


To make sorting easier, pandas comes with a `sort_index()` method that **sorts a Series by index**, and a `sort_values()` method that **sorts a Series by its values**. 

In [9]:
sc2 = series_custom.sort_index()
sc3 = series_custom.sort_values()
print(sc2.head(10))
print(sc3.head(10))

'71 (2015)                    97
5 Flights Up (2015)           52
A Little Chaos (2015)         40
A Most Violent Year (2014)    90
About Elly (2015)             97
Aloha (2015)                  19
American Sniper (2015)        72
American Ultra (2015)         46
Amy (2015)                    97
Annie (2014)                  27
dtype: int64
Paul Blart: Mall Cop 2 (2015)     5
Hitman: Agent 47 (2015)           7
Hot Pursuit (2015)                8
Fantastic Four (2015)             9
Taken 3 (2015)                    9
The Boy Next Door (2015)         10
The Loft (2015)                  11
Unfinished Business (2015)       11
Mortdecai (2015)                 12
Seventh Son (2015)               12
dtype: int64


### Vector operations 
Since **pandas builds on NumPy**, it takes advantage of **NumPy's vectorizaton capabilities**. These capabilities generate incredibly optimized, low level code in the C programming language to loop over the values. Using a traditional for loop would be much slower, especially for large data sets.

We can use any of the standard Python arithmetic operators (+, -, *, and /) to transform each of the values in a Series object. If we wanted to transform the Rotten Tomatoes scores from a 100-point scale to a 10-point scale, for example, we could use the Python division operator (/) to divide the Series by 10. We can even use NumPy functions to transform and run calculations over Series objects:

In [10]:
import numpy as np

# devide all entreis by 10
series_custom/10

# Add each value with each other
np.add(series_custom, series_custom)
# Apply sine function to each value
np.sin(series_custom)
# Return the highest value (will return a single value, not a Series)
np.max(series_custom)

print(sc3[len(sc3)-10:])

Paddington (2015)                             98
Mr. Turner (2014)                             98
Timbuktu (2015)                               99
Shaun the Sheep Movie (2015)                  99
Leviathan (2014)                              99
Song of the Sea (2014)                        99
Phoenix (2015)                                99
Selma (2014)                                  99
Seymour: An Introduction (2015)              100
Gett: The Trial of Viviane Amsalem (2015)    100
dtype: int64


### Series vs. ndarray
The values in a **Series object are part of an ndarray**, the core data type in NumPy. Applying some NumPy functions to a Series object will **return a new Series object**, while other functions will return a single value. NumPy's documentation (http://docs.scipy.org/doc/numpy/reference/generated/numpy.sin.html#numpy.sin) gives us a good sense of the return value for each function. If a particular NumPy function **usually returns an ndarray**, it will return a **Series object** instead when we **apply it to a Series.**

### Pandas and vectorized operations

Pandas uses **vectorized operations** for many tasks, such as filtering values within a single Series object and comparing two different Series objects. For example, to find all films with an average critic rating of 50 or above on Rotten Tomatoes, running:

`series_custom > 50`

will actually **return a Series object with a Boolean value for each film**. That's because pandas applies the filter (> 50) to each value in the Series object. To retrieve the actual film names, we need to pass this Boolean series into the original Series object.

In [11]:
series_greater_than_50 = series_custom[series_custom > 50]
print(series_greater_than_50.shape[0])
print(series_custom.shape[0])
print('Ratio of movies greater 50:')
print(series_greater_than_50.shape[0]/series_custom.shape[0])

94
146
Ratio of movies greater 50:
0.6438356164383562


In [12]:
# Combining boolean criteria with '&', '|'

bigger_50 = series_custom > 50
smaller_55 = series_custom <55

between_50_55 = series_custom[bigger_50 & smaller_55]
print(between_50_55)

Unbroken (2014)                  51
5 Flights Up (2015)              52
Saint Laurent (2015)             51
The Age of Adaline (2015)        54
Maggie (2015)                    54
Escobar: Paradise Lost (2015)    52
Woman in Gold (2015)             52
Minions (2015)                   54
Spare Parts (2015)               52
dtype: int64


In [13]:
rt_critics = pd.Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = pd.Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])


rt_mean = pd.Series((rt_critics+rt_users)/2)
print(rt_mean[:10])

FILM
Avengers: Age of Ultron (2015)    80.0
Cinderella (2015)                 82.5
Ant-Man (2015)                    85.0
Do You Believe? (2015)            51.0
Hot Tub Time Machine 2 (2015)     21.0
The Water Diviner (2015)          62.5
Irrational Man (2015)             47.5
Top Five (2014)                   75.0
Shaun the Sheep Movie (2015)      90.5
Love & Mercy (2015)               88.0
dtype: float64


## DataFrames

**DataFrames** use Series objects to represent columns. Dataframe objects can easily query and interact with many columns. When we select a single column from a DataFrame, pandas will return the **Series object representing that column**. By default, pandas indexes each individual Series object in a DataFrame with the **integer data type**. Each value in the Series has a unique integer index, or position. Like most Python data structures, the Series object uses 0-indexing. The **indexing ranges from 0 to n-1**, where n is the number of rows. We can use an integer index to select an individual value in a Series if we know its position.

**Series** objects maintain **data alignment between values and their index labels**. Because dataframes are basically collections of Series objects, they maintain **alignment along both columns and rows**.

Whenever you call a method that returns or prints a dataframe, the **index values** (such as a sequence of integers) appear in the **leftmost column**. You can also use the `index` attribute to access the index values directly.

In [14]:
print('First two rows:')
print(fandango.head(2))

print('Index:')
print(fandango.index)

First two rows:
                             FILM  RottenTomatoes  RottenTomatoes_User  \
0  Avengers: Age of Ultron (2015)              74                   86   
1               Cinderella (2015)              85                   80   

   Metacritic  Metacritic_User  IMDB  Fandango_Stars  Fandango_Ratingvalue  \
0          66              7.1   7.8             5.0                   4.5   
1          67              7.5   7.1             5.0                   4.5   

   RT_norm  RT_user_norm         ...           IMDB_norm  RT_norm_round  \
0     3.70           4.3         ...                3.90            3.5   
1     4.25           4.0         ...                3.55            4.5   

   RT_user_norm_round  Metacritic_norm_round  Metacritic_user_norm_round  \
0                 4.5                    3.5                         3.5   
1                 4.0                    3.5                         4.0   

   IMDB_norm_round  Metacritic_user_vote_count  IMDB_user_vote_count  \

In [15]:
print(fandango.index[39])

print(fandango.iloc[39])

39
FILM                          Maps to the Stars (2015)
RottenTomatoes                                      60
RottenTomatoes_User                                 46
Metacritic                                          67
Metacritic_User                                    5.8
IMDB                                               6.3
Fandango_Stars                                     3.5
Fandango_Ratingvalue                               3.1
RT_norm                                              3
RT_user_norm                                       2.3
Metacritic_norm                                   3.35
Metacritic_user_nom                                2.9
IMDB_norm                                         3.15
RT_norm_round                                        3
RT_user_norm_round                                 2.5
Metacritic_norm_round                              3.5
Metacritic_user_norm_round                           3
IMDB_norm_round                                      3
Metacri

With **Series**, each unique index value refers to a data value. With **dataframes**, however, each index value refers to an entire row. We can use the integer index to select rows in a few different ways:

In [16]:
# First five rows
fandango[0:5]
# From row at 140 and higher
fandango[140:]
# Just row at index 50 --> needs iloc!
fandango.iloc[50]
# Just row at index 45 and 90
fandango.iloc[[45,90]]

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
45,Tomorrowland (2015),50,53,60,6.4,6.6,4.0,3.7,2.5,2.65,...,3.3,2.5,2.5,3.0,3.0,3.5,262,42937,8077,0.3
90,The SpongeBob Movie: Sponge Out of Water (2015),78,55,62,6.5,6.1,3.5,3.3,3.9,2.75,...,3.05,4.0,3.0,3.0,3.5,3.0,196,26046,4493,0.2


### iloc with dataframes
We use **bracket notation** to select a slice (continuous sequence) of rows, just as we would for a list. To select an **individual row**, however, we'll need to use the **iloc[] method**. This method accepts the following objects for selection:

* An integer
* A list of integers
* A slice object
* A Boolean array

When selecting an **individual row**, pandas will return a **Series** object. When selecting **multiple rows**, it will return a **subset of the original dataframe** as a new dataframe.



In [17]:
first_last = fandango.iloc[[0,len(fandango)-1]] # use double brackets!

### Indexing Dataframes

The dataframe object has a `set_index()` method that allows us to pass in the **name of the column** we want pandas to use as the Dataframe index. By default, pandas will **create a new dataframe**, index it by the values in the column we specify, then drop that column. The set_index() method has a **few parameters** that allow us to tweak this behavior:

* `inplace`: If set to True, this parameter will **set the index for the current**, "live" dataframe, instead of returning a new dataframe.
* `drop`: If set to False, this parameter will **keep the column we specified** as the index, instead of dropping it.

In [18]:
# Create new dataframes with "FILM" column as index and not being dropped.
fandango_films = fandango.set_index('FILM', drop = False, inplace = False)

print(fandango.shape)
print(fandango_films.shape)

print(fandango_films.iloc[[5,10]])
print(fandango_films.index)

(146, 22)
(146, 22)
                                                                FILM  \
FILM                                                                   
The Water Diviner (2015)                    The Water Diviner (2015)   
Far From The Madding Crowd (2015)  Far From The Madding Crowd (2015)   

                                   RottenTomatoes  RottenTomatoes_User  \
FILM                                                                     
The Water Diviner (2015)                       63                   62   
Far From The Madding Crowd (2015)              84                   77   

                                   Metacritic  Metacritic_User  IMDB  \
FILM                                                                   
The Water Diviner (2015)                   50              6.8   7.2   
Far From The Madding Crowd (2015)          71              7.5   7.2   

                                   Fandango_Stars  Fandango_Ratingvalue  \
FILM                          

Now that we have a custom index, we can **select a row by film name instead of row number** (which is the default integer index). We can select rows using the custom index by either:

* Using the **loc[] method** (the same way we would the iloc[] method)
* Creating a slice using **bracket notation**

In [19]:
# Slice using either bracket notation or loc[]
fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]
fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]

# Specific movie
fandango_films.loc['Kumiko, The Treasure Hunter (2015)']

# Selecting list of movies
fandango_films.loc[['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']]

#fandango_films.iloc[5]


Unnamed: 0_level_0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
FILM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Kumiko, The Treasure Hunter (2015)","Kumiko, The Treasure Hunter (2015)",87,63,68,6.4,6.7,3.5,3.5,4.35,3.15,...,3.35,4.5,3.0,3.5,3.0,3.5,19,5289,41,0.0
Do You Believe? (2015),Do You Believe? (2015),18,84,22,4.7,5.4,5.0,4.5,0.9,4.2,...,2.7,1.0,4.0,1.0,2.5,2.5,31,3136,1793,0.5
Ant-Man (2015),Ant-Man (2015),80,90,64,8.1,7.8,5.0,4.5,4.0,4.5,...,3.9,4.0,4.5,3.0,4.0,4.0,627,103660,12055,0.5


### Apply to DataFrames

Recall that a dataframe object represents both rows and columns as Series objects. The `apply()` method in pandas allows us to specify **Python logic** that we want to evaluate **over the Series objects in a dataframe** (row or column), e.g.
* Calculate the standard deviations for each numeric column
* Lowercase all film names in the FILM column

The `apply()` method requires us to pass in the **vectorized operation** we want to apply over each Series object. The method runs **over the dataframe's columns by default**, but we can use the axis parameter to change this (which we'll do later). If the vectorized operation usually returns a single value it will return a Series object containing the computed value for each column. For example, in the following code, we use a lambda function to multiply each float column by 2:


In [40]:
import numpy as np

types = fandango_films.dtypes
#print(types.index)

float_columns = types[types.values == 'float64'].index

#print(float_columns)

# float_df contains only the float columns
float_df = fandango_films[float_columns]

float_df.apply(lambda x: np.sqrt(x)) 

#fandango_films
# usage of a lambda function
#float_df.apply(lambda x: x*2) 

Unnamed: 0_level_0,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,Metacritic_norm,Metacritic_user_nom,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Fandango_Difference
FILM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Avengers: Age of Ultron (2015),2.664583,2.792848,2.236068,2.121320,1.923538,2.073644,1.816590,1.884144,1.974842,1.870829,2.121320,1.870829,1.870829,2.000000,0.707107
Cinderella (2015),2.738613,2.664583,2.236068,2.121320,2.061553,2.000000,1.830301,1.936492,1.884144,2.121320,2.000000,1.870829,2.000000,1.870829,0.707107
Ant-Man (2015),2.846050,2.792848,2.236068,2.121320,2.000000,2.121320,1.788854,2.012461,1.974842,2.000000,2.121320,1.732051,2.000000,2.000000,0.707107
Do You Believe? (2015),2.167948,2.323790,2.236068,2.121320,0.948683,2.049390,1.048809,1.532971,1.643168,1.000000,2.000000,1.000000,1.581139,1.581139,0.707107
Hot Tub Time Machine 2 (2015),1.843909,2.258318,1.870829,1.732051,0.836660,1.183216,1.204159,1.303840,1.596872,0.707107,1.224745,1.224745,1.224745,1.581139,0.707107
The Water Diviner (2015),2.607681,2.683282,2.121320,2.000000,1.774824,1.760682,1.581139,1.843909,1.897367,1.732051,1.732051,1.581139,1.870829,1.870829,0.707107
Irrational Man (2015),2.756810,2.626785,2.000000,1.870829,1.449138,1.627882,1.627882,1.949359,1.857418,1.414214,1.581139,1.581139,2.000000,1.870829,0.707107
Top Five (2014),2.607681,2.549510,2.000000,1.870829,2.073644,1.788854,2.012461,1.843909,1.802776,2.121320,1.732051,2.000000,1.870829,1.870829,0.707107
Shaun the Sheep Movie (2015),2.966479,2.720294,2.121320,2.000000,2.224860,2.024846,2.012461,2.097618,1.923538,2.236068,2.000000,2.000000,2.121320,1.870829,0.707107
Love & Mercy (2015),2.915476,2.792848,2.121320,2.000000,2.109502,2.085665,2.000000,2.061553,1.974842,2.121320,2.121320,2.000000,2.121320,2.000000,0.707107


### Lambda function
In the code above, we passed a lambda function to the DataFrame.apply() method. A **lambda function** is also called an *anonymous function*, because we aren't defining a new, named function (e.g. a function called double()) and then using it. The lambda function `x*2` only **lives for the life of the DataFrame.apply() method call**. A lambda function consists of 2 parts:
* a variable name, that we can refer to in our transformation logic: lambda x:
* the transformation logic: x*2 (multiply by 2)

The function / method has to work with series (**vectorised**)!, e.g. `math.sqrt(x)` doesn't work, but `numpy.sqrt(x)` does.

In [52]:
halved_df = float_df.apply(lambda x: x/2)
#print(halved_df.head(5))

# Applying lambda function to rows

# select relevant columns
rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]

#get standard deviation across rows
std = rt_mt_user.apply(lambda x: np.std(x), axis=1)
print(std.head(5))

#get mean rating of certain columns
columns = float_df[['RT_user_norm', 'Metacritic_user_nom']]
rt_mt_means = columns.apply(lambda x: np.mean(x), axis = 1)
print(rt_mt_means.head(5))


FILM
Avengers: Age of Ultron (2015)    0.375
Cinderella (2015)                 0.125
Ant-Man (2015)                    0.225
Do You Believe? (2015)            0.925
Hot Tub Time Machine 2 (2015)     0.150
dtype: float64
FILM
Avengers: Age of Ultron (2015)    3.925
Cinderella (2015)                 3.875
Ant-Man (2015)                    4.275
Do You Believe? (2015)            3.275
Hot Tub Time Machine 2 (2015)     1.550
dtype: float64
