The three key data structures in pandas are:

    Series objects (collections of values)
    DataFrames (collections of Series objects)
    Panels (collections of DataFrame objects)

Series objects use NumPy arrays for fast computation, but add valuable features to them for analyzing data. While NumPy arrays use an integer index, for example, Series objects can use other index types, such as a string index. Series objects also allow for mixed data types, and use the NaN Python value for handling missing values.


In [1]:
import pandas as pd
fandango = pd.read_csv('fandango_score_comparison.csv')
fandango.head(2)

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
1,Cinderella (2015),85,80,67,7.5,7.1,5.0,4.5,4.25,4.0,...,3.55,4.5,4.0,3.5,4.0,3.5,249,65709,12640,0.5


Integer Indexes

DataFrames use Series objects to represent columns. When we select a single column from a DataFrame, pandas will return the Series object representing that column. By default, pandas indexes each individual Series object in a DataFrame with the integer data type. Each value in the Series has a unique integer index, or position. Like most Python data structures, the Series object uses 0-indexing. The indexing ranges from 0 to n-1, where n is the number of rows. We can use an integer index to select an individual value in a Series if we know its position.

With both NumPy arrays and Series objects, we can pass integer indexes into bracket notation to slice and select values. With Series objects, however, we can also specify custom indexes.

In [2]:
fandango = pd.read_csv('fandango_score_comparison.csv')
series_film = fandango['FILM']
print(series_film[0:5])
series_rt = fandango['RottenTomatoes']
print(series_rt[0:5])

0    Avengers: Age of Ultron (2015)
1                 Cinderella (2015)
2                    Ant-Man (2015)
3            Do You Believe? (2015)
4     Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object
0    74
1    85
2    80
3    18
4    14
Name: RottenTomatoes, dtype: int64


Custom Indexes

Both of these Series objects use the same integer indexes. This means that the value at index 5, for example, would describe the same film in both Series objects (The Water Diviner (2015)). To look up information about a specific movie, we would need to know its integer index.


In [10]:
from pandas import Series
film_names=series_film.values
rt_scores=series_rt.values
series_custom=Series(rt_scores,index=film_names)
series_custom[['Minions (2015)', 'Leviathan (2014)','Age of Ultron (2015)','Cinderella (2015)']]
#indexing also works
fiveten = series_custom[5:11]
fiveten


The Water Diviner (2015)             63
Irrational Man (2015)                42
Top Five (2014)                      86
Shaun the Sheep Movie (2015)         99
Love & Mercy (2015)                  89
Far From The Madding Crowd (2015)    84
dtype: int64

Reindexing

Reindexing is the pandas way of modifying the alignment between labels (indexes) and the data (values). The reindex() method allows us to specify a different order for the labels (indexes) in a Series object. This method takes in a list of strings corresponding to the order we'd like for that Series object.

We can use the reindex() method to sort series_custom alphabetically by film. To accomplish this, we need to

In [14]:
original_index = series_custom.index 
#print('original_index is ',original_index)
sorted_index = sorted(original_index)
#print('sorted_index is',sorted_index)
sorted_by_index = series_custom.reindex(sorted_index)
sorted_by_index.head(5)

'71 (2015)                    97
5 Flights Up (2015)           52
A Little Chaos (2015)         40
A Most Violent Year (2014)    90
About Elly (2015)             97
dtype: int64

Sorting

We just learned how to sort a Series object by the index using the reindex() method. This can be cumbersome if we just want to do some quick exploratory data analysis, or reorder by rating instead of film name.

To make sorting easier, pandas comes with a sort_index() method that sorts a Series by index, and a sort_values() method that sorts a Series by its values. Since the values representing the Rotten Tomatoes scores are integers, sorting by values will return the data in numerically ascending order (low to high).

In both cases, pandas preserves the link between each element's index (film name) and value (score). We call this data alignment, which is a key tenet of pandas that's incredibly important when analyzing data. Pandas allows us to assume the linking will be preserved, unless we specifically change a value or an index.


In [16]:
sorting_by_index = series_custom.sort_index()
sorting_by_values = series_custom.sort_values()
print('sorting_by_index is ',sorting_by_index[0:5])
print('sorting_by_values is ',sorting_by_values[0:5])

('sorting_by_index is ', '71 (2015)                    97
5 Flights Up (2015)           52
A Little Chaos (2015)         40
A Most Violent Year (2014)    90
About Elly (2015)             97
dtype: int64)
('sorting_by_values is ', Paul Blart: Mall Cop 2 (2015)    5
Hitman: Agent 47 (2015)          7
Hot Pursuit (2015)               8
Fantastic Four (2015)            9
Taken 3 (2015)                   9
dtype: int64)


In [17]:
#Normalise column 0-100 to 0-5 scale by dividing by 20
series_normalized = (series_custom/20)

In [18]:
criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two]

Alignment

One of pandas' core tenets is data alignment. Series objects align along indices, and DataFrame objects align along both indices and columns. With Series objects, pandas implicitly preserves the link between the index labels and the values across operations and transformations, unless we explicitly break it. With DataFrame objects, the values link to the index labels and the column labels. Pandas also preserves these links, unless we explicitly break them (by reassigning or editing a column or index label, for example).

This core tenet allows us to use pandas effectively when working with data, and offers a big advantage over using NumPy objects. For Series objects in particular, this means we can use the standard Python arithmetic operators (+, -, *, and /) to add, subtract, multiply, and divide the values at each index label for two different Series objects.

Let's use this functionality to calculate the mean ratings from both critics and users on Rotten Tomatoes.

In [20]:
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users)/2

print(rt_mean.head(10))

FILM
Avengers: Age of Ultron (2015)    80.0
Cinderella (2015)                 82.5
Ant-Man (2015)                    85.0
Do You Believe? (2015)            51.0
Hot Tub Time Machine 2 (2015)     21.0
The Water Diviner (2015)          62.5
Irrational Man (2015)             47.5
Top Five (2014)                   75.0
Shaun the Sheep Movie (2015)      90.5
Love & Mercy (2015)               88.0
dtype: float64


Pandas Internals: Dataframes

Shared Indexes

Dataframe objects can easily query and interact with many columns. They represent each of these columns as a Series object. We discussed how Series objects work in the previous mission. In this mission, we'll learn how dataframes build on Series objects to provide a powerful data analysis toolkit.

Series objects maintain data alignment between values and their index labels. Because dataframes are basically collections of Series objects, they maintain alignment along both columns and rows.

Pandas dataframe share a row index across columns. By default, this is an integer index. Pandas enforces this shared row index by throwing an error if we read in a CSV file with columns that contain a different number of elements.

Whenever you call a method that returns or prints a dataframe, the index values (such as a sequence of integers) appear in the leftmost column. You can also use the index attribute to access the index values directly. 

Using Integer Indexes to Select Rows

In the previous cell, we explored the default integer index that pandas uses for the dataframe. With Series, each unique index value refers to a data value. With dataframes, however, each index value refers to an entire row. We can use the integer index to select rows in a few different ways:

fandango[0:5] #first five rows
fandango[140:]#From row at 140 and higher
fandango.iloc[50]# Just row at index 50
fandango.iloc[[45,90]]# Just row at index 45 and 90

We use bracket notation to select a slice (continuous sequence) of rows, just as we would for a list. To select an individual row, however, we'll need to use the iloc[] method. This method accepts the following objects for selection:

    An integer
    A list of integers
    A slice object
    A Boolean array

When selecting an individual row, pandas will return a Series object. When selecting multiple rows, it will return a subset of the original dataframe as a new dataframe.
first_last=fandango.iloc[[0,fandango.shape[0]-1]]

Using Custom Indexes

The dataframe object has a set_index() method that allows us to pass in the name of the column we want pandas to use as the Dataframe index. By default, pandas will create a new dataframe, index it by the values in the column we specify, then drop that column. The set_index() method has a few parameters that allow us to tweak this behavior:

    inplace: If set to True, this parameter will set the index for the current, "live" dataframe, instead of returning a new dataframe.
    drop: If set to False, this parameter will keep the column we specified as the index, instead of dropping it.


In [21]:
fandango = pd.read_csv('fandango_score_comparison.csv')
fandango_films=fandango.set_index('FILM',drop=False)
print(fandango_films.index)

Index([u'Avengers: Age of Ultron (2015)', u'Cinderella (2015)',
       u'Ant-Man (2015)', u'Do You Believe? (2015)',
       u'Hot Tub Time Machine 2 (2015)', u'The Water Diviner (2015)',
       u'Irrational Man (2015)', u'Top Five (2014)',
       u'Shaun the Sheep Movie (2015)', u'Love & Mercy (2015)',
       ...
       u'The Woman In Black 2 Angel of Death (2015)', u'Danny Collins (2015)',
       u'Spare Parts (2015)', u'Serena (2015)', u'Inside Out (2015)',
       u'Mr. Holmes (2015)', u''71 (2015)', u'Two Days, One Night (2014)',
       u'Gett: The Trial of Viviane Amsalem (2015)',
       u'Kumiko, The Treasure Hunter (2015)'],
      dtype='object', name=u'FILM', length=146)


Using a Custom Index for Selection

Now that we have a custom index, we can select a row by film name instead of row number (which is the default integer index). We can select rows using the custom index by either:

    Using the loc[] method (the same way we would the iloc[] method)
    Creating a slice using bracket notation



In [23]:
best_movies=["The Lazarus Effect (2015)","Gett: The Trial of Viviane Amsalem (2015)","Mr. Holmes (2015)"]
best_movies_ever=fandango_films.loc[best_movies]
best_movies_ever

Unnamed: 0_level_0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
FILM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
The Lazarus Effect (2015),The Lazarus Effect (2015),14,23,31,4.9,5.2,3.0,3.0,0.7,1.15,...,2.6,0.5,1.0,1.5,2.5,2.5,62,17691,1651,0.0
Gett: The Trial of Viviane Amsalem (2015),Gett: The Trial of Viviane Amsalem (2015),100,81,90,7.3,7.8,3.5,3.5,5.0,4.05,...,3.9,5.0,4.0,4.5,3.5,4.0,19,1955,59,0.0
Mr. Holmes (2015),Mr. Holmes (2015),87,78,67,7.9,7.4,4.0,4.0,4.35,3.9,...,3.7,4.5,4.0,3.5,4.0,3.5,33,7367,1348,0.0


Apply() Logic Over the Columns in a Dataframe

Recall that a dataframe object represents both rows and columns as Series objects. The apply() method in pandas allows us to specify Python logic that we want to evaluate over the Series objects in a dataframe. Here are some examples of what we can accomplish using the apply() method:

    Calculate the standard deviations for each numeric column
    Lowercase all film names in the FILM column

The apply() method requires us to pass in the vectorized operation we want to apply over each Series object. The method runs over the dataframe's columns by default, but we can use the axis parameter to change this (which we'll do later). If the vectorized operation usually returns a single value (such as the NumPy std() function), it will return a Series object containing the computed value for each column. If it usually returns a value for each element (such as multiplying or dividing by 2), it will transform all of the values and return them as a dataframe.

In the following code cell, we select only the float columns, and assign the dataframe containing them to float_df. Then, we pass in the NumPy function std() as a lambda function to the dataframe method apply() in order to calculate the standard deviation of each column. Under the hood, pandas uses vectorized operations to apply the NumPy function for each iteration of the apply() method. It then returns a final Series object containing the standard deviations for each column (i.e., the film ratings).

In [24]:
import numpy as np

# returns the data types as a Series
types = fandango_films.dtypes
# filter data types to just floats, index attributes returns just column names
float_columns = types[types.values == 'float64'].index
# use bracket notation to filter columns to just float columns
float_df = fandango_films[float_columns]
# `x` is a Series object representing a column
deviations = float_df.apply(lambda x: np.std(x))

print(deviations)

Metacritic_User               1.505529
IMDB                          0.955447
Fandango_Stars                0.538532
Fandango_Ratingvalue          0.501106
RT_norm                       1.503265
RT_user_norm                  0.997787
Metacritic_norm               0.972522
Metacritic_user_nom           0.752765
IMDB_norm                     0.477723
RT_norm_round                 1.509404
RT_user_norm_round            1.003559
Metacritic_norm_round         0.987561
Metacritic_user_norm_round    0.785412
IMDB_norm_round               0.501043
Fandango_Difference           0.152141
dtype: float64


In [25]:
#giving axis=1 , we will apply across rows
rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]
rt_mt_deviations = rt_mt_user.apply(lambda x: np.std(x), axis=1)
print(rt_mt_deviations[0:5])
rt_mt_user2=float_df[['RT_user_norm','Metacritic_user_nom']]
rt_mt_means=rt_mt_user2.apply(lambda x:np.mean(x),axis=1)
print(rt_mt_means[0:5])

FILM
Avengers: Age of Ultron (2015)    0.375
Cinderella (2015)                 0.125
Ant-Man (2015)                    0.225
Do You Believe? (2015)            0.925
Hot Tub Time Machine 2 (2015)     0.150
dtype: float64
FILM
Avengers: Age of Ultron (2015)    3.925
Cinderella (2015)                 3.875
Ant-Man (2015)                    4.275
Do You Believe? (2015)            3.275
Hot Tub Time Machine 2 (2015)     1.550
dtype: float64
