# Data Indexing & Selection

___Resources___

https://bit.ly/2u8dj6p - pandas documentation - Indexing & Selecting Data

In [1]:
## Base imports

import pandas as pd
import numpy as np
pd.set_option('max_columns', 50)

## Data Selection in a `Series`

In [6]:
# Work with our IMDB dataset again - this time using the 'title' column as the DataFrame index

movies = pd.read_csv('./Data/imdb_movies.csv', index_col='title')
movies.head()

Unnamed: 0_level_0,star_rating,content_rating,genre,duration,actors_list
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Shawshank Redemption,9.3,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
The Godfather,9.2,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
The Godfather: Part II,9.1,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
The Dark Knight,9.0,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
Pulp Fiction,8.9,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [7]:
# Selecting a Series - genre
Genre = movies['genre'].copy()

type(Genre)

pandas.core.series.Series

In [8]:
# Let's take a look at the Genre series

Genre.head()

title
The Shawshank Redemption     Crime
The Godfather                Crime
The Godfather: Part II       Crime
The Dark Knight             Action
Pulp Fiction                 Crime
Name: genre, dtype: object

**Similar to a dictionary**, a `Series` object provides a mapping from a collection of keys to a collection of values:

In [9]:
Genre['12 Angry Men']

'Drama'

Likewise `dictionary` like expressions can be used to examine keys/indices and values

In [10]:
'Pulp Fiction' in Genre

True

In [11]:
# Alternative method call

Genre.index.contains('Pulp Fiction')

True

In [12]:
Genre.keys()

Index(['The Shawshank Redemption', 'The Godfather', 'The Godfather: Part II',
       'The Dark Knight', 'Pulp Fiction', '12 Angry Men',
       'The Good, the Bad and the Ugly',
       'The Lord of the Rings: The Return of the King', 'Schindler's List',
       'Fight Club',
       ...
       'Law Abiding Citizen', 'Wonder Boys', 'Death at a Funeral',
       'Blue Valentine', 'The Cider House Rules', 'Tootsie',
       'Back to the Future Part III',
       'Master and Commander: The Far Side of the World', 'Poltergeist',
       'Wall Street'],
      dtype='object', name='title', length=979)

In [13]:
list(Genre.items())

[('The Shawshank Redemption', 'Crime'),
 ('The Godfather', 'Crime'),
 ('The Godfather: Part II', 'Crime'),
 ('The Dark Knight', 'Action'),
 ('Pulp Fiction', 'Crime'),
 ('12 Angry Men', 'Drama'),
 ('The Good, the Bad and the Ugly', 'Western'),
 ('The Lord of the Rings: The Return of the King', 'Adventure'),
 ("Schindler's List", 'Biography'),
 ('Fight Club', 'Drama'),
 ('The Lord of the Rings: The Fellowship of the Ring', 'Adventure'),
 ('Inception', 'Action'),
 ('Star Wars: Episode V - The Empire Strikes Back', 'Action'),
 ('Forrest Gump', 'Drama'),
 ('The Lord of the Rings: The Two Towers', 'Adventure'),
 ('Interstellar', 'Adventure'),
 ("One Flew Over the Cuckoo's Nest", 'Drama'),
 ('Seven Samurai', 'Drama'),
 ('Goodfellas', 'Biography'),
 ('Star Wars', 'Action'),
 ('The Matrix', 'Action'),
 ('City of God', 'Crime'),
 ("It's a Wonderful Life", 'Drama'),
 ('The Usual Suspects', 'Crime'),
 ('Se7en', 'Drama'),
 ('Life Is Beautiful', 'Comedy'),
 ('Once Upon a Time in the West', 'Western'

`Series`objects can even be modified with a dictionary-like syntax. A `Series` can be extended by assigning to a new index value:

In [14]:
Genre['Blue Streak'] = 'Comedy'

In [15]:
Genre.tail()

title
Back to the Future Part III                        Adventure
Master and Commander: The Far Side of the World       Action
Poltergeist                                           Horror
Wall Street                                            Crime
Blue Streak                                           Comedy
Name: genre, dtype: object

**This easy mutability of the objects abstracts decisions about memory layout and data copying away from the user.**

A `Series` object also has **array-style** item selection - slicing, masking and fancy indexing.

In [16]:
# Slicing by explicit index - Note how the Blue Streak index is included

Genre['Poltergeist':'Blue Streak']

title
Poltergeist    Horror
Wall Street     Crime
Blue Streak    Comedy
Name: genre, dtype: object

In [17]:
# Slicing by implicit integer index - Note that the 10th index is excluded

Genre[0:10]

title
The Shawshank Redemption                             Crime
The Godfather                                        Crime
The Godfather: Part II                               Crime
The Dark Knight                                     Action
Pulp Fiction                                         Crime
12 Angry Men                                         Drama
The Good, the Bad and the Ugly                     Western
The Lord of the Rings: The Return of the King    Adventure
Schindler's List                                 Biography
Fight Club                                           Drama
Name: genre, dtype: object

In [18]:
# Boolean indexing

Genre[Genre == 'Sci-Fi']

title
Blade Runner                     Sci-Fi
Brazil                           Sci-Fi
Gravity                          Sci-Fi
The Day the Earth Stood Still    Sci-Fi
The Butterfly Effect             Sci-Fi
Name: genre, dtype: object

Boolean indexing is the use of boolean vectors to filter the data. For the purposes of pandas, it refers to selecting rows by providing a boolean value (True or False) for each row. The operators are: | for or, & for and, and ~ for not. These __must__ be grouped by using parentheses!

In [19]:
Genre[(Genre == 'Sci-Fi')|(Genre == 'Film-Noir')]

title
The Third Man                    Film-Noir
Blade Runner                        Sci-Fi
Laura                            Film-Noir
Brazil                              Sci-Fi
Gravity                             Sci-Fi
The Day the Earth Stood Still       Sci-Fi
The Butterfly Effect                Sci-Fi
Spellbound                       Film-Noir
Name: genre, dtype: object

In [20]:
# Fancy Indexing

Genre[['Fight Club', 'City of God']]

title
Fight Club     Drama
City of God    Crime
Name: genre, dtype: object

**Question** - What could be a potential problem with both being able to select by both implicit index and explicit index?

### Indexer attributes - loc and iloc

`loc` -  gets rows (or columns) with particular **labels** from the index  
`iloc` -  gets rows (or columns) at particular **positions** in the index

**Note - these were introduced because of the confusion in the case of integer indexes**

In [21]:
# loc attribute allows indexing and slicing that always references the explicit index

Genre.loc['Blade Runner']

'Sci-Fi'

In [22]:
Genre.loc['Pulp Fiction': 'Fight Club']

title
Pulp Fiction                                         Crime
12 Angry Men                                         Drama
The Good, the Bad and the Ugly                     Western
The Lord of the Rings: The Return of the King    Adventure
Schindler's List                                 Biography
Fight Club                                           Drama
Name: genre, dtype: object

In [23]:
# iloc attribute allows indexing and slicing that always references the implicit index

Genre.iloc[145] # Genre.index.get_loc('Blade Runner')

'Sci-Fi'

In [24]:
Genre.iloc[4:10]

title
Pulp Fiction                                         Crime
12 Angry Men                                         Drama
The Good, the Bad and the Ugly                     Western
The Lord of the Rings: The Return of the King    Adventure
Schindler's List                                 Biography
Fight Club                                           Drama
Name: genre, dtype: object

## Data Selection in a `DataFrame`

**Remember** - A `DataFrame` can be seen to act like a dictionary of `Series` structures sharing the same index.

In [25]:
movies.head()

Unnamed: 0_level_0,star_rating,content_rating,genre,duration,actors_list
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Shawshank Redemption,9.3,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
The Godfather,9.2,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
The Godfather: Part II,9.1,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
The Dark Knight,9.0,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
Pulp Fiction,8.9,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


### How to select a single column of data as a `Series`

In [26]:
# Using square bracket notation - most common indexing operator
# pass a single string to the DataFrame indexing operator accesses a column

movies['genre']

title
The Shawshank Redemption                                 Crime
The Godfather                                            Crime
The Godfather: Part II                                   Crime
The Dark Knight                                         Action
Pulp Fiction                                             Crime
12 Angry Men                                             Drama
The Good, the Bad and the Ugly                         Western
The Lord of the Rings: The Return of the King        Adventure
Schindler's List                                     Biography
Fight Club                                               Drama
The Lord of the Rings: The Fellowship of the Ring    Adventure
Inception                                               Action
Star Wars: Episode V - The Empire Strikes Back          Action
Forrest Gump                                             Drama
The Lord of the Rings: The Two Towers                Adventure
Interstellar                                     

In [27]:
# Alternative method - using dot notation and accessing an attribute

movies.genre

title
The Shawshank Redemption                                 Crime
The Godfather                                            Crime
The Godfather: Part II                                   Crime
The Dark Knight                                         Action
Pulp Fiction                                             Crime
12 Angry Men                                             Drama
The Good, the Bad and the Ugly                         Western
The Lord of the Rings: The Return of the King        Adventure
Schindler's List                                     Biography
Fight Club                                               Drama
The Lord of the Rings: The Fellowship of the Ring    Adventure
Inception                                               Action
Star Wars: Episode V - The Empire Strikes Back          Action
Forrest Gump                                             Drama
The Lord of the Rings: The Two Towers                Adventure
Interstellar                                     

**Dot notation not best practice!**

-  Column names with spaces/special characters can not be accessed 
-  Column names that are also `DataFrame` methods fail to be selected correctly using dot notation.

However, dot notation is still very commonly used ....... because laziness!

In [28]:
# example - add a column to dataframe called head
# Like with the Series objects from above, dictionary-style syntax can also be used to modify the object
# example - adding a new column

movies['head'] = np.random.randint(0, 1000, movies.shape[0])
movies.head()

Unnamed: 0_level_0,star_rating,content_rating,genre,duration,actors_list,head
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
The Shawshank Redemption,9.3,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...",977
The Godfather,9.2,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']",664
The Godfather: Part II,9.1,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv...",465
The Dark Knight,9.0,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E...",452
Pulp Fiction,8.9,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L....",617


In [29]:
# check for equality between object returned by dot notation and by square brackets

movies.head is movies['head']

False

We can also view the `DataFrame` as an enhanced two-dimensional array.
We can examine the raw underlying data array using the `values` attribute:

In [30]:
movies.values

array([[9.3, 'R', 'Crime', 142,
        "[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunton']", 977],
       [9.2, 'R', 'Crime', 175,
        "[u'Marlon Brando', u'Al Pacino', u'James Caan']", 664],
       [9.1, 'R', 'Crime', 200,
        "[u'Al Pacino', u'Robert De Niro', u'Robert Duvall']", 465],
       ...,
       [7.4, 'PG-13', 'Action', 138,
        "[u'Russell Crowe', u'Paul Bettany', u'Billy Boyd']", 583],
       [7.4, 'PG', 'Horror', 114,
        '[u\'JoBeth Williams\', u"Heather O\'Rourke", u\'Craig T. Nelson\']',
        211],
       [7.4, 'R', 'Crime', 126,
        "[u'Charlie Sheen', u'Michael Douglas', u'Tamara Tunie']", 547]],
      dtype=object)

This allows many familiar array-like observations to be performed on the `DataFrame` itself

In [31]:
# Example = Transpose a dataframe

movies.T

title,The Shawshank Redemption,The Godfather,The Godfather: Part II,The Dark Knight,Pulp Fiction,12 Angry Men,"The Good, the Bad and the Ugly",The Lord of the Rings: The Return of the King,Schindler's List,Fight Club,The Lord of the Rings: The Fellowship of the Ring,Inception,Star Wars: Episode V - The Empire Strikes Back,Forrest Gump,The Lord of the Rings: The Two Towers,Interstellar,One Flew Over the Cuckoo's Nest,Seven Samurai,Goodfellas,Star Wars,The Matrix,City of God,It's a Wonderful Life,The Usual Suspects,Se7en,...,X-Men,Zero Dark Thirty,Manhattan Murder Mystery,National Lampoon's Vacation,My Sister's Keeper,Deconstructing Harry,The Way Way Back,Capote,Driving Miss Daisy,La Femme Nikita,Lincoln,Limitless,The Simpsons Movie,The Rock,The English Patient,Law Abiding Citizen,Wonder Boys,Death at a Funeral,Blue Valentine,The Cider House Rules,Tootsie,Back to the Future Part III,Master and Commander: The Far Side of the World,Poltergeist,Wall Street
star_rating,9.3,9.2,9.1,9,8.9,8.9,8.9,8.9,8.9,8.9,8.8,8.8,8.8,8.8,8.8,8.7,8.7,8.7,8.7,8.7,8.7,8.7,8.7,8.7,8.7,...,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4,7.4
content_rating,R,R,R,PG-13,R,NOT RATED,NOT RATED,PG-13,R,R,PG-13,PG-13,PG,PG-13,PG-13,PG-13,R,UNRATED,R,PG,R,R,APPROVED,R,R,...,PG-13,R,PG,R,PG-13,R,PG-13,R,PG,R,PG-13,PG-13,PG-13,R,R,R,R,R,NC-17,PG-13,PG,PG,PG-13,PG,R
genre,Crime,Crime,Crime,Action,Crime,Drama,Western,Adventure,Biography,Drama,Adventure,Action,Action,Drama,Adventure,Adventure,Drama,Drama,Biography,Action,Action,Crime,Drama,Crime,Drama,...,Action,Drama,Comedy,Comedy,Drama,Comedy,Comedy,Biography,Comedy,Action,Biography,Mystery,Animation,Action,Drama,Crime,Drama,Comedy,Drama,Drama,Comedy,Adventure,Action,Horror,Crime
duration,142,175,200,152,154,96,161,201,195,139,178,148,124,142,179,169,133,207,146,121,136,130,130,106,127,...,104,157,104,98,109,96,103,114,99,118,150,105,87,136,162,109,107,90,112,126,116,118,138,114,126
actors_list,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...","[u'Marlon Brando', u'Al Pacino', u'James Caan']","[u'Al Pacino', u'Robert De Niro', u'Robert Duv...","[u'Christian Bale', u'Heath Ledger', u'Aaron E...","[u'John Travolta', u'Uma Thurman', u'Samuel L....","[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...","[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ...","[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...","[u'Liam Neeson', u'Ralph Fiennes', u'Ben Kings...","[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...","[u'Elijah Wood', u'Ian McKellen', u'Orlando Bl...","[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...","[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi...","[u'Tom Hanks', u'Robin Wright', u'Gary Sinise']","[u'Elijah Wood', u'Ian McKellen', u'Viggo Mort...","[u'Matthew McConaughey', u'Anne Hathaway', u'J...","[u'Jack Nicholson', u'Louise Fletcher', u'Mich...","[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...","[u'Robert De Niro', u'Ray Liotta', u'Joe Pesci']","[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi...","[u'Keanu Reeves', u'Laurence Fishburne', u'Car...","[u'Alexandre Rodrigues', u'Matheus Nachtergael...","[u'James Stewart', u'Donna Reed', u'Lionel Bar...","[u'Kevin Spacey', u'Gabriel Byrne', u'Chazz Pa...","[u'Morgan Freeman', u'Brad Pitt', u'Kevin Spac...",...,"[u'Patrick Stewart', u'Hugh Jackman', u'Ian Mc...","[u'Jessica Chastain', u'Joel Edgerton', u'Chri...","[u'Woody Allen', u'Diane Keaton', u'Jerry Adler']","[u'Chevy Chase', u""Beverly D'Angelo"", u'Imogen...","[u'Cameron Diaz', u'Abigail Breslin', u'Alec B...","[u'Woody Allen', u'Judy Davis', u'Julia Louis-...","[u'Steve Carell', u'Toni Collette', u'Allison ...","[u'Philip Seymour Hoffman', u'Clifton Collins ...","[u'Morgan Freeman', u'Jessica Tandy', u'Dan Ay...","[u'Anne Parillaud', u'Marc Duret', u'Patrick F...","[u'Daniel Day-Lewis', u'Sally Field', u'David ...","[u'Bradley Cooper', u'Anna Friel', u'Abbie Cor...","[u'Dan Castellaneta', u'Julie Kavner', u'Nancy...","[u'Sean Connery', u'Nicolas Cage', u'Ed Harris']","[u'Ralph Fiennes', u'Juliette Binoche', u'Will...","[u'Gerard Butler', u'Jamie Foxx', u'Leslie Bibb']","[u'Michael Douglas', u'Tobey Maguire', u'Franc...","[u'Matthew Macfadyen', u'Peter Dinklage', u'Ew...","[u'Ryan Gosling', u'Michelle Williams', u'John...","[u'Tobey Maguire', u'Charlize Theron', u'Micha...","[u'Dustin Hoffman', u'Jessica Lange', u'Teri G...","[u'Michael J. Fox', u'Christopher Lloyd', u'Ma...","[u'Russell Crowe', u'Paul Bettany', u'Billy Bo...","[u'JoBeth Williams', u""Heather O'Rourke"", u'Cr...","[u'Charlie Sheen', u'Michael Douglas', u'Tamar..."
head,977,664,465,452,617,42,951,663,168,253,247,563,726,431,833,535,15,170,954,773,673,947,968,413,870,...,869,430,336,660,979,368,928,846,624,167,753,161,584,377,539,935,129,898,177,259,805,866,583,211,547


For array-style indexing, we use the``loc``and``iloc``indexers mentioned earlier.

In [32]:
# Remove the previously created head column

del movies['head']

In [33]:
# For label specific indexing using the explicit index and column names

movies.loc[['The Godfather', 'Pulp Fiction'],['genre', 'duration']]

Unnamed: 0_level_0,genre,duration
title,Unnamed: 1_level_1,Unnamed: 2_level_1
The Godfather,Crime,175
Pulp Fiction,Crime,154


In [34]:
# For non label specific indexing using the implicit index

movies.iloc[:4, 1:5]

# Note how the index and column names are maintained

Unnamed: 0_level_0,content_rating,genre,duration,actors_list
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."


Any of the previously mentioned`Series`style data access patterns can be used within these indexers.
For example, in the``loc``indexer we can combine masking and fancy indexing:

In [35]:
movies.loc[movies.duration > 220, ['duration', 'star_rating']]

Unnamed: 0_level_0,duration,star_rating
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Once Upon a Time in America,229,8.4
Lagaan: Once Upon a Time in India,224,8.3
Gone with the Wind,238,8.2
Hamlet,242,7.8


## Recipe: Constructing multiple boolean conditions

Constructing a precise filter for your dataset  might have you combining multiple boolean expressions together to extract an exact subset. 

**Aim:** Find all movies that have a **star_rating** over 8.5 in the Action or Adventure genres.

**Solution:** Construct and combine multiple boolean expressions

In [36]:
# Usually we would load in our dataset, but we already have it in the variable movies

movies.head()

Unnamed: 0_level_0,star_rating,content_rating,genre,duration,actors_list
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Shawshank Redemption,9.3,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
The Godfather,9.2,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
The Godfather: Part II,9.1,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
The Dark Knight,9.0,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
Pulp Fiction,8.9,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [37]:
# Create a variable to hold each set of criteria independently as a boolean Series

star_8 = movies['star_rating'] > 8.5
action_adventure = (movies['genre'] == 'Action') | (movies['genre'] == 'Adventure')

# Note the use of parenthesis and the pandas or operator

In [38]:
# Combine all the criteria together into a single boolean Series:

criteria = star_8 & action_adventure

criteria.head()

# Use of the pandas and operator

title
The Shawshank Redemption    False
The Godfather               False
The Godfather: Part II      False
The Dark Knight              True
Pulp Fiction                False
dtype: bool

In [39]:
# Once you have your boolean Series, you simply pass it to the indexing operator to filter the data:

movies[criteria].head()

Unnamed: 0_level_0,star_rating,content_rating,genre,duration,actors_list
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Dark Knight,9.0,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
The Lord of the Rings: The Return of the King,8.9,PG-13,Adventure,201,"[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK..."
The Lord of the Rings: The Fellowship of the Ring,8.8,PG-13,Adventure,178,"[u'Elijah Wood', u'Ian McKellen', u'Orlando Bl..."
Inception,8.8,PG-13,Action,148,"[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'..."
Star Wars: Episode V - The Empire Strikes Back,8.8,PG,Action,124,"[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi..."


In [40]:
movies.loc[criteria, ['star_rating', 'genre']]

Unnamed: 0_level_0,star_rating,genre
title,Unnamed: 1_level_1,Unnamed: 2_level_1
The Dark Knight,9.0,Action
The Lord of the Rings: The Return of the King,8.9,Adventure
The Lord of the Rings: The Fellowship of the Ring,8.8,Adventure
Inception,8.8,Action
Star Wars: Episode V - The Empire Strikes Back,8.8,Action
The Lord of the Rings: The Two Towers,8.8,Adventure
Interstellar,8.7,Adventure
Star Wars,8.7,Action
The Matrix,8.7,Action
Saving Private Ryan,8.6,Action


## Exercises
***

In [41]:
# We will cover methods in a future section but to help with these exercises the 'unique' method can be called on a Series 
# to understand the unique values that it contains

movies.genre.unique()

array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography',
       'Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi',
       'History', 'Thriller', 'Family', 'Fantasy'], dtype=object)

In [42]:
movies.head()

Unnamed: 0_level_0,star_rating,content_rating,genre,duration,actors_list
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Shawshank Redemption,9.3,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
The Godfather,9.2,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
The Godfather: Part II,9.1,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
The Dark Knight,9.0,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
Pulp Fiction,8.9,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


__1)__ You are a parent with four children under 6 who were promised to be able to watch a film before bedtime. The film has to be content rated PG and within the 'Family' genre. What film do you pick?

__2)__ You are a child who has been given permission to watch __any__ film as long as it is age appropriate (content rating PG or PG-13). You would also like to stay up as late as possible, so you would like the longest film available (at least 3 hours), what do you pick?

__3)__ What's the best R rated film from the movies`DataFrame`that isn't in the Drama, Comedy, Action or Crime genres? Note - the movies`DataFrame`is already sorted by star_rating.

__Advanced Questions__

__4)__ Select all movies in the `DataFrame` that have a title that starts with 'The'?  
**Help** - [**`String Methods`**](https://pandas.pydata.org/pandas-docs/stable/text.html)

# Potential Pitfalls

## SettingWithCopyWarning

In [49]:
# examine the DataFrame rows that contain missing values
movies[movies.content_rating.isnull()]

Unnamed: 0_level_0,star_rating,content_rating,genre,duration,actors_list
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Butch Cassidy and the Sundance Kid,8.2,,Biography,110,"[u'Paul Newman', u'Robert Redford', u'Katharin..."
Where Eagles Dare,7.7,,Action,158,"[u'Richard Burton', u'Clint Eastwood', u'Mary ..."
True Grit,7.4,,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']"


In [50]:
# examine the unique values in the 'content_rating' Series
movies.content_rating.value_counts()

R            460
PG-13        189
PG           123
NOT RATED     65
APPROVED      47
UNRATED       38
G             32
PASSED         7
NC-17          7
X              4
GP             3
TV-MA          1
Name: content_rating, dtype: int64

**Aim:** Mark the 'NOT RATED' values as missing values, represented by 'NaN'.

In [51]:
# first, locate the relevant rows
movies[movies.content_rating=='NOT RATED'].head()

Unnamed: 0_level_0,star_rating,content_rating,genre,duration,actors_list
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
12 Angry Men,8.9,NOT RATED,Drama,96,"[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals..."
"The Good, the Bad and the Ugly",8.9,NOT RATED,Western,161,"[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ..."
Sunset Blvd.,8.5,NOT RATED,Drama,110,"[u'William Holden', u'Gloria Swanson', u'Erich..."
M,8.4,NOT RATED,Crime,99,"[u'Peter Lorre', u'Ellen Widmann', u'Inge Land..."
Munna Bhai M.B.B.S.,8.4,NOT RATED,Comedy,156,"[u'Sunil Dutt', u'Sanjay Dutt', u'Arshad Warsi']"


In [52]:
# then, select the 'content_rating' Series from those rows
movies[movies.content_rating=='NOT RATED'].content_rating.head()

title
12 Angry Men                      NOT RATED
The Good, the Bad and the Ugly    NOT RATED
Sunset Blvd.                      NOT RATED
M                                 NOT RATED
Munna Bhai M.B.B.S.               NOT RATED
Name: content_rating, dtype: object

In [53]:
# finally, replace the 'NOT RATED' values with 'NaN'

movies[movies.content_rating=='NOT RATED'].content_rating = np.nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


**Problem:** That statement involves two operations, a **`__getitem__`** and a **`__setitem__`**. Pandas can't guarantee whether the **`__getitem__`** operation returns a view or a copy of the data.

- If **`__getitem__`** returns a view of the data, **`__setitem__`** will affect the 'movies' DataFrame.
- But if **`__getitem__`** returns a copy of the data, **`__setitem__`** will not affect the 'movies' DataFrame.

**Solution:** Use the **`loc`** method, which replaces the 'NOT RATED' values in a single **`__setitem__`** operation.

In [54]:
# replace the 'NOT RATED' values with 'NaN' (does not cause a SettingWithCopyWarning)
movies.loc[movies.content_rating=='NOT RATED', 'content_rating'] = np.nan

In [55]:
# this time, the 'content_rating' Series has changed
movies.content_rating.isnull().sum()

68

### Second Instance

In [56]:
# create a DataFrame only containing movies with a high 'star_rating'
top_movies = movies.loc[movies.star_rating >= 9, :]
top_movies

Unnamed: 0_level_0,star_rating,content_rating,genre,duration,actors_list
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Shawshank Redemption,9.3,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
The Godfather,9.2,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
The Godfather: Part II,9.1,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
The Dark Knight,9.0,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."


In [57]:
# overwrite the relevant cell with the correct duration
top_movies.loc['The Shawshank Redemption', 'duration'] = 150

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


**Problem:** pandas isn't sure whether 'top_movies' is a view or a copy of 'movies'.

In [58]:
# 'top_movies' DataFrame has been updated
top_movies

Unnamed: 0_level_0,star_rating,content_rating,genre,duration,actors_list
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Shawshank Redemption,9.3,R,Crime,150,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
The Godfather,9.2,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
The Godfather: Part II,9.1,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
The Dark Knight,9.0,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."


In [59]:
# 'movies' DataFrame has not been updated
movies.head(1)

Unnamed: 0_level_0,star_rating,content_rating,genre,duration,actors_list
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Shawshank Redemption,9.3,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."


**Solution:** Any time you are attempting to create a DataFrame copy, use the copy method.

In [63]:
# explicitly create a copy of 'movies'
top_movies = movies.loc[movies.star_rating >= 9, :].copy()

In [64]:
# pandas now knows that you are updating a copy instead of a view (does not cause a SettingWithCopyWarning)
top_movies.loc['The Shawshank Redemption', 'duration'] = 150

In [65]:
# 'top_movies' DataFrame has been updated
top_movies

Unnamed: 0_level_0,star_rating,content_rating,genre,duration,actors_list
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Shawshank Redemption,9.3,R,Crime,150,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
The Godfather,9.2,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
The Godfather: Part II,9.1,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
The Dark Knight,9.0,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."


# Recap
***

1. `Series` and `DataFrames` can be manipulated in a similar fashion to both dictionaries and arrays. The main index operator is a square bracket.


2. Pandas data structures have both implicit and explicit indices, which can be used for data selection.  


3. The `loc` and `iloc` indexing operators are considered best practice for both the selection of data and when assigning new values to a `DataFram/Series`.

    -  `loc` provides label based indexing
    -  `iloc` provides position based indexing


4. Boolean indexing, with multiple criteria can be performed with both `loc` and `[]` indexing.


5. `SettingWithCopyWarning` error message a result of Pandas not knowing if a view of data is being requested or a copy. Best practice is to explicitly copy if required. 

<!--NAVIGATION-->
< [Pandas IO](03_Pandas_Data_IO.ipynb) | [Contents](Index.ipynb) | [Essential Pandas Functionality](05_Pandas_EssentialFunctionality.ipynb) >