In [4]:
# DATA INDEXING AND SELECTION

# Data Selection in Series
# Series objects act like a one-dimensional NumPy array, 
# and in many ways like a Python dictionary. If we keep these two overlapping analogies in 
# mind, it will help us to understand the patterns of data indexing and selection in arrays

# Series as dictionary
# Like a dictionary, the Series object provides a mapping from a collection of 
# keys to a collection of values
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [2]:
data['b']

0.5

In [3]:
# We can use dictionary-like python expressions and methods to examine
# the keys/indices and values:
'a' in data

True

In [4]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [5]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [6]:
data['e'] = 1.25

In [7]:
# Series objects can even be modified with a dictionary-like syntax.
# Just as you can extend a dictionary by assigning to a new key, you can extend a series
# by assigning to a new index value:

In [8]:
# Series as one-dimensional array
# A series build on this dictionary-like interface and provides array-style item selection
# via the same basic mechanisms as NumPy arrays - slices, masking, and fancy indexing

# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [9]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [10]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [11]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

In [5]:
# Slicing is the most confusing of these. When you slice with an explicit index (i.e., data[
# 'a':'c']), the final index is included in the slice, while when you're slicing with an implicit
# index (i.e., data[0:2]), the final index is excluded from the slice

# Indexers:loc, iloc, and ix
# These slicing and indexing conventions can be a source of confusion. For example,
# if your Series has an explicit integer index, and indexing operation such as data[1]
# will use the explicit indices, while a slicing operation like data[1:3] will use explicit
# Python-style index
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [13]:
# explicit index when indexing
data[1]

'a'

In [14]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

Because of this potential confustion in the case of interger indexes, Pandas provides some special indexer attributes that explicitly expose certain index schemes. These are not functional methods but attributes that expose a particular slicing interface to the data in the series


In [6]:
# First the loc attribute allows indexing and slicing that always references the explicit index:

data.loc[1]

'a'

In [7]:
data.loc[1:3]

1    a
3    b
dtype: object

In [9]:
# The iloc attribute allows indexing and slicing that always references the implicit
# Python-style index:
data.iloc[1]


'b'

In [10]:
data.iloc[1:3]

3    b
5    c
dtype: object

A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equivalent to stand []-based indexing. 
Explicit is better than implicit. The nature of loc and iloc makes them very useful in maintaining clean and readable code; especially with integer indexes

**Data Selection to DataFrame**
DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index. 

*DataFrame as a dictionary*
The first analogy we will consider is the DataFrame as a dictionary of related Series objects. 

In [11]:
area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:

In [12]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Equivalently, we can use attribute-style access with column names that are strings:

In [13]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

The attribute-style column acces actually accesses the exact same object as the the dictionary-style access:

In [14]:
data.area is data['area']

True

This does not work for all cases!
For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible. For example, the DataFrame has a pop() method, so data.pop will point to this rather than the "pop" column:

In [15]:
data.pop is data['pop']

False

Avoid the temptaion to try column assignment via attribute
(i.e., use data['pop'] = z rather than data.pop = z).

Like with the Series objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case to add a new column:

In [16]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


**DataFrame as two-dimensional array**
We can also view the DataFrame as an enhanced two-dimensional array. We can examine the ray underlying data array using the values attribue:

In [17]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

With this picture in mind, we can do many familiar array-like observations on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns:

In [18]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


When it comes to indexing of DataFrames objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array. In particular, passing a single index to an array accesses a row:

In [19]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

and passing a single "index" to a DataFrame accesses a column:

In [20]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

For array-style indexing, we need another convention. Here Pandas again uses the loc, iloc, and ix indexers mention earlier. Using the iloc indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame index and column lables are maintained in the result:

In [21]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [22]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


The ix indexer allows a hybrid of these two approaches: (This is now deprecated so it won't work, use .loc or .iloc instead)

Any of the familiar NumPy-style data access patterns can be used within these indexers. For example, in the loc indexer we can combine masking and fancy indexing as in the following:

In [26]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


*Additional indexing conventions*
There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice. First, while indexing refers to columns, slicing refers to rows:

In [27]:
data['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


Slices can also refer to rows by number rather than by index:

In [28]:
data[1:3]

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


Direct masking operation are alo interpreted row-wise rather than column-wise:

In [29]:
data[data.density > 100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


These are syntactically similar to those on a NumPy array, and while these may not precisely fit the mold of the Pandas conventions they are useful in practice