In [2]:
import pandas as pd

## Data Selection in `Series`

`Series` object 

- acts in many ways like a one-dimensional NumPy array, 

- and in many ways like a standard Python dictionary.

### `Series` as dictionary

In [5]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], 
                 index=['a', 'b', 'c', 'd'])

In [6]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [7]:
data['b']

0.5

Use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [8]:
'a' in data

True

In [9]:
'e' in data

False

In [10]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [15]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

`Series` objects can even be modified with a dictionary-like syntax.

In [18]:
# Extend Series by assigning to a new index value
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

In [19]:
# Modify value
data['e'] = 2
data

a    0.25
b    0.50
c    0.75
d    1.00
e    2.00
dtype: float64

### `Series` as one-dimensional array

`Series`:

- builds on dictionary-like interface

- provides array-style item selecting via the same basic mechanisms as Numpy arrays

In [26]:
data

a    0.25
b    0.50
c    0.75
d    1.00
e    2.00
dtype: float64

Slicing:

- When slicing with an **explicit index**, the final index is **included** in the slice, 

- while when slicing with an **implicit index**, the final index is **excluded** from the slice.

In [20]:
# Slicing by explicit index
data['a':'c'] # 'c' is also included

a    0.25
b    0.50
c    0.75
dtype: float64

In [21]:
# Slicing by implicit integer index
data[0:2] # index 2 is NOT included

a    0.25
b    0.50
dtype: float64

In [24]:
# Masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [25]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    2.00
dtype: float64

### Indexers: `loc`, `iloc`, `ix`

In order to avoid potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes.

In [27]:
data

a    0.25
b    0.50
c    0.75
d    1.00
e    2.00
dtype: float64

#### `loc`

The `loc` attribute allows indexing and slicing that always references the **explicit label(s)**

In [31]:
data.loc['a'] # Access by labels(s)

0.25

#### `iloc`

The `iloc` attribute allows indexing and slicing that always references the implicit Python-style **index**

In [30]:
 data.iloc[1] # Access by integer-location based indexing

0.5

#### `ix`

~~`ix` is a hybrid of the two above. For `Series` objects is equivalent to standard `[]`-based indexing.~~

Update: `Series.ix` and `DataFrame.ix` are deprecated!
(See: https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html#removal-of-prior-version-deprecations-changes)

One guiding principle of Python code is that "explicit is better than implicit." The explicit nature of `loc` and `iloc` make them very useful in maintaining clean and readable code; especially in the case of integer indexes. We should use these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.

## Data Selection in DataFrame

A `DataFrame` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of `Series` structures sharing the same index. 

### DataFrame as dictionary

Consider `DataFrame` as a dictionary of related `Series` objects.

In [32]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


Access column via **dictionary-style** indexing of the column name:

In [33]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Access column via **attribute-style**:

In [34]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Both ways are equivalent

In [35]:
data['area'] is data.area

True

If the column names are not strings, or if the column names conflict with methods to the `DataFrame`, the attribute-style doee NOT work!

**In particular, we should NOT do column assignment via attribute!**

**Therefore, dictionary-style should be the first choice.**

In [36]:
# Use dict-style syntax to modify the DataFrame object
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


### DataFrame as two-dimensional array

View the `DataFrame` as an enhanced two-dimensional array.

In [37]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

With this picture in mind, many familiar array-like observations can be done on the `DataFrame` itself.

In [38]:
# Transpose
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


In particular, passing a single "index" to a `DataFrame` accesses a column:

In [39]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

For array-style indexing, we need another convention: `loc`, `iloc`, and `ix` indexer.

#### `iloc`

Using the `iloc` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the `DataFrame` index and column labels are maintained in the result:

In [40]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [41]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


#### `loc`

Using the `loc` indexer we can index the underlying data in an array-like style but using the **explicit** index and column names:

In [42]:
data.loc[:'New York', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


#### `ix`

~~The `ix` indexer allows a hybrid of these two approaches:~~

Update: `Series.ix` and `DataFrame.ix` are deprecated!
(See: https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html#removal-of-prior-version-deprecations-changes)

In [43]:
data.ix[:3, :'pop']

AttributeError: 'DataFrame' object has no attribute 'ix'

#### Use familiar NumPy-style data access patterns within these indexers. 

##### Combine masking and fancy indexing

For example, in the `loc` indexer we can combine masking and fancy indexing as in the following:

In [45]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [47]:
data.loc[data.area > 200000, ['area', 'pop']]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193


##### Modify values

This is done in the standard way like working with NumPy:

In [49]:
print('Before modification: ', data.iloc[0, 2])

# Modification
data.iloc[0, 2] = 90

print('After modification: ', data.iloc[0, 2])

Before modification:  90.41392608386974
After modification:  90.0


## Additional indexing conventions

**Indexing** refers to **columns**:

In [55]:
data['pop']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: pop, dtype: int64

**Slicing** refers to **rows**:

In [52]:
data['Florida': 'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [57]:
data[1:3]

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


Direct masking operations are also interpreted row-wise

In [58]:
data[data['density'] > 100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
