## Data Selection in Series

### Series as dictionary

The `Series` object provides a mapping from a collection of keys to a collection of values

In [2]:
import pandas as pd
data = pd.Series([1,2,3,4],index=['a','b','c','d'])
print(data)

a    1
b    2
c    3
d    4
dtype: int64


In [3]:
data['b']

2

In [7]:
print(data.get('b', 'None'))
print(data.get('e','None'))

2
None


In [14]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [16]:
list(data.items())

[('a', 1), ('b', 2), ('c', 3), ('d', 4)]

Series objects can even be modified with a dictionary-like syntax

In [18]:
data['e'] = 1.25
data

a    1.00
b    2.00
c    3.00
d    4.00
e    1.25
dtype: float64

### Series as one-dimensional array


A Series builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as Numpy arrays- that is, **slices, masking, and fancy indexing**

In [29]:
# slicing by explicit index
# when slicing with an explicit index, the final index is included in the slice
print(data)
data['d':'e']

a    1.00
b    2.00
c    3.00
d    4.00
e    1.25
dtype: float64


d    4.00
e    1.25
dtype: float64

In [30]:
# slicing by inplicit integer index
# while slicing by inplicit index, the final index is excluded from the slice
data[1:3]

b    2.0
c    3.0
dtype: float64

In [27]:
# masking
data[(data>2) | (data<=1)]

a    1.0
c    3.0
d    4.0
dtype: float64

In [28]:
# fancing index 
data[['a','e']]


a    1.00
e    1.25
dtype: float64

### Indexers: loc, iloc, ix

In [31]:
data = pd.Series(['a','b','c'], index=[1,3,5])
data

1    a
3    b
5    c
dtype: object

In [32]:
# explicit index when indexing
data[1]

'a'

In [33]:
# implicit index when indexing
data[1:3]

3    b
5    c
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes

First, the `loc` attribute allows indexing and slicing that always references the explicit index

In [36]:
data

1    a
3    b
5    c
dtype: object

In [34]:
data.loc[1]

'a'

In [49]:
data.loc[1:3]

1    a
3    b
dtype: object

The `iloc` attribute allows indexing and slicing that always references the implicit Python-style index|

In [51]:
data

1    a
3    b
5    c
dtype: object

In [52]:
data.iloc[1]

'b'

In [59]:
data.iloc[0:2]

1    a
3    b
dtype: object

The `ix` is hybrid of the two

## Data Selection in DataFrame

### DataFrame as a dictionary

In [60]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


The individual `Series` that make up the columns of the `DataFrame` can be accessed via dictionary-style indexing of the column name

In [61]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [62]:
data['pop']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: pop, dtype: int64

Equivalently, we can use attribute-style access with column names that are strings(**Not Recommand**)

In [63]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Like with the `Series `objects discussed earlier, this dictionary-style syntax can also be used to modify the object

In [64]:
# Adding a new column 
data['density'] = data['pop'] / data['area']

In [65]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


### DataFrame as two-dimensional array

In [66]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [67]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [68]:
# Transponse
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


In [71]:
# passing a single index to an array accesses a row
data.values[0]


array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [77]:
# passing a single index to DataFrame accesses a column
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [95]:
data.iloc[1:3,:2]


Unnamed: 0,area,pop
Texas,695662,26448193
New York,141297,19651127


In [96]:
# avoid to use integer index if the column or index is not integer
data.loc['Texas':,:]

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


The `ix` indexer allows a hybrid of these two approaches


In [98]:
data.ix[:2,'pop':]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,pop,density
California,38332521,90.413926
Texas,26448193,38.01874


Any of the familiar Numpy-style data access patterns can be used within these indexers


In [103]:
data.loc[data['density']>100, ['pop', 'area']]

Unnamed: 0,pop,area
New York,19651127,141297
Florida,19552860,170312


Any of these indexing conventions may also be used to set or modify values

In [111]:
data.iloc[1:4] = 90
data


Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,90,90,90.0
New York,90,90,90.0
Florida,90,90,90.0
Illinois,149995,12882135,85.883763


### Additional indexing conventions

while indexing refers to columns, slicing refers to rows

In [116]:
data.loc['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,90,90,90.0
Illinois,149995,12882135,85.883763


Similarly, direct masking operations are also interpreted row-wise rather than column-wise


In [124]:
data.loc[data['density']>89]

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,90,90,90.0
New York,90,90,90.0
Florida,90,90,90.0


## Summary

## Series
1. Series as a generalized dictionary
It has many of Python-style dictionary methods
2. Series as a special Numpy array
It can be indexed (selected, slicing) explicitly  

## DataFrame
1. DataFrame as a special dictionary
2. DataFrame as a two-dimensional Numpy array