# Data Selection in Series

As we begin this section, let us keep in mind that we can look at `Series` object in two ways: as a *a one-dimensional NumPy array* and in many ways like a *standard Python dictionary*. By keeping these two things in mind, it will help us understand the patterns of data indexing and selection in these arrays.

## Series as adictionary

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index = ['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [3]:
data['a']

0.25

In [4]:
'a' in data

True

In [15]:
data.keys()

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [13]:
data.values

array([0.25, 0.5 , 0.75, 1.  , 1.25])

In [18]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0), ('e', 1.25)]

In [7]:
# you can extend a `Series` object like a dictionary
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

## `Series` as one-dimensional array

A `Series` is in a way like a dictionary but also has NumPy array like mechanisms *slices, masking,* and *fancy indexing*

In [8]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [9]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

Notice that when using an *explicit index,* the upper index is included while when using *implicit index* we use Python-like indexing which excludes the upper index

In [10]:
# masking
data[(data>0.3) & (data<0.8)]

b    0.50
c    0.75
dtype: float64

In [11]:
# fancy indexing
data[['a','e']]

a    0.25
e    1.25
dtype: float64

## indexers: loc, iloc, and ix

Slicing convections can be a source of confusion. For example, when working with `Series` object, `data[1]` uses explicit indices while a slicing operation like `data[1:3]` uses implicit Python-style indexing

In [19]:
data = pd.Series(['a', 'b', 'c'], index=[1,3,5])
data

1    a
3    b
5    c
dtype: object

In [20]:
# explicit index when indexing
data[1]

'a'

In [21]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

Because of confusion that might arise from these two indexing conventions, Pandas provides special *indexer* attributes that explicitly expose certain indexing schemes.There are not functional methods, but attributes that expose a particular slicing interface to the data in the `Series`

`loc` - allows indexing and slicing that always references the explicit index:

In [22]:
data.loc[1]

'a'

In [30]:
data.loc[1:3]

1    a
3    b
dtype: object

`iloc` - allows indexing and slicing that always references the implicit Python-style index:

In [26]:
data.iloc[1]

'b'

In [27]:
data.iloc[1:3]

3    b
5    c
dtype: object

A guiding principle in Python code is "explicit is better than implicit". The explicit nature of `loc` and `iloc` make them very useful in maintaining clean and readable code. It is recommended to use both of tehse to make code easier to read and understand and to revent subtle bugs due to the mixed indexing/slicing convention.

# Data Selection in DataFrame
Keep in mind the ways in which you can view a DataFrame:
1. a two-dimensional or structured array
2. like a dictionary of `Series` structures sharing the same index

## DataFrame as a dictionary

In [31]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':population})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


Individual `Series` that make up the colums of the `DataFrame` cna be accessed via dictionary-style indexing of the column name:

In [32]:
data['pop']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: pop, dtype: int64

Or you can also use attribute-style access with column names that are strings

In [33]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [37]:
data.area is data['area']

True

Using the dictionary style syntax, we can modify the  object, in this case adding a new column:

In [38]:
data['density'] = data['pop']/data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


This looks like vectorization that we saw in NumPy array. The operation is done element-wise

## DataFrame as two-dimensional array
We can also look at a `DataFrame`  as an enhanced two-dimensional array. We can view the raw underlying data

In [39]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [40]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


In [41]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [42]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

For array-style indexing, we need another convention. This is where we again come to Pandas `loc` and `iloc` indexers used earlier. using the `iloc` indexer, we can index the underlying array as if it is a simple NumPy array(using implicit Python-style index), but the `DataFrame` index and column are maintained in teh result:

In [44]:
data.iloc[:3,:2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


Similarly, using `loc` indexer, we can index the underlying data in an array-like style but using the explicit index and column names:

In [46]:
data.loc[:'New York', : 'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


The `ix` indexer was depricated as from Pandas 0.20.0

In [51]:
data.iloc[:3].loc[:,:'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


Any farmiliar NumPy-style data accessing patterns can be used within these indexers

In [52]:
data.loc[data.density>100,['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


any of these indexing conventions can be used to modify values:

In [53]:
data.loc['Illinois':'density'] = 85
data

KeyError: 'density'