# Chapter 14. Data Indexing and Selection
### Here we’ll look at similar means of accessing and modifying values in Pandas Series and DataFrame objects

In [2]:
import numpy as np
import pandas as pd

## Data Selection in Series

In [3]:
data = pd.Series([0.25,0.5,0.75,1.0],
                 index=['a','b','c','d'])

In [4]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [5]:
data['b']

np.float64(0.5)

In [6]:
'c' in data

True

In [7]:
'd' in data

True

In [8]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [9]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

#### modification

In [10]:
data['b'] = 12

In [11]:
data

a     0.25
b    12.00
c     0.75
d     1.00
dtype: float64

In [12]:
# adding a value
data['f'] = 21
data

a     0.25
b    12.00
c     0.75
d     1.00
f    21.00
dtype: float64

### slicing

In [20]:
data['a':'d']

a     0.25
b    12.00
c     0.75
d     1.00
dtype: float64

In [14]:
data['c':]

c     0.75
d     1.00
f    21.00
dtype: float64

In [21]:
data[0:2]

a     0.25
b    12.00
dtype: float64

### masking

In [15]:
data[data>1]

b    12.0
f    21.0
dtype: float64

In [17]:
data[(data>1) & (data <20)]

b    12.0
dtype: float64

### fancy indexing

In [19]:
data[['a','b','f','d']]

a     0.25
b    12.00
f    21.00
d     1.00
dtype: float64

## **If your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style indices:**

In [23]:
item = pd.Series([1,2,3,4,5,6],
                 index=[1,2,3,4,5,6])
item

1    1
2    2
3    3
4    4
5    5
6    6
dtype: int64

In [24]:
item[2] # explicit

np.int64(2)

In [28]:
item[1:4] # it uses the implicit indices

2    2
3    3
4    4
dtype: int64

#### **Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.**
First, the **loc** attribute allows indexing and slicing that always references the explicit index:

In [27]:
item.loc[1:4]

1    1
2    2
3    3
4    4
dtype: int64

In [30]:
item.loc[2]

np.int64(2)

The **iloc** attribute allows indexing and slicing that always references the
implicit Python-style index

In [31]:
item.iloc[2]

np.int64(3)

In [32]:
item.iloc[1:4]

2    2
3    3
4    4
dtype: int64

**One guiding principle of Python code is that “explicit is better than
implicit.” The explicit nature of loc and iloc makes them helpful in
maintaining clean and readable code; especially in the case of integer
indexes, using them consistently can prevent subtle bugs due to the mixed
indexing/slicing convention.**

## Data Selection in DataFrames

In [33]:
area = pd.Series({'California': 423967, 'Texas': 695662,
'Florida': 170312, 'New York': 141297,
'Pennsylvania': 119280})
pop = pd.Series({'California': 39538223, 'Texas':
29145505,
'Florida': 21538187, 'New York':
20201249,
'Pennsylvania': 13002700})
data = pd.DataFrame({'area':area,'pop':pop})

In [34]:
data

Unnamed: 0,area,pop
California,423967,39538223
Texas,695662,29145505
Florida,170312,21538187
New York,141297,20201249
Pennsylvania,119280,13002700


In [35]:
# To access individual series that makes up the data frame
data['area']

California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64

In [36]:
data.area

California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64

In [37]:
data.pop

<bound method DataFrame.pop of                 area       pop
California    423967  39538223
Texas         695662  29145505
Florida       170312  21538187
New York      141297  20201249
Pennsylvania  119280  13002700>

**NB** 
*Though this is a useful shorthand, keep in mind that it does not work for all
cases! For example, if the column names are not strings, or if the column
names conflict with methods of the DataFrame, this attribute-style access
is not possible. For example, the DataFrame has a pop method, so
data.pop will point to this rather than the pop column:*

### modification

In [38]:
data['density'] = data['pop']/data['area'] 

In [39]:
data

Unnamed: 0,area,pop,density
California,423967,39538223,93.257784
Texas,695662,29145505,41.896072
Florida,170312,21538187,126.463121
New York,141297,20201249,142.97012
Pennsylvania,119280,13002700,109.009893


In [40]:
data.keys()

Index(['area', 'pop', 'density'], dtype='object')

In [42]:
data.values

array([[4.23967000e+05, 3.95382230e+07, 9.32577842e+01],
       [6.95662000e+05, 2.91455050e+07, 4.18960717e+01],
       [1.70312000e+05, 2.15381870e+07, 1.26463121e+02],
       [1.41297000e+05, 2.02012490e+07, 1.42970120e+02],
       [1.19280000e+05, 1.30027000e+07, 1.09009893e+02]])

In [45]:
data.values[0]

array([4.23967000e+05, 3.95382230e+07, 9.32577842e+01])

In [47]:
data.iloc[:3,:2]

Unnamed: 0,area,pop
California,423967,39538223
Texas,695662,29145505
Florida,170312,21538187


In [50]:
data.loc['Texas':'New York']

Unnamed: 0,area,pop,density
Texas,695662,29145505,41.896072
Florida,170312,21538187,126.463121
New York,141297,20201249,142.97012


In [52]:
data.loc['Texas':'New York',:'pop']

Unnamed: 0,area,pop
Texas,695662,29145505
Florida,170312,21538187
New York,141297,20201249
