# Chapter 10: Indexing Operations

- Both series and dataframe have attributes(``.iloc`` and ``.loc``) that we can index against.

In [3]:
import pandas as pd
import numpy as np

url = "http://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip"
df = pd.read_csv(url)
city_mpg = df.city08
highway_mpg = df.highway08
make = df.make

  df = pd.read_csv(url)


## 10.1 Prepping Data and Renaming the Index

- We can use the ``.rename`` method to change the index labels.
- We can pass in a dictionary to map the previous index label to the new label
- The ``.rename`` method also accepts series, a scalar, a function that takes an old label and returns a new label or a sequence

In [4]:
make.to_dict()

{0: 'Alfa Romeo',
 1: 'Ferrari',
 2: 'Dodge',
 3: 'Dodge',
 4: 'Subaru',
 5: 'Subaru',
 6: 'Subaru',
 7: 'Toyota',
 8: 'Toyota',
 9: 'Toyota',
 10: 'Toyota',
 11: 'Volkswagen',
 12: 'Volkswagen',
 13: 'Volkswagen',
 14: 'Dodge',
 15: 'Volkswagen',
 16: 'Volvo',
 17: 'Volvo',
 18: 'Audi',
 19: 'Audi',
 20: 'BMW',
 21: 'BMW',
 22: 'BMW',
 23: 'Buick',
 24: 'Buick',
 25: 'Dodge',
 26: 'Buick',
 27: 'Buick',
 28: 'Buick',
 29: 'Buick',
 30: 'Buick',
 31: 'Cadillac',
 32: 'Cadillac',
 33: 'Cadillac',
 34: 'Cadillac',
 35: 'Chevrolet',
 36: 'Dodge',
 37: 'Chevrolet',
 38: 'Chevrolet',
 39: 'Chevrolet',
 40: 'Chevrolet',
 41: 'Chrysler',
 42: 'CX Automotive',
 43: 'CX Automotive',
 44: 'Nissan',
 45: 'Nissan',
 46: 'Nissan',
 47: 'Dodge',
 48: 'Dodge',
 49: 'Dodge',
 50: 'Dodge',
 51: 'Dodge',
 52: 'Dodge',
 53: 'Dodge',
 54: 'Dodge',
 55: 'Dodge',
 56: 'Ford',
 57: 'Ford',
 58: 'Dodge',
 59: 'Ford',
 60: 'Ford',
 61: 'Ford',
 62: 'Ford',
 63: 'Ford',
 64: 'Hyundai',
 65: 'Hyundai',
 66: 'Hyu

In [5]:
# using dictionary
city2 = city_mpg.rename(make.to_dict())

In [6]:
city2

Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
              ..
Subaru        19
Subaru        20
Subaru        18
Subaru        18
Subaru        16
Name: city08, Length: 41144, dtype: int64

In [7]:
city2.index

Index(['Alfa Romeo', 'Ferrari', 'Dodge', 'Dodge', 'Subaru', 'Subaru', 'Subaru',
       'Toyota', 'Toyota', 'Toyota',
       ...
       'Saab', 'Saturn', 'Saturn', 'Saturn', 'Saturn', 'Subaru', 'Subaru',
       'Subaru', 'Subaru', 'Subaru'],
      dtype='object', length=41144)

In [8]:
city2 = city_mpg.rename(make)

In [9]:
city2

Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
              ..
Subaru        19
Subaru        20
Subaru        18
Subaru        18
Subaru        16
Name: city08, Length: 41144, dtype: int64

## 10.2 Resetting the Index

In [10]:
city2.reset_index()

Unnamed: 0,index,city08
0,Alfa Romeo,19
1,Ferrari,9
2,Dodge,23
3,Dodge,10
4,Subaru,17
...,...,...
41139,Subaru,19
41140,Subaru,20
41141,Subaru,18
41142,Subaru,18


In [11]:
# drop current index and return a series
city2.reset_index(drop=True)

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

## 10.3 The .iloc Attribute

- Not recommended to index directly on a series object
- Recommend to index off of the ``.loc`` or ``.iloc`` attributes
- ``.loc`` attribute deals with the index labels

In [12]:
city2.loc["Subaru"]

Subaru    17
Subaru    21
Subaru    22
Subaru    19
Subaru    20
          ..
Subaru    19
Subaru    20
Subaru    18
Subaru    18
Subaru    16
Name: city08, Length: 885, dtype: int64

In [13]:
# when there is only one value, it will return a scalar
city2.loc["Fisker"]

20

In [14]:
# to guarantee a series is returned
city2.loc[["Fisker"]]

Fisker    20
Name: city08, dtype: int64

In [15]:
# slicing
city2.sort_index().loc["Ferrari":"lamborghini"]

Ferrari    10
Ferrari    13
Ferrari    13
Ferrari     9
Ferrari    10
           ..
Yugo       21
Yugo       24
Yugo       23
Yugo       23
Yugo       22
Name: city08, Length: 28504, dtype: int64

In [16]:
idx = pd.Index(['Dodge'])
city2.loc[idx]

Dodge    23
Dodge    10
Dodge    12
Dodge    11
Dodge    11
         ..
Dodge    18
Dodge    17
Dodge    14
Dodge    14
Dodge    11
Name: city08, Length: 2583, dtype: int64

- Using boolean array with ``.loc``

In [17]:
# using boolean array
mask = city2 > 50
city2.loc[mask]

Nissan     81
Toyota     81
Toyota     81
Ford       74
Nissan     84
         ... 
Tesla     140
Tesla     115
Tesla     104
Tesla      98
Toyota     55
Name: city08, Length: 236, dtype: int64

- Using function with the ``.loc`` attribute will come in handy when chaining operations

In [18]:
# using function 
cost = pd.Series([1.00, 2.25, 3.99, 0.99, 2.79],
                    index=['Gum', 'Cookie', 'Melon', 'Roll', 'Carrots'])

inflation = 1.10

In [20]:
cost

Gum        1.00
Cookie     2.25
Melon      3.99
Roll       0.99
Carrots    2.79
dtype: float64

In [19]:
# option 1
(cost
.mul(inflation)
.loc[lambda s_: s_ > 3])

Melon      4.389
Carrots    3.069
dtype: float64

In [24]:
# option 2
def gt3(s):
    return s > 3

(cost
.mul(inflation)
.loc[gt3])

Melon      4.389
Carrots    3.069
dtype: float64

## 10.4 The .iloc Attribute

- When we slice off this attribute, we pull out items by index position

In [26]:
city2

Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
              ..
Subaru        19
Subaru        20
Subaru        18
Subaru        18
Subaru        16
Name: city08, Length: 41144, dtype: int64

In [25]:
city2.iloc[0]

19

In [27]:
city2.iloc[-1]

16

In [28]:
city2.iloc[[0,1,-1]]

Alfa Romeo    19
Ferrari        9
Subaru        16
Name: city08, dtype: int64

In [29]:
city2.iloc[0:5]

Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
Name: city08, dtype: int64

## 10.5 Heads and Tails

- Useful for quick inspection of a chunk of data

In [30]:
city2.head(3)

Alfa Romeo    19
Ferrari        9
Dodge         23
Name: city08, dtype: int64

In [31]:
city2.tail(3)

Subaru    18
Subaru    18
Subaru    16
Name: city08, dtype: int64

## 10.6 Sampling

- Often the first few entries of the data may be incomplete, test data, or not representative of all the values.
- Sampling randomly pulls values

In [32]:
city2.sample(6, random_state=42)

Volvo         16
Mitsubishi    19
Buick         27
Jeep          15
Land Rover    13
Saab          17
Name: city08, dtype: int64

## 10.7 Filtering Index Values

- ``.filter`` method will filter index labels by exact match, substring, or regular expression
- Exact match fails with duplicate index labels

In [34]:
# using substring matches
city2.filter(like='rd')

Ford    18
Ford    16
Ford    17
Ford    17
Ford    15
        ..
Ford    26
Ford    19
Ford    21
Ford    18
Ford    19
Name: city08, Length: 3371, dtype: int64

In [35]:
# regular expression
city2.filter(regex='(Ford)|(Subaru)')

Subaru    17
Subaru    21
Subaru    22
Ford      18
Ford      16
          ..
Subaru    19
Subaru    20
Subaru    18
Subaru    18
Subaru    16
Name: city08, Length: 4256, dtype: int64

## 10.8 Reindexing

- ``.reindex`` method allows you to pull out values by index label
- Unlike ``.loc`` and ``.filter``, we can pass in labels that are in the index and it will not throw an error. Rather, it will insert missing values.
- ``.reindex`` does not like duplicate index labels in series and will throw an error

In [46]:
city_mpg

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

In [47]:
city_mpg.reindex([0,0, 10, 20, 2_000_000])

0          19.0
0          19.0
10         23.0
20         14.0
2000000     NaN
Name: city08, dtype: float64

In [48]:
s1 = pd.Series([10,20,30], index=['a', 'b', 'c'])
s2 = pd.Series([15,25,35], index=['b', 'c', 'd'])

In [49]:
s1

a    10
b    20
c    30
dtype: int64

In [50]:
s2

b    15
c    25
d    35
dtype: int64

In [52]:
s2.reindex(s1.index)

a     NaN
b    15.0
c    25.0
dtype: float64