In [1]:
import pandas as pd
import numpy as np

# Indexing Operations
Prepping the data and renaming the index:

In [2]:
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
df=pd.read_csv(url)
city_mpg = df.city08
highway_mpg = df.highway08
make = df.make

  df=pd.read_csv(url)


In [3]:
city2 = city_mpg.rename(make.to_dict()) 
#pass in a dictionary to map the previous index label to the new label
city2

Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
              ..
Subaru        19
Subaru        20
Subaru        18
Subaru        18
Subaru        16
Name: city08, Length: 41144, dtype: int64

In [4]:
city2.index

Index(['Alfa Romeo', 'Ferrari', 'Dodge', 'Dodge', 'Subaru', 'Subaru', 'Subaru',
       'Toyota', 'Toyota', 'Toyota',
       ...
       'Saab', 'Saturn', 'Saturn', 'Saturn', 'Saturn', 'Subaru', 'Subaru',
       'Subaru', 'Subaru', 'Subaru'],
      dtype='object', length=41144)

In [5]:
city2 = city_mpg.rename(make)

In [6]:
city2

Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
              ..
Subaru        19
Subaru        20
Subaru        18
Subaru        18
Subaru        16
Name: city08, Length: 41144, dtype: int64

In [7]:
city2.reset_index()

Unnamed: 0,index,city08
0,Alfa Romeo,19
1,Ferrari,9
2,Dodge,23
3,Dodge,10
4,Subaru,17
...,...,...
41139,Subaru,19
41140,Subaru,20
41141,Subaru,18
41142,Subaru,18


In [8]:
city2.reset_index(drop = True)

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

###10.3 the `.loc` attribute

- used to pull data out by using indexing operators
- `.loc` attribute deals with index labels, allows you to pull out pieces of the series
- can pass the follow into an index operation on `.loc` : 
    - scalar value of one of the index labels
    - a list of the index labels
    - a slice of labels
    - an index
    - a boolean array (same index labels as the series, but with true or false values)
    - a fn that accepts a series and returns one of the above


In [9]:
city2.loc['Subaru']

Subaru    17
Subaru    21
Subaru    22
Subaru    19
Subaru    20
          ..
Subaru    19
Subaru    20
Subaru    18
Subaru    18
Subaru    16
Name: city08, Length: 885, dtype: int64

In [10]:
city2.loc['Fisker']

20

In [11]:
city2.loc[['Fisker']] # to guarantee that a series is returned pass in a list rather than pass in a scalar

Fisker    20
Name: city08, dtype: int64

In [12]:
city2.loc[['Ferrari', 'Lamborghini']]

Ferrari         9
Ferrari        12
Ferrari        11
Ferrari        10
Ferrari        11
               ..
Lamborghini     6
Lamborghini     8
Lamborghini     8
Lamborghini     8
Lamborghini     8
Name: city08, Length: 357, dtype: int64

There is another trick up the label slicing sleeve. If you have a sorted index, you can slice with
strings that are not actual labels. For example, if I wanted all the labels in city2 that start with F and
go up to those index labels that also start with G H I, and including precisely 'J', but not anything else that happens to start with J, I could do the following. Note, that no label has the literal value
of either the start or stop, so these are not included:

In [13]:
city2.sort_index().loc["F":"J"]

Federal Coach    15
Federal Coach    13
Federal Coach    13
Federal Coach    14
Federal Coach    13
                 ..
Isuzu            15
Isuzu            15
Isuzu            15
Isuzu            27
Isuzu            18
Name: city08, Length: 9040, dtype: int64

Can also pass in a pandas Index to `.loc` . Useful for when you have parallel pandas objects with the same index. If you have one of them filtered, then you can the other to conform by passing its index into `.loc` . However, you need to be aware of duplicate index labels. (city2 series has many duplicated index labels, if u index into .loc with a simple index with only dodge in it, you get back every value for the label)

In [14]:
idx = pd.Index(['Dodge'])

In [15]:
city2.loc[idx]

Dodge    23
Dodge    10
Dodge    12
Dodge    11
Dodge    11
         ..
Dodge    18
Dodge    17
Dodge    14
Dodge    14
Dodge    11
Name: city08, Length: 2583, dtype: int64

In [16]:
idx = pd.Index(['Dodge', 'Dodge'])
city2.loc[idx]

Dodge    23
Dodge    10
Dodge    12
Dodge    11
Dodge    11
         ..
Dodge    18
Dodge    17
Dodge    14
Dodge    14
Dodge    11
Name: city08, Length: 5166, dtype: int64

Can pass boolean array to a `.loc`, to return the indexes that are true only:


In [17]:
mask = city2 > 50

In [18]:
mask

Alfa Romeo    False
Ferrari       False
Dodge         False
Dodge         False
Subaru        False
              ...  
Subaru        False
Subaru        False
Subaru        False
Subaru        False
Subaru        False
Name: city08, Length: 41144, dtype: bool

In [19]:
city2.loc[mask]

Nissan     81
Toyota     81
Toyota     81
Ford       74
Nissan     84
         ... 
Tesla     140
Tesla     115
Tesla     104
Tesla      98
Toyota     55
Name: city08, Length: 236, dtype: int64

In [20]:
cost = pd.Series([1.00, 2.25, 3.99, .99, 2.79],
    index=['Gum', 'Cookie', 'Melon', 'Roll', 'Carrots'])
inflation=1.10

In [22]:
(cost
 .mul(inflation)
 .loc[lambda x: x >3]
 )

Melon      4.389
Carrots    3.069
dtype: float64

if you calculate boolean array before taking into account inflation:

In [24]:
mask=cost>3

In [25]:
(cost.mul(inflation).loc[mask])

Melon    4.389
dtype: float64

*lambda function* allows running a function in a single line of code

In [26]:
city2.iloc[-1] #iloc pulls items by index position, and loc pulls from data value

16

In [30]:
city2.head(3)

Alfa Romeo    19
Ferrari        9
Dodge         23
Name: city08, dtype: int64

In [31]:
city2.tail(3)

Subaru    18
Subaru    18
Subaru    16
Name: city08, dtype: int64

In [36]:
city2.sample(6, random_state=22)

Ford         21
Chevrolet    22
Infiniti     18
Nissan       15
Chevrolet    31
Jaguar       14
Name: city08, dtype: int64

In [37]:
city2.filter(items=['Ford', 'Subaru']) #fails because duplicate index labels

ValueError: cannot reindex on an axis with duplicate labels

In [38]:
city2.filter(like='rd') 

Ford    18
Ford    16
Ford    17
Ford    17
Ford    15
        ..
Ford    26
Ford    19
Ford    21
Ford    18
Ford    19
Name: city08, Length: 3371, dtype: int64

In [39]:
city2.filter(regex='(Ford)|(Subaru)')

Subaru    17
Subaru    21
Subaru    22
Ford      18
Ford      16
          ..
Subaru    19
Subaru    20
Subaru    18
Subaru    18
Subaru    16
Name: city08, Length: 4256, dtype: int64

In [40]:
city2.reindex(['Missing', 'Ford']) #does not like duplicate index labels

ValueError: cannot reindex on an axis with duplicate labels

In [41]:
city_mpg.reindex([0,0, 10,20, 2_000_00]) #can pass in the index label mutiple times and it will repeat that index

0         19.0
0         19.0
10        23.0
20        14.0
200000     NaN
Name: city08, dtype: float64

This method is a lifesaver if you have series that have portions of index labels that are the same
and you want one to have the index of the other:

In [44]:
s1 = pd.Series([10,20, 30], index=['a', 'b', 'c']) 

In [45]:
s2 = pd.Series([15,25, 35], index=['b', 'c', 'd']) 

In [46]:
s2

b    15
c    25
d    35
dtype: int64

In [47]:
s2.reindex(s1.index)

a     NaN
b    15.0
c    25.0
dtype: float64