<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>

<br style="clear: both">
<hr>
<br>

<h1 align='center'>Advanced Indexing</h1>

<br>

<div style="display: table; width: 100%">
    <div style="display: table-row; width: 100%;">
        <div style="display: table-cell; width: 50%; align: middle; text-align: center;">            
            $\huge{{x}_{i}}$            
        </div>
        <div style="display: table-cell; width: 10%">
        </div>
        <div style="display: table-cell; width: 40%; vertical-align: top;">
            <blockquote>
                <p style="font-style: italic;">"Simplify, simplify."</p>
                <br>
                <p>-Henry David Thoreau</p>
            </blockquote>
        </div>
    </div>
</div>

<br>



<hr>

In [1]:
import numpy as np
import pandas as pd

## What other kind of more complex indexers can we use?

In addition to using simple values for indexers, you can also use:
* Boolean (a.k.a. "fancy") indexes
* Index objects

### Boolean (a.k.a. "fancy") indexing

No, I didn't come up with that name.

One of the nifty things that pandas stole from R was the concept of
boolean indexing.

Explained simply, if we have a series of True/False values that share an index with another series, we an pass that **boolean index** to loc[]. This is super useful when we get to dataframes.

The result will be that we get back all the rows with True and all the False rows will be excluded. We can also do this with non-series provided certain requirements are met.

This makes a lot more sense if you see it in action.

In [2]:
# Here we have a list of names.
data = pd.Series(['Jim', 'Jane', 'Steve', 'Stacey', 'Mark'])
data

0       Jim
1      Jane
2     Steve
3    Stacey
4      Mark
dtype: object

In [3]:
# And here we have a boolean index
bix = pd.Series([False, True, False, True, False])
bix

0    False
1     True
2    False
3     True
4    False
dtype: bool

In [4]:
# Now we can use the loc attribute and boolean index
data.loc[bix]

1      Jane
3    Stacey
dtype: object

As you can see, index items 1 and 3 of the boolean index are True, so a new Series with items 1 and 3 of the data.

This is a silly example, so let's do something more pragmatic. As you may remember, we can get a boolean series by operating on a data series as a whole. For example, if you evaluated:

    series > 5
    
... you will get back a Series that has True or False for each item depending on whether it was greater than 5 or not. This boolean series will be indexed like the data series ... and in turn it can be used for loc[]ing.

In [5]:
# An example say we have want all names that start with SA
name_series = pd.Series(['Sarah', 'Sam', 'David', 'Sadie', 'Elsa', 'Dave'])
name_series

0    Sarah
1      Sam
2    David
3    Sadie
4     Elsa
5     Dave
dtype: object

In [6]:
# We can use the handy Python string method startswith in the .str namespace
# There are a bajillion of these.
sa_bix = name_series.str.startswith('Sa')
sa_bix

0     True
1     True
2    False
3     True
4    False
5    False
dtype: bool

In [7]:
# And get the names we want. Doesn't matter if it's 10 or 10 million.
sa_series = name_series.loc[sa_bix]
sa_series

0    Sarah
1      Sam
3    Sadie
dtype: object

In [8]:
# Another thing you will see is equality with series
only_sam_bix  = name_series == 'Sam'
only_sam_bix

0    False
1     True
2    False
3    False
4    False
5    False
dtype: bool

In [9]:
# And we can get Sam.
name_series.loc[only_sam_bix]

1    Sam
dtype: object

Note: you can do this in a single line, but the caveat is that if you are going to use multiple conditions without assigning to a separate series, you will have to separate your conditions using parenthesis and operators.

* **|** is your 'or' operator
* **&** is your 'and' operator
* **^** is your 'xor' operator.
* **~** is your 'not' operator.

In [10]:
# Here we negate the Sam only bix, giving us everything but Sam in a bix.
opposite_of_sam_only_bix = ~only_sam_bix

# And selecting everyone but Sam
name_series.loc[opposite_of_sam_only_bix]

0    Sarah
2    David
3    Sadie
4     Elsa
5     Dave
dtype: object

In [11]:
# Here we have a condition of 1) Name is 5 letters in length or 2) name is 'Mark'
# You can get as granular as you want, but be warned this can get really messy really quickly.
# Because space doesn't matter inside delimiters, I like to separate each condition on a line.
name_series[
    (name_series.str.len() == 5) | 
    (name_series == 'Dave')
]

0    Sarah
2    David
3    Sadie
5     Dave
dtype: object

## Index object indexing

We can also use index objects for indexing. This is useful for when we want to use one of the large numbers of Index [methods](https://pandas.pydata.org/pandas-docs/stable/api.html#index). Most useful ones:

* [Index.difference](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.union.html#pandas.Index.difference): get the difference between index a and index b.
* [Index.duplicated](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.duplicated.html#pandas.Index.duplicated): whether an index item occurs more than once
* [Index.get_loc](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_loc.html): get an integer location based on an index value
* [Index.intersection](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.intersection.html): a list of the shared items.
* [Index.isin](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.isin.html#pandas.Index.isin): whether an index vlaue is in an arbitrary list.
* [Index.symmetric_difference](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.symmetric_difference.html): the the items that are only listed in one index.
* [Index.union](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.union.html#pandas.Index.union): merge two indexes.

In [12]:
# Say we have a series ...
original_series = pd.Series(
    index=['Jane', 'John', 'Jenny', 'Julie', 'Jessica'],
    data=np.random.randint(0,100,5)
)

# and an index ...
index = pd.Index(['Jane', 'Jessica'])

# we can use loc with the index to get a result.
original_series.loc[index]

Jane        0
Jessica    48
dtype: int32

In [13]:
# Get a random sample of our frame
s1 = original_series.sample(3)
s1

Jenny      14
Jessica    48
Julie      57
dtype: int32

In [14]:
# Another random sample
s2 = original_series.sample(3)
s2

John     94
Jane      0
Jenny    14
dtype: int32

In [15]:
# Combined index values of s1 and s2
union = s1.index.union(s2.index)
union

Index(['Jane', 'Jenny', 'Jessica', 'John', 'Julie'], dtype='object')

In [16]:
# The intersecting values of s1 and s2
intersect = s1.index.intersection(s2.index)
intersect

Index(['Jenny'], dtype='object')

In [17]:
# The items in s1 but not s2.
s1_diff = s1.index.difference(s2.index)
s1_diff

Index(['Jessica', 'Julie'], dtype='object')

In [18]:
# The items in s2 and not s1.
s2_diff = s2.index.difference(s1.index)
s2_diff

Index(['Jane', 'John'], dtype='object')

In [19]:
# The items that are in s1 or s2, but that are not shared.
sym_diff = s2.index.symmetric_difference(s1.index)
sym_diff

Index(['Jane', 'Jessica', 'John', 'Julie'], dtype='object')

In [20]:
# Boolean index of items in s1 that are included in an arbitrary list.
included = s1.index.isin(['Jenny', 'Jebediah', 'Julie'])
included

array([ True, False,  True])

In [21]:
# The datatype of our series
datatype = s1.index.dtype
datatype

dtype('O')

In [22]:
# The raw values of our series
raw_values = s1.index.values
raw_values

array(['Jenny', 'Jessica', 'Julie'], dtype=object)

In [23]:
# The index of s1 sorted by a sort function.
sorted_index = s1.index.sort_values()
sorted_index

Index(['Jenny', 'Jessica', 'Julie'], dtype='object')

In [24]:
# The index of duplicated items (leaves one duplicated item only)
duplicated_bix = s1.index.duplicated()
duplicated_bix

array([False, False, False])

In [25]:
# A new index with values of previous index dropped.
jessicaless_index = original_series.index.drop('Jessica')
jessicaless_index

Index(['Jane', 'John', 'Jenny', 'Julie'], dtype='object')

##### This pays off because we can loc[] the original series based on our new indexes!

    series.loc[new_index]

In [26]:
# Here we have the intersection of s1 and s2 as an indexer for the original series.
original_series.loc[intersect]

Jenny    14
dtype: int32

In [27]:
# Here we use s1's index as an indexer for the original series.
# This is useful for transferring results across series and dataframes.
s1_index = s1.index
original_series.loc[s1_index]

Jenny      14
Jessica    48
Julie      57
dtype: int32

## Multiindexes

Pandas also allows for hierarchical indexes, which are called 'multiindexes'. We're not going to go too far into these, but just know this is a great way to deal with subpopulations of data. The best way to access multiindexes is to use loc[] and then pass it a tuple with one key per level of your index.

Side note: generally a groupby(column_1, column_2) is a really way to generate a multiindex.

In [28]:
# We won't discuss this, but we can also use multiindexes
## Note: there are hierarchical indexes (multiindexes), but we won't get to that until later.
multiindex = pd.MultiIndex.from_tuples([('big', 'short'), ('big', 'tall'), ('small', 'short'), ('small', 'tall')])

# Create multiindexed series
ms = pd.Series(
    index=multiindex,
    data=[1,5,3,0]
)

# Demonstrate indexing a multiindex
print('The number of big/tall people are:')
print(ms.loc[('big', 'tall')])
print()

# You can also select bigger chunks
print('The number of big people are:')
print(ms.loc['big'])

# Show multiindexed series
ms

The number of big/tall people are:
5

The number of big people are:
short    1
tall     5
dtype: int64


big    short    1
       tall     5
small  short    3
       tall     0
dtype: int64

# Additional Learing Resources

* ### [Official Pandas Advanced Indexing](https://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced)
* ### [Multiindex API](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.MultiIndex.html)


---

# Next Up: [Series Part 1 Exervises](6_series_part_1_exercises.ipynb)

<br>

<h1 align="left"> $W=-\Delta PE$ </h1>

---