<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Basic-information" data-toc-modified-id="Basic-information-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Basic information</a></span></li><li><span><a href="#Index" data-toc-modified-id="Index-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Index</a></span><ul class="toc-item"><li><span><a href="#Subtleties-of-indices" data-toc-modified-id="Subtleties-of-indices-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Subtleties of indices</a></span></li></ul></li><li><span><a href="#Basic-indexing-(filtering)" data-toc-modified-id="Basic-indexing-(filtering)-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Basic indexing (filtering)</a></span><ul class="toc-item"><li><span><a href="#Chained-indexing" data-toc-modified-id="Chained-indexing-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Chained indexing</a></span></li><li><span><a href="#Filtering-by-column-values" data-toc-modified-id="Filtering-by-column-values-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Filtering by column values</a></span></li></ul></li><li><span><a href="#Indices-for-alignment" data-toc-modified-id="Indices-for-alignment-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Indices for alignment</a></span></li><li><span><a href="#Finding-missing-values" data-toc-modified-id="Finding-missing-values-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Finding missing values</a></span></li><li><span><a href="#Sorting" data-toc-modified-id="Sorting-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Sorting</a></span></li><li><span><a href="#Optional:-More-advanced-filtering" data-toc-modified-id="Optional:-More-advanced-filtering-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Optional: More advanced filtering</a></span></li><li><span><a href="#Optional:-.reindex()-and-.query()-methods" data-toc-modified-id="Optional:-.reindex()-and-.query()-methods-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Optional: <code>.reindex()</code> and <code>.query()</code> methods</a></span></li></ul></div>

In the following lessons, we will learn more advanced `pandas` concepts by applying them to very simple `DataFrame` related to a retail operation. Keeping `DataFrame`s simple in this way should help you both to predict what the result of various operations will be, and to understand their actual effects. In the practice questions later today, you will apply the concepts you've learned to more complex `DataFrame`s derived from the `omni_pool` database we have used already. 

# Setup

Let's get started by creating the simple `stock` `DataFrame` we'll be using in this lesson

In [1]:
import pandas as pd
import numpy as np

stock = pd.DataFrame({
    'item_no': pd.Series([1, 2, 2, 4, 5, 6, 7, 8, 9, 10], dtype='Int64'),
    'cost_class': pd.Series(['1st', '2nd', '3rd', '4th', '4th', '3rd', '2nd', np.nan, '1st', '3rd'], dtype='string'),
    'cost': pd.Series([10.99, np.nan, 2.99, np.nan, 2.99, 2.45, 5.99, 5.99, 3.00, None], dtype='float64'),
    'stock_code': pd.Series(['a', 'a', 'c', 'b', 'a', 'b', np.nan, np.nan, 'a', 'c'], dtype='string'),
    'priority_code': pd.Series([np.nan, None, 'a', 'b', None, 'a', 'e', None, 'a', 'd'], dtype='string'),
    'tax_rate': pd.Series([0, 0, 20, 20, 20, 0, 20, 20, 5, 20])
})

stock

Unnamed: 0,item_no,cost_class,cost,stock_code,priority_code,tax_rate
0,1,1st,10.99,a,,0
1,2,2nd,,a,,0
2,2,3rd,2.99,c,a,20
3,4,4th,,b,b,20
4,5,4th,2.99,a,,20
5,6,3rd,2.45,b,a,0
6,7,2nd,5.99,,e,20
7,8,,5.99,,,20
8,9,1st,3.0,a,a,5
9,10,3rd,,c,d,20


# Basic information: Recap

The methods and attributes providing basic information on the data held in a `DataFrame` are pretty straightforward. Let's see them in use again. 

In [2]:
stock.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   item_no        10 non-null     Int64  
 1   cost_class     9 non-null      string 
 2   cost           7 non-null      float64
 3   stock_code     8 non-null      string 
 4   priority_code  6 non-null      string 
 5   tax_rate       10 non-null     int64  
dtypes: Int64(1), float64(1), int64(1), string(3)
memory usage: 618.0 bytes


We see we have some `null`/`NaN` values in some of the columns. The datatypes are also given in the final column.

In [3]:
stock.describe(include='all')

Unnamed: 0,item_no,cost_class,cost,stock_code,priority_code,tax_rate
count,10.0,9,7.0,8,6,10.0
unique,,4,,3,4,
top,,3rd,,a,a,
freq,,3,,4,3,
mean,5.4,,4.914286,,,12.5
std,3.134042,,3.065169,,,9.78945
min,1.0,,2.45,,,0.0
25%,2.5,,2.99,,,1.25
50%,5.5,,3.0,,,20.0
75%,7.75,,5.99,,,20.0


Here `.describe()` provides information on columns requested by the `include=` argument. By selecting 'all' we get the most output, but note the different behaviour for columns that `pandas` believes are categorical and numerical:

* Outputs `unique`, `top` and `freq` appear only for categorical columns
* The 'five number summary' (`min`, `25%`, `50%`, `75%` and `max`), `std` (standard deviation) and `mean` appear only for numerical columns
* Output `count` appears for both categorical and numerical columns

Finally, the `.shape` attribute provides a `tuple`: `(num_rows, num_columns)`

In [4]:
stock.shape

(10, 6)

# Index

All `pandas` `DataFrame`s have an `index` object, used for row selection, data alignment between `DataFrame`s and various other purposes.  Let's have a look at `stock`

In [5]:
stock

Unnamed: 0,item_no,cost_class,cost,stock_code,priority_code,tax_rate
0,1,1st,10.99,a,,0
1,2,2nd,,a,,0
2,2,3rd,2.99,c,a,20
3,4,4th,,b,b,20
4,5,4th,2.99,a,,20
5,6,3rd,2.45,b,a,0
6,7,2nd,5.99,,e,20
7,8,,5.99,,,20
8,9,1st,3.0,a,a,5
9,10,3rd,,c,d,20


The `index` is down the left-hand side, and by default is set to be a simple range of integers. For our data, it looks like the `item_no` column might be a more meaningful and useful `index`, so let's set it as `index` and use `inplace=True` to make the change to the original `stock` DataFrame. 

In [6]:
stock.set_index('item_no', inplace=True)

stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1st,10.99,a,,0
2,2nd,,a,,0
2,3rd,2.99,c,a,20
4,4th,,b,b,20
5,4th,2.99,a,,20
6,3rd,2.45,b,a,0
7,2nd,5.99,,e,20
8,,5.99,,,20
9,1st,3.0,a,a,5
10,3rd,,c,d,20


The other 'index' of the `DataFrame` is technically the set of column names. We can access that like so

In [7]:
stock.columns

Index(['cost_class', 'cost', 'stock_code', 'priority_code', 'tax_rate'], dtype='object')

And we can see both indices reported together in the `axes` attribute of the `DataFrame`

In [8]:
stock.axes

[Index([1, 2, 2, 4, 5, 6, 7, 8, 9, 10], dtype='object', name='item_no'),
 Index(['cost_class', 'cost', 'stock_code', 'priority_code', 'tax_rate'], dtype='object')]

## Subtleties of indices

Alright, we've set our new `index`, but is it going to be fit for purpose? Let's check a few properties

In [9]:
stock.index.is_unique

False

Hmm, looks like we have some repetition of values in the `index`. We can see which item is replicated in the following way: 

In [10]:
stock.loc[stock.index.duplicated()]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,3rd,2.99,c,a,20


Ok, so index `item_no` == 2 is replicated. Let's try to select the row(s) we think correspond to item 2. 

In [11]:
stock.loc[2]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,2nd,,a,,0
2,3rd,2.99,c,a,20


So `item_no` does not uniquely index rows. Let's try to fix what we take to be a simple error, and change the second 2 to 3

In [12]:
stock.index[2] = 3

TypeError: Index does not support mutable operations

So you can see we get an error - we can't mutate an `Index` in this way. If we want to change an `Index`, we either have to: 

* Provide some sort of **collection of values of the correct size** (either the number of rows or number of columns)  
* Use the `.rename()` method specifying the `index=` argument

Let's try both of these now. 

In [12]:
# provide a collection of values to use as the index
stock.index = range(1, 11)
stock.index.name = 'item_no'
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1st,10.99,a,,0
2,2nd,,a,,0
3,3rd,2.99,c,a,20
4,4th,,b,b,20
5,4th,2.99,a,,20
6,3rd,2.45,b,a,0
7,2nd,5.99,,e,20
8,,5.99,,,20
9,1st,3.0,a,a,5
10,3rd,,c,d,20


In [13]:
# replace 10 anywhere in the index with 200
stock.rename(index={10: 200}, inplace=True)
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1st,10.99,a,,0
2,2nd,,a,,0
3,3rd,2.99,c,a,20
4,4th,,b,b,20
5,4th,2.99,a,,20
6,3rd,2.45,b,a,0
7,2nd,5.99,,e,20
8,,5.99,,,20
9,1st,3.0,a,a,5
200,3rd,,c,d,20


Let's undo this damage to the `index` by setting it back to an appropriate `RangeIndex` object

In [14]:
stock.index = pd.RangeIndex(1, 11, name='item_no')
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1st,10.99,a,,0
2,2nd,,a,,0
3,3rd,2.99,c,a,20
4,4th,,b,b,20
5,4th,2.99,a,,20
6,3rd,2.45,b,a,0
7,2nd,5.99,,e,20
8,,5.99,,,20
9,1st,3.0,a,a,5
10,3rd,,c,d,20



<hr style="border:8px solid black"> </hr>

***

**<u>Task - 2 mins</u>**

* Reset the `Index` using the `reset_index()` method (passing in arguments `inplace=True` and `drop=False`)
* What does the `drop` argument do in the line above? Investigate this.
* Now set the `Index` back once again to the `item_no` column using the `set_index()` method. Make sure the change is persisted in the `DataFrame`.

**Solution**

In [15]:
# drop=False tells pandas to insert the current index as a column in the DataFrame
# before it creates the default RangeIndex object, starting at zero
stock.reset_index(drop=False, inplace=True)
stock

Unnamed: 0,item_no,cost_class,cost,stock_code,priority_code,tax_rate
0,1,1st,10.99,a,,0
1,2,2nd,,a,,0
2,3,3rd,2.99,c,a,20
3,4,4th,,b,b,20
4,5,4th,2.99,a,,20
5,6,3rd,2.45,b,a,0
6,7,2nd,5.99,,e,20
7,8,,5.99,,,20
8,9,1st,3.0,a,a,5
9,10,3rd,,c,d,20


In [16]:
# set the index back to item_no again and persist the change
stock.set_index("item_no", inplace=True)
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1st,10.99,a,,0
2,2nd,,a,,0
3,3rd,2.99,c,a,20
4,4th,,b,b,20
5,4th,2.99,a,,20
6,3rd,2.45,b,a,0
7,2nd,5.99,,e,20
8,,5.99,,,20
9,1st,3.0,a,a,5
10,3rd,,c,d,20


***

<hr style="border:8px solid black"> </hr>

# Indexing

Let's see some examples of basic indexing operations in `pandas`. The two major methods of use here are `.loc[]` and `.iloc[]`, corresponding the label-based and position-based indexing, respectively. We've already looked at using `.loc[]` a lot, but this part of the lesson will add on to this a bit. 

As a recap, let's look again at how you index using label based indexing (`.loc[]`). 

> **Select item_nos 1, 2 and 3, with columns cost_class, cost and stock_code**


In [17]:
# arguments to .loc[] are always in the order [rows, columns]
stock.loc[[1, 2, 3], ['cost_class', 'cost', 'stock_code']]

Unnamed: 0_level_0,cost_class,cost,stock_code
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1st,10.99,a
2,2nd,,a
3,3rd,2.99,c


We can also use **slice notation** where it would be meaningful for `index` labels. Note that, unlike essentially everywhere else in `Python`, the last value in `.loc[]` slices **is included**. So, in the code below, the row with `index` 3 and the column with label `stock_code` will appear in the resulting `DataFrame`

In [18]:
stock.loc[1:3, 'cost_class':'stock_code']

Unnamed: 0_level_0,cost_class,cost,stock_code
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1st,10.99,a
2,2nd,,a
3,3rd,2.99,c


Now let's recap position-based indexing using `.iloc[]`

> **Select the first four rows and first three columns of the data**


In [19]:
# as for .loc[], arguments to .iloc[] are always [rows, columns]
# remember zero-based indexing in Python!
stock.iloc[[0, 1, 2, 3], [0, 1, 2]] 

Unnamed: 0_level_0,cost_class,cost,stock_code
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1st,10.99,a
2,2nd,,a
3,3rd,2.99,c
4,4th,,b


We can also use slice notation here too. However, in a confusing turn, the slice end values in `.iloc[]` are **not included** (a return to normal `Python` behaviour)

In [20]:
stock.iloc[0:4, 0:3]

Unnamed: 0_level_0,cost_class,cost,stock_code
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1st,10.99,a
2,2nd,,a
3,3rd,2.99,c
4,4th,,b



Let's check what `pandas` data structure these indexing operations are returning? 

In [21]:
type(stock.loc[[1, 2], ['cost_class', 'cost']])

pandas.core.frame.DataFrame

This makes sense, as we simultaneously have multiple rows and multiple columns in these 'cut-down' data sets. But what type does indexing a single row return?

In [22]:
# Can do this two ways
# Here : as the second argument means 'all columns'
way1 = stock.loc[1, :]
way2 = stock.loc[1]
type(way1), type(way2)

(pandas.core.series.Series, pandas.core.series.Series)

Both ways return a `pandas` `Series` object: think of this as a dedicated type holding a single `list` (or `array` of values). What about indexing to extract a single column?

In [23]:
# Again, can do this two ways
# Here : as the first argument means 'all rows'
way1 = stock.loc[:, 'cost']
way2 = stock.cost
type(way1), type(way2)

(pandas.core.series.Series, pandas.core.series.Series)

## Chained indexing

The `.loc[]` and `.iloc[]` indexing methods are the recommended ways to extract specific rows and columns from a `DataFrame` There are other ways that are generally not recommended (we'll discuss why later) but you are likely to encounter them. The most common of them is **'chained indexing'**. Here's an example

In [24]:
stock['cost'][[1, 2]]

item_no
1    10.99
2      NaN
Name: cost, dtype: float64

We call this 'chained indexing' because the `DataFrame` is indexed two (or more) times in successive and separate operations:

* Operation 1: extract the `cost` column as a `Series` object
* Operation 2: extract the elements at index positions `[1, 2]` within that `Series`

It is better to extract this data in a **single** indexing operation using either `.loc[]` or `.iloc[]`

In [25]:
stock.loc[[1, 2], 'cost']

item_no
1    10.99
2      NaN
Name: cost, dtype: float64

Chaining can come with some costs and is best avoided if you can use another approach. In particular, chaining tends to cause Pandas to return a view of the DataFrame instead of a copy on the DataFrame. For selecting data, this is not a big deal, though it might be slower than necessary. So do avoid where possible. 


## Indexing: Alignment

One of main uses of a `DataFrame` `index` is to provide efficient **alignment of rows**. Consider the following `Series`

In [26]:
new_series = pd.Series(['a', 'b', 'c', 'd'], index = [2, 3, 4, 6])
new_series

2    a
3    b
4    c
6    d
dtype: object

Let's add it as a new column to the `stock` `DataFrame`

In [27]:
stock.loc[:, 'new'] = new_series
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,new
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1st,10.99,a,,0,
2,2nd,,a,,0,a
3,3rd,2.99,c,a,20,b
4,4th,,b,b,20,c
5,4th,2.99,a,,20,
6,3rd,2.45,b,a,0,d
7,2nd,5.99,,e,20,
8,,5.99,,,20,
9,1st,3.0,a,a,5,
10,3rd,,c,d,20,


Note what's happened to the elements of `new_series`: they have been 'split up' when they were added to `stock`. Each value in `new_series` has been **aligned by index** when added to `stock`. It's important to keep this idea of 'alignment' in mind as you work with `pandas`! 

For now, let's get rid of the column we added (more on the `.drop()` method later)

In [28]:
stock.drop(columns='new', inplace=True)

## Indexing: Finding missing values 

Missing values in `pandas` are dealt with in a flexible way. Various values are treated as indicating missing data: 

* `numpy`'s `nan` ('not a number') value (`np.nan`)
* `None` from base `Python`
* `pd.NaT` ('not a time') indicates a missing value in a `DateTime` column


Let's see now how to find and count missing values in `DataFrame`s. The most useful method here is `.isna()` (or its equivalent `.isnull()`)

> **Find any items lacking a cost**

In [29]:
stock.loc[stock.cost.isna()]
# or stock.loc[stock.cost.isnull()] if you prefer!

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,2nd,,a,,0
4,4th,,b,b,20
10,3rd,,c,d,20


> **Find any rows containing at least one missing value**

We can use the `.any()` method for this. See the 'More advanced filtering' section below for more on this function

In [30]:
stock.loc[stock.isna().any(axis='columns')]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1st,10.99,a,,0
2,2nd,,a,,0
4,4th,,b,b,20
5,4th,2.99,a,,20
7,2nd,5.99,,e,20
8,,5.99,,,20
10,3rd,,c,d,20


> **Find all complete rows**

We can use the `pandas` *negation* operator `~` for this (again, the base `Python` `not` keyword won't work here)

In [31]:
stock.loc[~stock.isna().any(axis='columns')]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,3rd,2.99,c,a,20
6,3rd,2.45,b,a,0
9,1st,3.0,a,a,5


<hr style="border:8px solid black"> </hr>

***

**<u>Task - 2 mins</u>**

Consider the output of the following code:

In [32]:
stock.notna().all(axis='rows')

cost_class       False
cost             False
stock_code       False
priority_code    False
tax_rate          True
dtype: bool

1. How could we use this to output **all complete columns** (i.e. columns without missing values)?
2. How can we count how many missing values are there in each column?


**Solution**

In [33]:
# output all complete columns
stock.loc[:, stock.notna().all(axis='rows')]

Unnamed: 0_level_0,tax_rate
item_no,Unnamed: 1_level_1
1,0
2,0
3,20
4,20
5,20
6,0
7,20
8,20
9,5
10,20


In [34]:
# How can we count how many missing values are there in each column?
stock.isna().sum(axis='rows')

cost_class       1
cost             3
stock_code       2
priority_code    4
tax_rate         0
dtype: int64

To understand what the code above is doing, realise that when `pandas` applies the `np.sum()` method to a `Boolean` `Series`, it is first **converting (or 'casting')** the Booleans to `integer` equivalents (`True` -> 1 and `False` -> 0). To see this, consider the following code:

In [35]:
np.sum(pd.Series([True, True, False]))

2

And those are really the main uses for indexing in Python! 

*** 


***

# Filtering

You learnt how to filter data earlier this week. Let's do a quick recap, then move onto more advanced filtering. 

## Filtering by column values

Let's recap on how to filter `DataFrames` by applying logical conditions to values in columns (or to `index` objects)

> **Find the cost of items that have a priority_code of 'a'**

In [36]:
stock.loc[stock.priority_code == 'a', 'cost']

item_no
3    2.99
6    2.45
9    3.00
Name: cost, dtype: float64

> **Find any items that are in the 1st cost_class or have a stock_code of 'c'**

When answering this, note that the `Python` `or` keyword won't work in `pandas`, we have to use `|` instead

In [37]:
stock.loc[(stock.cost_class == '1st') | (stock.stock_code == 'c')]

# note as well we could also write this as
#stock.loc[(stock.cost_class == '1st') | (stock.stock_code == 'c'), :]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1st,10.99,a,,0
3,3rd,2.99,c,a,20
9,1st,3.0,a,a,5
10,3rd,,c,d,20


> **Find any items in the 3rd cost_class that have a priority_code of 'a'**

Note that this is really an 'and' combination of conditions. However, `Python` `and` won't work in `pandas`, we need to use `&` instead

In [38]:
# Again, Python 'and' keyword won't work, need to use & instead
stock.loc[(stock.cost_class == '3rd') & (stock.priority_code == 'a')]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,3rd,2.99,c,a,20
6,3rd,2.45,b,a,0


<hr style="border:8px solid black"> </hr>

***

**<u>Task - 2 mins</u>**

What is returned if a filtering condition results in no matches. Give this a try yourself, filtering for a value that **doesn't exist** in a column in the data.

**Solution**

In [39]:
no_matching_rows = stock.loc[stock.cost_class == '5th']
no_matching_rows

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1


In [40]:
type(no_matching_rows)

pandas.core.frame.DataFrame

In [41]:
print(no_matching_rows)

Empty DataFrame
Columns: [cost_class, cost, stock_code, priority_code, tax_rate]
Index: []


What happens if we try to index for a column that doesn't exist in the data?

In [42]:
stock.loc[:, 'banana']

KeyError: 'banana'

In this case, we get a `KeyError` instead: this is `pandas`' way of telling us that the requested column doesn't exist in the `DataFrame`!

## Advanced filtering: max( ), min( ), and isin( )


Now let's see some more advanced filtering techniques

> **Find any rows with items in the 1st or 2nd cost_class**

We can make use of the `Series` `.isin()` method for this. Think of this method as asking a question for each value in the `Series`: 'Are you a member of the `list` passed in as an argument?'



In [43]:
stock.loc[stock.cost_class.isin(['1st', '2nd'])]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1st,10.99,a,,0
2,2nd,,a,,0
7,2nd,5.99,,e,20
9,1st,3.0,a,a,5


> **Find any rows with items taxed at the maximum rate**

We can use the `.max()` method for this

In [44]:
stock.loc[stock.tax_rate == stock.tax_rate.max()]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,3rd,2.99,c,a,20
4,4th,,b,b,20
5,4th,2.99,a,,20
7,2nd,5.99,,e,20
8,,5.99,,,20
10,3rd,,c,d,20




**<u>Task - 5 mins</u>**

Are there any items having the lowest cost which are also taxed at the lowest rate? Remember: do this by *coding* rather than by inspection of the `DataFrame`.

**Solution**

In [45]:
stock.loc[(stock.tax_rate == stock.tax_rate.min()) & (stock.cost == stock.cost.min())]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6,3rd,2.45,b,a,0


> **Find any items having 'a' or 'b' in stock_code or priority_code**

In [46]:
stock.loc[stock.stock_code.isin(['a', 'b']) | stock.priority_code.isin(['a', 'b'])]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1st,10.99,a,,0
2,2nd,,a,,0
3,3rd,2.99,c,a,20
4,4th,,b,b,20
5,4th,2.99,a,,20
6,3rd,2.45,b,a,0
9,1st,3.0,a,a,5


We can also answer the query above using the `.any()` `DataFrame` method with the correct choice of `axis`. Let's break down what this method does. First, lets see if 'a' or 'b' occurs in `stock_code` or `priority_code`

In [47]:
stock.loc[:,['stock_code', 'priority_code']].isin(['a', 'b'])

Unnamed: 0_level_0,stock_code,priority_code
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1
1,True,False
2,True,False
3,False,True
4,True,True
5,True,False
6,True,True
7,False,False
8,False,False
9,True,True
10,False,False


Now we'd like to return any rows on which `True` occurs at least once: `.any()` can do this for us, if we set `axis='columns'` or `axis=1` (these mean the same thing). Basically we are asking for each row: 'Are any of the values on this row `True`?'

In [48]:
stock[['stock_code', 'priority_code']].isin(['a', 'b']).any(axis='columns')

item_no
1      True
2      True
3      True
4      True
5      True
6      True
7     False
8     False
9      True
10    False
dtype: bool

Next we use the expression above as a `Boolean` mask to index the rows. Only those rows where the mask is `True` will be returned

In [49]:
stock.loc[stock[['stock_code', 'priority_code']].isin(['a', 'b']).any(axis='columns')]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1st,10.99,a,,0
2,2nd,,a,,0
3,3rd,2.99,c,a,20
4,4th,,b,b,20
5,4th,2.99,a,,20
6,3rd,2.45,b,a,0
9,1st,3.0,a,a,5


There is also a corresponding `.all()` method, which when passed an argument `axis='columns'` is equivalent to asking for each row: 'Are **all** of the values on this row `True`?' Have a go at using it in this task.


**<u>Task - 2 mins</u>**

Find any items with an 'a' or 'b' in both stock_code and priority_code

**Solution**

In [50]:
stock.loc[stock[['stock_code', 'priority_code']].isin(['a', 'b']).all(axis='columns')]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4,4th,,b,b,20
6,3rd,2.45,b,a,0
9,1st,3.0,a,a,5


## Advanced filtering: strings and types

What about if we want to look for variables which contain certain strings? We know how to do this in R, and it's pretty useful. 

> **Return all columns starting with the word 'cost'**

To do this, we can use the `StringMethods` functions available on `Index` objects (you access these through the use of the `.str` **accessor** on the `Index` object)

In [51]:
stock.loc[:, stock.columns.str.startswith('cost')]

Unnamed: 0_level_0,cost_class,cost
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1st,10.99
2,2nd,
3,3rd,2.99
4,4th,
5,4th,2.99
6,3rd,2.45
7,2nd,5.99
8,,5.99
9,1st,3.0
10,3rd,



**<u>Task - 2 mins</u>**

Use similar code to select all the columns that **don't start** with the word 'cost'

**Solution**

In [52]:
# StringMethods
stock.loc[:, ~stock.columns.str.startswith('cost')]

Unnamed: 0_level_0,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,a,,0
2,a,,0
3,c,a,20
4,b,b,20
5,a,,20
6,b,a,0
7,,e,20
8,,,20
9,a,a,5
10,c,d,20


We can also select by column type, again like you know how to do in R. 

> **Select all numeric columns**

The `DataFrame` `.select_dtypes()` method can be used here; we can choose to `include` or `exclude` types as required. In `pandas` 'number' corresponds to `int`s and `float`s

In [53]:
stock.select_dtypes(include='number')

Unnamed: 0_level_0,cost,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1
1,10.99,0
2,,0
3,2.99,20
4,,20
5,2.99,20
6,2.45,0
7,5.99,20
8,5.99,20
9,3.0,5
10,,20


> **Select all character columns**

In [54]:
stock.select_dtypes(include=['string', 'object'])

Unnamed: 0_level_0,cost_class,stock_code,priority_code
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1st,a,
2,2nd,a,
3,3rd,c,a
4,4th,b,b
5,4th,a,
6,3rd,b,a
7,2nd,,e
8,,,
9,1st,a,a
10,3rd,c,d


## Advanced filtering `.reindex()`

What if we want to do a combination of filtering? Say we want to:

> **Select rows for item_nos 1, 2 and 12 (if they exist), together with all columns starting with the word 'cost' and column stock_date (if it exists)**

If we try to do this using `.loc[]`, it will fail with a `KeyError` as neither `item_no` 12 nor column `stock_date` exist.

In [55]:
item_nos = [1, 2, 12]
cols = [col for col in stock.columns if col.startswith('cost')] + ['stock_date']
cols

['cost_class', 'cost', 'stock_date']

In [56]:
stock.loc[item_nos, cols]

KeyError: '[12] not in index'

But the `.reindex()` method will work, filling in missing values (`np.nan`) for row indices or columns that were not found in the original data

In [57]:
stock.reindex(index=item_nos, columns=cols)

Unnamed: 0_level_0,cost_class,cost,stock_date
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1st,10.99,
2,2nd,,
12,,,


And that's all we're going to cover in slightly more advanced indexing and filtering! Hopefully that covers most of what you'd need to do involving subsetting your data in Python 




# Optional:  `.query()` methods

The `.query()` method available on each `DataFrame` can be used to write queries in more directly readable form. Let's use this method to find items matching the following:

> **Find all items with a cost greater than 5.00 that are in the 1st or 2nd cost_class**

In [58]:
stock.query("cost > 5.00 and cost_class in ['1st', '2nd']")

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1st,10.99,a,,0
7,2nd,5.99,,e,20


You can also use local variables in a `.query()` call by prefixing them with `@`

In [59]:
cost_threshold = 5.00
cost_classes = ['1st', '2nd']

stock.query('cost > @cost_threshold and cost_class in @cost_classes')

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1st,10.99,a,,0
7,2nd,5.99,,e,20


The `.query()` method makes use of the `NumExpr` package in the background: this was designed as a fast numerical expression engine for `numpy`. For this reason, `.query()` offers particular performance benefits for **large datasets** that are hardware limited, with reported speed-ups of up to 15-times over `numpy`.