# <font color=#14F278>Unit 3 - Data Indexing and Selection</font>
---

In the previous units we learnt about the concepts of __Series__ and __DataFrames__ and how to construct them. Let's explore the ways we can access subsets of data from them. 

## <font color=#14F278> 1. Data Indexing and Selection - Definition:</font>

<font color=#14F278>**Data Indexing**</font>  (also known as <font color=#14F278>**Subset Selection**</font>) in Pandas simply means selecting a certain sub-part of data from a Pandas Object.

In the context of a __Series__, data indexing could mean selecting one or multiple elements of a Series. In the context of __DataFrames__, data indexing could refer to selecting a subset of rows and columns, individual values, etc.

In [2]:
# Imports
import pandas as pd
import numpy as np
import datetime as dt

---
## <font color=#14F278> 2. Series - Data Indexing and Selection:</font>

Recall that __Series__ are 1-dimensional objects of indexed data.
We can index a Series object in one of the following ways:
- <font color=#14F278>**Explicit vs Implicit Indexing**</font>
- <font color=#14F278>**Slicing**</font>
- <font color=#14F278>**Boolean Indexing**</font>

---
### <font color=#14F278> 2.1 Explicit vs Implicit Indexing:</font>

The first way to select an element from a Series is by using its <font color=#14F278>**index**</font> as a key!


- <font color=#14F278>**Explicit Indexing:**</font>
     - accessing a single element via the <font color=#14F278>**actual label (name)**</font> of its corresponding index
     - Syntax: `series_name.loc[index_label]`
- <font color=#14F278>**Implicit Indexing:**</font>
     - accessing a single element via the <font color=#14F278>**integer position**</font> of its corresponding index
     - Syntax: `series_name.iloc[index_position]`

<font color=#FF8181>**Point to Note:**</font> Indexing is done by using one of two possible accessors - `.loc` and `.iloc`. An easy way to distinguish them from one another is by remembering that `.iloc` stands for __integer location__. In fact, associating the __i__ in `.iloc` with __integer, index,__ or even __implicit__ will help you remember which indexing method does what in the future! 


<center>
    <div>
        <img src="..\images\selection_001.png"/>
    </div>
</center>

In [8]:
# Create a series
s = pd.Series(['a','b','c','d'], index=[1,2,3,4])
display(s)

1    a
2    b
3    c
4    d
dtype: object

In [7]:
s.index = range(2,6)
s

2    a
3    b
4    c
5    d
dtype: object

In [4]:
# Explicit indexing (using the actual index labels)
# in this example we access the element, corresponding to index of name 1
print('Explicit indexing s.loc[1]: ', s.loc[1])

Explicit indexing s.loc[1]:  a


In [5]:
# Implicit index (using the position)
# in this example we access the element, corresponding to position number 1
# Recall that Series are 0-indexed -- the element on position 1 is in fact the 2nd element in the Series
print('Implicit indexing s.iloc[1]: ', s.iloc[1])

Implicit indexing s.iloc[1]:  b


<font color=#FF8181>**Warning:**</font> Using <font color=#FF8181>**unspecified indexing**</font> is <font color=#FF8181>**NOT recommended!**</font>

Without passing an accessor, Pandas will try to find an element with the specified index label, and if it does not exist, it will search for an element with the specified index position. In any ways, not using an accessor when indexing Series introduced __ambiguity__.

In [9]:
# in this case using unspecified indexing resulted in locating the element, corresponding to index label 1
print('Unspecified indexing s[1]: ', s[1])

Unspecified indexing s[1]:  a


---
### <font color=#14F278> 2.2 Slicing:</font>

We learnt how to access a single element of a Series via Explicit or Implicit Indexing.
To access a subset of elements of a Series, we use a method, called <font color=#14F278>**Slicing**</font>!


<font color=#14F278>**Slicing**</font> leverages the same accessors as Indexing - `.loc` and `.iloc`. This time however, instead of passing on a single argument - the index label, or the index position of the desired element, we will be passing two arguments, delimited by a colon (__:__)


Syntax:
- <font color=#14F278>**Explicit Slicing**</font> - `series_name.loc[start_index_label: stop_index_label]` - value corresponding to stop_index_label is <font color=#14F278>**included**</font> in the output
- <font color=#14F278>**Implicit Slicing**</font> - `series_name.iloc[start_index_position: stop_index_position]` - value corresponding to stop_index_label is <font color=#14F278>**excluded**</font> from the output

<center>
    <div>
        <img src="..\images\selection_002.png"/>
    </div>
</center>

In [10]:
# Create a Series
s = pd.Series(['a','b','c','d'], index=[1,2,3,4])
display(s)

1    a
2    b
3    c
4    d
dtype: object

In [11]:
# Explicit Slicing - both values at index labels 1 and 3 included -- output has length 3
print('Using explicit slicing s.loc[1:3]')
s.loc[1:3]

Using explicit slicing s.loc[1:3]


1    a
2    b
3    c
dtype: object

In [12]:
# Implicit Slicing - value at index position 2 is excluded - output has length 2
print('Using implicit slicing s.iloc[0:2]')
s.iloc[0:2]

Using implicit slicing s.iloc[0:2]


1    a
2    b
dtype: object

---
### <font color=#14F278> 2.3 Boolean Indexing (Masking):</font>

<font color=#14F278>**Boolean Indexing**</font> by definition is Indexing which uses <font color=#14F278>**actual values of the elements in the Series**</font>. 
Put simply, we do a __Boolean Indexing__ by assessing which elements of the Series fulfil a given __True/False__ statement - only the values which returned __True__ are then filtered. 

<center>
    <div>
        <img src="..\images\selection_003.png"/>
    </div>
</center>

Boolean Indexing is conducted by using a method called <font color=#14F278>**Masking**</font>:

A <font color=#14F278>**Mask**</font> is a vector of Boolean objects - e.g. `[True, False, True]`. When used on a Pandas Object - Series, or DataFrames, the mask will return the subset of values, which corresponded to True. 

Syntax:
- <font color=#14F278>**Explicit Mask**</font> - `mask = [bool1, bool2, ...]` where `bool1 = True/False`
- <font color=#14F278>**Implicit Mask**</font>- `mask = (True/False statement on series_name)` - this will return a boolean list `[bool1, bool2, ...]`

Although it sounds complicated in theory, <font color=#14F278>**Boolean Indexing (Masking) is just filtering!**</font>

In [13]:
# Create a Series
s = pd.Series([1,2,3,4,5])
display(s)

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [14]:
# Create a mask explicitly (manually entering True/False bools)
# note - once the mask is created, we can pass it onto the series via using the [] operator
mask = [True, True, True, False, False]
print('After applying the boolean mask: ')
s[mask]

After applying the boolean mask: 


0    1
1    2
2    3
dtype: int64

In [18]:
# Create a mask automatically via a True/False statement
mask = (s <= 3)
print(mask)
s[mask]

0     True
1     True
2     True
3    False
4    False
dtype: bool


0    1
1    2
2    3
dtype: int64

---
### <font color=#14F278> 2.4 Assigning Values to a Series:</font>
Data Indexing or Slicing is often used not just for 'selection and display purposes', but also for assigning new values to a Series. Below are some examples of how to do this:

In [19]:
# Create our sample data
s = pd.Series([1,2,3,4])
display(s)

# Assinging a new value to a single element:
s.iloc[0] = 9000
display(s)

# Assigning multiple values to a slice:
s.loc[1:2] = 9005
display(s)

# Assigning multiple values to a slice:
s.iloc[2:4] = [9010, 9015]
display(s)

0    1
1    2
2    3
3    4
dtype: int64

0    9000
1       2
2       3
3       4
dtype: int64

0    9000
1    9005
2    9005
3       4
dtype: int64

0    9000
1    9005
2    9010
3    9015
dtype: int64

---
## <font color=#14F278> 3. DataFrames - Data Indexing and Selection:</font>
Recall that __DataFrames__ are 2-dimensional objects of data, indexed by its rows and columns. 

We can index a DataFrame object in one of the following ways:
- <font color=#14F278>**Explicit vs Implicit Indexing**</font>
- <font color=#14F278>**Explicit vs Implicit Slicing**</font>
- <font color=#14F278>**Boolean Indexing**</font>
- <font color=#14F278>**Column Selection**</font>

In [20]:
# Create a function to generate data
def make_df(cols, rows):
    data = {c:[str(c)+str(r) for r in rows] for c in cols}
    return pd.DataFrame(data)

In [21]:
# Construct a dataframe using function above
df = make_df('abc', [1,2,3,4])
df.index = [5,6,7,8]
display(df)

Unnamed: 0,a,b,c
5,a1,b1,c1
6,a2,b2,c2
7,a3,b3,c3
8,a4,b4,c4


---
### <font color=#14F278> 3.1 Explicit vs Implicit Indexing:</font>

Explicit and Implicit Indexing for DataFrames work in exactly the same way as for Series! The only thing to remember is that a value in a __DataFrame__ is now uniquely identified by a pair of two keys - its column and its row! 

Syntax:
- <font color=#14F278>**Explicit Indexing**</font> - `dataframe_name.loc[row_label, column_label]`
- <font color=#14F278>**Implicit Indexing**</font> - `dataframe_name.iloc[row_number, column_number]`

In [22]:
# Getting the first row and first column using implicit indexing
df.iloc[0,0]

'a1'

In [23]:
# Getting the first row and first column using explicit indexing
df.loc[5,'a']

'a1'

---
### <font color=#14F278> 3.2 Slicing:</font>
Again, slicing a DataFrame is very similar to slicing a Series - we shall again use the `.iloc` and `.loc` accessors. However, as __DataFrames__ are 2-dimensional objects, we now can obtain subsets of a DataFrame of all sorts and shapes!


Syntax:
- <font color=#14F278>**Explicit Indexing**</font> - `dataframe_name.loc[start_row_label: stop_row_label, start_column_label: stop_column_label]` 
- <font color=#14F278>**Implicit Indexing**</font> - `dataframe_name.iloc[start_row_lnum: stop_row_num, start_column_num: stop_column_num]`

Depending on what output we are after, we can use a number of variations to the above syntax! Let's look into some examples!

In [24]:
# Explicit Slicing -- getting a sub-dataframe
df.loc[6:7, 'b':'c']

Unnamed: 0,b,c
6,b2,c2
7,b3,c3


In [25]:
# Implicit Slicing -- getting just the second row in full
# note - df.iloc[1] yields the same!
df.iloc[1,:]

a    a2
b    b2
c    c2
Name: 6, dtype: object

In [26]:
# Explicit Slicing -- getting the 2nd and 3rd row in full
# note - df.loc[6:7,:] yileds the same!
df.loc[6:7]

Unnamed: 0,a,b,c
6,a2,b2,c2
7,a3,b3,c3


---
### <font color=#14F278> 3.3 Boolean Indexing (Masking):</font>
You got this right - <font color=#14F278>**Boolean Masking**</font> on DataFrames works identically as it does on Series!

The only rule of thumb is that <font color=#14F278>**Masking filters Rows**</font>. In that sense, applying a mask on a DataFrame will return a new DataFrame with the same number of columns but fewer rows!

<center>
    <div>
        <img src="..\images\selection_004.png"/>
    </div>
</center>

In [27]:
# Explicit Masking -- it returns only the rows that correspond to True
mask = [False, True, False, True] 
df[mask]

Unnamed: 0,a,b,c
6,a2,b2,c2
8,a4,b4,c4


In [28]:
# Implicit Masking - it checks for which row contains a column 'a' value of a3
mask = df['a'] == 'a3'
df[mask]

Unnamed: 0,a,b,c
7,a3,b3,c3


---
### <font color=#14F278> 3.4 Column Selection:</font>
<font color=#14F278>**Column Selection**</font> is something, specific to <font color=#14F278>**DataFrames**</font> only. Sometimes we want to 'get rid of' some columns, which are uneccessary for our analysis. We can specify which columns we would like to select in the following way:

Synthax: 
- <font color=#14F278>**Select Multiple Columns & Return a DataFrame**</font> - `dataframe_name[['col1', 'col2', ...]]`
- <font color=#14F278>**Select a Single Column & Return a DataFrame**</font> - `dataframe_name[['col1']]`
- <font color=#14F278>**Select a Single Column & Return a Series**</font> - `dataframe_name['col1']`

In [30]:
display(df['b']) #series
display(df[['b']])
display(df[['a', 'b']])

5    b1
6    b2
7    b3
8    b4
Name: b, dtype: object

Unnamed: 0,b
5,b1
6,b2
7,b3
8,b4


Unnamed: 0,a,b
5,a1,b1
6,a2,b2
7,a3,b3
8,a4,b4


---
## <font color=#14F278> 4. Summary:</font>

__Data Indexing__ refers to selecting a sub-part of data from a Pandas Objects. The main ways to do this is via:
- __Explicit vs Implicit Indexing__ - applicable to both DataFrames and Series
- __Slicing__ - applicable to both DataFrames and Series
- __Boolean Masking__ -applicable to both DataFrames and Series
- __Column Selection__ - applicable to DataFrames only

---
## <font color=#FF8181> 5. Concept Check: </font>
1. What are the different ways we can index a series?
2. Suppose you have a series `pd.Series([1,2,3,4], index=['a','b','c','d'])`
-  Using implicit and explicit indexing, get the second element
-  Using explicit indexing, slice the series to get the first three elements
-  Using a boolean mask, select the even numbers

In [None]:
# 1) There are 3 ways to idex a Series - Explicitly, Implicitly and via Boolean Indexing
# explicit is via specifying the name of the index -- .loc[]
# implicit is via specifying the index number -- .iloc[]
# Boolean is via specifying a True/False statement, assessing the Series values and returning those that satisfy the statement

In [39]:
s = pd.Series([1,2,3,4], index=['a','b','c','d'])
s

a    1
b    2
c    3
d    4
dtype: int64

In [37]:
#implicit 
s.iloc[1]


2

In [41]:
s.loc['b']

2

In [42]:
s.loc['a':'c']

a    1
b    2
c    3
dtype: int64

In [44]:
mask = (s % 2 == 0)
print(mask)
s[mask]

a    False
b     True
c    False
d     True
dtype: bool


b    2
d    4
dtype: int64