**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
- [Importing the data](#toc1_2_)    
- [Indexing & Filtering](#toc2_)    
  - [`->` **Indexing**](#toc2_1_)    
    - [*Renaming index labels: The `.rename()` method*](#toc2_1_1_)    
    - [*Resetting index labels to monotonically increasing integers: The `.reset_index()` method*](#toc2_1_2_)    
    - [*Indexing by Index lables: The `.loc[]` method*](#toc2_1_3_)    
    - [*Indexing by Index positions: The `.iloc[]` method*](#toc2_1_4_)    
  - [`->` **Filtering**](#toc2_2_)    
    - [*Filtering Index and Column Labels with `.filter(items, like, regex, axis)`*](#toc2_2_1_)    
    - [*Filtering with boolean arrays (Boolean Masking)*](#toc2_2_2_)    
    - [*Using `functions with .loc` (for filtering)*](#toc2_2_3_)    
    - [*Filtering with the `.query()` method*](#toc2_2_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Read the official documentation on pandas DataFrames @ https://pandas.pydata.org/pandas-docs/stable/reference/frame.html**

**`Note:`** The notion of **chaining functions/methods** in pandas is similar to python.

DataFrames are **column oriented** unlike most common databases. And, **each column** in the dataframe is a **pandas series object**. So, any operation that can be performed on a pandas series object can be applied to a column too.

There are **two axes** for a dataframe commonly referred to as axis 0 and 1, or the **"index"** (or 'rows') axis and the **"columns"** axis respectively. Note that, when an **operation** is applied **along axis 0**, it is applied **down through all the rows for all the columns**. Likewise, operations **along axis 1** is applied **across the values in all the columns for all of the rows**.

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
# import statements
import numpy as np
import pandas as pd

In [2]:
# view options
pd.set_option("display.max_columns", 14)
pd.set_option("display.max_rows", 8)

### <a id='toc1_2_'></a>[Importing the data](#toc0_)

- We will be exploring a dataset from a Siena College Poll in 2018. This data has rankings of United States Presidents in various attributes. These attributes are:

In [3]:
siena_2018_cols = """
• Bg = Background
• Im = Imagination
• Int = Integrity
• IQ = Intelligence
• L = Luck
• WR = Willing to take risks
• AC = Ability to compromise
• EAb = Executive ability
• LA = Leadership ability
• CAb = Communication ability
• OA = Overall ability
• PL = Party leadership
• RC = Relations with Congress
• CAp = Court appointments
• HE = Handling of economy
• EAp = Executive appointments
• DA = Domestic accomplishments
• FPA = Foreign policy accomplishments
• AM = Avoid crucial mistakes
• EV = Experts’ view
• O = Overall
"""

In [4]:
# reading from github url

# it is a good practice to define your index column when reading the data file.
# it is generally frowned upon if you don't have an index column

url = "https://github.com/mattharrison/datasets/raw/master/data/siena2018-pres.csv"
siena_2018 = pd.read_csv(url, index_col=0)

In [5]:
siena_2018.head(3)

Unnamed: 0,Seq.,President,Party,Bg,Im,Int,IQ,...,HE,EAp,DA,FPA,AM,EV,O
1,1,George Washington,Independent,7,7,1,10,...,1,1,2,2,1,2,1
2,2,John Adams,Federalist,3,13,4,4,...,13,15,19,13,16,10,14
3,3,Thomas Jefferson,Democratic-Republican,2,2,14,1,...,20,4,6,9,7,5,5


In [6]:
# this will print all the column names, number of non null values in each column and the datatype of that column
siena_2018.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44 entries, 1 to 44
Data columns (total 24 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Seq.       44 non-null     object
 1   President  44 non-null     object
 2   Party      44 non-null     object
 3   Bg         44 non-null     int64 
 4   Im         44 non-null     int64 
 5   Int        44 non-null     int64 
 6   IQ         44 non-null     int64 
 7   L          44 non-null     int64 
 8   WR         44 non-null     int64 
 9   AC         44 non-null     int64 
 10  EAb        44 non-null     int64 
 11  LA         44 non-null     int64 
 12  CAb        44 non-null     int64 
 13  OA         44 non-null     int64 
 14  PL         44 non-null     int64 
 15  RC         44 non-null     int64 
 16  CAp        44 non-null     int64 
 17  HE         44 non-null     int64 
 18  EAp        44 non-null     int64 
 19  DA         44 non-null     int64 
 20  FPA        44 non-null     int64 
 21  

In [7]:
# Datatype casting and renaming the columns
cols_list = [
    col.strip().split("=") for col in siena_2018_cols.strip().split(sep="•")[1:]
]

# we will replace the spaces in the full form with underscores (_)
siena_2018_cols_dict = {
    col_prev.strip(): col_full.strip().replace(" ", "_")
    for col_prev, col_full in cols_list
}
siena_2018 = siena_2018.rename(columns={"Seq.": "Seq"}).rename(
    columns=siena_2018_cols_dict
)
siena_2018 = siena_2018.astype({"Party": "category"})
siena_2018 = siena_2018.astype(
    {col_name: "uint8" for col_name in siena_2018.select_dtypes("int64").columns}
)
siena_2018 = siena_2018.assign(
    Average_rank=siena_2018.loc[:, "Background":"Experts’_view"]
    .sum(axis=1)
    .rank(method="dense")
    .astype("uint8"),
    Quartile_rank=lambda df_: pd.qcut(
        df_.Average_rank, 4, labels=["1st", "2nd", "3rd", "4th"]
    ),
)

--------------------------

## <a id='toc2_'></a>[Indexing & Filtering](#toc0_)

------------------------

### <a id='toc2_1_'></a>[`->` **Indexing**](#toc0_)

#### <a id='toc2_1_1_'></a>[*Renaming index labels: The `.rename()` method*](#toc0_)

<u> Parameters: </u>
- **mapper**: Dict-like or function transformations to apply to specified axis' values. In case of a function you only need to pass in the name and not call them.
- **axis**: index (0) or, columns(1).

In [8]:
# say, we would like to set the president name as our index and use initial for first name and not the full name
def name_to_initial(val):
    vals = val.split(" ")
    return " ".join(
        [f"{vals[0][0]}.", *vals[1:]]
    )  # unpack the items in the vals[1:] list


siena_2018.set_index("President").rename(
    name_to_initial
)  # or, lambda name_: " ".join([f'{name_.split()[0][0]}.', *name_.split()[1:]])

Unnamed: 0_level_0,Seq,Party,Background,Imagination,Integrity,Intelligence,Luck,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
President,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
G. Washington,1,Independent,7,7,1,10,1,...,2,2,1,2,1,1,1st
J. Adams,2,Federalist,3,13,4,4,24,...,19,13,16,10,14,13,2nd
T. Jefferson,3,Democratic-Republican,2,2,14,1,8,...,6,9,7,5,5,5,1st
J. Madison,4,Democratic-Republican,4,6,7,3,16,...,11,19,11,8,7,7,1st
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
B. Clinton,42,Democratic,21,12,39,8,11,...,9,18,30,14,15,14,2nd
G. W. Bush,43,Republican,17,29,33,41,21,...,30,38,36,34,33,33,3rd
B. Obama,44,Democratic,24,11,13,9,15,...,13,20,10,11,17,17,2nd
D. Trump,45,Republican,43,40,44,44,10,...,40,42,41,42,42,42,4th


#### <a id='toc2_1_2_'></a>[*Resetting index labels to monotonically increasing integers: The `.reset_index()` method*](#toc0_)

In [9]:
siena_2018.set_index("President").reset_index()

Unnamed: 0,President,Seq,Party,Background,Imagination,Integrity,Intelligence,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
0,George Washington,1,Independent,7,7,1,10,...,2,2,1,2,1,1,1st
1,John Adams,2,Federalist,3,13,4,4,...,19,13,16,10,14,13,2nd
2,Thomas Jefferson,3,Democratic-Republican,2,2,14,1,...,6,9,7,5,5,5,1st
3,James Madison,4,Democratic-Republican,4,6,7,3,...,11,19,11,8,7,7,1st
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40,Bill Clinton,42,Democratic,21,12,39,8,...,9,18,30,14,15,14,2nd
41,George W. Bush,43,Republican,17,29,33,41,...,30,38,36,34,33,33,3rd
42,Barack Obama,44,Democratic,24,11,13,9,...,13,20,10,11,17,17,2nd
43,Donald Trump,45,Republican,43,40,44,44,...,40,42,41,42,42,42,4th


#### <a id='toc2_1_3_'></a>[*Indexing by Index lables: The `.loc[]` method*](#toc0_) [&#8593;](#toc0_)

The **`.loc[row indexer, column indexer]`** attribute is **primarily label based**, but may also be used with a boolean array.

Allowable inputs may be:
- **Scalar:** if any one of the indexer is passed as a scalar, it will return,
    - a dataframe if there are multiple instances and,
    - a series if there's only one entry. This series will have,
        - columns set as index if axis=0.
        - rows set as index if axis=1.

    For it to return a dataframe in all cases we have to pass in the scalar as a list.
- **Array of labels**
- **Slice object:** Slicing with .loc includes both the start and end. *Some notes:*
    - If the axis of slicing has unsorted duplicate index labels we will first need to sort the indexes with **.sort_index()**.
    - Slicing with string indexes only works if you sort them.
    - Partial slicing can only be done on string types and not on categorical type.
- **A boolean array:** of the same length as the indexing axis.
- **A callable function:** that returns one of the above.

#### <a id='toc2_1_4_'></a>[*Indexing by Index positions: The `.iloc[]` method*](#toc0_) [&#8593;](#toc0_)

The **`.iloc[row indexer, column indexer]`** attribute operates on **indexes and not index labels**. It can also be used with a boolean array.

Allowable inputs may be:
- **Scalar:** if any one of the indexer is passed as a scalar, it will return,
    - a dataframe if there are multiple instances and,
    - a series if there's only one entry. This series will have,
        - columns set as index labels if axis=0.
        - rows set as index labels if axis=1.

    For it to return a dataframe in all cases we have to pass in the scalar as a list.
- **Array of indexes**
- **Slice object:** Slicing with .iloc includes only the start and not the end. *Note:*, if the axis being sliced has unsorted duplicate indexed entries we will first need to sort the indexes with **.sort_index()**.
- **A boolean array:** of the same length as the indexing axis.
- **A callable function:** that returns one of the above.

### <a id='toc2_2_'></a>[`->` **Filtering**](#toc0_)

#### <a id='toc2_2_1_'></a>[*Filtering Index and Column Labels with `.filter(items, like, regex, axis)`*](#toc0_)

- **items** (passed as a list) is used for exact matches. Note that exact match (with items) fails with duplicate labels but if the label doesn't exist it will not throw an error.
- **like** is used for substring matches.
- **regex** allows to specify a regular expression to match against index or column labels.
- **axis** specifies whether to filter indexex (0) or columns (1).

#### <a id='toc2_2_2_'></a>[*Filtering with boolean arrays (Boolean Masking)*](#toc0_)

Boolean arrays can be used to filter data from a dataframe. Using different math operators (such as, &, <, >, | etc.) complex filters can be implemented. Note that, you can't use plain *or, and, not* etc.

In [10]:
# let's filter out the presidents who was a republican and has an average rank < 10.
try:
    siena_2018[siena_2018.Average_rank < 10 & siena_2018.Party == "Republican"]
except TypeError as err:
    print(err)

unsupported operand type(s) for &: 'int' and 'Categorical'


The takeaway is, you should always put parentheses around multiple conditions in index operations if you inline them as some operators has precedence over others.

Now let's do this properly.

In [11]:
siena_2018[(siena_2018.Average_rank < 10) & (siena_2018.Party == "Republican")]

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
16,16,Abraham Lincoln,Republican,28,1,2,2,...,1,6,2,1,3,3,1st
25,26,Theodore Roosevelt,Republican,5,4,8,6,...,4,3,5,4,4,4,1st
33,34,Dwight D. Eisenhower,Republican,11,18,5,17,...,8,7,3,6,6,6,1st


#### <a id='toc2_2_3_'></a>[*Using `functions with .loc` (for filtering)*](#toc0_)

The main advantage of using functions with .loc is that, the function will receive the current state of the dataframe as input. This is specially useful when multiple operations are chained together.

Also it is possible to filter rows and also select specific columns simultaneously.

In [12]:
# let us select presidents with average rank < 10 and return first 3 columns of data about them
siena_2018.loc[
    siena_2018.Average_rank < 10, lambda df_: df_.columns[:3]
]  # :3 as first column is the index column

Unnamed: 0,Seq,President,Party
1,1,George Washington,Independent
3,3,Thomas Jefferson,Democratic-Republican
4,4,James Madison,Democratic-Republican
5,5,James Monroe,Democratic-Republican
...,...,...,...
25,26,Theodore Roosevelt,Republican
31,32,Franklin D. Roosevelt,Democratic
32,33,Harry S. Truman,Democratic
33,34,Dwight D. Eisenhower,Republican


In [13]:
# the same can be achieved by the following section of code
siena_2018[siena_2018.Average_rank < 10].iloc[:, :3]

Unnamed: 0,Seq,President,Party
1,1,George Washington,Independent
3,3,Thomas Jefferson,Democratic-Republican
4,4,James Madison,Democratic-Republican
5,5,James Monroe,Democratic-Republican
...,...,...,...
25,26,Theodore Roosevelt,Republican
31,32,Franklin D. Roosevelt,Democratic
32,33,Harry S. Truman,Democratic
33,34,Dwight D. Eisenhower,Republican


#### <a id='toc2_2_4_'></a>[*Filtering with the `.query()` method*](#toc0_)

- Instead of using boolean arrays in combination with *.loc[]*, we can use the *.query()* method. And, unlike boolean arrays we can use both, plain 'and', 'or', 'not' commands and the operator forms such as &, |,  ! etc. We also don't need to worry as much about precedence and parentheses.
- In the .query() method we use a string to formulate and express our conditions, similar to SQL. One of the powerful aspect of using .query() is that, we can `access external variables using the @ sign as prefix` from inside the string. So we don't need to use string formatting or concatenation to implement complex logics in our search.
- `To access a column of the dataframe, just use the name of the column`.
- `To match a string literal pass it in as a string (within quote marks) as you would in any other situation.`

In [14]:
# to do the same filtering as we've done in the filtering with boolean arrays section
lt10 = siena_2018.Average_rank < 10
# siena_2018.query("Average_rank < 10 and Party == 'Republican'")
siena_2018.query('@lt10 and Party == "Republican"')

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
16,16,Abraham Lincoln,Republican,28,1,2,2,...,1,6,2,1,3,3,1st
25,26,Theodore Roosevelt,Republican,5,4,8,6,...,4,3,5,4,4,4,1st
33,34,Dwight D. Eisenhower,Republican,11,18,5,17,...,8,7,3,6,6,6,1st


Both `.query()` and boolean masks with `.loc[]` are effective methods for filtering data in pandas. The choice between them depends on factors such as readability, performance, and your personal coding preferences. Let's compare both approaches:

1. **`.query()` Method:**
   - **Advantages:**
     - Readability: `.query()` allows you to write filtering expressions in a more SQL-like syntax, which can be more intuitive for some users.
     - Avoidance of repetitive DataFrame name: You don't need to repeat the DataFrame name within the query expression.
   - **Considerations:**
     - Limited access to Python variables: You can't directly use Python variables within the query, which might be necessary for complex conditions.
     - Limited to column names: The query is performed using column names, and more complex operations might be easier with boolean masks.
     - The .query() method doesn't support column selection. This is very important to keep in mind when filtering data with the .query() method.


2. **Boolean Masks with `.loc[]`:**
   - **Advantages:**
     - Flexibility: You can use Python variables and more complex conditions within boolean masks, providing more fine-grained control over filtering.
     - Compatibility with other operations: Boolean masks can easily be used with other DataFrame operations like grouping and aggregation.
   - **Considerations:**
     - Slightly more verbose: Boolean mask expressions can become longer when compared to concise `.query()` expressions.

For simple filtering scenarios, both methods can work well. `.query()` is often favored when the filtering conditions are straightforward and you want a more human-readable syntax. However, if you need to use complex conditions involving variables, multiple columns, or other DataFrame operations, boolean masks with `.loc[]` offer greater flexibility. Also in more complex scenarios, especially those involving calculations or chaining multiple operations, boolean masks with `.loc[]` are often a better choice due to their versatility.

Ultimately, it's a matter of preference and context. You can even mix and match both methods within your codebase, using the one that suits each situation best.