**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
- [Importing the data](#toc1_2_)    
- [Sorting Columns and Indexes](#toc2_)    
    - [Setting indexes: The `.set_index()` method](#toc2_1_1_)    
    - [Sorting indexes: The `.sort_index()` method](#toc2_1_2_)    
    - [Sorting values: The `.sort_values()` method](#toc2_1_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Read the official documentation on pandas DataFrames @ https://pandas.pydata.org/pandas-docs/stable/reference/frame.html**

**`Note:`** The notion of **chaining functions/methods** in pandas is similar to python.

DataFrames are **column oriented** unlike most common databases. And, **each column** in the dataframe is a **pandas series object**. So, any operation that can be performed on a pandas series object can be applied to a column too.

There are **two axes** for a dataframe commonly referred to as axis 0 and 1, or the **"index"** (or 'rows') axis and the **"columns"** axis respectively. Note that, when an **operation** is applied **along axis 0**, it is applied **down through all the rows for all the columns**. Likewise, operations **along axis 1** is applied **across the values in all the columns for all of the rows**.

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [9]:
# import statements
import numpy as np
import pandas as pd

In [10]:
# view options
pd.set_option("display.max_columns", 14)
pd.set_option("display.max_rows", 8)

### <a id='toc1_2_'></a>[Importing the data](#toc0_)

- We will be exploring a dataset from a Siena College Poll in 2018. This data has rankings of United States Presidents in various attributes. These attributes are:

In [11]:
siena_2018_cols = """
• Bg = Background
• Im = Imagination
• Int = Integrity
• IQ = Intelligence
• L = Luck
• WR = Willing to take risks
• AC = Ability to compromise
• EAb = Executive ability
• LA = Leadership ability
• CAb = Communication ability
• OA = Overall ability
• PL = Party leadership
• RC = Relations with Congress
• CAp = Court appointments
• HE = Handling of economy
• EAp = Executive appointments
• DA = Domestic accomplishments
• FPA = Foreign policy accomplishments
• AM = Avoid crucial mistakes
• EV = Experts’ view
• O = Overall
"""

In [12]:
# reading from github url

# it is a good practice to define your index column when reading the data file.
# it is generally frowned upon if you don't have an index column

url = "https://github.com/mattharrison/datasets/raw/master/data/siena2018-pres.csv"
siena_2018 = pd.read_csv(url, index_col=0)

In [13]:
siena_2018.head(3)

Unnamed: 0,Seq.,President,Party,Bg,Im,Int,IQ,...,HE,EAp,DA,FPA,AM,EV,O
1,1,George Washington,Independent,7,7,1,10,...,1,1,2,2,1,2,1
2,2,John Adams,Federalist,3,13,4,4,...,13,15,19,13,16,10,14
3,3,Thomas Jefferson,Democratic-Republican,2,2,14,1,...,20,4,6,9,7,5,5


In [14]:
# this will print all the column names, number of non null values in each column and the datatype of that column
siena_2018.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44 entries, 1 to 44
Data columns (total 24 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Seq.       44 non-null     object
 1   President  44 non-null     object
 2   Party      44 non-null     object
 3   Bg         44 non-null     int64 
 4   Im         44 non-null     int64 
 5   Int        44 non-null     int64 
 6   IQ         44 non-null     int64 
 7   L          44 non-null     int64 
 8   WR         44 non-null     int64 
 9   AC         44 non-null     int64 
 10  EAb        44 non-null     int64 
 11  LA         44 non-null     int64 
 12  CAb        44 non-null     int64 
 13  OA         44 non-null     int64 
 14  PL         44 non-null     int64 
 15  RC         44 non-null     int64 
 16  CAp        44 non-null     int64 
 17  HE         44 non-null     int64 
 18  EAp        44 non-null     int64 
 19  DA         44 non-null     int64 
 20  FPA        44 non-null     int64 
 21  

In [15]:
# Datatype casting and renaming the columns
cols_list = [
    col.strip().split("=") for col in siena_2018_cols.strip().split(sep="•")[1:]
]

# we will replace the spaces in the full form with underscores (_)
siena_2018_cols_dict = {
    col_prev.strip(): col_full.strip().replace(" ", "_")
    for col_prev, col_full in cols_list
}
siena_2018 = siena_2018.rename(columns={"Seq.": "Seq"}).rename(
    columns=siena_2018_cols_dict
)
siena_2018 = siena_2018.astype({"Party": "category"})
siena_2018 = siena_2018.astype(
    {col_name: "uint8" for col_name in siena_2018.select_dtypes("int64").columns}
)
siena_2018 = siena_2018.assign(
    Average_rank=siena_2018.loc[:, "Background":"Experts’_view"]
    .sum(axis=1)
    .rank(method="dense")
    .astype("uint8"),
    Quartile_rank=lambda df_: pd.qcut(
        df_.Average_rank, 4, labels=["1st", "2nd", "3rd", "4th"]
    ),
)

--------------------------------------------

## <a id='toc2_'></a>[Sorting Columns and Indexes](#toc0_)

---------------------------------------------

#### <a id='toc2_1_1_'></a>[Setting indexes: The `.set_index()` method](#toc0_)

Return dataframe with the new index.

<u> Parameters -- </u>

- **keys**: column(s) to be set as index.
- **drop = True** : default True. Indicates whether to remove columns used for the index.
- **verify_integrity = False** : check for duplicate index values by setting verify_integrity=True.

#### <a id='toc2_1_2_'></a>[Sorting indexes: The `.sort_index()` method](#toc0_)

<u> Parameters -- </u>

- **axis = 0**: This method will return dataframe with index (axis=0) or columns (axis=1) sorted.  
- **ascending = True**: default True.
- **key = None**:  A key function accepts an index and should return an index. For multi-level indexes, each index is passed in independently to the function.

This operation is usually done after setting a new index. If the new index is of **string type** then **sorting it will allow us to use slicing** operation on the index column. Othrwise it will throw a KeyError.

#### <a id='toc2_1_3_'></a>[Sorting values: The `.sort_values()` method](#toc0_)

<u> Parameters -- </u>

- **by**: column name or a list of names to sort by.
- **ascending = True**: bool or list of bool, default True.
- **key = None**: Apply the key function to the values before sorting. It will be applied to each column in `by` independently. A key function accepts a series and should return a series with the same shape as the input. 

In [16]:
siena_2018.sort_values(by=["Quartile_rank", "Intelligence"], ascending=[True, False])

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
11,11,James K. Polk,Democratic,19,10,23,23,...,12,8,8,13,12,11,1st
32,33,Harry S. Truman,Democratic,31,16,9,21,...,7,4,9,7,9,9,1st
5,5,James Monroe,Democratic-Republican,9,14,11,18,...,10,5,6,9,8,8,1st
33,34,Dwight D. Eisenhower,Republican,11,18,5,17,...,8,7,3,6,6,6,1st
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21,21,Chester A. Arthur,Republican,41,31,37,36,...,25,32,23,31,34,34,4th
23,23,Benjamin Harrison,Republican,33,34,30,35,...,32,29,29,33,35,36,4th
10,10,John Tyler,Independent,34,33,35,34,...,36,26,32,36,37,37,4th
30,31,Herbert Hoover,Republican,13,35,15,13,...,39,33,40,35,36,35,4th


In [17]:
siena_2018.President.str.split()[1]

['George', 'Washington']

**`?`** For example say, we wanted to sort by the last name of the presidents. In this case we can use the `key` parameter to pass a function that will extract the last name from the full name.

In this case we can use the apply method, and this is an appropriate application of the apply method since we are working with strings.

In [18]:
siena_2018.sort_values(
    by=["President"],
    key=lambda byCol_: byCol_.str.split().apply(lambda val_lst: val_lst[-1]),
)

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
2,2,John Adams,Federalist,3,13,4,4,...,19,13,16,10,14,13,2nd
6,6,John Quincy Adams,Democratic-Republican,1,9,6,5,...,21,15,14,18,18,18,2nd
21,21,Chester A. Arthur,Republican,41,31,37,36,...,25,32,23,31,34,34,4th
15,15,James Buchanan,Democratic,36,43,40,39,...,44,43,44,44,43,43,4th
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44,45,Donald Trump,Republican,43,40,44,44,...,40,42,41,42,42,42,4th
10,10,John Tyler,Independent,34,33,35,34,...,36,26,32,36,37,37,4th
1,1,George Washington,Independent,7,7,1,10,...,2,2,1,2,1,1,1st
27,28,Woodrow Wilson,Democratic,8,8,19,7,...,14,11,25,15,11,12,2nd
