**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
- [Importing the data](#toc1_2_)    
- [Creating and Updating columns: The `.assign()` method](#toc2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Read the official documentation on pandas DataFrames @ https://pandas.pydata.org/pandas-docs/stable/reference/frame.html**

**`Note:`** The notion of **chaining functions/methods** in pandas is similar to python.

DataFrames are **column oriented** unlike most common databases. And, **each column** in the dataframe is a **pandas series object**. So, any operation that can be performed on a pandas series object can be applied to a column too.

There are **two axes** for a dataframe commonly referred to as axis 0 and 1, or the **"index"** (or 'rows') axis and the **"columns"** axis respectively. Note that, when an **operation** is applied **along axis 0**, it is applied **down through all the rows for all the columns**. Likewise, operations **along axis 1** is applied **across the values in all the columns for all of the rows**.

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
# import statements
import numpy as np
import pandas as pd

In [2]:
# view options
pd.set_option("display.max_columns", 14)
pd.set_option("display.max_rows", 8)

### <a id='toc1_2_'></a>[Importing the data](#toc0_)

- We will be exploring a dataset from a Siena College Poll in 2018. This data has rankings of United States Presidents in various attributes. These attributes are:

In [3]:
siena_2018_cols = """
• Bg = Background
• Im = Imagination
• Int = Integrity
• IQ = Intelligence
• L = Luck
• WR = Willing to take risks
• AC = Ability to compromise
• EAb = Executive ability
• LA = Leadership ability
• CAb = Communication ability
• OA = Overall ability
• PL = Party leadership
• RC = Relations with Congress
• CAp = Court appointments
• HE = Handling of economy
• EAp = Executive appointments
• DA = Domestic accomplishments
• FPA = Foreign policy accomplishments
• AM = Avoid crucial mistakes
• EV = Experts’ view
• O = Overall
"""

In [4]:
# reading from github url

# it is a good practice to define your index column when reading the data file.
# it is generally frowned upon if you don't have an index column

url = "https://github.com/mattharrison/datasets/raw/master/data/siena2018-pres.csv"
siena_2018 = pd.read_csv(url, index_col=0)

In [5]:
siena_2018.head(3)

Unnamed: 0,Seq.,President,Party,Bg,Im,Int,IQ,...,HE,EAp,DA,FPA,AM,EV,O
1,1,George Washington,Independent,7,7,1,10,...,1,1,2,2,1,2,1
2,2,John Adams,Federalist,3,13,4,4,...,13,15,19,13,16,10,14
3,3,Thomas Jefferson,Democratic-Republican,2,2,14,1,...,20,4,6,9,7,5,5


In [6]:
# this will print all the column names, number of non null values in each column and the datatype of that column
siena_2018.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44 entries, 1 to 44
Data columns (total 24 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Seq.       44 non-null     object
 1   President  44 non-null     object
 2   Party      44 non-null     object
 3   Bg         44 non-null     int64 
 4   Im         44 non-null     int64 
 5   Int        44 non-null     int64 
 6   IQ         44 non-null     int64 
 7   L          44 non-null     int64 
 8   WR         44 non-null     int64 
 9   AC         44 non-null     int64 
 10  EAb        44 non-null     int64 
 11  LA         44 non-null     int64 
 12  CAb        44 non-null     int64 
 13  OA         44 non-null     int64 
 14  PL         44 non-null     int64 
 15  RC         44 non-null     int64 
 16  CAp        44 non-null     int64 
 17  HE         44 non-null     int64 
 18  EAp        44 non-null     int64 
 19  DA         44 non-null     int64 
 20  FPA        44 non-null     int64 
 21  

In [7]:
# Datatype casting and renaming the columns
cols_list = [
    col.strip().split("=") for col in siena_2018_cols.strip().split(sep="•")[1:]
]

# we will replace the spaces in the full form with underscores (_)
siena_2018_cols_dict = {
    col_prev.strip(): col_full.strip().replace(" ", "_")
    for col_prev, col_full in cols_list
}
siena_2018 = siena_2018.rename(columns={"Seq.": "Seq"}).rename(
    columns=siena_2018_cols_dict
)
siena_2018 = siena_2018.astype({"Party": "category"})
siena_2018 = siena_2018.astype(
    {col_name: "uint8" for col_name in siena_2018.select_dtypes("int64").columns}
)

-------------------------------------------------

## <a id='toc2_'></a>[Creating and Updating columns: The `.assign()` method](#toc0_)

---------------------------------------------------

**Why use .assign ?** This method returns a dataframe and doesn't mutate the existing dataframe. This is very useful for chaining operations as the dataframe gets continuously updated and the subsequent methods operates on the updated dataframe.

<u>**\*\*kwargs:** `argument (column name) = argument value (callable or Series}, ......` </u>
- if the column already exists it will modify the values of the column
- if the column doesn't exist then it will create a new column
- if the argumnent value is a series or a scalar, it will simply assign those values to the column
- the callable (a function or *lambda*) must return a scalar or series. Using a function (it can be a normal function, but often we use a lambda to have the logic inline) has an unseen benefit. If any manipulation or filtering was done on the dataframe before using the `.assign()`, those changes will be represented on the dataframe and *the function will accept the current state of the dataframe.*

**`lambda` function Refresher:** A lambda function can take any number of arguments, but can only have one expression. 

*Syntax --* `lambda arguments : expression`. The expression is executed and the result is returned.

In [8]:
# First, we will add a column named Average_rank that ranks the presidents based on their toatal score (summing the numeric values across the columns)
# using dense method (lowest rank in the group but rank always increases by 1 between groups)
# this is essentially the "Overall" column but using a different ranking method

# Next, we will add another column named, "Quartile_rank" that will have 4 bins (1st, 2nd, 3rd, 4th)
# this is when we will see the power of using a function
# the lambda function will take the current state of the dataframe when the Average_rank column exists

siena_2018 = siena_2018.assign(
    Average_rank=siena_2018.loc[:, "Background":"Experts’_view"]
    .sum(axis=1)
    .rank(method="dense")
    .astype("uint8"),
    Quartile_rank=lambda df_: pd.qcut(
        df_.Average_rank, 4, labels=["1st", "2nd", "3rd", "4th"]
    ),
)

In [9]:
siena_2018

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
1,1,George Washington,Independent,7,7,1,10,...,2,2,1,2,1,1,1st
2,2,John Adams,Federalist,3,13,4,4,...,19,13,16,10,14,13,2nd
3,3,Thomas Jefferson,Democratic-Republican,2,2,14,1,...,6,9,7,5,5,5,1st
4,4,James Madison,Democratic-Republican,4,6,7,3,...,11,19,11,8,7,7,1st
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41,42,Bill Clinton,Democratic,21,12,39,8,...,9,18,30,14,15,14,2nd
42,43,George W. Bush,Republican,17,29,33,41,...,30,38,36,34,33,33,3rd
43,44,Barack Obama,Democratic,24,11,13,9,...,13,20,10,11,17,17,2nd
44,45,Donald Trump,Republican,43,40,44,44,...,40,42,41,42,42,42,4th
