**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
- [Importing the data](#toc1_2_)    
- [Casting Datatypes and Renaming the columns](#toc2_)    
  - [*Renaming the columns with proper full form*](#toc2_1_)    
    - [The `.rename()` method](#toc2_1_1_)    
  - [*Casting DataTypes*](#toc2_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Read the official documentation on pandas DataFrames @ https://pandas.pydata.org/pandas-docs/stable/reference/frame.html**

**`Note:`** The notion of **chaining functions/methods** in pandas is similar to python.

DataFrames are **column oriented** unlike most common databases. And, **each column** in the dataframe is a **pandas series object**. So, any operation that can be performed on a pandas series object can be applied to a column too.

There are **two axes** for a dataframe commonly referred to as axis 0 and 1, or the **"index"** (or 'rows') axis and the **"columns"** axis respectively. Note that, when an **operation** is applied **along axis 0**, it is applied **down through all the rows for all the columns**. Likewise, operations **along axis 1** is applied **across the values in all the columns for all of the rows**.

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
# import statements
import numpy as np
import pandas as pd

In [2]:
# view options
pd.set_option("display.max_columns", 14)
pd.set_option("display.max_rows", 8)

### <a id='toc1_2_'></a>[Importing the data](#toc0_)

- We will be exploring a dataset from a Siena College Poll in 2018. This data has rankings of United States Presidents in various attributes. These attributes are:

In [3]:
siena_2018_cols = """
• Bg = Background
• Im = Imagination
• Int = Integrity
• IQ = Intelligence
• L = Luck
• WR = Willing to take risks
• AC = Ability to compromise
• EAb = Executive ability
• LA = Leadership ability
• CAb = Communication ability
• OA = Overall ability
• PL = Party leadership
• RC = Relations with Congress
• CAp = Court appointments
• HE = Handling of economy
• EAp = Executive appointments
• DA = Domestic accomplishments
• FPA = Foreign policy accomplishments
• AM = Avoid crucial mistakes
• EV = Experts’ view
• O = Overall
"""

In [4]:
# reading from github url

# it is a good practice to define your index column when reading the data file.
# it is generally frowned upon if you don't have an index column

url = "https://github.com/mattharrison/datasets/raw/master/data/siena2018-pres.csv"
siena_2018 = pd.read_csv(url, index_col=0)

In [5]:
siena_2018.head(3)

Unnamed: 0,Seq.,President,Party,Bg,Im,Int,IQ,...,HE,EAp,DA,FPA,AM,EV,O
1,1,George Washington,Independent,7,7,1,10,...,1,1,2,2,1,2,1
2,2,John Adams,Federalist,3,13,4,4,...,13,15,19,13,16,10,14
3,3,Thomas Jefferson,Democratic-Republican,2,2,14,1,...,20,4,6,9,7,5,5


In [6]:
# this will print all the column names, number of non null values in each column and the datatype of that column
siena_2018.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44 entries, 1 to 44
Data columns (total 24 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Seq.       44 non-null     object
 1   President  44 non-null     object
 2   Party      44 non-null     object
 3   Bg         44 non-null     int64 
 4   Im         44 non-null     int64 
 5   Int        44 non-null     int64 
 6   IQ         44 non-null     int64 
 7   L          44 non-null     int64 
 8   WR         44 non-null     int64 
 9   AC         44 non-null     int64 
 10  EAb        44 non-null     int64 
 11  LA         44 non-null     int64 
 12  CAb        44 non-null     int64 
 13  OA         44 non-null     int64 
 14  PL         44 non-null     int64 
 15  RC         44 non-null     int64 
 16  CAp        44 non-null     int64 
 17  HE         44 non-null     int64 
 18  EAp        44 non-null     int64 
 19  DA         44 non-null     int64 
 20  FPA        44 non-null     int64 
 21  

---------------------

## <a id='toc2_'></a>[Casting Datatypes and Renaming the columns](#toc0_)

-----------------------

**Note: This (i.e, casting datatypes and renaming columns) should be the first step whenever we load in a dataset. Also, we should write these commands as functions, allowing us to reuse the code in other notebooks if necessary.**

### <a id='toc2_1_'></a>[*Renaming the columns with proper full form*](#toc0_)

- Getting the full form of each column from the "siena_2018_cols" string

In [7]:
# we want to write a code to generate a python dictionary from the above multiline string named "siena_2018_cols", which is
# formatted as short form = long form. This dictionary will be used to rename the columns of the dataframe "siena_2018"

# first we create a list of the form, [[short, full], .....]
cols_list = [
    col.strip().split("=") for col in siena_2018_cols.strip().split(sep="•")[1:]
]

# we will replace the spaces in the full form with underscores (_)
siena_2018_cols_dict = {
    col_prev.strip(): col_full.strip().replace(" ", "_")
    for col_prev, col_full in cols_list
}

**Note:** When such unpacking pattern is used with the for loop in a nested list, it will start to unpack from the most inner layer and not the outer one.

#### <a id='toc2_1_1_'></a>[The `.rename()` method](#toc0_)

In [8]:
# inplace = True is frowned upon
siena_2018 = siena_2018.rename(columns={"Seq.": "Seq"}).rename(
    columns=siena_2018_cols_dict
)

In [9]:
siena_2018

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,...,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall
1,1,George Washington,Independent,7,7,1,10,...,1,1,2,2,1,2,1
2,2,John Adams,Federalist,3,13,4,4,...,13,15,19,13,16,10,14
3,3,Thomas Jefferson,Democratic-Republican,2,2,14,1,...,20,4,6,9,7,5,5
4,4,James Madison,Democratic-Republican,4,6,7,3,...,14,7,11,19,11,8,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41,42,Bill Clinton,Democratic,21,12,39,8,...,5,12,9,18,30,14,15
42,43,George W. Bush,Republican,17,29,33,41,...,36,29,30,38,36,34,33
43,44,Barack Obama,Democratic,24,11,13,9,...,10,13,13,20,10,11,17
44,45,Donald Trump,Republican,43,40,44,44,...,39,44,40,42,41,42,42


### <a id='toc2_2_'></a>[*Casting DataTypes*](#toc0_)

The first thing we should do when we load in a dataset is checking the datatypes of each column and converting each of them to datatypes that is more suitable for them. This will save space and will increase the overall speed of all the operations.

In [10]:
siena_2018.dtypes.to_dict()  # we could've also used the .info() method

{'Seq': dtype('O'),
 'President': dtype('O'),
 'Party': dtype('O'),
 'Background': dtype('int64'),
 'Imagination': dtype('int64'),
 'Integrity': dtype('int64'),
 'Intelligence': dtype('int64'),
 'Luck': dtype('int64'),
 'Willing_to_take_risks': dtype('int64'),
 'Ability_to_compromise': dtype('int64'),
 'Executive_ability': dtype('int64'),
 'Leadership_ability': dtype('int64'),
 'Communication_ability': dtype('int64'),
 'Overall_ability': dtype('int64'),
 'Party_leadership': dtype('int64'),
 'Relations_with_Congress': dtype('int64'),
 'Court_appointments': dtype('int64'),
 'Handling_of_economy': dtype('int64'),
 'Executive_appointments': dtype('int64'),
 'Domestic_accomplishments': dtype('int64'),
 'Foreign_policy_accomplishments': dtype('int64'),
 'Avoid_crucial_mistakes': dtype('int64'),
 'Experts’_view': dtype('int64'),
 'Overall': dtype('int64')}

> **First, let's explore the columns with "Object" datatype**

- The "Seq" column (Sequences of the presidency)

In [11]:
siena_2018.Seq.values

array(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
       '13', '14', '15', '16', '17', '18', '19', '20', '21', '22/24',
       '23', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34',
       '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45'],
      dtype=object)

Upon inspection we can see that, there's a value of '22/24'. So this column can either remain as "Object" type or, can be converted to "string" type. 

- The "President" column lists the name of the president. So, this can either be converted to "string" type or can remain as is. 

- The "Party" column provides the name of the party, the president was elected with.

In [12]:
siena_2018.Party.value_counts()

Party
Republican               19
Democratic               15
Democratic-Republican     4
Whig                      3
Independent               2
Federalist                1
Name: count, dtype: int64

This column has only 6 unique values. So, this can be converted to "categorical" type.

In [13]:
siena_2018 = siena_2018.astype({"Party": "category"})

> **Now, let's explore the columns with "int64" as datatype**

**Note:** One of the interesting and important pandas methods is the `.select_dtypes()` method. This will select all the columns with the specified datatype and return those columns as a new DataFrame.

In [14]:
siena_2018.select_dtypes("int64")

Unnamed: 0,Background,Imagination,Integrity,Intelligence,Luck,Willing_to_take_risks,Ability_to_compromise,...,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall
1,7,7,1,10,1,6,2,...,1,1,2,2,1,2,1
2,3,13,4,4,24,14,31,...,13,15,19,13,16,10,14
3,2,2,14,1,8,5,14,...,20,4,6,9,7,5,5
4,4,6,7,3,16,15,6,...,14,7,11,19,11,8,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41,21,12,39,8,11,17,3,...,5,12,9,18,30,14,15
42,17,29,33,41,21,20,28,...,36,29,30,38,36,34,33
43,24,11,13,9,15,23,16,...,10,13,13,20,10,11,17
44,43,40,44,44,10,25,42,...,39,44,40,42,41,42,42


- Let's see the max and min values of the number type columns

In [15]:
_ = siena_2018.select_dtypes("int64").agg(["max", "min"])

In [16]:
_

Unnamed: 0,Background,Imagination,Integrity,Intelligence,Luck,Willing_to_take_risks,Ability_to_compromise,...,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall
max,43,43,44,44,44,41,43,...,44,44,44,44,44,44,44
min,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1


In [17]:
_.loc["max"].max()

44

In [18]:
_.loc["min"].min()

1

As we can see, none of the columns has values greater than 44 and lesser than 1. So, these columns can easily be converted to "uint8" type and still accomodate the values as is.

In [19]:
siena_2018 = siena_2018.astype(
    {col_name: "uint8" for col_name in siena_2018.select_dtypes("int64").columns}
)

After casting datatypes to more appropriate types, the memory footprint of the dataframe reduces drastically.

In [20]:
siena_2018.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44 entries, 1 to 44
Data columns (total 24 columns):
 #   Column                          Non-Null Count  Dtype   
---  ------                          --------------  -----   
 0   Seq                             44 non-null     object  
 1   President                       44 non-null     object  
 2   Party                           44 non-null     category
 3   Background                      44 non-null     uint8   
 4   Imagination                     44 non-null     uint8   
 5   Integrity                       44 non-null     uint8   
 6   Intelligence                    44 non-null     uint8   
 7   Luck                            44 non-null     uint8   
 8   Willing_to_take_risks           44 non-null     uint8   
 9   Ability_to_compromise           44 non-null     uint8   
 10  Executive_ability               44 non-null     uint8   
 11  Leadership_ability              44 non-null     uint8   
 12  Communication_ability        