**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
- [Importing the data](#toc1_2_)    
- [Aggregations](#toc2_)    
    - [*Multiple aggregations on a dataframe using the `.agg` method*](#toc2_1_1_)    
    - [*The `.describe` returns a dataframe with summary statistics for each numeric columns*](#toc2_1_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Read the official documentation on pandas DataFrames @ https://pandas.pydata.org/pandas-docs/stable/reference/frame.html**

**`Note:`** The notion of **chaining functions/methods** in pandas is similar to python.

DataFrames are **column oriented** unlike most common databases. And, **each column** in the dataframe is a **pandas series object**. So, any operation that can be performed on a pandas series object can be applied to a column too.

There are **two axes** for a dataframe commonly referred to as axis 0 and 1, or the **"index"** (or 'rows') axis and the **"columns"** axis respectively. Note that, when an **operation** is applied **along axis 0**, it is applied **down through all the rows for all the columns**. Likewise, operations **along axis 1** is applied **across the values in all the columns for all of the rows**.

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
# import statements
import numpy as np
import pandas as pd

In [2]:
# view options
pd.set_option("display.max_columns", 14)
pd.set_option("display.max_rows", 8)

### <a id='toc1_2_'></a>[Importing the data](#toc0_)

- We will be exploring a dataset from a Siena College Poll in 2018. This data has rankings of United States Presidents in various attributes. These attributes are:

In [3]:
siena_2018_cols = """
• Bg = Background
• Im = Imagination
• Int = Integrity
• IQ = Intelligence
• L = Luck
• WR = Willing to take risks
• AC = Ability to compromise
• EAb = Executive ability
• LA = Leadership ability
• CAb = Communication ability
• OA = Overall ability
• PL = Party leadership
• RC = Relations with Congress
• CAp = Court appointments
• HE = Handling of economy
• EAp = Executive appointments
• DA = Domestic accomplishments
• FPA = Foreign policy accomplishments
• AM = Avoid crucial mistakes
• EV = Experts’ view
• O = Overall
"""

In [4]:
# reading from github url

# it is a good practice to define your index column when reading the data file.
# it is generally frowned upon if you don't have an index column

url = "https://github.com/mattharrison/datasets/raw/master/data/siena2018-pres.csv"
siena_2018 = pd.read_csv(url, index_col=0)

In [5]:
siena_2018.head(3)

Unnamed: 0,Seq.,President,Party,Bg,Im,Int,IQ,...,HE,EAp,DA,FPA,AM,EV,O
1,1,George Washington,Independent,7,7,1,10,...,1,1,2,2,1,2,1
2,2,John Adams,Federalist,3,13,4,4,...,13,15,19,13,16,10,14
3,3,Thomas Jefferson,Democratic-Republican,2,2,14,1,...,20,4,6,9,7,5,5


In [6]:
# this will print all the column names, number of non null values in each column and the datatype of that column
siena_2018.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44 entries, 1 to 44
Data columns (total 24 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Seq.       44 non-null     object
 1   President  44 non-null     object
 2   Party      44 non-null     object
 3   Bg         44 non-null     int64 
 4   Im         44 non-null     int64 
 5   Int        44 non-null     int64 
 6   IQ         44 non-null     int64 
 7   L          44 non-null     int64 
 8   WR         44 non-null     int64 
 9   AC         44 non-null     int64 
 10  EAb        44 non-null     int64 
 11  LA         44 non-null     int64 
 12  CAb        44 non-null     int64 
 13  OA         44 non-null     int64 
 14  PL         44 non-null     int64 
 15  RC         44 non-null     int64 
 16  CAp        44 non-null     int64 
 17  HE         44 non-null     int64 
 18  EAp        44 non-null     int64 
 19  DA         44 non-null     int64 
 20  FPA        44 non-null     int64 
 21  

--------------------------------

## <a id='toc2_'></a>[Aggregations](#toc0_)

--------------------------

Aggregations that are applicable to a Series object are also applicable to a DataFrame. The only difference is that, in dataframes aggregations can be applied across 2 axis (i.e, index and columns).

In [7]:
# let's slice out a portion from the siena_2018 df that has numerical values it's good practice to use .copy()
# while slicing a df so that operations applied on the sliced df doesn't affect the original df
scores = siena_2018.loc[:, "Bg":"O"].copy()

#### <a id='toc2_1_1_'></a>[*Multiple aggregations on a dataframe using the `.agg` method*](#toc0_)

- **Aggregate over axis 1 (aggregate each row across the columns)**

In [8]:
scores.agg(["sum", "mean"], axis=1)

Unnamed: 0,sum,mean
1,80.0,3.809524
2,305.0,14.523810
3,139.0,6.619048
4,205.0,9.761905
...,...,...
41,307.0,14.619048
42,635.0,30.238095
43,331.0,15.761905
44,833.0,39.666667


- **Aggregate over axis 0 (aggregate each column down the rows)**

In [9]:
scores.agg(["sum", "mean"], axis=0)

Unnamed: 0,Bg,Im,Int,IQ,L,WR,AC,...,HE,EAp,DA,FPA,AM,EV,O
sum,968.0,957.0,990.0,990.0,990.0,953.0,968.0,...,990.0,990.0,990.0,990.0,990.0,990.0,990.0
mean,22.0,21.75,22.5,22.5,22.5,21.659091,22.0,...,22.5,22.5,22.5,22.5,22.5,22.5,22.5


- Different aggregations per column

In [10]:
scores.agg({"Int": ["max", "mean"], "IQ": ["min", "mean"]})

Unnamed: 0,Int,IQ
max,44.0,
mean,22.5,22.5
min,,1.0


#### <a id='toc2_1_2_'></a>[*The `.describe` returns a dataframe with summary statistics for each numeric columns*](#toc0_)

**`Note:`** This is the default behaviour. To generate a summary of all the columns (both numeric and non-numeric type) you can set, `df.describe(include="all")` or to get the summary of any column you can extract that column as a series object and then call the describe method e.g, `df.column_name.describe()`.

In [11]:
scores.describe()

Unnamed: 0,Bg,Im,Int,IQ,L,WR,AC,...,HE,EAp,DA,FPA,AM,EV,O
count,44.0,44.0,44.0,44.0,44.0,44.0,44.0,...,44.0,44.0,44.0,44.0,44.0,44.0,44.0
mean,22.0,21.75,22.5,22.5,22.5,21.659091,22.0,...,22.5,22.5,22.5,22.5,22.5,22.5,22.5
std,12.409674,12.519984,12.845233,12.845233,12.845233,11.892822,12.409674,...,12.845233,12.845233,12.845233,12.845233,12.845233,12.845233,12.845233
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,11.75,11.0,11.75,11.75,11.75,11.75,11.75,...,11.75,11.75,11.75,11.75,11.75,11.75,11.75
50%,22.0,21.5,22.5,22.5,22.5,22.5,22.0,...,22.5,22.5,22.5,22.5,22.5,22.5,22.5
75%,32.25,32.25,33.25,33.25,33.25,31.25,32.25,...,33.25,33.25,33.25,33.25,33.25,33.25,33.25
max,43.0,43.0,44.0,44.0,44.0,41.0,43.0,...,44.0,44.0,44.0,44.0,44.0,44.0,44.0


**Note:** The count row in the summary statistics has a particular meaning in pandas. It is not the count of the rows, rather it is the count of the non-missing (not na) rows.