# Deep dive into Pandas DataFrames

**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_)    
- [Importing the data](#toc2_)    
- [Mathematical operations on DataFrames](#toc3_)    
- [Looping over a DataFrame (using the `for` loop)](#toc4_)    
- [Aggregations](#toc5_)    
    - [*Multiple aggregations on a dataframe using the `.agg` method*](#toc5_1_1_)    
    - [*The `.describe` returns a dataframe with summary statistics for each numeric columns*](#toc5_1_2_)    
- [Casting Datatypes and Renaming the columns](#toc6_)    
  - [*Renaming the columns with proper full form*](#toc6_1_)    
    - [The `.rename()` method](#toc6_1_1_)    
  - [*Casting DataTypes*](#toc6_2_)    
- [Creating and Updating columns: The `.assign()` method](#toc7_)    
- [Dealing with Missing and Duplicated Data](#toc8_)    
  - [*Locating missing data*](#toc8_1_)    
  - [*Handling missing values*](#toc8_2_)    
  - [*Handling duplicate data*](#toc8_3_)    
- [Sorting Columns and Indexes](#toc9_)    
    - [Setting indexes: The `.set_index()` method](#toc9_1_1_)    
    - [Sorting indexes: The `.sort_index()` method](#toc9_1_2_)    
    - [Sorting values: The `.sort_values()` method](#toc9_1_3_)    
- [Indexing & Filtering](#toc10_)    
  - [`->` **Indexing**](#toc10_1_)    
    - [*Renaming index labels: The `.rename()` method*](#toc10_1_1_)    
    - [*Resetting index labels to monotonically increasing integers: The `.reset_index()` method*](#toc10_1_2_)    
    - [*Indexing by Index lables: The `.loc[]` method*](#toc10_1_3_)    
    - [*Indexing by Index positions: The `.iloc[]` method*](#toc10_1_4_)    
  - [`->` **Filtering**](#toc10_2_)    
    - [*Filtering Index and Column Labels with `.filter(items, like, regex, axis)`*](#toc10_2_1_)    
    - [*Filtering with boolean arrays (Boolean Masking)*](#toc10_2_2_)    
    - [*Using `functions with .loc` (for filtering)*](#toc10_2_3_)    
    - [*Filtering with the `.query()` method*](#toc10_2_4_)    
- [Reshaping Dataframes (Grouping and Aggregating)](#toc11_)    
    - [Reshaping dataframes with `dummies`](#toc11_1_1_)    
  - [*The `pivot_table()` method*](#toc11_2_)    
  - [*The `.groupby()` method*](#toc11_3_)    
  - [*Accessing values from a MultiIndexed Dataframe*](#toc11_4_)    
- [Flattening Hierarchial Indexes and Columns](#toc12_)    
  - [*Removing Hierarchial index*](#toc12_1_)    
  - [*Flattening hierarchial columns*](#toc12_2_)    
- [Melting, Transposing and Stacking Data](#toc13_)    
  - [*Melting & Unmelting Data*](#toc13_1_)    
  - [*Transposing Data*](#toc13_2_)    
  - [*Stacking and Unstacking Data*](#toc13_3_)    
- [Concatenation, Joining DataFrames](#toc14_)    
  - [The `concat()` method](#toc14_1_)    
  - [Joining Dataframes](#toc14_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=4
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Read the official documentation on pandas DataFrames @ https://pandas.pydata.org/pandas-docs/stable/reference/frame.html**

**`Note:`** The notion of **chaining functions/methods** in pandas is similar to python.

DataFrames are **column oriented** unlike most common databases. And, **each column** in the dataframe is a **pandas series object**. So, any operation that can be performed on a pandas series object can be applied to a column too.

There are **two axes** for a dataframe commonly referred to as axis 0 and 1, or the **"index"** (or 'rows') axis and the **"columns"** axis respectively. Note that, when an **operation** is applied **along axis 0**, it is applied **down through all the rows for all the columns**. Likewise, operations **along axis 1** is applied **across the values in all the columns for all of the rows**.

## <a id='toc1_'></a>[Import Statements](#toc0_)

--------------------------

In [1]:
# import statements
import numpy as np
import pandas as pd

In [2]:
# view options
pd.set_option("display.max_columns", 14)
pd.set_option("display.max_rows", 8)

---------------------------

## <a id='toc2_'></a>[Importing the data](#toc0_)

------------------------

- We will be exploring a dataset from a Siena College Poll in 2018. This data has rankings of United States Presidents in various attributes. These attributes are:

In [3]:
siena_2018_cols = """
• Bg = Background
• Im = Imagination
• Int = Integrity
• IQ = Intelligence
• L = Luck
• WR = Willing to take risks
• AC = Ability to compromise
• EAb = Executive ability
• LA = Leadership ability
• CAb = Communication ability
• OA = Overall ability
• PL = Party leadership
• RC = Relations with Congress
• CAp = Court appointments
• HE = Handling of economy
• EAp = Executive appointments
• DA = Domestic accomplishments
• FPA = Foreign policy accomplishments
• AM = Avoid crucial mistakes
• EV = Experts’ view
• O = Overall
"""

In [4]:
# reading from github url

# it is a good practice to define your index column when reading the data file.
# it is generally frowned upon if you don't have an index column

url = "https://github.com/mattharrison/datasets/raw/master/data/siena2018-pres.csv"
siena_2018 = pd.read_csv(url, index_col=0)

In [5]:
siena_2018.head(3)

Unnamed: 0,Seq.,President,Party,Bg,Im,Int,IQ,...,HE,EAp,DA,FPA,AM,EV,O
1,1,George Washington,Independent,7,7,1,10,...,1,1,2,2,1,2,1
2,2,John Adams,Federalist,3,13,4,4,...,13,15,19,13,16,10,14
3,3,Thomas Jefferson,Democratic-Republican,2,2,14,1,...,20,4,6,9,7,5,5


In [6]:
# this will print all the column names, number of non null values in each column and the datatype of that column
siena_2018.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44 entries, 1 to 44
Data columns (total 24 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Seq.       44 non-null     object
 1   President  44 non-null     object
 2   Party      44 non-null     object
 3   Bg         44 non-null     int64 
 4   Im         44 non-null     int64 
 5   Int        44 non-null     int64 
 6   IQ         44 non-null     int64 
 7   L          44 non-null     int64 
 8   WR         44 non-null     int64 
 9   AC         44 non-null     int64 
 10  EAb        44 non-null     int64 
 11  LA         44 non-null     int64 
 12  CAb        44 non-null     int64 
 13  OA         44 non-null     int64 
 14  PL         44 non-null     int64 
 15  RC         44 non-null     int64 
 16  CAp        44 non-null     int64 
 17  HE         44 non-null     int64 
 18  EAp        44 non-null     int64 
 19  DA         44 non-null     int64 
 20  FPA        44 non-null     int64 
 21  

- Another dataset that we will be exploring is the "./Data/vehicles.csv.zip".

In [7]:
vehicles = pd.read_csv("./Data/vehicles.csv.zip")

  vehicles = pd.read_csv("./Data/vehicles.csv.zip")


In [8]:
vehicles.head()

Unnamed: 0,barrels08,barrelsA08,charge120,charge240,city08,city08U,cityA08,...,c240bDscr,createdOn,modifiedOn,startStop,phevCity,phevHwy,phevComb
0,15.695714,0.0,0.0,0.0,19,0.0,0,...,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
1,29.964545,0.0,0.0,0.0,9,0.0,0,...,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
2,12.207778,0.0,0.0,0.0,23,0.0,0,...,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
3,29.964545,0.0,0.0,0.0,10,0.0,0,...,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
4,17.347895,0.0,0.0,0.0,17,0.0,0,...,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0


In [9]:
# vehicles.info()

- The stack overflow developer survey data from 2019

In [10]:
dev_survey = pd.read_csv("./Data/dev_survey_2019.zip")

In [11]:
dev_survey.head(3)

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,...,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,...,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,...,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,...,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult


In [12]:
# dev_survey.info()

---------------

## <a id='toc3_'></a>[Mathematical operations on DataFrames](#toc0_)

----------------

**Similar to series objects, Math operations for DataFrames are Index Aligned. What's more is that they are Columns Aligned too.** 

Aligning will take each index entry from a particular column in the left df and match it up with every entry with the same index and column name of the right df. This is repeated for all the overlapping columns. If any of the df has duplicate index this will cause the addition operation to behave unexpectedly i.e, it will work by process of permutating the matching indexex.

In [13]:
# df1: 3 rows and 4 columns
# df2: 2 rows and 5 columns
df1 = pd.DataFrame(
    np.linspace(2, 13, 12).reshape(3, 4),
    columns=["a1", "b1", "c1", "d1"],
    index=[1, 2, 3],
)
df2 = pd.DataFrame(
    np.linspace(2, 11, 10).reshape(2, 5),
    columns=["a1", "b1", "c1", "d1", "e1"],
    index=[2, 2],
)

In [14]:
df1

Unnamed: 0,a1,b1,c1,d1
1,2.0,3.0,4.0,5.0
2,6.0,7.0,8.0,9.0
3,10.0,11.0,12.0,13.0


In [15]:
df2

Unnamed: 0,a1,b1,c1,d1,e1
2,2.0,3.0,4.0,5.0,6.0
2,7.0,8.0,9.0,10.0,11.0


In [16]:
df1 + df2

Unnamed: 0,a1,b1,c1,d1,e1
1,,,,,
2,8.0,10.0,12.0,14.0,
2,13.0,15.0,17.0,19.0,
3,,,,,


As we can see, only the **overlapping rows** (2nd row) **and columns** (a1 through d1) get added together. The other values are missing. We can use the **.add method instead of  "+"  and define a fill value** if we wanted, similar to what we've done in case of series objects.

In [17]:
df1.add(df2, fill_value=0)

Unnamed: 0,a1,b1,c1,d1,e1
1,2.0,3.0,4.0,5.0,
2,8.0,10.0,12.0,14.0,6.0
2,13.0,15.0,17.0,19.0,11.0
3,10.0,11.0,12.0,13.0,


> Some of the available operator methods are, 
    
    add(), sub(), mul(), div(), mod(), pow(), rfloordiv(), lt(), gt(), eq(), ne(), le(), ge(), dot(), product() etc.

**`Note`**: *If the dataframes have a multi-level index, we can specify the level for the division using the **level** parameter*

---------------------------------------------------

## <a id='toc4_'></a>[Looping over a DataFrame (using the `for` loop)](#toc0_)

-----------------------------------------------------

It is generally not a good practice and is usually frowned upon if you use for loop with your pandas dataframe. This is because, pandas built-in methods are much faster than for loop due to vectorization and you are not taking advantage of it. However some times it might be useful to use for loop in datafrmes such as, when plotting visuals.

Some iterator methods that are useful while looping over a dataframe are, **.items(), .iterrows(), .itertuples()**.

- The `.items()` method: returns a tuple of **(column name, column content as Series)**

The indexes of the returned series will be the indexes of the dataframe.

In [18]:
for col_label, col_content in siena_2018.items():
    print(f"Column Name: {col_label}")
    print(f"Data contents of the column: \n{col_content}")
    break

Column Name: Seq.
Data contents of the column: 
1      1
2      2
3      3
4      4
      ..
41    42
42    43
43    44
44    45
Name: Seq., Length: 44, dtype: object


- The `.iterrows()` method: returns a tuple of **(index, row content as Series)**

The indexes of the returned series will be the associated column names.

In [19]:
for idx, row_content in siena_2018.iterrows():
    print(f"Row index: {idx}\n")
    print(f"Data contents of the row: \n\n{row_content}")
    break

Row index: 1

Data contents of the row: 

Seq.                         1
President    George Washington
Party              Independent
Bg                           7
                   ...        
FPA                          2
AM                           1
EV                           2
O                            1
Name: 1, Length: 24, dtype: object


- The `.itertuples()` method: returns the **rows as namedtuples**

In [20]:
for row in siena_2018.itertuples():
    print(row)
    break

Pandas(Index=1, _1='1', President='George Washington', Party='Independent', Bg=7, Im=7, Int=1, IQ=10, L=1, WR=6, AC=2, EAb=2, LA=1, CAb=11, OA=2, PL=18, RC=1, CAp=1, HE=1, EAp=1, DA=2, FPA=2, AM=1, EV=2, O=1)


--------------------------------

## <a id='toc5_'></a>[Aggregations](#toc0_)

--------------------------

Aggregations that are applicable to a Series object are also applicable to a DataFrame. The only difference is that, in dataframes aggregations can be applied across 2 axis (i.e, index and columns).

In [21]:
# let's slice out a portion from the siena_2018 df that has numerical values it's good practice to use .copy()
# while slicing a df so that operations applied on the sliced df doesn't affect the original df
scores = siena_2018.loc[:, "Bg":"O"].copy()

#### <a id='toc5_1_1_'></a>[*Multiple aggregations on a dataframe using the `.agg` method*](#toc0_)

- **Aggregate over axis 1 (aggregate each row across the columns)**

In [22]:
scores.agg(["sum", "mean"], axis=1)

Unnamed: 0,sum,mean
1,80.0,3.809524
2,305.0,14.523810
3,139.0,6.619048
4,205.0,9.761905
...,...,...
41,307.0,14.619048
42,635.0,30.238095
43,331.0,15.761905
44,833.0,39.666667


- **Aggregate over axis 0 (aggregate each column down the rows)**

In [23]:
scores.agg(["sum", "mean"], axis=0)

Unnamed: 0,Bg,Im,Int,IQ,L,WR,AC,...,HE,EAp,DA,FPA,AM,EV,O
sum,968.0,957.0,990.0,990.0,990.0,953.0,968.0,...,990.0,990.0,990.0,990.0,990.0,990.0,990.0
mean,22.0,21.75,22.5,22.5,22.5,21.659091,22.0,...,22.5,22.5,22.5,22.5,22.5,22.5,22.5


- Different aggregations per column

In [24]:
scores.agg({"Int": ["max", "mean"], "IQ": ["min", "mean"]})

Unnamed: 0,Int,IQ
max,44.0,
mean,22.5,22.5
min,,1.0


#### <a id='toc5_1_2_'></a>[*The `.describe` returns a dataframe with summary statistics for each numeric columns*](#toc0_)

**`Note:`** This is the default behaviour. To generate a summary of all the columns (both numeric and non-numeric type) you can set, `df.describe(include="all")` or to get the summary of any column you can extract that column as a series object and then call the describe method e.g, `df.column_name.describe()`.

In [25]:
scores.describe()

Unnamed: 0,Bg,Im,Int,IQ,L,WR,AC,...,HE,EAp,DA,FPA,AM,EV,O
count,44.0,44.0,44.0,44.0,44.0,44.0,44.0,...,44.0,44.0,44.0,44.0,44.0,44.0,44.0
mean,22.0,21.75,22.5,22.5,22.5,21.659091,22.0,...,22.5,22.5,22.5,22.5,22.5,22.5,22.5
std,12.409674,12.519984,12.845233,12.845233,12.845233,11.892822,12.409674,...,12.845233,12.845233,12.845233,12.845233,12.845233,12.845233,12.845233
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,11.75,11.0,11.75,11.75,11.75,11.75,11.75,...,11.75,11.75,11.75,11.75,11.75,11.75,11.75
50%,22.0,21.5,22.5,22.5,22.5,22.5,22.0,...,22.5,22.5,22.5,22.5,22.5,22.5,22.5
75%,32.25,32.25,33.25,33.25,33.25,31.25,32.25,...,33.25,33.25,33.25,33.25,33.25,33.25,33.25
max,43.0,43.0,44.0,44.0,44.0,41.0,43.0,...,44.0,44.0,44.0,44.0,44.0,44.0,44.0


**Note:** The count row in the summary statistics has a particular meaning in pandas. It is not the count of the rows, rather it is the count of the non-missing (not na) rows.

---------------------

## <a id='toc6_'></a>[Casting Datatypes and Renaming the columns](#toc0_)

-----------------------

**Note: This (i.e, casting datatypes and renaming columns) should be the first step whenever we load in a dataset. Also, we should write these commands as functions, allowing us to reuse the code in other notebooks if necessary.**

### <a id='toc6_1_'></a>[*Renaming the columns with proper full form*](#toc0_)

- Getting the full form of each column from the "siena_2018_cols" string

In [26]:
# we want to write a code to generate a python dictionary from the above multiline string named "siena_2018_cols", which is
# formatted as short form = long form. This dictionary will be used to rename the columns of the dataframe "siena_2018"

# first we create a list of the form, [[short, full], .....]
cols_list = [
    col.strip().split("=") for col in siena_2018_cols.strip().split(sep="•")[1:]
]

# we will replace the spaces in the full form with underscores (_)
siena_2018_cols_dict = {
    col_prev.strip(): col_full.strip().replace(" ", "_")
    for col_prev, col_full in cols_list
}

**Note:** When such unpacking pattern is used with the for loop in a nested list, it will start to unpack from the most inner layer and not the outer one.

#### <a id='toc6_1_1_'></a>[The `.rename()` method](#toc0_)

In [27]:
# inplace = True is frowned upon
siena_2018 = siena_2018.rename(columns={"Seq.": "Seq"}).rename(
    columns=siena_2018_cols_dict
)

In [28]:
siena_2018

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,...,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall
1,1,George Washington,Independent,7,7,1,10,...,1,1,2,2,1,2,1
2,2,John Adams,Federalist,3,13,4,4,...,13,15,19,13,16,10,14
3,3,Thomas Jefferson,Democratic-Republican,2,2,14,1,...,20,4,6,9,7,5,5
4,4,James Madison,Democratic-Republican,4,6,7,3,...,14,7,11,19,11,8,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41,42,Bill Clinton,Democratic,21,12,39,8,...,5,12,9,18,30,14,15
42,43,George W. Bush,Republican,17,29,33,41,...,36,29,30,38,36,34,33
43,44,Barack Obama,Democratic,24,11,13,9,...,10,13,13,20,10,11,17
44,45,Donald Trump,Republican,43,40,44,44,...,39,44,40,42,41,42,42


### <a id='toc6_2_'></a>[*Casting DataTypes*](#toc0_)

The first thing we should do when we load in a dataset is checking the datatypes of each column and converting each of them to datatypes that is more suitable for them. This will save space and will increase the overall speed of all the operations.

In [29]:
siena_2018.dtypes.to_dict()  # we could've also used the .info() method

{'Seq': dtype('O'),
 'President': dtype('O'),
 'Party': dtype('O'),
 'Background': dtype('int64'),
 'Imagination': dtype('int64'),
 'Integrity': dtype('int64'),
 'Intelligence': dtype('int64'),
 'Luck': dtype('int64'),
 'Willing_to_take_risks': dtype('int64'),
 'Ability_to_compromise': dtype('int64'),
 'Executive_ability': dtype('int64'),
 'Leadership_ability': dtype('int64'),
 'Communication_ability': dtype('int64'),
 'Overall_ability': dtype('int64'),
 'Party_leadership': dtype('int64'),
 'Relations_with_Congress': dtype('int64'),
 'Court_appointments': dtype('int64'),
 'Handling_of_economy': dtype('int64'),
 'Executive_appointments': dtype('int64'),
 'Domestic_accomplishments': dtype('int64'),
 'Foreign_policy_accomplishments': dtype('int64'),
 'Avoid_crucial_mistakes': dtype('int64'),
 'Experts’_view': dtype('int64'),
 'Overall': dtype('int64')}

> **First, let's explore the columns with "Object" datatype**

- The "Seq" column (Sequences of the presidency)

In [30]:
siena_2018.Seq.values

array(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
       '13', '14', '15', '16', '17', '18', '19', '20', '21', '22/24',
       '23', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34',
       '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45'],
      dtype=object)

Upon inspection we can see that, there's a value of '22/24'. So this column can either remain as "Object" type or, can be converted to "string" type. 

- The "President" column lists the name of the president. So, this can either be converted to "string" type or can remain as is. 

- The "Party" column provides the name of the party, the president was elected with.

In [31]:
siena_2018.Party.value_counts()

Party
Republican               19
Democratic               15
Democratic-Republican     4
Whig                      3
Independent               2
Federalist                1
Name: count, dtype: int64

This column has only 6 unique values. So, this can be converted to "categorical" type.

In [32]:
siena_2018 = siena_2018.astype({"Party": "category"})

> **Now, let's explore the columns with "int64" as datatype**

**Note:** One of the interesting and important pandas methods is the `.select_dtypes()` method. This will select all the columns with the specified datatype and return those columns as a new DataFrame.

In [33]:
siena_2018.select_dtypes("int64")

Unnamed: 0,Background,Imagination,Integrity,Intelligence,Luck,Willing_to_take_risks,Ability_to_compromise,...,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall
1,7,7,1,10,1,6,2,...,1,1,2,2,1,2,1
2,3,13,4,4,24,14,31,...,13,15,19,13,16,10,14
3,2,2,14,1,8,5,14,...,20,4,6,9,7,5,5
4,4,6,7,3,16,15,6,...,14,7,11,19,11,8,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41,21,12,39,8,11,17,3,...,5,12,9,18,30,14,15
42,17,29,33,41,21,20,28,...,36,29,30,38,36,34,33
43,24,11,13,9,15,23,16,...,10,13,13,20,10,11,17
44,43,40,44,44,10,25,42,...,39,44,40,42,41,42,42


- Let's see the max and min values of the number type columns

In [34]:
_ = siena_2018.select_dtypes("int64").agg(["max", "min"])

In [35]:
_

Unnamed: 0,Background,Imagination,Integrity,Intelligence,Luck,Willing_to_take_risks,Ability_to_compromise,...,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall
max,43,43,44,44,44,41,43,...,44,44,44,44,44,44,44
min,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1


In [36]:
_.loc["max"].max()

44

In [37]:
_.loc["min"].min()

1

As we can see, none of the columns has values greater than 44 and lesser than 1. So, these columns can easily be converted to "uint8" type and still accomodate the values as is.

In [38]:
siena_2018 = siena_2018.astype(
    {col_name: "uint8" for col_name in siena_2018.select_dtypes("int64").columns}
)

After casting datatypes to more appropriate types, the memory footprint of the dataframe reduces drastically.

In [39]:
siena_2018.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44 entries, 1 to 44
Data columns (total 24 columns):
 #   Column                          Non-Null Count  Dtype   
---  ------                          --------------  -----   
 0   Seq                             44 non-null     object  
 1   President                       44 non-null     object  
 2   Party                           44 non-null     category
 3   Background                      44 non-null     uint8   
 4   Imagination                     44 non-null     uint8   
 5   Integrity                       44 non-null     uint8   
 6   Intelligence                    44 non-null     uint8   
 7   Luck                            44 non-null     uint8   
 8   Willing_to_take_risks           44 non-null     uint8   
 9   Ability_to_compromise           44 non-null     uint8   
 10  Executive_ability               44 non-null     uint8   
 11  Leadership_ability              44 non-null     uint8   
 12  Communication_ability        

-------------------------------------------------

## <a id='toc7_'></a>[Creating and Updating columns: The `.assign()` method](#toc0_)

---------------------------------------------------

**Why use .assign ?** This method returns a dataframe and doesn't mutate the existing dataframe. This is very useful for chaining operations as the dataframe gets continuously updated and the subsequent methods operates on the updated dataframe.

<u>**\*\*kwargs:** `argument (column name) = argument value (callable or Series}, ......` </u>
- if the column already exists it will modify the values of the column
- if the column doesn't exist then it will create a new column
- if the argumnent value is a series or a scalar, it will simply assign those values to the column
- the callable (a function or *lambda*) must return a scalar or series. Using a function (it can be a normal function, but often we use a lambda to have the logic inline) has an unseen benefit. If any manipulation or filtering was done on the dataframe before using the `.assign()`, those changes will be represented on the dataframe and *the function will accept the current state of the dataframe.*

**`lambda` function Refresher:** A lambda function can take any number of arguments, but can only have one expression. 

*Syntax --* `lambda arguments : expression`. The expression is executed and the result is returned.

In [40]:
# First, we will add a column named Average_rank that ranks the presidents based on their toatal score (summing the numeric values across the columns)
# using dense method (lowest rank in the group but rank always increases by 1 between groups)
# this is essentially the "Overall" column but using a different ranking method

# Next, we will add another column named, "Quartile_rank" that will have 4 bins (1st, 2nd, 3rd, 4th)
# this is when we will see the power of using a function
# the lambda function will take the current state of the dataframe when the Average_rank column exists

siena_2018 = siena_2018.assign(
    Average_rank=siena_2018.loc[:, "Background":"Experts’_view"]
    .sum(axis=1)
    .rank(method="dense")
    .astype("uint8"),
    Quartile_rank=lambda df_: pd.qcut(
        df_.Average_rank, 4, labels=["1st", "2nd", "3rd", "4th"]
    ),
)

In [41]:
siena_2018

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
1,1,George Washington,Independent,7,7,1,10,...,2,2,1,2,1,1,1st
2,2,John Adams,Federalist,3,13,4,4,...,19,13,16,10,14,13,2nd
3,3,Thomas Jefferson,Democratic-Republican,2,2,14,1,...,6,9,7,5,5,5,1st
4,4,James Madison,Democratic-Republican,4,6,7,3,...,11,19,11,8,7,7,1st
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41,42,Bill Clinton,Democratic,21,12,39,8,...,9,18,30,14,15,14,2nd
42,43,George W. Bush,Republican,17,29,33,41,...,30,38,36,34,33,33,3rd
43,44,Barack Obama,Democratic,24,11,13,9,...,13,20,10,11,17,17,2nd
44,45,Donald Trump,Republican,43,40,44,44,...,40,42,41,42,42,42,4th


--------------------------------------------

## <a id='toc8_'></a>[Dealing with Missing and Duplicated Data](#toc0_)

----------------------------------------------

### <a id='toc8_1_'></a>[*Locating missing data*](#toc0_)

- The `.isna()` method

Works similarly to series.isna() method. Returns a Boolean dataframe when used with dataframes.  

In [42]:
siena_2018.isna()

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
1,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False
42,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False
43,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False
44,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False


We can use other methods such as **.any(), .all() etc.** in combination with the .isna() method to see whether there are any data at all missing from a column or whether all data in a column is Nan type etc.

In [43]:
siena_2018.isna().any()

Seq              False
President        False
Party            False
Background       False
                 ...  
Experts’_view    False
Overall          False
Average_rank     False
Quartile_rank    False
Length: 26, dtype: bool

To `count` how many data rows are `missing` from a particular column we can use the **df.isna().sum()**.

In [44]:
siena_2018.isna().sum()

Seq              0
President        0
Party            0
Background       0
                ..
Experts’_view    0
Overall          0
Average_rank     0
Quartile_rank    0
Length: 26, dtype: int64

To see what `percentage` of the data in a column is missing we can use something like, **df.isna().mean().mul(100)**.

In [45]:
siena_2018.isna().mean().mul(100)

Seq              0.0
President        0.0
Party            0.0
Background       0.0
                ... 
Experts’_view    0.0
Overall          0.0
Average_rank     0.0
Quartile_rank    0.0
Length: 26, dtype: float64

### <a id='toc8_2_'></a>[*Handling missing values*](#toc0_)

- The `.dropna(subset)` method 

We can use the good old .dropna() method to drop the rows with missing values. But note that, when using with dataframe, .dropna() will only drop the rows if it has Nan values in all the columns. To specify otherwise i.e, what columns to look at when dropping rows, **we can feed the subset parameter a list of column names (that we want it to look at for dropping Nan values)**.

- The `.fillna()` method

We can also use the .fillna() method to fill in the missing value. We can also define the filling method e.g, `ffill`, `bfill` etc. This method also takes a `value parameter` (value: scalar, dict, Series, or DataFrame) which will be used to fill the Nan values if specified. The **.mean(), .median(), .mode()** etc methods may come in handy when defining the value paramether.

- The `.interpolate()` method

This will replace Nan values with interpolation of the values around the missing value. This method comes in very handy when dealing with ordered data such as time series data.

- The `.where(cond, other)` method

Although not specific for handling missing values but this method is a powerful one for doing just that. This method **replaces values where the condition is False with corresponding value from 'other'**.

- The `.mask(cond, other)` method

Opposite of the .where() method in the sense that, this method will replace values **where the condition is True** with the corresponding value from 'other'. Equivalent to, **.where(~cond, other)**.

**`CAUTION:`** The data in each column of a dataframe usually represents different things. Thus applying methods such as .dropna(), .fillna(), .interpolate() is not logical and will bring no good (this is like, using a spoon for woodworking).

**So, the best approach is to treat each column differently as a separate series object, clean them, modify them and then adding/replacing them in a datafrmae using the .assign() method.**

### <a id='toc8_3_'></a>[*Handling duplicate data*](#toc0_)

- The `.duplicated(subset, keep)` method will return a boolean Series denoting duplicate rows

    - subset : column label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default uses all of the columns.

    - keep : {'first', 'last', False}. Determines which duplicates (if any) to mark.

        - first (default) : Mark duplicates as True except for the first occurrence.
        - last : Mark duplicates as True except for the last occurrence.
        - False : Mark all duplicates as True.


- The `.drop_duplicates(subset=None,keep='first', ignore_index=False)` method

If called without any parameters, it will drop only the rows that are complete copy of each other. The subset parameter lets us specify which columns to check when checking for duplicates.

--------------------------------------------

## <a id='toc9_'></a>[Sorting Columns and Indexes](#toc0_)

---------------------------------------------

#### <a id='toc9_1_1_'></a>[Setting indexes: The `.set_index()` method](#toc0_)

Return dataframe with the new index.

<u> Parameters -- </u>

- **keys**: column(s) to be set as index.
- **drop = True** : default True. Indicates whether to remove columns used for the index.
- **verify_integrity = False** : check for duplicate index values by setting verify_integrity=True.

#### <a id='toc9_1_2_'></a>[Sorting indexes: The `.sort_index()` method](#toc0_)

<u> Parameters -- </u>

- **axis = 0**: This method will return dataframe with index (axis=0) or columns (axis=1) sorted.  
- **ascending = True**: default True.
- **key = None**:  A key function accepts an index and should return an index. For multi-level indexes, each index is passed in independently to the function.

This operation is usually done after setting a new index. If the new index is of **string type** then **sorting it will allow us to use slicing** operation on the index column. Othrwise it will throw a KeyError.

#### <a id='toc9_1_3_'></a>[Sorting values: The `.sort_values()` method](#toc0_)

<u> Parameters -- </u>

- **by**: column name or a list of names to sort by.
- **ascending = True**: bool or list of bool, default True.
- **key = None**: Apply the key function to the values before sorting. It will be applied to each column in `by` independently. A key function accepts a series and should return a series with the same shape as the input. 

In [46]:
siena_2018.sort_values(by=["Quartile_rank", "Intelligence"], ascending=[True, False])

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
11,11,James K. Polk,Democratic,19,10,23,23,...,12,8,8,13,12,11,1st
32,33,Harry S. Truman,Democratic,31,16,9,21,...,7,4,9,7,9,9,1st
5,5,James Monroe,Democratic-Republican,9,14,11,18,...,10,5,6,9,8,8,1st
33,34,Dwight D. Eisenhower,Republican,11,18,5,17,...,8,7,3,6,6,6,1st
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21,21,Chester A. Arthur,Republican,41,31,37,36,...,25,32,23,31,34,34,4th
23,23,Benjamin Harrison,Republican,33,34,30,35,...,32,29,29,33,35,36,4th
10,10,John Tyler,Independent,34,33,35,34,...,36,26,32,36,37,37,4th
30,31,Herbert Hoover,Republican,13,35,15,13,...,39,33,40,35,36,35,4th


In [47]:
siena_2018.President.str.split()[1]

['George', 'Washington']

**`?`** For example say, we wanted to sort by the last name of the presidents. In this case we can use the `key` parameter to pass a function that will extract the last name from the full name.

In this case we can use the apply method, and this is an appropriate application of the apply method since we are working with strings.

In [48]:
siena_2018.sort_values(
    by=["President"],
    key=lambda byCol_: byCol_.str.split().apply(lambda val_lst: val_lst[-1]),
)

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
2,2,John Adams,Federalist,3,13,4,4,...,19,13,16,10,14,13,2nd
6,6,John Quincy Adams,Democratic-Republican,1,9,6,5,...,21,15,14,18,18,18,2nd
21,21,Chester A. Arthur,Republican,41,31,37,36,...,25,32,23,31,34,34,4th
15,15,James Buchanan,Democratic,36,43,40,39,...,44,43,44,44,43,43,4th
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44,45,Donald Trump,Republican,43,40,44,44,...,40,42,41,42,42,42,4th
10,10,John Tyler,Independent,34,33,35,34,...,36,26,32,36,37,37,4th
1,1,George Washington,Independent,7,7,1,10,...,2,2,1,2,1,1,1st
27,28,Woodrow Wilson,Democratic,8,8,19,7,...,14,11,25,15,11,12,2nd


--------------------------

## <a id='toc10_'></a>[Indexing & Filtering](#toc0_)

------------------------

### <a id='toc10_1_'></a>[`->` **Indexing**](#toc0_)

#### <a id='toc10_1_1_'></a>[*Renaming index labels: The `.rename()` method*](#toc0_)

<u> Parameters: </u>
- **mapper**: Dict-like or function transformations to apply to specified axis' values. In case of a function you only need to pass in the name and not call them.
- **axis**: index (0) or, columns(1).

In [49]:
# say, we would like to set the president name as our index and use initial for first name and not the full name
def name_to_initial(val):
    vals = val.split(" ")
    return " ".join(
        [f"{vals[0][0]}.", *vals[1:]]
    )  # unpack the items in the vals[1:] list


siena_2018.set_index("President").rename(
    name_to_initial
)  # or, lambda name_: " ".join([f'{name_.split()[0][0]}.', *name_.split()[1:]])

Unnamed: 0_level_0,Seq,Party,Background,Imagination,Integrity,Intelligence,Luck,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
President,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
G. Washington,1,Independent,7,7,1,10,1,...,2,2,1,2,1,1,1st
J. Adams,2,Federalist,3,13,4,4,24,...,19,13,16,10,14,13,2nd
T. Jefferson,3,Democratic-Republican,2,2,14,1,8,...,6,9,7,5,5,5,1st
J. Madison,4,Democratic-Republican,4,6,7,3,16,...,11,19,11,8,7,7,1st
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
B. Clinton,42,Democratic,21,12,39,8,11,...,9,18,30,14,15,14,2nd
G. W. Bush,43,Republican,17,29,33,41,21,...,30,38,36,34,33,33,3rd
B. Obama,44,Democratic,24,11,13,9,15,...,13,20,10,11,17,17,2nd
D. Trump,45,Republican,43,40,44,44,10,...,40,42,41,42,42,42,4th


#### <a id='toc10_1_2_'></a>[*Resetting index labels to monotonically increasing integers: The `.reset_index()` method*](#toc0_)

In [50]:
siena_2018.set_index("President").reset_index()

Unnamed: 0,President,Seq,Party,Background,Imagination,Integrity,Intelligence,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
0,George Washington,1,Independent,7,7,1,10,...,2,2,1,2,1,1,1st
1,John Adams,2,Federalist,3,13,4,4,...,19,13,16,10,14,13,2nd
2,Thomas Jefferson,3,Democratic-Republican,2,2,14,1,...,6,9,7,5,5,5,1st
3,James Madison,4,Democratic-Republican,4,6,7,3,...,11,19,11,8,7,7,1st
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40,Bill Clinton,42,Democratic,21,12,39,8,...,9,18,30,14,15,14,2nd
41,George W. Bush,43,Republican,17,29,33,41,...,30,38,36,34,33,33,3rd
42,Barack Obama,44,Democratic,24,11,13,9,...,13,20,10,11,17,17,2nd
43,Donald Trump,45,Republican,43,40,44,44,...,40,42,41,42,42,42,4th


#### <a id='toc10_1_3_'></a>[*Indexing by Index lables: The `.loc[]` method*](#toc0_) [&#8593;](#toc0_)

The **`.loc[row indexer, column indexer]`** attribute is **primarily label based**, but may also be used with a boolean array.

Allowable inputs may be:
- **Scalar:** if any one of the indexer is passed as a scalar, it will return,
    - a dataframe if there are multiple instances and,
    - a series if there's only one entry. This series will have,
        - columns set as index if axis=0.
        - rows set as index if axis=1.

    For it to return a dataframe in all cases we have to pass in the scalar as a list.
- **Array of labels**
- **Slice object:** Slicing with .loc includes both the start and end. *Some notes:*
    - If the axis of slicing has unsorted duplicate index labels we will first need to sort the indexes with **.sort_index()**.
    - Slicing with string indexes only works if you sort them.
    - Partial slicing can only be done on string types and not on categorical type.
- **A boolean array:** of the same length as the indexing axis.
- **A callable function:** that returns one of the above.

#### <a id='toc10_1_4_'></a>[*Indexing by Index positions: The `.iloc[]` method*](#toc0_) [&#8593;](#toc0_)

The **`.iloc[row indexer, column indexer]`** attribute operates on **indexes and not index labels**. It can also be used with a boolean array.

Allowable inputs may be:
- **Scalar:** if any one of the indexer is passed as a scalar, it will return,
    - a dataframe if there are multiple instances and,
    - a series if there's only one entry. This series will have,
        - columns set as index labels if axis=0.
        - rows set as index labels if axis=1.

    For it to return a dataframe in all cases we have to pass in the scalar as a list.
- **Array of indexes**
- **Slice object:** Slicing with .iloc includes only the start and not the end. *Note:*, if the axis being sliced has unsorted duplicate indexed entries we will first need to sort the indexes with **.sort_index()**.
- **A boolean array:** of the same length as the indexing axis.
- **A callable function:** that returns one of the above.

### <a id='toc10_2_'></a>[`->` **Filtering**](#toc0_)

#### <a id='toc10_2_1_'></a>[*Filtering Index and Column Labels with `.filter(items, like, regex, axis)`*](#toc0_)

- **items** (passed as a list) is used for exact matches. Note that exact match (with items) fails with duplicate labels but if the label doesn't exist it will not throw an error.
- **like** is used for substring matches.
- **regex** allows to specify a regular expression to match against index or column labels.
- **axis** specifies whether to filter indexex (0) or columns (1).

#### <a id='toc10_2_2_'></a>[*Filtering with boolean arrays (Boolean Masking)*](#toc0_)

Boolean arrays can be used to filter data from a dataframe. Using different math operators (such as, &, <, >, | etc.) complex filters can be implemented. Note that, you can't use plain *or, and, not* etc.

In [51]:
# let's filter out the presidents who was a republican and has an average rank < 10.
try:
    siena_2018[siena_2018.Average_rank < 10 & siena_2018.Party == "Republican"]
except TypeError as err:
    print(err)

unsupported operand type(s) for &: 'int' and 'Categorical'


The takeaway is, you should always put parentheses around multiple conditions in index operations if you inline them as some operators has precedence over others.

Now let's do this properly.

In [52]:
siena_2018[(siena_2018.Average_rank < 10) & (siena_2018.Party == "Republican")]

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
16,16,Abraham Lincoln,Republican,28,1,2,2,...,1,6,2,1,3,3,1st
25,26,Theodore Roosevelt,Republican,5,4,8,6,...,4,3,5,4,4,4,1st
33,34,Dwight D. Eisenhower,Republican,11,18,5,17,...,8,7,3,6,6,6,1st


#### <a id='toc10_2_3_'></a>[*Using `functions with .loc` (for filtering)*](#toc0_)

The main advantage of using functions with .loc is that, the function will receive the current state of the dataframe as input. This is specially useful when multiple operations are chained together.

Also it is possible to filter rows and also select specific columns simultaneously.

In [53]:
# let us select presidents with average rank < 10 and return first 3 columns of data about them
siena_2018.loc[
    siena_2018.Average_rank < 10, lambda df_: df_.columns[:3]
]  # :3 as first column is the index column

Unnamed: 0,Seq,President,Party
1,1,George Washington,Independent
3,3,Thomas Jefferson,Democratic-Republican
4,4,James Madison,Democratic-Republican
5,5,James Monroe,Democratic-Republican
...,...,...,...
25,26,Theodore Roosevelt,Republican
31,32,Franklin D. Roosevelt,Democratic
32,33,Harry S. Truman,Democratic
33,34,Dwight D. Eisenhower,Republican


In [54]:
# the same can be achieved by the following section of code
siena_2018[siena_2018.Average_rank < 10].iloc[:, :3]

Unnamed: 0,Seq,President,Party
1,1,George Washington,Independent
3,3,Thomas Jefferson,Democratic-Republican
4,4,James Madison,Democratic-Republican
5,5,James Monroe,Democratic-Republican
...,...,...,...
25,26,Theodore Roosevelt,Republican
31,32,Franklin D. Roosevelt,Democratic
32,33,Harry S. Truman,Democratic
33,34,Dwight D. Eisenhower,Republican


#### <a id='toc10_2_4_'></a>[*Filtering with the `.query()` method*](#toc0_)

- Instead of using boolean arrays in combination with .loc[], we can use the .query() method. And, unlike boolean arrays we can use both, plain 'and', 'or', 'not' commands and the operator forms such as &, |,  ! etc. We also don't need to worry as much about precedence and parentheses.
- In the .query() method we use a string to formulate and express our conditions, similar to SQL. One of the powerful aspect of using .query() is that, we can `access external variables using the @ sign as prefix` from inside the string. So we don't need to use string formatting or concatenation to implement complex logics in our search.
- `To access a column of the dataframe, just use the name of the column`.
- `To match a string literal pass it in as a string (within quote marks) as you would in any other situation.`

In [55]:
# to do the same filtering as we've done in the filtering with boolean arrays section
lt10 = siena_2018.Average_rank < 10
# siena_2018.query("Average_rank < 10 and Party == 'Republican'")
siena_2018.query('@lt10 and Party == "Republican"')

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,...,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
16,16,Abraham Lincoln,Republican,28,1,2,2,...,1,6,2,1,3,3,1st
25,26,Theodore Roosevelt,Republican,5,4,8,6,...,4,3,5,4,4,4,1st
33,34,Dwight D. Eisenhower,Republican,11,18,5,17,...,8,7,3,6,6,6,1st


Both `.query()` and boolean masks with `.loc[]` are effective methods for filtering data in pandas. The choice between them depends on factors such as readability, performance, and your personal coding preferences. Let's compare both approaches:

1. **`.query()` Method:**
   - **Advantages:**
     - Readability: `.query()` allows you to write filtering expressions in a more SQL-like syntax, which can be more intuitive for some users.
     - Avoidance of repetitive DataFrame name: You don't need to repeat the DataFrame name within the query expression.
   - **Considerations:**
     - Limited access to Python variables: You can't directly use Python variables within the query, which might be necessary for complex conditions.
     - Limited to column names: The query is performed using column names, and more complex operations might be easier with boolean masks.
     - The .query() method doesn't support column selection. This is very important to keep in mind when filtering data with the .query() method.


2. **Boolean Masks with `.loc[]`:**
   - **Advantages:**
     - Flexibility: You can use Python variables and more complex conditions within boolean masks, providing more fine-grained control over filtering.
     - Compatibility with other operations: Boolean masks can easily be used with other DataFrame operations like grouping and aggregation.
   - **Considerations:**
     - Slightly more verbose: Boolean mask expressions can become longer when compared to concise `.query()` expressions.

For simple filtering scenarios, both methods can work well. `.query()` is often favored when the filtering conditions are straightforward and you want a more human-readable syntax. However, if you need to use complex conditions involving variables, multiple columns, or other DataFrame operations, boolean masks with `.loc[]` offer greater flexibility. Also in more complex scenarios, especially those involving calculations or chaining multiple operations, boolean masks with `.loc[]` are often a better choice due to their versatility.

Ultimately, it's a matter of preference and context. You can even mix and match both methods within your codebase, using the one that suits each situation best.

----------------------------------

## <a id='toc11_'></a>[Reshaping Dataframes (Grouping and Aggregating)](#toc0_)

----------------------------------

#### <a id='toc11_1_1_'></a>[Reshaping dataframes with `dummies`](#toc0_)

So, what are dummies? Well, dummy columns are one of the ways of converting a categorical column to multiple numerical columns. Each category in a column is converted to a column in itself. These columns are filled with 1 or 0 based on whether the categorical value itself was present in a particular row of data or not. To create dummy columns from a series (or a dataframe that has multiple string columns), call the `pd.get_dummies` function. 

In [56]:
data = {
    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "gender": ["Female", "Male", "Male", "Male", "Female"],
    "score": [75, 52, 89, 47, 63],
}

df = pd.DataFrame(data)

In [57]:
pd.get_dummies(df, columns=["gender"])

Unnamed: 0,name,score,gender_Female,gender_Male
0,Alice,75,True,False
1,Bob,52,False,True
2,Charlie,89,False,True
3,David,47,False,True
4,Eve,63,True,False


### <a id='toc11_2_'></a>[*The `pivot_table()` method*](#toc0_)

<u> **Parameters** </u>

- **index:** Keys to group by on the pivot table index (i.e, along axis 0).
- **columns:** Keys to group by on the pivot table column (i.e, along axis 1). The *unique values of the specified column(s) are converted to indexes and columns.* If multiple column names are specified then it will have a MultiIndex structure.
- **values:** The column(s) to apply aggregate function to. 
- **aggfunc:** Function or list of functions (default *mean*). If a list of functions is passed, the resulting pivot table will have hierarchical columns (whose top level are the function names, inferred from the function objects themselves). If dict is passed, the *key is column to aggregate and value is function or list of functions.*
- **fill_value:** Value to replace missing values with (in the resulting pivot table, after aggregation). If not defined then the missing values in the pivot table will be filled as *Nan*.

**Note:** If both the columns and the values parameter is specified then,
- at first, the pivoted (modified) dataframe will be created with the provided names to the index and columns parameters i.e, the unique values in the supplied column names will be converted to indexes and/or column names.
- after that, aggregation function will be applied to the data of the specified column names supplied as values. A hierarchial column structure will be created (i.e, a multiindexed structure along axis 1). The 'value' columns will be at the top of the multiindexed dataframe.

**`Note:`** `When you see ”for each” or ”by”, your mind should think that whatever is following either of the terms, should go in the index.`

> Using a custom function to calculate percentage of Emacs users by country

In this case we still want country in the index, but we only want a single column, the percentage of emacs users. So we don’t provide a columns parameter.

In [58]:
def emacs_per(ser):
    return ser.str.contains("Emacs").mean() * 100

In [59]:
dev_survey.pivot_table(index="Country", values="DevEnviron", aggfunc=emacs_per).rename(
    {"DevEnviron": "% of Emacs users"}, axis=1
)

Unnamed: 0_level_0,% of Emacs users
Country,Unnamed: 1_level_1
Afghanistan,10.810811
Albania,1.190476
Algeria,1.550388
Andorra,14.285714
...,...
Viet Nam,3.083700
Yemen,0.000000
Zambia,0.000000
Zimbabwe,0.000000


> Say, we want to know what are the max mileage (both city08 and highway08 values) of the cars produced by different companies in each year.

In [60]:
# max mileage of cars produced by different companies in each year
max_mpg_year_manufac = vehicles.pivot_table(
    index=["year", "make"], values=["city08", "highway08"], aggfunc="max", fill_value=0
)
max_mpg_year_manufac

Unnamed: 0_level_0,Unnamed: 1_level_0,city08,highway08
year,make,Unnamed: 2_level_1,Unnamed: 3_level_1
1984,AM General,18,17
1984,Alfa Romeo,18,25
1984,American Motors Corporation,19,23
1984,Aston Martin,8,11
...,...,...,...
2020,Mitsubishi,25,30
2020,Nissan,19,26
2020,Subaru,21,27
2020,Toyota,55,53


In [61]:
# if we wanted to see how the max mileage evolved across the years for each manufacturer we can do,
max_mpg_manufac_year = vehicles.pivot_table(
    index=["make", "year"], values=["city08", "highway08"], fill_value=0, aggfunc="max"
)
max_mpg_manufac_year

Unnamed: 0_level_0,Unnamed: 1_level_0,city08,highway08
make,year,Unnamed: 2_level_1,Unnamed: 3_level_1
AM General,1984,18,17
AM General,1985,16,17
ASC Incorporated,1987,14,21
Acura,1986,23,28
...,...,...,...
smart,2016,122,93
smart,2017,124,94
smart,2018,124,94
smart,2019,124,94


> Multiple aggregations

In [62]:
vals = dev_survey.select_dtypes("float64", "int64").columns.to_list()

We will be applying these aggregate funcitons to each of the columns in the 'vals' list.

In [63]:
dev_survey.pivot_table(index="Country", values=vals, aggfunc=["max", "min"])

Unnamed: 0_level_0,max,max,max,max,max,min,min,min,min,min
Unnamed: 0_level_1,Age,CodeRevHrs,CompTotal,ConvertedComp,WorkWeekHrs,Age,CodeRevHrs,CompTotal,ConvertedComp,WorkWeekHrs
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Afghanistan,85.0,90.0,648838511.0,1000000.0,168.0,1.0,1.0,1.0,0.0,1.0
Albania,40.0,20.0,2688000.0,187668.0,65.0,15.0,1.0,400.0,1320.0,8.0
Algeria,63.0,20.0,300000.0,1000000.0,168.0,5.0,1.0,0.0,0.0,6.0
Andorra,50.0,3.0,150000.0,171862.0,44.0,19.0,2.0,3000.0,150000.0,40.0
...,...,...,...,...,...,...,...,...,...,...
Viet Nam,99.0,99.0,365000000.0,140000.0,168.0,1.0,0.5,0.0,200.0,8.0
Yemen,39.0,45.0,300000.0,60000.0,90.0,22.0,1.0,5000.0,799.0,7.0
Zambia,49.0,40.0,150000.0,40524.0,75.0,19.0,5.0,100.0,400.0,40.0
Zimbabwe,46.0,20.0,70000.0,180000.0,96.0,20.0,2.0,75.0,900.0,7.0


> Per column aggregations (Say, we wanted to know what are the minimum and maximum ages and the average compensation for each country?)

In [64]:
dev_survey.pivot_table(
    index="Country", aggfunc={"Age": ["min", "max"], "ConvertedComp": ["mean"]}
)

Unnamed: 0_level_0,Age,Age,ConvertedComp
Unnamed: 0_level_1,max,min,mean
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Afghanistan,85.0,1.0,101953.333333
Albania,40.0,15.0,21833.700000
Algeria,63.0,5.0,34924.047619
Andorra,50.0,19.0,160931.000000
...,...,...,...
Viet Nam,99.0,1.0,17233.436782
Yemen,39.0,22.0,16909.166667
Zambia,49.0,19.0,10075.375000
Zimbabwe,46.0,20.0,34046.666667


### <a id='toc11_3_'></a>[*The `.groupby()` method*](#toc0_)

A groupby operation splits the data into groups. You can apply aggregate functions to the group. Then the results of the aggregates are combined. The column we are grouping by will be placed in the index. 

<u>**Parameters**</u>

- **by:** used for determining the groups for the groupby. If a list of labels is passed then this will return a MultiIndex object.

The groupby operation can be summarized as, *Split-Apply-Combine*.

An example of GroupBy operation is shown in the following picture (in which the data is grouped by the 1st column, x and the applied aggregate function is mean) -- 

<img src="./groupby_demo.png">

> Say, we want to know what are the max mileage (both city08 and highway08 values) of the cars produced by different companies in each year.

In [65]:
# Without unstack()
vehicles.groupby(["year", "make"]).agg({"city08": "max", "highway08": "max"})

Unnamed: 0_level_0,Unnamed: 1_level_0,city08,highway08
year,make,Unnamed: 2_level_1,Unnamed: 3_level_1
1984,AM General,18,17
1984,Alfa Romeo,18,25
1984,American Motors Corporation,19,23
1984,Aston Martin,8,11
...,...,...,...
2020,Mitsubishi,25,30
2020,Nissan,19,26
2020,Subaru,21,27
2020,Toyota,55,53


<u>**Note:**</u> The `.unstack()` method is used with the groupby object to pull the inner-most index level to the inner-most column level.

- ##### Named Aggregations of groupby objects

*When calling the .agg method on a groupby object, we can use a keyword parameter and pass in a tuple of ("column", "agg func to apply") as its value. The keyword parameter will be turned into a **(flattened)** column name.*

<u>**Note:**</u> This is special to groupby and no equivalent is present in the pivot_table() method.

> Say, we wanted to know what are the minimum and maximum ages and the average compensation for each country? And we also want to name the columns, 'Age_min', 'Age_max', 'mean_ConvertedComp'.

In [66]:
dev_survey.groupby(by=["Country"]).agg(
    Age_min=("Age", "min"),
    Age_max=("Age", "max"),
    mean_ConvertedComp=("ConvertedComp", "mean"),
)

Unnamed: 0_level_0,Age_min,Age_max,mean_ConvertedComp
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,1.0,85.0,101953.333333
Albania,15.0,40.0,21833.700000
Algeria,5.0,63.0,34924.047619
Andorra,19.0,50.0,160931.000000
...,...,...,...
Viet Nam,1.0,99.0,17233.436782
Yemen,22.0,39.0,16909.166667
Zambia,19.0,49.0,10075.375000
Zimbabwe,20.0,46.0,34046.666667


- ##### The `get_group()` method of a groupby object

In [67]:
# For a multiindex object we need to pass in a tuple as the argument to the get_group method
vehicles.groupby(["year", "make"]).get_group((1984, "Aston Martin"))

Unnamed: 0,barrels08,barrelsA08,charge120,charge240,city08,city08U,cityA08,...,c240bDscr,createdOn,modifiedOn,startStop,phevCity,phevHwy,phevComb
18258,36.623333,0.0,0.0,0.0,8,0.0,0,...,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
18261,36.623333,0.0,0.0,0.0,8,0.0,0,...,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
18262,41.20125,0.0,0.0,0.0,7,0.0,0,...,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
19500,36.623333,0.0,0.0,0.0,8,0.0,0,...,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
19501,36.623333,0.0,0.0,0.0,8,0.0,0,...,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
19502,41.20125,0.0,0.0,0.0,7,0.0,0,...,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0


##### *The `.transform()` and `.filter()` methods on a Groupby object*

We often group and aggregate. This returns the result with the aggregated index. But sometimes we want to get the results in terms of the original index, not the aggregated index. This way we can easily add the returned series or dataframe to the original dataframe (*using the .assign method*). 

There are two specific methods that works on **groupby** objects and allows us to group and aggregate while keeping the original index. But the **pivot_table** method can't do that. This is one of the reasons as to why groupby is more favoured than pivot_table among some developers.

- ##### The Groupby `.transform(func)` method

The `.transform` method allows us to preserve the original index while giving the ability to apply whatever aggregation we want to the groupped object (either with existing aggregation functions or we can define our own if we need), thus increasing flexibility and functionality.

In other words, `<groupby_object>.transform` is similar to `<groupby_object>.agg` but it returns the result against the original index insted of the group index.

> `(?)` Say, we want to know how many respondents there were from each country in the dev_survey df. And we want to add that result at the end of the original dev_survey df in a column named, 'total_res_form_this_country'.

Let's do this with the help of `.transform` and `.assign` method.

Since we want to know the total number of response from each country what we can do is, first groupby the Country column and then get the size of any one column in that group.

In [68]:
# first let's see the .transform method in action
dev_survey.groupby("Country").Age.transform("size")

0         5737.0
1          108.0
2          214.0
3        20949.0
          ...   
88879        NaN
88880        NaN
88881        NaN
88882     1604.0
Name: Age, Length: 88883, dtype: float64

Looks like there's some rows for which the transform method returned NaN. Let's explore why's that.

Our first guess is that there's something wrong in the "Country" column. The guess is based on the fact that it is the column we used for gruop by.

In [69]:
dev_survey.Country.isna().sum()

132

Looks like there's some rows that does not have a "Country" value. And as a result those rows are not grouped. For now let's drop the rows that doesn't have a Country in the new dataframe.

In [70]:
filt = ~dev_survey.Country.isna()

**Note:** We could've used the `.query` method instead of the boolean masking. In fact most of the times using .query is preffered, but here since the application is very basic we have used boolean arrays for filtering. 

In [71]:
# now let's use the assign method to append a column to the dev_survey df
dev_survey[filt].assign(
    total_res_from_this_country=dev_survey.groupby("Country").Age.transform("size")
).set_index("Country").sort_index().total_res_from_this_country

Country
Afghanistan    44.0
Afghanistan    44.0
Afghanistan    44.0
Afghanistan    44.0
               ... 
Zimbabwe       39.0
Zimbabwe       39.0
Zimbabwe       39.0
Zimbabwe       39.0
Name: total_res_from_this_country, Length: 88751, dtype: float64

A list of strings that the groupby transform method accepts as functions -

<img src=groupby_transform_method_func_strings.png>

- ##### The Groupby `.filter(func)` method

The `.filter` method allows us to filter based on aggregated data but keep the original index.

**The `.filter` method accepts a function that takes the current group**. If the function returns True (it must return a scalar, not a series or dataframe), the rows are kept for the result.

> `(?)` Say, we want to remove any row from the dev_survey dataframe where the size of the country is less than the median size of countries.

First let's try to do this **with our existing pandas knowledge**.

In [72]:
# first let's find out the median size
mdn_size = dev_survey.Country.value_counts().median()
mdn_size

54.0

In [73]:
# a list of the countries to be removed
filt = dev_survey.Country.value_counts() < mdn_size
countries_to_remove = dev_survey.Country.value_counts().index[filt].to_list()
countries_to_remove
country_nan = np.nan

In [74]:
dev_survey.query("~Country.isin(@countries_to_remove) and Country.isna() == False")

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,...,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,...,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,...,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,...,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,...,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88876,88212,,No,Less than once per year,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Spain,...,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
88877,88282,,Yes,Once a month or more often,The quality of OSS and closed source software ...,"Not employed, but looking for work",United States,...,Man,No,Straight / Heterosexual,,No,Too short,Neither easy nor difficult
88878,88377,,Yes,Less than once a month but more than once per ...,The quality of OSS and closed source software ...,"Not employed, and not looking for work",Canada,...,Man,No,,,No,Appropriate in length,Easy
88882,88863,,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...","Not employed, and not looking for work",Spain,...,Man,No,Straight / Heterosexual,Hispanic or Latino/Latina;White or of European...,No,Appropriate in length,Easy


Now let's do this **with the filter method** for groupby objects.

In [75]:
dev_survey.groupby("Country").filter(lambda grp_: grp_.Age.size >= mdn_size)

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,...,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,...,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,...,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,...,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,...,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88876,88212,,No,Less than once per year,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Spain,...,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
88877,88282,,Yes,Once a month or more often,The quality of OSS and closed source software ...,"Not employed, but looking for work",United States,...,Man,No,Straight / Heterosexual,,No,Too short,Neither easy nor difficult
88878,88377,,Yes,Less than once a month but more than once per ...,The quality of OSS and closed source software ...,"Not employed, and not looking for work",Canada,...,Man,No,,,No,Appropriate in length,Easy
88882,88863,,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...","Not employed, and not looking for work",Spain,...,Man,No,Straight / Heterosexual,Hispanic or Latino/Latina;White or of European...,No,Appropriate in length,Easy


As we can see, the filter method does this in only 1 line of code whereas, our previous approach took multiple lines. 

### <a id='toc11_4_'></a>[*Accessing values from a MultiIndexed Dataframe*](#toc0_)

In [76]:
max_mpg_year_manufac.unstack()

Unnamed: 0_level_0,city08,city08,city08,city08,city08,city08,city08,...,highway08,highway08,highway08,highway08,highway08,highway08,highway08
make,AM General,ASC Incorporated,Acura,Alfa Romeo,American Motors Corporation,Aston Martin,Audi,...,Vixen Motor Company,Volga Associated Automobile,Volkswagen,Volvo,Wallace Environmental,Yugo,smart
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
1984,18.0,,,18.0,19.0,8.0,21.0,...,,,43.0,31.0,,,
1985,16.0,,,19.0,15.0,7.0,21.0,...,,,41.0,28.0,,,
1986,,,23.0,19.0,15.0,,22.0,...,16.0,22.0,40.0,27.0,,29.0,
1987,,14.0,23.0,19.0,15.0,8.0,22.0,...,,,37.0,26.0,,29.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2017,,,29.0,24.0,,15.0,33.0,...,,,111.0,36.0,,,94.0
2018,,,28.0,24.0,,18.0,34.0,...,,,111.0,36.0,,,94.0
2019,,,28.0,24.0,,18.0,74.0,...,,,111.0,36.0,,,94.0
2020,,,23.0,,,18.0,13.0,...,,,,,,,


In [77]:
# the values of max mileage in city08 and highway08 of the Subaru cars for the years 1984 to 1987
max_mpg_year_manufac.unstack().loc[
    range(1984, 1988), (["city08", "highway08"], "Subaru")
]

Unnamed: 0_level_0,city08,highway08
make,Subaru,Subaru
year,Unnamed: 1_level_2,Unnamed: 2_level_2
1984,26.0,33.0
1985,27.0,33.0
1986,26.0,33.0
1987,32.0,37.0


-------------------------

## <a id='toc12_'></a>[Flattening Hierarchial Indexes and Columns](#toc0_)

-------------------------------------------------------------------------------

### <a id='toc12_1_'></a>[*Removing Hierarchial index*](#toc0_)

`.reset_index()` is used to remove hierarchial indexing and push the multi level indexes into their own columns.

In [78]:
# example of hierarchial index
hr_idx = dev_survey.groupby(["Country", "Age"]).ConvertedComp.agg(["max", "min"])
hr_idx

Unnamed: 0_level_0,Unnamed: 1_level_0,max,min
Country,Age,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,1.0,0.0,0.0
Afghanistan,18.0,,
Afghanistan,21.0,,
Afghanistan,23.0,,
...,...,...,...
Zimbabwe,33.0,180000.0,28800.0
Zimbabwe,35.0,42000.0,42000.0
Zimbabwe,41.0,30000.0,30000.0
Zimbabwe,46.0,,


In [79]:
# removing hierarchial index with .reset_index()
hr_idx.reset_index()

Unnamed: 0,Country,Age,max,min
0,Afghanistan,1.0,0.0,0.0
1,Afghanistan,18.0,,
2,Afghanistan,21.0,,
3,Afghanistan,23.0,,
...,...,...,...,...
4139,Zimbabwe,33.0,180000.0,28800.0
4140,Zimbabwe,35.0,42000.0,42000.0
4141,Zimbabwe,41.0,30000.0,30000.0
4142,Zimbabwe,46.0,,


Alternatively, with `groupby()` method we can set, **as_index = False**. This will keep the grouped columns as columns and not insert them as index.

### <a id='toc12_2_'></a>[*Flattening hierarchial columns*](#toc0_)

Sadly, the `.reset_index()` method does not work for the hierarchial columns. Also, there's no built-in function or method that can help us do it. We have to manually mutate the dataframe if we want to flatten the multi-index column levels into one level.

In [80]:
# Example of hierarchial columns
hr_cols = hr_idx.unstack()
hr_cols

Unnamed: 0_level_0,max,max,max,max,max,max,max,...,min,min,min,min,min,min,min
Age,1.0,2.0,3.0,4.0,5.0,9.0,10.0,...,91.0,94.0,95.0,97.0,98.0,98.9,99.0
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
Afghanistan,0.0,,,,,,,...,,,,,,,
Albania,,,,,,,,...,,,,,,,
Algeria,,,,,,,,...,,,,,,,
Andorra,,,,,,,,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Viet Nam,,,,,,,,...,,,,,,,200.0
Yemen,,,,,,,,...,,,,,,,
Zambia,,,,,,,,...,,,,,,,
Zimbabwe,,,,,,,,...,,,,,,,


- `flatten_cols(df)` function

The following function will join each level of columns with an underscore (in a combinatorics manner) which then can be used with the `pipe()` method, making it possible to flatten multi-level columns in a chaining operation.

In [81]:
def flatten_cols(df):
    cols = ["_".join(map(str, col_comb)) for col_comb in df.columns.to_flat_index()]
    df.columns = cols
    return df

<u>**Explanation**</u>
1. `pandas.MultiIndex.to_flat_index()` returns a `pandas.Index` object with the MultiIndex data represented in a tuple.
2. Recall that, the `map(func, iterable)` function calls the "func" on each value of the iterable and returns a map object.
3. So, we map the `str` function to the index tuple in order to convert any non-string entry to string object before joining them.
4. Finally, the strings are joined with "_" between them.


<u>**A note on** `DataFrame.pipe(func, args, kwargs)` method</u>
- Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects.
- Parameters
    - func: function to apply to the series or dataframe
    - args: positional arguments passed into func
    - kwargs: a dictionary of keyword arguments passed into func

In [82]:
# now let's see the flatten_cols() function in action
hr_cols.pipe(flatten_cols)

Unnamed: 0_level_0,max_1.0,max_2.0,max_3.0,max_4.0,max_5.0,max_9.0,max_10.0,...,min_91.0,min_94.0,min_95.0,min_97.0,min_98.0,min_98.9,min_99.0
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Afghanistan,0.0,,,,,,,...,,,,,,,
Albania,,,,,,,,...,,,,,,,
Algeria,,,,,,,,...,,,,,,,
Andorra,,,,,,,,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Viet Nam,,,,,,,,...,,,,,,,200.0
Yemen,,,,,,,,...,,,,,,,
Zambia,,,,,,,,...,,,,,,,
Zimbabwe,,,,,,,,...,,,,,,,


**<u> Difference between the `.apply()` method and the ``.pipe()`` method</u>**

Both `.apply()` and `.pipe()` are methods provided by the pandas library in Python for applying functions to data in a DataFrame, but they have different purposes and use cases.

1. **`.apply()`**:
   - The .apply() method is used to apply a function along an axis (rows or columns) of a DataFrame.
   - It can be used with both Series and DataFrame objects.
   - The function is applied element-wise, and you can specify the axis using the axis parameter (axis=0 for applying along columns and axis=1 for applying along rows).

2. **`.pipe()`**:
   - The .pipe() method is used to apply a sequence of custom functions to a DataFrame.
   - It's more focused on chaining a series of operations in a more readable and compact manner.
   - Each function in the sequence takes the output of the previous function as input.
   - It's particularly useful for avoiding nested function calls and improving code readability when applying multiple operations.

In summary, `.apply()` is primarily used **for applying a function *along an axis* of a DataFrame**, while `.pipe()` is used for **chaining *a sequence of custom functions together that expects a `DataFrame`***. Also, since `.apply()` is applied element wise it is very slow compared to `.pipe()`. But both have their use cases and it depends on the situation.

In [83]:
data = {"A": [10, 20, 30], "B": [5, 10, 15]}

df = pd.DataFrame(data)


def double_column_values(df):
    return df * 2


def square_column_values(df):
    return df**2

In [84]:
%%timeit
result_pipe = df.pipe(double_column_values).pipe(square_column_values)

179 µs ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [85]:
print(result_pipe)

NameError: name 'result_pipe' is not defined

In [None]:
%%timeit
result_apply = df.apply(double_column_values).apply(square_column_values)

886 µs ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [None]:
print(result_apply)

      A    B
0   400  100
1  1600  400
2  3600  900


-------------------------------------------

## <a id='toc13_'></a>[Melting, Transposing and Stacking Data](#toc0_)

-------------------------------------------

### <a id='toc13_1_'></a>[*Melting & Unmelting Data*](#toc0_)

To understand melting of dataframes we first need to understand two terms associated with the data in a dataframe.
- **fact:** A fact is a value that is measured and reported on.
- **dimension:** A dimension is a value that describes the conditions of the fact.

For example, in a sales scenario, typical facts would be the number of sales of an item and the cost. The dimensions might include the store where the item was sold, the date, and the customer.

Based on the idea of fact and dimension, the way data is stored can be categorized as,
- **wide form:** if a single row has multiple facts and,
- **long or, tidy form:** if a single row of data has only one fact (may be along with other variables describing the dimensions).

**Melting** is the process of converting data of wide form to a long/tidy form. Pandas `pd.melt()` provides a nice convenient way of melting a datafrmae.

In [None]:
# first let's create a dataframe of wide form
wide = pd.DataFrame(
    {
        "Student_name": ["Ashly", "Cole", "Young", "Dave"],
        "Age": [15, 14, 15, 15],
        "Test1": [13, 18, 17, np.nan],
        "Test2": [19, 18, 16, 19],
        "Teacher": ["Abdullah", "Pial", "Hasan", "Arafat"],
    }
)

In [None]:
wide

Unnamed: 0,Student_name,Age,Test1,Test2,Teacher
0,Ashly,15,13.0,19,Abdullah
1,Cole,14,18.0,18,Pial
2,Young,15,17.0,16,Hasan
3,Dave,15,,19,Arafat


This dataframe has two columns (Test1 and Test2) that contains the facts i.e, test scores. The other columns are dimensions of those facts.

- #### Melting Data: The `pd.melt()` function

<u> **Parameters** </u>
- frame: the dataframe to melt
- id_vars: identifier variables i.e, dimension columns
- value_vars: fact columns
- var_name: name to use for the variable column
- value_name: name to use for the value column


In [None]:
# Now, let's convert this dataframe of wide form into long form
long = pd.melt(
    wide,
    id_vars=["Student_name", "Age", "Teacher"],
    value_vars=["Test1", "Test2"],
    var_name="Test",
    value_name="Test_scores",
)

In [None]:
long

Unnamed: 0,Student_name,Age,Teacher,Test,Test_scores
0,Ashly,15,Abdullah,Test1,13.0
1,Cole,14,Pial,Test1,18.0
...,...,...,...,...,...
6,Young,15,Hasan,Test2,16.0
7,Dave,15,Arafat,Test2,19.0


- #### Unmelting data with the `pivot_table()` method

In [None]:
long.pivot_table(
    index=["Student_name", "Age", "Teacher"], columns="Test", values="Test_scores"
).reset_index()

Test,Student_name,Age,Teacher,Test1,Test2
0,Ashly,15,Abdullah,13.0,19.0
1,Cole,14,Pial,18.0,18.0
2,Dave,15,Arafat,,19.0
3,Young,15,Hasan,17.0,16.0


**Notes:**
- as arguments to the columns and values parameters, if a list is passed then it will create more and more hierarchial column levels. So pass in scalar whenever you can.
- .reset_index() was used to remove hierarchial indexes.

### <a id='toc13_2_'></a>[*Transposing Data*](#toc0_)

Transposing means to convert the columns into rows and the rows into columns. This can be easily done either with the `.transpose` method or, the `.T` property.

Some use cases for transposing the data may be, 
- **Swapping axis for plotting**
- **Viewing more data in jupyter**: if the .transpose method is used to view more data on your screen, you might not want to transpose your whole data set. Remember that pandas stores and optimizes data by
column types. If you make a row that contains different data types (strings, dates, numbers) into
a column that can be a slow and memory-hungry operation. It is better to pull off the head, tail, or
take a sample of the data and then transpose it.


### <a id='toc13_3_'></a>[*Stacking and Unstacking Data*](#toc0_)

First let's create a multi-level (with both multi level index and multi level columns) dataframe. 

Note that, **the position of a multi level index or column is counted from out to in and counting starts from 0**.  

In [None]:
ds_sus = dev_survey.pivot_table(
    index=["Country", "Hobbyist"],
    columns="Employment",
    values="Age",
    aggfunc=["size", "mean", "max", "min"],
)

In [None]:
ds_sus

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,size,size,size,size,mean,...,max,min,min,min,min,min,min
Unnamed: 0_level_1,Employment,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired,Employed full-time,...,Retired,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired
Country,Hobbyist,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Afghanistan,No,7.0,2.0,1.0,,1.0,1.0,37.200000,...,,24.0,23.0,,,,
Afghanistan,Yes,14.0,2.0,5.0,3.0,2.0,,26.111111,...,,18.0,23.0,24.0,,18.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Zimbabwe,No,3.0,,2.0,,2.0,,28.666667,...,,23.0,,21.0,,25.0,
Zimbabwe,Yes,15.0,3.0,6.0,1.0,5.0,1.0,29.214286,...,,23.0,21.0,25.0,25.0,20.0,


- #### The `.stack()` method

The stack method moves a **multi-level column into the index**. By default it will move the inner-most column to the inner-most index. But, we can specify which level of column we want to move either by its position or by its name.

In [None]:
# say we wanted to pull the aggregate functions (size, mean, max, min) level to the inner-most index
ds_sus_stack = ds_sus.stack(0)
ds_sus_stack

NameError: name 'ds_sus' is not defined

- #### The `.unstack()` method

As we have previously seen with the groupby method, The unstack method moves a **multi-level index into the column**. By default it will move the inner-most index to the inner-most column. But, we can specify which level of index we want to move either by its position or by its name.

In [None]:
# say we wanted to pull the Hobbyist index level from the ds_sus_stack dataframe into the inner-most column level
ds_sus_stack.unstack("Hobbyist")

NameError: name 'ds_sus_stack' is not defined

- #### The `.swaplevel()` method: Swapping levels of a multilevel dataframe

The swaplevel method will move the inner-most index/column (can be specified with the axis parameter) level by one position to the outer direction.

In [None]:
ds_sus.swaplevel(axis="columns")

Unnamed: 0_level_0,Employment,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired,Employed full-time,...,Retired,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired
Unnamed: 0_level_1,Unnamed: 1_level_1,size,size,size,size,size,size,mean,...,max,min,min,min,min,min,min
Country,Hobbyist,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Afghanistan,No,7.0,2.0,1.0,,1.0,1.0,37.200000,...,,24.0,23.0,,,,
Afghanistan,Yes,14.0,2.0,5.0,3.0,2.0,,26.111111,...,,18.0,23.0,24.0,,18.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Zimbabwe,No,3.0,,2.0,,2.0,,28.666667,...,,23.0,,21.0,,25.0,
Zimbabwe,Yes,15.0,3.0,6.0,1.0,5.0,1.0,29.214286,...,,23.0,21.0,25.0,25.0,20.0,


--------------------------------

## <a id='toc14_'></a>[Concatenation, Joining DataFrames](#toc0_)

----------------------------------

### <a id='toc14_1_'></a>[The `concat()` method](#toc0_)

The concat method takes a sequence of Series or DataFrame objects. It will find any columns that have the same name and use a single column for each of the repeated columns.

1. To **add rows** of different dataframes together, use concat along **axis=0**.
    Note that, .concat() preserves index values, so the resulting dataframe is most likely to have duplicate indexes.
    - To raise an error if there's duplicate index values use, **verify_integrity = True**.
    - However, if you want pandas to create new indexes use, **ignore_index = True**.
2. To **add columns** of different dataframes together, use concat along **axis=1**. However, using the .assign method is more prefferable.

In [None]:
# example dataframes
car_df1 = pd.DataFrame(
    {"name": ["John", "George", "Ringo"], "color": ["Blue", "Blue", "Purple"]}
)
car_df2 = pd.DataFrame(
    {"name": ["Paul", "George", "Ringo"], "carcolor": ["Red", "Blue", np.nan]},
    index=[3, 1, 2],
)

In [None]:
pd.concat([car_df1, car_df2]).T

Unnamed: 0,0,1,2,3,1.1,2.1
name,John,George,Ringo,Paul,George,Ringo
color,Blue,Blue,Purple,,,
carcolor,,,,Red,Blue,


### <a id='toc14_2_'></a>[Joining Dataframes](#toc0_)

The four common types of joins are, `inner`, `outer`, `left`, and `right` joins. The dataframe has two methods to support these operations, `.join()` and `.merge()`.

<img src="how_to_merge.png" width="450" height="400">

Often the .merge() method is preffered as, the .join() method is meant for joining based on the index rather than columns. And, in practice it is often the case that we join dataframes based on columns. 

To join dataframes based on columns using the .join() method, we first need to set the column we want to join based on as index using the .set_index() method.

<u> The ``.merge()`` method Parameters</u>

0. `right:` object to merge with.
1. `on:` Column names to join on. String or list. These must be found in both DataFrames. (If on is None then, Default is intersection of names).
2. `left_on:` Column names for left dataframe. String or list. Used when names don’t overlap.
3. `right_on:` Column names for right dataframe. String or list. Used when names don’t overlap.
4. `left_index:` Join based on left dataframe index. Boolean.
5. `right_index:` Join based on right dataframe index. Boolean.
6. `how:` Type of merge to be performed. {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’} default ‘inner’.
7. `indicator:` Indicates where the data in the row come from {'left_only', 'right_only', 'both'}. Boolean or String.
    - indicator=True: pandas will create a column called _merge where the information will be shown.
    - If a string is passed, it will be the new column name rather than _merge.
8. `validate:` Raise an error if the defined validation constraint is not met. {'1:1', '1:m', or 'm:1', 'm:m'}. (m is for many, and m:m is always ignored).


- Example dataframes

In [95]:
df1 = pd.DataFrame(
    {
        "ID": [1001, 1002, 1003, 1004, 1006, 1008],
        "FirstName": ["Hasan", "Dave", "Arafat", "Alice", "Zhang", "Yamale"],
        "Job": ["Student", "Teacher", "Web Developer", "Clerk", "Chef", "Footballer"],
    }
)

df2 = pd.DataFrame(
    {
        "ID": [1001, 1002, 1003, 1008, 1009, 1010],
        "FirstName": ["Hasan", "David", "Arafat", "Yamale", "John", "Harold"],
        "Age": [23, 48, 24, 16, 32, 35],
    }
)

In [96]:
df1

Unnamed: 0,ID,FirstName,Job
0,1001,Hasan,Student
1,1002,Dave,Teacher
2,1003,Arafat,Web Developer
3,1004,Alice,Clerk
4,1006,Zhang,Chef
5,1008,Yamale,Footballer


In [97]:
df2

Unnamed: 0,ID,FirstName,Age
0,1001,Hasan,23
1,1002,David,48
2,1003,Arafat,24
3,1008,Yamale,16
4,1009,John,32
5,1010,Harold,35


- Example of Inner join

An inner join selects and combines only the rows that have matching values in the specified columns (keys) from both the left and right dataframes i.e, it includes only the data entries that appear in both dataframes based on the merging columns.

In [94]:
df1.merge(df2, on=["ID", "FirstName"], how="inner")

Unnamed: 0,ID,FirstName,Job,Age
0,1001,Hasan,Student,23
1,1003,Arafat,Web Developer,24
2,1008,Yamale,Footballer,16


- Example of Outer join

An outer join combines data from both the left and right dataframes, including all rows from both sides. For matching entries it will not create duplicates but it fills in missing values with NaN (Not a Number) for non-matching entries in the merging columns. This type of join ensures that no data is lost, as it includes all available information from both dataframes while preserving their original shapes.

In [99]:
df1.merge(df2, how="outer").T

Unnamed: 0,0,1,2,3,4,5,6,7,8
ID,1001,1002,1003,1004,1006,1008,1002,1009,1010
FirstName,Hasan,Dave,Arafat,Alice,Zhang,Yamale,David,John,Harold
Job,Student,Teacher,Web Developer,Clerk,Chef,Footballer,,,
Age,23.0,,24.0,,,16.0,48.0,32.0,35.0


- Example of Left join

It includes all rows from the left dataframe and brings in corresponding data from the right dataframe. If there are no matches for some keys, the resulting dataframe will have NaN (Not a Number) values in the columns from the right dataframe. This method ensures that no data is lost from the left dataframe while incorporating relevant information from the right dataframe where possible.

In [100]:
df1.merge(df2, how="left")

Unnamed: 0,ID,FirstName,Job,Age
0,1001,Hasan,Student,23.0
1,1002,Dave,Teacher,
2,1003,Arafat,Web Developer,24.0
3,1004,Alice,Clerk,
4,1006,Zhang,Chef,
5,1008,Yamale,Footballer,16.0


- Example of Right join

A right join is the reverse of a left join. It includes all rows from the right dataframe and merges corresponding data from the left dataframe based on the specified columns (keys) where possible. If no matching entries are found in the merging columns, the resulting dataframe will contain NaN values in the left dataframe's columns. This method ensures that data from the right dataframe is preserved while incorporating relevant information from the left dataframe when available.

In [102]:
df1.merge(df2, how="right")

Unnamed: 0,ID,FirstName,Job,Age
0,1001,Hasan,Student,23
1,1002,David,,48
2,1003,Arafat,Web Developer,24
3,1008,Yamale,Footballer,16
4,1009,John,,32
5,1010,Harold,,35


`(?)` `Exercise:` Below are two dataframes one of which defines the employees name and company name. The other defines some company name (in the ticker column) and their location. Get the location of each employee assuming the company location and the employee location is the same.


In [103]:
# example dataframes
employees = pd.DataFrame(
    {
        "name": ["Fred", "Johm", "Sally", "Annie"],
        "company": ["AMZN", "GOOG", "GOOG", "NFLX"],
    }
)
locations = pd.DataFrame(
    {
        "ticker": ["AMZN", "GOOG"],
        "location": ["Seattle", "SF"],
    }
)

In [104]:
employees

Unnamed: 0,name,company
0,Fred,AMZN
1,Johm,GOOG
2,Sally,GOOG
3,Annie,NFLX


In [105]:
locations

Unnamed: 0,ticker,location
0,AMZN,Seattle
1,GOOG,SF


In [106]:
# to get the location of each employee
employees.merge(
    locations,
    left_on="company",
    right_on="ticker",
    how="left",
    validate="m:1",
    indicator=True,
)

Unnamed: 0,name,company,ticker,location,_merge
0,Fred,AMZN,AMZN,Seattle,both
1,Johm,GOOG,GOOG,SF,both
2,Sally,GOOG,GOOG,SF,both
3,Annie,NFLX,,,left_only
