# Deep dive into Pandas DataFrames 

**Read the official documentation on pandas DataFrames @ https://pandas.pydata.org/pandas-docs/stable/reference/frame.html**

**`Note:`** The notion of **chaining functions/methods** in pandas is similar to python.

DataFrames are **column oriented** unlike most common databases. And, **each column** in the dataframe is a **pandas series object**. So, any operation that can be performed on a pandas series object it can be applied to a column too.

There are **two axes** for a dataframe commonly referred to as axis 0 and 1, or the **"index"** (or 'rows') axis and the **"columns"** axis respectively. Note that, when an **operation** is applied **along axis 0**, it is applied **down the column**. Likewise, operations **along axis 1** operate **across the values in the row**.

## Import Statements

--------------------------

In [1]:
# import statements
import numpy as np
import pandas as pd

In [2]:
# view options
pd.set_option("display.max_columns", 40)
pd.set_option("display.max_rows", 10)

---------------------------

## Importing the data

------------------------

- We will be exploring a dataset from a Siena College Poll in 2018. This data has rankings of United States Presidents in various attributes. These attributes are:

In [3]:
siena_2018_cols = """
• Bg = Background
• Im = Imagination
• Int = Integrity
• IQ = Intelligence
• L = Luck
• WR = Willing to take risks
• AC = Ability to compromise
• EAb = Executive ability
• LA = Leadership ability
• CAb = Communication ability
• OA = Overall ability
• PL = Party leadership
• RC = Relations with Congress
• CAp = Court appointments
• HE = Handling of economy
• EAp = Executive appointments
• DA = Domestic accomplishments
• FPA = Foreign policy accomplishments
• AM = Avoid crucial mistakes
• EV = Experts’ view
• O = Overall
"""

In [4]:
# reading from github url

# it is a good practice to define your index column when reading the data file.
# it is generally frowned upon if you don't have an index column

url = "https://github.com/mattharrison/datasets/raw/master/data/siena2018-pres.csv"
siena_2018 = pd.read_csv(url, index_col=0)

In [5]:
siena_2018.head(3)

Unnamed: 0,Seq.,President,Party,Bg,Im,Int,IQ,L,WR,AC,EAb,LA,CAb,OA,PL,RC,CAp,HE,EAp,DA,FPA,AM,EV,O
1,1,George Washington,Independent,7,7,1,10,1,6,2,2,1,11,2,18,1,1,1,1,2,2,1,2,1
2,2,John Adams,Federalist,3,13,4,4,24,14,31,21,21,13,8,28,17,4,13,15,19,13,16,10,14
3,3,Thomas Jefferson,Democratic-Republican,2,2,14,1,8,5,14,6,6,4,4,5,5,7,20,4,6,9,7,5,5


In [6]:
# this will print all the column names, number of non null values in each column and the datatype of that column
# siena_2018.info()

- Another dataset that we will be exploring is the "/Data/vehicles.csv.zip".

In [7]:
vehicles = pd.read_csv("./Data/vehicles.csv")

  vehicles = pd.read_csv("./Data/vehicles.csv.zip")


In [8]:
vehicles.head(3)

Unnamed: 0,barrels08,barrelsA08,charge120,charge240,city08,city08U,cityA08,cityA08U,cityCD,cityE,cityUF,co2,co2A,co2TailpipeAGpm,co2TailpipeGpm,comb08,comb08U,combA08,combA08U,combE,...,year,youSaveSpend,guzzler,trans_dscr,tCharger,sCharger,atvType,fuelType2,rangeA,evMotor,mfrCode,c240Dscr,charge240b,c240bDscr,createdOn,modifiedOn,startStop,phevCity,phevHwy,phevComb
0,15.695714,0.0,0.0,0.0,19,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,423.190476,21,0.0,0,0.0,0.0,...,1985,-2250,,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
1,29.964545,0.0,0.0,0.0,9,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,807.909091,11,0.0,0,0.0,0.0,...,1985,-11500,T,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
2,12.207778,0.0,0.0,0.0,23,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,329.148148,27,0.0,0,0.0,0.0,...,1985,0,,SIL,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0


In [9]:
# vehicles.info()

- The stack overflow developer survey data from 2019

In [10]:
dev_survey = pd.read_csv("Data/dev_survey_2019.zip")

In [11]:
dev_survey.head(3)

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,...,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",,,4.0,10,,,,,,...,A few times per month or weekly,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was much faster,31-60 minutes,No,,"No, I didn't know that Stack Overflow had a jo...","No, and I don't know what those are",Neutral,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,Taken an online course in programming or softw...,,"Developer, desktop or enterprise applications;...",,17,,,,,,...,Daily or almost daily,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was much faster,11-30 minutes,Yes,A few times per month or weekly,"No, I knew that Stack Overflow had a job board...","No, and I don't know what those are","Yes, somewhat",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,"Taught yourself a new language, framework, or ...",100 to 499 employees,"Designer;Developer, back-end;Developer, front-...",3.0,22,1.0,Slightly satisfied,Slightly satisfied,Not at all confident,Not sure,...,A few times per week,Find answers to specific questions;Learn how t...,6-10 times per week,They were about the same,,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...",Neutral,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult


In [12]:
# dev_survey.info()

---------------

## Mathematical operations on DataFrames

----------------

**Similar to series objects, Math operations for DataFrames are Index Aligned.**

Aligning will take each index entry from a particular column in the left df and match it up with every entry with the same index of the same column in the right df. This is repeated for all the overlapping columns. If any of the df has duplicate index this will cause the addition operation to behave unexpectedly i.e, it will work by process of permutating the matching indexex.

In [13]:
# s1: 3 rows and 4 columns
# s2: 2 rows and 5 columns
s1 = pd.DataFrame(
    np.linspace(2, 13, 12).reshape(3, 4),
    columns=["a1", "b1", "c1", "d1"],
    index=[1, 2, 3],
)
s2 = pd.DataFrame(
    np.linspace(2, 11, 10).reshape(2, 5),
    columns=["a1", "b1", "c1", "d1", "e1"],
    index=[2, 2],
)

In [14]:
s1 + s2

Unnamed: 0,a1,b1,c1,d1,e1
1,,,,,
2,8.0,10.0,12.0,14.0,
2,13.0,15.0,17.0,19.0,
3,,,,,


As we can see, only the **overlapping rows** (2nd row) **and columns** (a1 through d1) get added together. The other values are missing. We can use the **.add method instead of "+" and define a fill value** if we wanted, similar to what we've done in case of series objects.

In [15]:
s1.add(s2, fill_value=0)

Unnamed: 0,a1,b1,c1,d1,e1
1,2.0,3.0,4.0,5.0,
2,8.0,10.0,12.0,14.0,6.0
2,13.0,15.0,17.0,19.0,11.0
3,10.0,11.0,12.0,13.0,


---------------------------------------------------

## Looping over a DataFrame (using the `for` loop)

-----------------------------------------------------

It is generally not a good practice and is usually frowned upon if you use for loop with your pandas dataframe. This is because, pandas built-in methods are much faster than for loop due to vectorization and you are not taking advantage of it. However some times it might be useful to use for loop in datafrmes such as, when plotting visuals.

Some iterator methods that are useful while looping over a dataframe are, **.items(), .iterrows(), .itertuples()**.

- The `.items()` method: returns a tuple of **(column name, column content as Series)**

The indexes of the returned series will be the indexes of the dataframe.

In [16]:
for col_label, col_content in siena_2018.items():
    print(f"Column Name: {col_label}")
    print(f"Data contents of the column: \n{col_content}")
    break

Column Name: Seq.
Data contents of the column: 
1      1
2      2
3      3
4      4
5      5
      ..
40    41
41    42
42    43
43    44
44    45
Name: Seq., Length: 44, dtype: object


- The `.iterrows()` method: returns a tuple of **(index, row content as Series)**

The indexes of the returned series will be the associated column names.

In [17]:
for idx, row_content in siena_2018.iterrows():
    print(f"Row index: {idx}\n")
    print(f"Data contents of the row: \n\n{row_content}")
    break

Row index: 1

Data contents of the row: 

Seq.                         1
President    George Washington
Party              Independent
Bg                           7
Im                           7
                   ...        
DA                           2
FPA                          2
AM                           1
EV                           2
O                            1
Name: 1, Length: 24, dtype: object


- The `.itertuples()` method: returns the **rows as namedtuples**

In [18]:
for row in siena_2018.itertuples():
    print(row)
    break

Pandas(Index=1, _1='1', President='George Washington', Party='Independent', Bg=7, Im=7, Int=1, IQ=10, L=1, WR=6, AC=2, EAb=2, LA=1, CAb=11, OA=2, PL=18, RC=1, CAp=1, HE=1, EAp=1, DA=2, FPA=2, AM=1, EV=2, O=1)


--------------------------------

## Aggregations

--------------------------

Aggregations that are applicable to a Series object are also applicable to a DataFrame. The only difference is that, in dataframes aggregations can be applied across 2 axis (i.e, index and columns).

In [19]:
# let's slice out a portion from the siena_2018 df that has numerical values it's good practice to use .copy()
# while slicing a df so that operations applied on the sliced df doesn't affect the original df
scores = siena_2018.loc[:, "Bg":"O"].copy()

#### *Multiple aggregations on a dataframe using the `.agg` method*

- **Aggregate over axis 1 (apply function to each row i.e, aggregate across the columns)**

In [20]:
scores.agg(["sum", "mean"], axis=1).tail(3)

Unnamed: 0,sum,mean
42,635.0,30.238095
43,331.0,15.761905
44,833.0,39.666667


- **Aggregate over axis 0 (apply function to each column i.e, aggregate across the rows)**

In [21]:
scores.agg(["sum", "mean"], axis=0).head(3)

Unnamed: 0,Bg,Im,Int,IQ,L,WR,AC,EAb,LA,CAb,OA,PL,RC,CAp,HE,EAp,DA,FPA,AM,EV,O
sum,968.0,957.0,990.0,990.0,990.0,953.0,968.0,978.0,990.0,990.0,990.0,990.0,979.0,990.0,990.0,990.0,990.0,990.0,990.0,990.0,990.0
mean,22.0,21.75,22.5,22.5,22.5,21.659091,22.0,22.227273,22.5,22.5,22.5,22.5,22.25,22.5,22.5,22.5,22.5,22.5,22.5,22.5,22.5


- Different aggregations per column

In [22]:
scores.agg({"Int": ["max", "mean"], "IQ": ["min", "mean"]})

Unnamed: 0,Int,IQ
max,44.0,
mean,22.5,22.5
min,,1.0


#### *The `.describe` returns a dataframe with summary statistics for each numeric columns*

In [23]:
scores.describe()

Unnamed: 0,Bg,Im,Int,IQ,L,WR,AC,EAb,LA,CAb,OA,PL,RC,CAp,HE,EAp,DA,FPA,AM,EV,O
count,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0
mean,22.0,21.75,22.5,22.5,22.5,21.659091,22.0,22.227273,22.5,22.5,22.5,22.5,22.25,22.5,22.5,22.5,22.5,22.5,22.5,22.5,22.5
std,12.409674,12.519984,12.845233,12.845233,12.845233,11.892822,12.409674,12.500909,12.845233,12.845233,12.845233,12.845233,12.519984,12.845233,12.845233,12.845233,12.845233,12.845233,12.845233,12.845233,12.845233
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,11.75,11.0,11.75,11.75,11.75,11.75,11.75,11.75,11.75,11.75,11.75,11.75,11.75,11.75,11.75,11.75,11.75,11.75,11.75,11.75,11.75
50%,22.0,21.5,22.5,22.5,22.5,22.5,22.0,22.5,22.5,22.5,22.5,22.5,22.5,22.5,22.5,22.5,22.5,22.5,22.5,22.5,22.5
75%,32.25,32.25,33.25,33.25,33.25,31.25,32.25,32.25,33.25,33.25,33.25,33.25,33.0,33.25,33.25,33.25,33.25,33.25,33.25,33.25,33.25
max,43.0,43.0,44.0,44.0,44.0,41.0,43.0,43.0,44.0,44.0,44.0,44.0,43.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0


**Note:** The count row in the summary statistics has a particular meaning in pandas. It is not the count of the rows, rather it is the count of the non-missing (not na) rows.

---------------------

## Casting Datatypes and Renaming the columns

-----------------------

**Note: This (casting datatypes and renaming columns) should be the first step whenever we load in a dataset. Also, we should write these commands as functions, allowing us to reuse the code in other notebooks if necessary.**

### *Renaming the columns with proper full form*

- Getting the full form of each column from the "siena_2018_cols" string

In [24]:
# we want to write a code to generate a python dictionary from the above multiline string named "siena_2018_cols", which is
# formatted as short form = long form. This dictionary will be used to rename the columns of the dataframe "siena_2018"

# first we create a list of the form, [[short, full], .....]
cols_list = [
    col.strip().split("=") for col in siena_2018_cols.strip().split(sep="•")[1:]
]

# we will replace the spaces in the full form with underscores (_)
siena_2018_cols_dict = {
    col_prev.strip(): col_full.strip().replace(" ", "_")
    for col_prev, col_full in cols_list
}

**Note:** When such unpacking pattern is used with the for loop in a nested list, it will start to unpack from the most inner layer and not the outer one.

#### The `.rename()` method

In [25]:
# inplace = True is frowned upon
siena_2018 = siena_2018.rename(columns={"Seq.": "Seq"}).rename(
    columns=siena_2018_cols_dict
)

In [26]:
siena_2018.head(3)

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,Luck,Willing_to_take_risks,Ability_to_compromise,Executive_ability,Leadership_ability,Communication_ability,Overall_ability,Party_leadership,Relations_with_Congress,Court_appointments,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall
1,1,George Washington,Independent,7,7,1,10,1,6,2,2,1,11,2,18,1,1,1,1,2,2,1,2,1
2,2,John Adams,Federalist,3,13,4,4,24,14,31,21,21,13,8,28,17,4,13,15,19,13,16,10,14
3,3,Thomas Jefferson,Democratic-Republican,2,2,14,1,8,5,14,6,6,4,4,5,5,7,20,4,6,9,7,5,5


### *Casting DataTypes*

The first thing we should do when we load in a dataset is checking the datatypes of each column and converting each of them to datatypes that is more suitable for them. This will save space and will increase the overall speed of all the operations.

In [27]:
siena_2018.dtypes.to_dict()  # we could've also used the .info() method

{'Seq': dtype('O'),
 'President': dtype('O'),
 'Party': dtype('O'),
 'Background': dtype('int64'),
 'Imagination': dtype('int64'),
 'Integrity': dtype('int64'),
 'Intelligence': dtype('int64'),
 'Luck': dtype('int64'),
 'Willing_to_take_risks': dtype('int64'),
 'Ability_to_compromise': dtype('int64'),
 'Executive_ability': dtype('int64'),
 'Leadership_ability': dtype('int64'),
 'Communication_ability': dtype('int64'),
 'Overall_ability': dtype('int64'),
 'Party_leadership': dtype('int64'),
 'Relations_with_Congress': dtype('int64'),
 'Court_appointments': dtype('int64'),
 'Handling_of_economy': dtype('int64'),
 'Executive_appointments': dtype('int64'),
 'Domestic_accomplishments': dtype('int64'),
 'Foreign_policy_accomplishments': dtype('int64'),
 'Avoid_crucial_mistakes': dtype('int64'),
 'Experts’_view': dtype('int64'),
 'Overall': dtype('int64')}

> **First, let's explore the columns with "Object" datatype**

- The "Seq" column (Sequences of the presidency)

In [28]:
siena_2018.Seq.values

array(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
       '13', '14', '15', '16', '17', '18', '19', '20', '21', '22/24',
       '23', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34',
       '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45'],
      dtype=object)

Upon inspection we can see that, there's a value of '22/24'. So this column has to remain as "Object" type (or, can be converted to "string" type). 

- The "President" column lists the name of the president. So, this can either be converted to "string" type or can remain as is. 

- The "Party" column provides the name of the party, the president was elected with

In [29]:
siena_2018.Party.value_counts()

Republican               19
Democratic               15
Democratic-Republican     4
Whig                      3
Independent               2
Federalist                1
Name: Party, dtype: int64

This column has only 6 unique values. So, this can be converted to "categorical" type.

In [30]:
siena_2018 = siena_2018.astype({"Party": "category"})

> **Now, let's explore the columns with "int64" as datatype**

**Note:** One of the interesting and important pandas methods is the `.select_dtypes()` method. This will select all the columns with the specified datatype and return those columns as a new DataFrame.

In [31]:
siena_2018.select_dtypes("int64").head(3)

Unnamed: 0,Background,Imagination,Integrity,Intelligence,Luck,Willing_to_take_risks,Ability_to_compromise,Executive_ability,Leadership_ability,Communication_ability,Overall_ability,Party_leadership,Relations_with_Congress,Court_appointments,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall
1,7,7,1,10,1,6,2,2,1,11,2,18,1,1,1,1,2,2,1,2,1
2,3,13,4,4,24,14,31,21,21,13,8,28,17,4,13,15,19,13,16,10,14
3,2,2,14,1,8,5,14,6,6,4,4,5,5,7,20,4,6,9,7,5,5


- Let's see the max and min values of the number type columns

In [32]:
siena_2018.select_dtypes("int64").agg(["max", "min"])

Unnamed: 0,Background,Imagination,Integrity,Intelligence,Luck,Willing_to_take_risks,Ability_to_compromise,Executive_ability,Leadership_ability,Communication_ability,Overall_ability,Party_leadership,Relations_with_Congress,Court_appointments,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall
max,43,43,44,44,44,41,43,43,44,44,44,44,43,44,44,44,44,44,44,44,44
min,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


As we can see, none of the columns has values greater than 44 and lesser than 1. So, these columns can easily be converted to "uint8" type and still accomodate the values as is.

In [33]:
siena_2018 = siena_2018.astype(
    {col_name: "uint8" for col_name in siena_2018.select_dtypes("int64").columns}
)

After casting datatypes to more appropriate types, the memory footprint of the dataframe reduces drastically.

In [34]:
siena_2018.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44 entries, 1 to 44
Data columns (total 24 columns):
 #   Column                          Non-Null Count  Dtype   
---  ------                          --------------  -----   
 0   Seq                             44 non-null     object  
 1   President                       44 non-null     object  
 2   Party                           44 non-null     category
 3   Background                      44 non-null     uint8   
 4   Imagination                     44 non-null     uint8   
 5   Integrity                       44 non-null     uint8   
 6   Intelligence                    44 non-null     uint8   
 7   Luck                            44 non-null     uint8   
 8   Willing_to_take_risks           44 non-null     uint8   
 9   Ability_to_compromise           44 non-null     uint8   
 10  Executive_ability               44 non-null     uint8   
 11  Leadership_ability              44 non-null     uint8   
 12  Communication_ability   

-------------------------------------------------

## Creating and Updating columns: The `.assign()` method

---------------------------------------------------

**Why use .assign ?** This method returns a dataframe and doesn't mutate the existing dataframe. This is very useful for chaining operations as the dataframe gets continuously updated and the subsequent methods operates on the updated dataframe.

<u>**\*\*kwargs:** argument (column name) = argument value (callable or Series}, ...... </u>
- if the column already exists it will modify the values of the column
- if the column doesn't exist then it will create a new column
- if the argumnent value is a series or a scalar, it will simply assign those values to the column
- the callable (a function or *lambda*) must return a scalar or series. Using a function (it can be a normal function, but often we use a lambda to have the logic inline) has an unseen benefit. If any manipulation of filtering was done on the dataframe before using the `.assign()`, those changes will be represented on the dataframe and *the function will accept the current state of the dataframe.*

**`lambda` function Refresher:** A lambda function can take any number of arguments, but can only have one expression. *Syntax:* `lambda arguments : expression`. The expression is executed and the result is returned.

In [35]:
# First, we will add a column named Average_rank that ranks the presidents based on their toatal score (summing the numeric values across the columns)
# using dense method (lowest rank in the group but rank always increases by 1 between groups)
# this is essentially the "Overall" column but using a different ranking method

# Next, we will add another column named, "Quartile_rank" that will have 4 bins (1st, 2nd, 3rd, 4th)
# this is when we will see the power of using a function 
# the lambda function will take the current state of the dataframe when the Average_rank column exists

siena_2018 = siena_2018.assign(Average_rank=siena_2018.loc[:, "Background":"Experts’_view"].sum(axis=1).rank(method="dense").astype("uint8"), 
Quartile_rank=lambda df_: pd.qcut(df_.Average_rank, 4, labels=["1st", "2nd", "3rd", "4th"]))

In [36]:
siena_2018.head(3)

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,Luck,Willing_to_take_risks,Ability_to_compromise,Executive_ability,Leadership_ability,Communication_ability,Overall_ability,Party_leadership,Relations_with_Congress,Court_appointments,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
1,1,George Washington,Independent,7,7,1,10,1,6,2,2,1,11,2,18,1,1,1,1,2,2,1,2,1,1,1st
2,2,John Adams,Federalist,3,13,4,4,24,14,31,21,21,13,8,28,17,4,13,15,19,13,16,10,14,13,2nd
3,3,Thomas Jefferson,Democratic-Republican,2,2,14,1,8,5,14,6,6,4,4,5,5,7,20,4,6,9,7,5,5,5,1st


--------------------------------------------

## Dealing with Missing and Duplicated Data

----------------------------------------------

### *Locating missing data*

- The `.isna()` method

Works similarly to series.isna() method. Returns a Boolean dataframe when used with dataframes.  

In [37]:
siena_2018.isna().head(3)

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,Luck,Willing_to_take_risks,Ability_to_compromise,Executive_ability,Leadership_ability,Communication_ability,Overall_ability,Party_leadership,Relations_with_Congress,Court_appointments,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


We can use other methods such as **.any(), .all() etc.** in combination with the .isna() method to see whether there are any data at all missing from a column or whether all data in a column is Nan type etc.

In [38]:
siena_2018.isna().any().head(3)

Seq          False
President    False
Party        False
dtype: bool

To `count` how many data rows are `missing` from a particular column we can use the **df.isna().sum()**.

In [39]:
siena_2018.isna().sum().head(3)

Seq          0
President    0
Party        0
dtype: int64

To see what `percentage` of the data in a column is missing we can use something like, **df.isna().mean().mul(100)**.

In [40]:
siena_2018.isna().mean().mul(100).sample(3)

Communication_ability    0.0
Integrity                0.0
Luck                     0.0
dtype: float64

### *Handling missing values*

- The `.dropna(subset)` method 

We can use the good old .dropna() method to drop the rows with missing values. But note that, when using with dataframe, .dropna() will only drop the rows if it has Nan values in all the columns. To specify otherwise i.e, what columns to look at when dropping rows, we can feed the subset parameter a list of column names (that we want it to look at for dropping Nan values).

- The `.fillna()` method

We can also use the .fillna() method to fill in the missing value. We can also define the filling method e.g, `ffill`, `bfill` etc. This method also takes a `value parameter` (value: scalar, dict, Series, or DataFrame) which will be used to fill the Nan values if specified. The **.mean(), .median(), .mode()** etc methods may come in handy when defining the value paramether.

- The `.interpolate()` method

This will replace Nan values with interpolation of the values around the missing value. This method comes in very handy when dealing with ordered data such as time series data.

- The `.where(cond, other)` method

Although not specific for handling missing values but this method is a powerful one for doing just that. This method **replaces values where the condition is False with corresponding value from 'other'**.

- The `.mask(cond, other)` method

Opposite of the .where() method in the sense that, this method will replace values **where the condition is True** with the corresponding value from 'other'. Equivalent to, **.where(~cond, other)**.

**`CAUTION:`** The data in each column of a dataframe usually represents different things. Thus applying methods such as .dropna(), .fillna(), .interpolate() is not logical and will bring no good (this is like, using a spoon for woodworking).

**So, the best approach is to treat each column differently as a separate series object, clean them, modify them and then adding/replacing them in a datafrmae using the .assign() method.**

### *Handling duplicate data*

- The `.drop_duplicates(subset=None,keep='first', ignore_index=False)` method

If called without any parameters, it will drop only the rows that are complete copy of each other. The subset parameter lets us specify which columns to check when checking for duplicates.

--------------------------------------------

## Sorting Columns and Indexes 

---------------------------------------------

#### Setting indexes: The `.set_index()` method

Return dataframe with the new index.

<u> Parameters -- </u>

- **keys**: column(s) to be set as index.
- **drop = True** : default True. Indicates whether to remove columns used for the index.
- **verify_integrity = False** : check for duplicate index values by setting verify_integrity=True.

#### Sorting indexes: The `.sort_index()` method

<u> Parameters -- </u>

- **axis = 0**: This method will return dataframe with index (axis=0) or columns (axis=1) sorted.  
- **ascending = True**: default True.
- **key = None**:  A key function accepts an index and should return an index. For multi-level indexes, each index is passed in independently to the function.

This operation is usually done after setting a new index. If the new index is of **string type** then **sorting it will allow us to use slicing** operation on the index column. Othrwise it will throw a KeyError.

#### Sorting values: The `.sort_values()` method

<u> Parameters -- </u>

- **by**: column name or a list of names to sort by.
- **ascending = True**: bool or list of bool, default True.
- **key = None**: Apply the key function to the values before sorting. This is similar to the `key` argument in the builtin **sorted** function. A key function accepts a series and should return a series with the same index.


In [41]:
siena_2018.sort_values(by=["Quartile_rank", "Intelligence"], ascending=[True, False]).sample(3)

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,Luck,Willing_to_take_risks,Ability_to_compromise,Executive_ability,Leadership_ability,Communication_ability,Overall_ability,Party_leadership,Relations_with_Congress,Court_appointments,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
6,6,John Quincy Adams,Democratic-Republican,1,9,6,5,29,19,24,22,23,12,16,29,29,15,17,18,21,15,14,18,18,18,2nd
1,1,George Washington,Independent,7,7,1,10,1,6,2,2,1,11,2,18,1,1,1,1,2,2,1,2,1,1,1st
26,27,William Howard Taft,Republican,12,28,12,14,27,31,19,23,26,21,23,30,21,16,19,21,18,22,19,23,22,22,2nd


--------------------------

## Indexing & Filtering 

------------------------

### *Renaming an index: The `.rename()` method*

<u> Parameters: </u>
- **mapper**: Dict-like or function transformations to apply to specified axis' values. In case of a function you only need to pass in the name and not call them.
- **axis**: index (0) or, columns(1).

In [42]:
# say, we would like to set the president name as our index and use initial for first name and not the full name
def name_to_initial(val):
    vals = val.split(" ")
    return " ".join([f'{vals[0][0]}.', *vals[1:]])  # unpack the items in the vals[1:] list

siena_2018.set_index("President").rename(name_to_initial).head(3) # or, lambda name_: " ".join([f'{name_.split()[0][0]}.', *name_.split()[1:]]) 

Unnamed: 0_level_0,Seq,Party,Background,Imagination,Integrity,Intelligence,Luck,Willing_to_take_risks,Ability_to_compromise,Executive_ability,Leadership_ability,Communication_ability,Overall_ability,Party_leadership,Relations_with_Congress,Court_appointments,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
President,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
G. Washington,1,Independent,7,7,1,10,1,6,2,2,1,11,2,18,1,1,1,1,2,2,1,2,1,1,1st
J. Adams,2,Federalist,3,13,4,4,24,14,31,21,21,13,8,28,17,4,13,15,19,13,16,10,14,13,2nd
T. Jefferson,3,Democratic-Republican,2,2,14,1,8,5,14,6,6,4,4,5,5,7,20,4,6,9,7,5,5,5,1st


#### Resetting indexes to monotonically increasing integers: The `.reset_index()` method

In [43]:
siena_2018.set_index("President").reset_index().head(3)

Unnamed: 0,President,Seq,Party,Background,Imagination,Integrity,Intelligence,Luck,Willing_to_take_risks,Ability_to_compromise,Executive_ability,Leadership_ability,Communication_ability,Overall_ability,Party_leadership,Relations_with_Congress,Court_appointments,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
0,George Washington,1,Independent,7,7,1,10,1,6,2,2,1,11,2,18,1,1,1,1,2,2,1,2,1,1,1st
1,John Adams,2,Federalist,3,13,4,4,24,14,31,21,21,13,8,28,17,4,13,15,19,13,16,10,14,13,2nd
2,Thomas Jefferson,3,Democratic-Republican,2,2,14,1,8,5,14,6,6,4,4,5,5,7,20,4,6,9,7,5,5,5,1st


### *Filtering Index and Column Labels with `.filter(items, like, regex, axis)`*

- **items** (passed as a list) is used for exact matches. Note that exact match (with items) fails with duplicate labels but if the label doesn't exist it will not throw an error.
- **like** is used for substring matches.
- **regex** allows to specify a regular expression to match against index or column labels.
- **axis** specifies whether to filter indexex (0) or columns (1).

### *Indexing by Name: The `.loc[]` method*

The **`.loc[row indexer, column indexer]`** attribute is **primarily label based**, but may also be used with a boolean array.

Allowable inputs may be:
- **Scalar:** if any one of the indexer is passed as a scalar, it will return,
    - a dataframe if there are multiple instances and,
    - a series if there's only one entry. This series will have,
        - columns set as index if axis=0.
        - rows set as index if axis=1.

For it to return a dataframe in all cases we have to pass in the scalar as a list.
- **Array of labels**
- **Slice object:** Slicing with .loc includes both the start and end. *Some notes:*
    - If the axis of slicing has unsorted duplicate index labels we will first need to sort the indexes with **.sort_index()**.
    - Slicing with string indexes only works if you sort them.
    - Partial slicing can only be done on string types and not on categorical type.
- **A boolean array:** of the same length as the indexing axis.
- **A callable function:** that returns one of the above.

#### Using `functions with .loc` (for filtering)

The main advantage of using functions with .loc is that, the function will receive the current state of the dataframe as input. This is specially useful when multiple operations are chained together.

In [44]:
# let us select presidents with average rank < 10 and return first 3 columns of data about them
siena_2018.loc[siena_2018.Average_rank < 10, lambda df_: df_.columns[:3]].head(3)  # :3 as first column is the index column

Unnamed: 0,Seq,President,Party
1,1,George Washington,Independent
3,3,Thomas Jefferson,Democratic-Republican
4,4,James Madison,Democratic-Republican


In [45]:
# the same can be achieved by the following section of code
siena_2018.loc[siena_2018.Average_rank < 10, "Seq":"Party"].head(3)

Unnamed: 0,Seq,President,Party
1,1,George Washington,Independent
3,3,Thomas Jefferson,Democratic-Republican
4,4,James Madison,Democratic-Republican


### *Indexing by Position: The `.iloc[]` method*

The **`.iloc[row indexer, column indexer]`** attribute operates on **indexes and not index labels**. It can also be used with a boolean array.

Allowable inputs may be:
- **Scalar:** if any one of the indexer is passed as a scalar, it will return,
    - a dataframe if there are multiple instances and,
    - a series if there's only one entry. This series will have,
        - columns set as index labels if axis=0.
        - rows set as index labels if axis=1.

For it to return a dataframe in all cases we have to pass in the scalar as a list.
- **Array of indexes**
- **Slice object:** Slicing with .iloc includes only the start and not the end. *Note:*, if the axis being sliced has unsorted duplicate indexed entries we will first need to sort the indexes with **.sort_index()**.
- **A boolean array:** of the same length as the indexing axis.
- **A callable function:** that returns one of the above.

### *Filtering with boolean arrays (Boolean Masking)*

Boolean arrays can be used to filter data from a dataframe. Using different math operators (such as, &, <, >, | etc.) complex filters can be implemented.

In [46]:
# let's filter out the presidents who was a republican and has an average rank < 10. 
try:
    siena_2018[siena_2018.Average_rank < 10 & siena_2018.Party == "Republican"]
except TypeError as err: print(err)

unsupported operand type(s) for &: 'int' and 'Categorical'


The takeaway is, you should always put parentheses around multiple conditions in index operations if you inline them as some operators has precedence over others.

Now let's do this properly.

In [47]:
siena_2018[(siena_2018.Average_rank < 10) & (siena_2018.Party == "Republican")]

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,Luck,Willing_to_take_risks,Ability_to_compromise,Executive_ability,Leadership_ability,Communication_ability,Overall_ability,Party_leadership,Relations_with_Congress,Court_appointments,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
16,16,Abraham Lincoln,Republican,28,1,2,2,18,1,1,1,2,1,1,2,4,3,4,2,1,6,2,1,3,3,1st
25,26,Theodore Roosevelt,Republican,5,4,8,6,2,2,15,4,4,5,5,7,7,9,3,5,4,3,5,4,4,4,1st
33,34,Dwight D. Eisenhower,Republican,11,18,5,17,7,21,5,5,5,20,7,15,9,5,6,11,8,7,3,6,6,6,1st


### *Filtering with the `.query()` method*

- Instead of using boolean arrays in combination with .loc[], we can use the .query() method. And, unlike boolean arrays we can use both, plain 'and', 'or', 'not' commands and also the operator forms such as &, |,  ! etc. We also don't need to worry as much about precedence and parentheses.
- In the .query() method we use a string to formulate and express our conditions, similar to SQL. One of the powerful aspect of using .query() is that, we can `access external variables using the @ sign as prefix` from inside the string. So we don't need to use string formatting or concatenation to implement complex logics in our search.
- `To access a column of the dataframe, just use the name of the column`.
- `To match a string literal pass it in as a string (within quote marks) as you would in any other situation.`
- .query() **vs** .loc[]:
    - Both of these methods can be used to work on the intermediate data. But usually when we use .loc[] and boolean arrays to filter data, the mask we pass in is based on the original dataframe and not the intermediate one. As a result we would have to use a function with .loc to get access to the original dataframe.
    - The .loc[] method supports column selection but the .query() method doesn't. This is very important to keep in mind when filtering data with the .query() method.

In [48]:
# to do the same filtering as we've done in the filtering with boolean arrays section
lt10 = siena_2018.Average_rank < 10
# siena_2018.query("Average_rank < 10 and Party == 'Republican'")
siena_2018.query('@lt10 and Party == "Republican"')

Unnamed: 0,Seq,President,Party,Background,Imagination,Integrity,Intelligence,Luck,Willing_to_take_risks,Ability_to_compromise,Executive_ability,Leadership_ability,Communication_ability,Overall_ability,Party_leadership,Relations_with_Congress,Court_appointments,Handling_of_economy,Executive_appointments,Domestic_accomplishments,Foreign_policy_accomplishments,Avoid_crucial_mistakes,Experts’_view,Overall,Average_rank,Quartile_rank
16,16,Abraham Lincoln,Republican,28,1,2,2,18,1,1,1,2,1,1,2,4,3,4,2,1,6,2,1,3,3,1st
25,26,Theodore Roosevelt,Republican,5,4,8,6,2,2,15,4,4,5,5,7,7,9,3,5,4,3,5,4,4,4,1st
33,34,Dwight D. Eisenhower,Republican,11,18,5,17,7,21,5,5,5,20,7,15,9,5,6,11,8,7,3,6,6,6,1st


----------------------------------

## Reshaping Dataframes (Grouping and Aggregating) 

----------------------------------

#### Reshaping dataframes with `dummies`

So, what are dummies? Well, dummy columns are one of the ways of converting a categorical column to multiple numerical columns. Each category in a column is converted to a column in itself. These columns are filled with 1 or 0 based on whether the categorical value itself was present in a particular row of data or not. To create dummy columns from a series (or a dataframe that has multiple string columns), call the `pd.get_dummies` function. 

### *The `pivot_table()` method*

<u> **Parameters** </u>

- **values:** The column(s) to apply aggregate function to. 
- **aggfunc:** Function or list of functions. By default set to **mean**. If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If dict is passed, the **key is column to aggregate and value is function or list of functions.**
- **fill_value:** Value to replace missing values with (in the resulting pivot table, after aggregation). If not defined then the missing values in the pivot table will be filled as **Nan**.
- **index:** **Keys to group by on the pivot table index.** 
- **columns:** **Keys to group by on the pivot table column.**

The *unique values of the specified column(s) are converted to indexes and columns.* If multiple column names are specified then it will have a nested structured MultiIndex.

**NOTE:** If both the columns and the values parameter is specified then this will result in filling the newly created columns (created from unique values of the specified columns) of the pivot table with the aggregated values of the groups in their respective positions and for the positions that doesn't have any value will be filled as **Nan**.

> Say, we want to know what are the max mileage (both city08 and highway08 values) of the cars produced by different companies in each year.

In [49]:
# max mileage of cars produced by different companies in each year
max_mpg_year_manufac = vehicles.pivot_table(index="year", columns="make", values=["city08", "highway08"], aggfunc="max")
max_mpg_year_manufac.sample(3)

Unnamed: 0_level_0,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,...,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08
make,AM General,ASC Incorporated,Acura,Alfa Romeo,American Motors Corporation,Aston Martin,Audi,Aurora Cars Ltd,Autokraft Limited,Avanti Motor Corporation,Azure Dynamics,BMW,BMW Alpina,BYD,Bentley,Bertone,Bill Dovell Motor Car Company,Bitter Gmbh and Co. Kg,Bugatti,Buick,...,Shelby,Spyker,Sterling,Subaru,Superior Coaches Div E.p. Dutton,Suzuki,TVR Engineering Ltd,"Tecstar, LP",Tesla,Texas Coach Company,Toyota,VPG,Vector,Vixen Motor Company,Volga Associated Automobile,Volkswagen,Volvo,Wallace Environmental,Yugo,smart
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2
1989,,,23.0,19.0,,8.0,20.0,,,,,17.0,,,,20.0,,,,23.0,...,,,21.0,33.0,,45.0,,,,,37.0,,,,,39.0,26.0,,29.0,
2014,,,38.0,,,14.0,25.0,,,,,137.0,,61.0,15.0,,,,8.0,25.0,...,,,,35.0,,,,,97.0,,74.0,,,,,48.0,30.0,,,93.0
2016,,,28.0,24.0,,14.0,37.0,,,,,137.0,,73.0,15.0,,,,,27.0,...,,,,36.0,,,,,107.0,,53.0,,,,,105.0,37.0,,,93.0


> Using a custom function to calculate percentage of Emacs users by country

In [50]:
def emacs_per(ser):
    return ser.str.contains("Emacs").mean() * 100

In this case we still want country in the index, but we only want a single column, the emacs percentage. So we don’t provide a columns parameter.

In [51]:
dev_survey.pivot_table(index="Country", values="DevEnviron", aggfunc=emacs_per).sample(3)

Unnamed: 0_level_0,DevEnviron
Country,Unnamed: 1_level_1
Latvia,3.787879
Democratic Republic of the Congo,0.0
The former Yugoslav Republic of Macedonia,0.0


- **Note:** When you see ”for each” or ”by”, your mind should think that whatever is following either of the terms should go in the index.

> Multiple aggregations

In [52]:
vals = dev_survey.select_dtypes("float64", "int64").columns.to_list()

We will be applying these aggregate funcitons to each of the columns. So we shouldn't pass in values for the columns parameter which will convert each unique entry in the columns specied to separate columns. 

In [53]:
dev_survey.pivot_table(index="Country", values=vals, aggfunc=["max", "min"]).head(3)

Unnamed: 0_level_0,max,max,max,max,max,min,min,min,min,min
Unnamed: 0_level_1,Age,CodeRevHrs,CompTotal,ConvertedComp,WorkWeekHrs,Age,CodeRevHrs,CompTotal,ConvertedComp,WorkWeekHrs
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Afghanistan,85.0,90.0,648838511.0,1000000.0,168.0,1.0,1.0,1.0,0.0,1.0
Albania,40.0,20.0,2688000.0,187668.0,65.0,15.0,1.0,400.0,1320.0,8.0
Algeria,63.0,20.0,300000.0,1000000.0,168.0,5.0,1.0,0.0,0.0,6.0


> Per column aggregations (Say, we wanted to know what are the minimum and maximum ages and the average compensation for each country?)

In [54]:
dev_survey.pivot_table(index="Country", aggfunc={"Age": ["min", "max"], "ConvertedComp": ["mean"]}).head(3)

Unnamed: 0_level_0,Age,Age,ConvertedComp
Unnamed: 0_level_1,max,min,mean
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Afghanistan,85.0,1.0,101953.333333
Albania,40.0,15.0,21833.7
Algeria,63.0,5.0,34924.047619


### *The `.groupby()` method*

A groupby operation splits the data into groups. You can apply aggregate functions to the group. Then the results of the aggregates are combined. The column we are grouping by will be placed in the index.

<u>**Parameters**</u>

- **by:** used for determining the groups for the groupby. If a list of labels is passed then this will return a MultiIndex object.
- **keys:** used for named aggregations.

> Say, we want to know what are the max mileage (both city08 and highway08 values) of the cars produced by different companies in each year.

In [55]:
# Without unstack()
vehicles.groupby(["year", "make"]).agg({"city08":"max", "highway08": "max"}).head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,city08,highway08
year,make,Unnamed: 2_level_1,Unnamed: 3_level_1
1984,AM General,18,17
1984,Alfa Romeo,18,25
1984,American Motors Corporation,19,23


In [56]:
# With unstack()
vehicles.groupby(["year", "make"]).agg({"city08":"max", "highway08": "max"}).unstack().head(3)

Unnamed: 0_level_0,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,city08,...,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08,highway08
make,AM General,ASC Incorporated,Acura,Alfa Romeo,American Motors Corporation,Aston Martin,Audi,Aurora Cars Ltd,Autokraft Limited,Avanti Motor Corporation,Azure Dynamics,BMW,BMW Alpina,BYD,Bentley,Bertone,Bill Dovell Motor Car Company,Bitter Gmbh and Co. Kg,Bugatti,Buick,...,Shelby,Spyker,Sterling,Subaru,Superior Coaches Div E.p. Dutton,Suzuki,TVR Engineering Ltd,"Tecstar, LP",Tesla,Texas Coach Company,Toyota,VPG,Vector,Vixen Motor Company,Volga Associated Automobile,Volkswagen,Volvo,Wallace Environmental,Yugo,smart
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2
1984,18.0,,,18.0,19.0,8.0,21.0,,,15.0,,21.0,,,,20.0,17.0,12.0,,23.0,...,,,,33.0,11.0,22.0,,,,,43.0,,,,,43.0,31.0,,,
1985,16.0,,,19.0,15.0,7.0,21.0,,,,,21.0,,,,20.0,17.0,,,23.0,...,,,,33.0,,47.0,22.0,,,,42.0,,,,,41.0,28.0,,,
1986,,,23.0,19.0,15.0,,22.0,,15.0,,,21.0,,,,20.0,,14.0,,23.0,...,,,,33.0,,27.0,22.0,,,14.0,35.0,,,16.0,22.0,40.0,27.0,,29.0,


<u>**Note:**</u> The `.unstack()` method is used with the groupby object to pull the inner-most index and set as the columns.

- ##### Named Aggregations of groupby objects

<u>**Note:**</u> This is special to groupby and no equivalent is present in the pivot_table() method.

When calling the .agg method on a groupby object, we can use a keyword parameter and pass in a tuple of the column and aggregation function (to be applied to that column) as its value. The keyword parameter will be turned into a (flattened) column name.

> Say, we wanted to know what are the minimum and maximum ages and the average compensation for each country? And we also want to name the columns, 'Age_min', 'Age_max', 'mean_ConvertedComp'.

In [57]:
dev_survey.groupby(by=["Country"]).agg(
    Age_min=("Age", "min"), 
    Age_max=("Age", "max"), 
    mean_ConvertedComp=("ConvertedComp", "mean")).head(3)

Unnamed: 0_level_0,Age_min,Age_max,mean_ConvertedComp
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,1.0,85.0,101953.333333
Albania,15.0,40.0,21833.7
Algeria,5.0,63.0,34924.047619


- ##### The `get_group()` method of a groupby object

In [58]:
# For a multiindex object we need to pass in a tuple as the argument to the get_group method
vehicles.groupby(["year", "make"]).get_group((1984, "Aston Martin")).head(2)

Unnamed: 0,barrels08,barrelsA08,charge120,charge240,city08,city08U,cityA08,cityA08U,cityCD,cityE,cityUF,co2,co2A,co2TailpipeAGpm,co2TailpipeGpm,comb08,comb08U,combA08,combA08U,combE,...,year,youSaveSpend,guzzler,trans_dscr,tCharger,sCharger,atvType,fuelType2,rangeA,evMotor,mfrCode,c240Dscr,charge240b,c240bDscr,createdOn,modifiedOn,startStop,phevCity,phevHwy,phevComb
18258,36.623333,0.0,0.0,0.0,8,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,987.444444,9,0.0,0,0.0,0.0,...,1984,-15750,T,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
18261,36.623333,0.0,0.0,0.0,8,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,987.444444,9,0.0,0,0.0,0.0,...,1984,-15750,T,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0


### *Accessing values from a MultiIndexed Dataframe*

In [59]:
# the values of max mileage in city08 and highway08 of the Subaru cars for the years 1984 to 1987
max_mpg_year_manufac.loc[range(1984, 1988), (["city08", "highway08"], "Subaru")]

Unnamed: 0_level_0,city08,highway08
make,Subaru,Subaru
year,Unnamed: 1_level_2,Unnamed: 2_level_2
1984,26.0,33.0
1985,27.0,33.0
1986,26.0,33.0
1987,32.0,37.0


----------------------------------------

## The `.transform()` and `.filter()` method of a Groupby object

------------------------------------------

We often group and aggregate. This returns the result with the aggregated index. But sometimes we want to get the results in terms of the original index, not the aggregated index. This way we can easily add the returned series or dataframe to the original dataframe (using the .assign method). 

There are two specific methods that works on **groupby** objects and allows us to group and aggregate while keeping the original index. But the **pivot_table** method can't do that. This is one of the reasons as to why groupby is more favoured than pivot_table among some developers.

- The `.transform` method allows us to preserve the original index while giving the ability to apply whatever aggregation (either with existing aggregation functions or we can define our own if we need) we want to the groupped object, thus increasing flexibility and functionality. 

- The `.filter` method allows us to filter based on aggregated data but keep the original index.


### *The Groupby `.transform(func)` method*

> Say, we want to know how many respondents there were from each country in the dev_survey df. And we want to add that result at the end of the original dev_survey df in a column named, 'total_res_form_this_country'.

Let's do this with the help of `.transform` and `.assign` method. Since we want to know the total number of response from each country what we can do is first groupby the Country column and then get the size of any column in the group.

In [60]:
# first let's see the .transform method in action
dev_survey.groupby("Country").Age.transform("size").tail(3)

88880       NaN
88881       NaN
88882    1604.0
Name: Age, dtype: float64

Looks like there's some rows for which the transform method returned NaN. Let's explore why's that.

Our first guess is that there's something wrong in the "Country" column. The reasoning is that, it is the column we used to gruop by.

In [61]:
dev_survey.Country.isna().sum()

132

Looks like there's some rows that does not have a "Country" value. And as a result those rows are not grouped. For now let's drop the rows that doesn't have a Country.

In [62]:
filt = ~dev_survey.Country.isna()

In [63]:
# now let's use the assign method to append a column to the dev_survey df
dev_survey[filt].assign(total_res_from_this_country=dev_survey.groupby("Country").Age.transform("size"))\
    .set_index("Country").sort_index().total_res_from_this_country

Country
Afghanistan    44.0
Afghanistan    44.0
Afghanistan    44.0
Afghanistan    44.0
Afghanistan    44.0
               ... 
Zimbabwe       39.0
Zimbabwe       39.0
Zimbabwe       39.0
Zimbabwe       39.0
Zimbabwe       39.0
Name: total_res_from_this_country, Length: 88751, dtype: float64

**Note:** We could've used the `.query` method instead of the boolean masking. In fact most of the times using .query is preffered, but here since the application is very basic we have used boolean arrays for filtering. 

- A list of strings that the groupby transform method accepts as functions -

<img src=groupby_transform_method_func_strings.png>

### *The Groupby `.filter(func)` method*

This method helps us to filter parts of groups by an aggregation but return the result with the original index.


The `.filter` method accepts a function that takes the current group. If the function returns True (it must return a scalar, not a series or dataframe), the rows are kept for the result

> Say we want to remove any row where the size of the country is less than the median size of countries from the dev_survey dataframe.

- First let's try to do this **with our existing pandas knowledge**

In [64]:
# first let's find out the median size
mdn_size = dev_survey.Country.value_counts().median()
mdn_size;

In [65]:
# a list of the countries to be removed
filt = dev_survey.Country.value_counts() < mdn_size
countries_to_remove = dev_survey.Country.value_counts().index[filt].to_list()
countries_to_remove;
country_nan = np.nan

In [66]:
dev_survey.query("~Country.isin(@countries_to_remove) and Country.isna() == False").tail(2)

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,...,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
88878,88377,,Yes,Less than once a month but more than once per ...,The quality of OSS and closed source software ...,"Not employed, and not looking for work",Canada,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",,,,,,,,,,...,A few times per week,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was slightly faster,11-30 minutes,Yes,I have never participated in Q&A on Stack Over...,"No, I knew that Stack Overflow had a job board...","No, I've heard of them, but I am not part of a...","No, not at all",,Tech articles written by other developers;Tech...,,Man,No,,,No,Appropriate in length,Easy
88882,88863,,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...","Not employed, and not looking for work",Spain,"Yes, full-time","Professional degree (JD, MD, etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,,,8.0,11.0,3.0,,,,,...,Daily or almost daily,Find answers to specific questions;Learn how t...,6-10 times per week,Stack Overflow was much faster,11-30 minutes,Yes,A few times per month or weekly,Yes,"No, I've heard of them, but I am not part of a...","Yes, somewhat",Somewhat less welcome now than last year,Tech articles written by other developers;Indu...,18.0,Man,No,Straight / Heterosexual,Hispanic or Latino/Latina;White or of European...,No,Appropriate in length,Easy


- Now let's do this **with the filter method** for groupby objects 

In [67]:
dev_survey.groupby("Country").filter(lambda g: g.Age.size >= mdn_size).tail(2)

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,...,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
88878,88377,,Yes,Less than once a month but more than once per ...,The quality of OSS and closed source software ...,"Not employed, and not looking for work",Canada,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",,,,,,,,,,...,A few times per week,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was slightly faster,11-30 minutes,Yes,I have never participated in Q&A on Stack Over...,"No, I knew that Stack Overflow had a job board...","No, I've heard of them, but I am not part of a...","No, not at all",,Tech articles written by other developers;Tech...,,Man,No,,,No,Appropriate in length,Easy
88882,88863,,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...","Not employed, and not looking for work",Spain,"Yes, full-time","Professional degree (JD, MD, etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,,,8.0,11.0,3.0,,,,,...,Daily or almost daily,Find answers to specific questions;Learn how t...,6-10 times per week,Stack Overflow was much faster,11-30 minutes,Yes,A few times per month or weekly,Yes,"No, I've heard of them, but I am not part of a...","Yes, somewhat",Somewhat less welcome now than last year,Tech articles written by other developers;Indu...,18.0,Man,No,Straight / Heterosexual,Hispanic or Latino/Latina;White or of European...,No,Appropriate in length,Easy


As we can see, the filter method is does this in only 1 line of code whereas, our previous approach took multiple lines. 

-------------------------

## Flattening Hierarchial Indexes and Columns

-------------------------------------------------------------------------------

### *Removing Hierarchial index*

`.reset_index()` is used to remove hierarchial indexing and push the multi level indexes into their own columns.

In [68]:
# example of hierarchial index
hr_idx = dev_survey.groupby(["Country", "Age"]).ConvertedComp.agg(["max", "min"])
hr_idx.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,max,min
Country,Age,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,1.0,0.0,0.0
Afghanistan,18.0,,
Afghanistan,21.0,,


In [69]:
# removing hierarchial index with .reset_index()
hr_idx.reset_index().head(3)

Unnamed: 0,Country,Age,max,min
0,Afghanistan,1.0,0.0,0.0
1,Afghanistan,18.0,,
2,Afghanistan,21.0,,


Alternatively, with `groupby()` method we can set, **as_index = False**. This will keep the grouping columns as columns and not insert them as index.

### *Flattening hierarchial columns*

Sadly, the `.reset_index()` method does not work for the hierarchial columns. Also, there's no built-in function or method that can help us do it. We have to manually mutate the dataframe if we want to flatten the multi-index column levels into one level.

In [70]:
# Example of hierarchial columns
hr_cols = hr_idx.unstack()
hr_cols.head(3)

Unnamed: 0_level_0,max,max,max,max,max,max,max,max,max,max,max,max,max,max,max,max,max,max,max,max,...,min,min,min,min,min,min,min,min,min,min,min,min,min,min,min,min,min,min,min,min
Age,1.0,2.0,3.0,4.0,5.0,9.0,10.0,11.0,12.0,13.0,13.5,14.0,14.1,14.5,15.0,16.0,16.5,16.9,17.0,17.3,...,76.0,77.0,78.0,79.0,80.0,81.0,82.0,83.0,84.0,85.0,87.0,88.0,90.0,91.0,94.0,95.0,97.0,98.0,98.9,99.0
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2
Afghanistan,0.0,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,
Albania,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,
Algeria,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,


- `flatten_cols(df)` function

The following function will join each level of columns with an underscore (in a combinatorics manner) which then can be used with the `pipe()` method, making it possible to flatten multi-level columns in a chaining operation.

In [71]:
def flatten_cols(df):
    cols = ["_".join(map(str, col_comb)) for col_comb in df.columns.to_flat_index()]
    df.columns = cols
    return df

<u>**Explanation**</u>
1. `pandas.MultiIndex.to_flat_index()` returns a `pandas.Index` object with the MultiIndex data represented in a tuple.
2. Recall that, the `map(func, iterable)` function calls the "func" on each value of the iterable and returns a map object.
3. So, we map the `str` function to the index tuple in order to convert any non-string entry to string object before joining them.
4. Finally, the strings are joined with "_" between them.


<u>**A note on** `DataFrame.pipe(func, args, kwargs)` method</u>
- Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects.
- Parameters
    - func: function to apply to the series or dataframe
    - args: positional arguments passed into func
    - kwargs: a dictionary of keyword arguments passed into func

In [72]:
# now let's see the flatten_cols() function in action
hr_cols.pipe(flatten_cols).head(3)

Unnamed: 0_level_0,max_1.0,max_2.0,max_3.0,max_4.0,max_5.0,max_9.0,max_10.0,max_11.0,max_12.0,max_13.0,max_13.5,max_14.0,max_14.1,max_14.5,max_15.0,max_16.0,max_16.5,max_16.9,max_17.0,max_17.3,...,min_76.0,min_77.0,min_78.0,min_79.0,min_80.0,min_81.0,min_82.0,min_83.0,min_84.0,min_85.0,min_87.0,min_88.0,min_90.0,min_91.0,min_94.0,min_95.0,min_97.0,min_98.0,min_98.9,min_99.0
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1
Afghanistan,0.0,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,
Albania,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,
Algeria,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,


-------------------------------------------

## Melting, Transposing and Stacking Data

-------------------------------------------

### *Melting & Unmelting Data*

To understand melting of dataframes we first need to understand two terms associated with the data in a dataframe.
- fact: A fact is a value that is measured and reported on.
- dimension: A dimension is a value that describes the conditions of the fact.

For example, in a sales scenario, typical facts would be the number of sales of an item and the cost. The dimensions might include the store where the item was sold, the date, and the customer.

Based on the idea of fact and dimension, the way data is stored can be categorized as,
- wide form: if a single row has multiple facts and,
- long or, tidy form: if a single row of data has only one fact (may be along with other variables describing the dimensions).

**Melting** is the process of converting data of wide form to a long/tidy form. Pandas `pd.melt()` provides a nice convenient way of melting a datafrmae.

In [73]:
# first let's create a dataframe of wide form
wide = pd.DataFrame({
    "Student_name": ["Ashly", "Cole", "Young", "Dave"], 
    "Age": [15, 14, 15, 15], 
    "Test1": [13, 18, 17, np.nan], 
    "Test2": [19, 18, 16, 19], 
    "Teacher": ["Abdullah", "Pial", "Hasan", "Arafat"]})

In [74]:
wide

Unnamed: 0,Student_name,Age,Test1,Test2,Teacher
0,Ashly,15,13.0,19,Abdullah
1,Cole,14,18.0,18,Pial
2,Young,15,17.0,16,Hasan
3,Dave,15,,19,Arafat


This dataframe has two columns (Test1 and Test2) that contains the facts i.e, test scores. The other columns are dimensions of those facts.

- #### Melting Data: The `pd.melt()` function

<u> **Parameters** </u>
- frame: the dataframe to melt
- id_vars: identifier variables i.e, dimension columns
- value_vars: fact columns
- var_name: name to use for the variable column
- value_name: name to use for the value column


In [75]:
# Now, let's convert this dataframe of wide form into long form
long = pd.melt(wide, id_vars=["Student_name", "Age", "Teacher"], value_vars=["Test1", "Test2"], 
                        var_name="Test", value_name="Test_scores")

In [76]:
long

Unnamed: 0,Student_name,Age,Teacher,Test,Test_scores
0,Ashly,15,Abdullah,Test1,13.0
1,Cole,14,Pial,Test1,18.0
2,Young,15,Hasan,Test1,17.0
3,Dave,15,Arafat,Test1,
4,Ashly,15,Abdullah,Test2,19.0
5,Cole,14,Pial,Test2,18.0
6,Young,15,Hasan,Test2,16.0
7,Dave,15,Arafat,Test2,19.0


- #### Unmelting data with the `pivot_table()` method

In [77]:
long.pivot_table(index=["Student_name", "Age", "Teacher"], columns="Test", values="Test_scores").reset_index()

Test,Student_name,Age,Teacher,Test1,Test2
0,Ashly,15,Abdullah,13.0,19.0
1,Cole,14,Pial,18.0,18.0
2,Dave,15,Arafat,,19.0
3,Young,15,Hasan,17.0,16.0


**Notes:**
- as arguments to the columns and values parameters, if a list is passed then it will create more and more hierarchial column levels. So pass in scalar whenever you can.
- .reset_index() was used to remove hierarchial indexex.

### *Transposing Data* 

Transposing means to convert the columns into rows and the rows into columns. This can be easily done either with the `.transpose` method or, the `.T` property.

Some use cases for transposing the data may be, 
- **Swapping axis for plotting**
- **Viewing more data in jupyter**: if the .transpose method is used to view more data on your screen, you might not want to transpose your whole data set. Remember that pandas stores and optimizes data by
column types. If you make a row that contains different data types (strings, dates, numbers) into
a column that can be a slow and memory-loving operation. It is better to pull off the head, tail, or
take a sample of the data and then transpose it.


### *Stacking and Unstacking Data*

First let's create a multi-level (with both multi level index and multi level columns) dataframe. 

Note that, the position of a multi level index or column is counted from out to in and counting starts from 0.  

In [78]:
ds_sus = dev_survey.pivot_table(index=["Country", "Hobbyist"], columns="Employment", values="Age", aggfunc=["size", "mean", "max", "min"])

In [79]:
ds_sus.head(4)

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,size,size,size,size,mean,mean,mean,mean,mean,mean,max,max,max,max,max,max,min,min,min,min,min,min
Unnamed: 0_level_1,Employment,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired
Country,Hobbyist,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2
Afghanistan,No,7.0,2.0,1.0,,1.0,1.0,37.2,24.0,,,,,85.0,25.0,,,,,24.0,23.0,,,,
Afghanistan,Yes,14.0,2.0,5.0,3.0,2.0,,26.111111,23.0,25.0,,19.5,,34.0,23.0,26.0,,21.0,,18.0,23.0,24.0,,18.0,
Albania,No,12.0,1.0,2.0,,1.0,,26.090909,,32.5,,,,37.0,,40.0,,,,21.0,,25.0,,,
Albania,Yes,52.0,5.0,6.0,2.0,5.0,,25.034884,24.0,28.6,17.0,21.25,,35.0,37.0,36.0,19.0,24.0,,19.0,18.0,21.0,15.0,18.0,


- #### The `.stack()` method

The stack method moves a **multi-level column into the index**. By default it will move the inner-most column to the inner-most index. But, we can specify which level of column we want to move either by its position or by its name.

In [80]:
# say we wanted to pull the aggregate functions (size, mean, max, min) level to the inner-most index
ds_sus_stack = ds_sus.stack(0)
ds_sus_stack

Unnamed: 0_level_0,Unnamed: 1_level_0,Employment,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired
Country,Hobbyist,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Afghanistan,No,max,85.000000,25.000000,,,,
Afghanistan,No,mean,37.200000,24.000000,,,,
Afghanistan,No,min,24.000000,23.000000,,,,
Afghanistan,No,size,7.000000,2.000000,1.0,,1.0,1.0
Afghanistan,Yes,max,34.000000,23.000000,26.0,,21.0,
...,...,...,...,...,...,...,...,...
Zimbabwe,No,size,3.000000,,2.0,,2.0,
Zimbabwe,Yes,max,41.000000,24.000000,46.0,25.0,26.0,
Zimbabwe,Yes,mean,29.214286,22.666667,32.0,25.0,23.4,
Zimbabwe,Yes,min,23.000000,21.000000,25.0,25.0,20.0,


- #### The `.unstack()` method

As we have previously seen with the groupby method, The unstack method moves a **multi-level index into the column**. By default it will move the inner-most index to the inner-most column. But, we can specify which level of index we want to move either by its position or by its name.

In [81]:
# say we wanted to pull the Hobbyist index level from the ds_sus_stack dataframe into the inner-most column level
ds_sus_stack.unstack("Hobbyist")

Unnamed: 0_level_0,Employment,Employed full-time,Employed full-time,Employed part-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, and not looking for work","Not employed, but looking for work","Not employed, but looking for work",Retired,Retired
Unnamed: 0_level_1,Hobbyist,No,Yes,No,Yes,No,Yes,No,Yes,No,Yes,No,Yes
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
Afghanistan,max,85.000000,34.000000,25.0,23.000000,,26.0,,,,21.0,,
Afghanistan,mean,37.200000,26.111111,24.0,23.000000,,25.0,,,,19.5,,
Afghanistan,min,24.000000,18.000000,23.0,23.000000,,24.0,,,,18.0,,
Afghanistan,size,7.000000,14.000000,2.0,2.000000,1.0,5.0,,3.0,1.0,2.0,1.0,
Albania,max,37.000000,35.000000,,37.000000,40.0,36.0,,19.0,,24.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
Zambia,size,,4.000000,1.0,,,4.0,,2.0,,1.0,,
Zimbabwe,max,33.000000,41.000000,,24.000000,26.0,46.0,,25.0,25.0,26.0,,
Zimbabwe,mean,28.666667,29.214286,,22.666667,23.5,32.0,,25.0,25.0,23.4,,
Zimbabwe,min,23.000000,23.000000,,21.000000,21.0,25.0,,25.0,25.0,20.0,,


- #### The `.swaplevel()` method: Swapping levels of a multilevel dataframe

The swaplevel method will move the inner-most index/column (can be specified with the axis parameter) level by one position to the outer direction.

In [82]:
ds_sus.swaplevel(axis='columns')

Unnamed: 0_level_0,Employment,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired
Unnamed: 0_level_1,Unnamed: 1_level_1,size,size,size,size,size,size,mean,mean,mean,mean,mean,mean,max,max,max,max,max,max,min,min,min,min,min,min
Country,Hobbyist,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2
Afghanistan,No,7.0,2.0,1.0,,1.0,1.0,37.200000,24.000000,,,,,85.0,25.0,,,,,24.0,23.0,,,,
Afghanistan,Yes,14.0,2.0,5.0,3.0,2.0,,26.111111,23.000000,25.000000,,19.500000,,34.0,23.0,26.0,,21.0,,18.0,23.0,24.0,,18.0,
Albania,No,12.0,1.0,2.0,,1.0,,26.090909,,32.500000,,,,37.0,,40.0,,,,21.0,,25.0,,,
Albania,Yes,52.0,5.0,6.0,2.0,5.0,,25.034884,24.000000,28.600000,17.0,21.250000,,35.0,37.0,36.0,19.0,24.0,,19.0,18.0,21.0,15.0,18.0,
Algeria,No,15.0,2.0,4.0,1.0,12.0,,33.090909,33.000000,23.000000,22.0,23.545455,,63.0,33.0,24.0,22.0,36.0,,26.0,33.0,22.0,22.0,18.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Yemen,Yes,6.0,2.0,3.0,,5.0,,26.800000,36.000000,29.000000,,27.200000,,30.0,39.0,34.0,,34.0,,25.0,33.0,23.0,,22.0,
Zambia,No,,1.0,,,,,,26.000000,,,,,,26.0,,,,,,26.0,,,,
Zambia,Yes,4.0,,4.0,2.0,1.0,,26.250000,,26.333333,36.0,26.000000,,33.0,,28.0,49.0,26.0,,19.0,,23.0,23.0,26.0,
Zimbabwe,No,3.0,,2.0,,2.0,,28.666667,,23.500000,,25.000000,,33.0,,26.0,,25.0,,23.0,,21.0,,25.0,


--------------------------------

## Joining DataFrames

----------------------------------