**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
- [Importing the data](#toc1_2_)    
- [Melting, Transposing and Stacking Data](#toc2_)    
  - [*Melting & Unmelting Data*](#toc2_1_)    
  - [*Transposing Data*](#toc2_2_)    
  - [*Stacking and Unstacking Data*](#toc2_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Read the official documentation on pandas DataFrames @ https://pandas.pydata.org/pandas-docs/stable/reference/frame.html**

**`Note:`** The notion of **chaining functions/methods** in pandas is similar to python.

DataFrames are **column oriented** unlike most common databases. And, **each column** in the dataframe is a **pandas series object**. So, any operation that can be performed on a pandas series object can be applied to a column too.

There are **two axes** for a dataframe commonly referred to as axis 0 and 1, or the **"index"** (or 'rows') axis and the **"columns"** axis respectively. Note that, when an **operation** is applied **along axis 0**, it is applied **down through all the rows for all the columns**. Likewise, operations **along axis 1** is applied **across the values in all the columns for all of the rows**.

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
# import statements
import numpy as np
import pandas as pd

In [2]:
# view options
pd.set_option("display.max_columns", 14)
pd.set_option("display.max_rows", 8)

### <a id='toc1_2_'></a>[Importing the data](#toc0_)

- The stack overflow developer survey data from 2019

In [3]:
dev_survey = pd.read_csv("./Data/dev_survey_2019.zip")

In [4]:
dev_survey.head(3)

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,...,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,...,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,...,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,...,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult


In [5]:
# dev_survey.info()

-------------------------------------------

## <a id='toc2_'></a>[Melting, Transposing and Stacking Data](#toc0_)

-------------------------------------------

### <a id='toc2_1_'></a>[*Melting & Unmelting Data*](#toc0_)

To understand melting of dataframes we first need to understand two terms associated with the data in a dataframe.
- **fact:** A fact is a value that is measured and reported on.
- **dimension:** A dimension is a value that describes the conditions of the fact.

For example, in a sales scenario, typical facts would be the number of sales of an item and the cost. The dimensions might include the store where the item was sold, the date, and the customer.

Based on the idea of fact and dimension, the way data is stored can be categorized as,
- **wide form:** if a single row has multiple facts and,
- **long or, tidy form:** if a single row of data has only one fact (may be along with other variables describing the dimensions).

**Melting** is the process of converting data of wide form to a long/tidy form. Pandas `pd.melt()` provides a nice convenient way of melting a datafrmae.

In [6]:
# first let's create a dataframe of wide form
wide = pd.DataFrame(
    {
        "Student_name": ["Ashly", "Cole", "Young", "Dave"],
        "Age": [15, 14, 15, 15],
        "Test1": [13, 18, 17, np.nan],
        "Test2": [19, 18, 16, 19],
        "Teacher": ["Abdullah", "Pial", "Hasan", "Arafat"],
    }
)

In [7]:
wide

Unnamed: 0,Student_name,Age,Test1,Test2,Teacher
0,Ashly,15,13.0,19,Abdullah
1,Cole,14,18.0,18,Pial
2,Young,15,17.0,16,Hasan
3,Dave,15,,19,Arafat


This dataframe has two columns (Test1 and Test2) that contains the facts i.e, test scores. The other columns are dimensions of those facts.

- #### Melting Data: The `pd.melt()` function

<u> **Parameters** </u>
- frame: the dataframe to melt
- id_vars: identifier variables i.e, dimension columns
- value_vars: fact columns
- var_name: name to use for the variable column
- value_name: name to use for the value column


In [8]:
# Now, let's convert this dataframe of wide form into long form
long = pd.melt(
    wide,
    id_vars=["Student_name", "Age", "Teacher"],
    value_vars=["Test1", "Test2"],
    var_name="Test",
    value_name="Test_scores",
)

In [9]:
long

Unnamed: 0,Student_name,Age,Teacher,Test,Test_scores
0,Ashly,15,Abdullah,Test1,13.0
1,Cole,14,Pial,Test1,18.0
2,Young,15,Hasan,Test1,17.0
3,Dave,15,Arafat,Test1,
4,Ashly,15,Abdullah,Test2,19.0
5,Cole,14,Pial,Test2,18.0
6,Young,15,Hasan,Test2,16.0
7,Dave,15,Arafat,Test2,19.0


- #### Unmelting data with the `pivot_table()` method

In [10]:
long.pivot_table(
    index=["Student_name", "Age", "Teacher"], columns="Test", values="Test_scores"
).reset_index()

Test,Student_name,Age,Teacher,Test1,Test2
0,Ashly,15,Abdullah,13.0,19.0
1,Cole,14,Pial,18.0,18.0
2,Dave,15,Arafat,,19.0
3,Young,15,Hasan,17.0,16.0


**Notes:**
- as arguments to the columns and values parameters, if a list is passed then it will create more and more hierarchial column levels. So pass in scalar whenever you can.
- .reset_index() was used to remove hierarchial indexes.

### <a id='toc2_2_'></a>[*Transposing Data*](#toc0_)

Transposing means to convert the columns into rows and the rows into columns. This can be easily done either with the `.transpose` method or, the `.T` property.

Some use cases for transposing the data may be, 
- **Swapping axis for plotting**
- **Viewing more data in jupyter**: if the .transpose method is used to view more data on your screen, you might not want to transpose your whole data set. Remember that pandas stores and optimizes data by
column types. If you make a row that contains different data types (strings, dates, numbers) into
a column that can be a slow and memory-hungry operation. It is better to pull off the head, tail, or
take a sample of the data and then transpose it.


### <a id='toc2_3_'></a>[*Stacking and Unstacking Data*](#toc0_)

First let's create a multi-level (with both multi level index and multi level columns) dataframe. 

Note that, **the position of a multi level index or column is counted from out to in and counting starts from 0**.  

In [11]:
ds_sus = dev_survey.pivot_table(
    index=["Country", "Hobbyist"],
    columns="Employment",
    values="Age",
    aggfunc=["size", "mean", "max", "min"],
)

In [12]:
ds_sus

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,size,size,size,size,mean,...,max,min,min,min,min,min,min
Unnamed: 0_level_1,Employment,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired,Employed full-time,...,Retired,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired
Country,Hobbyist,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Afghanistan,No,7.0,2.0,1.0,,1.0,1.0,37.200000,...,,24.0,23.0,,,,
Afghanistan,Yes,14.0,2.0,5.0,3.0,2.0,,26.111111,...,,18.0,23.0,24.0,,18.0,
Albania,No,12.0,1.0,2.0,,1.0,,26.090909,...,,21.0,,25.0,,,
Albania,Yes,52.0,5.0,6.0,2.0,5.0,,25.034884,...,,19.0,18.0,21.0,15.0,18.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Zambia,No,,1.0,,,,,,...,,,26.0,,,,
Zambia,Yes,4.0,,4.0,2.0,1.0,,26.250000,...,,19.0,,23.0,23.0,26.0,
Zimbabwe,No,3.0,,2.0,,2.0,,28.666667,...,,23.0,,21.0,,25.0,
Zimbabwe,Yes,15.0,3.0,6.0,1.0,5.0,1.0,29.214286,...,,23.0,21.0,25.0,25.0,20.0,


- #### The `.stack()` method

The stack method moves a **multi-level column into the index**. By default it will move the inner-most column to the inner-most index. But, we can specify which level of column we want to move either by its position or by its name.

In [13]:
# say we wanted to pull the aggregate functions (size, mean, max, min) level to the inner-most index
ds_sus_stack = ds_sus.stack(0)
ds_sus_stack

Unnamed: 0_level_0,Unnamed: 1_level_0,Employment,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired
Country,Hobbyist,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Afghanistan,No,max,85.000000,25.000000,,,,
Afghanistan,No,mean,37.200000,24.000000,,,,
Afghanistan,No,min,24.000000,23.000000,,,,
Afghanistan,No,size,7.000000,2.000000,1.0,,1.0,1.0
...,...,...,...,...,...,...,...,...
Zimbabwe,Yes,max,41.000000,24.000000,46.0,25.0,26.0,
Zimbabwe,Yes,mean,29.214286,22.666667,32.0,25.0,23.4,
Zimbabwe,Yes,min,23.000000,21.000000,25.0,25.0,20.0,
Zimbabwe,Yes,size,15.000000,3.000000,6.0,1.0,5.0,1.0


- #### The `.unstack()` method

As we have previously seen with the groupby method, The unstack method moves a **multi-level index into the column**. By default it will move the inner-most index to the inner-most column. But, we can specify which level of index we want to move either by its position or by its name.

In [14]:
# say we wanted to pull the Hobbyist index level from the ds_sus_stack dataframe into the inner-most column level
ds_sus_stack.unstack("Hobbyist")

Unnamed: 0_level_0,Employment,Employed full-time,Employed full-time,Employed part-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, and not looking for work","Not employed, but looking for work","Not employed, but looking for work",Retired,Retired
Unnamed: 0_level_1,Hobbyist,No,Yes,No,Yes,No,Yes,No,Yes,No,Yes,No,Yes
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
Afghanistan,max,85.000000,34.000000,25.0,23.000000,,26.0,,,,21.0,,
Afghanistan,mean,37.200000,26.111111,24.0,23.000000,,25.0,,,,19.5,,
Afghanistan,min,24.000000,18.000000,23.0,23.000000,,24.0,,,,18.0,,
Afghanistan,size,7.000000,14.000000,2.0,2.000000,1.0,5.0,,3.0,1.0,2.0,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
Zimbabwe,max,33.000000,41.000000,,24.000000,26.0,46.0,,25.0,25.0,26.0,,
Zimbabwe,mean,28.666667,29.214286,,22.666667,23.5,32.0,,25.0,25.0,23.4,,
Zimbabwe,min,23.000000,23.000000,,21.000000,21.0,25.0,,25.0,25.0,20.0,,
Zimbabwe,size,3.000000,15.000000,,3.000000,2.0,6.0,,1.0,2.0,5.0,,1.0


- #### The `.swaplevel()` method: Swapping levels of a multilevel dataframe

The swaplevel method will move the inner-most index/column (can be specified with the axis parameter) level by one position to the outer direction.

In [15]:
ds_sus.swaplevel(axis="columns")

Unnamed: 0_level_0,Employment,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired,Employed full-time,...,Retired,Employed full-time,Employed part-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired
Unnamed: 0_level_1,Unnamed: 1_level_1,size,size,size,size,size,size,mean,...,max,min,min,min,min,min,min
Country,Hobbyist,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Afghanistan,No,7.0,2.0,1.0,,1.0,1.0,37.200000,...,,24.0,23.0,,,,
Afghanistan,Yes,14.0,2.0,5.0,3.0,2.0,,26.111111,...,,18.0,23.0,24.0,,18.0,
Albania,No,12.0,1.0,2.0,,1.0,,26.090909,...,,21.0,,25.0,,,
Albania,Yes,52.0,5.0,6.0,2.0,5.0,,25.034884,...,,19.0,18.0,21.0,15.0,18.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Zambia,No,,1.0,,,,,,...,,,26.0,,,,
Zambia,Yes,4.0,,4.0,2.0,1.0,,26.250000,...,,19.0,,23.0,23.0,26.0,
Zimbabwe,No,3.0,,2.0,,2.0,,28.666667,...,,23.0,,21.0,,25.0,
Zimbabwe,Yes,15.0,3.0,6.0,1.0,5.0,1.0,29.214286,...,,23.0,21.0,25.0,25.0,20.0,
