# Long Format and Wide Format

After this encounter you will 
- develop an understanding of what is meant with "long" and "wide" datasets, as well to when you should use either of them
- know how to use two pandas methods in order to bring datasets into a long format
- know how to use one pandas method in order to bring datasets into a wide format

A general rule of thumb is that it is
easier to describe functional relationships between variables/columns (e.g., z is a linear combination
of x and y, density is the ratio of weight to volume) than between rows, and it is easier
to make comparisons between groups of observations (e.g., average of group a vs. average of
group b) than between groups of columns.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("penguins_simple.csv", sep=";")

In [4]:
manual = pd.DataFrame(
    {
        "Nationalities": ["Iranian", "Turkish", "German", "Colombian", "Greek"],
        "PhD_Studies": ["Physics", "Economics", "Mechanical Engineering", "Physics", "Physics"],
        "PhD_grades": [1,2,1,1,1],
        "Masters_grades": [1,2,1,1,1],
        "Age": [19,20,19,19,19],
    }
)

### Long

In [20]:
manual.head()

Unnamed: 0_level_0,Nationalities,PhD_grades,Masters_grades,Age
PhD_Studies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Physics,Iranian,1,1,19
Economics,Turkish,2,2,20
Mechanical Engineering,German,1,1,19
Physics,Colombian,1,1,19
Physics,Greek,1,1,19


In [19]:
manual.shape

(5, 4)

In [16]:
manual.set_index(['Nationalities','PhD_Studies'],inplace = True)

In [21]:
manual.reset_index(['PhD_Studies'],inplace=True)
manual

Unnamed: 0,PhD_Studies,Nationalities,PhD_grades,Masters_grades,Age
0,Physics,Iranian,1,1,19
1,Economics,Turkish,2,2,20
2,Mechanical Engineering,German,1,1,19
3,Physics,Colombian,1,1,19
4,Physics,Greek,1,1,19


One option to bring a dataset from a "wide" format into a "long" one is using **.stack()**:

In [22]:
manual.stack()

0  PhD_Studies                      Physics
   Nationalities                    Iranian
   PhD_grades                             1
   Masters_grades                         1
   Age                                   19
1  PhD_Studies                    Economics
   Nationalities                    Turkish
   PhD_grades                             2
   Masters_grades                         2
   Age                                   20
2  PhD_Studies       Mechanical Engineering
   Nationalities                     German
   PhD_grades                             1
   Masters_grades                         1
   Age                                   19
3  PhD_Studies                      Physics
   Nationalities                  Colombian
   PhD_grades                             1
   Masters_grades                         1
   Age                                   19
4  PhD_Studies                      Physics
   Nationalities                      Greek
   PhD_grades                   

In [24]:
test = manual.stack()
type(test)

pandas.core.series.Series

In [25]:
manual.shape, manual.stack().shape # note: output after calling .stack() is a Series.

((5, 5), (25,))

In [26]:
df.tail(3)

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE
332,Gentoo,49.9,16.1,213.0,5400.0,MALE


In [9]:
df.stack()

0    Species                Adelie
     Culmen Length (mm)       39.1
     Culmen Depth (mm)        18.7
     Flipper Length (mm)     181.0
     Body Mass (g)          3750.0
                             ...  
332  Culmen Length (mm)       49.9
     Culmen Depth (mm)        16.1
     Flipper Length (mm)     213.0
     Body Mass (g)          5400.0
     Sex                      MALE
Length: 1998, dtype: object

In [27]:
df.shape, df.stack().shape

((333, 6), (1998,))

In [28]:
df['region'] = ['Region A'] * 100 + ['Region B'] * 100 + ['Region C'] * 133 #create new dummy categorical column

In [42]:
df

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,region
0,Adelie,39.1,18.7,181.0,3750.0,MALE,Region A
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE,Region A
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE,Region A
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE,Region A
4,Adelie,39.3,20.6,190.0,3650.0,MALE,Region A
...,...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE,Region C
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE,Region C
330,Gentoo,50.4,15.7,222.0,5750.0,MALE,Region C
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE,Region C


Creating hierarchical columns so that we can apply .stack() in order to "grab" specific columns and insert them into the index of the stacked object.

In [43]:
df_hierarchical_cols = df.groupby(['Species', 'Sex', 'region'])['Body Mass (g)'].mean() # group-by object => see encounter on "aggregation"
df_hierarchical_cols = df_hierarchical_cols.unstack((0,2)).round(2)
df_hierarchical_cols 


Species,Adelie,Adelie,Chinstrap,Chinstrap,Gentoo
region,Region A,Region B,Region B,Region C,Region C
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
FEMALE,3379.0,3346.74,3530.56,3514.29,4679.74
MALE,4076.5,3971.74,3955.56,3875.0,5484.84


In [44]:
df_stacked_inner = df_hierarchical_cols.stack() # default: "grab" innermost level of cols (i.e. "region) and insert it into index
df_stacked_inner 

Unnamed: 0_level_0,Species,Adelie,Chinstrap,Gentoo
Sex,region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FEMALE,Region A,3379.0,,
FEMALE,Region B,3346.74,3530.56,
FEMALE,Region C,,3514.29,4679.74
MALE,Region A,4076.5,,
MALE,Region B,3971.74,3955.56,
MALE,Region C,,3875.0,5484.84


In [45]:
df_stacked_inner.shape

(6, 3)

In [49]:
df_stacked_outer = df_hierarchical_cols.stack(1) # default: "grab" outer level of cols (i.e. "Species") and insert it into index
df_stacked_outer 

Unnamed: 0_level_0,Species,Adelie,Chinstrap,Gentoo
Sex,region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FEMALE,Region A,3379.0,,
FEMALE,Region B,3346.74,3530.56,
FEMALE,Region C,,3514.29,4679.74
MALE,Region A,4076.5,,
MALE,Region B,3971.74,3955.56,
MALE,Region C,,3875.0,5484.84


So, **.stack()**-method can be used in order to access specific "hierarchical columns" and to bring them into the index ("flattening the cols").

Another option to bring a dataset from a "wide" format into a "long" one is using **.melt()**:

In [52]:
manual = manual.copy()   
manual.head()


Unnamed: 0,PhD_Studies,Nationalities,PhD_grades,Masters_grades,Age
0,Physics,Iranian,1,1,19
1,Economics,Turkish,2,2,20
2,Mechanical Engineering,German,1,1,19
3,Physics,Colombian,1,1,19
4,Physics,Greek,1,1,19


In [51]:
manual.melt() # note: output after calling .melt() is a DataFrame

Unnamed: 0,variable,value
0,PhD_Studies,Physics
1,PhD_Studies,Economics
2,PhD_Studies,Mechanical Engineering
3,PhD_Studies,Physics
4,PhD_Studies,Physics
5,Nationalities,Iranian
6,Nationalities,Turkish
7,Nationalities,German
8,Nationalities,Colombian
9,Nationalities,Greek


In [17]:
manual.shape, manual.melt().shape

((5, 5), (25, 2))

#### "Tidy"

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is
messy or tidy depending on how rows, columns and tables are matched up with observations,
variables and types. In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.

In [53]:
manual.melt(
    id_vars = ["Nationalities", "PhD_Studies", "Age"],
    value_vars = ["Masters_grades", "PhD_grades"],
    var_name = "Academic Degrees",
    value_name = "Academic Grades",
)

# intuition of "value_vars"-param: define which values the categorical variable, "Academic Degrees" should have.
# intuition of "value_name"-param: the correspondent numeric value to each value of the categorical variable (i.e. "Masters_grades" and "PhD_grades")

Unnamed: 0,Nationalities,PhD_Studies,Age,Academic Degrees,Academic Grades
0,Iranian,Physics,19,Masters_grades,1
1,Turkish,Economics,20,Masters_grades,2
2,German,Mechanical Engineering,19,Masters_grades,1
3,Colombian,Physics,19,Masters_grades,1
4,Greek,Physics,19,Masters_grades,1
5,Iranian,Physics,19,PhD_grades,1
6,Turkish,Economics,20,PhD_grades,2
7,German,Mechanical Engineering,19,PhD_grades,1
8,Colombian,Physics,19,PhD_grades,1
9,Greek,Physics,19,PhD_grades,1


### Wide

One option to bring a dataset from a "long" format into a "wide" one is using **.unstack()**:

In [54]:
df_stacked_inner = df_stacked_inner.copy()
df_stacked_inner 

Unnamed: 0_level_0,Species,Adelie,Chinstrap,Gentoo
Sex,region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FEMALE,Region A,3379.0,,
FEMALE,Region B,3346.74,3530.56,
FEMALE,Region C,,3514.29,4679.74
MALE,Region A,4076.5,,
MALE,Region B,3971.74,3955.56,
MALE,Region C,,3875.0,5484.84


In [20]:
df_stacked_inner.index

MultiIndex([('FEMALE', 'Region A'),
            ('FEMALE', 'Region B'),
            ('FEMALE', 'Region C'),
            (  'MALE', 'Region A'),
            (  'MALE', 'Region B'),
            (  'MALE', 'Region C')],
           names=['Sex', 'region'])

In [57]:
df_unstacked = df_stacked_inner.unstack(level = 1) # grabbing the innermost index of df_stacked_inner and inserting it into col-dimension
df_unstacked


Species,Adelie,Adelie,Adelie,Chinstrap,Chinstrap,Chinstrap,Gentoo,Gentoo,Gentoo
region,Region A,Region B,Region C,Region A,Region B,Region C,Region A,Region B,Region C
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
FEMALE,3379.0,3346.74,,,3530.56,3514.29,,,4679.74
MALE,4076.5,3971.74,,,3955.56,3875.0,,,5484.84


In [22]:
df_unstacked = df_stacked_inner.unstack(0) # grabbing the outermost index of df_stacked_inner and inserting it into col-dimension
df_unstacked

Species,Adelie,Adelie,Chinstrap,Chinstrap,Gentoo,Gentoo
Sex,FEMALE,MALE,FEMALE,MALE,FEMALE,MALE
region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Region A,3379.0,4076.5,,,,
Region B,3346.74,3971.74,3530.56,3955.56,,
Region C,,,3514.29,3875.0,4679.74,5484.84


In [61]:
triple_stacked = df.groupby(['Species', 'Sex', 'region'])['Body Mass (g)'].mean() # creating an object with 3 indices for the next manipulation 

In [62]:
triple_stacked.head(2)

Species  Sex     region  
Adelie   FEMALE  Region A    3379.00000
                 Region B    3346.73913
Name: Body Mass (g), dtype: float64

In [65]:
df_tuple_index = triple_stacked.unstack((0,2)) # grabbing the outermost and innermost indeces for inserting them into columns
df_tuple_index 

Species,Adelie,Adelie,Chinstrap,Chinstrap,Gentoo
region,Region A,Region B,Region B,Region C,Region C
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
FEMALE,3379.0,3346.73913,3530.555556,3514.285714,4679.741379
MALE,4076.5,3971.73913,3955.555556,3875.0,5484.836066


In [69]:
df_tuple_index.unstack()

Species    region    Sex   
Adelie     Region A  FEMALE    3379.000000
                     MALE      4076.500000
           Region B  FEMALE    3346.739130
                     MALE      3971.739130
Chinstrap  Region B  FEMALE    3530.555556
                     MALE      3955.555556
           Region C  FEMALE    3514.285714
                     MALE      3875.000000
Gentoo     Region C  FEMALE    4679.741379
                     MALE      5484.836066
dtype: float64

In [71]:
df_tuple_index.unstack().unstack()

Unnamed: 0_level_0,Sex,FEMALE,MALE
Species,region,Unnamed: 2_level_1,Unnamed: 3_level_1
Adelie,Region A,3379.0,4076.5
Adelie,Region B,3346.73913,3971.73913
Chinstrap,Region B,3530.555556,3955.555556
Chinstrap,Region C,3514.285714,3875.0
Gentoo,Region C,4679.741379,5484.836066


Hence, .unstack() allows for "accessing" hierarchical indeces by bringing selected indeces into cols.

A bit different logic as a means of getting a better view of your (wide) dataset: swap rows and cols with **.transpose()**

Transpose: flip dimensions of df

In [72]:
df = df.copy()
df.head(2)

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,region
0,Adelie,39.1,18.7,181.0,3750.0,MALE,Region A
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE,Region A


In [29]:
df.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,323,324,325,326,327,328,329,330,331,332
Species,Adelie,Adelie,Adelie,Adelie,Adelie,Adelie,Adelie,Adelie,Adelie,Adelie,...,Gentoo,Gentoo,Gentoo,Gentoo,Gentoo,Gentoo,Gentoo,Gentoo,Gentoo,Gentoo
Culmen Length (mm),39.1,39.5,40.3,36.7,39.3,38.9,39.2,41.1,38.6,34.6,...,43.5,51.5,46.2,55.1,48.8,47.2,46.8,50.4,45.2,49.9
Culmen Depth (mm),18.7,17.4,18.0,19.3,20.6,17.8,19.6,17.6,21.2,21.1,...,15.2,16.3,14.1,16.0,16.2,13.7,14.3,15.7,14.8,16.1
Flipper Length (mm),181.0,186.0,195.0,193.0,190.0,181.0,195.0,182.0,191.0,198.0,...,213.0,230.0,217.0,230.0,222.0,214.0,215.0,222.0,212.0,213.0
Body Mass (g),3750.0,3800.0,3250.0,3450.0,3650.0,3625.0,4675.0,3200.0,3800.0,4400.0,...,4650.0,5500.0,4375.0,5850.0,6000.0,4925.0,4850.0,5750.0,5200.0,5400.0
Sex,MALE,FEMALE,FEMALE,FEMALE,MALE,FEMALE,MALE,FEMALE,MALE,MALE,...,FEMALE,MALE,FEMALE,MALE,MALE,FEMALE,FEMALE,MALE,FEMALE,MALE
region,Region A,Region A,Region A,Region A,Region A,Region A,Region A,Region A,Region A,Region A,...,Region C,Region C,Region C,Region C,Region C,Region C,Region C,Region C,Region C,Region C


## Comments and questions within encounter

What's the difference between .stack() and .melt()
- .stack() returns a Series in the case you have non-hierarchical cols (which means you would eventually have to call .to_frame() if u wanna have a DataFrame)
- .stack() allows for accessing hierarchical cols
- . melt() returns a DataFrame
- .melt() enables you to use params for getting data into a "tidy"-format




How is the "innermost" level defined in .stack()
- innermost: level = -1 See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html?highlight=stack#pandas.DataFrame.stack
- outermost: starts with "0"

When using .melt() for bringing data into a tidy format: does the method identify which cols are (categorical) values of another variable?
- no, this is up to you to define
- as a general rule: can u subsume given cols into another, more general category/variable? If yes => untidy dataset!

What is a .groupby()-object?
- in essence: it creates sub DataFrames of your initial DataFrames. This is done "under the hood" for you, so you don't have to bother with that step of creating sub DataFrames
- as we will see in the "aggregation"-encounter: when calling .groupby() u are preparing your data for applying transformations (like aggregations) and then creating a new, combined dataset 

When using .unstack(): how will innermost (outermost) indeces be inserted into new object? 
- actually, the levels will be swapped (i.e. innermost lands in outermost and vice versa): see output of line 25 

Is there a method to swap indeces, while not changing the shape of an object?
- I don't know of any just yet => would need to do some research first :) 