## Additional Learning Resources
Refer to [scikit-learn documentation](https://scikit-learn.org/stable/) and the [Pandas user guide](https://pandas.pydata.org/docs/) for detailed explanations of the functions used in this notebook.
For a quick refresher on splitting data:
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```


# Long Format and Wide Format

## Additional Learning Resources
Refer to [scikit-learn documentation](https://scikit-learn.org/stable/) and the [Pandas user guide](https://pandas.pydata.org/docs/) for detailed explanations of the functions used in this notebook.
For a quick refresher on splitting data:
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```


After this encounter you will 
- develop an understanding of what is meant with "long" and "wide" datasets, as well to when you should use either of them
- know how to use two pandas methods in order to bring datasets into a long format
- know how to use one pandas method in order to bring datasets into a wide format

A general rule of thumb is that it is
easier to describe functional relationships between variables/columns (e.g., z is a linear combination
of x and y, density is the ratio of weight to volume) than between rows, and it is easier
to make comparisons between groups of observations (e.g., average of group a vs. average of
group b) than between groups of columns.

In [1]:
import pandas as pd

### Long

One option to bring a dataset from a "wide" format into a "long" one is using **.stack()**:

In [2]:
df = pd.DataFrame(
    {
        "Nationalities":["Ukrainian", "Nigerian", "Indian", "New Zealander", "French"],
        "Studies":["Business", "Physics", "Physics", "Env_Studies", "Marketing"],
        "PhD_Grade":[1,1,1,1,1],
        "Masters_Grade":[1,1,1,1,1],
        "Age":[18,21,21,19,23],
    }
)

In [3]:
df

Unnamed: 0,Nationalities,Studies,PhD_Grade,Masters_Grade,Age
0,Ukrainian,Business,1,1,18
1,Nigerian,Physics,1,1,21
2,Indian,Physics,1,1,21
3,New Zealander,Env_Studies,1,1,19
4,French,Marketing,1,1,23


In [4]:
df.shape

(5, 5)

In [5]:
df.index

RangeIndex(start=0, stop=5, step=1)

So, **.stack()**-method can be used in order to access specific "hierarchical columns" and to bring them into the index ("flattening the columns").

In [6]:
df_stacked = df.stack()  # default: "grab" innermost level of columns (in this case: we only have one column level) and insert it into index

In [7]:
df_stacked

0  Nationalities        Ukrainian
   Studies               Business
   PhD_Grade                    1
   Masters_Grade                1
   Age                         18
1  Nationalities         Nigerian
   Studies                Physics
   PhD_Grade                    1
   Masters_Grade                1
   Age                         21
2  Nationalities           Indian
   Studies                Physics
   PhD_Grade                    1
   Masters_Grade                1
   Age                         21
3  Nationalities    New Zealander
   Studies            Env_Studies
   PhD_Grade                    1
   Masters_Grade                1
   Age                         19
4  Nationalities           French
   Studies              Marketing
   PhD_Grade                    1
   Masters_Grade                1
   Age                         23
dtype: object

In [8]:
df_stacked.index

MultiIndex([(0, 'Nationalities'),
            (0,       'Studies'),
            (0,     'PhD_Grade'),
            (0, 'Masters_Grade'),
            (0,           'Age'),
            (1, 'Nationalities'),
            (1,       'Studies'),
            (1,     'PhD_Grade'),
            (1, 'Masters_Grade'),
            (1,           'Age'),
            (2, 'Nationalities'),
            (2,       'Studies'),
            (2,     'PhD_Grade'),
            (2, 'Masters_Grade'),
            (2,           'Age'),
            (3, 'Nationalities'),
            (3,       'Studies'),
            (3,     'PhD_Grade'),
            (3, 'Masters_Grade'),
            (3,           'Age'),
            (4, 'Nationalities'),
            (4,       'Studies'),
            (4,     'PhD_Grade'),
            (4, 'Masters_Grade'),
            (4,           'Age')],
           )

In [9]:
type(df_stacked)

pandas.core.series.Series

Another option to bring a dataset from a "wide" format into a "long" one is using **.melt()**:

In [10]:
df = df.copy()

In [11]:
df

Unnamed: 0,Nationalities,Studies,PhD_Grade,Masters_Grade,Age
0,Ukrainian,Business,1,1,18
1,Nigerian,Physics,1,1,21
2,Indian,Physics,1,1,21
3,New Zealander,Env_Studies,1,1,19
4,French,Marketing,1,1,23


In [12]:
df.melt() # note: output after calling .melt() is a DataFrame; pandas .melt() creates two new columns: one "variable" includes initial column names as its values, another column "value" takes all the values of the initial Dataframe as its values

Unnamed: 0,variable,value
0,Nationalities,Ukrainian
1,Nationalities,Nigerian
2,Nationalities,Indian
3,Nationalities,New Zealander
4,Nationalities,French
5,Studies,Business
6,Studies,Physics
7,Studies,Physics
8,Studies,Env_Studies
9,Studies,Marketing


In [13]:
df

Unnamed: 0,Nationalities,Studies,PhD_Grade,Masters_Grade,Age
0,Ukrainian,Business,1,1,18
1,Nigerian,Physics,1,1,21
2,Indian,Physics,1,1,21
3,New Zealander,Env_Studies,1,1,19
4,French,Marketing,1,1,23


In [14]:
df_melted = df.melt(
    value_vars=["PhD_Grade", "Masters_Grade"],
    var_name="Grade",
    id_vars=["Nationalities", "Studies", "Age"],
    value_name="Academic_Performance"
)
df_melted   

Unnamed: 0,Nationalities,Studies,Age,Grade,Academic_Performance
0,Ukrainian,Business,18,PhD_Grade,1
1,Nigerian,Physics,21,PhD_Grade,1
2,Indian,Physics,21,PhD_Grade,1
3,New Zealander,Env_Studies,19,PhD_Grade,1
4,French,Marketing,23,PhD_Grade,1
5,Ukrainian,Business,18,Masters_Grade,1
6,Nigerian,Physics,21,Masters_Grade,1
7,Indian,Physics,21,Masters_Grade,1
8,New Zealander,Env_Studies,19,Masters_Grade,1
9,French,Marketing,23,Masters_Grade,1


#### "Tidy"

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is
messy or tidy depending on how rows, columns and tables are matched up with observations,
variables and types. In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.

.melt() comes especially handy for bringing your data into a "tidy" format (because of the available params/arguments you can pass)

### Wide

One option to bring a dataset from a "long" format into a "wide" one is using **.unstack()**:

In [15]:
df = df.copy()

In [16]:
df

Unnamed: 0,Nationalities,Studies,PhD_Grade,Masters_Grade,Age
0,Ukrainian,Business,1,1,18
1,Nigerian,Physics,1,1,21
2,Indian,Physics,1,1,21
3,New Zealander,Env_Studies,1,1,19
4,French,Marketing,1,1,23


In [17]:
df_grouped = df.groupby(["Nationalities", "Studies","PhD_Grade", "Masters_Grade"])["Age"].count() # just creating a (Series) with a multi-index
df_grouped

Nationalities  Studies      PhD_Grade  Masters_Grade
French         Marketing    1          1                1
Indian         Physics      1          1                1
New Zealander  Env_Studies  1          1                1
Nigerian       Physics      1          1                1
Ukrainian      Business     1          1                1
Name: Age, dtype: int64

In [18]:
df_grouped.index

MultiIndex([(       'French',   'Marketing', 1, 1),
            (       'Indian',     'Physics', 1, 1),
            ('New Zealander', 'Env_Studies', 1, 1),
            (     'Nigerian',     'Physics', 1, 1),
            (    'Ukrainian',    'Business', 1, 1)],
           names=['Nationalities', 'Studies', 'PhD_Grade', 'Masters_Grade'])

In [20]:
df_grouped_unstacked = df_grouped.unstack((0)) # specififying to extract the outer-most index, "Nationalities"
df_grouped_unstacked

Unnamed: 0_level_0,Unnamed: 1_level_0,Nationalities,French,Indian,New Zealander,Nigerian,Ukrainian
Studies,PhD_Grade,Masters_Grade,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Business,1,1,,,,,1.0
Env_Studies,1,1,,,1.0,,
Marketing,1,1,1.0,,,,
Physics,1,1,,1.0,,1.0,


In [19]:
df_grouped_unstacked = df_grouped.unstack((0,1)) # specififying to extract the first two outer-most indeces, "Nationalities" and "Studies"
df_grouped_unstacked

Unnamed: 0_level_0,Nationalities,French,Indian,New Zealander,Nigerian,Ukrainian
Unnamed: 0_level_1,Studies,Marketing,Physics,Env_Studies,Physics,Business
PhD_Grade,Masters_Grade,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1,1,1,1,1,1,1


Hence, .unstack() allows for "accessing" hierarchical indeces by bringing selected indeces into cols.

A bit different logic as a means of getting a better view of your (wide) dataset: swap rows and cols with **.transpose()**

Transpose: flip dimensions of df

In [None]:
df.transpose()

## Comments and questions within encounter

Can we define a list to be passed as argument into the "var_name"-argument of .melt() => after having shown in class that it can't be done 😉, we also saw in the docs why (definition of arguments accepted to be passed.)

What's the difference between .stack() and .melt()
- .stack() returns a Series in the case you have non-hierarchical cols (which means you would eventually have to call .to_frame() if u wanna have a DataFrame)
- .stack() allows for accessing hierarchical cols
- . melt() returns a DataFrame
- .melt() enables you to use params for getting data into a "tidy"-format


How is the "innermost" level defined in .stack()
- innermost: level = -1 See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html?highlight=stack#pandas.DataFrame.stack
- outermost: starts with "0"

Oh, and why do we call our classes "encounters":
=> the "founding" teachers @SPICED hated the term of "classes"; I do too: given the level of knowledge and experience that participants already bring in to the bootcamp 🙂