# Long Format and Wide Format

## Additional Learning Resources
Refer to [scikit-learn documentation](https://scikit-learn.org/stable/) and the [Pandas user guide](https://pandas.pydata.org/docs/) for detailed explanations of the functions used in this notebook.
For a quick refresher on splitting data:
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```


After this encounter you will 
- develop an understanding of what is meant with "long" and "wide" datasets, as well to when you should use either of them
- know how to use two pandas methods in order to bring datasets into a long format
- know how to use one pandas method in order to bring datasets into a wide format

A general rule of thumb is that it is
easier to describe functional relationships between variables/columns (e.g., z is a linear combination
of x and y, density is the ratio of weight to volume) than between rows, and it is easier
to make comparisons between groups of observations (e.g., average of group a vs. average of
group b) than between groups of columns.

In [1]:
import pandas as pd

### Long

One option to bring a dataset from a "wide" format into a "long" one is using **.stack()**:

So, **.stack()**-method can be used in order to access specific "hierarchical columns" and to bring them into the index ("flattening the cols").

In [2]:
df = pd.DataFrame(
    {
        "Nationalities":["Ukrainian", "Nigerian", "Indian", "New Zealander", "French"],
        "Studies":["Business", "Physics", "Physics", "Env_Studies", "Marketing"],
        "PhD_Grade":[1,1,1,1,1],
        "Masters_Grade":[1,1,1,1,1],
        "Age":[18,21,21,19,23],
    }
)

Another option to bring a dataset from a "wide" format into a "long" one is using **.melt()**:

In [4]:
# df
df.shape

(5, 5)

In [8]:
df.index

RangeIndex(start=0, stop=5, step=1)

#### "Tidy"

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is
messy or tidy depending on how rows, columns and tables are matched up with observations,
variables and types. In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.

In [6]:
df_stacked = df.stack()

In [7]:
df_stacked.index

MultiIndex([(0, 'Nationalities'),
            (0,       'Studies'),
            (0,     'PhD_Grade'),
            (0, 'Masters_Grade'),
            (0,           'Age'),
            (1, 'Nationalities'),
            (1,       'Studies'),
            (1,     'PhD_Grade'),
            (1, 'Masters_Grade'),
            (1,           'Age'),
            (2, 'Nationalities'),
            (2,       'Studies'),
            (2,     'PhD_Grade'),
            (2, 'Masters_Grade'),
            (2,           'Age'),
            (3, 'Nationalities'),
            (3,       'Studies'),
            (3,     'PhD_Grade'),
            (3, 'Masters_Grade'),
            (3,           'Age'),
            (4, 'Nationalities'),
            (4,       'Studies'),
            (4,     'PhD_Grade'),
            (4, 'Masters_Grade'),
            (4,           'Age')],
           )

In [14]:
df_melted = df.melt(
    id_vars=["Nationalities", "Studies", "Age"], 
    value_vars=["PhD_Grade", "Masters_Grade"], 
    var_name="Grade",
    value_name="Academic_Performance"
)
df_melted

Unnamed: 0,Nationalities,Studies,Age,Grade,Academic_Performance
0,Ukrainian,Business,18,PhD_Grade,1
1,Nigerian,Physics,21,PhD_Grade,1
2,Indian,Physics,21,PhD_Grade,1
3,New Zealander,Env_Studies,19,PhD_Grade,1
4,French,Marketing,23,PhD_Grade,1
5,Ukrainian,Business,18,Masters_Grade,1
6,Nigerian,Physics,21,Masters_Grade,1
7,Indian,Physics,21,Masters_Grade,1
8,New Zealander,Env_Studies,19,Masters_Grade,1
9,French,Marketing,23,Masters_Grade,1


### Wide

One option to bring a dataset from a "long" format into a "wide" one is using **.unstack()**:

In [18]:
df_grouped = df.groupby(["Nationalities", "Studies", "Age"])["Age"].count()
df_grouped

Nationalities  Studies      Age
French         Marketing    23     1
Indian         Physics      21     1
New Zealander  Env_Studies  19     1
Nigerian       Physics      21     1
Ukrainian      Business     18     1
Name: Age, dtype: int64

In [19]:
df_grouped.index

MultiIndex([(       'French',   'Marketing', 23),
            (       'Indian',     'Physics', 21),
            ('New Zealander', 'Env_Studies', 19),
            (     'Nigerian',     'Physics', 21),
            (    'Ukrainian',    'Business', 18)],
           names=['Nationalities', 'Studies', 'Age'])

Hence, .unstack() allows for "accessing" hierarchical indeces by bringing selected indeces into cols.

In [21]:
df_grouped_unstacked = df_grouped.unstack()
df_grouped_unstacked.index

MultiIndex([(       'French',   'Marketing'),
            (       'Indian',     'Physics'),
            ('New Zealander', 'Env_Studies'),
            (     'Nigerian',     'Physics'),
            (    'Ukrainian',    'Business')],
           names=['Nationalities', 'Studies'])

A bit different logic as a means of getting a better view of your (wide) dataset: swap rows and cols with **.transpose()**

Transpose: flip dimensions of df

In [24]:
df.transpose()

Unnamed: 0,0,1,2,3,4
Nationalities,Ukrainian,Nigerian,Indian,New Zealander,French
Studies,Business,Physics,Physics,Env_Studies,Marketing
PhD_Grade,1,1,1,1,1
Masters_Grade,1,1,1,1,1
Age,18,21,21,19,23


In [None]:
# # Tidy Formats Criteria

# 1). 