<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/006_long-wide-data-formats.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# Long and Wide Data Formats
___
In this script, we will talk a little about data formats, in particular the **long** and **wide** formats. While those are not the only formats that exist, they are, without doubt, the most prevalent ones.

Most of the time, you will receive data in a *tabular* format (e.g., like an Excel spreadsheet, with rows and columns), where the data is either in long or wide format. It's important that you are comfortable with both formats and that you are able to transform a dataframe from one format to the other.

In my experience, with `pandas`, you will be using **wide format** most of the time. However, when obtaining data from external sources, you will often be faced with data in **long format** as well, hence, it's good to know both. Personally, I also feel like wide format is generally more intuitive, but others might have different views.

You might not know it yet, but you have already encountered both data in wide and long format. In fact, the *iris* dataset is in wide format, while the *exchange rate* data is in long format.

Let's start with a quick example. Say we have a dataset of the students taking a specific course. This dataset has the following information for each student:  
+ The name
+ The grade point average (GPA)
+ The track (Econ, Business, Law, CS)
+ The current semester




___
## Wide format
If the dataset is in wide format, it will look something like this.

|Name|GPA|Track|Semester|
|:--|--:|:--|--:|
|Johann Friedrich|6.0|Econ|3|
|Florence|5.5|Business|4|
|Gertrude|4.5|Econ|3|
|Ronald|4.0|Law|3|
|Janet|5.0|Econ|5|
|Leonhard|6.0|CS|5|
|Sofya|5.5|Law|3|
|...|...|...|...|

___
## Long format
If, on the other hand, we had received the data in long format, it would look as follows:

|Name|Variable|Value|
|:--|:--|--:|
|Johann Friedrich|GPA|6.0|
|Johann Friedrich|Track|Econ|
|Johann Friedrich|Semester|3|
|Florence|GPA|5.5|
|Florence|Track|Business|
|Florence|Semester|4|
|Gertrude|GPA|4.5|
|Gertrude|Track|Econ|
|Gertrude|Semester|3|
|Ronald|GPA|4.0|
|Ronald|Track|Law|
|Ronald|Semester|3|
|...|...|...|

The names are fairly intuitive now that we see an example. 

The same data in **long format** generaly is *longer*, i.e., it has more rows, because instead of having a single row per observation, we have a row for each variable, for each individual. On the other hand, the **wide format** is, as the name suggests, *wider*, i.e., it has more columns, because now we group all information on an observation on a single row and just add more columns.



## From long to wide

Time to get our hands dirty. Let's look at how to reshape data from the long format to the wide format. Recall that our SNB exchange rate data is in long format.

In [None]:
# Import necessary libraries
import numpy as np # Numerical computing
import pandas as pd # Dataframes

# Define the path where the data is stored
DATA_PATH = "https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data"

In [None]:
pd.read_csv(f"{DATA_PATH}/snb-data-devkum-en-selection-20220601_1430.csv", sep=";", skiprows=3)

In [None]:
# Read in data
snb = pd.read_csv(f"{DATA_PATH}/snb-data-devkum-en-selection-20220601_1430.csv", sep=";", skiprows=3)
# Do some cleaning, drop the D0 column
snb.drop(columns=["D0"], inplace=True)
# Create a smaller dataset with only two currencies: EUR1 and USD1
snb_small = snb.loc[snb["D1"].isin(["EUR1", "USD1"])]
snb_small # Display the data

Notice how this data is in long format? Let's go ahead and transform the *two currencies* version of the data into wide format.

In [None]:
# Transform the smaller dataset to wide format
snb_small_wide = snb_small.pivot(
    index="Date", columns="D1", values="Value").reset_index().rename_axis(None, axis=1)
snb_small_wide

Do you see how going from long to wide, our dataset is now half as long but has twice as many *value* columns? This is because we did the exercise with two currencies, as you can see from the example below, this also works for much more than only two identifiers (currencies in this case).

In [None]:
# Transform the full dataset to wide format
snb_wide = snb.pivot(
    index="Date", columns="D1", values="Value").reset_index().rename_axis(None, axis=1)
snb_wide

## From wide to long

Going from long to wide was easy, a simple `.pivot` method and a `.reset_index()` following did the trick. Is going from wide to long just as easy?

In [None]:
np.random.seed(72) # Set the random seed for the sample to always be the same
# This time let's use a random sample of 10 flowers from iris dataset since it's already in wide format
iris = pd.read_csv("../data/iris.csv").sample(n=10)
iris # Display the dataset

In [None]:
# Going wide to long is also called "melting"
iris_long = pd.melt(iris.reset_index(), id_vars="index", value_vars=iris.columns)
iris_long # Display the data

___
## Getting used to it

Unfortunately, both the exchange rate and the iris datasets are not the best examples to do long and wide format changes. Do you see why?

Think of our first example with the students, there it's pretty clear that our identifier is the name of the student, and each student has multiple variables that have different values.

Try to think of this for our iris data. What is the identifier? There is no proper identifier, that's why we had to use `.reset_index()`, such that we add an `index` column which contains the numbering of each observation, creating a fake identifier in a sense.

Perhaps that went a bit fast. Look at what `.reset_index()` does in isolation.

In [None]:
# Adds a column named "index" with the value of the index
iris.reset_index()

So that's why we couldn't just use
```python
iris_long = pd.melt(iris, value_vars=iris.columns)
```
above. There is no way to identify what each row belongs to, as you can see from the example below.

In [None]:
# 🙀 🤯 how can we find out which row belongs to which observation in the wide format?
pd.melt(iris, value_vars=iris.columns)

It's a bit similar with the exchange rate data, however, it's easier to use the date as an identifier, it seems more intuitive. But  the student dataset really helps illustrate this whole procedure better.

In [None]:
# Load dataframe (wide format)
students_wide = pd.read_csv(f"{DATA_PATH}/data/students_wide.csv")
students_wide # Display the data

In [None]:
# Load a dataframe (long format)
students_long = pd.read_csv(f"{DATA_PATH}/data/students_long.csv")
students_long # Display the data

#### ➡️ ✏️ Task 1: Wide to long

Using the `students_wide` dataframe, create a long format dataframe, where `student` is the id variable and `gpa`, `track`, and `semester` are the three value variables.

*Hint*: Use the code above to help yourself. Your output should be the same as what is currently in  `students_long`

In [None]:
# Enter your code below


#### ➡️ ✏️ Task 2: Long to wide

Using the `students_long` dataframe, create a wide format dataframe, where `student` is the id variable and `gpa` and `track` are the two value variables.

*Hint*: Use the code above to help yourself. Your output should be the same as what is currently in  `students_wide`

In [None]:
# Enter your code below


___
#### 🤔 Pause and ponder
Can you think of a use case where wide data is more useful than long data? What about a case where long data is more useful than wide data?
___