<b><font size="5">Data Wrangling with Pandas. </font></b>
<br><br>
This notebook is an introduction to Pandas library. Feel free to complement your knowledge with online documentation:<br>
https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html

### <font color='#BFD72F'>Table of Contents </font> <a class="anchor" id='toc'></a> 

- [1. Wide VS Long](#P1) 
- [2. Transpose](#P2) 
- [3. Wide to Long](#P3)
    - [Melt](#P3.1)
    - [Wide to long](#P3.2)
    - [Stack](#P3.3)
- [4. Long to Wide](#P4)
    - [Pivot](#P4.1)
    - [Pivot table](#P4.2)
    - [Unstack](#P4.3)
- [5. Transform list columns](#P5)  
- [6. Try it out](#P6)

### <font color='#BFD72F'>1. Wide VS Long </font> <a class="anchor" id="P1"></a>
  [Back to TOC](#toc)

| Wide format | Long format |
| ----- | ----- |
| Columns per attribute | Column for subject, attribute and values |
| Rows per subject | Rows per subject-attribute |
| No repeated subjects but possible missing values | Repeated subjects but no missing values |
| <img src="https://preview.redd.it/reshaping-table-w-tens-of-millions-of-rows-from-long-to-wide-v0-qlpweqqts66a1.png?width=1334&format=png&auto=webp&s=9d7ccfef49690095f13afa0fb45cebbccc091cd1" width=400> | <img src="https://preview.redd.it/reshaping-table-w-tens-of-millions-of-rows-from-long-to-wide-v0-ijzw95ios66a1.png?width=1316&format=png&auto=webp&s=8aa3be9405c66da96e896a7fe6863564a673ebe2" width=450> |

### <font color='#BFD72F'>2. Transpose </font> <a class="anchor" id="P2"></a>
  [Back to TOC](#toc)

In [3]:
# Import libraries and define the alias
import pandas as pd
import numpy as np

In [1]:
# Confirm that you have Pandas updated
!pip show pandas

Name: pandas
Version: 2.2.2
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: 
Author-email: The Pandas Development Team <pandas-dev@python.org>
License: BSD 3-Clause License

Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
All rights reserved.

Copyright (c) 2011-2023, Open source contributors.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
  contributors may be u

In [4]:
# Import datasets
netflix = pd.read_csv('datasets_tp/netflix_dataset.csv')
countries = pd.read_excel('datasets_tp/countries.xlsx') 
weather = pd.read_csv('datasets_tp/austin_weather.csv')
countries.head()

Unnamed: 0,place,pop1980,pop2000,pop2010,pop2022,pop2023,pop2030,pop2050,country,area,landAreaKm,cca2,cca3,netChange,growthRate,worldPercentage,density,densityMi,rank
0,356,696828385.0,1059634000.0,1240613620,1417173000.0,1428628000.0,1514994000.0,1670491000.0,India,3287590.0,2973190.0,IN,IND,0.4184,0.0081,0.1785,480.5033,1244.5036,1
1,156,982372466.0,1264099000.0,1348191368,1425887000.0,1425671000.0,1415606000.0,1312636000.0,China,9706961.0,9424702.9,CN,CHN,-0.0113,-0.0002,0.1781,151.2696,391.7884,2
2,840,223140018.0,282398600.0,311182845,338289900.0,339996600.0,352162300.0,375392000.0,United States,9372610.0,9147420.0,US,USA,0.0581,0.005,0.0425,37.1686,96.2666,3
3,360,148177096.0,214072400.0,244016173,275501300.0,277534100.0,292150100.0,317225200.0,Indonesia,1904569.0,1877519.0,ID,IDN,0.0727,0.0074,0.0347,147.8196,382.8528,4
4,586,80624057.0,154369900.0,194454498,235824900.0,240485700.0,274029800.0,367808500.0,Pakistan,881912.0,770880.0,PK,PAK,0.1495,0.0198,0.03,311.9625,807.9829,5


#### Transpose

- Transpose index and columns.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html <br>
*DataFrame.transpose(args, copy=False)*
<br><br>[Back to TOC](#toc)

In [None]:
# Prepare the dataset to try the transpose method
df = countries.iloc[:20, 1:9].set_index('country')
df

In [None]:
# Reshape the DataFrame switching index with columns...
 #code here

In [None]:
# ...can also use with the acronym 'T'...
df.T

In [None]:
# As with other methods, can also cascade it
df.T.T

### <font color='#BFD72F'>3. Wide to Long </font> <a class="anchor" id="P3"></a>
  [Back to TOC](#toc)

#### Melt <a class="anchor" id="P3.1"></a>

- Reshape the DataFrame into a format where one or more columns are variables (id_vars) while all other columns are measured values (value_vars).
Note: for each variable (defined in id_vars) we will get a row for each column and value (defined in value_vars).

https://pandas.pydata.org/docs/reference/api/pandas.melt.html <br>
*pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)*

In [None]:
# Transforming countries from wide to long using population values...
df0 = pd.melt(countries, id_vars=['country'], 
              value_vars=['pop1980','pop2000','pop2010','pop2022','pop2023','pop2030','pop2050'])
df0

In [None]:
# As each country has unique information (besides the name) those could be considered in id_vars
df1 = pd.melt(countries, id_vars=['country','area','density'], 
              value_vars=['pop1980','pop2000','pop2010','pop2022','pop2023','pop2030','pop2050'])
df1

In [None]:
# Can also define the new columns name
 #code here

#### Wide to long <a class="anchor" id="P3.2"></a>

- This function expects to find columns with format 'ColSuffix', where 'Col' is the stubnames and 'Suffix' is the j . <br>
Note: The suffix argument is set (by deafult) to capture numeric suffixes, and the sep argument is '' (empty space).

https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html <br>
*pandas.wide_to_long(df, stubnames, i, j, sep='', suffix='\\d+')*
<br><br>[Back to TOC](#toc)

In [None]:
# Lets preview the data
countries.iloc[:, 1:9]

In [None]:
# Reshape countries to obtain just the population for each country and year
 #code here

In [None]:
# If we use all DataFrame columns the variables that aren't called by the function are left intact...
pd.wide_to_long(countries, i='country', stubnames=['pop'], j='year')

In [None]:
# ..but some can also move to 'i' variables
pd.wide_to_long(countries, i=['country','rank'], stubnames=['pop'], j='year')

In [None]:
# As with any argument, you can change the suffix...
weather.head()

In [None]:
# ...after 'Temp' and 'DewPoint' we want to get the 'HighF', 'AvgF' and LowF' (regex \w+ to get the suffix word)
df2 = pd.wide_to_long(weather.iloc[:10,:7], i='Date', stubnames=['Temp','DewPoint'], j='measure', suffix=r'\w+')
df2

In [None]:
# This function (pd.wide_to_long) always outputs a MultiIndex (i,j)
df2.index

#### Stack <a class="anchor" id="P3.3"></a>

- Stack the prescribed level(s) from columns to index.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html <br>
*DataFrame.stack(level=-1, dropna=_NoDefault.no_default, sort=_NoDefault.no_default, future_stack=False)*
<br><br>[Back to TOC](#toc)

In [None]:
# Stack 'df' DataFrame (single index)
 #code here

In [None]:
# Considering a MultiIndex DataFrame...
df3 = netflix.groupby(['Country', 'Subscription Type'])[['Monthly Revenue','Age']].agg(['min', 'mean', 'max'])
df3.columns.names=['variable','measure']
df3

In [None]:
# Note that we have MultiIndex in rows and columns
print(df3.columns)
print(df3.index)

In [None]:
# ...we can stack it with default arguments (level=-1, which will use the last column MultiIndex 'measure')...
df3_stacked = df3.stack()
df3_stacked

In [None]:
# ...but if we want to stack by the first column MultiIndex ('variable')
df3.stack(level=0) # In this case is equal to level=-2

In [None]:
# As our column index have names...
df3.columns.names

In [None]:
# ...we can use it to define the level argument (instead of positive/negative indexing)
df3.stack(level='variable') # same output as df3.stack(level=0)

### <font color='#BFD72F'>4. Long to Wide </font> <a class="anchor" id="P4"></a>
  [Back to TOC](#toc)

#### Pivot <a class="anchor" id="P4.1"></a>

- Reshape data based on column values.

https://pandas.pydata.org/docs/reference/api/pandas.pivot.html <br>
*pandas.pivot(data, columns, index=_NoDefault.no_default, values=_NoDefault.no_default)*

In [None]:
# Reverting the 'melted' DataFrames with pivot...
df0

In [None]:
# ...we have, at least, to define the index and columns...
 #code here

In [None]:
#...but if the dataset has more columns than the one with value (area and density), we get many duplicates! So...
display(df1)

In [None]:
pd.pivot(df1, index='country', columns='variable')


In [None]:
# ...we should define the argument values
pd.pivot(df1, index='country', columns='variable', values='value')  # What about the other (lost) columns?...

In [None]:
# ...we can keep it in index, and then reset it to only keep 'country'
pd.pivot(df1, index=['country','area','density'], columns='variable', values='value').reset_index(level=[1,2])

#### Pivot table <a class="anchor" id="P4.2"></a>

- Create a pivot table with aggregation of numeric data.

https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html <br>
*pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)*
<br><br>[Back to TOC](#toc)

In [None]:
display(weather)

In [None]:
# Pivot weather DataFrame to get the average 'TempAvgF' by year (index) and month (column)
pd.pivot_table(weather, index='Year', columns='Month', values='TempAvgF', aggfunc='mean')

In [None]:
# Same example with extra arguments
pd.pivot_table(weather, index='Year', columns='Month', values='TempAvgF', aggfunc='mean',
               fill_value='UNK', margins=True, margins_name='Global Avg')

In [None]:
# Can also consider several values and aggregation functions
pd.pivot_table(weather, index='Year', columns='Month', values=['TempLowF','TempAvgF','TempHighF'], 
               aggfunc={'TempLowF':'min', 'TempAvgF':'mean', 'TempHighF':'max'}).round(2)

In [None]:
# Athough the focus is to transform from long to wide, in order to see the full output can consider to transpose it...
pd.pivot_table(weather, index='Year', columns='Month', values=['TempLowF','TempAvgF','TempHighF'], 
               aggfunc={'TempLowF':'min', 'TempAvgF':'mean', 'TempHighF':'max'}).round(2).T

In [None]:
# ...or even use a different set of index and columns
pd.pivot_table(weather, index=['Year','Month'], values=['TempLowF','TempAvgF','TempHighF'], 
               aggfunc={'TempLowF':'min', 'TempAvgF':'mean', 'TempHighF':'max'}).round(2)

#### Unstack <a class="anchor" id="P4.3"></a>

- Revert the stacking, reshaping from index to columns.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html <br>
*DataFrame.unstack(level=-1, fill_value=None, sort=True)*
<br><br>[Back to TOC](#toc)

In [None]:
display(df3_stacked)

In [None]:
# Revert df3_stacked DataFrame
df3_stacked.unstack() # In this case is equal to level=2

In [None]:
# As in .stack() the level can be defined, this time to be applied to DataFrame (row) MultiIndex...
df3_stacked.unstack(level=1) # In this case is applicable on Subscription Type

In [None]:
# ...also possible using the name (try with 'country' which is equal to level=0)
 #code here

### <font color='#BFD72F'>5. Transform list columns </font> <a class="anchor" id="P5"></a>
  [Back to TOC](#toc)

#### Explode

- Transform each element of a list-like to a row.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html <br>
*Series.explode(ignore_index=False)*

In [None]:
netflix.head()

In [None]:
# Considering a DataFrame with a column which has a list by row...
df4 = pd.DataFrame(netflix[netflix.Country=='Canada'].groupby('Subscription Type')['Monthly Revenue'].apply(list))
df4

In [None]:
# ...can use explode() get a value by row
 #code here

In [None]:
# Beware that previous values are sorted by Subscription and can't be directly applied to the original (unsorted) DataFrame...
netflix.loc[netflix.Country=='Canada','Monthly Revenue']=df4.explode('Monthly Revenue').values # rewrite the 'Monthly Revenue'
netflix[netflix.Country=='Canada'].head()
# ... the first two rows get revenue=10, which is related to basic subscription, instead of revenue=15 of premium!

### <font color='#BFD72F'>6. Try it out </font> <a class="anchor" id="P6"></a>
  [Back to TOC](#toc)

In [None]:
# Melt the weather DataFrame to get each SeaLevelPressure by date. This should only be applied to January and February!
# Label the variable column as "SeaLevelPressure"


In [None]:
# Revert the previous melted DataFrame using .pivot()


#### That's all for today and feel free to complement your knowledge with online documentation.
*https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html*