---
# Python Pandas

Data Manipulation, pivoting, groupby

---
Code examples on the most frequently used functions - Collected, Created and Edited by __Pawel Rosikiewicz__ www.SimpleAI.ch

## CONTENT

In [2]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---
## STACK / UNSTACK 

long <-> wide df, affects all data in df

---

### __Main Function__
* __stack()__; wide to tall: innermost col_label --> innermost row_index
* __unstack()__; tall to wide: innermost row_index --> innermost col_label, 
* useful when stacking /unstacking on multi index:
    * __df.dropna(axis=0/1)__, axis=0, remove rows with NaN, how=“all”
    * __df.swaplevel(axis=0/1)__
    
    
### __GOLDEN RULE__
* keep your data as stacked as it is possible!
* stacking allow faster access to the data at df
* WHY? : The main benefit of a columnar database is faster performance compared to a row-oriented one. That's because it accesses less memory to output data. Because a columnar database stores data by columns instead of rows, it can store more data in a smaller amount of memory.
* Pandas DF: 2d data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns.
        
### __MORE NOTES__;
* stack()     
    - wide to tall
    - innermost column label --> innermost row index
    - generates NaN when no data exist    

* unstack()   
    - tall to wide
    - innermost row label --> innermost column index
    - generates NaN when no data exist 
        
* Caution - Stack/Unstack on multiple levels
    - the chosen index for stack and unstack will always be moved to the last or innermost level of the other index 
    - thus, you may have problem to unstack, 
    - example: if you prevoiously stacked your df's and you must use the following :    
        -      df.dropna(axis=1)
        -      df.swaplevel(axis=1)
        -      df.swaplevel(axis=1).dropna(axis=1)

In [3]:
#### stack/unstack simple example  ------------------------------
arr = np.arange(4).reshape(2,2)
df  = pd.DataFrame(arr, index=["A", "B"], columns=["col1", "col2"]); 
df

Unnamed: 0,col1,col2
A,0,1
B,2,3


In [4]:
# wide to tall
'''you will create 2-level index!'''
df = df.stack(); df

A  col1    0
   col2    1
B  col1    2
   col2    3
dtype: int64

In [5]:
# tall to wide
df = df.unstack(); df

Unnamed: 0,col1,col2
A,0,1
B,2,3


In [19]:
#### stack/unstack with multiindex - ie multiindex is already there
''' Caution !
    - may generate NaN for missing data
    - becuase it must create all combinations of innermost levels
    
    comments:
    - row/col_levels - these are al possible keys, these are not nesessarly used, eg: abc,c,d,e,f,g,.....¨
    - ...used ... - here, you assign row/col_labels to columns
    
'''
# create the example
arr         = np.arange(9).reshape(3,3)
col_levels  = [['1', '2', '3', '4'],['one','two','three','four']] # all possible values for each level
col_levels_used = [[0,0,1],[0,1,3]] # which to use
row_levels  = [['A', 'B', 'C', 'D'],['a','b','c','d']] # all possible values for each leve
row_levels_used = [[0,0,1],[0,3,0]] # which to use
first_df    = pd.DataFrame(arr, 
                           index  =pd.MultiIndex(row_levels, row_levels_used), 
                           columns=pd.MultiIndex(col_levels, col_levels_used))
first_df

Unnamed: 0_level_0,Unnamed: 1_level_0,1,1,2
Unnamed: 0_level_1,Unnamed: 1_level_1,one,two,four
A,a,0,1,2
A,d,3,4,5
B,a,6,7,8


In [15]:
# (a) stack() then, unstack() .............................
df  = first_df.stack(); df

Unnamed: 0,Unnamed: 1,Unnamed: 2,1,2
A,a,one,0.0,
A,a,two,1.0,
A,a,four,,2.0
A,d,one,3.0,
A,d,two,4.0,
A,d,four,,5.0
B,a,one,6.0,
B,a,two,7.0,
B,a,four,,8.0


In [16]:
df = df.unstack(); df

Unnamed: 0_level_0,Unnamed: 1_level_0,1,1,1,2,2,2
Unnamed: 0_level_1,Unnamed: 1_level_1,one,two,four,one,two,four
A,a,0.0,1.0,,,,2.0
A,d,3.0,4.0,,,,5.0
B,a,6.0,7.0,,,,8.0


In [17]:
# (b) unstack() then, stack() -----------------------
df  = first_df.unstack(); df

Unnamed: 0_level_0,1,1,1,1,2,2
Unnamed: 0_level_1,one,one,two,two,four,four
Unnamed: 0_level_2,a,d,a,d,a,d
A,0.0,3.0,1.0,4.0,2.0,5.0
B,6.0,,7.0,,8.0,


In [18]:
df = df.unstack(); df

1  one   a  A    0.0
            B    6.0
         d  A    3.0
            B    NaN
   two   a  A    1.0
            B    7.0
         d  A    4.0
            B    NaN
2  four  a  A    2.0
            B    8.0
         d  A    5.0
            B    NaN
dtype: float64

In [None]:
---
## STACK / UNSTACK 

long <-> wide df, affects all data in df

---
* __stack()__; wide to tall: innermost col_label --> innermost row_index
* __unstack()__; tall to wide: innermost row_index --> innermost col_label, 
* useful when stacking /unstacking on multi index:
    * __df.dropna(axis=0/1)__, axis=0, remove rows with NaN, how=“all”
    * __df.swaplevel(axis=0/1)__

          - PIVOTING (creates derivative df, populated with selected data)
                - pd.pivot()       : simple form, can not use duplicates
                - pd.pivot_table() : can calulate summarxy value for duplicated values
                    - aggfunc= {eg: np.mean etc..., lambda x: sum(x)}, can take >1 function)
                    - fillvalue=
                    - dropna (removes rows only with NaN
                    - with hierarchical index /columns, (you ca select sevelar columns in df for each in list)
 

In [None]:
        2. RESHAPING DATA  IN ONE DF
		
          - STACK / UNSTACK (long <-> wide df, affects all data in df)
                - stack(), — innermost column label --> innermost row index, wide to tall
                - unstack(), — innermost row label --> innermost column index, tall to wide
                - useful when stacking /unstacking on multi index:
                    -  df.dropna(axis=0/1), axis=0, remove rows with NaN, how=“all”
                    - df.swaplevel(axis=0/1)

          - PIVOTING (creates derivative df, populated with selected data)
                - pd.pivot()       : simple form, can not use duplicates
                - pd.pivot_table() : can calulate summarxy value for duplicated values
                    - aggfunc= {eg: np.mean etc..., lambda x: sum(x)}, can take >1 function)
                    - fillvalue=
                    - dropna (removes rows only with NaN
                    - with hierarchical index /columns, (you ca select sevelar columns in df for each in list)
 

        3. DATA MODYFFICIATIONS
    
            - REPLACING VALUES
                - np.where()    (creates True/False arr, and can replce that values with new values)
                - pd.replcae()  (old value, new value, no keyerror wen not found!)
        
            - MAP, APPLY, APPLYMAP
                - map           (series, elementwise) 
                - apply,        (series, elementwise, or , axes, createing new row or new column) 
                - applymap,     (df, elemenjt wise)
                

        4. DATA GROUPING AND MODYFICATION
            
            - SPLIT-APPLY-COMBINE
	            - groupby()
			- GROUPBY WITH A SINGLE VARIABLE
			- GROUPBY WITH MUTIINDEX
			- AGGRTEGATION, — generates summary stat for each group
			- transformation. , — retrutnrs df with transformed values, of the same size as original, but you can work in each group separately
			- Filtering - removes unwanted data/groups


