<a href="https://colab.research.google.com/github/Nickguild1993/Business_Py_Explorations/blob/main/SideTables_Pandas_Library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Going over the side tables library- which function as frequency tables

In [1]:
# !pip install sidetable

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sidetable
  Downloading sidetable-0.9.0-py3-none-any.whl (17 kB)
Installing collected packages: sidetable
Successfully installed sidetable-0.9.0


In [6]:
# import the regulars

import sidetable
import pandas as pd
import numpy as np
import seaborn as sns
from random import random

In [4]:
# get the data

url = "https://github.com/chris1610/pbpython/blob/master/data/school_transform.csv?raw=True"
df = pd.read_csv(url, index_col=0)
df.head(3)

Unnamed: 0,School Name,City,State,District Name,Model Selected,Award_Amount,Region
0,HOGARTH KINGEEKUK MEMORIAL SCHOOL,SAVOONGA,AK,BERING STRAIT SCHOOL DISTRICT,Transformation,471014,West
1,AKIACHAK SCHOOL,AKIACHAK,AK,YUPIIT SCHOOL DISTRICT,Transformation,520579,West
2,GAMBELL SCHOOL,GAMBELL,AK,BERING STRAIT SCHOOL DISTRICT,Transformation,449592,West


with sidetable imported, we have ourselves a new accessor for all the dataframes - *stb* for building summary tables.

In [8]:
# por exemplo 
# using .stb.freq()

df.stb.freq(["State"])[:5]

Unnamed: 0,State,count,percent,cumulative_count,cumulative_percent
0,CA,92,12.153236,92,12.153236
1,FL,71,9.379128,163,21.532365
2,PA,58,7.661823,221,29.194188
3,OH,35,4.623514,256,33.817701
4,MO,32,4.227213,288,38.044914


compared to using ole value_counts

In [11]:
df["State"].value_counts(normalize=True)[:5]

CA    0.121532
FL    0.093791
PA    0.076618
OH    0.046235
MO    0.042272
Name: State, dtype: float64

Using the *thresh* argument to create a threshold for the returned values

In [15]:
df.stb.freq(["State"], thresh=30)

# Returns a frequency table with the states that make up ~ 30% of the total, with the remaining states grouped as "others"

Unnamed: 0,State,count,percent,cumulative_count,cumulative_percent
0,CA,92,12.153236,92,12.153236
1,FL,71,9.379128,163,21.532365
2,PA,58,7.661823,221,29.194188
3,others,536,70.805812,757,100.0


In [17]:
# can pass the argument other_labels to change it

df.stb.freq(["State"], thresh=30, other_label = "The rest of em")

Unnamed: 0,State,count,percent,cumulative_count,cumulative_percent
0,CA,92,12.153236,92,12.153236
1,FL,71,9.379128,163,21.532365
2,PA,58,7.661823,221,29.194188
3,The rest of em,536,70.805812,757,100.0


Can group columns together to further understand the distribution.  For instance, to see how the various "Transformation Models" apply across "Regions"?

In [19]:
df.stb.freq(["Region", "Model Selected"])[:4]

Unnamed: 0,Region,Model Selected,count,percent,cumulative_count,cumulative_percent
0,South,Transformation,185,24.76573,185,24.76573
1,West,Transformation,142,19.009371,327,43.7751
2,Midwest,Transformation,111,14.859438,438,58.634538
3,Northeast,Transformation,102,13.654618,540,72.289157


In [21]:
df.stb.freq(["State", "Model Selected"])[:4]

Unnamed: 0,State,Model Selected,count,percent,cumulative_count,cumulative_percent
0,CA,Transformation,56,7.397622,56,7.397622
1,FL,Transformation,54,7.133421,110,14.531044
2,PA,Transformation,43,5.680317,153,20.211361
3,CA,Turnaround,29,3.830911,182,24.042272


#### passing a value argument that can be *summed* instead of counting occurances

In [22]:
df.head(1)

Unnamed: 0,School Name,City,State,District Name,Model Selected,Award_Amount,Region
0,HOGARTH KINGEEKUK MEMORIAL SCHOOL,SAVOONGA,AK,BERING STRAIT SCHOOL DISTRICT,Transformation,471014,West


In [23]:
df.stb.freq(["Region"], value = "Award_Amount")

Unnamed: 0,Region,Award_Amount,percent,cumulative_Award_Amount,cumulative_percent
0,South,117467481,37.314735,117467481,37.314735
1,West,74418552,23.639807,191886033,60.954542
2,Midwest,65736175,20.881762,257622208,81.836304
3,Northeast,57179654,18.163696,314801862,100.0


Looking at the types of models selected and determine the 80/20 breakdown of the allocated dollars

In [24]:
df.stb.freq(["Region", "Model Selected"],
            value = "Award_Amount", thresh = 80,
            other_label = "Those Which Remain")

Unnamed: 0,Region,Model Selected,Award_Amount,percent,cumulative_Award_Amount,cumulative_percent
0,South,Transformation,88680032,28.17011,88680032,28.17011
1,West,Transformation,56207890,17.855006,144887922,46.025116
2,Midwest,Transformation,48702505,15.470844,193590427,61.49596
3,Northeast,Transformation,41263161,13.107661,234853588,74.603621
4,Those Which Remain,Those Which Remain,79948274,25.396379,314801862,100.0


adding pandas style to the sidetables

In [25]:
df.stb.freq(["Region"], value = "Award_Amount", style =True)

Unnamed: 0,Region,Award_Amount,percent,cumulative_Award_Amount,cumulative_percent
0,South,117467481,37.31%,117467481,37.31%
1,West,74418552,23.64%,191886033,60.95%
2,Midwest,65736175,20.88%,257622208,81.84%
3,Northeast,57179654,18.16%,314801862,100.00%


In [26]:
df.stb.missing()

Unnamed: 0,missing,total,percent
Region,10,757,1.321004
School Name,0,757,0.0
City,0,757,0.0
State,0,757,0.0
District Name,0,757,0.0
Model Selected,0,757,0.0
Award_Amount,0,757,0.0
