<img src="https://pandas.pydata.org/_static/pandas_logo.png"/>


# Advanced Pandas


Pands has very strong data wrangling and calculations capabilities. 

- Statistical Functions
- Window Functions
- Aggregations
- Missing Data
- GroupBy
- Merging/Joining
- Concatenations
- Date Functionality
- Timedelta
- Categorical Data
- Visualization
- IO Tools
- Sparse Data
- 

## Grouping

Similarly to SQL, pandas supports grouping and aggregation functions. 


In [9]:
import numpy as np
import pandas as pd
df = pd.read_csv("./nba.csv")
df.groupby('Team')

grouping_obj = df.groupby('Team')

# now apply a Numpy aggregation function like this:
grouping_obj.agg(np.mean)

Unnamed: 0_level_0,Number,Age,Weight,Salary
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Atlanta Hawks,19.0,28.2,221.266667,4860197.0
Boston Celtics,31.866667,24.733333,219.466667,4181505.0
Brooklyn Nets,18.266667,25.6,215.6,3501898.0
Charlotte Hornets,17.133333,26.133333,220.4,5222728.0
Chicago Bulls,19.2,27.4,218.933333,5785559.0
Cleveland Cavaliers,14.466667,29.533333,227.866667,7642049.0
Dallas Mavericks,20.0,29.733333,227.0,4746582.0
Denver Nuggets,15.266667,25.733333,217.533333,4294424.0
Detroit Pistons,17.266667,26.2,222.2,4477884.0
Golden State Warriors,20.866667,27.666667,224.6,5924600.0


### Grouping using multiple functions and columns

In [18]:
grouping_obj.agg({'Salary':['sum', 'max'], 
                  'Age': 'mean', 
                  'Weight':'std', 
                  'Age': lambda x: x.max() - x.min()}).head()

Unnamed: 0_level_0,Salary,Salary,Age,Weight
Unnamed: 0_level_1,sum,max,<lambda>,std
Team,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Atlanta Hawks,72902950.0,18671659.0,13.0,25.982045
Boston Celtics,58541068.0,12000000.0,9.0,25.606547
Brooklyn Nets,52528475.0,19689000.0,11.0,24.37739
Charlotte Hornets,78340920.0,13500000.0,10.0,29.908909
Chicago Bulls,86783378.0,20093064.0,14.0,29.336634


### pro tip: Naming custom agg functions

In [17]:
def max_min(x):
    return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

grouping_obj.agg({'Salary':['sum', 'max'], 
                  'Age': 'mean', 
                  'Weight':'std', 
                  'Age': max_min}).head()

Unnamed: 0_level_0,Salary,Salary,Age,Weight
Unnamed: 0_level_1,sum,max,Max minus Min,std
Team,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Atlanta Hawks,72902950.0,18671659.0,13.0,25.982045
Boston Celtics,58541068.0,12000000.0,9.0,25.606547
Brooklyn Nets,52528475.0,19689000.0,11.0,24.37739
Charlotte Hornets,78340920.0,13500000.0,10.0,29.908909
Chicago Bulls,86783378.0,20093064.0,14.0,29.336634


### Filtering results
Pandas grouping filter works similary to SQLs HAVING clause 

In [33]:
# filter groups where the average salary in that group is greater than the general average salary
above_avg_group_salary = grouping_obj.filter(lambda g: g['Salary'].mean() >= df['Salary'].mean()) 
above_avg_group_salary.groupby('Team').agg({'Salary' : 'sum'})


Unnamed: 0_level_0,Salary
Team,Unnamed: 1_level_1
Atlanta Hawks,72902950.0
Charlotte Hornets,78340920.0
Chicago Bulls,86783378.0
Cleveland Cavaliers,106988689.0
Golden State Warriors,88868997.0
Houston Rockets,75283021.0
Los Angeles Clippers,94854640.0
Memphis Grizzlies,76550880.0
Miami Heat,82515673.0
Oklahoma City Thunder,93765298.0


### Exercise time 💪
