# Split-Apply-Combine More

### Objectives
After this lesson you should be able to...
+ Create custom aggregation functions to pass to groupby objects
+ Know how to use the primary groupby methods **`agg`**, **`filter`**, **`transform`** and **`apply`**
+ Know the differences between **`agg`**, **`filter`**, **`transform`** and **`apply`**

### Prepare for this lesson by
+ Reading the rest of the [split apply combine](http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation) documentation from Transformation until the end

### Custom aggregate functions
Pandas groupby objects come with a few dozen aggregate functions that can be applied to the groups. It is also possible to define your own customized aggregate function. These customized functions must return a single value.

Let's suppose you would like to know the difference between the max and min value of a column. Pandas does not have an aggregate function built to do this. You would have to define one yourself. 

Each customized aggregate function is defined as normal with the **`def`** keyword. Each function will accept one argument and that is the aggregating column. It will be passed as a **`Series`**. This means that all Series methods will work on the passed argument.

The **`min_max`** function below takes one argument, **`s`** which is a Series object. It returns the difference between the max and min values of that Series.

In [1]:
import pandas as pd
import numpy as np

pd.options.display.max_columns = 40

In [2]:
# define custom function
# s is a Series

def min_max(s):
    return s.max() - s.min()

In [4]:
# read in college DF
college = pd.read_csv('data/college.csv')

In [5]:
# use new min_max function
college.groupby(['STABBR', 'RELAFFIL'])['UGDS'].agg(['size', 'min', 'max', min_max]).head(12)

Unnamed: 0_level_0,Unnamed: 1_level_0,size,min,max,min_max
STABBR,RELAFFIL,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AK,0,7,109.0,12865.0,12756.0
AK,1,3,27.0,275.0,248.0
AL,0,72,12.0,29851.0,29839.0
AL,1,24,13.0,3033.0,3020.0
AR,0,68,18.0,21405.0,21387.0
AR,1,18,20.0,4485.0,4465.0
AS,0,1,1276.0,1276.0,0.0
AZ,0,124,1.0,151558.0,151557.0
AZ,1,9,25.0,4102.0,4077.0
CA,0,609,0.0,44744.0,44744.0


### Filtering out groups

**`groupby`** objects come with a **`filter`** method that...
1. Scans each group independently
2. Applies a function to each group and returns a boolean value
3. Keeps the group or drops each group based on the boolean value returned from the function
4. The end result is the original DataFrame (same number of columns) with certain groups filtered out

The **`filter`** method accepts a function that returns either True or False for each group. This result is used to filter the original DataFrame.

The function that **`filter`** accepts will be a **custom** function that is **`implicitly`** passed a DataFrame of each group.

Anything can happen inside the body of the function passed to **`filter`** but it must return **`True`** or **`False`**. This boolean value determines whether the group is included or is dropped from the final resulting DataFrame.

### Find states with more than 300,000 undergraduate students
To help provide some context, we will use the filter method to find states that have at least 300,000 undergraduate students. The **`filter`** is passed a function that will sum all the undergraduate students for each state. It will take this sum and compare it against the number 300,000 and return **`True`** or **`False`**. Only states that have more than 300,000 students will remain.

None of the values in the DataFrame are mutated. Only rows are dropped with **`filter`**

In [6]:
# Define function that accepts a dataframe of the current group
# it must return a boolean

def filter_ugds(df):
    return df['UGDS'].sum() > 300000

In [7]:
# use filter
college_filtered = college.groupby('STABBR').filter(filter_ugds)

In [8]:
# see the difference in size
print(college.shape)
print(college_filtered.shape)

(7535, 27)
(4619, 27)


### Pass additional parameters to the filtering function
At first glance the **`filter`** method from above appears to only allow the passed function to contain a single argument. This is not the case. You can pass any number of arguments to the function inside of **`filter`**.

Let's take a look at the **`filter`** docstrings for a moment. Since **`filter`** is a chained method the docstring intelligence tricks (shift + tab + tab, etc...) are not available to us. We can use the help function to output the docstrings into the notebook.

In [42]:
help(college.groupby('STABBR').filter)

Help on method filter in module pandas.core.groupby:

filter(func, dropna=True, *args, **kwargs) method of pandas.core.groupby.DataFrameGroupBy instance
    Return a copy of a DataFrame excluding elements from groups that
    do not satisfy the boolean criterion specified by func.
    
    Parameters
    ----------
    f : function
        Function to apply to each subframe. Should return True or False.
    dropna : Drop groups that do not pass the filter. True by default;
        if False, groups that evaluate False are filled with NaNs.
    
    Notes
    -----
    Each subframe is endowed the attribute 'name' in case you need to know
    which group you are working on.
    
    Examples
    --------
    >>> grouped = df.groupby(lambda x: mapping[x])
    >>> grouped.filter(lambda x: x['A'].sum() + x['B'].sum() > 0)



### \*args and \*\*kwargs
Notice in the method header the two additional parameters, \*args and \*\*kwargs that not formally mentioned in the parameter list. These two additional parameters will not be explained here but there is a great [stackoverflow post](http://stackoverflow.com/questions/3394835/args-and-kwargs) where you can learn all about them.

### What it means for `filter`
You can pass additional arguments to the function that **`filter`** accepts. This becomes useful when we want to customize the filter based on some parameter.

Let's rebuild the filtering function to accept a **`num_students`** argument that allows more flexibility when determining which states to filter out.

In [46]:
# define new function with additional arguments
def filter_ugds_param(df, num_students):
    return df['UGDS'].sum() > num_students

In [28]:
college_filtered2 = college.groupby('STABBR').filter(filter_ugds_param, num_students=500000)

In [50]:
print(college_filtered2.shape)
print(college_filtered.shape)
print(college.shape)

(3319, 27)
(4619, 27)
(7535, 27)


### Groupby with Series

Pandas Series also have a **`groupby`** method. You might be thinking how it's possible to group and aggregate a single column of data. It becomes possible when you consider the **index** and the many levels that an index can have. Multi-level indexes will be discussed in another notebook.

To make the index meaningful set it to be one of the columns of the DataFrame with the **`set_index`** method.

In [51]:
# create a new dataframe with a more interesting index
college_state_index = college.set_index('STABBR')

college_state_index.head()

Unnamed: 0_level_0,INSTNM,CITY,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
AL,Alabama A & M University,Normal,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
AL,University of Alabama at Birmingham,Birmingham,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
AL,Amridge University,Montgomery,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
AL,University of Alabama in Huntsville,Huntsville,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
AL,Alabama State University,Montgomery,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### Grouping by the Index

The **`groupby`** method uses the **`level`** argument to contain the name (or integer location) of the index you would like to form groups with.

In [52]:
# create a Series by selecting one column
s = college_state_index['UGDS']

s.head()

STABBR
AL     4206.0
AL    11383.0
AL      291.0
AL     5451.0
AL     4811.0
Name: UGDS, dtype: float64

In [53]:
# the name can be retrieved with the name attribute of the index
s.index.name

'STABBR'

In [54]:
# groupby as normal but with the level argument
# get the mean of the undergrad student population

s.groupby(level='STABBR').mean().head(10)

STABBR
AK    2493.200000
AL    2789.865169
AR    1644.146341
AS    1276.000000
AZ    4130.468254
CA    3518.308397
CO    2324.880342
CT    1873.550562
DC    2645.277778
DE    2491.052632
Name: UGDS, dtype: float64

In [55]:
# should give the same result as dataframe
college.groupby('STABBR')['UGDS'].mean().head(10)

STABBR
AK    2493.200000
AL    2789.865169
AR    1644.146341
AS    1276.000000
AZ    4130.468254
CA    3518.308397
CO    2324.880342
CT    1873.550562
DC    2645.277778
DE    2491.052632
Name: UGDS, dtype: float64

### Tranforming and not aggregating groups

The most common operation to apply to a group is some kind of aggregation. Getting a single number summary of a group is usually of primary concern. There are however instances where the entire group would like to be transformed with every row kept. The **`transform`** groupby method will apply a (usually) custom function to each column of each group returning data that is the same length as the group. 

One of the most common operations is to standardize data - that is transform a numerical group so that its mean is 0 and standard deviation is 1. This is [done in the documentation](http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation). 

# Case Study: Tracking Weight Loss per Month
There are two friends interested in tracking their weight over the course of several months. To provide motivation they decide to wager some money each month. The friend who loses the highest percentage of body weight each month wins that month. Each month is independent from the others so the weight loss percentage resets at the start of the month.

Below is the data that was collected over the course of the bet.

In [848]:
# create fake data

all_weight = np.zeros(32)
all_weight[::2] = 300 * np.cumprod(np.random.rand(16) * .05 + .96)
all_weight[1::2] = 200 * np.cumprod(np.random.rand(16) * .05 + .96)
all_weight = np.round(all_weight, 0).astype(int)


df_weight = pd.DataFrame({'Name':['Bob', 'Amy'] * 16, 
                          'Month': ('Jan ' * 8 + ' Feb' * 8 + ' Mar' * 8 + ' Apr' * 8).split(),
                          'Week' : (['Week 1'] * 2 + ['Week 2'] * 2 + ['Week 3'] * 2 + ['Week 4'] * 2) * 4,
                          'Weight':all_weight},
                        columns=['Name', 'Month', 'Week', 'Weight'])

df_weight

Unnamed: 0,Name,Month,Week,Weight
0,Bob,Jan,Week 1,290
1,Amy,Jan,Week 1,196
2,Bob,Jan,Week 2,292
3,Amy,Jan,Week 2,189
4,Bob,Jan,Week 3,287
5,Amy,Jan,Week 3,183
6,Bob,Jan,Week 4,287
7,Amy,Jan,Week 4,177
8,Bob,Feb,Week 1,282
9,Amy,Feb,Week 1,177


### Transform
A new bet is begun at the start of each month and the winner for each month is declared by percentage weight loss achieved from week 1 to week 4. Percentage weight loss resets to zero at the beginning of each month.

We need to find a way to track the percentage weight loss within each month for each person. Since each week will have a weight loss percentage, no aggregation is performed and instead a new value is needed for each row. This situation calls for **`transform`**.

The default behavior of transform is to apply the same function (given as an argument) to each non-grouped column of the DataFrame. Since columns are Series, it is a Series that is passed to the transformation function.

In [861]:
# define a custom function that takes a Series and find the percentage loss from the first week
# is passed a Series

def find_perc_loss(s):
    return (s - s.iloc[0]) / s.iloc[0]

In [863]:
# group by name and month
# apply transformation
# and only transform weight column
df_weight['percent_month_loss'] = df_weight.groupby(['Name', 'Month'])['Weight'].transform(find_perc_loss) 

# view first two months. Notice that percent loss resets to 0
df_weight.head(16)

Unnamed: 0,Name,Month,Week,Weight,percent_month_loss
0,Bob,Jan,Week 1,290,0.0
1,Amy,Jan,Week 1,196,0.0
2,Bob,Jan,Week 2,292,0.006897
3,Amy,Jan,Week 2,189,-0.035714
4,Bob,Jan,Week 3,287,-0.010345
5,Amy,Jan,Week 3,183,-0.066327
6,Bob,Jan,Week 4,287,-0.010345
7,Amy,Jan,Week 4,177,-0.096939
8,Bob,Feb,Week 1,282,0.0
9,Amy,Feb,Week 1,177,0.0


In [864]:
# it might be easier to read if sorted.
# note that Month is not sorted lexicographically and not by calendar
df_weight_final = df_weight.sort_values(['Name', 'Month', 'Week'])

df_weight_final

Unnamed: 0,Name,Month,Week,Weight,percent_month_loss
25,Amy,Apr,Week 1,165,0.0
27,Amy,Apr,Week 2,162,-0.018182
29,Amy,Apr,Week 3,157,-0.048485
31,Amy,Apr,Week 4,152,-0.078788
9,Amy,Feb,Week 1,177,0.0
11,Amy,Feb,Week 2,171,-0.033898
13,Amy,Feb,Week 3,166,-0.062147
15,Amy,Feb,Week 4,166,-0.062147
1,Amy,Jan,Week 1,196,0.0
3,Amy,Jan,Week 2,189,-0.035714


### Finding a winner

It's possible to manually find a winner of each month by comparing each person's week 4. For instance, Amy lost 7.9% of her body weight in April compared to Bob's 6.6% and won that month.

This is very tedious and since we have Pandas would be ridiculous to do by hand. We need to reshape the data in such a manner that each person's week 4 is easily comparable. There are many ways to reshape data. **`pivot`** and **`pivot_table`** DataFrame methods allow you to convert **long** formatted data into **wide** formatted data. This converts column values into column names. More will be said on this later.

First the above DataFrame will be filtered for only week 4 and then pivoted to make the comparison easy.

In [866]:
df_weight_week4 = df_weight_final[df_weight_final['Week'] == 'Week 4']

df_weight_week4

Unnamed: 0,Name,Month,Week,Weight,percent_month_loss
31,Amy,Apr,Week 4,152,-0.078788
15,Amy,Feb,Week 4,166,-0.062147
7,Amy,Jan,Week 4,177,-0.096939
23,Amy,Mar,Week 4,164,-0.012048
30,Bob,Apr,Week 4,228,-0.065574
14,Bob,Feb,Week 4,267,-0.053191
6,Bob,Jan,Week 4,287,-0.010345
22,Bob,Mar,Week 4,252,-0.063197


In [869]:
# use pivot_table to move the Name column
df_weight_winner = df_weight_week4.pivot_table(index='Month', columns='Name', values='percent_month_loss')

df_weight_winner

Name,Amy,Bob
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
Apr,-0.078788,-0.065574
Feb,-0.062147,-0.053191
Jan,-0.096939,-0.010345
Mar,-0.012048,-0.063197


### Column for winner with np.where

Now that the winner is much more easily seen with the new reshaped data, a final step of creating a column with the winner's name can be made with the numpy **`where`** function. **`np.where`** works by an array of boolean values and returns an array consisting of the second argument wherever the array is True and the third argument False.

In [876]:
# a trivial example
# Return 'Yes' when True and 'No' when False

np.where([True, False, False, True, False, True], 'Yes', 'No')

array(['Yes', 'No', 'No', 'Yes', 'No', 'Yes'], 
      dtype='<U3')

In [873]:
# make the winner Amy when her weight loss is more than Bob's and vice versa using np.where
df_weight_winner['Winner'] = np.where(df_weight_winner['Amy'] < df_weight_winner['Bob'], 'Amy', 'Bob')

df_weight_winner

Name,Amy,Bob,Winner
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apr,-0.078788,-0.065574,Amy
Feb,-0.062147,-0.053191,Amy
Jan,-0.096939,-0.010345,Amy
Mar,-0.012048,-0.063197,Bob


In [878]:
# Get the winner
# If there happen to be lots of months of data

df_weight_winner['Winner'].value_counts()

Amy    3
Bob    1
Name: Winner, dtype: int64

# End of Section Summary
1. **`filter`** method returns a boolean for each group
2. Group by the index with the **`level`** argument in the **`groupby`** method
2. Know how to define custom aggregation and filter functions
3. Use **`transform`** when you want to return a Series the same length as the original group
4. Map a boolean expression to 2 other values with **`np.where`**

# Problem Set
We will be working with the city of Houston dataset for the questions in this notebook. Run the following command before attempting the problems

In [4]:
import pandas as pd
import numpy as np

pd.options.display.max_columns = 40
hou = pd.read_csv('data/coh_employee.csv')

Houston Information Tech Svcs    9
Planning & Development           7
Mayor's Office                   5
City Controller's Office         5
Convention and Entertainment     1
Name: DEPARTMENT, dtype: int64

### Problem 1
<span  style="color:green; font-size:16px">What are the 5 least common departments?</span>

In [8]:
import pandas as pd
import numpy as np

pd.options.display.max_columns = 40
hou = pd.read_csv('data/coh_employee.csv')

hou.DEPARTMENT.value_counts().tail(1)

Convention and Entertainment    1
Name: DEPARTMENT, dtype: int64

### Problem 2
<span  style="color:green; font-size:16px">Filter out departments with less than 50 occurences and save it to **`hou_filter`**. Then test your code by outputing the frequencies of all the remaining departments. </span>

In [28]:
def min_count(df, threshold):
    return len(df['DEPARTMENT']) >= threshold

hou_filter = hou.groupby(['DEPARTMENT']).filter(min_count, threshold=50)
hou_filter.DEPARTMENT.value_counts()

Houston Police Department-HPD     638
Houston Fire Department (HFD)     384
Public Works & Engineering-PWE    343
Health & Human Services           110
Houston Airport System (HAS)      106
Parks & Recreation                 74
Name: DEPARTMENT, dtype: int64

### Problem 3
<span  style="color:green; font-size:16px">Filter out departments from the original **`hou`** DataFrame with average salaries less than $70,000 and save it to **`hou_filter_salary`**. Then test your code by outputing the average salaries for the remaining departments.</span>

In [31]:
hou_filter_salary = hou.groupby(['DEPARTMENT']).filter(lambda df: df.BASE_SALARY.mean() < 70000)

hou_filter_salary.groupby(['DEPARTMENT'])['DEPARTMENT','BASE_SALARY'].agg(np.mean) 

Unnamed: 0_level_0,BASE_SALARY
DEPARTMENT,Unnamed: 1_level_1
Admn. & Regulatory Affairs,50890.551724
City Controller's Office,55711.6
City Council,59089.222222
Convention and Entertainment,38397.0
Dept of Neighborhoods (DON),47092.882353
Fleet Management Department,43994.305556
General Services Department,51295.818182
Health & Human Services,51305.933962
Housing and Community Devp.,61387.7
Houston Airport System (HAS),53956.066038


### Problem 4
<span  style="color:green; font-size:16px">Filter for thoe departments from the original **`hou`** DataFrame with average salaries of at least 50,000 or having at least 5 unique position titles. Then create a way to check you got the right answer</span>

In [63]:
def many_and_high(df, threshold, positions):
    return (df.BASE_SALARY.mean() > threshold) or (df.POSITION_TITLE.nunique() >= positions)

hou_filter2 = hou.groupby(['DEPARTMENT']).filter(many_and_high, threshold=65000, positions = 25)

hou_filter2.shape



(1696, 31)

### Problem 5: Advanced
<span  style="color:green; font-size:16px">Find a way to do problem 4 without using the **`filter`** method. Make clever use of aggregate groupby and boolean logic</span>

In [85]:
# do aggregations for each boolean piece separately
salary_grp = hou.groupby('DEPARTMENT')['BASE_SALARY'].mean()
uniq_grp = hou.groupby('DEPARTMENT')['POSITION_TITLE'].nunique()

salary_grp.head()

# create boolean criteria

deps = (salary_grp > 65000) | (uniq_grp >= 25)

deps

# filter Series with itself and grab index values
deps_true = deps[deps].index.values

deps_true

hou_more_check = hou[hou.DEPARTMENT.isin(deps_true)]

# can check equality of dataframes with equals method
hou_more_check.equals(hou_filter2)

True

### Problem 6: Advanced
<span  style="color:green; font-size:16px">Group by department, gender and race and get the mean, min and max base salary for each group. Also get the number of unique position titles and the most frequent position title for each group. Rename each aggregation to something that makes sense. Then remove the top level of the column index. Hint: This [stackoverflow answer](http://stackoverflow.com/questions/15222754/group-by-pandas-dataframe-and-select-most-common-string-factor) will be useful </span>

In [86]:
df = hou.groupby(['DEPARTMENT', 'GENDER','RACE']).agg({'BASE_SALARY':{'salary_mean':'mean',
                                                                'salary_min':'min',
                                                                'salary_max':'max'},
                                                 'POSITION_TITLE':{'unique_positions':'nunique',
                                                                  'most_frequent_position':lambda x: x.value_counts().index[0]}})

df.columns = df.columns.droplevel(0)

df.head(12)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,salary_mean,salary_min,salary_max,unique_positions,most_frequent_position
DEPARTMENT,GENDER,RACE,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Admn. & Regulatory Affairs,Female,Asian/Pacific Islander,72293.666667,37710.0,130416.0,3,ADMINISTRATIVE ASSOCIATE
Admn. & Regulatory Affairs,Female,Black or African American,49727.5,33550.0,72741.0,8,CUSTOMER SERVICE REPRESENTATIVE I
Admn. & Regulatory Affairs,Female,Hispanic/Latino,36616.4,28205.0,47341.0,4,ANIMAL CARE TECHNICIAN
Admn. & Regulatory Affairs,Female,White,47664.666667,33280.0,62129.0,3,ADMINISTRATIVE SPECIALIST
Admn. & Regulatory Affairs,Male,Black or African American,29827.5,29557.0,30098.0,2,REGULATORY INVESTIGATOR
Admn. & Regulatory Affairs,Male,Hispanic/Latino,35318.0,35318.0,35318.0,1,CUSTOMER SERVICE REPRESENTATIVE I
Admn. & Regulatory Affairs,Male,White,122096.0,103776.0,140416.0,2,DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV
City Controller's Office,Female,Asian/Pacific Islander,59077.0,59077.0,59077.0,1,ASSISTANT CITY CONTROLLER III
City Controller's Office,Female,Black or African American,56295.0,55536.0,57054.0,1,ADMINISTRATIVE ASSISTANT
City Controller's Office,Female,Hispanic/Latino,64251.0,64251.0,64251.0,1,ADMINISTRATIVE ASSISTANT


### Problem 7
<span  style="color:green; font-size:16px"> Create a column **`is_max`** that is equal to 1 if the base salary is currently the max base salary (out of all previous rows) for that department and 0 otherwise. Save the returned DataFrame to **`hou_1`**. Output the first 20 rows. See sample data below.</span>

In [87]:
# Return a dataframe that looks like this
''' 
DEPARTMENT    BASE_SALARY  is_max  
Library           160         1
Police            150         1
Library           170         1
Police             95         0
Police            140         0
Library            80         0
Police            189         1
'''


hou_1 = hou[['DEPARTMENT', 'BASE_SALARY']].copy()
hou_1['is_max'] = hou_1.groupby('DEPARTMENT')['BASE_SALARY'].transform(lambda x: x == x.cummax())
hou_1.head(20)

Unnamed: 0,DEPARTMENT,BASE_SALARY,is_max
0,Municipal Courts Department,121862.0,1.0
1,Library,26125.0,1.0
2,Houston Police Department-HPD,45279.0,1.0
3,Houston Fire Department (HFD),63166.0,1.0
4,General Services Department,56347.0,1.0
5,Houston Police Department-HPD,66614.0,1.0
6,Public Works & Engineering-PWE,71680.0,1.0
7,Houston Airport System (HAS),42390.0,1.0
8,Public Works & Engineering-PWE,107962.0,1.0
9,Houston Airport System (HAS),44616.0,1.0


In [88]:
# for simplicity start with this dataframe
hou_1 = hou[['DEPARTMENT', 'BASE_SALARY']].copy() # copy so that we don't overwrite the original data
# your code here

### Problem 8: Advanced
<span  style="color:green; font-size:16px"> Programatically Find the 10th occurence of 0 for **`is_max`** and return a DataFrame that ends after the tenth occurence.</span>

In [90]:
hou_1 = hou[['DEPARTMENT', 'BASE_SALARY']].copy()
hou_1['is_max'] = hou_1.groupby('DEPARTMENT')['BASE_SALARY'].transform(lambda x: x == x.cummax())

hou_1['occur'] = hou_1.groupby('is_max').cumcount()
hou_1.head(20)
idx_10 = hou_1.index[(hou_1.occur == 10) & (hou_1.is_max == 0)][0]
idx_10
hou_1.loc[:idx_10]

Unnamed: 0,DEPARTMENT,BASE_SALARY,is_max,occur
0,Municipal Courts Department,121862.0,1.0,0
1,Library,26125.0,1.0,1
2,Houston Police Department-HPD,45279.0,1.0,2
3,Houston Fire Department (HFD),63166.0,1.0,3
4,General Services Department,56347.0,1.0,4
5,Houston Police Department-HPD,66614.0,1.0,5
6,Public Works & Engineering-PWE,71680.0,1.0,6
7,Houston Airport System (HAS),42390.0,1.0,7
8,Public Works & Engineering-PWE,107962.0,1.0,8
9,Houston Airport System (HAS),44616.0,1.0,9


### Problem 9
<span  style="color:green; font-size:16px"> Write a function that accepts a single argument that will filter **`hou_1`** for a specific department where **`is_max`** is 1. Test your function with departments like 'Library' and 'Public Works & Engineering-PWE'.</span>

In [91]:
def filter_dep(dep):
    criteria = (hou_1.DEPARTMENT == dep) & (hou_1.is_max == 1)
    return hou_1[criteria]

filter_dep('Library')

filter_dep('Public Works & Engineering-PWE')

Unnamed: 0,DEPARTMENT,BASE_SALARY,is_max,occur
6,Public Works & Engineering-PWE,71680.0,1.0,6
8,Public Works & Engineering-PWE,107962.0,1.0,8
186,Public Works & Engineering-PWE,110881.0,1.0,39
297,Public Works & Engineering-PWE,141948.0,1.0,50
1067,Public Works & Engineering-PWE,146141.0,1.0,77
1232,Public Works & Engineering-PWE,178331.0,1.0,80


### Problem 10
<span  style="color:green; font-size:16px">A good skill to have is to ask a difficult question for yourself and then answer it. Ask yourself a question that involes grouping and answer it.</span>

In [None]:
# your code here