# Assumptions

Hoping you are familiar with basic python syntax and Lambda functions
<br />Lambda function Blog : https://www.programiz.com/python-programming/anonymous-function <br /><br />
Also you can find some Blogs that might help you better understand the concept for groupby:
<br />Link : https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e

# Why/When Groupby?
  During your EDA/Data analysis/feature engineering there will always come a point where you would want to split the data based on certain groups/categories, and get relevant statistical inferences from it.

Let's see some scenerios below. let's start by making a dummy Data


In [None]:
import numpy as np
import pandas as pd

In [None]:
"""
  Lets start with a assumption that there is a class of 300 students and they have given there one fav subject a rating in range of 1-10 from 6 unique subject .
"""
n_studs = 300
HouseOne = pd.DataFrame({
    "Name":["Name_"+str(i) for i in range(n_studs)],
    "Subject":np.random.choice(["Subject_"+str(i) for i in range(6)], size=n_studs),
    "Rating":np.random.uniform(low=1, high=10, size=n_studs)
})

In [None]:
HouseOne.sample(10) # will give you 10 random points in any order

Unnamed: 0,Name,Subject,Rating
257,Name_257,Subject_4,6.859085
240,Name_240,Subject_3,3.588999
140,Name_140,Subject_0,7.641776
24,Name_24,Subject_2,8.263233
179,Name_179,Subject_4,9.93577
86,Name_86,Subject_3,9.299188
271,Name_271,Subject_3,2.674865
158,Name_158,Subject_5,4.251036
192,Name_192,Subject_4,1.436623
13,Name_13,Subject_5,3.344366


Ok, now we have a sample data, let's see how much average rating each subject has

In [None]:
"""
Hold on here a second, notice groupby("Subject") will filter you data into groups of each individual unique subject and return a groupby OBJECT.!, 
if you iterate over these objects you get a tuple of group name(here, subname & the filtered data)
"""
for name, group in HouseOne.groupby("Subject"):
  #print(name)
  #print(group)
  print(f"Group Name is {name}")

Group Name is Subject_0
Group Name is Subject_1
Group Name is Subject_2
Group Name is Subject_3
Group Name is Subject_4
Group Name is Subject_5


In [None]:
%%time
for name, group in HouseOne.groupby("Subject"):
  print(f"Subject Name is {name} and Subject Avg. rating is {group['Rating'].mean()}") # now we know groups represent the  subjects

Subject Name is Subject_0 and Subject Avg. rating is 5.814835584844783
Subject Name is Subject_1 and Subject Avg. rating is 5.213351144307308
Subject Name is Subject_2 and Subject Avg. rating is 5.267472426569818
Subject Name is Subject_3 and Subject Avg. rating is 6.187580729487596
Subject Name is Subject_4 and Subject Avg. rating is 5.758361704615079
Subject Name is Subject_5 and Subject Avg. rating is 5.980369276602925
CPU times: user 2.62 ms, sys: 3.99 ms, total: 6.61 ms
Wall time: 9.06 ms


In [None]:
for name, group in HouseOne.groupby("Subject"):
  print(f"Now Printing Filtered Data of only : {name}")
  print("*"*50)
  print(group.head(3))
  print("*"*50)

Now Printing Filtered Data of only : Subject_0
**************************************************
     Name    Subject    Rating
2  Name_2  Subject_0  1.510844
3  Name_3  Subject_0  8.516733
5  Name_5  Subject_0  1.197019
**************************************************
Now Printing Filtered Data of only : Subject_1
**************************************************
       Name    Subject    Rating
9    Name_9  Subject_1  6.894412
12  Name_12  Subject_1  9.423082
20  Name_20  Subject_1  8.046732
**************************************************
Now Printing Filtered Data of only : Subject_2
**************************************************
       Name    Subject    Rating
0    Name_0  Subject_2  4.170904
15  Name_15  Subject_2  3.718398
17  Name_17  Subject_2  3.682088
**************************************************
Now Printing Filtered Data of only : Subject_3
**************************************************
       Name    Subject    Rating
4    Name_4  Subject_3  5.204231
8

I hope you abstractly get the point here., but looping over all values for mean or any statistical value of individual groups isn't really pythonic so let's try another way

In [None]:
HouseOne.groupby("Subject") #returns the groupby object 

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f084cc6b320>

In [None]:
%%time
HouseOne.groupby("Subject")['Rating'].mean() # what just happened?, so basically first groups of each subjects were created withing these group rating column was extracted & mean of that value was taken. simple.

CPU times: user 2.1 ms, sys: 22 µs, total: 2.13 ms
Wall time: 12 ms


Subject
Subject_0    5.397969
Subject_1    5.830347
Subject_2    5.400330
Subject_3    4.897711
Subject_4    6.392445
Subject_5    5.928049
Name: Rating, dtype: float64

Check the time diffrence & match the values where we looped over group and now where we used short cute method, if the data is big enough this time difference would be significant., try it once by increasing n_rows from 300 to 3,000,000

#### Aggregate functions
How to get Multiple, Satistical infrences from groups at once?
  Let's try to get mean, median, mode, count values of rating from each group

In [None]:
"""
  There is one rule for Aggregate functions ----::---- Always remember the aggregate function assumes that function that you want to use will return a single value.
  for e.g : 
    for a column -> mean would return a single value which is the average of that column.
    but cant use a function like value_counts, which return multiple values. but i can filter the most/least frequent element from it to return as value see getmeMode function
  You can even pass your custom/user defined function, i am here going to pass a user defined function that returns mode of the series/column

"""
def getMeMode(x):
  return x.value_counts().index[0] # return value of most frequent element. Note: x is the entire column of rating of any particular group, and i am returning a *single* value which mode from my defined function.
  
stats = HouseOne.groupby("Subject")["Rating"].agg({"mean", "median", "count", np.sum, getMeMode}) # instead of np.sum you can also use "sum", getMeMode signifies as the address/refrence of function
stats # columns are returned in random order you can get a proper order by filtering columns in partcular order [ColA, ColB, ..., ColN]

Unnamed: 0_level_0,mean,count,median,getMeMode,sum
Subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Subject_0,5.523832,51,6.053294,1.263153,281.715414
Subject_1,6.486176,44,7.064109,9.108997,285.39173
Subject_2,4.755034,40,4.493698,4.431074,190.20136
Subject_3,6.025273,52,6.583036,2.347427,313.314216
Subject_4,4.911229,56,4.25552,6.459121,275.028835
Subject_5,5.942814,57,6.358712,2.667478,338.74042


#### Apply/ApplyMap/Map

There could be scenerios where you might want to apply some logic/function on dataframe or series itself. 


In [None]:
"""
  Apply/ApplyMap/Map are functions that help you intract with groups you created & even manipulate values for same.
  Map - Applied only to a series(columns), and is element wise
  ApplyMap - Applied to only DataFrame, and is element wise
  Apply - Can be used on both.
  .
  .
  Let's create a new column in Dataframe and convert the rating to some string value.
"""
rating_to_grade = {
    10 : "O",
     9 : "E",
     8 : "A",
     7 : "B",
     6 : "C",
     5 : "D",
     4 : "E",
     3 : "F",
     2 : "Nailed it.!",
     1 : "Legendary"
}

# converting Rating which is continous to discrete values
HouseOne['discrete_rating'] = HouseOne['Rating'].apply(np.ceil) # you can also try map here since we are just manipulating a Series here i.e. Rating
HouseOne.head()

Unnamed: 0,Name,Subject,Rating,discrete_rating
0,Name_0,Subject_2,4.170904,5.0
1,Name_1,Subject_5,1.373747,2.0
2,Name_2,Subject_0,1.510844,2.0
3,Name_3,Subject_0,8.516733,9.0
4,Name_4,Subject_3,5.204231,6.0


In [None]:
HouseOne['Grade'] = HouseOne['discrete_rating'].map(rating_to_grade)
HouseOne.sample(10)

Unnamed: 0,Name,Subject,Rating,discrete_rating,Grade
127,Name_127,Subject_3,1.348359,2.0,Nailed it.!
141,Name_141,Subject_0,6.255384,7.0,B
177,Name_177,Subject_5,3.784712,4.0,E
152,Name_152,Subject_0,9.452109,10.0,O
133,Name_133,Subject_3,8.224375,9.0,E
67,Name_67,Subject_2,5.396364,6.0,C
290,Name_290,Subject_4,7.758885,8.0,A
52,Name_52,Subject_1,9.861044,10.0,O
270,Name_270,Subject_1,1.402795,2.0,Nailed it.!
50,Name_50,Subject_3,7.630585,8.0,A


In [None]:
"""
  Let's say you want to do some manipulation on each row and value of dataframe, idk maybe add H1 for House_one to every value let's try to do that
"""
HouseOne.applymap(lambda x : str(x)+ "_H1") # as you can see applymap added _H1 to each element of data.

Unnamed: 0,Name,Subject,Rating
0,Name_0_H1,Subject_1_H1,4.857946967777754_H1
1,Name_1_H1,Subject_0_H1,7.136362708317373_H1
2,Name_2_H1,Subject_2_H1,7.855307560524956_H1
3,Name_3_H1,Subject_0_H1,2.7248876763902956_H1
4,Name_4_H1,Subject_2_H1,3.7787610600439923_H1
...,...,...,...
295,Name_295_H1,Subject_0_H1,6.965027269326243_H1
296,Name_296_H1,Subject_2_H1,3.1023213913339966_H1
297,Name_297_H1,Subject_2_H1,1.6147742874569702_H1
298,Name_298_H1,Subject_3_H1,8.30477808711956_H1


##### Ley's say you want to Do some manipulation on groups of data or run your function for each group

In [None]:
def myFunction(groupName, groupData):
  """
    @ groupName param : it is the current groupName which is being used 
    @ x : x is the row of each group
  """
  print(f"Currently this group Name is {groupName}")
  print(f"Curious what X contains? :\n {groupData.head(2)}") # x is simply the data that you extracted. You can assume that groupData is the a dataframe & apply your logic to it based on your intrest.



HouseOne.groupby("Subject").apply(lambda x: myFunction(x.name, x)) # x.name is passed just for illustration purpose i hardly imagine u'll need that., btw if u r trying just try to use map instead of apply n see what happens

Currently this group Name is Subject_0
Curious what X contains? :
      Name    Subject    Rating  discrete_rating        Grade
2  Name_2  Subject_0  1.510844              2.0  Nailed it.!
3  Name_3  Subject_0  8.516733              9.0            E
Currently this group Name is Subject_1
Curious what X contains? :
        Name    Subject    Rating  discrete_rating Grade
9    Name_9  Subject_1  6.894412              7.0     B
12  Name_12  Subject_1  9.423082             10.0     O
Currently this group Name is Subject_2
Curious what X contains? :
        Name    Subject    Rating  discrete_rating Grade
0    Name_0  Subject_2  4.170904              5.0     D
15  Name_15  Subject_2  3.718398              4.0     E
Currently this group Name is Subject_3
Curious what X contains? :
      Name    Subject    Rating  discrete_rating Grade
4  Name_4  Subject_3  5.204231              6.0     C
8  Name_8  Subject_3  7.417015              8.0     A
Currently this group Name is Subject_4
Curious what

Let's see what subject got what frequent grade

In [None]:
HouseOne.groupby("Subject").apply(lambda x: x['Grade'].value_counts().index[0]) # if you are confused what happened here, try to decode it line by line n u'll get the point.

Subject
Subject_0    E
Subject_1    E
Subject_2    E
Subject_3    E
Subject_4    E
Subject_5    E
dtype: object

## Aggregating Multiple Columns

In [None]:
# here we aggregates individual columns by specifying the dictionary what operation we want also we can pass custom functons here i passed getMeMoode created earlier
HouseOne.groupby("Subject").agg({"Rating":"mean", "discrete_rating":"median", "Grade":getMeMode})

Unnamed: 0_level_0,Rating,discrete_rating,Grade
Subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Subject_0,5.397969,6.0,E
Subject_1,5.830347,7.0,E
Subject_2,5.40033,6.0,E
Subject_3,4.897711,5.0,E
Subject_4,6.392445,7.0,E
Subject_5,5.928049,6.0,E


In [6]:
my_list = [1, 5, 4, 6, 8, 11, 3, 12]

new_list = list(filter(lambda x: (x%2 == 0) , my_list))

print(new_list)

[4, 6, 8, 12]


In [9]:
list(filter(lambda x: (x%2 == 0) , my_list))

[4, 6, 8, 12]