# ```Groupby```

Groupby is a pretty simple concept. We can create a grouping of categories and apply a function to the categories. It’s a simple concept but it’s an extremely valuable technique that’s widely used in data science. In real data science projects, you’ll be dealing with large amounts of data and trying things over and over, so for efficiency, we use Groupby concept. Groupby concept is really important because it’s ability to aggregate data efficiently, both in performance and the amount code is magnificent. Groupby mainly refers to a process involving one or more of the following steps they are: 
* **Splitting** : It is a process in which we split data into group by applying some conditions on datasets.
* Applying : It is a process in which we apply a function to each group independently
* **Combining** : It is a process in which we combine different datasets after applying groupby and results into a data structure
The following image will help in understanding a process involve in Groupby concept. 

The following image will help in understanding a process involve in Groupby concept. 

1. Group the unique values from the Team column:
![image.png](attachment:image.png)
2. Now there’s a bucket for each group:
![image-2.png](attachment:image-2.png)
3. Toss the other data into the buckets:
![image-3.png](attachment:image-3.png)
4. Apply a function on the weight column of each bucket:
![image-4.png](attachment:image-4.png)

### Splitting Data into Groups
Splitting is a process in which we split data into a group by applying some conditions on datasets. In order to split the data, we apply certain conditions on datasets. In order to split the data, we use groupby() function this function is used to split the data into groups based on some criteria. Pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. Pandas datasets can be split into any of their objects. There are multiple ways to split data like:
* ```obj.groupby(key)```
* ```obj.groupby(key, axis=1)```
* ```obj.groupby([key1, key2])```

In [68]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

df = pd.read_csv("./data/employees_satisfaction_transformed.csv", index_col=0)
df

Unnamed: 0,age,department,education,recruitment_type,job_level,rating,awards,certifications,salary,gender,entry_date,satisfied
0,28,HR,Postgraduate,Referral,5,2.0,1,0,78075.0,Male,2019-02-01,1
1,50,Technology,Postgraduate,Recruitment Agency,3,5.0,2,1,38177.1,Male,2017-01-17,0
2,43,Technology,Undergrad,Referral,4,1.0,2,0,59143.5,Female,2012-08-27,1
3,44,Sales,Postgraduate,On-Campus,2,3.0,0,0,26824.5,Female,2017-07-25,1
4,33,HR,Undergrad,Recruitment Agency,2,1.0,5,0,26824.5,Male,2019-05-17,1
...,...,...,...,...,...,...,...,...,...,...,...,...
495,49,HR,Postgraduate,On-Campus,2,5.0,6,0,26824.5,Male,2014-03-21,1
496,24,Technology,Undergrad,Referral,2,4.0,2,0,26824.5,Female,2018-02-20,0
497,34,Marketing,Postgraduate,On-Campus,1,Unavailable,2,0,21668.4,Male,2020-10-20,1
498,26,Technology,Undergrad,Walk-in,2,1.0,1,1,26824.5,Male,2012-05-18,0


# 1. grouping with one key: ```groupby(key)```
In order to group data with one key, we pass only one key as an argument in groupby function. 

In [69]:
group = df.groupby('department')
print(df.groupby('department').groups)

{'HR': [0, 4, 8, 18, 20, 21, 42, 43, 47, 49, 50, 51, 67, 69, 70, 71, 74, 91, 94, 95, 96, 108, 115, 116, 118, 122, 125, 127, 129, 131, 134, 145, 147, 148, 153, 154, 164, 168, 169, 184, 185, 189, 194, 196, 208, 212, 229, 231, 235, 239, 241, 242, 245, 248, 249, 268, 272, 273, 276, 277, 280, 284, 289, 293, 302, 303, 308, 318, 320, 332, 334, 336, 344, 354, 356, 357, 358, 361, 368, 373, 377, 388, 391, 395, 396, 397, 398, 399, 409, 414, 415, 419, 426, 456, 457, 467, 470, 473, 475, 480, ...], 'Marketing': [10, 16, 26, 29, 32, 33, 44, 60, 64, 65, 73, 75, 93, 103, 106, 112, 119, 123, 124, 130, 137, 144, 149, 160, 161, 167, 173, 174, 180, 181, 188, 204, 205, 206, 207, 209, 213, 214, 215, 220, 221, 238, 243, 250, 258, 259, 260, 261, 266, 267, 283, 291, 298, 299, 307, 309, 310, 316, 319, 322, 324, 330, 339, 342, 343, 359, 360, 367, 371, 372, 376, 380, 384, 385, 393, 402, 403, 404, 405, 411, 418, 424, 438, 439, 440, 444, 445, 448, 451, 459, 471, 474, 482, 493, 497], 'Purchasing': [5, 6, 13, 14, 15, 

#### get specific group.

In [70]:
group.get_group('HR')

Unnamed: 0,age,department,education,recruitment_type,job_level,rating,awards,certifications,salary,gender,entry_date,satisfied
0,28,HR,Postgraduate,Referral,5,2.0,1,0,78075.0,Male,2019-02-01,1
4,33,HR,Undergrad,Recruitment Agency,2,1.0,5,0,26824.5,Male,2019-05-17,1
8,35,HR,Postgraduate,Referral,3,4.0,0,0,38177.1,Female,2015-04-02,1
18,54,HR,Postgraduate,On-Campus,1,5.0,4,0,21668.4,Female,2014-05-07,1
20,35,HR,Undergrad,On-Campus,2,4.0,4,0,26824.5,Female,2008-01-15,1
...,...,...,...,...,...,...,...,...,...,...,...,...
484,34,HR,Postgraduate,Walk-in,2,5.0,7,1,26824.5,Female,2013-01-14,1
486,54,HR,Postgraduate,Recruitment Agency,4,3.0,2,0,59143.5,Male,2015-08-09,1
489,33,HR,Undergrad,On-Campus,2,1.0,7,0,26824.5,Female,2016-09-03,1
492,33,HR,Undergrad,On-Campus,3,5.0,6,1,38177.1,Male,2017-08-09,1


#### Print first entry in each group

In [71]:
group.first()

Unnamed: 0_level_0,age,education,recruitment_type,job_level,rating,awards,certifications,salary,gender,entry_date,satisfied
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
HR,28,Postgraduate,Referral,5,2.0,1,0,78075.0,Male,2019-02-01,1
Marketing,31,Undergrad,Walk-in,4,4.0,6,0,59143.5,Male,2009-01-24,1
Purchasing,40,Undergrad,Walk-in,3,3.0,7,1,38177.1,Male,2004-04-22,1
Sales,44,Postgraduate,On-Campus,2,3.0,0,0,26824.5,Female,2017-07-25,1
Technology,50,Postgraduate,Recruitment Agency,3,5.0,2,1,38177.1,Male,2017-01-17,0


# 2. grouping with multiple keys: ```groupby([key1, key2])```

In [72]:
group2 = df.groupby(['department', 'gender'])
group2.size()

# To reach 59 in HR -> Female is equivalent to len(group2.get_group(('HR', 'Female')))

department  gender 
HR          Female     59
            Male       46
            Unknown     1
Marketing   Female     40
            Male       54
            Unknown     1
Purchasing  Female     47
            Male       67
Sales       Female     42
            Male       45
Technology  Female     45
            Male       52
            Unknown     1
dtype: int64

In [73]:
group2.groups

{('HR', 'Female'): [8, 18, 20, 47, 51, 67, 70, 74, 91, 94, 95, 96, 115, 116, 118, 125, 127, 129, 147, 153, 169, 185, 196, 212, 229, 231, 235, 249, 272, 276, 277, 280, 284, 289, 302, 303, 308, 320, 336, 357, 361, 373, 388, 391, 396, 397, 398, 399, 414, 415, 419, 426, 457, 467, 473, 480, 481, 484, 489], ('HR', 'Male'): [0, 4, 21, 42, 43, 49, 50, 69, 71, 108, 122, 131, 134, 145, 148, 154, 164, 168, 184, 189, 194, 208, 239, 241, 242, 245, 248, 268, 273, 293, 318, 332, 334, 344, 354, 356, 358, 368, 377, 395, 409, 456, 470, 486, 492, 495], ('HR', 'Unknown'): [475], ('Marketing', 'Female'): [65, 73, 75, 93, 103, 106, 119, 130, 137, 161, 174, 188, 204, 207, 214, 220, 221, 238, 258, 260, 266, 267, 298, 310, 316, 322, 343, 359, 376, 384, 385, 393, 402, 403, 418, 438, 444, 451, 474, 482], ('Marketing', 'Male'): [10, 16, 26, 29, 32, 33, 44, 60, 64, 112, 123, 124, 144, 149, 160, 167, 173, 180, 181, 205, 206, 209, 213, 215, 243, 250, 259, 283, 291, 299, 307, 309, 319, 324, 330, 339, 342, 360, 367, 3

#### get specific group.

In [74]:
group2.get_group(list(group2.groups)[0])
#or
group2.get_group(('HR', 'Female'))

Unnamed: 0,age,department,education,recruitment_type,job_level,rating,awards,certifications,salary,gender,entry_date,satisfied
8,35,HR,Postgraduate,Referral,3,4.0,0,0,38177.1,Female,2015-04-02,1
18,54,HR,Postgraduate,On-Campus,1,5.0,4,0,21668.4,Female,2014-05-07,1
20,35,HR,Undergrad,On-Campus,2,4.0,4,0,26824.5,Female,2008-01-15,1
47,46,HR,Undergrad,Referral,3,2.0,1,0,38177.1,Female,2014-11-09,1
51,42,HR,Undergrad,On-Campus,1,1.0,5,0,21668.4,Female,2007-03-05,1
67,28,HR,Postgraduate,Referral,1,Unavailable,7,1,21668.4,Female,2020-09-10,0
70,30,HR,Undergrad,Recruitment Agency,1,Unavailable,0,0,21668.4,Female,2020-05-30,0
74,46,HR,Postgraduate,Referral,5,1.0,8,1,78075.0,Female,2007-03-15,1
91,36,HR,Undergrad,Recruitment Agency,3,2.0,8,0,38177.1,Female,2008-04-22,1
94,36,HR,Undergrad,Walk-in,1,1.0,0,1,21668.4,Female,2004-08-26,1


**Beispiel** : In welcher Abteilung sind die meisten zufriedenen Angestellten?

In [75]:
satisfied = df[df['satisfied'] == 1]
satisfied['department'].value_counts()
#or
satisfied.groupby('department')['satisfied'].sum()

department
HR            73
Marketing     61
Purchasing    80
Sales         70
Technology    73
Name: satisfied, dtype: int64

# 2. grouping based on date and Time:

## 2.1. using groupby

**Verändert sich das Recruiting über die Jahre?**

Erstellen Sie hierzu ein Liniendiagramm, das die Anzahl der Angestellten über die Zeit (entry_date) in Abhängigkeit vom Recruitment-Typ darstellt.

In [77]:
df['entry_date'] = pd.to_datetime(df['entry_date'], format='%Y-%m-%d')


In [78]:
df.groupby([df["entry_date"].dt.year, df["recruitment_type"]]).size()

entry_date  recruitment_type  
2004        On-Campus              6
            Recruitment Agency     1
            Referral               6
            Walk-in               15
2005        On-Campus              5
                                  ..
2019        Walk-in                1
2020        On-Campus              8
            Recruitment Agency    12
            Referral               8
            Walk-in                1
Length: 68, dtype: int64

## 2.2. using ```pd.Grouper()```

In [79]:
df.groupby(pd.Grouper(key='entry_date', axis=0, freq='Y')).size()

entry_date
2004-12-31    28
2005-12-31    29
2006-12-31    27
2007-12-31    34
2008-12-31    34
2009-12-31    31
2010-12-31    25
2011-12-31    20
2012-12-31    24
2013-12-31    30
2014-12-31    38
2015-12-31    28
2016-12-31    34
2017-12-31    23
2018-12-31    30
2019-12-31    36
2020-12-31    29
Freq: A-DEC, dtype: int64

## 3. Functions of groupby

### 3.1. ```sum()```

In [36]:
df.groupby(['department'])['awards'].sum()

department
HR            465
Marketing     407
Purchasing    531
Sales         396
Technology    486
Name: awards, dtype: int64

### 3.2. sort and ```sort()```
Group keys are sorted by default using the groupby operation. User can pass ```sort=False``` for potential speedups. 

In [38]:
df.groupby(['department'], sort = False)['awards'].sum()

department
HR            465
Technology    486
Sales         396
Purchasing    531
Marketing     407
Name: awards, dtype: int64

### 3.3. iterating through groups

In [39]:
grp = df.groupby('department')
for name, group in grp:
    print(name)
    print(group)
    print()

HR
     age department     education    recruitment_type  job_level rating   
0     28         HR  Postgraduate            Referral          5    2.0  \
4     33         HR     Undergrad  Recruitment Agency          2    1.0   
8     35         HR  Postgraduate            Referral          3    4.0   
18    54         HR  Postgraduate           On-Campus          1    5.0   
20    35         HR     Undergrad           On-Campus          2    4.0   
..   ...        ...           ...                 ...        ...    ...   
484   34         HR  Postgraduate             Walk-in          2    5.0   
486   54         HR  Postgraduate  Recruitment Agency          4    3.0   
489   33         HR     Undergrad           On-Campus          2    1.0   
492   33         HR     Undergrad           On-Campus          3    5.0   
495   49         HR  Postgraduate           On-Campus          2    5.0   

     awards  certifications   salary  gender  entry_date  satisfied  
0         1               

## Applying function to group

After splitting a data into a group, we apply a function to each group in order to do that we perform some operation they are:
* **Aggregation** : It is a process in which we compute a summary statistic (or statistics) about each group. For Example, Compute group sums ormeans. Aggregated function returns a single aggregated value for each group. After splitting a data into groups using groupby function, several aggregation operations can be performed on the grouped data. 

* **Transformation** : It is a process in which we perform some group-specific computations and return a like-indexed. For Example, Filling NAs within groups with a value derived from each group
* **Filtration** : It is a process in which we discard some groups, according to a group-wise computation that evaluates True or False. For Example, Filtering out data based on the group sum or mean
  
### 1. Aggregation 

#### using ```aggregate()```
**Code #1**: Using aggregation via the aggregate method 
 

In [46]:
grp1 = df.groupby('department')
grp1['awards'].aggregate(sum) # or grp1['awards'].aggregate(np.sum)

department
HR            465
Marketing     407
Purchasing    531
Sales         396
Technology    486
Name: awards, dtype: int64

#### Applying multiple functions to the same column : 
We can apply a multiple functions at once by passing a list or dictionary of functions to do aggregation with, outputting a DataFrame. 

In [48]:
#grp['awards'].agg([sum, mean, std)


#### Applying different functions to different columns : 
In order to apply a different aggregation to the columns of a DataFrame, we can pass a dictionary to aggregate . 

In [49]:
grp1.agg({'age' : 'sum', 'awards' : 'std'})

Unnamed: 0_level_0,age,awards
department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,4173,2.731024
Marketing,3698,3.030565
Purchasing,4576,2.706364
Sales,3377,2.97571
Technology,4023,3.514091


### 2. Transformation : 
Transformation is a process in which we perform some group-specific computations and return a like-indexed. Transform method returns an object that is indexed the same (same size) as the one being grouped. The transform function must: 
* Return a result that is either the same size as the group chunk
* Operate column-by-column on the group chunk
* Not perform in-place operations on the group chunk.

#### 2.1. perform group-specific computations and return a like-indexed. 

In [55]:
# using transform function
grp3 = df.groupby('department')['awards']
sc = lambda x: (x - x.mean()) / x.std()*10
grp3.transform(sc)

0     -12.401183
1      -8.420909
2      -8.420909
3     -15.296261
4       2.245340
         ...    
495     5.906970
496    -8.420909
497    -7.537244
498   -11.266596
499    -5.575223
Name: awards, Length: 500, dtype: float64

### 3. Filtration : 
Filtration is a process in which we discard some groups, according to a group-wise computation that evaluates True or False. In order to filter a group, we use filter method and apply some condition by which we filter group. 

In [56]:
grp = df.groupby('department')['awards']
grp.filter(lambda x: len(x) >= 2)

0      1
1      2
2      2
3      0
4      5
      ..
495    6
496    2
497    2
498    1
499    3
Name: awards, Length: 500, dtype: int64