##### **Pandas GroupBy**

Groupby is a pretty simple concept. We can create a grouping of categories and apply a function to the categories. It’s a simple concept but it’s an extremely valuable technique that’s widely used in data science. In real data science projects, you’ll be dealing with large amounts of data and trying things over and over, so for efficiency, we use Groupby concept. Groupby concept is really important because it’s ability to aggregate data efficiently, both in performance and the amount code is magnificent. Groupby mainly refers to a process involving one or more of the following steps they are: 
 

Splitting : It is a process in which we split data into group by applying some conditions on datasets.
Applying : It is a process in which we apply a function to each group independently
Combining : It is a process in which we combine different datasets after applying groupby and results into a data structure
The following image will help in understanding a process involve in Groupby concept. 
1. Group the unique values from the Team column 
 

![](attachment:image.png)

In [2]:
# importing pandas module
import pandas as pd 

# Define a dictionary containing employee data 
data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi', 
				'Gaurav', 'Anuj', 'Princi', 'Abhi'], 
		'Age':[27, 24, 22, 32, 
			33, 36, 27, 32], 
		'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
				'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'], 
		'Qualification':['Msc', 'MA', 'MCA', 'Phd',
						'B.Tech', 'B.com', 'Msc', 'MA']} 
	

# Convert the dictionary into DataFrame 
df = pd.DataFrame(data1)

display(df) 


Unnamed: 0,Name,Age,Address,Qualification
0,Jai,27,Nagpur,Msc
1,Anuj,24,Kanpur,MA
2,Jai,22,Allahabad,MCA
3,Princi,32,Kannuaj,Phd
4,Gaurav,33,Jaunpur,B.Tech
5,Anuj,36,Kanpur,B.com
6,Princi,27,Allahabad,Msc
7,Abhi,32,Aligarh,MA


In [8]:
# using groupby function
# with one key

df.groupby('Address')
print(df.groupby('Address').groups)


{'Aligarh': [7], 'Allahabad': [2, 6], 'Jaunpur': [4], 'Kannuaj': [3], 'Kanpur': [1, 5], 'Nagpur': [0]}


##### **Now we print the first entries in all the groups formed.**

In [4]:
gk = df.groupby("Name")
gk.first()

Unnamed: 0_level_0,Age,Address,Qualification
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abhi,32,Aligarh,MA
Anuj,24,Kanpur,MA
Gaurav,33,Jaunpur,B.Tech
Jai,27,Nagpur,Msc
Princi,32,Kannuaj,Phd


##### **Grouping data with multiple keys :**

In order to group data with multiple keys, we pass multiple keys in groupby function. 
 

In [19]:
# importing pandas module
import pandas as pd 

# Define a dictionary containing employee data 
data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi', 
				'Gaurav', 'Anuj', 'Princi', 'Abhi'], 
		'Age':[27, 24, 22, 32, 
			33, 36, 27, 32], 
		'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
				'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'], 
		'Qualification':['Msc', 'MA', 'MCA', 'Phd',
						'B.Tech', 'B.com', 'Msc', 'MA']} 
	

# Convert the dictionary into DataFrame 
df = pd.DataFrame(data1)

display(df) 
print("\nAfter groupby\n")

print(df.groupby(["Name","Qualification"]).groups)

print("\nUsing for loop: \n")
for i,j in (df.groupby(["Name","Qualification"]).groups).items():
    print(i,j)

Unnamed: 0,Name,Age,Address,Qualification
0,Jai,27,Nagpur,Msc
1,Anuj,24,Kanpur,MA
2,Jai,22,Allahabad,MCA
3,Princi,32,Kannuaj,Phd
4,Gaurav,33,Jaunpur,B.Tech
5,Anuj,36,Kanpur,B.com
6,Princi,27,Allahabad,Msc
7,Abhi,32,Aligarh,MA



After groupby

{('Abhi', 'MA'): [7], ('Anuj', 'B.com'): [5], ('Anuj', 'MA'): [1], ('Gaurav', 'B.Tech'): [4], ('Jai', 'MCA'): [2], ('Jai', 'Msc'): [0], ('Princi', 'Msc'): [6], ('Princi', 'Phd'): [3]}

Using for loop: 

('Abhi', 'MA') Index([7], dtype='int64')
('Anuj', 'B.com') Index([5], dtype='int64')
('Anuj', 'MA') Index([1], dtype='int64')
('Gaurav', 'B.Tech') Index([4], dtype='int64')
('Jai', 'MCA') Index([2], dtype='int64')
('Jai', 'Msc') Index([0], dtype='int64')
('Princi', 'Msc') Index([6], dtype='int64')
('Princi', 'Phd') Index([3], dtype='int64')


##### **Grouping data by sorting keys :**
 
Group keys are sorted by default using the groupby operation. User can pass sort=False for potential speedups. 

In [24]:
# importing pandas module
import pandas as pd 

# Define a dictionary containing employee data 
data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi', 
				'Gaurav', 'Anuj', 'Princi', 'Abhi'], 
		'Marks':[27, 24, 22, 32, 
			33, 36, 27, 32], } 
	

# Convert the dictionary into DataFrame 
df = pd.DataFrame(data1)

display(df)

Unnamed: 0,Name,Marks
0,Jai,27
1,Anuj,24
2,Jai,22
3,Princi,32
4,Gaurav,33
5,Anuj,36
6,Princi,27
7,Abhi,32


In [25]:
#witout sort 
display(df.groupby("Name").sum())

Unnamed: 0_level_0,Marks
Name,Unnamed: 1_level_1
Abhi,32
Anuj,60
Gaurav,33
Jai,49
Princi,59


In [31]:
#with sort
display(df.groupby("Name",sort=True,).sum())

Unnamed: 0_level_0,Marks
Name,Unnamed: 1_level_1
Abhi,32
Anuj,60
Gaurav,33
Jai,49
Princi,59


In order to select a group, we can select group using GroupBy.get_group(). We can select a group by applying a function GroupBy.get_group this function select a single group.

In [32]:
# importing pandas module
import pandas as pd 

# Define a dictionary containing employee data 
data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi', 
				'Gaurav', 'Anuj', 'Princi', 'Abhi'], 
		'Age':[27, 24, 22, 32, 
			33, 36, 27, 32], 
		'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
				'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'], 
		'Qualification':['Msc', 'MA', 'MCA', 'Phd',
						'B.Tech', 'B.com', 'Msc', 'MA']} 
	

# Convert the dictionary into DataFrame 
df = pd.DataFrame(data1)

display(df) 


Unnamed: 0,Name,Age,Address,Qualification
0,Jai,27,Nagpur,Msc
1,Anuj,24,Kanpur,MA
2,Jai,22,Allahabad,MCA
3,Princi,32,Kannuaj,Phd
4,Gaurav,33,Jaunpur,B.Tech
5,Anuj,36,Kanpur,B.com
6,Princi,27,Allahabad,Msc
7,Abhi,32,Aligarh,MA


In [42]:
#Now we select an object grouped on Single columns 

gb = df.groupby("Address")
print(gb.get_group("Allahabad"))

#Now we select an object grouped on multiple columns 
gb = df.groupby(["Name","Qualification"])
print("\n",gb.get_group(("Jai","Msc")))


     Name  Age    Address Qualification
2     Jai   22  Allahabad           MCA
6  Princi   27  Allahabad           Msc

   Name  Age Address Qualification
0  Jai   27  Nagpur           Msc


##### **Applying function to group**
After splitting a data into a group, we apply a function to each group in order to do that we perform some operation they are: 
 

Aggregation : It is a process in which we compute a summary statistic (or statistics) about each group. For Example, Compute group sums ormeans

Transformation : It is a process in which we perform some group-specific computations and return a like-indexed. For Example, Filling NAs within 

groups with a value derived from each group

Filtration : It is a process in which we discard some groups, according to a group-wise computation that evaluates True or False. For Example, 

Filtering out data based on the group sum or mean
  
Aggregation : 

Aggregation is a process in which we compute a summary statistic about each group. Aggregated function returns a single aggregated value for 

each group. After splitting a data into groups using groupby function, several aggregation operations can be performed on the grouped data. 


##### Code #1: Using aggregation via the aggregate method 

In [44]:
# importing pandas module
import pandas as pd 

# importing numpy as np
import numpy as np

# Define a dictionary containing employee data 
data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi', 
				'Gaurav', 'Anuj', 'Princi', 'Abhi'], 
		'Age':[27, 24, 22, 32, 
			33, 36, 27, 32], 
		'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
				'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'], 
		'Qualification':['Msc', 'MA', 'MCA', 'Phd',
						'B.Tech', 'B.com', 'Msc', 'MA']} 
	

# Convert the dictionary into DataFrame 
df = pd.DataFrame(data1)

display(df)

Unnamed: 0,Name,Age,Address,Qualification
0,Jai,27,Nagpur,Msc
1,Anuj,24,Kanpur,MA
2,Jai,22,Allahabad,MCA
3,Princi,32,Kannuaj,Phd
4,Gaurav,33,Jaunpur,B.Tech
5,Anuj,36,Kanpur,B.com
6,Princi,27,Allahabad,Msc
7,Abhi,32,Aligarh,MA


In [59]:
print("Now we perform aggregation using aggregate method") 
gb = df.groupby("Name")
display(gb.aggregate(sum))

print("\nNow we perform aggregation on agroup containing multiple keys\n") 
gb = df.groupby(["Name","Qualification"])
display(gb.aggregate(sum))

print("\nNow we apply a multiple functions by passing a list of functions.\n")
gb = df.groupby(["Name"])
display(gb["Age"].agg([np.sum,np.mean,np.std]))

Now we perform aggregation using aggregate method


Unnamed: 0_level_0,Age,Address,Qualification
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abhi,32,Aligarh,MA
Anuj,60,KanpurKanpur,MAB.com
Gaurav,33,Jaunpur,B.Tech
Jai,49,NagpurAllahabad,MscMCA
Princi,59,KannuajAllahabad,PhdMsc



Now we perform aggregation on agroup containing multiple keys



Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Address
Name,Qualification,Unnamed: 2_level_1,Unnamed: 3_level_1
Abhi,MA,32,Aligarh
Anuj,B.com,36,Kanpur
Anuj,MA,24,Kanpur
Gaurav,B.Tech,33,Jaunpur
Jai,MCA,22,Allahabad
Jai,Msc,27,Nagpur
Princi,Msc,27,Allahabad
Princi,Phd,32,Kannuaj



Now we apply a multiple functions by passing a list of functions.



Unnamed: 0_level_0,sum,mean,std
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abhi,32,32.0,
Anuj,60,30.0,8.485281
Gaurav,33,33.0,
Jai,49,24.5,3.535534
Princi,59,29.5,3.535534


##### **Applying different functions to DataFrame columns :** 
In order to apply a different aggregation to the columns of a DataFrame, we can pass a dictionary to aggregate . 

In [60]:
# importing pandas module
import pandas as pd 

# importing numpy as np
import numpy as np

# Define a dictionary containing employee data 
data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi', 
				'Gaurav', 'Anuj', 'Princi', 'Abhi'], 
		'Age':[27, 24, 22, 32, 
			33, 36, 27, 32], 
		'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
				'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'], 
		'Qualification':['Msc', 'MA', 'MCA', 'Phd',
						'B.Tech', 'B.com', 'Msc', 'MA'],
		'Score': [23, 34, 35, 45, 47, 50, 52, 53]} 
	

# Convert the dictionary into DataFrame 
df = pd.DataFrame(data1)

display(df) 


Unnamed: 0,Name,Age,Address,Qualification,Score
0,Jai,27,Nagpur,Msc,23
1,Anuj,24,Kanpur,MA,34
2,Jai,22,Allahabad,MCA,35
3,Princi,32,Kannuaj,Phd,45
4,Gaurav,33,Jaunpur,B.Tech,47
5,Anuj,36,Kanpur,B.com,50
6,Princi,27,Allahabad,Msc,52
7,Abhi,32,Aligarh,MA,53


In [61]:
gb = df.groupby("Name")
display(gb.agg({"Age":"sum","Score":"mean"}))

Unnamed: 0_level_0,Age,Score
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Abhi,32,53.0
Anuj,60,42.0
Gaurav,33,47.0
Jai,49,29.0
Princi,59,48.5


#### **Transformation :** 

Transformation is a process in which we perform some group-specific computations and return a like-indexed. Transform method returns an object 
that is indexed the same (same size) as the one being grouped. The transform function must: 
 

Return a result that is either the same size as the group chunk

Operate column-by-column on the group chunk

Not perform in-place operations on the group chunk.

In [62]:
# importing pandas module
import pandas as pd 

# importing numpy as np
import numpy as np

# Define a dictionary containing employee data 
data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi', 
				'Gaurav', 'Anuj', 'Princi', 'Abhi'], 
		'Age':[27, 24, 22, 32, 
			33, 36, 27, 32], 
		'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
				'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'], 
		'Qualification':['Msc', 'MA', 'MCA', 'Phd',
						'B.Tech', 'B.com', 'Msc', 'MA'],
		'Score': [23, 34, 35, 45, 47, 50, 52, 53]} 
	

# Convert the dictionary into DataFrame 
df = pd.DataFrame(data1)

display(df) 


Unnamed: 0,Name,Age,Address,Qualification,Score
0,Jai,27,Nagpur,Msc,23
1,Anuj,24,Kanpur,MA,34
2,Jai,22,Allahabad,MCA,35
3,Princi,32,Kannuaj,Phd,45
4,Gaurav,33,Jaunpur,B.Tech,47
5,Anuj,36,Kanpur,B.com,50
6,Princi,27,Allahabad,Msc,52
7,Abhi,32,Aligarh,MA,53


In [65]:
gb = df.groupby("Name")
fun1 = lambda x: x*2
gb.transform(fun1)

Unnamed: 0,Age,Address,Qualification,Score
0,54,NagpurNagpur,MscMsc,46
1,48,KanpurKanpur,MAMA,68
2,44,AllahabadAllahabad,MCAMCA,70
3,64,KannuajKannuaj,PhdPhd,90
4,66,JaunpurJaunpur,B.TechB.Tech,94
5,72,KanpurKanpur,B.comB.com,100
6,54,AllahabadAllahabad,MscMsc,104
7,64,AligarhAligarh,MAMA,106


#### **Filtration :** 
Filtration is a process in which we discard some groups, according to a group-wise computation that evaluates True or False. In order to filter a group, we use filter method and apply some condition by which we filter group. 

In [93]:
# filtering data using
# filter data
grp = df.groupby('Name')
grp.filter(lambda x: len(x) >= 2)


Unnamed: 0,Name,Age,Address,Qualification,Score
0,Jai,27,Nagpur,Msc,23
1,Anuj,24,Kanpur,MA,34
2,Jai,22,Allahabad,MCA,35
3,Princi,32,Kannuaj,Phd,45
5,Anuj,36,Kanpur,B.com,50
6,Princi,27,Allahabad,Msc,52


##### **Grouping Rows in pandas**

Example 1:

For grouping rows in Pandas, we will start with creating a pandas dataframe first.

In [94]:
# importing Pandas 
import pandas as pd 

# example dataframe 
example = {'Team':['Arsenal', 'Manchester United', 'Arsenal', 
				'Arsenal', 'Chelsea', 'Manchester United', 
				'Manchester United', 'Chelsea', 'Chelsea', 'Chelsea'], 
					
		'Player':['Ozil', 'Pogba', 'Lucas', 'Aubameyang', 
					'Hazard', 'Mata', 'Lukaku', 'Morata', 
										'Giroud', 'Kante'], 
										
		'Goals':[5, 3, 6, 4, 9, 2, 0, 5, 2, 3] } 

df = pd.DataFrame(example) 

display(df) 


Unnamed: 0,Team,Player,Goals
0,Arsenal,Ozil,5
1,Manchester United,Pogba,3
2,Arsenal,Lucas,6
3,Arsenal,Aubameyang,4
4,Chelsea,Hazard,9
5,Manchester United,Mata,2
6,Manchester United,Lukaku,0
7,Chelsea,Morata,5
8,Chelsea,Giroud,2
9,Chelsea,Kante,3


In [99]:
#Now, create a grouping object, means an object that represents that particular grouping.
total_goals = df["Goals"].groupby(df["Team"])
print(total_goals.groups)
total_goals.mean()

{'Arsenal': [0, 2, 3], 'Chelsea': [4, 7, 8, 9], 'Manchester United': [1, 5, 6]}


Team
Arsenal              5.000000
Chelsea              4.750000
Manchester United    1.666667
Name: Goals, dtype: float64

In [109]:
import pandas as pd 

# example dataframe 
example = {'Team':['Australia', 'England', 'South Africa', 
				'Australia', 'England', 'India', 'India', 
						'South Africa', 'England', 'India'], 
						
		'Player':['Ricky Ponting', 'Joe Root', 'Hashim Amla', 
					'David Warner', 'Jos Buttler', 'Virat Kohli', 
					'Rohit Sharma', 'David Miller', 'Eoin Morgan', 
												'Dinesh Karthik'], 
												
		'Runs':[345, 336, 689, 490, 989, 672, 560, 455, 342, 376], 
			
		'Salary':[34500, 33600, 68900, 49000, 98899, 
					67562, 56760, 45675, 34542, 31176] } 

df = pd.DataFrame(example) 
df 


Unnamed: 0,Team,Player,Runs,Salary
0,Australia,Ricky Ponting,345,34500
1,England,Joe Root,336,33600
2,South Africa,Hashim Amla,689,68900
3,Australia,David Warner,490,49000
4,England,Jos Buttler,989,98899
5,India,Virat Kohli,672,67562
6,India,Rohit Sharma,560,56760
7,South Africa,David Miller,455,45675
8,England,Eoin Morgan,342,34542
9,India,Dinesh Karthik,376,31176


In [114]:
total_salary_for_team = df["Salary"].groupby(df["Team"])
display(total_salary_for_team.groups)
display(total_salary_for_team.mean())

{'Australia': [0, 3], 'England': [1, 4, 8], 'India': [5, 6, 9], 'South Africa': [2, 7]}

Team
Australia       41750.000000
England         55680.333333
India           51832.666667
South Africa    57287.500000
Name: Salary, dtype: float64

##### **Combining multiple columns in Pandas groupby with dictionary**

Let’ see how to combine multiple columns in Pandas using groupby with dictionary with the help of different examples.

Example #1:

**Explanation**

Here we have grouped Column 1.1, Column 1.2 and Column 1.3 into Column 1 and Column 2.1, Column 2.2 into Column 2.

Notice that the output in each column is the min value of each row of the columns grouped together. i.e in Column 1, value of first row is the 

minimum value of Column 1.1 Row 1, Column 1.2 Row 1 and Column 1.3 Row 1.


In [120]:
# importing pandas as pd 
import pandas as pd 

# Creating a dictionary 
d = {'id':['1', '2', '3'], 
	'Column 1.1':[14, 15, 16], 
	'Column 1.2':[10, 10, 10], 
	'Column 1.3':[1, 4, 5], 
	'Column 2.1':[1, 2, 3], 
	'Column 2.2':[10, 10, 10], } 

# Converting dictionary into a data-frame 
df = pd.DataFrame(d) 
display(df) 

# Set the index of df as Column 'id' 
df = df.set_index('id')


Unnamed: 0,id,Column 1.1,Column 1.2,Column 1.3,Column 2.1,Column 2.2
0,1,14,10,1,1,10
1,2,15,10,4,2,10
2,3,16,10,5,3,10


In [121]:
# Creating the groupby dictionary 
groupby_dict = {'Column 1.1':'Column 1', 
				'Column 1.2':'Column 1', 
				'Column 1.3':'Column 1', 
				'Column 2.1':'Column 2', 
				'Column 2.2':'Column 2' } 

# Groupby the groupby_dict created above 
df = df.groupby(groupby_dict, axis = 1).max() 
print(df) 


    Column 1  Column 2
id                    
1         14        10
2         15        10
3         16        10


**Example #2:**

In [124]:
# importing pandas as pd 
import pandas as pd 

# Create dictionary with data 
dict = { 
	"ID":[1, 2, 3], 
	"Movies":["The Godfather", "Fight Club", "Casablanca"], 
	"Week_1_Viewers":[30, 30, 40], 
	"Week_2_Viewers":[60, 40, 80], 
	"Week_3_Viewers":[40, 20, 20] }; 

# Convert dictionary to dataframe 
df = pd.DataFrame(dict); 
print(df) 
df = df.set_index('ID') 


   ID         Movies  Week_1_Viewers  Week_2_Viewers  Week_3_Viewers
0   1  The Godfather              30              60              40
1   2     Fight Club              30              40              20
2   3     Casablanca              40              80              20


In [125]:
# Create the groupby_dict  
groupby_dict = {"Week_1_Viewers":"Total_Viewers", 
           "Week_2_Viewers":"Total_Viewers", 
           "Week_3_Viewers":"Total_Viewers", 
           "Movies":"Movies" } 

df = df.groupby(groupby_dict, axis=1).sum()
df

Unnamed: 0_level_0,Movies,Total_Viewers
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,The Godfather,130
2,Fight Club,90
3,Casablanca,140
