## Aggregating

Aggregation in pandas provides various functions that perform a mathematical or logical operation on our dataset and returns a summary of that function.

In [6]:
import pandas as pd 

df = pd.DataFrame([[9, 4, 8, 9], 
				[8, 10, 7, 6], 
				[7, 6, 8, 5]], 
				columns=['Maths', 'English', 
						'Science', 'History']) 

display(df) 

Unnamed: 0,Maths,English,Science,History
0,9,4,8,9
1,8,10,7,6
2,7,6,8,5



We used agg() function to calculate the sum, min, and max of each column in our dataset.

In [62]:
df.agg(['sum', 'min', 'max'])

Unnamed: 0,Maths,English,Science,History
sum,24,20,23,20
min,7,4,7,5
max,9,10,8,9


In [9]:
import pandas as pd

data = {
    'department': ['HR', 'HR', 'IT', 'IT', 'Sales', 'Sales'],
    'employee_id': [1, 2, 3, 4, 5, 6],
    'performance_score': [80, 90, 70, 85, 60, 95],
    'salary': [50000, 55000, 60000, 65000, 70000, 75000]
}
df = pd.DataFrame(data)
df

Unnamed: 0,department,employee_id,performance_score,salary
0,HR,1,80,50000
1,HR,2,90,55000
2,IT,3,70,60000
3,IT,4,85,65000
4,Sales,5,60,70000
5,Sales,6,95,75000


### **Applying Aggregation Functions:**

1. **`sum()`**: Compute the total sum of values in a column.

In [11]:
total_salary = df['salary'].sum()
print("Total Salary:", total_salary)

Total Salary: 375000


2. **`min()`**: Compute the minimum value in a column.
   
  

In [13]:
min_salary = df['salary'].min()
print("Minimum Salary:", min_salary)

Minimum Salary: 50000


3. **`max()`**: Compute the maximum value in a column.
   
  

In [15]:
max_salary = df['salary'].max()
print("Maximum Salary:", max_salary)

Maximum Salary: 75000


4. **`mean()`**: Compute the mean (average) value of a column.
   
   
   

In [17]:
average_salary = df['salary'].mean()
print("Average Salary:", average_salary)

Average Salary: 62500.0



5. **`size()`**: Compute the size (number of elements) of a column.
   
   
   
   

In [19]:
total_employees = df['employee_id'].size
print("Total Employees:", total_employees)

Total Employees: 6




6. **`describe()`**: Generate descriptive statistics of a column.
   
   

In [21]:
salary_description = df['salary'].describe()
print("Salary Description:")
salary_description

Salary Description:


count        6.000000
mean     62500.000000
std       9354.143467
min      50000.000000
25%      56250.000000
50%      62500.000000
75%      68750.000000
max      75000.000000
Name: salary, dtype: float64


7. **`first()`**: Compute the first value in a group (useful with `groupby`).
  

In [23]:
first_salaries = df.groupby('department')['salary'].first()
print("First Salary in Each Department:")
print(first_salaries)

First Salary in Each Department:
department
HR       50000
IT       60000
Sales    70000
Name: salary, dtype: int64




8. **`last()`**: Compute the last value in a group (useful with `groupby`).
   
   
   
   

In [25]:
last_salaries = df.groupby('department')['salary'].last()
print("Last Salary in Each Department:")
print(last_salaries)

Last Salary in Each Department:
department
HR       55000
IT       65000
Sales    75000
Name: salary, dtype: int64


9. **`count()`**: Compute the count of non-null values in a column.
  
   
   

In [27]:
count_salaries = df['salary'].count()
print("Count of Salaries:", count_salaries)

Count of Salaries: 6


10. **`std()`**: Compute the standard deviation of a column.
    
    

In [29]:
salary_std = df['salary'].std()
print("Salary Standard Deviation:", salary_std)

Salary Standard Deviation: 9354.143466934853


11. **`var()`**: Compute the variance of a column.
   
    

In [31]:
salary_variance = df['salary'].var()
print("Salary Variance:", salary_variance)

Salary Variance: 87500000.0


12. **`sem()`**: Compute the standard error of the mean.
    
    

In [33]:
salary_sem = df['salary'].sem()
print("Salary Standard Error of the Mean:", salary_sem)

Salary Standard Error of the Mean: 3818.8130791298668


## Grouping in Pandas

Grouping is used to group data using some criteria from our dataset. It is used as `split-apply-combine` strategy.

- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.

In [35]:
import pandas as pd 

df = pd.DataFrame([[9, 4, 8, 9], 
				[8, 10, 7, 6], 
				[7, 6, 8, 5]], 
				columns=['Maths', 'English', 
						'Science', 'History']) 

display(df)

Unnamed: 0,Maths,English,Science,History
0,9,4,8,9
1,8,10,7,6
2,7,6,8,5


In [36]:
a = df.groupby('Maths') 
a.first() 

Unnamed: 0_level_0,English,Science,History
Maths,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
7,6,8,5
8,10,7,6
9,4,8,9


#### Implementation on a Dataset

In [45]:
import numpy as np 
import pandas as pd 

dataset = pd.read_csv("C:\\Users\\aardr\\Downloads\\diamonds.csv\\diamonds.csv") 

dataset.head()

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [56]:
dataset.groupby('cut').sum()

Unnamed: 0_level_0,Unnamed: 0,carat,color,clarity,depth,table,price,x,y,z
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Fair,38877246,1684.28,EEFFFHHHGEJGEGIEFGFJIJJFEHHHGFGHDJJHEIGIHGDFFE...,VS2SI2SI2VS2VS2SI2SI2SI2SI1I1SI2VVS1SI2VS1SI1V...,103107.1,95076.6,7017600,10057.5,9954.07,6412.26
Good,121545813,4166.1,EJJJJIFEHDDHHIHEEFHEIGFFEEDGFEGEFEIFFGGFGEHIID...,VS1SI2SI1SI1SI1SI2VS1VS1SI1VS2VS1SI2SI2SI1SI1V...,305967.0,287955.9,19275009,28645.08,28703.75,17855.42
Ideal,626005490,15146.84,EJJIIIJGIIIDDGIEIEGGIGIFEFEFFGEFDEHDHGDEIDFDGF...,SI2VS1SI2SI2SI2SI2SI1VS1SI1SI2VS1SI1SI1VVS2VVS...,1329899.3,1205814.4,74513487,118691.07,118963.24,73304.61
Premium,353052483,12300.95,EIFEEIFEDJDIGEHHGHHHIHEGGIFFHEFGGDDGGGGDEFEEEE...,SI1VS2SI1SI2I1VS1SI1VS2VS2SI2SI1SI2SI1VVS1SI1S...,844901.1,810167.4,63221498,82385.88,81985.82,50297.49
Very Good,315307738,9742.7,JIHHJEHJJGJDFFFEEDDHEHFIIGDHFEDDEEDEGEDGFDDEHF...,VVS2VVS1SI1VS1SI1VS2VS1SI1SI1VVS2VS2VS2VS1VS1V...,746888.4,700226.2,48107623,69359.09,69713.45,43009.52


In [66]:
dataset.groupby(['cut', 'color']).agg('min') 

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,carat,clarity,depth,table,price,x,y,z
cut,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Fair,D,677,0.25,I1,52.2,52.0,536,4.09,4.11,2.49
Fair,E,9,0.22,I1,51.0,49.0,337,3.87,3.78,2.33
Fair,F,98,0.25,I1,52.3,50.0,496,4.19,4.15,2.32
Fair,G,228,0.23,I1,43.0,53.0,369,0.0,0.0,0.0
Fair,H,129,0.33,I1,52.7,50.0,659,4.4,4.32,2.84
Fair,I,353,0.41,I1,50.8,49.0,735,4.62,4.66,2.93
Fair,J,256,0.3,I1,55.0,52.0,416,4.24,4.16,2.72
Good,D,43,0.23,I1,54.3,52.0,361,3.83,3.85,2.37
Good,E,3,0.23,I1,56.3,53.0,327,3.83,3.85,2.31
Good,F,36,0.23,I1,56.2,52.0,357,0.0,0.0,0.0


In [60]:
agg_functions = { 
	'price': 
	['sum', 'mean', 'median', 'min', 'max', 'prod'] 
} 

dataset.groupby(['color']).agg(agg_functions) 

Unnamed: 0_level_0,price,price,price,price,price,price
Unnamed: 0_level_1,sum,mean,median,min,max,prod
color,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
D,21476439,3169.954096,1838.0,357,18693,0
E,30142944,3076.752475,1739.0,326,18731,0
F,35542866,3724.886397,2343.5,342,18791,0
G,45158240,3999.135671,2242.0,354,18818,0
H,37257301,4486.669196,3460.0,337,18803,0
I,27608146,5091.874954,3730.0,334,18823,0
J,14949281,5323.81802,4234.0,335,18710,0


In [None]:
path = 