In [3]:
import pandas as pd

## **9. Aggregate Functions**

In [14]:
df = pd.read_csv("datasets/chocolate.csv")
df

Unnamed: 0,Sales Person,Country,Product,Date,Amount,Boxes Shipped
0,Jehu Rudeforth,UK,Mint Chip Choco,04-Jan-22,"$5,320",180
1,Van Tuxwell,India,85% Dark Bars,01-Aug-22,"$7,896",94
2,Gigi Bohling,India,Peanut Butter Cubes,07-Jul-22,"$4,501",91
3,Jan Morforth,Australia,Peanut Butter Cubes,27-Apr-22,"$12,726",342
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24-Feb-22,"$13,685",184
...,...,...,...,...,...,...
1089,Karlen McCaffrey,Australia,Spicy Special Slims,17-May-22,"$4,410",323
1090,Jehu Rudeforth,USA,White Choc,07-Jun-22,"$6,559",119
1091,Ches Bonnell,Canada,Organic Choco Syrup,26-Jul-22,$574,217
1092,Dotty Strutley,India,Eclairs,28-Jul-22,"$2,086",384


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1094 entries, 0 to 1093
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Sales Person   1094 non-null   object
 1   Country        1094 non-null   object
 2   Product        1094 non-null   object
 3   Date           1094 non-null   object
 4   Amount         1094 non-null   object
 5   Boxes Shipped  1094 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 51.4+ KB


---
#### **df["column_name"].sum()** : Returns the sum of all values in a column.

In [18]:
df['Boxes Shipped'].sum()

np.int64(177007)

#### **df["column_name"].min()**: Returns the minimum value in a column.


In [16]:
df['Boxes Shipped'].min()

np.int64(1)

In [17]:
df['Country'].min()

'Australia'

#### **df["column_name"].max()**: Returns the maximum value in a column.

In [10]:
df['Boxes Shipped'].max()

np.int64(709)

#### **df["column_name"].count()**: Counts the non-null values in a column.

In [12]:
df['Country'].count()

np.int64(1094)

---
## **10. Statistical Functions**

#### **df["column_name"].mean()**: Returns the average value of a column.

In [13]:
df['Boxes Shipped'].mean()

np.float64(161.7979890310786)

#### **df["column_name"].median()**: Returns the middle value of a column.

In [20]:
df['Boxes Shipped'].median()

np.float64(135.0)

#### **df["column_name"].mode()**: Returns the most frequent value(s) in a column.

In [21]:
df['Boxes Shipped'].mode()

0    24
Name: Boxes Shipped, dtype: int64

#### **df["column_name"].std()**: Standard deviation

In [22]:
df['Boxes Shipped'].std()

np.float64(121.54414540536331)

#### **df["column_name"].var()**: Variance

In [23]:
df['Boxes Shipped'].var()

np.float64(14772.979282320099)

#### **df.describe()**: Summary of stats of numerical columns

In [25]:
df.describe()

Unnamed: 0,Boxes Shipped
count,1094.0
mean,161.797989
std,121.544145
min,1.0
25%,70.0
50%,135.0
75%,228.75
max,709.0


---
### **<center><span style="color:brown">Grouping Data</span></center>**
#### **df.groupby("column_name").aggregate_function()** : split your data into groups based on some column(s), then apply aggregation (like sum, mean, count, etc.).



In [31]:
df = pd.read_csv('datasets/employees.csv')
df

Unnamed: 0,Employee,Department,Salary,Age,Gender
0,Emp1,Marketing,52662,53,Male
1,Emp2,Sales,38392,54,Female
2,Emp3,Finance,60535,22,Male
3,Emp4,Sales,43067,40,Female
4,Emp5,Sales,78033,23,Male
...,...,...,...,...,...
95,Emp96,Finance,67505,59,Female
96,Emp97,Sales,77323,29,Female
97,Emp98,HR,76645,48,Female
98,Emp99,HR,30854,48,Female


In [41]:
grouped_data = df.groupby('Department')
print(grouped_data )

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x116aa0e60>


#### **1. GroupBy with Aggregation on a Single Column**
#### **`df.groupby("column_name")["target_column"].agg_function()`**


In [42]:
df.groupby('Department')["Salary"].mean()

Department
Finance      53235.687500
HR           54252.619048
IT           59009.277778
Marketing    53007.884615
Sales        57079.526316
Name: Salary, dtype: float64

In [43]:
df.groupby("Department")["Salary"].sum()

Department
Finance       851771
HR           1139305
IT           1062167
Marketing    1378205
Sales        1084511
Name: Salary, dtype: int64

In [45]:
df.groupby("Department")["Salary"].max()

Department
Finance      79850
HR           79080
IT           75453
Marketing    78702
Sales        78033
Name: Salary, dtype: int64

#### **2. GroupBy with Multiple Columns**
#### **`df.groupby(["col1", "col2"])["target_col"].agg_function()`**

In [47]:
df.groupby(["Department", "Gender"])["Salary"].sum()

Department  Gender
Finance     Female    329366
            Male      522405
HR          Female    539845
            Male      599460
IT          Female    671432
            Male      390735
Marketing   Female    796972
            Male      581233
Sales       Female    496572
            Male      587939
Name: Salary, dtype: int64

---
### **<center><span style="color:brown">agg() function</span></center>**
#### **one or more functions like sum, mean, min, max, etc., to a DataFrame or Series.**


#### **1. Single Column – One Function**
#### **`df["column"].agg("function")`**

In [49]:
df["Salary"].agg("sum")

np.int64(5515959)

#### **2. Single Column – Multiple Functions**
#### **`df["column"].agg(["func1", "func2", ...])`**

In [52]:
df["Salary"].agg(["sum","min","max"])

sum    5515959
min      30206
max      79850
Name: Salary, dtype: int64

#### **3. Multiple Columns – Different Functions**
```python
df.agg({
    "column1": ["func1", "func2"],
    "column2": ["func3"]
})
```

In [53]:
df.agg({
    "Salary":["min","max"],
    "Age":["max"]
})

Unnamed: 0,Salary,Age
min,30206,
max,79850,59.0


#### **4. With Custom Function Names**
#### **`df["column"].agg(name1="func1", name2="func2")`**

In [55]:
df["Salary"].agg(max_sal="max",min_sal = "min")


max_sal    79850
min_sal    30206
Name: Salary, dtype: int64

---
end

### **3. GroupBy with Multiple Aggregations**
#### **`df.groupby("column")["target_column"].agg(["mean", "min", "max"])`**

In [57]:
df.groupby("Department")["Salary"].agg(["min","max","mean"])

Unnamed: 0_level_0,min,max,mean
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Finance,30206,79850,53235.6875
HR,30854,79080,54252.619048
IT,43545,75453,59009.277778
Marketing,30663,78702,53007.884615
Sales,31678,78033,57079.526316


#### **4. GroupBy on Whole DataFrame**
#### **`df.groupby("column").agg_function()`**

In [59]:
df.groupby("Department").min()


Unnamed: 0_level_0,Employee,Salary,Age,Gender
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Finance,Emp12,30206,22,Female
HR,Emp14,30854,23,Female
IT,Emp100,43545,24,Female
Marketing,Emp1,30663,22,Female
Sales,Emp10,31678,22,Female
