# Aggregation and Grouping in Pandas

In data analysis and manipulation, the concepts of aggregation and grouping play a crucial role, allowing us to derive insights from complex datasets. This section will delve deeper into the fundamental concepts of aggregation and grouping in the context of pandas, a powerful Python library for data manipulation and analysis.

## Grouping Data with `groupby()`
Grouping data involves categorizing and splitting a dataset into smaller subsets based on specific criteria. The `groupby()` function in pandas is a fundamental tool for accomplishing this task. It allows us to group a DataFrame based on one or more columns, enabling subsequent analysis and computation within these groups [McKinney, 2022, Pandas Developers, 2023].

### Basic Syntax
The basic syntax for using the `groupby()` function is as follows {cite:p}`PandasDocumentation`:
```python
grouped = df.groupby('column_name')
```

Full syntax can be found [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html).

<font color='Blue'><b>Example</b></font> This dataset records meteorological observations for the "UNIVERSITY OF CALGARY" weather station. It was obtained from the source https://climate.weather.gc.ca/ and includes the following columns:

1. **Station Name:** The name of the weather station, which is "UNIVERSITY OF CALGARY" in this case.

2. **Date/Time:** The date and time of the observation in the format MM/DD/YYYY.

3. **Year:** The year of the observation, extracted from the Date/Time column.

4. **Month:** The month of the observation, extracted from the Date/Time column.

5. **Day:** The day of the observation, extracted from the Date/Time column.

6. **Max Temp (°C):** The maximum temperature recorded on that date in degrees Celsius.

7. **Min Temp (°C):** The minimum temperature recorded on that date in degrees Celsius.

8. **Mean Temp (°C):** The mean temperature recorded on that date in degrees Celsius, typically calculated as the average of the maximum and minimum temperatures.

9. **Total Rain (mm):** The total amount of rainfall in millimeters on that date.

10. **Total Snow (cm):** The total amount of snowfall in centimeters on that date.

11. **Total Precip (mm):** The cumulative total of precipitation, which combines rain and snow, in millimeters on that date.

In [None]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/HatefDastour/ENGG_680/main/Datasets/UofC_Daily_1990_Q1Q2.csv')

# Display the first few rows of the DataFrame
display(df.tail())

# Grouping by 'Month'
grouped = df.groupby('Month')
print('grouped would be an object as follows:')
print(grouped)

## Aggregation: Extracting Insights from Groups

Once you've successfully grouped your data using the `groupby()` function, the next step is to harness the power of aggregation. Aggregation involves summarizing and extracting meaningful information from the grouped data by applying various functions to each group. This process unveils hidden patterns, trends, and characteristics within your dataset.

Pandas offers a multitude of aggregation functions that allow you to compute summary statistics and insights for each group. These functions include [McKinney, 2022, Pandas Developers, 2023]:
- **Mean**: Calculates the average value of a specific column within each group.
- **Sum**: Computes the total sum of values in a column for each group.
- **Count**: Determines the number of non-null values within a column for each group.
- **Max**: Identifies the maximum value within a column for each group.
- **Min**: Finds the minimum value within a column for each group.
- ...and many more.

In [None]:
def Line(n=40):
    print(n * "_")

# Applying aggregation functions on the groups

# Calculating the mean of 'Value' for each group
average_values = grouped['Max Temp (°C)'].mean().round(2)
print("Average Values:")
print(average_values)
Line()

# Calculating the sum of 'Value' for each group
sum_values = grouped['Max Temp (°C)'].sum().round(2)
print("Sum Values:")
print(sum_values)
Line()

# Counting the number of occurrences in each group
count_values = grouped['Max Temp (°C)'].count().round(2)
print("Count Values:")
print(count_values)
Line()

# Finding the maximum value in each group
max_value = grouped['Max Temp (°C)'].max().round(2)
print("Max Values:")
print(max_value)
Line()

# Finding the minimum value in each group
min_value = grouped['Max Temp (°C)'].min().round(2)
print("Min Values:")
print(min_value)
Line()

## Multiple aggregations

You can apply multiple aggregation functions simultaneously using the `agg()` method.

<font color='Blue'><b>Example - Exploring Store Sales Patterns:</b></font> Imagine you're analyzing sales data for a small boutique store. The store sells products from two categories, 'A' and 'B'. You want to understand how the sales values for these categories differ.

You start by generating a random dataset using numpy and pandas to simulate the store's sales records. You set a random seed for reproducibility and create 100 rows of data. Each row has a 'Category' column with either 'A' or 'B' values and a 'Value' column with random sales amounts between 1 and 100.

In [None]:
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
rng = np.random.default_rng(42)

# Define the number of rows in the DataFrame
num_rows = 100

# Generate random alphabet column with only 'A' and 'B'
random_alphabet = [rng.choice(['A', 'B']) for _ in range(num_rows)]

# Generate random numeric column
random_numeric = rng.integers(1, 101, size=num_rows)

# Create a Pandas DataFrame
data = {'Category': random_alphabet, 'Value': random_numeric}
df = pd.DataFrame(data)

# Display the first few rows of the DataFrame
display(df.head())

In [None]:
# Grouping by 'Category'
grouped = df.groupby('Category')

#  Applying multiple aggregation functions
agg_functions = {'Value': ['mean', 'sum', 'count', 'max', 'min']}
result = grouped.agg(agg_functions)
# or simply result = grouped.agg({'Value': ['mean', 'sum', 'count', 'max', 'min']})
display(result)

## Grouping by multiple columns
Suppose we have a dataset containing sales data for different products in different regions. We want to group the data by both "Product" and "Region" to analyze the total sales and average price for each combination.


In [None]:
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
rng = np.random.default_rng(42)

# Define the number of rows in the DataFrame
num_rows = 100

# Generate random data using NumPy functions directly
data = {
    'Product': rng.choice(['A', 'B'], num_rows),
    'Region': rng.choice(['North', 'South'], num_rows),
    'Sales': rng.integers(100, 201, size=num_rows),
    'Price': rng.integers(10, 23, size=num_rows)
}

# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the resulting DataFrame
display(df.head())

# Grouping by both "Product" and "Region"
result = df.groupby(['Product', 'Region']).agg({'Sales': 'sum', 'Price': 'mean'})

# Display the result
display(result)

## Grouping and Resampling Time Series Data

### `resample` Function for Time Series Analysis

The `resample` function within pandas is a versatile and dynamic tool designed explicitly for managing time series data. It extends the capability to modify the frequency of your time-dependent data and simultaneously carry out a multitude of aggregation operations on the grouped data. The essence of the `resample` function can be succinctly encapsulated as follows [Pandas Developers, 2023]:

When applied to pandas DataFrames or Series with a datetime index, the `resample` function's primary purpose is to partition time series data into distinct time intervals (such as days, months, years) while offering the flexibility to employ aggregation functions within each interval. The general syntax of utilization stands as:

```python
new_resampled_object = df.resample(rule)
```

In this context, `rule` symbolizes a string that defines the desired resampling frequency, such as `'D'` for daily, `'M'` for monthly, `'A'` for annually, and so forth.

The true power of the `resample` function lies in its inherent adaptability, enabling a seamless exploration of trends, patterns, and aggregated insights across various time intervals.

You can find a comprehensive description by following this [link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html).

<font color='Blue'><b>Example:</b></font> Consider our initial illustration involving climate data from the University of Calgary station for the 1st half of the year 1990.

In [None]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/HatefDastour/ENGG_680/main/Datasets/UofC_Daily_1990_Q1Q2.csv',
                parse_dates = ['Date/Time'])
df = df.set_index('Date/Time')

# Grouping by month and calculating the sum of sales and revenue for each group
result = df.resample('MS').agg({'Mean Temp (°C)': 'mean'})

display(result)

## Pandas pivot

The `pandas.DataFrame.pivot` method is a powerful feature in the pandas library for reshaping data in a DataFrame. It allows you to transform your data from a long format (where data is stored in rows) to a wide format (where data is stored in columns) or vice versa. This can be particularly useful when you need to analyze or visualize your data in a different structure.

**Syntax:**
```python
DataFrame.pivot(index=None, columns=None, values=None)
```

- `index`: This parameter specifies the column whose unique values will become the new index (row labels) of the pivoted DataFrame. It can be a column name or a list of column names.

- `columns`: This parameter specifies the column whose unique values will become the new column headers of the pivoted DataFrame. It can be a column name or a list of column names.

- `values`: This parameter specifies the column(s) containing the data to populate the pivoted DataFrame. It can be a column name or a list of column names. If not provided, all remaining columns not used as `index` or `columns` will be used.

We can see the full syntax [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html).

<font color='Blue'><b>Example:</b></font> We can use `pivot` to transform data from a long format (e.g., for time-series data) to a wide format, making it easier to analyze.

In [None]:
import pandas as pd

# Create a DataFrame with air quality data
# This dataset is a fictional dataset:
air_quality_data = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Pollutant': ['CO', 'NO2', 'CO', 'NO2'],
    'Value': [2.5, 20.1, 2.7, 22.3]
})

# Print the original air quality data DataFrame
print("Original Air Quality Data:")
display(air_quality_data)

# Pivot the air quality data to a wide format
pivoted_air_quality = air_quality_data.pivot(index='Date', columns='Pollutant', values='Value')

# Print the pivoted air quality data
print("\nPivoted Air Quality Data (Wide Format):")
display(pivoted_air_quality)

The `pandas.melt` function is a powerful tool for converting a DataFrame from wide format to long format, making it easier to work with certain types of data. It essentially "unpivots" a DataFrame by melting columns into rows. This can be particularly useful when you want to analyze or visualize data that is initially organized with column headers representing different variables.

**Syntax:**
```python
pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
```

- `frame`: This parameter specifies the DataFrame you want to melt.

- `id_vars`: This parameter specifies the column(s) that should remain as identifier variables (columns that won't be melted). It can be a column name or a list of column names.

- `value_vars`: This parameter specifies the column(s) that should be melted (unpivoted). If not specified, all columns not listed in `id_vars` will be melted.

- `var_name`: This parameter specifies the name of the new column that will store the variable names (from the melted columns).

- `value_name`: This parameter specifies the name of the new column that will store the values (from the melted columns).

- `col_level`: This parameter is used when working with MultiIndex columns. It specifies which level of the column index should be melted.

We can see the full syntax [here](https://pandas.pydata.org/docs/reference/api/pandas.melt.html).


<font color='Blue'><b>Example:</b></font> Let's apply the `melt` function to the previous air quality data example:

In [None]:
# Reset the index to convert 'Date' back to a regular column
pivoted_air_quality.reset_index(inplace=True)
display(pivoted_air_quality)

# Use the melt function to transform the data to long format
melted_air_quality = pd.melt(pivoted_air_quality, id_vars=['Date'], value_vars=['CO', 'NO2'],
                              var_name='Pollutant', value_name='Value')

# Print the melted air quality data in long format
print("Melted Air Quality Data (Long Format):")
display(melted_air_quality)