# Chapter 5 Grouping and Aggregating

In [None]:
import pandas as pd

# Read car_stocks dataset
car_stks = pd.read_csv("./datasets/car_stocks.csv")

# Print the dataframe
car_stks.head()

### Find the average closing price of all possible values of column 'Symbol'

In [None]:
# Find the different possible values of Symbol in the dataframe

# Firstly, we can find the different possible values and their counts of a column using the 
# Dataframe.column.value_counts() method
car_stks.Symbol.value_counts()

In [None]:
# Find average closing price of RIVN
RIVN = car_stks["Symbol"] == "RIVN"
car_stks[RIVN].Close.mean()

In [None]:
# Find average closing price of LCID
LCID = car_stks["Symbol"] == "LCID"
car_stks[LCID].Close.mean()

In [None]:
# Find average closing price of GM
GM = car_stks["Symbol"] == "GM"
car_stks[GM].Close.mean()

For a large dataset with numerous types of possible values, individually finding the average values for each type can be impractical. Is there any better alternative to this? The answer is **Grouping**.

## Group by column

To group by values in a column we can make use of the following method:

**Dataframe.groupby(by=Column):** Returns a DataframeGroupBy object that rearranges the order of records to group them together based on the column passed. We can then apply various Dataframe methods to analyze these groups further.

Ex: df.groupby(by="age") or df.groupby("age")

In [None]:
# Grouping by the 'Symbol' column and finding the mean closing price for each group
grouped_data = car_stks.groupby("Symbol")
grouped_data["Close"].mean()

# This would calculate the mean closing price for each unique value in the 'Symbol' column.

In [None]:
# Read titanic dataset
titanic = pd.read_csv("./datasets/titanic.csv")

# Print the dataframe
titanic.head()

In [None]:
# Find the datatype of age
titanic.age.dtype

# dtype('O') represents Python Object str

In [None]:
# Convert datatype from Python Object str to float64

# Replace '?' with None using replace method, in place
titanic.age.replace(['?'], [None], inplace=True)

# Set the datatype from Python Object to float64
titanic.age = titanic.age.astype("float")

# Confirm change in datatype
titanic.age.dtype

In [None]:
# Create a shorter version of titanic dataframe including only the columns pclass, survived, gender and age
tnc = titanic[["pclass", "survived", "gender", "age"]]

# Print the dataframe
tnc.head()

## Exploring various properties/methods on groups

As discussed above, we can group the records based on a column using the method **Dataframe.groupby(by=Column)** or simply **Dataframe.groupby(Column)**. It returns a DataframeGroupBy object that can be later used to perform data analysis on these groups.

1) **DataframeGroupBy.ngroups:** Returns the number of unique groups formed after the groupby() operation. It provides insight into the diversity of groups within the dataset.

Ex: df.groupby("age").ngroups

2) **DataframeGroupBy.groups:** Returns a dictionary where the keys are the unique group labels and the values are arrays of index labels corresponding to each group. It provides a convenient way to access the indexes of records within each group.

Ex: df.groupby("age").groups

3) **DataframeGroupBy.get_group(group_name):** Returns a Dataframe containing records of specified group.

Ex: df.groupby("age").get_group(14.0)

In [None]:
# Create a DataframeGroupBy Object on the column "gender"
gender_gbo = tnc.groupby(by="gender")

# Create a DataframeGroupBy Object on the column "age"
age_gbo = tnc.groupby(by="age")

# Create a DataframeGroupBy Object on the column "survived"
surv_gbo = tnc.groupby(by="survived")

In [None]:
# Print the various groups (genders)
gender_gbo.ngroups

# Two groups are formed (male and female)

In [None]:
# Print the various groups (age)
age_gbo.ngroups

# 98 groups are formed

In [None]:
# Print the indexes of each group (genders)
gender_gbo.groups

# Returns a dictionary with group label (female, male) as keys and the values are arrays of index labels corresponding to each group

In [None]:
# Print the indexes of each group (survived)
surv_gbo.groups

# Returns a dictionary with group labels (0, 1) as keys and the values are arrays of index labels corresponding to each group

In [None]:
# Print male passenger records
gender_gbo.get_group("male")

In [None]:
# Print records of who survived
surv_gbo.get_group(1)

## Aggregation methods

The aggregation methods provide versatile tools for summarizing and analyzing data within grouped contexts, offering insights into various statistical properties of the grouped data.

| Method Name               | Description                                                                                      |
|:--------------------------|:-------------------------------------------------------------------------------------------------|
| count()                   | Counts the number of non-null values in each group.                                              |
| sum()                     | Computes the sum of values in each group.                                                        |
| mean()                    | Calculates the mean (average) of values in each group.                                           |
| median()                  | Calculates the median (middle value) of values in each group.                                    |
| min()                     | Finds the minimum value in each group.                                                           |
| max()                     | Finds the maximum value in each group.                                                           |
| std()                     | Computes the standard deviation of values in each group.                                         |
| var()                     | Computes the variance of values in each group.                                                   |
| first()                   | Retrieves the first value in each group.                                                         |
| last()                    | Retrieves the last value in each group.                                                          |
| agg() or aggregate()      | Allows for applying custom or multiple aggregation functions simultaneously.                     |

In [None]:
# Print titanic dataframe
tnc.head()

### Count of all columns of passengers grouped by gender

In [None]:
gender_gbo.count()

### Count of ages of passengers grouped by gender

In [None]:
gender_gbo["age"].count()

### Sum of all columns of passengers grouped by gender

In [None]:
gender_gbo.sum()

### Sum of ages of passengers grouped by gender

In [None]:
gender_gbo["age"].sum()

### Mean of all columns of passengers grouped by gender

In [None]:
gender_gbo.mean()

### Mean of ages of passengers grouped by gender

In [None]:
gender_gbo["age"].mean()

### Median of all columns of passengers grouped by gender

In [None]:
gender_gbo.median()

### Median of ages of passengers grouped by gender

In [None]:
gender_gbo["age"].median()

### Min of all columns of passengers grouped by gender

In [None]:
gender_gbo.min()

### Min of ages of passengers grouped by gender

In [None]:
gender_gbo["age"].min()

### Max of all columns of passengers grouped by gender

In [None]:
gender_gbo.max()

### Max of ages of passengers grouped by gender

In [None]:
gender_gbo["age"].max()

### Standard Deviation of all columns of passengers grouped by gender

In [None]:
gender_gbo.std()

### Standard Deviation of ages of passengers grouped by gender

In [None]:
gender_gbo["age"].std()

### Variance of all columns of passengers grouped by gender

In [None]:
gender_gbo.var()

### Variance of ages of passengers grouped by gender

In [None]:
gender_gbo["age"].var()

### First records of all columns of passengers grouped by gender

In [None]:
gender_gbo.first()

### First record of column "age" passengers grouped by gender

In [None]:
gender_gbo["age"].first()

### Last records of all columns passengers grouped by gender

In [None]:
gender_gbo.last()

### Last record of column "age" passengers grouped by gender

In [None]:
gender_gbo["age"].last()

### Multiple aggregate functions combined using agg() method

In [None]:
gender_gbo["age"].agg(["min", "max", "count", "mean", "median", "std", "var", "first", "last"])