# CDCS Summer School
# A Gentle Introduction to Coding for Data Analysis
## Session 12: Birdseye View

---------------

### Learning objectives for this session:

At the end of this notebook you will know:

1. Understand how to summarise data using descriptive statistics.
2. Learn to group data and perform aggregate operations.
3. Master the use of value counts and unique values.
4. Explore the use of cross-tabulations and pivot tables.

--------

## 1. How do we summarise data with pandas?

Pandas provides several methods to generate descriptive statistics that summarise the central tendency, dispersion, and shape of a dataset's distribution.

In [None]:
import pandas as pd

# Load the Palmer Penguins dataset
file_path = 'data/palmer_penguins.csv'
penguins = pd.read_csv(file_path)

The `describe()` function generates summary statistics for numerical columns in the dataset, including count, mean, standard deviation, minimum, and maximum values, as well as the 25th, 50th (median), and 75th percentiles.

In [None]:
# Generate summary statistics for numerical columns
summary_stats = penguins.describe()
summary_stats

This next cell calculates the mean (average) value for each numerical column in the dataset. The mean provides a measure of the central tendency of the data.

In [None]:
# Get the mean of each numerical column
mean_values = penguins.mean()
mean_values

The `median()` function calculates the median value for each numerical column in the dataset. The median is the middle value when the data is sorted, providing another measure of central tendency that is less sensitive to outliers than the mean.

In [None]:
# Get the median of each numerical column
median_values = penguins.median()
median_values

Next, the cell computes the standard deviation for each numerical column in the dataset. The standard deviation measures the amount of variation or dispersion in the data, indicating how spread out the values are from the mean.

In [None]:
# Get the standard deviation of each numerical column
std_dev = penguins.std()
std_dev

You may remember from earlier in the week, that the `min()` function calculates the minimum value for each numerical column, while the `max()` function calculates the maximum value. These values help identify the range of the data.

In [None]:
# Get the minimum value of each numerical column
min_values = penguins.min(numeric_only=True)
min_values

# Get the maximum value of each numerical column
max_values = penguins.max(numeric_only=True)
max_values

This cell calculates the variance for each numerical column in the dataset. Variance measures the average squared deviation of each number from the mean, providing another way to understand data dispersion.

In [None]:
# Calculate the variance of each numerical column
variance_values = penguins.var()
variance_values

We can start to use the parameters built into the `describe` function we saw earlier. Using `describe(include='all')` generates a comprehensive overview of the dataset, including summary statistics for both numerical and categorical columns. This function provides a full picture of the dataset's central tendencies, dispersions, and categorical distributions.

In [None]:
# Use the describe function to get a comprehensive overview
comprehensive_overview = penguins.describe(include='all')
comprehensive_overview

--------

## 2. Grouping data together.

Grouping data allows us to split the dataset into subsets, perform operations on each subset, and then combine the results.

In [None]:
# Group by species and calculate the mean of each numerical column
grouped_by_species = penguins.groupby('species').mean()
grouped_by_species

This cell groups the data by the species column and calculates the mean of each numerical column for each species. Grouping by a single column helps us understand how the numerical features vary across different species.

In [None]:
# Group by species and island, then calculate the mean
grouped_by_species_island = penguins.groupby(['species', 'island']).mean()
grouped_by_species_island

This cell groups the data by both species and island columns, then calculates the mean of each numerical column for each combination of species and island. Grouping by multiple columns allows for more granular analysis of the data.

In [None]:
# Aggregate multiple functions for each group
aggregated_data = penguins.groupby('species').agg({
    'bill_length_mm': ['mean', 'std'],
    'body_mass_g': ['min', 'max']
})
aggregated_data

This cell uses the agg() function to apply multiple aggregate functions (mean, median, standard deviation) to the grouped data by species. This provides a more detailed summary of each group.

In [None]:
# Apply custom aggregation functions to grouped data
custom_agg = penguins.groupby('species').agg({
    'bill_length_mm': 'mean',
    'bill_depth_mm': 'std',
    'flipper_length_mm': 'max',
    'body_mass_g': lambda x: x.max() - x.min()
})
custom_agg

This cell demonstrates how to apply custom aggregation functions to the grouped data. It calculates the mean of bill_length_mm, standard deviation of bill_depth_mm, maximum of flipper_length_mm, and the range (max - min) of body_mass_g for each species.

In [None]:
# Use transform to apply a function to each group
normalized_body_mass = penguins.groupby('species')['body_mass_g'].transform(lambda x: (x - x.mean()) / x.std())
penguins['normalized_body_mass'] = normalized_body_mass
penguins.head()

The `transform()` function is used to apply a function to each group and return a transformed version of the data. In this cell, we normalize the body_mass_g column for each species by subtracting the mean and dividing by the standard deviation.

In [None]:
# Filter groups based on the mean of body mass
filtered_groups = penguins.groupby('species').filter(lambda x: x['body_mass_g'].mean() > 4000)
filtered_groups

This cell uses the `filter()` function to filter groups based on a condition. It keeps only the groups where the mean body mass is greater than 4000 grams.

In [None]:
# Generate descriptive statistics for groups
group_descriptive_stats = penguins.groupby('species').describe()
group_descriptive_stats

The `describe()` function generates descriptive statistics for each group. This cell provides a comprehensive summary of the data for each species, including count, mean, standard deviation, and quartiles.

In [None]:
# Iterate through groups and perform operations
for name, group in penguins.groupby('species'):
    print(f"Species: {name}")
    print(group.head(), "\n")

This cell demonstrates how to iterate through groups using a for loop. It prints the name of each group (species) and the first few rows of the corresponding subset of the data. Iterating through groups allows for custom operations on each group.

----- 

## 3. Using value counts and unique values.

Understanding the make-up of categorical variables is essential for summarising the dataset.

Below, the cells use the `value_counts()` function to count the occurrences of each unique value in the species column. The result is a series showing the frequency of each unique value.

In [None]:
# Get the count of unique values in the 'species' column
species_counts = penguins['species'].value_counts()
species_counts

In [None]:
# Get the count of unique values in the 'island' column
island_counts = penguins['island'].value_counts()
island_counts

The `value_counts(normalize=True)` function returns the normalised value counts as proportions of the total count. This helps us understand the relative frequency of each unique value.

In [None]:
# Get normalized value counts as proportions
species_proportions = penguins['species'].value_counts(normalize=True)
species_proportions

Next, we can sort the value counts in descending order using the `sort_values()` function. Sorting helps in identifying the most frequent values quickly.

In [None]:
# Sort the value counts in descending order
sorted_species_counts = penguins['species'].value_counts().sort_values(ascending=False)
sorted_species_counts

These next two cells use the `unique()` function to find and display the unique values in the species column. The `unique()` function returns an array of unique values in the specified column.

In [None]:
# Get the unique values in the 'species' column
unique_species = penguins['species'].unique()
unique_species

In [None]:
# Get the unique values in the 'island' column
unique_islands = penguins['island'].unique()
unique_islands

The `nunique()` function is used to count the number of unique values in the species column. This gives us an idea of the distinct categories present in the column.

In [None]:
# Count the number of unique values in the species column
num_unique_species = penguins['species'].nunique()
num_unique_species

We can become fairly sophisticated fairly early on with filtering. This next cell filters the value counts to show only those unique values that have more than 50 occurrences. Filtering helps in focusing on the most significant categories.

In [None]:
# Filter value counts to show only values with more than 50 occurrences
filtered_species_counts = penguins['species'].value_counts()[penguins['species'].value_counts() > 50]
filtered_species_counts

Finally we demonstrate how to handle missing values when using `value_counts()`. By setting `dropna=False`, we include NaN values in the count. This is useful for understanding the distribution of missing data in the column.

In [None]:
# Handle missing values when using value counts
penguins_with_nan = penguins.copy()
penguins_with_nan.loc[5:10, 'species'] = None  # Introduce some NaN values for demonstration

species_counts_with_nan = penguins_with_nan['species'].value_counts(dropna=False)
species_counts_with_nan

-----

## 4. Cross-tabs and pivot tables.

Cross-tabulations and pivot tables allow us to summarise the data in a matrix format, providing insights into relationships between variables.

In [None]:
# Create a cross-tabulation of species and island
species_island_crosstab = pd.crosstab(penguins['species'], penguins['island'])
species_island_crosstab

This cell uses the `crosstab()` function to create a cross-tabulation of the species and island columns. The resulting table shows the frequency of each species on each island.

The `margins=True` parameter in the `crosstab()` function adds row and column totals to the cross-tabulation. This helps in understanding the overall distribution and totals for each category.

In [None]:
# Add margins to include row and column totals
species_island_crosstab_margins = pd.crosstab(penguins['species'], penguins['island'], margins=True)
species_island_crosstab_margins

This cell normalizes the cross-tabulation to show proportions instead of raw counts. The `normalize='index'` parameter scales the counts to proportions within each row, making it easier to compare distributions across categories.

In [None]:
# Normalize the cross-tabulation to show proportions
species_island_crosstab_normalized = pd.crosstab(penguins['species'], penguins['island'], normalize='index')
species_island_crosstab_normalized

The `pivot_table()` function is used to create a basic pivot table with species as the index and island as the columns. The values in the pivot table are the average body mass (body_mass_g) for each combination of species and island.

In [None]:
# Create a pivot table to summarize the mean body mass by species and sex
pivot_table = penguins.pivot_table(values='body_mass_g', index='species', columns='sex', aggfunc='mean')
pivot_table

The next cell applies multiple aggregation functions (mean and median) in a pivot table. The resulting table shows both the mean and median body mass for each combination of species and island.

In [None]:
# Apply multiple aggregation functions in a pivot table
pivot_table_agg = pd.pivot_table(penguins, values='body_mass_g', index='species', columns='island', aggfunc=['mean', 'median'])
pivot_table_agg

It is also possible to make custom aggregation functions in a pivot table. The following cell calculates the range (max - min) of body mass for each combination of species and island.

In [None]:
# Use custom aggregation functions in a pivot table
pivot_table_custom_agg = pd.pivot_table(penguins, values='body_mass_g', index='species', columns='island', aggfunc={'body_mass_g': lambda x: x.max() - x.min()})
pivot_table_custom_agg

Below we see that we can also create pivot tables with multiple index columns (here species and sex). The pivot table shows the average body mass for each combination of species, sex, and island.

In [None]:
# Create a pivot table with multiple index columns
pivot_table_multi_index = pd.pivot_table(penguins, values='body_mass_g', index=['species', 'sex'], columns='island')
pivot_table_multi_index

------

## ⭐️⭐️⭐️💥 What you learned in this session: Three stars and a wish.
**In your own words** write in the Markdown cell below:

- 3 things you would like to remember from this notebook.
- 1 thing you wish to understand better in the future or a question you'd like to ask.

*Add your reflections here.*

--------------

## Topic Overview

In [None]:
# Generate summary statistics for numerical columns
summary_stats = penguins.describe()
summary_stats

In [None]:
# Group by species and calculate the mean of each numerical column
grouped_by_species = penguins.groupby('species').mean()
grouped_by_species

In [None]:
# Count the occurrences of each unique value in the species column
species_counts = penguins['species'].value_counts()
species_counts

In [None]:
# Create a basic pivot table with species as index and island as columns
pivot_table_basic = pd.pivot_table(penguins, values='body_mass_g', index='species', columns='island')
pivot_table_basic

-----------

# ⛏ Exercise: What's occuring in Palmer penguins?

Analyse the Palmer Penguins dataset to understand the distribution and characteristics of different penguin species. You will use descriptive statistics, grouping, and value counts to perform this analysis.

1. Load the Dataset:
Load the Palmer Penguins dataset from the provided file path.

2. Summary Statistics:
Generate summary statistics for numerical columns in the dataset and interpret the results.

3. Group by Species:
Group the data by the species column and calculate the mean, median, and standard deviation for each numerical column.

4. Value Counts:
Count the occurrences of each unique value in the species column. Additionally, count the occurrences of unique values in the island column.

5. Filtering:
Filter the dataset to include only observations of the species "Adelie" and generate summary statistics for this subset.

6. Write a Short Report:
Write a short report summarising your findings. Include observations about the differences between species, the distribution of penguins across islands, and any other interesting insights.

In [None]:
# try to solve the task here

*write your findings here*

# ⛏ Exercise: And **pivot**

Use pivot tables to analyse the relationship between different variables in the Palmer Penguins dataset. This exercise will help you understand how to create pivot tables and interpret their results.

1. Create a Basic Pivot Table:
Create a pivot table with species as the index and island as the columns. The values should be the average body mass (body_mass_g).

2. Add Aggregations:
Create another pivot table with species as the index and island as the columns, but this time include both the mean and median of body_mass_g.

3. Custom Aggregations:
Create a pivot table that calculates the range (max - min) of body_mass_g for each combination of species and island.

4. Multiple Indexes and Values:
Create a pivot table with species and sex as the index columns and island as the columns. Include both body_mass_g and flipper_length_mm as values and calculate the mean.

5. Write a Short Report:
Write a short report summarising your findings from the pivot tables. Include observations about the relationships between species, islands, body mass, and flipper length.

In [None]:
# try to solve the task here

*write your findings here*