![Imgur](https://i.imgur.com/NvyrPIB.png)

# Data Overview

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 300
plt.rcParams["axes.spines.right"] = False
plt.rcParams["axes.spines.top"] = False
plt.rcParams["figure.figsize"] = (4.5, 3.5)

penguin_dataset = pd.read_csv("/kaggle/input/palmers-penguin-dataset-extended/palmerpenguins_extended.csv")
df = penguin_dataset.copy()

df

# Univariate Analysis

# 　Descriptive Statistics

In [None]:
df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].describe().round(1)

# 　Frequency Table

In [None]:
df[['species','sex','diet','life_stage']].value_counts().reset_index(name='count')

# Bivariate Analysis

# 　Species-wise

In [None]:
selected_columns = df[['species','bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
sns.pairplot(selected_columns, corner=True, hue='species')
plt.show()

## Pivot Table

In [None]:
pivot_table_result = pd.pivot_table(df,
                                    values=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'],
                                    index=['species'],
                                    aggfunc='mean')  # You can change 'mean' to other functions like 'sum', 'max', etc.

print(pivot_table_result.round(1))

The pivot table shows distinct characteristics across the Adelie, Chinstrap, and Gentoo species in terms of bill depth, bill length, body mass, and flipper length.

- **Gentoo**: Lead in all categories, having the deepest bills, longest bills, highest body mass, and longest flippers.
  
- **Chinstrap**: Generally heavier than Adelie but similar in other aspects.
  
- **Adelie**: Smallest in all metrics, very closely related to Chinstrap except for a slightly lower body mass.

In summary, Gentoo penguins are generally larger, while Adelie and Chinstrap are more similar to each other but differ slightly in body mass. These traits could serve as distinguishing features in ecological studies or machine learning models.

# 　Sex-wise

In [None]:
selected_columns = df[['sex','bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
sns.pairplot(selected_columns, corner=True, hue='sex')
plt.show()

## Pivot Table

In [None]:
pivot_table_result = pd.pivot_table(df,
                                    values=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'],
                                    index=['sex'],
                                    aggfunc='mean')  # You can change 'mean' to other functions like 'sum', 'max', etc.

print(pivot_table_result.round(1))

The table shows noticeable differences between female and male penguins in terms of bill depth, bill length, body mass, and flipper length:

- **Male Penguins**: Generally larger in all metrics, including deeper and longer bills, higher body mass, and longer flippers.

- **Female Penguins**: Have smaller dimensions across the board, with shallower and shorter bills, lower body mass, and shorter flippers.

These observations indicate sexual dimorphism in the penguin populations studied. The males are noticeably larger, suggesting that these metrics can be important features to distinguish between sexes in biological or machine learning studies.


# 　Life-stage Wise

In [None]:
selected_columns = df[['life_stage','bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
sns.pairplot(selected_columns, corner=True, hue='life_stage')
plt.show()

## Pivot Table

In [None]:
pivot_table_result = pd.pivot_table(df,
                                    values=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'],
                                    index=['life_stage'],
                                    aggfunc='mean')  # You can change 'mean' to other functions like 'sum', 'max', etc.

print(pivot_table_result.round(1))

The table provides insights into how penguin metrics differ across life stages:

- **Adult Penguins**: Display the highest values for each attribute—deepest bills, longest bills, greatest body mass, and longest flippers.

- **Juvenile Penguins**: Fall in between adults and chicks in terms of all attributes, showing a transitional stage in their development.

- **Chick Penguins**: Possess the lowest measurements across all attributes, reflecting their early stage in life.

The metrics correlate well with the life stage of the penguins, marking a clear progression from chick to juvenile to adult in terms of size and mass. This suggests that these physical attributes could serve as indicators of life stage in biological studies or machine learning models.

# 　Diet-wise

In [None]:
selected_columns = df[['diet','bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
sns.pairplot(selected_columns, corner=True, hue='diet')
plt.show()

## Pivot Table

In [None]:
pivot_table_result = pd.pivot_table(df,
                                    values=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'],
                                    index=['diet'],
                                    aggfunc='mean')  # You can change 'mean' to other functions like 'sum', 'max', etc.

print(pivot_table_result.round(1))

The table shows how different diets are associated with variations in penguin physical attributes:

- **Fish Diet**: Penguins on a fish diet exhibit the largest metrics across all attributes. They have the deepest and longest bills, highest body mass, and longest flippers.

- **Krill Diet**: Penguins consuming krill have smaller measurements for all attributes when compared to those on a fish diet, but are notably larger than chicks that are fed by their parents.

- **Parental Diet**: Penguins still reliant on parental feeding (likely chicks) show the lowest values across all physical attributes.

- **Squid Diet**: Penguins on a squid diet display attributes that are generally higher than those on a krill diet but lower than those on a fish diet, indicating a middle-ground nutritional value of squid.

The data suggests that diet plays a significant role in the physical development of penguins, offering valuable insights for ecological studies and predictive modeling.


# 　Year-wise

In [None]:
selected_columns = df[['year','bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
sns.pairplot(selected_columns, corner=True, hue='year')
plt.show()

## Pivot Table

In [None]:
pivot_table_result = pd.pivot_table(df,
                                    values=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'],
                                    index=['year'],
                                    aggfunc='mean')  # You can change 'mean' to other functions like 'sum', 'max', etc.

print(pivot_table_result.round(1))

The table provides insights into the subtle yearly variations in penguin attributes:

- **Bill Depth**: Remains relatively stable across years, ranging from 18.4 to 18.6 mm.
  
- **Bill Length**: Also exhibits minimal fluctuation, with a slightly increased length noted in 2024.

- **Body Mass**: Shows a general increase from 2021 to 2024, followed by a slight decrease in 2025.

- **Flipper Length**: Incrementally increases from 2021 to 2024, but reverts to the 2021 length in 2025.

Overall, while there are slight fluctuations in penguin attributes over the years, the changes are not substantial. This consistency could imply stable environmental conditions or successful adaptation strategies over the observed years.



# 　Correlation Heatmaps

In [None]:
selected_columns = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
correlation_matrix = selected_columns.corr().round(2)

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidth=1)
plt.show()

The correlation matrix provides a quantified relationship between the continuous variables in the dataset, ranging from -1 to 1. Positive values indicate a positive correlation, whereas negative values suggest an inverse correlation. A value closer to 1 or -1 signifies a stronger correlation.

1. **Bill Length & Flipper Length (`r=0.67`)**: There is a moderately strong positive correlation between bill length and flipper length, suggesting that penguins with longer bills tend to have longer flippers.

2. **Bill Length & Body Mass (`r=0.66`)**: A similar moderately strong positive correlation is observed between bill length and body mass, indicating that larger penguins generally have longer bills.

3. **Flipper Length & Body Mass (`r=0.79`)**: A strong positive correlation is present, indicating that flipper length is a good predictor of body mass or vice versa.

4. **Bill Depth & Body Mass (`r=0.56`)**: There's a moderate positive correlation between bill depth and body mass, which suggests that penguins with deeper bills are likely to have higher body mass, although not as strongly correlated as flipper length and body mass.

5. **Bill Length & Bill Depth (`r=0.31`)**: There is a weak positive correlation, which implies that they are somewhat related but not significantly.

6. **Bill Depth & Flipper Length (`r=0.49`)**: A moderate positive correlation exists, suggesting that deeper bills are somewhat associated with longer flippers, but other factors could be at play.

In summary, flipper length shows the strongest correlations with other variables, particularly with body mass, making it a valuable metric for predicting a penguin's physical characteristics. On the other hand, bill length and bill depth show moderate to weak correlations with other features, suggesting that they capture different aspects of penguin morphology.


#  Cramers' V

In [None]:
from scipy.stats import chi2_contingency
cat_cols = ['species','sex','diet','life_stage', 'health_metrics']
cramers_df = pd.DataFrame(index=cat_cols, columns=cat_cols)

def cramers_v(x, y):
    contingency = pd.crosstab(x, y)
    chi2 = chi2_contingency(contingency)[0]
    n = len(x)
    return np.sqrt(chi2 / (n * (min(contingency.shape) - 1)))

for col1 in cat_cols:
    for col2 in cat_cols:
        cramers_df.loc[col1, col2] = cramers_v(df[col1], df[col2]).round(2)

print(cramers_df)


The Cramer's V table provides us with a measure of association between the different categorical variables in the dataset. The value ranges from 0 to 1, where 0 indicates no association, and 1 indicates a perfect association.

1. **Species & Other Variables**: Species show little to no association with any other variables, except diet (`V=0.14`), which suggests a low level of correlation.
   
2. **Sex & Other Variables**: The sex of the penguins also shows negligible association with all other variables, making it relatively independent.

3. **Diet & Life Stage**: A strong association is observed between diet and life stage (`V=0.73`), indicating that the life stage of a penguin is highly dependent on its diet or vice versa.

4. **Diet & Health Metrics**: There is a moderate association between diet and health metrics (`V=0.5`), implying that diet does play a role in determining the health of the penguins but isn't the only factor.

5. **Life Stage & Health Metrics**: A weak association is present between life stage and health metrics (`V=0.18`), suggesting other factors are more crucial in determining the penguins' health metrics.

6. **Health Metrics & Other Variables**: While there is a moderate association with diet, health metrics show a weak association with other variables like species, sex, and life stage.

In summary, the most critical insight is the strong correlation between diet and life stage, implying a crucial biological or ecological link. Most other variables are relatively independent of each other, suggesting that each captures different aspects of penguin biology and ecology.

![Imgur](https://i.imgur.com/hljMmQi.png)