![Alt text](https://imgur.com/orZWHly.png=80)
source: @allison_horst https://github.com/allisonhorst/penguins

You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as `penguins.csv`

**Origin of this data** : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

**The dataset consists of 5 columns.**

Column | Description
--- | ---
culmen_length_mm | culmen length (mm)
culmen_depth_mm | culmen depth (mm)
flipper_length_mm | flipper length (mm)
body_mass_g | body mass (g)
sex | penguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are **at least three** species that are native to the region: **Adelie**, **Chinstrap**, and **Gentoo**.  Your task is to apply your data science skills to help them identify groups in the dataset!

In [22]:
# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
penguins_df.head()


Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,39.1,18.7,181.0,3750.0,MALE
1,39.5,17.4,186.0,3800.0,FEMALE
2,40.3,18.0,195.0,3250.0,FEMALE
3,36.7,19.3,193.0,3450.0,FEMALE
4,39.3,20.6,190.0,3650.0,MALE


In [23]:
# Separate numeric and categorical columns
num_features = penguins_df.select_dtypes(include=['float64', 'int64']).columns
categorical_features = penguins_df.select_dtypes(include=['object']).columns

# Preprocess using ColumnTransformer for scaling and encoding
num_scaler = StandardScaler()
cat_scaler = OneHotEncoder(drop= 'first')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_scaler, num_features),
        ('cat', cat_scaler, categorical_features)
    ])

# Fit and transform the data
transformed_features = preprocessor.fit_transform(penguins_df)

In [24]:
# Apply KMeans clustering
k = 3  # Adjustable according to analysis
kmeans = KMeans(n_clusters= k, random_state= 123)
penguins_df['cluster'] = kmeans.fit_predict(transformed_features)

# Calculate the mean values for each cluster
stat_penguins = penguins_df.groupby('cluster').mean()

# Display the DataFrame stat_penguins
print(stat_penguins)

         culmen_length_mm  culmen_depth_mm  flipper_length_mm  body_mass_g
cluster                                                                   
0               47.477907        18.787209         197.279070  3918.604651
1               38.356693        18.066929         188.244094  3571.259843
2               47.568067        14.996639         217.235294  5092.436975
