<a href="https://colab.research.google.com/github/Reben80/3DPrintCalculus/blob/main/Week7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Load the penguins dataset from seaborn
penguins = sns.load_dataset('penguins')

# Display the first few rows of the dataset
print(penguins.head())


In [None]:
penguins.info()

In [None]:
penguins.describe()


In [None]:
plt.scatter(penguins['bill_length_mm'], penguins['bill_depth_mm'])
plt.xlabel('Bill Length (mm)')
plt.ylabel('Bill Depth (mm)')
plt.title('Bill Length vs. Bill Depth')
plt.show()

In [None]:
plt.scatter(penguins['bill_length_mm'], penguins['flipper_length_mm'])

In [None]:
plt.scatter(penguins['bill_length_mm'], penguins['body_mass_g'])

In [None]:
penguins_adeile=penguins[penguins['species']=='Adelie']
penguins['species'].unique()

In [None]:

plt.scatter(penguins_adeile['bill_length_mm'], penguins_adeile['bill_depth_mm'])
plt.xlabel('Bill Length (mm)')
plt.ylabel('Bill Depth (mm)')
plt.title('Bill Length vs. Bill Depth')
plt.show()

In [None]:
# Check for missing values
print(penguins.isnull().sum())


In [None]:
# Filter penguins by species and select bill_length_mm for Adelie
adelie_bill_length = penguins[penguins['species'] == 'Adelie']['bill_length_mm']

# Display the result
print(adelie_bill_length)

In [None]:
# Pairplot will create all scatter plots for the numerical columns
sns.pairplot(penguins)

In [None]:
# Pairplot excluding non-numerical columns and using a hue for species
sns.pairplot(penguins, hue="species", diag_kind="kde")

plt.show()


In [None]:
# Pairplot excluding non-numerical columns and using a hue for species
sns.pairplot(penguins, hue="island", diag_kind="kde")

plt.show()


In [None]:
# Pairplot excluding non-numerical columns and using a hue for species
sns.pairplot(penguins, hue="species", markers=["o", "s", "D"])


plt.show()

In [None]:
sns.pairplot(penguins, hue="species", height=3.5)


In [None]:
sns.pairplot(penguins, hue="species", palette="coolwarm")


### Key Features of `sns.pairplot`:

1. **Plots All Pairwise Relationships**:
   - For a dataset with `n` numerical columns, `pairplot` creates an $ n \times n $ grid of subplots.
   - Each off-diagonal plot is a scatter plot showing the relationship between two numerical variables.
   - For $ n $ columns, there are $\binom{n}{2}$ scatter plots and $n$ diagonal plots.

2. **Diagonal Plots (Histograms or KDEs)**:
   - By default, `pairplot` plots histograms for the diagonal cells to show the distribution of each numerical variable.
   - You can change this to kernel density estimation (KDE) using the `diag_kind` argument.

3. **Hue (Color Coding by Category)**:
   - Use the `hue` parameter to color-code the scatter plots by a categorical variable (like species, sex or Island).
   - This allows you to see how different categories are distributed across pairs of numerical variables.

4. **Customization**:
   - You can customize marker styles, palettes, plot size, and other visual aspects of the grid.
   



# Bar Graph on Pengiun Dataset (bad!!!)

In [None]:
# Bar plot of species vs average body mass
#sns.barplot(x="species", y="body_mass_g", data=penguins)

plt.bar(penguins['species'], penguins['body_mass_g'])
# Add title and labels
plt.title('Average Body Mass for Each Penguin Species')
plt.xlabel('Species')
plt.ylabel('Body Mass (g)')

# Show the plot
plt.show()

# Corrected Bar Graph using aggreation

In [None]:

# Calculate the mean body mass for each species
species_means = penguins.groupby('species')['body_mass_g'].mean()

# Plot using plt.bar (now using aggregated data)
plt.bar(species_means.index, species_means.values)

# Add title and labels
plt.title('Average Body Mass for Each Penguin Species')
plt.xlabel('Species')
plt.ylabel('Body Mass (g)')

# Show the plot
plt.show()


In [None]:
species_means = penguins.groupby('species')['body_mass_g'].mean()
plt.bar(species_means.index, species_means.values)
plt.title('Average Body Mass for Each Species')
plt.xlabel('Species')
plt.ylabel('Body Mass (g)')
plt.show()


In [None]:
species_sum = penguins.groupby('species')['body_mass_g'].sum()
plt.bar(species_sum.index, species_sum.values)
plt.title('Sum of Body Mass for Each Species')
plt.xlabel('Species')
plt.ylabel('Body Mass (g)')
plt.show()

### Aggregation with the Penguin Dataset

**Aggregation** is the process of summarizing data by grouping it and applying a function (e.g., mean, sum, min, max) to each group.

#### Example: Aggregating Body Mass by Species

If we want to summarize the **body mass** of penguins by their **species**, we can use pandas' `groupby()` function and apply aggregation functions like `mean`, `min`, and `max`.

#### Common Aggregations:

1. **Mean** (Average Body Mass by Species):
  


In [None]:

   species_means = penguins.groupby('species')['body_mass_g'].mean()

Min (Smallest Body Mass by Species):

In [None]:
species_min = penguins.groupby('species')['body_mass_g'].min()


Count (Number of Penguins by Species):
Sum ( Sum of Body mass for all penguins in each Species)

In [None]:
species_count = penguins.groupby('species')['body_mass_g'].count()

species_sum = penguins.groupby('species')['body_mass_g'].sum()

In [None]:
species_means.index


In [None]:
species_means.values


### Using `groupby()` in pandas

When you use the `groupby()` function in pandas, the result is a **pandas Series or DataFrame** where the **index** corresponds to the grouping variable (in this case, `species`), and the **values** correspond to the aggregated quantity (like `mean`, `min`, `max`, etc.).




## Using Seaborn

One of the benefit of using seaborn is that automatically do agression for you.

In [None]:
# Bar plot of species vs average body mass
sns.barplot(x="species", y="body_mass_g", data=penguins)

#plt.bar(penguins['species'], penguins['body_mass_g'])
# Add title and labels
plt.title('Average Body Mass for Each Penguin Species')
plt.xlabel('Species')
plt.ylabel('Body Mass (g)')

# Show the plot
plt.show()

In [None]:

sns.barplot(x="species", y="body_mass_g", data=penguins, estimator=np.median)

plt.title('Median Body Mass for Each Penguin Species')
plt.xlabel('Species')
plt.ylabel('Body Mass (g)')
plt.show()


# Combined Bar Graph


In [None]:

# Grouped bar plot: species vs body mass, colored by sex
sns.barplot(x='species', y='body_mass_g', hue='sex', data=penguins)

# Add title and labels
plt.title('Average Body Mass by Species and Sex')
plt.xlabel('Species')
plt.ylabel('Body Mass (g)')

# Show the plot
plt.show()


In [None]:

# Grouped bar plot: species vs body mass, colored by sex
sns.barplot(x='species', y='body_mass_g', hue='island', data=penguins)

# Add title and labels
plt.title('Average Body Mass by Species and Sex')
plt.xlabel('Species')
plt.ylabel('Body Mass (g)')
plt.show()

In [None]:

# Grouped bar plot: species vs body mass, colored by sex
sns.barplot(x='sex', y='body_mass_g', hue='species', data=penguins)

# Add title and labels
plt.title('Average Body Mass by Species and Sex')
plt.xlabel('Species')
plt.ylabel('Body Mass (g)')
plt.show()

In [None]:

# Prepare data for stacking
species_sex_grouped = penguins.groupby(['species', 'sex'])['body_mass_g'].mean().unstack()

# Plot stacked bar graph
species_sex_grouped.plot(kind='bar', stacked=True)

# Add title and labels
plt.title('Stacked Bar Plot: Average Body Mass by Species and Sex')
plt.xlabel('Species')
plt.ylabel('Body Mass (g)')

# Show the plot
plt.show()


In [None]:
fig, axes = plt.subplots(1, 3, figsize=(10, 5))

# Bar plot for body mass by species
sns.barplot(x='species', y='body_mass_g', data=penguins, ax=axes[0])


# Bar plot for flipper length by species
sns.barplot(x='species', y='flipper_length_mm', data=penguins, ax=axes[1])

plt.bar(penguins['species'], penguins['body_mass_g'],ax=axes[2])


# Show the plot
plt.show()
# Adjust the layout
plt.tight_layout()

# Show the plot
plt.show()


In [None]:
fig, axes = plt.subplots(1, 3, figsize=(10, 5))

# Bar plot for body mass by species
sns.barplot(x='species', y='body_mass_g', data=penguins, ax=axes[0])


# Bar plot for flipper length by species
sns.barplot(x='species', y='flipper_length_mm', data=penguins, ax=axes[1])

# Use axes[2].bar() instead of plt.bar() to explicitly specify the subplot
axes[2].bar(penguins['species'], penguins['body_mass_g']) # Changed this line to use axes[2].bar()


# Show the plot
plt.show()
# Adjust the layout
plt.tight_layout()

# Show the plot
plt.show()

# Simple Exam of Pie Graph

In [None]:
# Example pie chart
labels = ['A', 'B', 'C']
sizes = [50, 25, 25]
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.show()

In [None]:


# Count the number of penguins in each species
species_counts = penguins['species'].value_counts()

# Create a pie chart
plt.figure(figsize=(6, 6))
plt.pie(species_counts, labels=species_counts.index, autopct='%1.1f%%', startangle=90, colors=['lightblue', 'lightgreen', 'coral'])
plt.title('Penguin Species Distribution')

# Show the plot
plt.show()


The `autopct='%1.1f%%'` parameter in the `plt.pie()` function is used to **display the percentage value** on each segment of the pie chart.

### Explanation:
- **`'%1.1f%%'`** is a string format for **floating point numbers**.
  - **`1.1f`** means:
    - **1**: The total number of digits to display before the decimal point (which is the minimum, but if there are more digits, they'll be displayed).
    - **1**: The number of digits to display **after** the decimal point.
    - **f**: This specifies that the value should be displayed as a floating point number.
  - The **`%%`** is used to display the percentage symbol (`%`).

### Breakdown:
- **`1.1f`**: Displays the number as a float with one digit after the decimal point.
- **`%%`**: Adds the literal percentage sign after the number.

### Example:
If you have a pie chart slice that represents 33.33% of the data:
- **`autopct='%1.1f%%'`** will display `33.3%` (rounding to one decimal place).
- **`autopct='%1.0f%%'`** will display `33%` (no decimal places).
- **`autopct='%1.2f%%'`** will display `33.33%` (two decimal places).



In [None]:
# Example pie chart
explode = (0.1,0, 0)
labels = ['A', 'B', 'C']
sizes = [50, 25, 25]
plt.pie(sizes, labels=labels,explode=explode, hatch=['**O', 'oO', 'O.O', '.||.'])
plt.show()


In [None]:
explode = (0, 0.1, 0, 0)  # only "explode" the 2nd slice (i.e. 'Hogs')

fig, ax = plt.subplots()
ax.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
       shadow=True, startangle=90)
plt.show()

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Load the penguins dataset
penguins = sns.load_dataset('penguins')

# Preprocessing: Drop rows with missing values
penguins_clean = penguins.dropna()

# Select numerical features for clustering
features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

# Standardize the data
scaler = StandardScaler()
penguins_scaled = scaler.fit_transform(penguins_clean[features])

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
penguins_clean['cluster'] = kmeans.fit_predict(penguins_scaled)

# Visualize the clusters using bill length and flipper length
plt.scatter(penguins_clean['bill_length_mm'], penguins_clean['flipper_length_mm'], c=penguins_clean['cluster'], cmap='viridis')

# Add labels and title
plt.xlabel('Bill Length (mm)')
plt.ylabel('Flipper Length (mm)')
plt.title('K-Means Clustering of Penguins')

plt.show()
