# EDA on Penguins

In this notebook exercise, we will conduct simple EDA steps on the popular penguins dataset.

### Load the dataset

Dataset source: https://github.com/allisonhorst/palmerpenguins

In [86]:
import seaborn as sns

In [87]:
df = sns.load_dataset('penguins')

In [88]:
df.shape

(344, 7)

# Step 1 Understand the Features

You can find information about this dataset here: https://www.kaggle.com/code/parulpandey/penguin-dataset-the-new-iris

**Question: in your own words**:
1. describe each feature

  - Species: Types of penguin.
  - Island: Island name.
  - bill_length_mm: Length of penguin's bill (mm).
  - bill_depth_mm: depth of penguin's bill (mm).
  - flipper_length_mm: Length of penguin's flipper (mm).
  - body_mass_g: body mass (g).
  - sex: gender (male or female).

2. mention its type (numeric or categorical)
  - Float: [bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g]
  - Categorical: [species, island, sex]
3. write its name in Arabic
 * Species: نوع البطريق
 * Island: اسم الجزيرة
 * bill_length_mm: طول المنقار
 * bill_depth_mm: عمق/ارتفاع المنقار
 * flipper_length_mm: طول الجناح
 * body_mass_g: كتلة الجسم
 * sex: الجنس

Note: use a Markdown cell.

Hint: you can attach an image to illustrate what the features are.

<img src="https://github.com/allisonhorst/palmerpenguins/raw/main/man/figures/culmen_depth.png" width="400">

# Step 2

- Have a look at the columns and their values (`head`, `sample`, `tail`)
- Look at the technical information (`info`)

In [89]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [90]:
df.sample(5)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
311,Gentoo,Biscoe,52.2,17.1,228.0,5400.0,Male
234,Gentoo,Biscoe,45.8,14.6,210.0,4200.0,Female
73,Adelie,Torgersen,45.8,18.9,197.0,4150.0,Male
156,Chinstrap,Dream,52.7,19.8,197.0,3725.0,Male
216,Chinstrap,Dream,43.5,18.1,202.0,3400.0,Female


# Step 3

1. For each column, check and handle missing values; state your strategy and justify it. Examples:
    - Strategy: drop the column. Justification: ...?
    - Strategy: fill missing values. Justificaiton: ...?
    - Strategy: drop the row. Justification: ...?
1. Calculate count and percentage of missing values before handling them
1. Check and handle duplicated rows
1. Calculate the percentage of data loss after cleaning

In [91]:
rows_before = df.shape[0]
columns_before = df.shape[1]
print('rows before cleaning:', rows_before)
print('columns before cleaning:', columns_before)

rows before cleaning: 344
columns before cleaning: 7


In [92]:
df.isna().sum()

Unnamed: 0,0
species,0
island,0
bill_length_mm,2
bill_depth_mm,2
flipper_length_mm,2
body_mass_g,2
sex,11


In [93]:
df_nulls = df.isnull().any(axis=1)
df[df_nulls] #after checking the null rows, we will drop row 3 and 339 because most of their values are missing
             #rest of the sex coulumn will be filled with mode

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
3,Adelie,Torgersen,,,,,
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,
10,Adelie,Torgersen,37.8,17.1,186.0,3300.0,
11,Adelie,Torgersen,37.8,17.3,180.0,3700.0,
47,Adelie,Dream,37.5,18.9,179.0,2975.0,
246,Gentoo,Biscoe,44.5,14.3,216.0,4100.0,
286,Gentoo,Biscoe,46.2,14.4,214.0,4650.0,
324,Gentoo,Biscoe,47.3,13.8,216.0,4725.0,
336,Gentoo,Biscoe,44.5,15.7,217.0,4875.0,


In [94]:
df.drop([3,339], inplace=True)
df['sex'].fillna(df['sex'].mode()[0], inplace=True)

In [95]:
df.isna().sum()

Unnamed: 0,0
species,0
island,0
bill_length_mm,0
bill_depth_mm,0
flipper_length_mm,0
body_mass_g,0
sex,0


In [96]:
df.duplicated().sum()# no duplicates

0

In [97]:
rows_after = df.shape[0]
columns_after = df.shape[1]
print('rows after cleaning:', rows_after)
print('columns after cleaning:', columns_after)

rows after cleaning: 342
columns after cleaning: 7


In [99]:
print('rows percentage loss:', (rows_before - rows_after) / rows_before * 100, '%')
print('columns percentage loss:', (columns_before - columns_after) / columns_before * 100, '%')

rows percentage loss: 0.5813953488372093 %
columns percentage loss: 0.0 %


# Step 4

#### Data types conversion
- We shall convert the string types to `category` to preserve memory
- numeric types can be stored in less precision: `float32`

In [None]:
mem_usage_before = df.memory_usage(deep=True)
mem_usage_before

In [None]:
# convert categotical types
df['species'] = df['species'].astype('category')
df['island'] = df ['island'].astype('category')
df['sex'] = df['sex'].astype('category')


In [None]:
# convert numerical types
df['bill_depth_mm'] = df['bill_depth_mm'].astype('float32')
df['bill_length_mm'] = df['bill_length_mm'].astype('float32')
df['flipper_length_mm'] = df['flipper_length_mm'].astype('float32')
df['body_mass_g'] = df['body_mass_g'].astype('float32')

Calculate memory saved after type conversion

In [None]:
 mem_usage_after = df.memory_usage(deep=True)
 mem_usage_after

In [None]:
print('memory saved:', (mem_usage_before - mem_usage_after).sum() // 1024, 'KB')

# Step 5

#### Detect inconsistency in categorical values

The categorical columns should be checked for any inconsistencies. For example. We look for lowercase, uppercase, or inconsistent use of codes (e.g., "M", "F") with non-codes (e.g., "Male", "Female")  in the `sex` column.

- hint: use `.unique()` to check the number of unique values in a column
- you can also use: `.value_counts()` to check the frequency of each value in a column

In [None]:


print(df['sex'].value_counts())
print(df['species'].value_counts())
print(df['island'].value_counts())
#After checking, we can say that data is consistent

# Step 6: Univariate Analysis

- Separate numerical from categorical columns (hint; use `df.select_dtypes()`)
- Look at the statistical information for each:
    - `df_num.describe().T`
    - `df_cat.describe().T`

In [None]:
df_num = df.select_dtypes(include='number')
df_cat = df.select_dtypes(include='category')
df_cat.describe().T


In [None]:
df_num.describe().T

Use charts to plot `value_counts()` categorical variables:
1. plot `species` using bar plot
1. plot `island` using pie chart
1. plot `sex` using horizontal bar plot

In [None]:
df_cat['species'].value_counts()

In [None]:
sns.barplot(data=df_cat['species'].value_counts())

In [None]:
import matplotlib.pyplot as plt
plt.pie(df_cat['island'].value_counts(), labels=df_cat['island'].unique(),autopct='%1.1f%%')
plt.show()

In [None]:
df_cat['sex'].value_counts().index

In [None]:
plt.barh(df_cat['sex'].value_counts().index, df_cat['sex'].value_counts())
plt.figure(figsize=(5,5))
plt.show()

Plot numerical variables:

1. Boxplot: `bill_length_mm`
1. Histogram: `bill_depth_mm`
1. Boxplot: `flipper_length_mm`
1. Histogram: `body_mass_g`

In [None]:
df_num

In [None]:
sns.boxplot(df_num['bill_length_mm'])
plt.show()

In [None]:
plt.hist(df_num['bill_depth_mm'],bins= 40)
plt.show()

In [None]:
sns.boxplot(df_num['flipper_length_mm'])
plt.show()

## Step 7: Bivariate Analysis

#### Correlation between numerical features

Let's find out if there is any correlation between numerical features.

- Hint: you can use the `df.corr()` to find the correlation matrix.
- Hint: you can use `sns.heatmap()` to plot the correlation matrix

In [None]:
corr_matr = df_num.corr()
corr_matr

In [None]:
sns.heatmap(corr_matr, annot=True)
plt.show()

Write down your observations based on the correlation heatmap.

Observations:

- **There is a positive relation between body mass and flipper_length**
- **A strong inverse relation between flipper_length and bill_depth**
- **There is a small inverse relation between bill_length and bill_depth, so it might not be useful in some cases**

### Feature Engineering

- We might try adding the feature `bill_size` which is the product of `bill_length` and `bill_depth` to see if it has any significance in the model.
- We might also try `bill_ratio` which is the ratio of `bill_length` to `bill_depth` to see if it has any significance in the model.

In [None]:
df['bill_size'] = df['bill_length_mm'] * df['bill_depth_mm']
df['bill_ratio'] = df['bill_length_mm'] / df['bill_depth_mm']
df.head()

Let's look at the correlation to see whether the newly created features are better.

In [None]:
# This plots the correlation values for a specific column
# which is usually what we are interested in
df_num = df.select_dtypes(include='number')
corr = df_num.corr()
corr

**From correlation matrix we conclude that new features is much better and more relatable**

In [None]:
corr['body_mass_g'].sort_values().plot.barh()