# EDA on Penguins

In this notebook exercise, we will conduct simple EDA steps on the popular penguins dataset.

### Load the dataset

The following will load the dataset automatically.

In [5]:
import seaborn as sns

In [6]:
# You don't have to download the dataset
# the following will download it for you
df = sns.load_dataset('penguins')

In [None]:
df.shape

(333, 7)

# Step 1 Understand the Features

You can find information about this dataset here: https://www.kaggle.com/code/parulpandey/penguin-dataset-the-new-iris

**Question: in your own words**:
1. describe each feature
2. mention its type (numeric or categorical)
3. write its name in Arabic

Please use a Markdown cell to write your answer:

Hint: you can attach an image to illustrate what the features are.

<img src="https://github.com/allisonhorst/palmerpenguins/raw/main/man/figures/culmen_depth.png" width="400">

# Step 2

- Have a look at the columns and their values (`head`, `sample`, `tail`)
- Look at the technical information (`info`)

# Step 3

1. Calculate count of missing values
1. Calculate the percentage of missing values
1. For each column, check and handle missing values; state your strategy and justify it. Examples:
    - Strategy: drop the column.
        - Why did you choose this strategy?
    - Strategy: fill missing values.
        - Why did you choose this strategy?
    - Strategy: drop the row.
        - Why did you choose this strategy?
1. Check and handle duplicated rows
1. **If** you chose to drop missing values, how much data did you lose?

# Step 4

#### Data types conversion
- We shall convert the string types to `category` to preserve memory
- numeric types can be stored in less precision: `float32`

In [None]:
mem_usage_before = df.memory_usage(deep=True)

In [None]:
# convert categotical types
df['species'] = df['species'].astype('category')
# ...?
# ...?

In [None]:
# convert numerical types
df['bill_depth_mm'] = df['bill_depth_mm'].astype('float32')
# ...?
# ...?
# ...?

Calculate memory saved after type conversion

In [None]:
# mem_usage_after = ...?

In [None]:
print('memory saved:', (mem_usage_before - mem_usage_after).sum() // 1024, 'KB')

# Step 5

#### Detect inconsistency in categorical values

Inconsistencies can cause issues in data analysis and should be cleaned to ensure uniformity. Some examples:

- Denoting missing values:
    - Use of special characters: "N/A" and "NA"
    - Empty values: "NULL" and empty strings
- Spelling variations: "USA" and "United States"
- Trailing and leading spaces: `"Female "` and `" Female"`
- Different languages: "Red" and "Rojo"
- Use of synonyms: "Doctor" and "Physician"
- Abbreviations: "Dr." and "Doctor"
- Different date formats: "2023-07-30" and "30/07/2023"
- Mixed data types: "10" (integer) and "Ten" (string)
- Typos: "Adminstrator" and "Administrator"

- hint: use `.unique()` to check the number of unique values in a column
- you can also use: `.value_counts()` to check the frequency of each value in a column

# Step 6: Univariate Analysis

- Separate numerical from categorical columns (hint; use `df.select_dtypes()`)
- Look at the statistical information for each:
    - `df_num.describe().T`
    - `df_cat.describe().T`

Use charts to plot `value_counts()` categorical variables:
1. plot `species` using bar plot
1. plot `island` using pie chart
1. plot `sex` using horizontal bar plot

Plot numerical variables:

1. Boxplot: `bill_length_mm`
1. Histogram: `bill_depth_mm`
1. Boxplot: `flipper_length_mm`
1. Histogram: `body_mass_g`

## Step 7: Bivariate Analysis

#### Correlation between numerical features

Let's find out if there is any correlation between numerical features.

- Hint: you can use the `df.corr()` to find the correlation matrix.
- Hint: you can use `sns.heatmap()` to plot the correlation matrix

Question: Write down your observations based on the correlation heatmap.

**Observations:**

### Feature Engineering

- We might try adding the feature `bill_size` which is the product of `bill_length` and `bill_depth` to see if it has any significance in the model.
- We might also try `bill_ratio` which is the ratio of `bill_length` to `bill_depth` to see if it has any significance in the model.

Let's look at the correlation to see whether the newly created features are better.

1. Compute the correlation matrix
1. Select the `'body_mass_g'` column, sort it, and plot it using horizontal bar plot

In [None]:
# This plots the correlation values for a specific column
# which is usually what we are interested in

# corr['body_mass_g'].sort_values().plot.barh()

Dataset source: https://github.com/allisonhorst/palmerpenguins