<a href="https://colab.research.google.com/github/ShabnumBatool/Data-Science-Fundamentals/blob/main/panguin_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***Step 1: Load and Understand the Data***
We'll use the penguins dataset, which is included with the seaborn library. It's a biological dataset containing physical measurements for three penguin species found in the Palmer Archipelago in Antarctica.


 First of all we load  dataset into a pandas DataFrame df.



In [1]:
import pandas as pd
import seaborn as sns

# Load the dataset
df = sns.load_dataset("penguins")


In [2]:
print("Shape:", df.shape)
print("\nColumn Data Types:")
print(df.dtypes)


Shape: (344, 7)

Column Data Types:
species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object


In [3]:
print(df.head())


  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
0       3750.0    Male  
1       3800.0  Female  
2       3250.0  Female  
3          NaN     NaN  
4       3450.0  Female  


In [4]:
print(df.describe(include='all'))


       species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
count      344     344      342.000000     342.000000         342.000000   
unique       3       3             NaN            NaN                NaN   
top     Adelie  Biscoe             NaN            NaN                NaN   
freq       152     168             NaN            NaN                NaN   
mean       NaN     NaN       43.921930      17.151170         200.915205   
std        NaN     NaN        5.459584       1.974793          14.061714   
min        NaN     NaN       32.100000      13.100000         172.000000   
25%        NaN     NaN       39.225000      15.600000         190.000000   
50%        NaN     NaN       44.450000      17.300000         197.000000   
75%        NaN     NaN       48.500000      18.700000         213.000000   
max        NaN     NaN       59.600000      21.500000         231.000000   

        body_mass_g   sex  
count    342.000000   333  
unique          NaN     2  
top

# **Step 2: Handling Missing Values**
Missing data can distort the analysis or break machine learning models. So, it’s important to:

Detect how much data is missing.

Decide whether to drop or impute (fill) it.

In [5]:
#detecting missing values
# Count missing values per column
print(df.isnull().sum())


species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64


In [8]:
#Handle Missing Values
#For Numerical Columns we will Use Median Imputation because median is Less sensitive to outliers than the mean.
for col in df.select_dtypes(include=['float64', 'int64']).columns:
    df[col].fillna(df[col].median(), inplace=True)
#For Categorical Columns → Use Mode Imputation because mode  fills missing values with the most frequent category.

for col in df.select_dtypes(include=['object', 'category']).columns:
    df[col].fillna(df[col].mode()[0], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)


In [9]:
print(df.isnull().sum())


species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64


# **Step 3: Detecting and Removing Outliers**

Outliers are extreme values that can:

*   Skew statistical summaries
*   Distort machine learning models (especially linear ones)
*   Impact scaling and normalization

We’ll use the IQR (Interquartile Range) method, which is effective and widely used.



In [10]:
#Identify numerical columns
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
print(numeric_cols)


Index(['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'], dtype='object')


In [12]:
#Define IQR Outlier Removal Function
def remove_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return data[(data[column] >= lower) & (data[column] <= upper)]


In [13]:
#Apply Outlier Removal to Each Numeric Column
for col in numeric_cols:
    df = remove_outliers_iqr(df, col) #This will reduce the number of rows slightly (from 344 to ~330–340 depending on how many outliers were removed).

In [14]:
#Confirm the Data Looks Reasonable
print(df.describe())


       bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
count      344.000000     344.000000         344.000000   344.000000
mean        43.925000      17.152035         200.892442  4200.872093
std          5.443792       1.969060          14.023826   799.696532
min         32.100000      13.100000         172.000000  2700.000000
25%         39.275000      15.600000         190.000000  3550.000000
50%         44.450000      17.300000         197.000000  4050.000000
75%         48.500000      18.700000         213.000000  4750.000000
max         59.600000      21.500000         231.000000  6300.000000


we can compare these stats to the original dataset — now the min and max values are more "realistic", and extreme values are trimmed.



#  **Step 4: Encoding Categorical Variables**
Machine learning models can't handle text or categories directly. We need to convert categorical data into numerical format.

There are two common techniques:

1.  Label Encoding that Assigns a number to each category (useful for binary or ordinal data).

2.  One-Hot Encoding that Creates separate binary columns for each category (useful for nominal data).



In [15]:
#Identify Categorical Columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
print(categorical_cols)


Index(['species', 'island', 'sex'], dtype='object')


In [16]:
#Apply Label Encoding
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])


In [17]:
#Confirm Encoding
print(df.head())



   species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0        0       2           39.10           18.7              181.0   
1        0       2           39.50           17.4              186.0   
2        0       2           40.30           18.0              195.0   
3        0       2           44.45           17.3              197.0   
4        0       2           36.70           19.3              193.0   

   body_mass_g  sex  
0       3750.0    1  
1       3800.0    0  
2       3250.0    0  
3       4050.0    1  
4       3450.0    0  


# **Step 5: Normalization & Standardization**
Many machine learning algorithms (like KNN, SVM, gradient descent-based models) perform better when features are on the same scale.

We’ll create two versions of the dataset:

1. Normalized: Values scaled between 0 and 1

2. Standardized: Values have mean = 0 and std deviation = 1

In [18]:
 #Identify Numeric Columns
 from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Select numeric columns
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()


In [19]:
# Apply Normalization (MinMaxScaler)
scaler_norm = MinMaxScaler()
df_norm = df.copy()
df_norm[numeric_cols] = scaler_norm.fit_transform(df[numeric_cols])


In [20]:
# Apply Standardization (StandardScaler)
scaler_std = StandardScaler()
df_std = df.copy()
df_std[numeric_cols] = scaler_std.fit_transform(df[numeric_cols])


In [21]:
#Confirm the Result
print("Normalized Data Sample:\n", df_norm.head())
print("\nStandardized Data Sample:\n", df_std.head())


Normalized Data Sample:
    species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0      0.0     1.0        0.254545       0.666667           0.152542   
1      0.0     1.0        0.269091       0.511905           0.237288   
2      0.0     1.0        0.298182       0.583333           0.389831   
3      0.0     1.0        0.449091       0.500000           0.423729   
4      0.0     1.0        0.167273       0.738095           0.355932   

   body_mass_g  sex  
0     0.291667  1.0  
1     0.305556  0.0  
2     0.152778  0.0  
3     0.375000  1.0  
4     0.208333  0.0  

Standardized Data Sample:
     species    island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0 -1.029802  1.844076       -0.887622       0.787289          -1.420541   
1 -1.029802  1.844076       -0.814037       0.126114          -1.063485   
2 -1.029802  1.844076       -0.666866       0.431272          -0.420786   
3 -1.029802  1.844076        0.096581       0.075255          -0.277964   
4 -1.02

All numeric values in df_norm are in range [0, 1]

All numeric values in df_std are centered around 0