#**Case Study 3.1**
In this case study, you will perform essential data preprocessing steps on the Penguins dataset. The dataset contains information about different species of penguins, including their physical characteristics and the region where they were observed. Your goal is to prepare the dataset for machine learning analysis. Follow these steps:
1. Load the penguins dataset using the code snippet provided below.
2. Perform initial data exploration to understand the dataset's structure, features, and any missing values. Summarize the dataset's statistics and gain insights into the data.
3. Address any data quality issues, such as missing values and outliers. Decide on an appropriate strategy for handling missing data, such as imputation or removal of rows/columns.
4. Analyze the relevance of each feature for your machine learning task by using the learned use feature selection technques.
5. If the dataset contains categorical variables, encode them into a numerical format suitable for machine learning models.
6. Split the dataset into training and testing subsets to evaluate the performance of your machine learning models.
7. Scale or normalize the numerical features to ensure consistent scaling across variables.
8. Apply suitable dimensionality reduction techniques to reduce the size of the data while preserving important information.
9. Validate your preprocessing pipeline by training and evaluating a machine learning model, such as the Random Forest model, on the preprocessed data. Compare the results to the model trained on the raw data (before feature filtering, transformation, and reduction) to ensure that preprocessing has improved model performance.

In [17]:
# Import necessary libraries
from seaborn import load_dataset

# Step 1: Load the penguins dataset
df = load_dataset('penguins')

In [18]:
# Step 2: Display the first few rows of the DataFrame to get an initial look at the data
print("First few rows of the dataset:")
df.head()

First few rows of the dataset:


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [19]:
# Display basic information about the dataset
print("\nDataset information:")
print(df.info())

# Display summary statistics for numerical features
print("\nSummary statistics:")
df.describe()


Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
None

Summary statistics:


Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


In [20]:
# Step 3: Address data quality issues - Handling missing values
# Impute missing values using the mean for numerical features
from sklearn.impute import SimpleImputer

# Display the number of missing values before handling
print("\nNumber of missing values before handling:")
df.isnull().sum()


Number of missing values before handling:


species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

In [21]:
# Impute missing values using the mean for numerical features
imputer = SimpleImputer(strategy='mean')
df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']] = imputer.fit_transform(df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']])

# Drop rows with missing categorical values (sex)
# df = df.dropna(subset=['sex'])

# Display the number of missing values after handling
print("\nNumber of missing values after handling:")
df.isnull().sum()


Number of missing values after handling:


species               0
island                0
bill_length_mm        0
bill_depth_mm         0
flipper_length_mm     0
body_mass_g           0
sex                  11
dtype: int64

In [22]:
# Impute missing values in the 'sex' column with the most frequent value
imputer_sex = SimpleImputer(strategy='most_frequent')
df['sex'].fillna(df['sex'].mode()[0], inplace=True)

# Display the number of missing values after handling for the 'sex' column
print("\nNumber of missing values after handling for 'sex' column:")
df.isnull().sum()



Number of missing values after handling for 'sex' column:


species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

Step 4: Analyze the relevance of each feature for your machine learning task by using feature selection techniques.



In [23]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Deep copy the original dataframe to avoid modifying it directly
df_encoded = df.copy()

# Label encode categorical variables and add "_encoded" suffix
label_encoder = LabelEncoder()

# Encode 'sex'
df_encoded['sex_encoded'] = label_encoder.fit_transform(df_encoded['sex'])

# Encode 'species'
df_encoded['species_encoded'] = label_encoder.fit_transform(df_encoded['species'])

# Encode 'island'
df_encoded['island_encoded'] = label_encoder.fit_transform(df_encoded['island'])

# Drop the original categorical columns
df_encoded.drop(['sex', 'species', 'island'], axis=1, inplace=True)

# Step 5: Feature selection using ANOVA F-statistic
selector = SelectKBest(f_classif, k=4)
X = df_encoded.drop(['species_encoded'], axis=1)
y = df_encoded['species_encoded']
X_selected = selector.fit_transform(X, y)

# Display the selected features
print("\nSelected features:")
selected_features = X.columns[selector.get_support()]
print(selected_features)

# Split the data into features and target variable
X = df_encoded[selected_features]
y = df_encoded['species_encoded']

# Split the data into training and testing sets
X_train_before, X_test_before, y_train_before, y_test_before = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize a Random Forest classifier
clf = RandomForestClassifier()

# Fit the classifier to the training data
clf.fit(X_train_before, y_train_before)

# Get feature importances
feature_importances = clf.feature_importances_
print("\nFeature importances:")
print(feature_importances)



Selected features:
Index(['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'], dtype='object')

Feature importances:
[0.37835767 0.22747959 0.31478437 0.07937837]


In [24]:
from scipy.stats import zscore

# Calculate Z-scores for each column
z_scores = zscore(df_encoded)

# Set thresholds for Z-scores to identify outliers
threshold_3 = 3
threshold_2_5 = 2.5

# Find the indices of outliers for threshold 3
outliers_3 = (abs(z_scores) > threshold_3).any(axis=1)

# Find the indices of outliers for threshold 2.5
outliers_2_5 = (abs(z_scores) > threshold_2_5).any(axis=1)

# Remove outliers from the dataframe for both thresholds
df_no_outliers_3 = df_encoded[~outliers_3]
df_no_outliers_2_5 = df_encoded[~outliers_2_5]

# Display the shape before and after removing outliers for both thresholds
print(f"Shape before removing outliers: {df_encoded.shape}")
print(f"Shape after removing outliers (threshold 3): {df_no_outliers_3.shape}")
print(f"Shape after removing outliers (threshold 2.5): {df_no_outliers_2_5.shape}")


Shape before removing outliers: (344, 7)
Shape after removing outliers (threshold 3): (344, 7)
Shape after removing outliers (threshold 2.5): (341, 7)


Step 5. If the dataset contains categorical variables, encode them into a numerical format suitable for
machine learning models.

In [25]:
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Encode the "species" and "sex" columns
df['species_encoded'] = label_encoder.fit_transform(df['species'])
df['sex_encoded'] = label_encoder.fit_transform(df['sex'])

# Drop the original categorical columns
df.drop(['species', 'sex'], axis=1, inplace=True)

In [26]:
import pandas as pd

# Use the pandas get_dummies function to perform One-Hot Encoding for "island"
df = pd.get_dummies(df, columns=['island'])
df.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,species_encoded,sex_encoded,island_Biscoe,island_Dream,island_Torgersen
0,39.1,18.7,181.0,3750.0,0,1,False,False,True
1,39.5,17.4,186.0,3800.0,0,0,False,False,True
2,40.3,18.0,195.0,3250.0,0,0,False,False,True
3,43.92193,17.15117,200.915205,4201.754386,0,1,False,False,True
4,36.7,19.3,193.0,3450.0,0,0,False,False,True


In [27]:
df.tail()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,species_encoded,sex_encoded,island_Biscoe,island_Dream,island_Torgersen
339,43.92193,17.15117,200.915205,4201.754386,2,1,True,False,False
340,46.8,14.3,215.0,4850.0,2,0,True,False,False
341,50.4,15.7,222.0,5750.0,2,1,True,False,False
342,45.2,14.8,212.0,5200.0,2,0,True,False,False
343,49.9,16.1,213.0,5400.0,2,1,True,False,False


Step 6. Split the dataset into training and testing subsets to evaluate the performance of your machine learning models.

In [28]:
from sklearn.model_selection import train_test_split

# Define your features (X) and target variable (y)
X = df.drop('species_encoded', axis=1)  # Features
y = df['species_encoded']  # Target variable

# Split the data into training and testing sets (e.g., 80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Number of original features: {X_train.shape[1]}")
print(f"Number of training samples: {X_train.shape[0]}")
print(f"Number of testing samples: {X_test.shape[0]}")


Number of original features: 8
Number of training samples: 275
Number of testing samples: 69


Step 7. Scale or normalize the numerical features to ensure consistent scaling across variables.

In [29]:
from sklearn.preprocessing import MinMaxScaler

# Assuming X_train and X_test are the training and testing feature sets

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

df[selected_features] = scaler.fit_transform(df[selected_features])

# Display the first few rows of the DataFrame after scaling
df.head()


Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,species_encoded,sex_encoded,island_Biscoe,island_Dream,island_Torgersen
0,0.254545,0.666667,0.152542,0.291667,0,1,False,False,True
1,0.269091,0.511905,0.237288,0.305556,0,0,False,False,True
2,0.298182,0.583333,0.389831,0.152778,0,0,False,False,True
3,0.429888,0.482282,0.490088,0.417154,0,1,False,False,True
4,0.167273,0.738095,0.355932,0.208333,0,0,False,False,True


Step 8
applying suitable dimensionality reduction techniques. One commonly used technique is Principal Component Analysis (PCA). PCA can help reduce the dimensionality of the dataset while preserving important information.

In [30]:
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize and train a Random Forest model on the original data
original_clf = RandomForestClassifier(random_state=42)
original_clf.fit(X_train_before, y_train_before)

# Make predictions on the test data
original_y_pred = original_clf.predict(X_test_before)

# Evaluate the model's performance on the original data
original_accuracy = accuracy_score(y_test_before, original_y_pred)

# Apply PCA for dimensionality reduction
pca = PCA(n_components=6)  # Specify the number of components
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Train a classifier on the retained PCA components and evaluate
pca_clf = RandomForestClassifier(random_state=42)
pca_clf.fit(X_train_pca, y_train)
pca_y_pred = pca_clf.predict(X_test_pca)

# Evaluate the model's performance after PCA
accuracy_after_pca = accuracy_score(y_test, pca_y_pred)

# Print key metrics
print(f"Explained variance ratio for each PCA component: {pca.explained_variance_ratio_}")
print(f"Number of original features: {X_train.shape[1]}")
print(f"Number of features retained after PCA: {X_train_pca.shape[1]}")

Explained variance ratio for each PCA component: [9.99892104e-01 7.86782314e-05 2.47119002e-05 3.81526547e-06
 3.08365643e-07 2.15068903e-07]
Number of original features: 8
Number of features retained after PCA: 6


9. Validate your preprocessing pipeline by training and evaluating a machine learning model, such as the Random Forest model, on the preprocessed data. Compare the results to the model trained on the raw data (before feature filtering, transformation, and reduction) to ensure that preprocessing has improved model performance

In [31]:
# load the dataset
df = load_dataset('penguins')

# train a Random Forest classifier on the original data without any preprocessing
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_before, y_train_before)

# make predictions on the test data
y_pred = clf.predict(X_test_before)

# evaluate the model's performance on the original data
accuracy = accuracy_score(y_test_before, y_pred)
print(f"Accuracy before preprocessing: {accuracy}")
print(f"Accuracy after preprocessing: {accuracy_after_pca}")

Accuracy before preprocessing: 0.9710144927536232
Accuracy after preprocessing: 0.9855072463768116
