# Case Study 3.1

In this case study, you will perform essential data preprocessing steps on the Penguins dataset. The dataset contains information about different species of penguins, including their physical characteristics and the region where they were observed. Your goal is to prepare the dataset for machine learning analysis. Follow these steps:
1. Load the penguins dataset using the code snippet provided below.
2. Perform initial data exploration to understand the dataset's structure, features, and any missing values. Summarize the dataset's statistics and gain insights into the data.
3. Address any data quality issues, such as missing values and outliers. Decide on an appropriate strategy for handling missing data, such as imputation or removal of rows/columns.
4. Analyze the relevance of each feature for your machine learning task by using the learned use feature selection technques.
5. If the dataset contains categorical variables, encode them into a numerical format suitable for machine learning models.
6. Split the dataset into training and testing subsets to evaluate the performance of your machine learning models.
7. Scale or normalize the numerical features to ensure consistent scaling across variables.
8. Apply suitable dimensionality reduction techniques to reduce the size of the data while preserving important information.
9. Validate your preprocessing pipeline by training and evaluating a machine learning model, such as the Random Forest model, on the preprocessed data. Compare the results to the model trained on the raw data (before feature filtering, transformation, and reduction) to ensure that preprocessing has improved model performance.


# Islam Jihad 1191375

In [18]:
# Import the 'load_dataset' function from seaborn to load the penguins dataset
from seaborn import load_dataset

# Load the penguins dataset and store it in the 'df' DataFrame
df = load_dataset('penguins')

# Display the first few rows of the DataFrame to get an initial look at the data
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [19]:
# Display basic information about the dataset
print(df.info())
print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

# Display statistical summary of the dataset
print(df.describe())
print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

# Check for missing values
print(df.isnull().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
None
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
       bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
count      342.000000     342.000000         342.000000   342.000000
mean        43.921930      17.151170         200.915205  4201.754386
std          5.459584       1.974793          14.061714   801.954536
min         32.100000      13.100000         172.000000

In [20]:
df.isnull().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

In [21]:
print(df.isnull().any(axis=1).sum())
print(100*df.isnull().any(axis=1).sum()/df.shape[0],'%')

11
3.197674418604651 %


In [22]:
df[df.isnull().any(axis=1)]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
3,Adelie,Torgersen,,,,,
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,
10,Adelie,Torgersen,37.8,17.1,186.0,3300.0,
11,Adelie,Torgersen,37.8,17.3,180.0,3700.0,
47,Adelie,Dream,37.5,18.9,179.0,2975.0,
246,Gentoo,Biscoe,44.5,14.3,216.0,4100.0,
286,Gentoo,Biscoe,46.2,14.4,214.0,4650.0,
324,Gentoo,Biscoe,47.3,13.8,216.0,4725.0,
336,Gentoo,Biscoe,44.5,15.7,217.0,4875.0,


In [23]:
print(f"Number of empty records = {df.isnull().all(axis=1).sum()}")
df[df.isnull().all(axis=1)]

Number of empty records = 0


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex


In [24]:
print(f"The number of records where gender is missing equals {df.isnull()['sex'].sum()}")
print(f"The proportion of records where gender is missing equals {100*df.isnull()['sex'].sum()/df.shape[0]}%")

The number of records where gender is missing equals 11
The proportion of records where gender is missing equals 3.197674418604651%


In [25]:
#encode categoral values and finnd the corolation
from sklearn.preprocessing import LabelEncoder

# create a copy of the original DataFrame to mainting the original DataFrame
df_corr=df.copy()

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the categorical feature
df_corr['sex_encoded'] = label_encoder.fit_transform(df_corr['sex'])
df_corr['island_encoded'] = label_encoder.fit_transform(df_corr['island'])
df_corr['species_encoded'] = label_encoder.fit_transform(df_corr['species'])

# Drop categorical features
df_corr.drop(['sex','species'],axis=1,inplace=True)

df_corr.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex_encoded,island_encoded,species_encoded
bill_length_mm,1.0,-0.235053,0.656181,0.59511,0.27144,-0.353647,0.731369
bill_depth_mm,-0.235053,1.0,-0.583851,-0.471916,0.31146,0.571035,-0.744076
flipper_length_mm,0.656181,-0.583851,1.0,0.871202,0.215992,-0.565825,0.854307
body_mass_g,0.59511,-0.471916,0.871202,1.0,0.361224,-0.561515,0.750491
sex_encoded,0.27144,0.31146,0.215992,0.361224,1.0,0.029246,0.008559
island_encoded,-0.353647,0.571035,-0.565825,-0.561515,0.029246,1.0,-0.635659
species_encoded,0.731369,-0.744076,0.854307,0.750491,0.008559,-0.635659,1.0


In [26]:
print(df[['body_mass_g']].isnull().any(axis=1).sum())
print(100*df[['body_mass_g']].isnull().any(axis=1).sum()/df.shape[0],'%')

2
0.5813953488372093 %


In [27]:
x= df['body_mass_g'].median()
x

4050.0

In [28]:
df['body_mass_g'].fillna(value=x, inplace=True)

df[df.isnull().any(axis=1)]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
3,Adelie,Torgersen,,,,4050.0,
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,
10,Adelie,Torgersen,37.8,17.1,186.0,3300.0,
11,Adelie,Torgersen,37.8,17.3,180.0,3700.0,
47,Adelie,Dream,37.5,18.9,179.0,2975.0,
246,Gentoo,Biscoe,44.5,14.3,216.0,4100.0,
286,Gentoo,Biscoe,46.2,14.4,214.0,4650.0,
324,Gentoo,Biscoe,47.3,13.8,216.0,4725.0,
336,Gentoo,Biscoe,44.5,15.7,217.0,4875.0,


In [29]:
y= df['bill_depth_mm'].median()
print(y)
z= df['bill_length_mm'].median()
print(z)

17.3
44.45


In [30]:
df['bill_depth_mm'].fillna(value=y, inplace=True)
df['bill_length_mm'].fillna(value=z, inplace=True)

df[df.isnull().any(axis=1)]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
3,Adelie,Torgersen,44.45,17.3,,4050.0,
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,
10,Adelie,Torgersen,37.8,17.1,186.0,3300.0,
11,Adelie,Torgersen,37.8,17.3,180.0,3700.0,
47,Adelie,Dream,37.5,18.9,179.0,2975.0,
246,Gentoo,Biscoe,44.5,14.3,216.0,4100.0,
286,Gentoo,Biscoe,46.2,14.4,214.0,4650.0,
324,Gentoo,Biscoe,47.3,13.8,216.0,4725.0,
336,Gentoo,Biscoe,44.5,15.7,217.0,4875.0,


In [31]:
df.isnull().sum()

species               0
island                0
bill_length_mm        0
bill_depth_mm         0
flipper_length_mm     2
body_mass_g           0
sex                  11
dtype: int64

In [32]:
#linear regression to get flipper_length_mm values in dependence on body_mass_g as the corolation is high
import seaborn as sns
from sklearn.linear_model import LinearRegression
def fill_nan(df_orig, x, y):
    df = df_orig.copy()
    df_missing = df[df[y].isna()]
    
    if len(df_missing) == 0:
        return df
    
    df_not_missing = df[~df[y].isna()]
    
    
    iqr1 = df_not_missing[x].quantile(0.25)
    
    train = df_not_missing[(df_not_missing[x] > iqr1)]
    
    model = LinearRegression()
    model.fit(train[[x]], train[y])

    x_missing = df_missing[[x]]
    df.loc[df[y].isna(), y] = model.predict(x_missing)
    return df



df = fill_nan(df, 'body_mass_g', 'flipper_length_mm')
df.isnull().sum()

species               0
island                0
bill_length_mm        0
bill_depth_mm         0
flipper_length_mm     0
body_mass_g           0
sex                  11
dtype: int64

In [33]:
df[df.isnull().any(axis=1)]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
3,Adelie,Torgersen,44.45,17.3,197.314363,4050.0,
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,
10,Adelie,Torgersen,37.8,17.1,186.0,3300.0,
11,Adelie,Torgersen,37.8,17.3,180.0,3700.0,
47,Adelie,Dream,37.5,18.9,179.0,2975.0,
246,Gentoo,Biscoe,44.5,14.3,216.0,4100.0,
286,Gentoo,Biscoe,46.2,14.4,214.0,4650.0,
324,Gentoo,Biscoe,47.3,13.8,216.0,4725.0,
336,Gentoo,Biscoe,44.5,15.7,217.0,4875.0,


In [34]:
# Create a copy of the DataFrame for imputation
df_imputed = df.copy()

In [36]:
#Imputing Missing Values in 'sex' Column Using SimpleImputer
from sklearn.impute import SimpleImputer

# Define the columns to be imputed
columns_to_impute = ['sex']

# Create a SimpleImputer instance
imputer = SimpleImputer(strategy='most_frequent')

# Fit and transform the imputer on the selected columns
df[columns_to_impute] = imputer.fit_transform(df[columns_to_impute])

# Now, df_imputed contains the filled values in the 'sex' column
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,44.45,17.3,197.314363,4050.0,Male
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [38]:
# Display basic information about the dataset
print(df.info())
print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

# Display statistical summary of the dataset
print(df.describe())
print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

#check that all values are 0
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     344 non-null    float64
 3   bill_depth_mm      344 non-null    float64
 4   flipper_length_mm  344 non-null    float64
 5   body_mass_g        344 non-null    float64
 6   sex                344 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
None
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
       bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
count      344.000000     344.000000         344.000000   344.000000
mean        43.925000      17.152035         200.894270  4200.872093
std          5.443792       1.969060          14.023338   799.696532
min         32.100000      13.100000         172.000000

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

# Scale or normalize the numerical features to ensure consistent scaling across variables.


In [369]:
from sklearn.preprocessing import MinMaxScaler

# Creating the MinMaxScaler instance
scaler = MinMaxScaler()

# Selecting only the columns that need scaling (numeric columns)
numeric_columns = df.select_dtypes(include=['float64']).columns

# Fitting and transforming the scaler on the selected columns
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])

# Checking the scaled DataFrame
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,344.0,344.0,344.0,344.0
mean,0.43,0.482385,0.489733,0.416909
std,0.197956,0.234412,0.237684,0.222138
min,0.0,0.0,0.0,0.0
25%,0.260909,0.297619,0.305085,0.236111
50%,0.449091,0.5,0.423729,0.375
75%,0.596364,0.666667,0.694915,0.569444
max,1.0,1.0,1.0,1.0


In [370]:
#encoding the data frame
# Fit and transform the categorical feature
df['sex'] = label_encoder.fit_transform(df['sex'])
df['island'] = label_encoder.fit_transform(df['island'])
df['species'] = label_encoder.fit_transform(df['species'])
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,0,2,0.254545,0.666667,0.152542,0.291667,1
1,0,2,0.269091,0.511905,0.237288,0.305556,0
2,0,2,0.298182,0.583333,0.389831,0.152778,0
3,0,2,0.449091,0.5,0.429057,0.375,1
4,0,2,0.167273,0.738095,0.355932,0.208333,0


# Split the dataset into training and testing subsets to evaluate the performance of your machine learning models.

In [388]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.decomposition import PCA
import pandas as pd

# Features (excluding the target variable 'species')
X = df.drop('species', axis=1)

# Target variable 'species'
y = df['species']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Apply suitable dimensionality reduction techniques to reduce the size of the data while preserving important information.

# with pca only

In [389]:
# Train a classifier on the original features and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Create PCA instance and fit to the data
pca = PCA(n_components=5)  # Specify the number of components
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Train a classifier on the retained PCA components and evaluate
clf_pca = RandomForestClassifier(random_state=42)
clf_pca.fit(X_train_pca, y_train)
y_pred_pca = clf_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

# Print the explained variance ratio for each selected component
print(f"Explained variance ratio for each PCA component: {pca.explained_variance_ratio_}")

print(f"Number of original features: {X_train.shape[1]}")
print(f"Number of features retained after PCA: {X_train_pca.shape[1]}")
print(f"Accuracy of Original features (testing accuracy): {accuracy}")
print(f"Accuracy after PCA (testing accuracy): {accuracy_pca}")

# Display additional classification metrics
print(classification_report(y_test, y_pred))

Explained variance ratio for each PCA component: [0.60000898 0.28544661 0.07362431 0.02376811 0.01165683]
Number of original features: 6
Number of features retained after PCA: 5
Accuracy of Original features (testing accuracy): 0.9855072463768116
Accuracy after PCA (testing accuracy): 0.9710144927536232
              precision    recall  f1-score   support

           0       0.97      1.00      0.98        32
           1       1.00      0.94      0.97        16
           2       1.00      1.00      1.00        21

    accuracy                           0.99        69
   macro avg       0.99      0.98      0.98        69
weighted avg       0.99      0.99      0.99        69



# Using VarianceThreshold

In [390]:
#VarianceThreshold
#selecting 5 top features
selector = VarianceThreshold(threshold=0.05)
df_variance = pd.DataFrame(selector.fit_transform(df), columns=df.columns[selector.get_support()])
df_variance.head()

Unnamed: 0,species,island,bill_depth_mm,flipper_length_mm,sex
0,0.0,2.0,0.666667,0.152542,1.0
1,0.0,2.0,0.511905,0.237288,0.0
2,0.0,2.0,0.583333,0.389831,0.0
3,0.0,2.0,0.5,0.429057,1.0
4,0.0,2.0,0.738095,0.355932,0.0


In [391]:
from sklearn.feature_selection import VarianceThreshold

# Train a classifier on the original features and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Apply Variance Threshold
selector = VarianceThreshold(threshold=0.05)
X_train_variance = selector.fit_transform(X_train)
X_test_variance = selector.transform(X_test)

# Train a classifier on the selected features and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_variance, y_train)
y_pred = clf.predict(X_test_variance)
accuracy_variance = accuracy_score(y_test, y_pred)

print(f"The variance of each featue: {selector.variances_}")
print(f"Number of original features: {X_train.shape[1]}")
print(f"Number of features after variance threshold filtering: {X_train_variance.shape[1]}")
print(f"Accuracy of Original features (testing accuracy): {accuracy}")
print(f"Accuracy after variance threshold filtering (testing accuracy): {accuracy_variance}")

The variance of each featue: [0.51300496 0.03990416 0.05529107 0.05828465 0.04954537 0.24991736]
Number of original features: 6
Number of features after variance threshold filtering: 4
Accuracy of Original features (testing accuracy): 0.9855072463768116
Accuracy after variance threshold filtering (testing accuracy): 0.8695652173913043


# Using KBest

In [392]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest, mutual_info_classif

# Train a classifier on the original features and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Apply Information Gain
selector = SelectKBest(mutual_info_classif, k=4)  # keep the top 4 features
X_train_ig = selector.fit_transform(X_train, y_train)
X_test_ig = selector.transform(X_test)

# Train a classifier on the selected features and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_ig, y_train)
y_pred = clf.predict(X_test_ig)
accuracy_ig = accuracy_score(y_test, y_pred)


print(f"Number of original features: {X_train.shape[1]}")
print(f"Number of features after Information Gain filtering: {X_train_ig.shape[1]}")
print(f"Accuracy of Original features (testing accuracy): {accuracy}")
print(f"Accuracy after Information Gain filtering (testing accuracy): {accuracy_ig}")


Number of original features: 6
Number of features after Information Gain filtering: 4
Accuracy of Original features (testing accuracy): 0.9855072463768116
Accuracy after Information Gain filtering (testing accuracy): 0.9855072463768116


# PCA & VarianceThreshold

In [393]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Train a classifier on the original features and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Apply Variance Threshold
selector = VarianceThreshold(threshold=0.05)
X_train_variance = selector.fit_transform(X_train)
X_test_variance = selector.transform(X_test)

# Create PCA instance and fit to the data
pca = PCA(n_components=3)  # Specify the number of components
X_train_pca = pca.fit_transform(X_train_variance)
X_test_pca = pca.transform(X_test_variance)

# Train a classifier on the retained PCA components and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)
accuracy_PCA_VT = accuracy_score(y_test, y_pred)

print(f"The variance of each feature: {selector.variances_}")
print(f"Number of original features: {X_train.shape[1]}")
print(f"Number of features after Variance Threshold filtering: {X_train_variance.shape[1]}")
print(f"Number of features after PCA filtering: {X_train_pca.shape[1]}")
print(f"Accuracy of Original features (testing accuracy): {accuracy}")
print(f"Accuracy after Variance Threshold filtering and PCA (testing accuracy): {accuracy_PCA_VT}")

The variance of each feature: [0.51300496 0.03990416 0.05529107 0.05828465 0.04954537 0.24991736]
Number of original features: 6
Number of features after Variance Threshold filtering: 4
Number of features after PCA filtering: 3
Accuracy of Original features (testing accuracy): 0.9855072463768116
Accuracy after Variance Threshold filtering and PCA (testing accuracy): 0.855072463768116


# PCA & KBest

In [394]:
from sklearn.decomposition import PCA

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a classifier on the original features and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Apply Information Gain
selector_ig = SelectKBest(mutual_info_classif, k=4)  # keep the top 4 features
X_train_ig = selector_ig.fit_transform(X_train, y_train)
X_test_ig = selector_ig.transform(X_test)

# Train a classifier on the selected features and evaluate
clf_ig = RandomForestClassifier(random_state=42)
clf_ig.fit(X_train_ig, y_train)
y_pred_ig = clf_ig.predict(X_test_ig)
accuracy_ig = accuracy_score(y_test, y_pred_ig)

# Apply PCA
pca = PCA(n_components=3)  # specify the number of components
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Train a classifier on the retained PCA components and evaluate
clf_pca = RandomForestClassifier(random_state=42)
clf_pca.fit(X_train_pca, y_train)
y_pred_pca = clf_pca.predict(X_test_pca)
accuracy_PCA_Kbest = accuracy_score(y_test, y_pred_pca)

print(f"Number of original features: {X_train.shape[1]}")
print(f"Number of features after Information Gain filtering: {X_train_ig.shape[1]}")
print(f"Accuracy of Original features (testing accuracy): {accuracy}")
print(f"Accuracy after Information Gain filtering (testing accuracy): {accuracy_ig}")
print(f"Number of features after PCA: {X_train_pca.shape[1]}")
print(f"Accuracy after PCA (testing accuracy): {accuracy_PCA_Kbest}")

Number of original features: 6
Number of features after Information Gain filtering: 4
Accuracy of Original features (testing accuracy): 0.9855072463768116
Accuracy after Information Gain filtering (testing accuracy): 0.9855072463768116
Number of features after PCA: 3
Accuracy after PCA (testing accuracy): 0.8985507246376812


# Raw Data

In [395]:
# Import the 'load_dataset' function from seaborn to load the penguins dataset
from seaborn import load_dataset

# Load the penguins dataset and store it in the 'df' DataFrame
df_raw = load_dataset('penguins')

# Display the first few rows of the DataFrame to get an initial look at the data
df_raw.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [396]:
from sklearn.preprocessing import MinMaxScaler

# Creating the MinMaxScaler instance
scaler = MinMaxScaler()

# Selecting only the columns that need scaling (numeric columns)
numeric_columns = df_raw.select_dtypes(include=['float64']).columns

# Fitting and transforming the scaler on the selected columns
df_raw[numeric_columns] = scaler.fit_transform(df_raw[numeric_columns])

# Checking the scaled DataFrame
df_raw.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,0.429888,0.482282,0.490088,0.417154
std,0.19853,0.235094,0.238334,0.222765
min,0.0,0.0,0.0,0.0
25%,0.259091,0.297619,0.305085,0.236111
50%,0.449091,0.5,0.423729,0.375
75%,0.596364,0.666667,0.694915,0.569444
max,1.0,1.0,1.0,1.0


In [397]:
#encoding the data frame
# Fit and transform the categorical feature
df_raw['sex'] = label_encoder.fit_transform(df['sex'])
df_raw['island'] = label_encoder.fit_transform(df['island'])
df_raw['species'] = label_encoder.fit_transform(df['species'])
df_raw = df_raw.dropna()
df_raw.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,0,2,0.254545,0.666667,0.152542,0.291667,1
1,0,2,0.269091,0.511905,0.237288,0.305556,0
2,0,2,0.298182,0.583333,0.389831,0.152778,0
4,0,2,0.167273,0.738095,0.355932,0.208333,0
5,0,2,0.261818,0.892857,0.305085,0.263889,1


In [398]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.decomposition import PCA
import pandas as pd

# Features (excluding the target variable 'species')
X = df_raw.drop('species', axis=1)

# Target variable 'species'
y = df_raw['species']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# VarianceThreshold - raw data 

In [399]:
#VarianceThreshold
#selecting 5 top features
selector = VarianceThreshold(threshold=0.05)
df_variance = pd.DataFrame(selector.fit_transform(df_raw), columns=df_raw.columns[selector.get_support()])
df_variance.head()

Unnamed: 0,species,island,bill_depth_mm,flipper_length_mm,sex
0,0.0,2.0,0.666667,0.152542,1.0
1,0.0,2.0,0.511905,0.237288,0.0
2,0.0,2.0,0.583333,0.389831,0.0
3,0.0,2.0,0.738095,0.355932,0.0
4,0.0,2.0,0.892857,0.305085,1.0


In [400]:
from sklearn.feature_selection import VarianceThreshold

# Train a classifier on the original features and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Apply Variance Threshold
selector = VarianceThreshold(threshold=0.05)
X_train_variance = selector.fit_transform(X_train)
X_test_variance = selector.transform(X_test)

# Train a classifier on the selected features and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_variance, y_train)
y_pred = clf.predict(X_test_variance)
accuracy_variance_raw = accuracy_score(y_test, y_pred)

print(f"The variance of each featue: {selector.variances_}")
print(f"Number of original features: {X_train.shape[1]}")
print(f"Number of features after variance threshold filtering: {X_train_variance.shape[1]}")
print(f"Accuracy of Original features (testing accuracy): {accuracy}")
print(f"Accuracy after variance threshold filtering (testing accuracy): {accuracy_variance_raw}")

The variance of each featue: [0.49486777 0.04044968 0.05551308 0.05743487 0.04891159 0.24983563]
Number of original features: 6
Number of features after variance threshold filtering: 4
Accuracy of Original features (testing accuracy): 0.9565217391304348
Accuracy after variance threshold filtering (testing accuracy): 0.8115942028985508


# KBest - raw data

In [401]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest, mutual_info_classif

# Train a classifier on the original features and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Apply Information Gain
selector = SelectKBest(mutual_info_classif, k=4)  # keep the top 4 features
X_train_ig = selector.fit_transform(X_train, y_train)
X_test_ig = selector.transform(X_test)

# Train a classifier on the selected features and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_ig, y_train)
y_pred = clf.predict(X_test_ig)
accuracy_ig_raw = accuracy_score(y_test, y_pred)


print(f"Number of original features: {X_train.shape[1]}")
print(f"Number of features after Information Gain filtering: {X_train_ig.shape[1]}")
print(f"Accuracy of Original features (testing accuracy): {accuracy}")
print(f"Accuracy after Information Gain filtering (testing accuracy): {accuracy_ig_raw}")


Number of original features: 6
Number of features after Information Gain filtering: 4
Accuracy of Original features (testing accuracy): 0.9565217391304348
Accuracy after Information Gain filtering (testing accuracy): 0.9565217391304348


# PCA & VarianceThreshold - raw data

In [402]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


# Train a classifier on the original features and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Apply Variance Threshold
selector = VarianceThreshold(threshold=0.05)
X_train_variance = selector.fit_transform(X_train)
X_test_variance = selector.transform(X_test)

# Create PCA instance and fit to the data
pca = PCA(n_components=3)  # Specify the number of components
X_train_pca = pca.fit_transform(X_train_variance)
X_test_pca = pca.transform(X_test_variance)

# Train a classifier on the retained PCA components and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)
accuracy_PCA_VT_raw = accuracy_score(y_test, y_pred)

print(f"The variance of each feature: {selector.variances_}")
print(f"Number of original features: {X_train.shape[1]}")
print(f"Number of features after Variance Threshold filtering: {X_train_variance.shape[1]}")
print(f"Number of features after PCA filtering: {X_train_pca.shape[1]}")
print(f"Accuracy of Original features (testing accuracy): {accuracy}")
print(f"Accuracy after Variance Threshold filtering and PCA (testing accuracy): {accuracy_PCA_VT_raw}")

The variance of each feature: [0.49486777 0.04044968 0.05551308 0.05743487 0.04891159 0.24983563]
Number of original features: 6
Number of features after Variance Threshold filtering: 4
Number of features after PCA filtering: 3
Accuracy of Original features (testing accuracy): 0.9565217391304348
Accuracy after Variance Threshold filtering and PCA (testing accuracy): 0.8405797101449275


# PCA & KBest - raw data

In [403]:
from sklearn.decomposition import PCA

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a classifier on the original features and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Apply Information Gain
selector_ig = SelectKBest(mutual_info_classif, k=4)  # keep the top 4 features
X_train_ig = selector_ig.fit_transform(X_train, y_train)
X_test_ig = selector_ig.transform(X_test)

# Train a classifier on the selected features and evaluate
clf_ig = RandomForestClassifier(random_state=42)
clf_ig.fit(X_train_ig, y_train)
y_pred_ig = clf_ig.predict(X_test_ig)
accuracy_ig = accuracy_score(y_test, y_pred_ig)

# Apply PCA
pca = PCA(n_components=3)  # specify the number of components
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Train a classifier on the retained PCA components and evaluate
clf_pca = RandomForestClassifier(random_state=42)
clf_pca.fit(X_train_pca, y_train)
y_pred_pca = clf_pca.predict(X_test_pca)
accuracy_PCA_Kbest_raw = accuracy_score(y_test, y_pred_pca)

print(f"Number of original features: {X_train.shape[1]}")
print(f"Number of features after Information Gain filtering: {X_train_ig.shape[1]}")
print(f"Accuracy of Original features (testing accuracy): {accuracy}")
print(f"Accuracy after Information Gain filtering (testing accuracy): {accuracy_ig}")
print(f"Number of features after PCA: {X_train_pca.shape[1]}")
print(f"Accuracy after PCA (testing accuracy): {accuracy_PCA_Kbest_raw}")

Number of original features: 6
Number of features after Information Gain filtering: 4
Accuracy of Original features (testing accuracy): 0.9565217391304348
Accuracy after Information Gain filtering (testing accuracy): 0.9565217391304348
Number of features after PCA: 3
Accuracy after PCA (testing accuracy): 0.8840579710144928



# Validate your preprocessing pipeline by training and evaluating a machine learning model, such as the Random Forest model, on the preprocessed data. 

# Compare the results to the model trained on the raw data (before feature filtering, transformation, and reduction) to ensure that preprocessing has improved model performance.

In [404]:
# Compare the results
print(f"Improvement in accuracy after preprocessing (VarianceThreshold): {accuracy_variance - accuracy_variance_raw}")

# Compare the results
print(f"Improvement in accuracy after preprocessing(Kmeans): {accuracy_ig - accuracy_ig_raw}")

# Compare the results
print(f"Improvement in accuracy after preprocessing(PCA & VarianceThreshold): {accuracy_PCA_VT - accuracy_PCA_VT_raw}")

# Compare the results
print(f"Improvement in accuracy after preprocessing(PCA & KBest): {accuracy_PCA_Kbest - accuracy_PCA_Kbest_raw}")


Improvement in accuracy after preprocessing (VarianceThreshold): 0.05797101449275355
Improvement in accuracy after preprocessing(Kmeans): 0.0
Improvement in accuracy after preprocessing(PCA & VarianceThreshold): 0.01449275362318847
Improvement in accuracy after preprocessing(PCA & KBest): 0.01449275362318836
