# What is Feature Engineering

Feature engineering is not cleaning.

It is:

How can I convert this cleaned data into a smarter, more meaningful, model-ready format?

This is where you create intelligence in the data.

# Task
Perform a data cleanliness assessment on the `diabetes_cleaned.csv` file.

In [2]:
import pandas as pd
df_diabetes = pd.read_csv('/content/drive/MyDrive/diabetes_cleaned.csv')
df_diabetes.head()

Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome,flag_glucose,flag_bloodpressure,flag_skinthickness,flag_insulin,flag_bmi,pregnancies_yeojohnson,glucose_yeojohnson,bmi_yeojohnson,diabetespedigreefunction_yeojohnson,age_yeojohnson
0,6,148.0,72.0,35.0,125.0,33.6,0.627,50,1,0,0,0,1,0,0.813399,0.907992,0.260613,0.821764,1.36418
1,1,85.0,66.0,29.0,125.0,26.6,0.351,31,0,0,0,0,1,0,-0.833906,-1.305305,-0.842925,-0.168409,0.126452
2,8,183.0,64.0,29.0,125.0,23.3,0.672,32,1,0,0,1,1,0,1.188996,1.785503,-1.457644,0.935284,0.230161
3,1,89.0,66.0,23.0,112.875,28.1,0.167,21,0,0,0,0,0,0,-0.833906,-1.126118,-0.585976,-1.298725,-1.480075
4,0,137.0,40.0,35.0,135.875,43.1,2.288,33,1,0,0,0,0,0,-1.603317,0.592903,1.46286,2.33668,0.327328


## Check for Missing Values

### Subtask:
Identify and quantify any missing values across all columns to understand data completeness.


**Reasoning**:
To identify and quantify missing values, I will calculate the sum of null values for each column and then compute the percentage of missing values relative to the total number of rows. This provides a clear overview of data completeness.



In [6]:
missing_values = df_diabetes.isnull().sum()
missing_percentage = (df_diabetes.isnull().sum() / len(df_diabetes)) * 100

missing_info = pd.DataFrame({
    'Missing Count': missing_values,
    'Missing Percentage': missing_percentage
})

print("Missing Values Information:")
print(missing_info)


Missing Values Information:
                          Missing Count  Missing Percentage
Pregnancies                           0                 0.0
Glucose                               0                 0.0
BloodPressure                         0                 0.0
SkinThickness                         0                 0.0
Insulin                               0                 0.0
BMI                                   0                 0.0
DiabetesPedigreeFunction              0                 0.0
Age                                   0                 0.0
Outcome                               0                 0.0


## Check for Duplicates

### Subtask:
Detect and count any duplicate rows in the dataset, as duplicates can skew analysis.


**Reasoning**:
To detect and count duplicate rows, I will first use the `.duplicated()` method to identify them and then sum the boolean result to get the total count. I will then print this count.



In [7]:
duplicate_rows_count = df_diabetes.duplicated().sum()
print(f"Total number of duplicate rows: {duplicate_rows_count}")

Total number of duplicate rows: 0



    
  **Overall Cleanliness Status**: The dataset is explicitly clean, with no `NaN` values or duplicate rows. However, it contains significant implicit missing values (represented by zeros) in critical physiological measurement columns. This necessitates a careful approach to handling these zero values before any modeling or in-depth analysis.

# Task
Perform feature engineering on the `df_diabetes` DataFrame to handle implicit missing values (zeros) in the 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', and 'BMI' columns by replacing them with the median of their respective columns. Then, scale the numerical features, separate the data into features (X) and target (y) with 'Outcome' as the target, and finally split the preprocessed data into training and testing sets.

## Handle Implicit Missing Values

### Subtask:
Address the '0' values in columns like 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', and 'BMI' by replacing them with a more appropriate measure (e.g., median or mean) to reflect actual missing data.


**Reasoning**:
To address the implicit missing values represented by '0's in the 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', and 'BMI' columns, I will replace these '0's with the median of each respective column. The median is chosen over the mean as it is less sensitive to outliers and skewed distributions, which is often the case with medical data. It's crucial to calculate the median *excluding* the '0' values themselves, as these are the values we consider missing and do not want them to influence the central tendency. After replacement, I will display descriptive statistics for these columns to confirm the successful imputation and inspect the new data distribution.

**Reasoning**:
First, I will define the list of columns that need '0' values to be treated as implicit missing data. This clearly identifies the target columns for the imputation process.



In [8]:
columns_to_impute = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
print("Columns identified for imputing '0' values:", columns_to_impute)

Columns identified for imputing '0' values: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']


**Reasoning**:
Now that the columns for imputation are identified, I will iterate through each of these columns, calculate the median value for each, specifically excluding any '0' entries which are considered implicit missing values, and then use these calculated medians to replace all '0's within their respective columns.



In [9]:
for col in columns_to_impute:
    median_val = df_diabetes[df_diabetes[col] != 0][col].median()
    df_diabetes[col] = df_diabetes[col].replace(0, median_val)

print("Descriptive statistics for imputed columns:")
print(df_diabetes[columns_to_impute].describe())

Descriptive statistics for imputed columns:
          Glucose  BloodPressure  SkinThickness     Insulin         BMI
count  768.000000     768.000000     768.000000  768.000000  768.000000
mean   121.656250      72.386719      29.108073  140.671875   32.455208
std     30.438286      12.096642       8.791221   86.383060    6.875177
min     44.000000      24.000000       7.000000   14.000000   18.200000
25%     99.750000      64.000000      25.000000  121.500000   27.500000
50%    117.000000      72.000000      29.000000  125.000000   32.300000
75%    140.250000      80.000000      32.000000  127.250000   36.600000
max    199.000000     122.000000      99.000000  846.000000   67.100000


## Feature Scaling

### Subtask:
Scale numerical features to ensure that they contribute equally to the model's training process and prevent features with larger values from dominating.


## Reasoning:
Feature scaling is crucial for many machine learning algorithms, especially those that rely on distance calculations (e.g., K-Nearest Neighbors, Support Vector Machines) or gradient descent (e.g., Logistic Regression, Neural Networks). Without scaling, features with larger numerical ranges can dominate the learning process, even if they are not more important functionally. Standardization (using `StandardScaler`) transforms the data to have a mean of 0 and a standard deviation of 1, making all features comparable and preventing any single feature from disproportionately influencing the model.

**Reasoning**:
To prepare for feature scaling, I will identify all numerical columns in the DataFrame that require scaling, specifically excluding the target variable 'Outcome' and any columns that have already undergone specific transformations (e.g., Yeo-Johnson transformation denoted by '_yeojohnson' suffix) or are flag indicators. This ensures that only appropriate features are scaled.



In [10]:
numerical_cols = df_diabetes.select_dtypes(include=['number']).columns.tolist()

# Exclude 'Outcome' column as it is the target variable
columns_to_scale = [col for col in numerical_cols if col != 'Outcome']

# Exclude columns that are already transformed (e.g., _yeojohnson) or flag indicators
columns_to_scale = [col for col in columns_to_scale if '_yeojohnson' not in col and 'flag_' not in col]

print("Columns identified for scaling:")
print(columns_to_scale)

Columns identified for scaling:
['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']


**Reasoning**:
Now that the columns to be scaled have been identified, the next step is to import the `StandardScaler` from `sklearn.preprocessing` and instantiate it, which is necessary to perform the feature scaling.



In [11]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
print("StandardScaler imported and instantiated.")

StandardScaler imported and instantiated.


**Reasoning**:
Now that the `StandardScaler` is instantiated, I will apply it to the identified numerical features using the `.fit_transform()` method to scale the data. Then, I will update the `df_diabetes` DataFrame with these scaled values, ensuring column names are preserved, and display the head of the DataFrame to verify the scaling.



In [12]:
df_diabetes[columns_to_scale] = scaler.fit_transform(df_diabetes[columns_to_scale])

print("df_diabetes after feature scaling:")
print(df_diabetes.head())

df_diabetes after feature scaling:
   Pregnancies   Glucose  BloodPressure  SkinThickness   Insulin       BMI  \
0     0.639947  0.866045      -0.031990       0.670643 -0.181541  0.166619   
1    -0.844885 -1.205066      -0.528319      -0.012301 -0.181541 -0.852200   
2     1.233880  2.016662      -0.693761      -0.012301 -0.181541 -1.332500   
3    -0.844885 -1.073567      -0.528319      -0.695245 -0.540642 -0.633881   
4    -1.141852  0.504422      -2.679076       0.670643  0.316566  1.549303   

   DiabetesPedigreeFunction       Age  Outcome  
0                  0.468492  1.425995        1  
1                 -0.365061 -0.190672        0  
2                  0.604397 -0.105584        1  
3                 -0.920763 -1.041549        0  
4                  5.484909 -0.020496        1  


**Reasoning**:
Now that the features have been scaled, I need to separate the data into features (X) and the target variable (y) for model training. The 'Outcome' column is specified as the target.



In [13]:
X = df_diabetes.drop('Outcome', axis=1)
y = df_diabetes['Outcome']

print("Shape of X (features):", X.shape)
print("Shape of y (target):", y.shape)
print("First 5 rows of X:")
print(X.head())
print("First 5 rows of y:")
print(y.head())

Shape of X (features): (768, 8)
Shape of y (target): (768,)
First 5 rows of X:
   Pregnancies   Glucose  BloodPressure  SkinThickness   Insulin       BMI  \
0     0.639947  0.866045      -0.031990       0.670643 -0.181541  0.166619   
1    -0.844885 -1.205066      -0.528319      -0.012301 -0.181541 -0.852200   
2     1.233880  2.016662      -0.693761      -0.012301 -0.181541 -1.332500   
3    -0.844885 -1.073567      -0.528319      -0.695245 -0.540642 -0.633881   
4    -1.141852  0.504422      -2.679076       0.670643  0.316566  1.549303   

   DiabetesPedigreeFunction       Age  
0                  0.468492  1.425995  
1                 -0.365061 -0.190672  
2                  0.604397 -0.105584  
3                 -0.920763 -1.041549  
4                  5.484909 -0.020496  
First 5 rows of y:
0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64


**Reasoning**:
To complete the subtask of splitting the preprocessed data into training and testing sets, I will import the `train_test_split` function from `sklearn.model_selection`.



In [14]:
from sklearn.model_selection import train_test_split

print("train_test_split imported from sklearn.model_selection.")

train_test_split imported from sklearn.model_selection.


**Reasoning**:
Now that `train_test_split` is imported and `X` and `y` are defined, I will split the data into training and testing sets using a 80/20 ratio for training/testing and a `random_state` for reproducibility.



In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (614, 8)
Shape of X_test: (154, 8)
Shape of y_train: (614,)
Shape of y_test: (154,)


## Final Task

### Subtask:
Summarize the feature engineering steps performed and confirm the data is ready for model training.


## Save Preprocessed Data to Google Drive

### Subtask:
Save the preprocessed `X_train`, `X_test`, `y_train`, and `y_test` datasets to Google Drive for future use in model training.

**Reasoning**:
Saving the training and testing sets (`X_train`, `X_test`, `y_train`, `y_test`) is a crucial step after preprocessing. This ensures that the exact same splits and transformations are used consistently for model development and evaluation, preventing data leakage or discrepancies. Storing them in Google Drive makes them persistently available across sessions.

In [16]:
from google.colab import drive
drive.mount('/content/drive')

# Define the base path for saving in Google Drive
save_path = '/content/drive/MyDrive/'

# Save X_train
X_train.to_csv(f'{save_path}X_train.csv', index=False)
print(f'X_train saved to {save_path}X_train.csv')

# Save X_test
X_test.to_csv(f'{save_path}X_test.csv', index=False)
print(f'X_test saved to {save_path}X_test.csv')

# Save y_train
y_train.to_csv(f'{save_path}y_train.csv', index=False)
print(f'y_train saved to {save_path}y_train.csv')

# Save y_test
y_test.to_csv(f'{save_path}y_test.csv', index=False)
print(f'y_test saved to {save_path}y_test.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
X_train saved to /content/drive/MyDrive/X_train.csv
X_test saved to /content/drive/MyDrive/X_test.csv
y_train saved to /content/drive/MyDrive/y_train.csv
y_test saved to /content/drive/MyDrive/y_test.csv


## Save the StandardScaler to Google Drive

### Subtask:
Save the trained `StandardScaler` object to Google Drive for future use.

**Reasoning**:
Saving the `StandardScaler` object is crucial for ensuring consistency when preprocessing new data or during model deployment. It allows us to apply the exact same scaling transformation (mean and standard deviation) that was learned from the training data to any new data, preventing data leakage and ensuring correct model behavior.

In [17]:
import joblib
import os

# Define the path for saving in Google Drive
save_path = '/content/drive/MyDrive/'

# Save the scaler object
scaler_filename = 'scaler.pkl'
joblib.dump(scaler, os.path.join(save_path, scaler_filename))

print(f'Scaler saved to {os.path.join(save_path, scaler_filename)}')


Scaler saved to /content/drive/MyDrive/scaler.pkl
