## **MLOPs Assignment 2 - Group 42**

<b>Group members</b>
<ol>
    <li>ARUN KUMAR MOHANDAS – 2022AC05190</li>
    <li>DEBAYAN MITRA– 2022AC05222</li>
    <li>ILYAS MOHD– 2022AC05644</li>
    <li>PRITAM MONDAL– 2022AC05090</li>
    <li>RONIT MONDAL– 2022AA05142</li>
 </ol>

# **PROBLEM STATEMENT**

**Task 1**:

**Data Collection and Preprocessing (4 Marks):**

**•Task: Select an appropriate dataset and perform data preprocessing, including data cleaning, feature engineering, and scaling/normalization.**

**•Details: Explain the choices made during preprocessing and how they impact the model. Use KizenML or other tools for AutoEDA if applicable.**

# Task 1: Data Collection and Preprocessing
This notebook demonstrates data preprocessing steps on our diabetes dataset, including data cleaning, feature engineering, and scaling/normalization. These preprocessing steps improve the quality and consistency of the data, which is crucial for building robust machine learning models.

In [1]:
# installing pandas profiling EDA
!pip install -U ydata-profiling

Collecting ydata-profiling
  Downloading ydata_profiling-4.10.0-py2.py3-none-any.whl.metadata (20 kB)
Collecting visions<0.7.7,>=0.7.5 (from visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling)
  Downloading visions-0.7.6-py3-none-any.whl.metadata (11 kB)
Collecting htmlmin==0.1.12 (from ydata-profiling)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting phik<0.13,>=0.11.1 (from ydata-profiling)
  Downloading phik-0.12.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting multimethod<2,>=1.4 (from ydata-profiling)
  Downloading multimethod-1.12-py3-none-any.whl.metadata (9.6 kB)
Collecting imagehash==4.3.1 (from ydata-profiling)
  Downloading ImageHash-4.3.1-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting dacite>=1.8 (from ydata-profiling)
  Downloading dacite-1.8.1-py3-none-any.whl.metadata (15 kB)
Collecting PyWavelets (from imagehash==4.3.1->ydata-profiling)
  Downloading pywavelets-1.

In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport
from sklearn.preprocessing import StandardScaler


### Pima Indians Diabetes Dataset Overview

**Dataset Description**:  
The Pima Indians Diabetes dataset is a popular dataset for binary classification tasks. It is used to predict whether or not a patient has diabetes based on certain diagnostic measurements. This dataset was collected by the National Institute of Diabetes and Digestive and Kidney Diseases and contains data on female patients who are at least 21 years old of Pima Indian heritage.

The primary goal is to use various medical predictors (features) to predict the outcome, which indicates whether a patient has diabetes (`Outcome = 1`) or not (`Outcome = 0`).

**Number of Observations**:  
The dataset contains 768 instances (rows).

**Number of Features**:  
There are 8 input features (independent variables) and 1 target variable (dependent variable), which is the `Outcome`.

**Column Descriptions**:

- **Pregnancies**:  
  Number of times the patient has been pregnant.

- **Glucose**:  
  Plasma glucose concentration (measured after 2 hours in an oral glucose tolerance test).

- **BloodPressure**:  
  Diastolic blood pressure (mm Hg).

- **SkinThickness**:  
  Triceps skinfold thickness (mm).

- **Insulin**:  
  2-hour serum insulin (mu U/ml).

- **BMI**:  
  Body mass index (weight in kg/(height in m)^2), a measure of body fat based on height and weight.

- **DiabetesPedigreeFunction**:  
  A function that represents the likelihood of diabetes based on family history (genetic risk factor).

- **Age**:  
  Age of the patient (years).

- **Outcome**:  
  Class variable (0 or 1). `1` indicates the patient has diabetes, and `0` indicates the patient does not have diabetes.

---

### Objective

The objective is to preprocess the Pima Indians Diabetes dataset to make it suitable for building machine learning models. We will address issues such as missing values, outliers, feature scaling, class imbalance, and feature selection to ensure the dataset is ready for model training and evaluation.


## Data Loading
First, we load the dataset and explore its basic structure, including the columns, data types, and missing values. The target column is **Outcome**, which indicates whether the patient has diabetes (1) or not (0).

In [3]:
# Specify the file path to the dataset
dataset_path = 'diabetes.csv'

In [4]:
# Load the dataset into a pandas DataFrame
data = pd.read_csv(dataset_path)

In [5]:
# Display basic information and column types
data_info = data.info()
data_head = data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


## Dataset Information Summary

The Pima Indians Diabetes dataset consists of **768 entries (rows)** and **9 columns** (features). Below is a summary of the key characteristics of the dataset:

- **Total Rows**: 768
- **Total Columns**: 9

Each column represents a feature, with the following data types:

- **7 columns** are of type `int64` (integer data).
- **2 columns** are of type `float64` (floating-point data).

**Column Breakdown**:
- **Pregnancies**: Integer, no missing values (768 non-null).
- **Glucose**: Integer, no missing values (768 non-null).
- **BloodPressure**: Integer, no missing values (768 non-null).
- **SkinThickness**: Integer, no missing values (768 non-null).
- **Insulin**: Integer, no missing values (768 non-null).
- **BMI**: Float, no missing values (768 non-null).
- **DiabetesPedigreeFunction**: Float, no missing values (768 non-null).
- **Age**: Integer, no missing values (768 non-null).
- **Outcome**: Integer, no missing values (768 non-null) – this is the target variable indicating whether the patient has diabetes.

**Data Types**:
- **Integer (int64)**: Features that contain whole numbers (e.g., Pregnancies, Glucose, BloodPressure, etc.).
- **Float (float64)**: Features that contain decimal values (e.g., BMI, DiabetesPedigreeFunction).

**Observation**:
- The dataset has no missing values, as all columns have 768 non-null entries.
- The target variable, `Outcome`, is binary (0 or 1), indicating the presence or absence of diabetes.


In [6]:
# Display the first few rows to confirm the data has been loaded correctly
data_head

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Justification for Using `ydata_profiling` for Automated EDA

**Exploratory Data Analysis (EDA)** is a critical step in the data preprocessing workflow, as it helps in understanding the underlying structure, distribution, and patterns in the dataset. In this notebook, we use the `ydata_profiling` library (previously known as `pandas_profiling`) to automate and streamline the EDA process for the following reasons:

1. **Comprehensive and Detailed Reports**:  
   `ydata_profiling` generates an extensive summary of the dataset, including key statistics such as the number of missing values, distributions, correlations, outliers, and much more. This gives us an immediate and deep insight into the data with minimal manual effort.

2. **Efficiency and Time-Saving**:  
   Manually creating descriptive statistics, visualizations, and reports can be time-consuming. `ydata_profiling` automatically generates these detailed reports in a matter of seconds, allowing us to focus more on data cleaning and modeling tasks.

3. **Visualizations and Correlation Insights**:  
   The profiling report includes visualizations like histograms, correlation heatmaps, and interactions between features. These visual aids help in quickly identifying relationships between variables, potential outliers, and trends.

4. **Missing Values Handling**:  
   The report includes a detailed section on missing data, highlighting columns with missing values and their percentages. This is extremely useful for guiding the imputation or removal process during data preprocessing.

5. **Interactivity**:  
   The report can be rendered as an interactive HTML report that allows us to explore different sections of the analysis in detail. This makes it easy to review key insights without manually scrolling through long outputs.

6. **Ease of Use**:  
   `ydata_profiling` integrates seamlessly with pandas DataFrames, making it a natural extension of the Python data science workflow. By simply passing the dataset, we receive an insightful and interactive report with little configuration needed.

**Conclusion**:  
By using `ydata_profiling`, we can perform a thorough and efficient exploratory data analysis that significantly accelerates the data understanding process. This ensures that we are making informed decisions regarding data cleaning, preprocessing, and feature selection, which ultimately leads to more accurate model building.


## Creating pandas profiling report

In [7]:
# Creating a pandas profiling report
diabetes_profile = ProfileReport(data, title="Pima Indians Diabetes Profiling Report", explorative=True)

In [8]:
# Displaying the report directly in the console of colab
diabetes_profile.to_notebook_iframe()

Output hidden; open in https://colab.research.google.com to view.

In [9]:
# Save the report as an HTML file for download
diabetes_profile.to_file("pima_indians_diabetes_profiling_report.html")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Key Observations from the Profiling Report

1. **Missing/Zero Values**:  
   Several features contain zero values, which are invalid for physiological measurements like `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, and `BMI`. These values need to be treated as missing data.

2. **Outliers**:  
   The dataset has outliers in features such as `Insulin`, `BMI`, and `SkinThickness`. These outliers could negatively impact the performance of machine learning models and should be handled.

3. **Imbalance in Target Variable**:  
   There is a slight imbalance in the target variable `Outcome` (0 = non-diabetic, 1 = diabetic), which may affect model performance and generalization.

4. **Skewed Features**:  
   Features like `Insulin`, `SkinThickness`, and `BMI` show significant skewness, which could affect model performance and should be transformed.

5. **High Correlation**:  
   Features such as `Age`, `Glucose`, and `BMI` exhibit higher correlation with the target variable `Outcome`, while `Pregnancies` also shows some correlation. These features may be prioritized during feature selection.


# Data Preprocessing steps:

---

## **1. Handling Zero Values as Missing Data**

**Observation**:  
Several features, such as `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, and `BMI`, contain zero values that are not realistic. These values should be treated as missing data (`NaN`).

**Action**:  
Replace zero values with `NaN` in the specified columns to represent missing data.

In [10]:
# Replace zero values with NaN for physiological measurements
columns_with_zeroes = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
data[columns_with_zeroes] = data[columns_with_zeroes].replace(0, np.nan)

# Check for missing values after replacement
data.isnull().sum()

Unnamed: 0,0
Pregnancies,0
Glucose,5
BloodPressure,35
SkinThickness,227
Insulin,374
BMI,11
DiabetesPedigreeFunction,0
Age,0
Outcome,0


## **2. Imputing Missing Values**
**Observation**:  
After treating zero values as missing (`NaN`), these columns now contain missing values. To ensure completeness in the dataset, these missing values must be filled in.

**Action**:  
Impute the missing values using the median of each column, as it is robust to outliers.

In [11]:
from sklearn.impute import SimpleImputer

# Create an imputer with the strategy of filling missing values with the median
imputer = SimpleImputer(strategy='median')

# Impute missing values in the columns that had zeroes
data[columns_with_zeroes] = imputer.fit_transform(data[columns_with_zeroes])

# Verify that no missing values remain
print(data.isnull().sum())


Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


## **3. Handling Outliers**

**Observation**:  
The dataset contains outliers in features like `Insulin`, `BMI`, and `SkinThickness`, which may negatively impact model performance if left untreated.

**Action**:  
Apply winsorization to cap extreme outlier values, limiting them within the 5th and 95th percentiles.


In [12]:
from scipy.stats import mstats

# Winsorize the data to cap outliers
data['Insulin'] = mstats.winsorize(data['Insulin'], limits=[0.05, 0.05])
data['BMI'] = mstats.winsorize(data['BMI'], limits=[0.05, 0.05])
data['SkinThickness'] = mstats.winsorize(data['SkinThickness'], limits=[0.05, 0.05])

# Check summary statistics after handling outliers
print(data[['Insulin', 'BMI', 'SkinThickness']].describe())

          Insulin         BMI  SkinThickness
count  768.000000  768.000000     768.000000
mean   133.968750   32.332422      28.927083
std     56.190856    6.207864       7.655040
min     50.000000   22.200000      14.000000
25%    121.500000   27.500000      25.000000
50%    125.000000   32.300000      29.000000
75%    127.250000   36.600000      32.000000
max    293.000000   44.500000      44.000000


## **4. Scaling the Features**
**Observation**:  
The dataset contains features with varying scales, such as `Age`, `BMI`, and `Glucose`. This could negatively affect the performance of machine learning models.

**Action**:  
Standardize the features to ensure all columns are on the same scale for optimal performance.


In [13]:
from sklearn.preprocessing import StandardScaler

# Initialize a scaler
scaler = StandardScaler()

# Apply scaling to the necessary columns
columns_to_scale = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age']
data[columns_to_scale] = scaler.fit_transform(data[columns_to_scale])

# Preview the standardized data
print(data[columns_to_scale].head())


    Glucose  BloodPressure  SkinThickness   Insulin       BMI       Age
0  0.866045      -0.031990       0.793840 -0.159716  0.204322  1.425995
1 -1.205066      -0.528319       0.009532 -0.159716 -0.924015 -0.190672
2  2.016662      -0.693761       0.009532 -0.159716 -1.455945 -0.105584
3 -1.073567      -0.528319      -0.774777 -0.711767 -0.682228 -1.041549
4  0.504422      -2.679076       0.793840  0.606031  1.735636 -0.020496


## **5. Handling Skewness**

**Observation**:  
Several features, such as `Insulin` and `BMI`, exhibit skewness, which could affect the performance of machine learning models.

**Action**:  
Apply a logarithmic transformation to reduce skewness in these features.


In [14]:
# Ensure no negative or zero values before applying log1p transformation
# Replace any non-positive values with a small positive number (e.g., 0.00001)
data['Insulin'] = data['Insulin'].apply(lambda x: x if x > 0 else 0.00001)
data['BMI'] = data['BMI'].apply(lambda x: x if x > 0 else 0.00001)

# Apply log1p transformation
data['Insulin'] = np.log1p(data['Insulin'])
data['BMI'] = np.log1p(data['BMI'])

# Check the skewness after transformation
print(data[['Insulin', 'BMI']].skew())


Insulin    2.050728
BMI        1.052965
dtype: float64


## **6. Handling Class Imbalance**
**Observation**:  
The target variable `Outcome` is imbalanced, with more non-diabetic cases than diabetic cases, which could result in biased model predictions.

**Action**:  
Use SMOTE (Synthetic Minority Oversampling Technique) to balance the classes in the dataset.


In [15]:
from imblearn.over_sampling import SMOTE

# Separate features and target
X = data.drop(columns='Outcome')
y = data['Outcome']

# Apply SMOTE to balance the class distribution
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Check the new class distribution after SMOTE
print(y_resampled.value_counts())


Outcome
1    500
0    500
Name: count, dtype: int64


## **7. Feature Selection Based on Correlation**
**Observation**:  
We want to retain as many features as possible while focusing on those that have a reasonable correlation with the target variable `Outcome`.

**Action**:  
Calculate the correlation between all features and the target variable `Outcome`. Retain features with a correlation value greater than a specified threshold (e.g., 0.1).


In [16]:
# Calculate the correlation of each feature with the target variable 'Outcome'
correlation_matrix = data.corr()

# Select features with correlation to 'Outcome' greater than 0.1 (absolute value)
correlation_threshold = 0.1
correlated_features = correlation_matrix['Outcome'][abs(correlation_matrix['Outcome']) > correlation_threshold].index

# Keep only the selected correlated features in X_resampled
X_selected = X_resampled[correlated_features.drop('Outcome')]  # Drop 'Outcome' from the list of correlated features

# Verify the selected features
print(f"Selected features based on correlation threshold of {correlation_threshold}:")
print(X_selected.columns)

Selected features based on correlation threshold of 0.1:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'],
      dtype='object')


## Saving Final Preprocessed Data

**Observation**:  
After selecting the relevant features and performing all necessary preprocessing, we need to save the final dataset, which includes the selected features and the target variable `Outcome`. This dataset will be used for the next steps in the modeling process.

**Action**:  
We will concatenate the preprocessed features (`X_selected`) with the target variable (`y_resampled`) and save this combined dataset into a CSV file named `preprocessed_diabetes_data.csv` for future use.


In [17]:
# Concatenate the selected features (X_selected) with the target variable (y_resampled)
final_preprocessed_data = pd.concat([X_selected, y_resampled], axis=1)

# Save the final preprocessed data to a CSV file
final_preprocessed_data.to_csv('preprocessed_diabetes_data.csv', index=False)

# Confirm the file is saved and check the first few rows
print("Preprocessed data saved successfully!")
print(final_preprocessed_data.head())


Preprocessed data saved successfully!
   Pregnancies   Glucose  BloodPressure  SkinThickness   Insulin       BMI  \
0            6  0.866045      -0.031990       0.793840  0.000010  0.185917   
1            1 -1.205066      -0.528319       0.009532  0.000010  0.000010   
2            8  2.016662      -0.693761       0.009532  0.000010  0.000010   
3            1 -1.073567      -0.528319      -0.774777  0.000010  0.000010   
4            0  0.504422      -2.679076       0.793840  0.473766  1.006364   

   DiabetesPedigreeFunction       Age  Outcome  
0                     0.627  1.425995        1  
1                     0.351 -0.190672        0  
2                     0.672 -0.105584        1  
3                     0.167 -1.041549        0  
4                     2.288 -0.020496        1  


## Conclusion and Final Observations

We have successfully performed a complete data preprocessing workflow for the Pima Indians Diabetes dataset. Below is a summary of the key observations and actions taken:

1. **Handling Zero Values**:  
   Several features such as `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, and `BMI` contained zero values, which are not realistic for physiological measurements.  
   **Action**: We replaced these zero values with `NaN` and then imputed the missing values using the median to maintain data integrity without skewing the results.

2. **Imputing Missing Data**:  
   After replacing zero values, we observed missing data in several features.  
   **Action**: Missing values were imputed using the median of each column to ensure robustness against outliers.

3. **Handling Outliers**:  
   The dataset contained outliers in features such as `Insulin`, `BMI`, and `SkinThickness`, which could negatively affect model performance.  
   **Action**: Winsorization was applied to cap extreme values and minimize their effect on the model.

4. **Scaling the Features**:  
   The features had varying scales, which could lead to biased model performance.  
   **Action**: We standardized the features using `StandardScaler` to ensure all features were on the same scale, improving the performance of machine learning models.

5. **Handling Skewness**:  
   Some features, including `Insulin` and `BMI`, exhibited skewness.  
   **Action**: We applied a logarithmic transformation (`log1p`) to reduce skewness and ensure a more normal distribution for better model performance.

6. **Handling Class Imbalance**:  
   The target variable (`Outcome`) was imbalanced, with more non-diabetic cases than diabetic cases, which could result in biased model predictions.  
   **Action**: SMOTE (Synthetic Minority Oversampling Technique) was used to balance the dataset, ensuring that the model would not be biased towards the majority class.

7. **Feature Selection**:  
   Using correlation analysis, we identified features such as `Glucose`, `BMI`, `Age`, and `Pregnancies` that had significant correlations with the target variable.  
   **Action**: We retained the most correlated features based on a specified threshold to reduce dimensionality while keeping the most informative features.

8. **Saving Preprocessed Data**:  
   After completing all preprocessing steps, the final dataset, containing the selected features and the target variable, was saved for future model training.  
   **Action**: We saved the preprocessed data to a CSV file (`preprocessed_diabetes_data.csv`) for use in subsequent stages of the machine learning pipeline.

---

### Final Remarks

The data is now fully prepared for the next steps in the machine learning process, including model training and evaluation. The key preprocessing steps—handling missing values, addressing outliers, scaling features, reducing skewness, balancing the classes, and selecting the most relevant features—will ensure that the dataset is clean, balanced, and ready for building predictive models. This comprehensive preprocessing is critical for achieving optimal results and reliable predictions in the classification task.
