Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values.
Design a pipeline that includes the following steps"
Use an automated feature selection method to identify the important features in the datasetC
Create a numerical pipeline that includes the following steps"
Impute the missing values in the numerical columns using the mean of the column valuesC
Scale the numerical columns using standardisationC
Create a categorical pipeline that includes the following steps"
Impute the missing values in the categorical columns using the most frequent value of the columnC
One-hot encode the categorical columnsC
Combine the numerical and categorical pipelines using a ColumnTransformerC
Use a Random Forest Classifier to build the final modelC
Evaluate the accuracy of the model on the test dataset.
Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipeline.

The dataset has the following attributes :
1.
gender - Gender refers to the biological sex of the individual, which can have an impact on their susceptibility to diabetes. There are three categories in it male ,female and other.2.

age - Age is an important factor as diabetes is more commonly diagnosed in older adults.Age ranges from 0-80 in our datase3.t.

hypertension - Hypertension is a medical condition in which the blood pressure in the arteries is persistently elevated. It has values a 0 or 1 where 0 indicates they don’t have hypertension and for 1 it means they have hypertens4.ion.

heart_disease - Heart disease is another medical condition that is associated with an increased risk of developing diabetes. It has values a 0 or 1 where 0 indicates they don’t have heart disease and for 1 it means they have heart di5.sease.

smoking_history - Smoking history is also considered a risk factor for diabetes and can exacerbate the complications associated with diabetes.In our dataset we have 5 categories i.e not current,former,No Info,current,never a6.nd ever.

bmi - BMI (Body Mass Index) is a measure of body fat based on weight and height. Higher BMI values are linked to a higher risk of diabetes. The range of BMI in the dataset is from 10.16 to 71.55. BMI less than 18.5 is underweight, 18.5-24.9 is normal, 25-29.9 is overweight, and 30 or more7. is obese.

HbA1c_level - HbA1c (Hemoglobin A1c) level is a measure of a person's average blood sugar level over the past 2-3 months. Higher levels indicate a greater risk of developing diabetes. Mostly more than 6.5% of HbA1c Level indicat8.es diabetes.

blood_glucose_level - Blood glucose level refers to the amount of glucose in the bloodstream at a given time. High blood glucose levels are a key indicato9.r of diabetes.

diabetes - Diabetes is the target variable being predicted, with values of 1 indicating the presence of diabetes and 0 indicating the

Dataset used : https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from warnings import filterwarnings
filterwarnings("ignore")

from sklearn.feature_selection import SelectKBest, mutual_info_classif #For feature selection
from sklearn.impute import SimpleImputer #Handling Missing Values
from sklearn.preprocessing import OneHotEncoder,StandardScaler #Handling Categorical Features & Feature Scaling
from sklearn.pipeline import Pipeline #For automating the entire process
from sklearn.compose import ColumnTransformer #Combining the numerical & Categorial columns
from sklearn.ensemble import RandomForestClassifier #Model for our output predicition

In [3]:
df = pd.read_csv("diabetes_prediction_dataset.csv")
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [5]:
df.shape

(100000, 9)

In [7]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


In [9]:
df.isnull().sum()

gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
HbA1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

In [11]:
#Now seperate dependent and independent variables
X = df.drop('diabetes', axis=1)
y = df['diabetes']

In [13]:
X.shape , y.shape


((100000, 8), (100000,))

In [15]:
cat_cols = list(X.select_dtypes(include='object').columns)
num_cols = list(X.select_dtypes(exclude='object').columns)

In [17]:
#Displaying the list of categorical columns in our dataset
cat_cols

['gender', 'smoking_history']

In [19]:
#Displaying the list of numerical columns in our dataset
num_cols

['age',
 'hypertension',
 'heart_disease',
 'bmi',
 'HbA1c_level',
 'blood_glucose_level']

In [23]:
#Imputing missing values in numerical columns & feature scaling (ie.Standardizing) the values
num_pipeline = Pipeline(
    steps=[
        ('imputer',SimpleImputer(strategy="mean")),
        ('scaler',StandardScaler())
         ]
)

#Imputing missing values in categorical columns & feature Scaling (ie.OneHotEncoding) the values
cat_pipeline = Pipeline(
    steps=[
        ('imputer',SimpleImputer(strategy="most_frequent")),
        ('scaler',OneHotEncoder())
         ]
)

In [25]:
#Combining numerical and categorical pipeline
preprocessor = ColumnTransformer([
    ('num_pipeline',num_pipeline,num_cols),
    ('cat_pipeline',cat_pipeline,cat_cols)
])

In [27]:
# Creating automated pipeline for model selection along with feature selection
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selector', SelectKBest(score_func=mutual_info_classif, k=7)), #Selecting best 7 features
    ('classifier', RandomForestClassifier())
])

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

X_train.shape , X_test.shape

((75000, 8), (25000, 8))

In [31]:
y_train.shape , y_test.shape

((75000,), (25000,))

In [33]:
pipeline

In [35]:
# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the accuracy of the model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Accuracy: 0.96864


Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

In [39]:
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)


In [41]:
X.shape , y.shape

((150, 4), (150,))

In [45]:
from sklearn.ensemble import RandomForestClassifier


In [69]:
from sklearn.linear_model import LogisticRegression


In [71]:
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression


In [73]:
rfc = RandomForestClassifier()
lr = LogisticRegression()


In [75]:
vc = VotingClassifier(estimators=[('random_forest_classifier', rfc),
                                  ('logistic_regression', lr)],
                                  voting='soft')


In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [57]:
X_train.shape , X_test.shape

((112, 4), (38, 4))

In [59]:
y_train.shape , y_test.shape

((112,), (38,))

In [77]:
vc


In [79]:
# Fit the Voting Classifier on the training data 
vc.fit(X_train, y_train)

# Make predictions on the test data
y_pred = vc.predict(X_test)

# Evaluate the accuracy of the model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Accuracy: 1.0
