Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values.

Design a pipeline that includes the following steps"

1. Use an automated feature selection method to identify the important features in the datasetC
2. Create a numerical pipeline that includes the following steps"
3. Impute the missing values in the numerical columns using the mean of the column valuesC
4. Scale the numerical columns using standardisationC
5. Create a categorical pipeline that includes the following steps"
6. Impute the missing values in the categorical columns using the most frequent value of the columnC
7. One-hot encode the categorical columnsC
8. Combine the numerical and categorical pipelines using a ColumnTransformerC
9. Use a Random Forest Classifier to build the final modelC
10. Evaluate the accuracy of the model on the test dataset.

Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipeline.


Here's a step-by-step pipeline design that handles missing values, automates feature engineering, and builds a Random Forest Classifier model with feature selection, preprocessing for numerical and categorical features, and model evaluation. I'll walk through the pipeline and include Python code using libraries like scikit-learn for each step.


In [None]:

## Step 1: Load Necessary Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

## Step 2: Load the Dataset

# Load your dataset
data = pd.read_csv('your_dataset.csv')

# Assuming 'target' is the label column
X = data.drop('target', axis=1)
y = data['target']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Step 3: Automated Feature Selection

# SelectFromModel to automatically select important features using Random Forest

feature_selector = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))

## Step 4: Create a Numerical Pipeline

# Pipeline for numerical features

numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with the mean
    ('scaler', StandardScaler())  # Standardize the data (mean=0, std=1)
])

## Step 5: Create a Categorical Pipeline

# Pipeline for categorical features
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with the most frequent value
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical features
])

## Step 6: Combine Pipelines Using ColumnTransformer
# Define which columns are numerical and which are categorical
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Combine numerical and categorical pipelines
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_pipeline, numerical_features),
    ('cat', categorical_pipeline, categorical_features)
])

## Step 7: Create the Final Pipeline
# Create the final pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Preprocessing step
    ('feature_selection', feature_selector),  # Feature selection step
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))  # Random Forest classifier
])

## Step 8: Train the Model
# Train the model
pipeline.fit(X_train, y_train)

## Step 9: Evaluate the Model
# Predict and evaluate the model
y_pred = pipeline.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.2f}')


Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.

In [3]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize classifiers
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
logistic_regression = LogisticRegression(max_iter=1000, random_state=42)

# Create a voting classifier
voting_clf = VotingClassifier(estimators=[
    ('rf', random_forest),
    ('lr', logistic_regression)
], voting='hard')  # For majority voting

# For soft voting (averaging probabilities), use `voting='soft'`

# Create a pipeline with the VotingClassifier
pipeline = Pipeline(steps=[
    ('voting_classifier', voting_clf)
])

# Train the model
pipeline.fit(X_train, y_train)

# Make predictions on the test data
y_pred = pipeline.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.2f}')


Model Accuracy: 1.00
