# Module70 Ensemble Techniques Ass5

Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features.
You have identified that some of the features are highly correlated and there are
missing values in some of the columns.
You want to build a pipeline that automates the feature engineering process and handles the missing values.

### Design a pipeline that includes the following steps"

1.) Use an automated feature selection method to identify the important features in the dataset.

2.) Create a numerical pipeline that includes the following steps".

3.) Impute the missing values in the numerical columns using the mean of the column values.

4.) Scale the numerical columns using standardisation.

5.) Create a categorical pipeline that includes the following steps".

6.) Impute the missing values in the categorical columns using the most frequent value of the column.

7.) One-hot encode the categorical columns.

8.) Combine the numerical and categorical pipelines using a ColumnTransformer.

9.) Use a Random Forest Classifier to build the final model.

10.) Evaluate the accuracy of the model on the test dataset.

Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipeline.



A1. Designing a Pipeline

In [3]:
# loading dataset

from google.colab import files

uploaded = files.upload()

Saving dataset.csv to dataset.csv


In [4]:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
# Replace 'your_dataset.csv' with the path to your dataset
# Ensure that your dataset has both numerical and categorical columns
df = pd.read_csv('dataset.csv')

# Separate features and target variable
X = df.drop("target", axis=1)  # Replace 'target' with the name of the target column
y = df["target"]

# Identify numerical and categorical columns
numerical_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object", "category"]).columns

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 1: Automated Feature Selection
# Calculate correlation matrix and drop highly correlated features
correlation_matrix = X_train[numerical_features].corr()
high_correlation = correlation_matrix.abs().unstack().sort_values(ascending=False).drop_duplicates()

threshold = 0.85  # Set correlation threshold
correlated_features = set()
for i, j in high_correlation.index:
    if i != j and high_correlation.loc[(i, j)] > threshold:
        correlated_features.add(j)

X_train = X_train.drop(columns=correlated_features)
X_test = X_test.drop(columns=correlated_features)
numerical_features = [col for col in numerical_features if col not in correlated_features]

# Step 2: Numerical Pipeline
numerical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),  # Impute missing values with the mean
    ("scaler", StandardScaler())  # Standardize the data
])

# Step 3: Categorical Pipeline
categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),  # Impute missing values with the mode
    ("onehot", OneHotEncoder(handle_unknown="ignore"))  # One-hot encode categorical variables
])

# Step 4: Column Transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_pipeline, numerical_features),
        ("cat", categorical_pipeline, categorical_features)
    ]
)

# Step 5: Random Forest Classifier Pipeline
model_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),  # Preprocess the data
    ("classifier", RandomForestClassifier(n_estimators=100, random_state=42))  # Train Random Forest
])

# Train the pipeline
model_pipeline.fit(X_train, y_train)

# Evaluate the pipeline
y_pred = model_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.2f}")


Model Accuracy: 0.84


# Explanation of Steps

1.) **Automated Feature Selection:**

Highly correlated numerical features are identified and removed to avoid multicollinearity.


2.) **Numerical Pipeline:**

**Imputation:** Handles missing values by replacing them with the column mean.

**Scaling:** Standardizes numerical features to have a mean of 0 and a standard deviation of 1.


3.) **Categorical Pipeline:**

**Imputation:** Handles missing values by replacing them with the most frequent value.

**One-Hot Encoding:** Converts categorical features into a binary matrix.


4.) **ColumnTransformer:**

Combines the numerical and categorical pipelines into one preprocessing step.


5.) **Random Forest Classifier:**

A Random Forest model is trained on the preprocessed data.


Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

A2.

In [5]:
from sklearn.datasets import load_iris
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Logistic Regression Pipeline
lr_pipeline = Pipeline(steps=[
    ("scaler", StandardScaler()),  # Scale the data
    ("lr", LogisticRegression(random_state=42))  # Logistic Regression
])

# Random Forest Pipeline
rf_pipeline = Pipeline(steps=[
    ("rf", RandomForestClassifier(n_estimators=100, random_state=42))  # Random Forest
])

# Voting Classifier
voting_clf = VotingClassifier(
    estimators=[
        ("logistic", lr_pipeline),
        ("random_forest", rf_pipeline)
    ],
    voting="hard"  # Use hard voting
)

# Train the Voting Classifier
voting_clf.fit(X_train, y_train)

# Evaluate the model
y_pred = voting_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Voting Classifier Accuracy: {accuracy:.2f}")


Voting Classifier Accuracy: 1.00


# Explanation of Steps

1.) **Pipelines for Base Models:**

A Logistic Regression model and a Random Forest model are wrapped in separate pipelines.

Logistic Regression includes scaling, while Random Forest works on raw data.


2.) **Voting Classifier:**

Combines the predictions of Logistic Regression and Random Forest using hard voting.

3.) **Training and Evaluation:**

The Voting Classifier is trained on the Iris dataset and evaluated using accuracy.
