# Q1

In [None]:
Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing valuesD
Design a pipeline that includes the following steps"
Use an automated feature selection method to identify the important features in the datasetC
Create a numerical pipeline that includes the following steps"
Impute the missing values in the numerical columns using the mean of the column valuesC
Scale the numerical columns using standardisationC
Create a categorical pipeline that includes the following steps"
Impute the missing values in the categorical columns usng the most frequent value of the columnC
One-hot encode the categorical columnsC
Combine the numerical and categorical pipelines using a ColumnTransformerC
Use a Random Forest Classifier to build the final modelC
Evaluate the accuracy of the model on the test datasetD
Note! Your solution should include code snippets for each step of the pipeline, and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipelineD

Ans:-

To create a pipeline that automates the feature engineering process and handles missing values, we can follow the steps you mentioned. Let's break down the process step by step and provide code snippets for each part:

Automated Feature Selection:
We can use feature selection techniques like SelectKBest or SelectPercentile from scikit-learn to identify the important features in the dataset.

In [5]:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Create a synthetic dataset with 100 samples and 10 features (5 numerical and 5 categorical)
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, random_state=42)

# Preprocess the entire dataset (numerical and categorical columns)
numerical_features = [0, 1, 2, 3, 4]  # Assuming the first 5 features are numerical
categorical_features = [5, 6, 7, 8, 9]  # Assuming the last 5 features are categorical

# Numerical Pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical Pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

# Combine Numerical and Categorical Pipelines
preprocessor = ColumnTransformer([
    ('num', numerical_pipeline, numerical_features),
    ('cat', categorical_pipeline, categorical_features)
])

# Create the Random Forest Classifier
clf = RandomForestClassifier(random_state=42)

# Combine Preprocessor and Classifier in a Final Pipeline
final_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', clf)
])

# Split the preprocessed data into training and test sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model on the training data
final_pipeline.fit(X_train, y_train)

# Predict on the test data
y_pred = final_pipeline.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

  mode = stats.mode(array)


Accuracy: 0.5333333333333333


# Q2

In [None]:
Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifer, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.

In [6]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create the Random Forest Classifier
rf_clf = RandomForestClassifier(random_state=42)

# Create the Logistic Regression Classifier
lr_clf = LogisticRegression(random_state=42)

# Create the Voting Classifier
voting_clf = VotingClassifier(
    estimators=[('rf', rf_clf), ('lr', lr_clf)],
    voting='hard'  # Use 'hard' voting, where the majority class wins
)

# Build the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Optionally scale the features
    ('voting', voting_clf)
])

# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 1.0


To build a pipeline that includes both a Random Forest Classifier and a Logistic Regression Classifier, and then use a Voting Classifier to combine their predictions, we can use scikit-learn's Pipeline and VotingClassifier. The Iris dataset is a well-known dataset, and scikit-learn provides an easy way to load it. Let's proceed with the implementation:

we first load the Iris dataset using load_iris() from scikit-learn. We then split the data into training and testing sets. Next, we create the Random Forest Classifier (rf_clf) and the Logistic Regression Classifier (lr_clf). The Voting Classifier (voting_clf) combines these two classifiers using hard voting, where the majority class among the predictions is chosen.

We then create a pipeline (pipeline) that includes the StandardScaler to scale the features (optional) and the Voting Classifier.

Finally, we train the pipeline on the training data and evaluate its accuracy on the test data.

Please note that the dataset used in this example is small, and it's often recommended to have larger datasets for reliable evaluations. In a real-world scenario, you can replace the Iris dataset with your dataset using appropriate data loading methods like pd.read_csv() or any other method suitable for your data format.