## Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the featur engineering process and handles the missing values.Design a pipeline that includes the following steps.Use an automated feature selection method to identify the important features in the dataset

Create a numerical pipeline that includes the following steps:

Impute the missing values in the numerical columns using the mean of the column values

Scale the numerical columns using standardscaler

Create a categorical pipeline that includes the following steps:
Impute the missing values in the categorical columns using the most frequent value of the column

One-hot encode the categorical columns

Combine the numerical and categorical pipelines using a ColumnTransformer

Use a Random Forest Classifier to build the final model

Evaluate the accuracy of the model on the test dataset

Note! Your solution should include code snippets for each step of the pipeline, and a brief explanation of
each step.
You should also provide an #nterpretat#on of the results and suggest possible improvements for
the pipeline

In [None]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.feature_selection import SelectFromModel

# load data
data = pd.read_csv('data.csv')     ## Defined only for the purpose of code snippet
X = data.drop('target', axis=1)
y = data['target']

# automated feature selection
selector = SelectFromModel(RandomForestClassifier(n_estimators=100))
selector.fit(X, y)
X = selector.transform(X)

# define numeric pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# define categorical pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# combine numeric and categorical pipelines
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, selector.get_support(indices=True)),
    ('cat', categorical_transformer, ~selector.get_support(indices=True))
])

# add classifier to pipeline
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('classifier', RandomForestClassifier(n_estimators=100))])

# fit the model and make predictions
pipe.fit(X, y)
y_pred = pipe.predict(X)

# evaluate the model
acc = accuracy_score(y, y_pred)
print('Accuracy:', acc)

## Insights:
This solution first uses an automated feature selection method (SelectFromModel) to identify the important features in the dataset based on their importance scores from a Random Forest model. The selected features are then used to create a ColumnTransformer that applies different preprocessing pipelines to the numerical and categorical parts of the data.

For the numerical part, missing values are imputed with the mean of the column values (SimpleImputer), and then the values are scaled to have zero mean and unit variance (StandardScaler). For the categorical part, missing values are imputed with the most frequent value of the column (SimpleImputer), and then the categorical variables are one-hot encoded to represent each level as a binary indicator variable (OneHotEncoder).

The final preprocessed data is then fed into a Random Forest classifier (RandomForestClassifier) to build the model. The accuracy of the model is evaluated using the original target values and the predicted values.

Possible improvements to this pipeline include trying different feature selection methods or hyperparameters for the preprocessing and modeling steps, using different imputation strategies or scaling methods, and testing different classifiers or ensemble methods. Additionally, domain knowledge and data exploration may suggest other ways to engineer the features or handle missing values.

## Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the Iris dataset and evaluate its accuracy.

In [18]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Define the column transformer for scaling numeric columns and one-hot encoding categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), [0, 1, 2, 3]),
        ('cat', OneHotEncoder(), [])
    ])

# Define the random forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Define the logistic regression classifier
lr = LogisticRegression(random_state=42)

# Define the voting classifier
voting_clf = VotingClassifier(estimators=[('rf', rf), ('lr', lr)], voting='hard')

# Define the pipeline
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('voting_clf', voting_clf)])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the pipeline on the training data
pipe.fit(X_train, y_train)

# Evaluate the accuracy of the pipeline on the testing data
y_pred = pipe.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.2f}")

Accuracy: 1.00
