### Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

Design a pipeline that includes the following steps:
- Use an automated feature selection method to identify the important features in the datasetC
- Create a numerical pipeline that includes the following steps"
- Impute the missing values in the numerical columns using the mean of the column valuesC
- Scale the numerical columns using standardisationC
- Create a categorical pipeline that includes the following steps"
- Impute the missing values in the categorical columns using the most frequent value of the columnC
- One-hot encode the categorical columnsC
- Combine the numerical and categorical pipelines using a ColumnTransformerC
- Use a Random Forest Classifier to build the final modelC
- Evaluate the accuracy of the model on the test datasetD

**Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipelineD**


In [1]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

#### Numerical Pipeline

In [2]:
num_pipe = Pipeline([
    ('imputer',SimpleImputer(strategy='mean')),
    ('scaler',StandardScaler())
])

#### Categorical Pipeline

In [3]:
categ_pipe = Pipeline([
    ('imputer',SimpleImputer(strategy='most_frequent')),
    ('ohe',OneHotEncoder(handle_unknown='ignore'))
])

#### Combinig both Pipeline using ColumnTransfomer

In [4]:
from sklearn.compose import ColumnTransformer

In [5]:
num_features = []
categ_features = []

In [6]:
preprocessor = ColumnTransformer([
    ('num_pipe',num_pipe,num_features),         ## num_features is List of column names with numerical data
    ('categ_pipe',categ_pipe,categ_features)    ## categ_features is List of column names with categorical data

])

**Now use preprocessor for Automatic feature engineering , and after that define model and evaluate it**

### Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

In [7]:
from sklearn.datasets import load_iris

In [8]:
iris = load_iris()
X = iris.data
y = iris.target

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [11]:
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.linear_model import LogisticRegression

In [12]:
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rf_classifier', RandomForestClassifier(random_state=42))
])

lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lr_classifier', LogisticRegression(random_state=42))
])

In [13]:
voting_classifier = VotingClassifier(
    estimators=[('rf', rf_pipeline), ('lr', lr_pipeline)],
    voting='hard'
)

In [14]:
voting_classifier.fit(X_train, y_train)

In [15]:
y_pred = voting_classifier.predict(X_test)

In [16]:
from sklearn.metrics import accuracy_score

In [17]:
print(accuracy_score(y_test, y_pred))

1.0
