#### Assignment
Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values.

Design a pipeline that includes the following steps
Use an automated feature selection method to identify the important features in the dataset
Create a numerical pipeline that includes the following steps
Impute the missing values in the numerical columns using the mean of the column values
Scale the numerical columns using standardisation
Create a categorical pipeline that includes the following steps
Impute the missing values in the categorical columns using the most frequent value of the column
One-hot encode the categorical columns
Combine the numerical and categorical pipelines using a ColumnTransformer
Use a Random Forest Classifier to build the final model
Evaluate the accuracy of the model on the test dataset.

#### Answer:

```python
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Assume 'X' is your feature matrix and 'y' is the target variable
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the numerical features and categorical features
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Feature Selection
feature_selector = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42), threshold='median')

# Numerical Pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical Pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Column Transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ]
)

# Final Pipeline
pipeline = Pipeline([
    ('feature_selector', feature_selector),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy on the test set: {accuracy}')
```

In this pipeline:

SelectFromModel is used for automated feature selection, based on the importance scores provided by a RandomForestClassifier.
Numerical features are imputed using the mean, scaled using standardization.
Categorical features are imputed using the most frequent value and one-hot encoded.
A ColumnTransformer is used to combine the numerical and categorical pipelines.
The final pipeline consists of feature selection, preprocessing, and a RandomForestClassifier.

#### Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.


#### Answer:

In [9]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the individual classifiers
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
lr_classifier = LogisticRegression(max_iter=1000, random_state=42)

# Create the pipeline for the Random Forest Classifier
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', rf_classifier)
])

# Create the pipeline for the Logistic Regression Classifier
lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', lr_classifier)
])

# Create the Voting Classifier
voting_classifier = VotingClassifier(
    estimators=[
        ('random_forest', rf_pipeline),
        ('logistic_regression', lr_pipeline)
    ],
    voting='hard'  # Use 'hard' for majority voting, 'soft' for weighted voting
)

# Fit the Voting Classifier on the training data
voting_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = voting_classifier.predict(X_test)

# Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy on the test set: {accuracy}')

Accuracy on the test set: 1.0


In this example:

Two classifiers, a Random Forest Classifier and a Logistic Regression Classifier, are created.
Each classifier is part of its own pipeline, including standardization (StandardScaler) as a preprocessing step.
The VotingClassifier is created with the two pipelines, and the voting strategy is set to 'hard' for majority voting.