In [None]:
'''Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values.
Design a pipeline that includes the following steps"
Use an automated feature selection method to identify the important features in the datasetC
Create a numerical pipeline that includes the following steps"
Impute the missing values in the numerical columns using the mean of the column valuesC
Scale the numerical columns using standardisationC
Create a categorical pipeline that includes the following steps"
Impute the missing values in the categorical columns using the most frequent value of the columnC
One-hot encode the categorical columnsC
Combine the numerical and categorical pipelines using a ColumnTransformerC
Use a Random Forest Classifier to build the final modelC
Evaluate the accuracy of the model on the test dataset.

Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipeline.'''

In [None]:
'''
Pipeline Steps
1. Import Necessary Libraries

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score

2. Load Dataset

# Replace 'your_dataset.csv' with your actual dataset path
data = pd.read_csv('your_dataset.csv')

3. Split Data into Features and Target

X = data.drop('target_column', axis=1)  # Replace 'target_column' with your target variable
y = data['target_column']

4. Split Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Create Numerical and Categorical Pipelines

numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

6. Combine Pipelines Using ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])

7. Feature Selection

selector = SelectFromModel(RandomForestClassifier(random_state=42))

8. Create Final Pipeline

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selection', selector),
    ('classifier', RandomForestClassifier(random_state=42))
])

9. Train the Model

pipeline.fit(X_train, y_train)

10. Evaluate the Model

accuracy = pipeline.score(X_test, y_test)
print("Accuracy:", accuracy)

# Cross-validation for more robust evaluation
cv_scores = cross_val_score(pipeline, X, y, cv=5)
print("Cross-Validation Accuracy:", cv_scores.mean())

Explanation of Steps:
Preprocessing: The pipelines handle missing values and scale numerical features while one-hot encoding categorical features.
Feature Selection: The SelectFromModel step uses a Random Forest Classifier to identify important features based on their feature importance.
Pipeline Combination: The ColumnTransformer combines the numerical and categorical pipelines for consistent preprocessing.
Model Training: The Random Forest Classifier is trained on the preprocessed and selected features.
Evaluation: The model's accuracy is evaluated on the test set and using cross-validation for more robust assessment.'''

In [None]:
#Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.

In [None]:
'''
Creating a Pipeline with Random Forest and Logistic Regression

Import necessary libraries:

from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score

Load the Iris dataset:

iris = load_iris()
X = iris.data
y = iris.target

Split data into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create individual classifiers:   

rf = RandomForestClassifier(random_state=42)
lr = LogisticRegression(random_state=42)

Create a voting classifier:

voting_clf = VotingClassifier(estimators=[('rf', rf), ('lr', lr)], voting='hard')

Train the pipeline:

voting_clf.fit(X_train, y_train)

Evaluate the pipeline:

accuracy = voting_clf.score(X_test, y_test)
print("Accuracy:", accuracy)

cv_scores = cross_val_score(voting_clf, X, y, cv=5)
print("Cross-Validation Accuracy:", cv_scores.mean())

Explanation:

We import the necessary libraries for creating a pipeline with Random Forest and Logistic Regression classifiers, and for using a VotingClassifier.
We load the Iris dataset and split it into training and testing sets.
We create individual Random Forest and Logistic Regression classifiers.
We create a VotingClassifier that combines the predictions of the two classifiers using a 'hard' voting strategy (majority vote).
We train the VotingClassifier on the training data.
We evaluate the accuracy of the VotingClassifier on the testing data and using cross-validation for a more robust evaluation.
This pipeline demonstrates how to combine multiple classifiers using a VotingClassifier and evaluate its performance on a given dataset.  '''