Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

Design a pipeline that includes the following steps:
1. Use an automated feature selection method to identify the important features in the dataset.
2. Create a numerical pipeline that includes the following steps".
3. Impute the missing values in the numerical columns using the mean of the column values.
4. Scale the numerical columns using standardization.
5. Create a categorical pipeline that includes the following steps"
6. Impute the missing values in the categorical columns using the most frequent value of the column.
7. One-hot encode the categorical columns.
8. Combine the numerical and categorical pipelines using a ColumnTransformer.
9. Use a Random Forest Classifier to build the final model.
10. Evaluate the accuracy of the model on the test dataset.

Note:- Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.

### Q1. Designing a Machine Learning Pipeline

**Problem Statement:**
You're tasked with building a machine learning pipeline to handle a dataset containing both numerical and categorical features. The dataset has missing values and some highly correlated features. The pipeline needs to automate feature engineering, handle missing values, and finally train a Random Forest Classifier.

**Steps:**

1. **Automated Feature Selection**: Identify the most important features using an automated method.
2. **Numerical Pipeline**:
   - Impute missing values in numerical columns using the mean.
   - Scale numerical columns using standardization.
3. **Categorical Pipeline**:
   - Impute missing values in categorical columns using the most frequent value.
   - One-hot encode categorical columns.
4. **Combine Pipelines**: Use `ColumnTransformer` to combine the numerical and categorical pipelines.
5. **Modeling**: Use a Random Forest Classifier.
6. **Evaluation**: Assess the model's accuracy on the test dataset.


In [None]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest, f_classif

# Loading dataset (assuming it's loaded into a DataFrame 'df')
df = pd.read_csv('My_Data.csv')

# Separate features and target
X = df.drop('target_column', axis=1)
y = df['target_column']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify numerical and categorical columns
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X_train.select_dtypes(include=['object']).columns

# 1. Automated Feature Selection
feature_selector = SelectKBest(score_func=f_classif, k='all')

# 2-4. Numerical Pipeline
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with mean
    ('scaler', StandardScaler())  # Standardize the numerical features
])

# 5-7. Categorical Pipeline
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with the most frequent value
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode the categorical features
])

# 8. Combine Pipelines
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_pipeline, numerical_cols),
    ('cat', categorical_pipeline, categorical_cols)
])

# 9. Final Pipeline including the Random Forest Classifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_selector', feature_selector),  # Select important features
    ('classifier', RandomForestClassifier(random_state=42))  # Use Random Forest Classifier
])

# Train the model
pipeline.fit(X_train, y_train)

# 10. Predict on the test set and evaluate
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.4f}')



**Explanation:**
- **Feature Selection**: `SelectKBest` uses the ANOVA F-value to identify the most important features.
- **Numerical Pipeline**: Handles missing values by imputing the mean and standardizes the numerical features.
- **Categorical Pipeline**: Handles missing values by imputing the most frequent value and applies one-hot encoding to the categorical features.
- **Preprocessing**: Combines the numerical and categorical pipelines into a single preprocessing step.
- **Modeling**: The Random Forest Classifier is applied to the preprocessed data.
- **Evaluation**: The accuracy score is computed on the test dataset to evaluate the model’s performance.

**Interpretation of Results:**
If the accuracy is satisfactory, the pipeline is performing well. If not, consider tuning hyperparameters or experimenting with different imputation and scaling techniques.

**Possible Improvements:**
- **Hyperparameter Tuning**: Use `GridSearchCV` or `RandomizedSearchCV` to fine-tune the model.
- **Additional Feature Engineering**: Create interaction features or use PCA for dimensionality reduction.
- **Outlier Handling**: Consider implementing techniques to detect and mitigate the impact of outliers.



### Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

A2. We need to build a pipeline that includes a Random Forest Classifier and a Logistic Regression Classifier, and then combine their predictions using a Voting Classifier.

**Steps:**

1. **Load the Iris dataset**.
2. **Create individual pipelines** for Random Forest and Logistic Regression.
3. **Combine the models** using a Voting Classifier.
4. **Train the pipeline** and evaluate its accuracy.



In [1]:

from sklearn.datasets import load_iris
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_pipeline = Pipeline(steps=[
    ('classifier', RandomForestClassifier(random_state=42))
])

lr_pipeline = Pipeline(steps=[
    ('classifier', LogisticRegression(random_state=42, max_iter=1000))
])

voting_classifier = VotingClassifier(estimators=[
    ('rf', rf_pipeline),
    ('lr', lr_pipeline)
], voting='hard')

voting_classifier.fit(X_train, y_train)

accuracy = voting_classifier.score(X_test, y_test)
print(f'Voting Classifier Accuracy: {accuracy:.4f}')

Voting Classifier Accuracy: 1.0000




**Explanation:**
- **Random Forest Pipeline**: Contains a Random Forest Classifier.
- **Logistic Regression Pipeline**: Contains a Logistic Regression Classifier.
- **Voting Classifier**: Combines the predictions from both classifiers. `voting='hard'` means the final prediction is based on majority voting.
- **Evaluation**: The accuracy of the Voting Classifier is computed on the test dataset.

**Interpretation of Results:**
If the Voting Classifier's accuracy is high, it suggests that combining predictions from multiple models is beneficial. If not, you might need to consider alternative models or improve the preprocessing steps.

**Possible Improvements:**
- **Soft Voting**: Experiment with `voting='soft'` to average predicted probabilities instead of hard majority voting.
- **Additional Models**: Include more diverse classifiers like SVM or KNN in the Voting Classifier.
- **Feature Engineering**: Enhance feature engineering to boost model performance.
