### Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

Design a pipeline that includes the following steps:

- **Use an automated feature selection method** to identify the important features in the dataset.
- **Create a numerical pipeline** that includes the following steps:
  - Impute the missing values in the numerical columns using the mean of the column values.
  - Scale the numerical columns using standardisation.
- **Create a categorical pipeline** that includes the following steps:
  - Impute the missing values in the categorical columns using the most frequent value of the column.
  - One-hot encode the categorical columns.
- **Combine the numerical and categorical pipelines** using a `ColumnTransformer`.
- **Use a Random Forest Classifier** to build the final model.
- **Evaluate the accuracy** of the model on the test dataset.

Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.

---

### Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.


## Q1. Design a Pipeline for Feature Engineering and Model Building

In this task, we are working on a machine learning project that involves a dataset with both numerical and categorical features. The goal is to automate the feature engineering process and handle missing values. We will use a pipeline to automate the steps for data preprocessing and model building. Here’s how to do it:

### Step 1: Import Libraries
We will start by importing the necessary libraries for data manipulation, preprocessing, and modeling.

```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import accuracy_score
```
---

## Step 2: Load the Data
Assume the dataset is already loaded into a DataFrame df. For this example, we assume that the target variable is target, and other columns include both numerical and categorical features.

```python
# Example dataset
# df = pd.read_csv('your_data.csv')
X = df.drop(columns=['target'])
y = df['target']
```

## Step 3: Identify Numerical and Categorical Features
We separate the features into numerical and categorical columns to apply appropriate preprocessing steps.

```python
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns
```


## Step 4: Create the Numerical Pipeline
The numerical pipeline includes:
- **Imputation:** Fill missing values in numerical columns with the mean of the column values.
- **Scaling:** Standardize the numerical columns using StandardScaler.

```python
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with the mean
    ('scaler', StandardScaler())  # Standardize the numerical features
])
```

## Step 5: Create the Categorical Pipeline
The categorical pipeline includes:
- **Imputation:** Fill missing values in categorical columns with the most frequent value of the column.
- **One-hot Encoding:** Convert categorical variables into a format that can be provided to ML models.

```python
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with the most frequent value
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode the categorical features
])
```


## Step 6: Combine the Pipelines Using ColumnTransformer
The ColumnTransformer is used to apply the numerical and categorical pipelines to the respective columns.

```python
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ]
)
```

## Step 7: Feature Selection Using SelectKBest
We will apply automated feature selection using the SelectKBest method with the f_classif statistical test. This step selects the best features based on their relationship with the target variable.

```python
selector = SelectKBest(f_classif, k='all')  # Select all features (you can adjust k to select top k features)
```

## Step 8: Build the Model Pipeline
We combine the feature selection step and the model (Random Forest Classifier) into a final pipeline.

```python
model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('selector', selector),
    ('classifier', RandomForestClassifier(random_state=42))
])
```


## Step 9: Train-Test Split
We split the data into training and test sets.

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```


## Step 10: Train the Model
We train the pipeline on the training dataset.

```python
model_pipeline.fit(X_train, y_train)
```

## Step 11: Evaluate the Model
Finally, we evaluate the accuracy of the model on the test set.

```python
y_pred = model_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Random Forest model: {accuracy}")
```

## Interpretation of Results
- The Random Forest model uses an ensemble of decision trees, which helps in reducing overfitting compared to a single decision tree.
- Feature selection ensures that we are using the most important features, which can improve model performance by reducing noise.
- Preprocessing steps like imputation and scaling are crucial for handling missing values and ensuring the numerical features are on the same scale.

## Possible Improvements:
- Feature Selection: We can fine-tune the number of features selected using SelectKBest.
- Hyperparameter Tuning: Tune the hyperparameters of the Random Forest model using techniques like GridSearchCV or RandomizedSearchCV.
- Imputation Strategy: Explore other imputation strategies, like using a median or applying a more sophisticated method like KNN imputation.
- Model Selection: Test other models (e.g., XGBoost, SVM) for comparison.

---

## Q2. Build a Pipeline with Random Forest Classifier and Logistic Regression Classifier Using a Voting Classifier
In this task, we will create a pipeline that includes both a Random Forest Classifier and a Logistic Regression Classifier, and we will combine their predictions using a Voting Classifier. We will train this pipeline on the Iris dataset and evaluate its accuracy.

## Step 1: Import Libraries for the Voting Classifier

```python
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
```

## Step 2: Load the Iris Dataset

```python
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
```

## Step 3: Split the Data into Train and Test Sets

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```

## Step 4: Create the Random Forest and Logistic Regression Classifiers

```python
# Define the classifiers
rf_classifier = RandomForestClassifier(random_state=42)
lr_classifier = LogisticRegression(random_state=42)
```

## Step 5: Create the Voting Classifier
We will create a VotingClassifier that combines the predictions of the Random Forest and Logistic Regression classifiers.

```python
# Create a Voting Classifier
voting_classifier = VotingClassifier(estimators=[
    ('rf', rf_classifier),
    ('lr', lr_classifier)
], voting='hard')  # 'hard' voting means majority rule
```

## Step 6: Create the Pipeline
Now we will create a pipeline that includes the voting classifier.

```python
pipeline = Pipeline([
    ('classifier', voting_classifier)
])
```

## Step 7: Train the Pipeline
We will train the pipeline on the training dataset.

```python
pipeline.fit(X_train, y_train)
```

## Step 8: Evaluate the Model
Finally, we evaluate the accuracy of the model on the test dataset.

```python
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Voting Classifier model: {accuracy}")
```

## Interpretation of Results:
- Voting Classifier: By combining the predictions of two classifiers (Random Forest and Logistic Regression), the model benefits from both classifiers' strengths.
- Hard Voting: In hard voting, the class label that gets the majority of votes from the individual models is chosen as the final prediction.
- Accuracy: The accuracy score will tell us how well the ensemble of classifiers performs compared to each individual classifier.

## Possible Improvements:
- Hyperparameter Tuning: Tune the hyperparameters of the classifiers to improve performance.
- Soft Voting: Experiment with soft voting, where the class probabilities are averaged rather than voting on the most frequent class.
- Add More Classifiers: We could add other classifiers (e.g., SVM, KNN) to the voting classifier to improve predictions further.

---