**Q1. Design a Feature Engineering Pipeline:**



In [None]:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assume 'X' is your feature matrix and 'y' is your target variable

# Identify important features using automated feature selection method (e.g., SelectKBest)
# Replace 'your_feature_selector' with an appropriate feature selection method
feature_selector = your_feature_selector()

# Numerical pipeline
numerical_features = ['numerical_feature_1', 'numerical_feature_2', ...]
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_features = ['categorical_feature_1', 'categorical_feature_2', ...]
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder())
])

# Combine numerical and categorical pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])

# Combine feature selection, preprocessing, and model
pipeline = Pipeline([
    ('feature_selector', feature_selector),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Evaluate the accuracy on the test set
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')



**Interpretation and Suggestions:**
- This pipeline automates feature engineering by incorporating feature selection, handling missing values, and scaling numerical features.
- The choice of the feature selector depends on the specific dataset and problem. Replace `'your_feature_selector'` with an appropriate method (e.g., `SelectKBest` with mutual information criterion).
- The pipeline uses a Random Forest Classifier as the final model, which is suitable for both classification tasks and handling a mix of numerical and categorical features.
- Possible improvements include tuning hyperparameters, exploring different feature selection methods, or trying different imputation strategies.


**Q2. Build a Pipeline with Ensemble (Voting Classifier):**

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assume 'X' is your feature matrix and 'y' is your target variable

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build individual classifiers
rf_classifier = RandomForestClassifier()
lr_classifier = LogisticRegression()

# Build the ensemble pipeline with a Voting Classifier
ensemble_pipeline = Pipeline([
    ('ensemble', VotingClassifier(estimators=[('rf', rf_classifier), ('lr', lr_classifier)], voting='hard'))
])

# Fit the pipeline on the training data
ensemble_pipeline.fit(X_train, y_train)

# Evaluate the accuracy on the test set
y_pred = ensemble_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
```

**Interpretation:**
- This pipeline combines predictions from a Random Forest Classifier and a Logistic Regression Classifier using a hard voting strategy.
- The Voting Classifier aggregates the predictions and provides a single final prediction based on a majority vote.
- The accuracy of the ensemble can be compared to individual models to see if combining diverse models improves performance.