# Answer.1

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create individual classifiers
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
lr_classifier = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)

# Create a Voting Classifier that combines the individual classifiers
voting_classifier = VotingClassifier(estimators=[('rf', rf_classifier), ('lr', lr_classifier)], voting='hard')

# Create a pipeline with feature scaling (optional) and the Voting Classifier
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Optional: Scale features if needed
    ('voting_classifier', voting_classifier)
])

# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the accuracy of the ensemble model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')


Accuracy: 1.00


# Answer.2

To create a machine learning pipeline that handles feature engineering, including feature selection, and uses a Random Forest Classifier for classification, you can use Python along with libraries like scikit-learn. Here's a step-by-step implementation of the pipeline:

```python
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load your dataset (replace 'data.csv' with the actual dataset file)
data = pd.read_csv('data.csv')

# Split the data into features (X) and target (y)
X = data.drop('target_column', axis=1)
y = data['target_column']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Selection: Select the top k features based on ANOVA F-statistic
feature_selector = SelectKBest(score_func=f_classif, k=10)  # Adjust k as needed

# Numerical Pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical Pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Column Transformer: Apply pipelines to respective feature types
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, ['numerical_feature_1', 'numerical_feature_2']),
        ('cat', categorical_pipeline, ['categorical_feature_1', 'categorical_feature_2'])
    ])

# Final Model Pipeline with Feature Selection and Random Forest Classifier
model_pipeline = Pipeline([
    ('features', feature_selector),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Fit the model
model_pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = model_pipeline.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
```

Explanation of Steps:

1. **Load Data**: Load your dataset into a DataFrame.

2. **Split Data**: Split the dataset into feature matrix (X) and target vector (y). Replace `'target_column'` with the actual name of your target column.

3. **Feature Selection**: Use ANOVA F-statistic with SelectKBest to select the top k features based on their importance for classification. You should adjust the value of `k` based on your dataset and problem.

4. **Numerical Pipeline**: Create a pipeline for numerical features. Impute missing values with the mean of the column and scale the features using standardization.

5. **Categorical Pipeline**: Create a pipeline for categorical features. Impute missing values with the most frequent value of the column and perform one-hot encoding.

6. **Column Transformer**: Use `ColumnTransformer` to apply the respective pipelines to numerical and categorical features.

7. **Final Model Pipeline**: Combine the feature selection, preprocessing, and Random Forest Classifier into a final pipeline.

8. **Fit and Predict**: Fit the model on the training data and make predictions on the test data.

9. **Evaluate**: Calculate the accuracy of the model on the test set.

Possible Improvements:

- Experiment with different feature selection methods and hyperparameters to find the best subset of features.

- Try different imputation strategies (e.g., median or custom strategies) based on your understanding of the data.

- Explore other classification algorithms or hyperparameter tuning to potentially improve model performance.

- Consider handling class imbalance if it exists in your target variable by using techniques like oversampling or undersampling.

- Implement cross-validation to get a better estimate of the model's performance and reduce the risk of overfitting.