**Q1. You are work#ng on a machine learn#ng project where you have a dataset containing numerical and
categorical features. You have #dent#f#ed that some of the features are highly correlated and there are
missing values #n some of the columns. You want to bu#ld a p#pel#ne that automates the feature
eng#neer#ng process and handles the m#ss#ng values**

1. Automated feature selection:
We can use SelectKBest with f_classif as the scoring function for feature selection.

2. Numerical pipeline:
   * Impute missing values with the mean.
   * Scale the features using standardization.

3. Categorical pipeline:
   * Impute missing values with the most frequent value.
   * One-hot encode the categorical features.

4. Combine pipelines:
Use ColumnTransformer to combine numerical and categorical pipelines.

5. Model training and evaluation:
Use a RandomForestClassifier and evaluate its performance.

In [1]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load your dataset (replace 'your_dataset.csv' with your actual dataset)
# data = pd.read_csv('your_dataset.csv')

# For illustration, using a sample dataset from scikit-learn
from sklearn.datasets import load_iris
data = load_iris(as_frame=True)
df = data.frame
df['target'] = data.target

# Split the data into features and target
X = df.drop('target', axis=1)
y = df['target']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the numerical and categorical columns
numerical_cols = X.select_dtypes(include=['float64', 'int64']).columns
categorical_cols = X.select_dtypes(include=['object', 'category']).columns

# Numerical pipeline
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine numerical and categorical pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_cols),
        ('cat', categorical_pipeline, categorical_cols)
    ]
)

# Feature selection
feature_selection = SelectKBest(score_func=f_classif, k='all')  # Adjust 'k' as needed

# Complete pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_selection', feature_selection),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Train the model
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Model accuracy: {accuracy:.2f}')


Model accuracy: 1.00


**Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.**

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

# Create the voting classifier with RandomForest and LogisticRegression
voting_clf = VotingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(random_state=42)),
        ('lr', LogisticRegression(max_iter=1000, random_state=42))
    ],
    voting='hard'  # Use 'soft' if you want to use predicted probabilities
)

# Complete pipeline with voting classifier
voting_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_selection', feature_selection),
    ('classifier', voting_clf)
])

# Train the model
voting_pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred_voting = voting_pipeline.predict(X_test)

# Evaluate the accuracy
accuracy_voting = accuracy_score(y_test, y_pred_voting)
print(f'Voting classifier model accuracy: {accuracy_voting:.2f}')


Voting classifier model accuracy: 1.00


`Interpretation and Suggestions`
1. Accuracy Interpretation:
   * The accuracy scores from both models (random forest alone and the voting classifier) give an idea of how well the models are performing.
   * Compare the accuracy of the voting classifier with the individual random forest model to determine if the ensemble method improves performance.

2. Possible Improvements:
   * Feature Engineering: Explore additional feature engineering techniques to improve model performance.
   * Hyperparameter Tuning: Use grid search or randomized search to tune the hyperparameters of the classifiers.
   * Cross-Validation: Employ cross-validation to get a more robust estimate of the model performance.
   * Handling Class Imbalance: If the dataset is imbalanced, consider techniques like SMOTE or class weighting.