In [None]:
Q1: Designing the Pipeline

1. Automated Feature Selection
We'll use a feature selection method to identify important features. Here, we can use SelectKBest from sklearn.feature_selection.

2. Numerical Pipeline
Impute missing values using the mean.
Scale numerical features using standardization.
3. Categorical Pipeline
Impute missing values using the most frequent value.
One-hot encode categorical features.
4. Combine Pipelines
Combine the numerical and categorical pipelines using ColumnTransformer.

5. Final Model
Use a Random Forest Classifier for the final model.

6. Evaluation
Evaluate the model using accuracy on the test dataset.


Code Implementation

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest, f_classif

# Load dataset (assuming it's a CSV file)
url = 'https://drive.google.com/uc?export=download&id=1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ'
data = pd.read_csv(url)

# Split data into features and target
X = data.drop('target', axis=1)
y = data['target']

# Identify numerical and categorical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'category']).columns

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Numerical pipeline
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine pipelines
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_pipeline, numerical_cols),
    ('cat', categorical_pipeline, categorical_cols)
])

# Feature selection
feature_selection = SelectKBest(score_func=f_classif, k='all')

# Create final pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_selection', feature_selection),
    ('classifier', RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42))
])

# Train model
model_pipeline.fit(X_train, y_train)

# Predict on test set
y_pred = model_pipeline.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
















Q2: Voting Classifier with Random Forest and Logistic Regression
Steps Involved:
Numerical and Categorical Pipelines

Same as in Q1.
Combine Pipelines

Same as in Q1.
Define Classifiers

Random Forest Classifier
Logistic Regression Classifier
Voting Classifier

Combine both classifiers using VotingClassifier.
Evaluation

Evaluate the combined model.
Code Implementation