In [None]:
## Beginner-Friendly Machine Learning Pipelines

### Q1: Feature Engineering Pipeline - Handling Missing Data & Feature Selection

#### Steps:
1. **Select Important Features**: Identify key features that contribute most to predictions.
2. **Prepare Numerical Data**:
   - Fill missing values with the mean.
   - Standardize numerical data for better performance.
3. **Prepare Categorical Data**:
   - Fill missing values with the most common category.
   - Convert categories into numerical form using one-hot encoding.
4. **Combine Everything**: Use `ColumnTransformer` to apply transformations to both numerical and categorical features.
5. **Train a Model**: Use `RandomForestClassifier` to make predictions.
6. **Evaluate the Model**: Measure accuracy using test data.

#### Code:
```python
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load sample dataset
data = pd.read_csv("data.csv")
X = data.drop("target", axis=1)
y = data["target"]

# Identify numerical and categorical features
num_features = X.select_dtypes(include=["int64", "float64"]).columns
cat_features = X.select_dtypes(include=["object"]).columns

# Create pipelines for processing
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine both pipelines
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

# Final pipeline including the model
model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model_pipeline.fit(X_train, y_train)

# Make predictions and evaluate accuracy
y_pred = model_pipeline.predict(X_test)
print("Model Accuracy:", accuracy_score(y_test, y_pred))
```

### Q2: Combining Models with a Voting Classifier
#### Steps:
1. Train two different models: `RandomForestClassifier` and `LogisticRegression`.
2. Combine both models using `VotingClassifier`.
3. Test the accuracy on a dataset.

#### Code:
```python
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define two models
rf = RandomForestClassifier(n_estimators=100, random_state=42)
lr = LogisticRegression(max_iter=200)

# Combine them in a voting classifier
voting_clf = VotingClassifier(estimators=[('rf', rf), ('lr', lr)], voting='hard')
voting_clf.fit(X_train, y_train)

# Evaluate performance
print("Voting Classifier Accuracy:", voting_clf.score(X_test, y_test))
```

### Summary:
- We built a pipeline that automates feature selection, data processing, and model training.
- We used two different classifiers and combined them using a voting mechanism to improve accuracy.
- This approach ensures robustness and reduces bias in predictions.
