Q1: Design a Machine Learning Pipeline for Feature Engineering and Handling Missing Values

In this machine learning project, you have a dataset that contains both numerical and categorical features. Some features are highly correlated, and there are missing values in some columns. You want to build a pipeline to automate the feature engineering process and handle missing values. Here's a pipeline with the required steps:

1. **Feature Selection**:
   - Use an automated feature selection method to identify the important features in the dataset.

2. **Numerical Pipeline**:
   - Impute missing values in the numerical columns using the mean of the column values.
   - Scale the numerical columns using standardization (e.g., Z-score normalization).

3. **Categorical Pipeline**:
   - Impute missing values in the categorical columns using the most frequent value of the column.
   - Perform one-hot encoding on the categorical columns to convert them into a numerical format.

4. **Column Transformation**:
   - Combine the numerical and categorical pipelines using a `ColumnTransformer` to process both types of features together.

5. **Model Building**:
   - Use a Random Forest Classifier to build the final predictive model.

6. **Model Evaluation**:
   - Evaluate the accuracy of the model on a test dataset to assess its performance.

This pipeline ensures that missing values are handled appropriately, features are transformed for modeling, and the final model is trained and evaluated.

In [1]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import seaborn as sns


In [12]:
# loading the dataset 
df = sns.load_dataset("tips")
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:

from sklearn.preprocessing import LabelEncoder 
encoder = LabelEncoder()
df["time"] = encoder.fit_transform(df.time)

In [4]:
# Dividing the dataset

x = df.drop("time", axis = 1)
y = df.time

In [5]:
# splitting the data 

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size= .30, random_state=42)

In [6]:
# Automating feature eng, frature scaling 

categorical_cols = ['sex', 'smoker','day']
numerical_cols = ['total_bill', 'tip','size']

num_pipeline = Pipeline(steps = [('imputer', SimpleImputer(strategy='median')), 
                                 ('scaler', StandardScaler())])

cat_pipeline = Pipeline(steps = [('imputer', SimpleImputer(strategy='most_frequent')),
                                ('onehotencoder', OneHotEncoder())])

preprocessor = ColumnTransformer([('num_pipeline', num_pipeline, numerical_cols),
                                  ('cat_pipeline', cat_pipeline, categorical_cols)])

x_train = preprocessor.fit_transform(x_train)
x_test = preprocessor.transform(x_test)

In [7]:
## model traning

rf_clf = RandomForestClassifier()
rf_clf.fit(x_train, y_train)

In [8]:
y_pred = rf_clf.predict(x_test)

In [9]:
## Evaluation 

print("Accuracy score :", accuracy_score(y_test, y_pred))

Accuracy score : 0.972972972972973


Q2: Build a Pipeline with Random Forest and Logistic Regression Classifiers

Build a pipeline that includes both a Random Forest Classifier and a Logistic Regression Classifier. Use a Voting Classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.


In [11]:

from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

# Load the iris dataset (for demonstration purposes)
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create individual classifiers
rf_classifier = RandomForestClassifier()
logistic_classifier = LogisticRegression()

# Create a voting classifier that combines predictions from both classifiers
voting_classifier = VotingClassifier(estimators=[
    ('random_forest', rf_classifier),
    ('logistic_regression', logistic_classifier)
], voting='soft')  # You can use 'hard' or 'soft' voting

# Fit the voting classifier to the training data
voting_classifier.fit(X_train, y_train)

# Make predictions using the voting classifier
y_pred = voting_classifier.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Voting Classifier: {accuracy}")


Accuracy of the Voting Classifier: 1.0
