# Ensemble Techniques Assignment 5

Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated, and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

Design a pipeline that includes the following steps:

Use an automated feature selection method to identify the important features in the dataset.

Create a numerical pipeline that includes the following steps:
* Impute the missing values in the numerical columns using the mean of the column values.
* Scale the numerical columns using standardization.
* Create a categorical pipeline that includes the following steps:
* Impute the missing values in the categorical columns using the most frequent value of the column.
* One-hot encode the categorical columns.
* Combine the numerical and categorical pipelines using a ColumnTransformer.
* Use a Random Forest Classifier to build the final model.
* Evaluate the accuracy of the model on the test dataset.
Note! Your solution should include code snippets for each step of the pipeline and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.


In [1]:
import seaborn as sns
df = sns.load_dataset('tips')

In [2]:
df.dtypes

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object

In [3]:
num_cols=['total_bill','tip','size']
cat_cols=['sex','smoker','day']

In [4]:
X = df.drop('time',axis=1)
y = df.time

In [5]:

##importing necessary libraries as asked in the question for numerical columns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

##importing necessary libraries as asked in the question for categorical columns
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

##importing RandomForestClassifier for training the model
from sklearn.ensemble import RandomForestClassifier

## importing Pipeline
from sklearn.pipeline import Pipeline

## Create a Pipeline for numerical columns
# Applying SimpleImputer with mean Strategy and StandardScaler
num_pipe=Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])


## Create a Pipeline for categorical columns
# Applying SimpleImputer with 'most frequent' startegy and OneHotEncoder
cat_pipe=Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder())
])

## Transforming using the ColumnTransformer
preprocessor = ColumnTransformer([
    ('numeric_columns',num_pipe,num_cols),
    ('categorical_columns',cat_pipe,cat_cols)
]) 

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33, random_state=42)

In [7]:
X_train=preprocessor.fit_transform(X_train)
X_test=preprocessor.transform(X_test)

In [8]:
clf=RandomForestClassifier()
clf.fit(X_train, y_train)

In [9]:
y_pred=clf.predict(X_test)

In [10]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.9753086419753086

### Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and the use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression

# Create the pipeline
pipeline = Pipeline([
    ('rf', RandomForestClassifier()),
    ('lr', LogisticRegression(max_iter=1000))
])

# Create the voting classifier
voting_classifier = VotingClassifier(estimators=[('rf', pipeline[0]), ('lr', pipeline[1])], voting='soft')

# Train  the pipeline on the iris dataset
iris = sns.load_dataset('iris')
X = iris.drop('species', axis=1)
y = iris['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

voting_classifier.fit(X_train, y_train)

# Evaluate the accuracy of the pipeline
accuracy = voting_classifier.score(X_test, y_test)

print('The accuracy of the pipeline is:', accuracy)



The accuracy of the pipeline is: 1.0


## The End