Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values.

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
import seaborn as sns

In [2]:
df = sns.load_dataset('tips')

In [3]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [4]:
X = df.drop(['time'] , axis=1)
y = df.time

In [5]:
X.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,size
0,16.99,1.01,Female,No,Sun,2
1,10.34,1.66,Male,No,Sun,3
2,21.01,3.5,Male,No,Sun,3
3,23.68,3.31,Male,No,Sun,2
4,24.59,3.61,Female,No,Sun,4


In [6]:
y.head()

0    Dinner
1    Dinner
2    Dinner
3    Dinner
4    Dinner
Name: time, dtype: category
Categories (2, object): ['Lunch', 'Dinner']

In [7]:
y.unique()

['Dinner', 'Lunch']
Categories (2, object): ['Lunch', 'Dinner']

In [8]:
y = y.replace('Dinner' , 1)
y = y.replace('Lunch' , 0)

In [9]:
y.value_counts()

1    176
0     68
Name: time, dtype: int64

In [10]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: time, dtype: category
Categories (2, int64): [0, 1]

In [11]:
X.head(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,size
0,16.99,1.01,Female,No,Sun,2
1,10.34,1.66,Male,No,Sun,3


In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.33 , random_state=0)

In [14]:
numerical_col = ['total_bill' , 'tip' , 'size']
categorical_col = ['sex' , 'smoker' , 'day']

In [15]:
num_pipline = Pipeline(steps=[
    ('impute' , SimpleImputer(strategy='mean')), ## handling missing values
    ('scaler' , StandardScaler()) ## scaling the data
])

cat_pipline = Pipeline(steps=[
    ('impute' , SimpleImputer(strategy='most_frequent')), ## Handling the missing values 
    ('OneHotEncoder' , OneHotEncoder()) ## One hot encoding this is change the categorical into numeric values 
])

In [16]:
processer = ColumnTransformer([
    ('num_pipline' , num_pipline , numerical_col),
    ('cat_pipline' , cat_pipline , categorical_col)
]) ## We connect the both numeric pipline and categorical pipline with each other

In [17]:
X_train = processer.fit_transform(X_train)
X_test = processer.transform(X_test)

In [18]:
from sklearn.ensemble import RandomForestClassifier

In [19]:
classifire = RandomForestClassifier()

In [20]:
classifire.fit(X_train , y_train)

In [21]:
y_pred = classifire.predict(X_test)

In [22]:
from sklearn.metrics import accuracy_score , precision_score , recall_score , f1_score

In [23]:
print('accuracy_score' , accuracy_score(y_test , y_pred))
print('precision_score' , precision_score(y_test , y_pred))
print('recall_score' , recall_score(y_test , y_pred))
print('f1_score' , f1_score(y_test , y_pred))

accuracy_score 0.9135802469135802
precision_score 0.9206349206349206
recall_score 0.9666666666666667
f1_score 0.943089430894309


Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

In [64]:
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier , VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [65]:
iris = load_iris()

In [70]:
X = iris.data
y = iris.target

In [72]:
X_train , X_test , y_train , y_test = train_test_split(X , y , test_size=0.33 , random_state=42)

In [73]:
rcf = RandomForestClassifier()
lcf = LogisticRegression()

In [76]:
vot_clf = VotingClassifier(estimators=[
    
    ('rcf' , rcf),
    ('lcf' , lcf)    
] , voting='hard')

In [77]:
pipline = Pipeline(steps=[
    
    ('vot_clf' , vot_clf)
    
])

In [79]:
import warnings
warnings.filterwarnings('ignore')
pipline.fit(X_train , y_train)

In [80]:
y_pred = pipline.predict(X_test)

In [81]:
from sklearn.metrics import accuracy_score
print('accuracy_score' , accuracy_score(y_test , y_pred))

accuracy_score 0.98
