`Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values.`

`Design a pipeline that includes the following steps:`

1. Use an automated feature selection method to identify the important features in the dataset.

`Create a numerical pipeline that includes the following steps:`
1. Impute the missing values in the numerical columns using the mean of the column values.
2. Scale the numerical columns using standardisation

`Create a categorical pipeline that includes the following steps:`
1. Impute the missing values in the categorical columns using the most frequent value of the column.
2. One-hot encode the categorical columns.

- Combine the numerical and categorical pipelines using a ColumnTransformer.
- Use a Random Forest Classifier to build the final model
- Evaluate the accuracy of the model on the test dataset.

Note! Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Load the dataset
data = pd.read_csv('your_dataset.csv')

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data.drop('target', axis=1), data['target'], test_size=0.2, random_state=42)

# Define the numerical pipeline
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Define the categorical pipeline
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder())
])

# Combine the numerical and categorical pipelines using ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipeline, ['numerical_feature_1', 'numerical_feature_2']),
    ('cat', cat_pipeline, ['categorical_feature_1', 'categorical_feature_2'])
])

# Use SelectFromModel to perform automated feature selection
selector = SelectFromModel(RandomForestClassifier(random_state=42))
selector.fit_transform(X_train, y_train)

# Define the final pipeline
final_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selection', selector),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Fit the final pipeline on the training data
final_pipeline.fit(X_train, y_train)

# Evaluate the accuracy of the model on the test dataset
y_pred = final_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: {:.2f}%'.format(accuracy*100))


In this pipeline, we first load the dataset and split it into train and test sets. Then we define two pipelines, one for the numerical features and one for the categorical features. The numerical pipeline first imputes the missing values in the numerical columns using the mean of the column values and then scales the columns using standardization. The categorical pipeline first imputes the missing values in the categorical columns using the most frequent value of the column and then one-hot encodes the columns.

We then use ColumnTransformer to combine the numerical and categorical pipelines. This allows us to apply different preprocessing steps to different types of features in the dataset. We also use SelectFromModel to perform automated feature selection, which identifies the important features in the dataset.

Finally, we define the final pipeline that includes the preprocessor, feature selection, and a Random Forest Classifier. We fit this pipeline on the training data and evaluate its accuracy on the test dataset.

Possible improvements for this pipeline include trying different imputation strategies, feature selection methods, and models to find the best combination for the dataset. Additionally, it may be useful to perform further exploratory data analysis to identify any other issues with the dataset that may affect model performance.

`Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.`

In [26]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
import seaborn as sns

#loading Iris dataset
df = sns.load_dataset("iris")
X = df.drop('species',axis =1)
y = df.species

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

numerical_features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
categorical_features = []

# Define the numerical pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Define the categorical pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder())
])

# Combine the numerical and categorical pipelines using a ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', numerical_pipeline, numerical_features),
    ('cat', categorical_pipeline, categorical_features)
])

# Define the Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Define the Logistic Regression Classifier
lr = LogisticRegression(random_state=42)

# Define the Voting Classifier
voting_clf = VotingClassifier(
    estimators=[('rf', rf), ('lr', lr)],
    voting='soft'
)

# Define the final pipeline that includes the preprocessor and the Voting Classifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('voting', voting_clf)
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Evaluate the accuracy of the pipeline on the test data
accuracy = pipeline.score(X_test, y_test)
print(f'Accuracy: {accuracy}')


Accuracy: 1.0


In this pipeline, we first define a numerical pipeline and a categorical pipeline that handle the missing values and apply scaling and one-hot encoding respectively. Then we combine these pipelines using a ColumnTransformer.

Next, we define the Random Forest Classifier and the Logistic Regression Classifier. We then define the Voting Classifier that combines the predictions of these classifiers.

Finally, we define the complete pipeline that includes the preprocessor and the Voting Classifier, and fit it on the training data. We evaluate the accuracy of the pipeline on the test data using the `score` method.

Note that we set the voting parameter of the `Voting` Classifier to `soft`, which means that the predicted probabilities of the classifiers are averaged to compute the final prediction. If we set it to `hard`, the class with the highest number of votes is selected as the final prediction.

The advantage of using a Voting Classifier is that it can improve the accuracy and robustness of the predictions by combining the strengths of multiple classifiers. The main limitation is that it may not work well if the individual classifiers have high correlation in their errors or predictions.