**`Q.No-01`    You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.**

**`Design a pipeline that includes the following steps` :**

- **Use an automated feature selection method to identify the important features in the dataset.**

- **Create a numerical pipeline that includes the following steps :**

    - **Impute the missing values in the numerical columns using the mean of the column values.**

    - **Scale the numerical columns using standardisation.**

- **Create a categorical pipeline that includes the following steps :**

    - **Impute the missing values in the categorical columns using the most frequent value of the column.**

    - **One-hot encode the categorical columns.**

- **Combine the numerical and categorical pipelines using a ColumnTransformer.**

- **Use a Random Forest Classifier to build the final model.**

- **Evaluate the accuracy of the model on the test dataset.**

`Note`: **Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.**

**Ans :-**

1. **Data Loading and Preprocessing :**

   - The dataset is loaded using Seaborn's `load_dataset` function, which loads the "tips" dataset.

   - The target variable 'time' is encoded using LabelEncoder to convert categorical labels into numerical labels for modeling purposes.

In [1]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

In [2]:
df=sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [None]:
encoder=LabelEncoder()
df['time']=encoder.fit_transform(df['time'])

In [3]:
# Splitting the dataset into features and target
X = df.drop('time', axis=1)  # Features
y = df.time  # Target

In [5]:
# Splitting the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

2. **Feature Selection and Preprocessing Pipelines :**

   - Three pipelines are defined:

     - Numerical Pipeline: Imputes missing values with the mean and scales numerical features.

     - Categorical Pipeline: Imputes missing values with the most frequent value and applies one-hot encoding to categorical features.

     - Feature Selection Pipeline: Uses a RandomForestClassifier to select features based on importance scores.

In [6]:
# Instantiate a Random Forest Classifier
rfc = RandomForestClassifier(random_state=42)

In [7]:
# Feature selection pipeline
feature_selection_pipeline = Pipeline([
    ('feature_selection', SelectFromModel(estimator=rfc, threshold="median"))
])

In [8]:
# Numerical pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

In [9]:
# Categorical pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehotencoder', OneHotEncoder())
])

In [10]:
numerical_features = X.select_dtypes(include=['float64', 'int64']).columns.tolist()
categorical_features = X.select_dtypes(include=['category']).columns.tolist()

print(f"Numerical Features: {numerical_features}")
print(f"Categorical Features: {categorical_features}")

Numerical Features: ['total_bill', 'tip', 'size']
Categorical Features: ['sex', 'smoker', 'day']


3. **Column Transformer :**

   - Combines numerical and categorical pipelines to preprocess the data.

   - The 'num' transformer applies the numerical pipeline to numerical features.

   - The 'cat' transformer applies the categorical pipeline to categorical features.

In [11]:
# Column Transformer to combine numerical and categorical pipelines
preprocessor = ColumnTransformer([
    ('num', numerical_pipeline, numerical_features),  # Selecting numerical columns
    ('cat', categorical_pipeline, categorical_features)  # Selecting categorical columns
])
preprocessor

4. **Combined Pipeline :**

   - Combines the preprocessor and feature selection pipeline.

In [12]:
# Combine feature selection pipeline with preprocessing pipelines
combined_pipeline = Pipeline([
    ('preprocessor', preprocessor),  # Preprocessing step
    ('feature_selection', feature_selection_pipeline)  # Feature selection step
])

5. **Model Training and Evaluation :**

   - The combined pipeline is fitted on the training data.

   - Then, the model (Random Forest Classifier) is trained on the preprocessed training data.

   - Predictions are made on the preprocessed testing data.

   - Performance metrics like accuracy are calculated and printed.

In [13]:
# Fit the combined pipeline to the training data
X_train_preprocessed = combined_pipeline.fit_transform(X_train, y_train)

In [14]:
# Transform the testing data using the fitted combined pipeline
X_test_preprocessed = combined_pipeline.transform(X_test)

In [15]:
# Fit your model on the preprocessed training data
rfc.fit(X_train_preprocessed, y_train)

In [16]:
# Predict on the testing data
y_pred = rfc.predict(X_test_preprocessed)

In [17]:
# Evaluate the performance of your model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9795918367346939


**`Interpretation of Results` :**

- The pipeline achieves an accuracy of approximately 97.96%, which indicates that the model performs well in predicting the time of day based on the given features.

- The accuracy suggests that the selected features and the model are capturing the patterns in the data effectively.

**`Possible Improvements` :**

1. **Hyperparameter Tuning -** Tune hyperparameters of the RandomForestClassifier to potentially improve performance further.

2. **Cross-Validation -** Utilize cross-validation to get a more robust estimate of model performance.

3. **Feature Engineering -** Explore additional feature engineering techniques to enhance the predictive power of the model.

4. **Different Models -** Try different models apart from Random Forest to see if they perform better.

5. **Feature Importance Analysis -** Perform a deeper analysis of feature importance to ensure that the selected features are indeed the most relevant for prediction.

6. **Handling Imbalance -** If there's a class imbalance, techniques like oversampling, undersampling, or using class weights can be employed to address it.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

**`Q.No-02`    Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.**

**Ans :-**

In [46]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

In [47]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

In [48]:
# Create DataFrame
Iris = pd.DataFrame(data=X, columns=iris.feature_names)
Iris['target'] = y

Iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [49]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [50]:
# Define classifiers
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
logistic_regression = LogisticRegression(max_iter=1000, random_state=42)

In [57]:
# Define pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),  # Standardize features
    ('ensemble', VotingClassifier(estimators=[
        ('rf', random_forest),
        ('lr', logistic_regression)
    ], voting='hard'))  # Combine predictions using majority voting
])

In [58]:
# Train the pipeline
pipeline.fit(X_train, y_train)

In [59]:
# Evaluate accuracy
accuracy = accuracy_score(y_test, pipeline.predict(X_test))
print("Accuracy:", accuracy)

Accuracy: 1.0
