## 1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.
Design a pipeline that includes the following steps"

Use an automated feature selection method to identify the important features in the dataset

Create a numerical pipeline that includes the following steps"

Impute the missing values in the numerical columns using the mean of the column values

Scale the numerical columns using standardisation

Create a categorical pipeline that includes the following steps"

Impute the missing values in the categorical columns using the most frequent value of the column

One-hot encode the categorical columns

Combine the numerical and categorical pipelines using a ColumnTransformer

Use a Random Forest Classifier to build the final model

Evaluate the accuracy of the model on the test dataset

Note! Your solution should include code snipets for each step of the pipeline, and a brief explanation of each step. You should also proide an interpretation of the results and suggest possible improvements for the pipeline

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler , OneHotEncoder ,LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import seaborn as sns


In [6]:
df=sns.load_dataset('tips') ## consider this data set as  a given data set

In [5]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [13]:
#### AUTOMAte feature engineering
features=[]

cat_cols=['sex','smoker','day']
num_cols=['total_bill','tip','size']


num_pipeline=Pipeline(
    steps=[
        ('impute',SimpleImputer(strategy='median')), ## handling missing values
        ('scaler',StandardScaler())  ## handling outliers by scaling
    ]
)

cat_pipline = Pipeline(
    steps=[
        ('impute',SimpleImputer(strategy='most_frequent')),#,# handling missing values
        ('encoder',OneHotEncoder())
    ]
)

preprocessor = ColumnTransformer([
    ('num_pipline',num_pipeline,num_cols),
    ('cat_pipeline',cat_pipline,cat_cols)
])


encoder=LabelEncoder()
df['time']=encoder.fit_transform(df['time'])


X=df.drop(labels=['time'],axis=1)
y=df['time']

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=42)

X_train=preprocessor.fit_transform(X_train)
X_test=preprocessor.transform(X_test)


### automate the models 

models={
    'rfc':RandomForestClassifier()
}

### createing the final function to automate all the above things 
def evalute_model(X_train,X_test,y_train,y_test,models):
    report={}
    for i in range(len(models)):
        model=list(models.values())[i]
        
        ## training the model
        
        model.fit(X_train,y_train)
        
        
        ## predict data
        
        y_pred=model.predict(X_test)
        
        model_test_score=accuracy_score(y_test,y_pred)
        
        report[list(models.keys())[i]]=model_test_score
    return report

In [14]:
evalute_model(X_train,X_test,y_train,y_test,models)

{'rfc': 0.972972972972973}

The provided code snippet is a Python script that automates the feature engineering process, trains machine learning models, and evaluates their performance on a given dataset. Let's break down the code step by step:

1. **Feature Engineering Pipelines:**
   - Two separate pipelines are defined, one for numerical features (`num_pipeline`) and one for categorical features (`cat_pipeline`).
   - The numerical pipeline performs two preprocessing steps: imputation of missing values using the median strategy and scaling of the numerical columns using standardization. Scaling helps handle outliers and brings all numerical features to a similar scale.
   - The categorical pipeline handles missing values in categorical columns using the most frequent strategy and then performs one-hot encoding to convert categorical variables into a binary format suitable for machine learning algorithms.
   - The `ColumnTransformer` (`preprocessor`) is used to combine the numerical and categorical pipelines, applying each to the corresponding columns in the dataset.

2. **Label Encoding:**
   - The 'time' column in the DataFrame (`df`) is transformed using `LabelEncoder()` to convert the 'time' labels into numeric format. This is often done to represent categorical labels as numerical values, making them compatible with certain machine learning algorithms.

3. **Data Splitting:**
   - The DataFrame `df` is split into features `X` and target variable `y`. The 'time' column is considered the target variable, and the rest of the columns are treated as features.
   - The data is further split into training and testing sets using `train_test_split`.

4. **Automated Model Evaluation:**
   - The function `evaluate_model` automates the process of training and evaluating machine learning models on the dataset.
   - The function takes the training and testing data along with a dictionary of machine learning models (`models`) as input.
   - For each model in the `models` dictionary, it fits the model to the training data and predicts on the testing data.
   - The accuracy score is calculated for each model by comparing the predicted labels with the true labels (`y_test`), and the scores are stored in the `report` dictionary.
   - The function returns the `report` dictionary containing model names as keys and their corresponding accuracy scores as values.

5. **Model Automation and Evaluation:**
   - In the provided code snippet, a single model, `RandomForestClassifier()`, is included in the `models` dictionary.
   - The `evalute_model` function is called with the training and testing data along with the `models` dictionary.
   - The function returns the accuracy score for the `RandomForestClassifier` model on the test data.

The code aims to automate the feature engineering and model evaluation processes, making it easier to try multiple models and assess their performance on the dataset. It demonstrates how to use pipelines for feature engineering, label encoding for categorical variables, and a function to automate the model evaluation process.

However, it's worth noting that in practice, the code can be further enhanced by trying different models, tuning hyperparameters, and performing cross-validation to get more reliable performance estimates. Additionally, model selection should be based on the problem's requirements and characteristics of the dataset.

## Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler , OneHotEncoder ,LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
df=sns.load_dataset('iris')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [15]:
#### AUTOMAte feature engineering
features=[]


num_cols=['sepal_length','sepal_width','petal_length','petal_width']


num_pipeline=Pipeline(
    steps=[
        ('impute',SimpleImputer(strategy='median')), ## handling missing values
        ('scaler',StandardScaler())  ## handling outliers by scaling
    ]
)



preprocessor = ColumnTransformer([
    ('num_pipline',num_pipeline,num_cols)
    
])


encoder=LabelEncoder()
df['species']=encoder.fit_transform(df['species'])


X=df.drop(labels=['species'],axis=1)
y=df['species']

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=42)

X_train=preprocessor.fit_transform(X_train)
X_test=preprocessor.transform(X_test)


### automate the models 

models={
    'rfc':RandomForestClassifier(),
    'LR':LogisticRegression()
}

### createing the final function to automate all the above things 
def evalute_model(X_train,X_test,y_train,y_test,models):
    report={}
    for i in range(len(models)):
        model=list(models.values())[i]
        
        ## training the model
        
        
        ## predict data
       
        voting_classifier = VotingClassifier(
            estimators=[('rf', RandomForestClassifier()), ('lr', LogisticRegression())],
            voting='hard'  # Use majority voting
        )
        # : Fit the Voting Classifier to the training data
        voting_classifier.fit(X_train, y_train)
        
        #: Make predictions on the test data
        
        y_pred = voting_classifier.predict(X_test)
        
        model_test_score=accuracy_score(y_test,y_pred)
        
        report['voting_classiferi_score']=model_test_score
    return report




In [16]:
evalute_model(X_train,X_test,y_train,y_test,models)

{'voting_classiferi_score': 1.0}