#### Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

#### Design a pipeline that includes the following steps.

- Use an automated feature selection method to identify the important features in the dataset.
- Create a numerical pipeline that includes the following steps.
- Impute the missing values in the numerical columns using the mean of the column values.
- Scale the numerical columns using standardisation.
- Create a categorical pipeline that includes the following steps.
- Impute the missing values in the categorical columns using the most frequent value of the column.
- One-hot encode the categorical columns.
- Combine the numerical and categorical pipelines using a ColumnTransformer.
- Use a Random Forest Classifier to build the final modelC
- Evaluate the accuracy of the model on the test datasetD

In [25]:
import pandas as pd
from sklearn.feature_selection import SelectFromModel
import seaborn as sns

In [26]:
data=sns.load_dataset('tips')

In [27]:
data.to_csv('tips.csv', index=False)

In [28]:
data.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


In [29]:
data.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [30]:
data.isna().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [31]:
from sklearn.model_selection import train_test_split

In [32]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

In [33]:
numerical_pipeline=Pipeline([('imputer',SimpleImputer(strategy='mean')),
                            ('scaler',StandardScaler())])

In [34]:
categorical_pipeline=Pipeline([('imputer',SimpleImputer(strategy='most_frequent')),
                              ('encoder',OneHotEncoder())])

In [35]:
preprocessor=ColumnTransformer([('numerical', numerical_pipeline, ['total_bill','tip','size']),
    ('categorical', categorical_pipeline, ['sex','day','time'])])

In [51]:
x=data.drop('smoker',axis=1)
y=data['smoker']

In [52]:
x=preprocessor.fit_transform(x)

In [53]:
encoder=OneHotEncoder()

In [54]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

In [55]:
rf_classifier=RandomForestClassifier()

In [56]:
rf_classifier.fit(x_train,y_train)

In [57]:
feature_selector = SelectFromModel(RandomForestClassifier(n_estimators=100))
selected_features = feature_selector.fit_transform(x_train, y_train)

In [58]:
y_pred=rf_classifier.predict(x_test)

In [59]:
y_pred

array(['No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'No',
       'Yes', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'No', 'No',
       'No', 'No', 'Yes', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'No',
       'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No',
       'No', 'No', 'No', 'Yes', 'No', 'No'], dtype=object)

In [60]:
from sklearn.metrics import accuracy_score

In [61]:
print("Accuracy:",accuracy_score(y_test,y_pred))

Accuracy: 0.673469387755102


Explanation of each step:

Feature Selection: Use SelectFromModel to select the important features based on the Random Forest Classifier.

Numerical Pipeline: Impute the missing values in numerical columns using the mean of the column values and scale the numerical columns using standardization.

Categorical Pipeline: Impute the missing values in categorical columns using the most frequent value of the column and perform one-hot encoding on the categorical columns.

ColumnTransformer: Combine the numerical and categorical pipelines using ColumnTransformer to apply the preprocessing steps to the respective column types.

Random Forest Classifier: Build the final model using the Random Forest Classifier.

Evaluate Model: Predict the target variable on the test dataset and evaluate the model's accuracy.

Interpretation of the results and possible improvements:
The pipeline automates the feature engineering process by handling missing values, scaling numerical features, and one-hot encoding categorical features. The Random Forest Classifier is used as the final model. The accuracy score obtained on the test dataset provides an estimate of the model's performance.

Possible improvements for the pipeline could include:

The pipeline automates the feature engineering process by handling missing values, scaling numerical features, and one-hot encoding categorical features. The Random Forest Classifier is used as the final model. The accuracy score obtained on the test dataset provides an estimate of the model's performance.

Possible improvements for the pipeline could include:

Trying different imputation strategies, such as using median or regression-based imputation, depending on the nature of the missing data.
Experimenting with different feature selection techniques or adjusting the hyperparameters of the Random Forest Classifier for better feature selection.
Trying different models or ensemble methods to improve the classification performance.
Conducting more thorough exploratory data analysis to identify additional data preprocessing steps or feature engineering techniques that could benefit the model's performance.
These improvements depend on the specific characteristics of your dataset

#### Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

In [62]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the individual classifiers
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
logistic_regression = LogisticRegression(random_state=42)

# Create the pipeline with StandardScaler and the classifiers
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ensemble', VotingClassifier(
        estimators=[('rf', random_forest), ('lr', logistic_regression)],
        voting='soft'  # Use soft voting for probabilities
    ))
])

# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 1.0
