__Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values__
1. Design a pipeline that includes the following  steps"
2. Use an automated feature selection method to identify the important features in the datasets
3. Create a numerical pipeline that  includes the following steps"
4. Impute the missing values in the numerical columns using the mean of the column values
5. Scale the numerical columns using standardization
6. Create a categorical pipeline that includes the following steps"
7. Impute the missing values in the categorical columns using the most frequent value of the columns
8. One-hot encodes the categorical columns
9. Combine the numerical and categorical pipelines using a ColumnTransformers
10. Use a Random Forest Classifier to build the final models
11. Evaluate the accuracy of the model on the test dataset

Note! Your solution should include code snippets for each step of the pipeline and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipeline

#### Import Necessary Libraries


In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,recall_score,f1_score

#### Load the Iris dataset

In [2]:
data = load_iris()
X, y = data.data, data.target

#### Split the dataset into train and test sets


In [3]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

#### 1.  Use an automated feature selection method to identify the important features in the dataset. 
Feature selection helps identify the most relevant features for your model, improving performance and reducing overfitting. You can use methods like correlation analysis, recursive feature elimination, or feature importance from tree-based models. Here's an example using SelectKBest with f_classif scoring:

In [4]:
# Step 1: Use an automated feature selection method
selector = SelectKBest(score_func=f_classif, k=3)
X_train_selected = selector.fit_transform(X_train, y_train)

In [13]:
X_train_selected

array([[4.6, 1. , 0.2],
       [5.7, 1.5, 0.4],
       [6.7, 4.4, 1.4],
       [4.8, 1.6, 0.2],
       [4.4, 1.3, 0.2],
       [6.3, 5. , 1.9],
       [6.4, 4.5, 1.5],
       [5.2, 1.5, 0.2],
       [5. , 1.4, 0.2],
       [5.2, 1.5, 0.1],
       [5.8, 5.1, 1.9],
       [6. , 4.5, 1.6],
       [6.7, 4.7, 1.5],
       [5.4, 1.3, 0.4],
       [5.4, 1.5, 0.2],
       [5.5, 3.7, 1. ],
       [6.3, 5.1, 1.5],
       [6.4, 5.5, 1.8],
       [6.6, 4.4, 1.4],
       [7.2, 6.1, 2.5],
       [5.7, 4.2, 1.3],
       [7.6, 6.6, 2.1],
       [5.6, 4.5, 1.5],
       [5.1, 1.4, 0.2],
       [7.7, 6.7, 2. ],
       [5.8, 4.1, 1. ],
       [5.2, 1.4, 0.2],
       [5. , 1.3, 0.3],
       [5.1, 1.9, 0.4],
       [5. , 3.5, 1. ],
       [6.3, 4.9, 1.8],
       [4.8, 1.9, 0.2],
       [5. , 1.6, 0.2],
       [5.1, 1.7, 0.5],
       [5.6, 4.2, 1.3],
       [5.1, 1.5, 0.2],
       [5.7, 4.2, 1.2],
       [7.7, 6.7, 2.2],
       [4.6, 1.4, 0.2],
       [6.2, 4.3, 1.3],
       [5.7, 5. , 2. ],
       [5.5, 1.4

#### 2.    Create a numerical pipeline that includes the following steps:
a. Impute the missing values in the numerical columns using the mean of the column values.<br>
b. Scale the numerical columns using standardization.
Imputing missing values fills gaps in the dataset, and scaling numerical features standardizes them for better model performance. Here's an example:

In [5]:
# Step 2: Create a numerical pipeline
num_pipeline=Pipeline([
    ('imputer',SimpleImputer(strategy='mean')),
    ('scaler',StandardScaler())
])

#### 3.  Create a categorical pipeline that includes the following steps:
a. Impute the missing values in the categorical columns using the most frequent value of the column.<br>
b. One-hot encode the categorical columns.
Imputing missing values replaces them with the most frequent value, and one-hot encoding converts categorical variables into binary vectors. Here's an example:

In [6]:
# Step 3: Create a categorical pipeline
cat_pipeline=Pipeline([
    ('imputer',SimpleImputer(strategy='most_frequent')),
    ('encode',OneHotEncoder())
])

#### 4.  Combine the numerical and categorical pipelines using a ColumnTransformer.
The ColumnTransformer allows different transformations on different columns, combining them into a single output. Here's an example:

In [7]:
# Step 4: Combine numerical and categorical pipelines
preprocessor = ColumnTransformer([
    ('num', num_pipeline, [0, 1, 2, 3]),  # Assuming all columns are numerical
    ('cat', cat_pipeline, [])  # Assuming no categorical columns in this dataset
])

#### 5.   Use a Random Forest Classifier to build the final model.
Random Forest is an ensemble learning method that combines multiple decision trees for predictions. It handles both numerical and categorical features and is suitable for classification tasks. Here's an example:

In [8]:
# Step 5: Use a Random Forest Classifier
rf_classifier = RandomForestClassifier()

In [9]:
# Step 6: Create the final pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', rf_classifier)
])

In [10]:
# Step 7: Train the pipeline on the training data
pipeline.fit(X_train, y_train)

#### 8. Evaluate the accuracy of the model on the test dataset.
Once trained, evaluate the model's performance on the test dataset using metrics like accuracy, precision, recall, or F1-score. Here's an example:n)

In [11]:
# Evaluate the accuracy on the test data
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print("Accuracy:", accuracy)
print("Recall:", recall)
print("F1 Score:", f1)

Accuracy: 1.0
Recall: 1.0
F1 Score: 1.0


__Interpretation of Results and Possible Improvements:__

- The pipeline automates feature engineering and missing value handling for numerical and categorical features, leading to a Random Forest model.
- Evaluate the model's accuracy on the test dataset to assess its performance.
- If the accuracy is unsatisfactory, consider the following:

Hyperparameter Tuning: Experiment with different hyperparameter values for the Random Forest Classifier. Adjusting parameters such as the number of estimators, maximum depth, and minimum samples per leaf can significantly impact the model's performance.

Feature Engineering: Explore different feature engineering techniques to enhance the model's performance. For example, you can try creating new features based on domain knowledge, applying different transformations to the existing features, or incorporating interaction terms.

Alternative Models: Consider trying alternative models or ensemble methods besides Random Forest. Different algorithms might have different strengths and weaknesses, so exploring models like Gradient Boosting, Support Vector Machines, or Neural Networks could potentially yield better results.

Data Augmentation: If the dataset is small, you can explore techniques like data augmentation to generate additional synthetic samples. This can help improve the model's generalization and performance.

Cross-Validation: Evaluate the model's performance using cross-validation techniques to obtain more reliable estimates of its accuracy. This can help assess whether the initial accuracy score is consistent across different folds of the data.

Handling Class Imbalance: If the dataset has class imbalance, consider addressing it by using techniques such as oversampling, undersampling, or using class weights during model training. This can prevent the model from being biased towards the majority class.

### Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the Iris dataset and evaluate its accuracy.

In [14]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression

# Create individual classifiers
rf_classifier = RandomForestClassifier(random_state=42)
lr_classifier = LogisticRegression(random_state=42)

# Create a voting classifier
voting_classifier = VotingClassifier(
    estimators=[('rf', rf_classifier), ('lr', lr_classifier)],
    voting='hard'  # Use majority voting
)

# Create the final pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', voting_classifier)
])

# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

# Evaluate the accuracy on the test data
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print("Accuracy:", accuracy)
print("Recall:", recall)
print("F1 Score:", f1)


Accuracy: 1.0
Recall: 1.0
F1 Score: 1.0
