### 1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

Design a pipeline that includes the following steps"

Use an automated feature selection method to identify the important features in the dataset

Create a numerical pipeline that includes the following steps"

Impute the missing values in the numer#cal columns using the mean of the column values

Scale the numerical columns using standardisation

Create a categorical pipeline that includes the following steps"

Impute the missing values in the categorical columns using the most frequent value of the column

One-hot encode the categorical columns

Combine the numerical and categorical pipelines using a ColumnTransformer

Use a Random Forest Classifier to build the final model

Evaluate the accuracy of the model on the test dataset

Note! Your solution should include code snipets for each step of the pipeline, and a brief explanation of
each step. You should also proide an interpretation of the results and suggest possible improvements for
the pipeline.

Certainly! Here's an example of a pipeline that includes the steps you mentioned:

```python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectFromModel
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assuming we have X as the feature matrix and y as the target variable

# Step 1: Automated feature selection
feature_selector = SelectFromModel(RandomForestClassifier())

# Step 2: Numerical pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Step 3: Categorical pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder())
])

# Step 4: Combine numerical and categorical pipelines
preprocessor = ColumnTransformer([
    ('numerical', numerical_pipeline, numerical_features),
    ('categorical', categorical_pipeline, categorical_features)
])

# Step 5: Final pipeline with preprocessor and classifier
pipeline = Pipeline([
    ('feature_selector', feature_selector),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Step 6: Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 7: Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Step 8: Make predictions on the test data
y_pred = pipeline.predict(X_test)

# Step 9: Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
```

Explaination of the steps 

1. Automated Feature Selection: SelectFromModel uses a model (in this case, Random Forest Classifier) to automatically select important features.

2. Numerical Pipeline: Handles missing values in numerical columns by imputing them with the mean value of the column. Scales the numerical columns using standardization.

3. Categorical Pipeline: Handles missing values in categorical columns by imputing them with the most frequent value of the column. One-hot encodes the categorical columns.

4. ColumnTransformer: Combines the numerical and categorical pipelines, applying the appropriate preprocessing steps to each type of column.

5. Final Pipeline: Includes the feature selection, preprocessor, and Random Forest Classifier.

6. Split the Data: Splits the data into training and test sets for model evaluation.

7. Fit the Pipeline: Fits the pipeline to the training data, performing all the preprocessing steps and training the model.

8. Make Predictions: Uses the trained pipeline to make predictions on the test data.

9. Evaluate Accuracy: Compares the predicted labels with the true labels and calculates the accuracy of the model.

Interpretation and Possible Improvements:
The pipeline automates the feature engineering process and handles missing values in both numerical and categorical features. The Random Forest Classifier is used as the final model for classification.

Interpreting the accuracy score obtained from the pipeline allows you to assess the performance of the model on the test dataset. Higher accuracy indicates better performance. However, it's essential to consider other evaluation metrics and conduct further analysis, such as examining the confusion matrix or calculating precision, recall, or F1-score.

Possible improvements to the pipeline depend on the specific characteristics of your dataset and problem:

1. Hyperparameter Tuning: Explore different hyperparameter values for the Random Forest Classifier to optimize its performance. Consider using techniques like grid search or randomized search.



2. Feature Engineering: Experiment with additional feature engineering techniques, such as creating interaction terms, polynomial features, or domain-specific transformations, to capture more complex relationships in the data.

3. Handling Imbalanced Classes: If the dataset is imbalanced, where one class dominates the other, consider using techniques like oversampling, undersampling, or class weighting to address the class imbalance problem.

4. Model Selection: Besides Random Forest, consider trying other classification algorithms to see if they yield better results for your specific dataset, such as Gradient Boosting, Support Vector Machines, or Neural Networks.

It's crucial to adapt and fine-tune the pipeline based on the characteristics of your dataset and the specific requirements of your problem.

### 2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

In [3]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Step 2: Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Define the individual classifiers
rf_classifier = RandomForestClassifier(random_state=42)
lr_classifier = LogisticRegression(random_state=42)

# Step 4: Define the Voting Classifier combining the individual classifiers
voting_classifier = VotingClassifier(
    estimators=[('rf', rf_classifier), ('lr', lr_classifier)],
    voting='hard'  # Use majority voting
)

# Step 5: Fit the Voting Classifier to the training data
voting_classifier.fit(X_train, y_train)

# Step 6: Make predictions on the test data
y_pred = voting_classifier.predict(X_test)

# Step 7: Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 1.0
