# TASK 9: Evaluation Metrics – Pick the Best Performer

## Description of the Task

The objective of this task is to evaluate and compare multiple **pretrained machine learning models**
using a suitable test dataset and identify the best-performing model based on standard evaluation metrics.

Unlike typical machine learning tasks that focus on training models, this task emphasizes
**model evaluation and comparison**. It helps in understanding how different algorithms behave
when tested on the same data and how evaluation metrics guide the selection of the most effective model.

Specifically, the task involves:

- Selecting an appropriate test dataset
- Loading pretrained models saved as `.pkl` files
- Using the models to generate predictions on the test dataset
- Evaluating model performance using classification metrics
- Comparing results and identifying the best-performing model


## Understanding Model Evaluation

Model evaluation is a crucial step in the machine learning workflow.
It helps measure how well a trained model performs on unseen data.

Instead of relying on a single metric, multiple evaluation metrics are used to gain a more
complete understanding of a model’s strengths and weaknesses. This task highlights the importance
of choosing the right metrics based on the problem type.

Since the models in this task are already trained, evaluation becomes the primary tool
for judging their effectiveness.

## Dataset Used as Test Dataset

I used the **Iris dataset** as the **test dataset** for evaluation.

- **Target variable:** `Species`
  - Iris-setosa  
  - Iris-versicolor  
  - Iris-virginica  
- **Features:**  
  - Sepal length  
  - Sepal width  
  - Petal length  
  - Petal width  
- **Missing values:** None  

The dataset was chosen because it is clean, balanced, and commonly used for benchmarking
classification models.

## Pretrained Models Used

The following five pretrained machine learning models were provided and evaluated:

1. Decision Tree Classifier  
2. Logistic Regression  
3. K-Nearest Neighbors (KNN)  
4. Support Vector Machine (SVM)  
5. Random Forest Classifier  

All models were loaded using the **joblib** library and evaluated on the same test dataset
to ensure a fair comparison.


## Approach Followed to Solve the Task

### 1. Data Preparation

The Iris dataset was loaded into a Pandas DataFrame to inspect its structure.
The `Id` column was removed since it is only an identifier and does not contribute to prediction.
The remaining columns were separated into:

- **Features (X)** – input variables
- **Target (y)** – flower species


### 2. Handling Label Encoding

While evaluating the models, it was observed that some pretrained models returned
numerical class labels, whereas the dataset contained class labels in string format.

To ensure consistency between the true labels and predicted labels, **LabelEncoder**
was used to encode the target variable into numerical form.
This step was necessary to avoid errors during metric computation.


### 3. Model Evaluation

Each pretrained model was used to generate predictions on the Iris test dataset.
The predicted values were then compared with the true labels using multiple evaluation metrics.

Evaluating all models on the same dataset ensured consistency and allowed for
a meaningful comparison of performance.


## Evaluation Metrics Used

Since this is a **multi-class classification problem**, the following metrics were used:

- **Accuracy** – measures overall correctness of predictions  
- **Precision (weighted)** – evaluates how reliable the predicted classes are  
- **Recall (weighted)** – measures how well actual classes are identified  
- **F1-score (weighted)** – provides a balance between precision and recall  

Weighted averaging was used so that all three Iris species were fairly represented
in the evaluation.


## Results and Model Comparison

The evaluation results for all five models were compiled into a comparison table.
This made it easy to observe differences in performance across models and metrics.

The model with the **highest weighted F1-score** was identified as the best-performing model,
as F1-score provides a balanced measure of classification performance.


## Difficulties Faced During the Task

Several challenges were encountered while completing this task:

- File path issues occurred while loading the pretrained model files, as they were stored
  in a different directory. This was resolved by identifying and using absolute file paths.
- A label format mismatch caused errors during metric calculation because the dataset
  contained string labels while some models produced numerical predictions.
  This issue was resolved using label encoding.
- Warnings related to scikit-learn version differences and feature names were encountered.
  After understanding that these warnings did not affect prediction results, the evaluation
  was carried out safely.

These challenges helped highlight common real-world issues that arise during model evaluation.

## Outcomes and Learnings

Through this task, I gained a strong understanding of how pretrained machine learning models
can be evaluated and compared using appropriate metrics.

I learned the importance of selecting a suitable test dataset, ensuring consistent preprocessing,
and choosing the right evaluation metrics based on the problem type. The task also improved my
debugging skills by exposing me to practical issues such as file handling, label mismatches,
and library warnings.

Overall, this task reinforced the idea that model evaluation is just as important as model training
in building reliable machine learning systems.
