# Waldemar Chang - Assignment 4: Constructing the Inference Service
## EN.705.603.82.FA24 Creating AI-Enabled Systems
### Task 5
#### Using a notebook called model_performance.ipynb, analyze the performance of the two models provided to you. Use the TechTrack Dataset to perform this analysis. Finally, provide a thorough argument on why you favor one model over another.

#### Analysis
To analyze the performance of the two YOLO object detection models, Model 1 and Model 2, a variety of performance metrics was employed, including mean Average Precision (mAP), precision, recall, and F1-score, to comprehensively assess the models. The goal is to determine which model offers superior performance for potential deployment in real-world applications. The efficacy of an object detection model is not solely determined by its predictive accuracy but also by its robustness, generalization capability, and performance across different object classes and conditions.

The 20 object classes under evaluation include: barcode, car, cardboard box, fire, forklift, freight container, gloves, helmet, ladder, license plate, person, QR code, road sign, safety vest, smoke, traffic cone, traffic light, truck, van, and wood pallet. Each class presents distinct challenges, such as varying object sizes, occlusions, and class imbalances, making a comprehensive evaluation essential for identifying the best-performing model.

Due to the computational intensity of tracking metrics such as precision, recall, and F1-score for all 20 class labels across the larger 100-image dataset, most of the metrics were gathered using a smaller, more manageable 10-image dataset. However, mAP, which provides a broad measure of detection performance across all classes, was recorded for both the 10-image and 100-image datasets. This approach allowed for a more detailed analysis while keeping the computational demands feasible.

In this analysis, Model 1 and Model 2 will be compared by:

1. Evaluating performance metrics.
2. Analyzing per-class performance.
3. Identifying strengths and weaknesses through error analysis.
4. Discussing the implications of the findings for practical applications.

In [None]:
import os
import random
import cv2 as cv
import numpy as np
import matplotlib.pyplot as plt
from nms import filter
from object_detection import Model, draw_bboxes
from helper import calculate_iou, calculate_pr, calculate_ap, calculate_map, calculate_11pi, calculate_precision_recall_f1, calculate_specificity, get_ground_truth, denormalize

In [None]:
# Define path to folder containing images and labels
folder_path = r"C:\Users\walde\techtrack\notebooks\logistics"

# Initialize models
m1 = Model('model1.cfg', 'model1.weights')
m2 = Model('model2.cfg', 'model2.weights')

#### 10 Images

In [None]:
# Get all image file names
image_files = [f for f in os.listdir(folder_path) if f.endswith('.jpg')]

# Randomly select 10 images
random.seed(42)
sample_size = 10
sampled_images = random.sample(image_files, sample_size)

In [None]:
# Get list of class IDs
num_classes = 20
class_ids_list = list(range(num_classes))
class_labels = ['barcode', 'car', 'cardboard box', 'fire', 'forklift', 'freight container', 'gloves', 
                'helmet', 'ladder', 'license plate', 'person', 'qr code', 'road sign', 'safety vest', 
                'smoke', 'traffic cone', 'traffic light', 'truck', 'van', 'wood pallet']

# Initialize performance metrics
aps_model1 = []
aps_model2 = []

# Dictionaries to store per-class metrics
metrics_model1 = {
    'tp': {class_id: 0 for class_id in class_ids_list},
    'fp': {class_id: 0 for class_id in class_ids_list},
    'fn': {class_id: 0 for class_id in class_ids_list},
    'scores': {class_id: [] for class_id in class_ids_list},
    'gt_counts': {class_id: 0 for class_id in class_ids_list}
}

metrics_model2 = {
    'tp': {class_id: 0 for class_id in class_ids_list},
    'fp': {class_id: 0 for class_id in class_ids_list},
    'fn': {class_id: 0 for class_id in class_ids_list},
    'scores': {class_id: [] for class_id in class_ids_list},
    'gt_counts': {class_id: 0 for class_id in class_ids_list}
}

# Loop through the sampled images
for file_name in sampled_images:
    # Get base name without the extension
    base_name = os.path.splitext(file_name)[0]
    
    # Define corresponding text file path
    text_file = os.path.join(folder_path, f"{base_name}.txt")
    
    # Check if corresponding text file exists
    if os.path.exists(text_file):
        # Read the image
        image_path = os.path.join(folder_path, file_name)
        image = cv.imread(image_path)
        
        # Read ground truth data
        ground_truth = get_ground_truth(text_file)
        gt_bboxes = [bbox for _, bbox in ground_truth]
        gt_class_ids = [class_id for class_id, _ in ground_truth]
        
        # Update ground truth counts per class
        for class_id in gt_class_ids:
            metrics_model1['gt_counts'][class_id] += 1
            metrics_model2['gt_counts'][class_id] += 1
        
        # Perform Model 1 prediction and post-processing
        m1_pp_frame, m1_og_frame = m1.preprocess(image)
        m1_pred_bboxes, m1_pred_class_ids, m1_pred_scores = m1.predict(m1_pp_frame)
        m1_post_bboxes, m1_post_class_ids, m1_post_scores = m1.post_process(
            m1_pred_bboxes, m1_pred_class_ids, m1_pred_scores, m1_og_frame)
        m1_nms_bboxes, m1_nms_class_ids, m1_nms_labels, m1_nms_scores = filter(
            m1_post_bboxes, m1_post_class_ids, m1_post_scores, 0.5, 0.4)

        # Perform Model 2 prediction and post-processing
        m2_pp_frame, m2_og_frame = m2.preprocess(image)
        m2_pred_bboxes, m2_pred_class_ids, m2_pred_scores = m2.predict(m2_pp_frame)
        m2_post_bboxes, m2_post_class_ids, m2_post_scores = m2.post_process(
            m2_pred_bboxes, m2_pred_class_ids, m2_pred_scores, m2_og_frame)
        m2_nms_bboxes, m2_nms_class_ids, m2_nms_labels, m2_nms_scores = filter(
            m2_post_bboxes, m2_post_class_ids, m2_post_scores, 0.5, 0.4)
        
        # Calculate performance metrics for Model 1
        precision_m1, recall_m1, tp_m1, fp_m1, fn_m1 = calculate_pr(
            m1_nms_bboxes, m1_nms_class_ids, m1_nms_scores, gt_bboxes, gt_class_ids,
            num_classes=num_classes)
        ap_m1 = calculate_ap(precision_m1, recall_m1)
        aps_model1.append(ap_m1)
        
        # Update per-class metrics for Model 1
        for class_id in class_ids_list:
            metrics_model1['tp'][class_id] += tp_m1[class_id]
            metrics_model1['fp'][class_id] += fp_m1[class_id]
            metrics_model1['fn'][class_id] += fn_m1[class_id]
            class_scores = [score for cid, score in zip(m1_nms_class_ids, m1_nms_scores) if cid == class_id]
            metrics_model1['scores'][class_id].extend(class_scores)

        # Calculate performance metrics for Model 2
        precision_m2, recall_m2, tp_m2, fp_m2, fn_m2 = calculate_pr(
            m2_nms_bboxes, m2_nms_class_ids, m2_nms_scores, gt_bboxes, gt_class_ids,
            num_classes=num_classes)
        ap_m2 = calculate_ap(precision_m2, recall_m2)
        aps_model2.append(ap_m2)
        
        # Update per-class metrics for Model 2
        for class_id in class_ids_list:
            metrics_model2['tp'][class_id] += tp_m2[class_id]
            metrics_model2['fp'][class_id] += fp_m2[class_id]
            metrics_model2['fn'][class_id] += fn_m2[class_id]
            class_scores = [score for cid, score in zip(m2_nms_class_ids, m2_nms_scores) if cid == class_id]
            metrics_model2['scores'][class_id].extend(class_scores)
        
        # Visualize results
        # Draw ground truth and predicted bounding boxes on image for comparison
            image_gt = image.copy()
            image_gt = draw_bboxes(image_gt, gt_bboxes, gt_class_ids)
            
            # Draw predicted boxes for Model 1
            image_m1 = image.copy()
            image_m1 = draw_bboxes(image_m1, m1_nms_bboxes, m1_nms_class_ids, m1_nms_scores)
            #image_m1 = draw_bboxes(image_m1, m1_post_bboxes, m1_post_class_ids, m1_post_scores)
            
            # Draw predicted boxes for Model 2
            image_m2 = image.copy()
            image_m2 = draw_bboxes(image_m2, m2_nms_bboxes, m2_nms_class_ids, m2_nms_scores)
            #image_m2 = draw_bboxes(image_m2, m2_post_bboxes, m2_post_class_ids, m2_post_scores)
            
            # Display images side by side
            plt.figure(figsize=(15, 10))
            plt.subplot(1, 3, 1)
            plt.imshow(cv.cvtColor(image_gt, cv.COLOR_BGR2RGB))
            plt.title("Ground Truth")
            plt.axis('off')

            plt.subplot(1, 3, 2)
            plt.imshow(cv.cvtColor(image_m1, cv.COLOR_BGR2RGB))
            plt.title("Model 1 Predictions")
            plt.axis('off')

            plt.subplot(1, 3, 3)
            plt.imshow(cv.cvtColor(image_m2, cv.COLOR_BGR2RGB))
            plt.title("Model 2 Predictions")
            plt.axis('off')

            plt.suptitle(f"Comparison for {file_name}", fontsize=16)
            plt.show()

    else:
        print(f"No corresponding text file found for {file_name}")

# Calculate mAP for both models
mAP_model1 = calculate_map(aps_model1)
mAP_model2 = calculate_map(aps_model2)

print(f"Model 1 mAP: {mAP_model1:.4f}")
print(f"Model 2 mAP: {mAP_model2:.4f}")

# Get per-class metrics for both models
per_class_metrics_model1 = calculate_precision_recall_f1(metrics_model1)
per_class_metrics_model2 = calculate_precision_recall_f1(metrics_model2)

# Print per-class metrics
print("\nModel 1 Per-Class Metrics:")
for class_id, metrics in per_class_metrics_model1.items():
    print(f"Class {class_id} ({class_labels[class_id]}): Precision: {metrics['precision']:.2f}, "
          f"Recall: {metrics['recall']:.2f}, F1-score: {metrics['f1_score']:.2f}, "
          f"TP: {metrics['tp']}, FP: {metrics['fp']}, FN: {metrics['fn']}, "
          f"GT Count: {metrics['gt_count']}")

print("\nModel 2 Per-Class Metrics:")
for class_id, metrics in per_class_metrics_model2.items():
    print(f"Class {class_id} ({class_labels[class_id]}): Precision: {metrics['precision']:.2f}, "
          f"Recall: {metrics['recall']:.2f}, F1-score: {metrics['f1_score']:.2f}, "
          f"TP: {metrics['tp']}, FP: {metrics['fp']}, FN: {metrics['fn']}, "
          f"GT Count: {metrics['gt_count']}")

specificity_model1 = calculate_specificity(metrics_model1)
specificity_model2 = calculate_specificity(metrics_model2)

# Print specificity per class
print("\nModel 1 Specificity Per Class:")
for class_id, specificity in specificity_model1.items():
    print(f"Class {class_id} ({class_labels[class_id]}): Specificity: {specificity:.2f}")

print("\nModel 2 Specificity Per Class:")
for class_id, specificity in specificity_model2.items():
    print(f"Class {class_id} ({class_labels[class_id]}): Specificity: {specificity:.2f}")

#### Evaluating Performance Metrics
The performance of Model 1 and Model 2 was assessed using a variety of metrics, each providing insight into different aspects of object detection accuracy and reliability. The key metric used here was the mean Average Precision (mAP), which combines precision and recall across multiple confidence thresholds. This metric is commonly used in object detection tasks as it provides a holistic view of a model’s ability to accurately detect and classify objects across all classes. When evaluated on 100 images, Model 1 achieved an mAP of 0.0950, while Model 2 achieved a higher mAP of 0.1149, indicating that Model 2 has an edge in detecting objects across a larger dataset. However, when the models were evaluated on a smaller dataset of 10 images, Model 1 outperformed Model 2, achieving an mAP of 0.2000 compared to Model 2’s 0.1500. This discrepancy suggests that Model 1 may perform better in smaller, more controlled datasets but struggles with generalization across a larger number of images. It is also possible that the 10 images simply favored Model 1 by chance, which can happen with small sample sizes. On the other hand, Model 2’s superior mAP on the larger dataset suggests that it is more consistent and robust in diverse environments, which is a critical quality for real-world deployment where the number and variety of images encountered will be greater.

For the 10-image dataset, precision and recall metrics were exhibited in only two classes: fire and truck. In these specific cases, both models exhibited better performance compared to other object classes. For fire, Model 1 achieved a precision of 1.00 and a recall of 0.50, which indicates that it correctly identified one instance of "fire" while missing another. Model 2 had the same performance for "fire," with an F1-score of 0.67 for both models. This suggests that, in this class, both models were capable of making some correct detections but struggled to capture all instances, leading to incomplete recall. In the truck class, Model 1 performed exceptionally well, achieving a perfect precision and recall of 1.00, meaning it detected all instances of "truck" without producing any false positives. Model 2 also performed well in this class but with slightly lower precision (0.50) while maintaining a perfect recall (1.00). This indicates that Model 2 produced one false positive, which reduced its precision for this class, but still managed to detect all instances of trucks, leading to an F1-score of 0.67.

For all other classes in the 10-image dataset, both models exhibited a precision and recall of 0.00, indicating that no objects were detected in these categories, resulting in low or zero F1-scores. This shows that the models struggled significantly outside of the "fire" and "truck" classes when evaluated on this small dataset. This imbalance highlights the limitations of using such a small dataset to evaluate overall performance, as both models failed to detect any objects in the majority of classes. The F1-score, which represents the harmonic mean of precision and recall, further highlights the trade-offs between these two metrics. This score is particularly important when dealing with uneven class distributions, as it helps measure how well a model performs in identifying all relevant objects without generating too many false positives. For both models, the F1-scores across several classes were 0.00, reflecting that they either missed the objects entirely or produced too many false positives.

Specificity, or the model’s ability to avoid false positives, is another critical metric, especially in applications where false alarms can be costly or disruptive. Model 1 exhibited higher specificity across several object classes compared to Model 2, making it better at avoiding false positives in certain cases. This trait is particularly useful in controlled environments, such as warehouses, where false alarms might interrupt operational workflows. However, Model 2 demonstrated lower specificity in certain critical classes, such as "forklift" and "truck," where it produced more false positives. While this could be problematic in some contexts, it might not be detrimental in cases where missing a detection (false negative) is more damaging than raising a false alarm (false positive), such as in safety-critical applications. Overall, Model 1 demonstrated higher specificity in object classes where no objects were detected (such as "forklift"), meaning it avoided false positives in those cases. Model 2, on the other hand, had lower specificity in these classes due to generating more false positives, especially in the "forklift" class. However, it’s important to note that these results are based on a very limited dataset, and both models would likely show different behavior on larger datasets.

#### Analyzing Per-Class Performance
When breaking down the performance by object class for the 10-image dataset, both Model 1 and Model 2 struggled with the majority of object classes, indicating similar weaknesses. In classes such as "barcode," "car," "forklift," "freight container," "gloves," "helmet," and "wood pallet," both models failed to detect any objects, yielding precision, recall, and F1-scores of 0.00. This poor performance reflects the models' inability to detect smaller, occluded, or less distinct objects, which significantly reduces their overall effectiveness in real-world applications where such objects are common.

For the "fire" class, both models performed similarly well, achieving precision of 1.00 and recall of 0.50, resulting in an F1-score of 0.67. This means that both models were able to correctly detect one instance of "fire" but missed another, indicating some success but not complete reliability in this category. For the "truck" class, Model 1 performed slightly better, achieving a precision and recall of 1.00 (F1-score of 1.00), meaning it correctly identified all instances without any false positives. Model 2, on the other hand, detected all trucks (with a recall of 1.00) but had one false positive, lowering its precision to 0.50 and F1-score to 0.67.

It is important to highlight that both models struggled equally in most of the other classes. For example, neither model was able to detect objects in the "license plate," "road sign," or "QR code" classes. These failures suggest that the models were unable to generalize well across different object types in the small dataset. While Model 1 did show perfect precision in detecting "truck," this advantage was not widespread across other classes. Similarly, Model 2’s performance in detecting trucks with slightly lower precision but perfect recall does not significantly distinguish it from Model 1 in terms of overall detection abilities across multiple object classes. Overall, the similar patterns of performance across classes suggest that both models face challenges when detecting smaller, occluded, or less distinctive objects. Their limited success in the "fire" and "truck" classes reflects their ability to detect larger, clearer objects but highlights the need for improvement in handling more diverse object types and conditions.

#### Identifying Strengths and Weaknesses Through Error Analysis
Examining the errors made by both models reveals their respective strengths and weaknesses. One of the key strengths of Model 1 is its high specificity for many classes, meaning it is less likely to generate false positives. This trait can be beneficial in environments like warehouse management, where false alarms might lead to operational inefficiencies. However, when analyzing the actual performance across object classes, Model 1's weaknesses become apparent. While it performed well in the truck class, with perfect precision and recall (F1-score of 1.00), it struggled significantly with other classes, such as "fire," where it missed one instance, and smaller or less distinct objects like "forklift," "gloves," and "QR code," where it failed to detect any objects, yielding precision, recall, and F1-scores of 0.00. This poor recall in certain classes shows that Model 1 may not be suitable for safety-critical environments, as it could miss important detections in scenarios like fire detection or recognizing protective equipment like helmets and gloves.

Examining the true positives (TP), false positives (FP), and false negatives (FN) reveals the differences between Model 1 and Model 2 more effectively. Both models demonstrated similar performance across most object classes, but there are a few differences in their error profiles. In the truck class, Model 1 performed perfectly with 1 TP, 0 FP, and 0 FN, achieving a precision, recall, and F1-score of 1.00. Model 2, while correctly detecting all truck instances (1 TP, 0 FN), introduced a false positive (1 FP), lowering its precision to 0.50 and F1-score to 0.67. This suggests that Model 1 was slightly better at distinguishing between truck and non-truck objects, while Model 2 misclassified one non-truck object as a truck.

In the fire class, both models achieved identical performance, with 1 TP, 0 FP, and 1 FN, resulting in precision of 1.00 and recall of 0.50. This indicates that while both models correctly identified one instance of fire, they missed another. The fact that both models share these statistics suggests that their capability to detect fire was comparable and that neither model had an advantage in this class. Where the two models diverge more clearly is in their handling of false positives. For example, in the forklift class, Model 1 had 0 TP, 0 FP, and 2 FN, meaning it missed both instances of forklifts. Model 2, on the other hand, also had 0 TP but produced 2 FP, meaning it incorrectly detected forklifts where none existed. This pattern of false positives in Model 2 shows that while both models struggled to detect actual forklifts, Model 2 was more prone to misclassifications, which could be problematic in real-world scenarios where false detections might lead to operational inefficiencies. For most other classes, both models showed near-identical performance, with 0 TP, 0 FP, and high numbers of FN. Neither model managed to detect objects in classes like license plate, road sign, or QR code, showing that both struggled with smaller or more challenging objects in this limited dataset.

Overall, both models exhibited weaknesses in detecting smaller or less distinct objects such as "license plate," "road sign," and "QR code." These shared failures suggest that neither model was able to generalize well across all object classes, especially in the limited dataset of 10 images. While Model 1 demonstrated strong specificity in avoiding false positives for several classes, it still failed to detect relevant objects in many cases, which reduces its reliability in real-world applications. Model 2, despite showing some false positives, performed similarly to Model 1 across most classes, indicating no significant advantage in terms of broader object detection. It’s important to note that these results are based on a very limited dataset of 10 images, and both models would likely show different behavior on larger datasets, potentially revealing more about their strengths and limitations. Model 2’s stronger performance on larger datasets, as evidenced by its higher mAP in prior evaluations, suggests it might be more generalizable and better at handling the complexity and variety found in real-world environments than Model 1, which showed a tendency to perform better in smaller, more controlled datasets.

#### Implications of the Findings for Practical Applications
The findings from this analysis have certain implications for selecting the appropriate model for specific real-world applications. If the primary concern is avoiding false positives, such as in warehouse environments where operational interruptions are costly, Model 1 may be the better choice due to its generally higher specificity across many object classes. However, neither Model 1 nor Model 2 detected certain key objects, such as “wood pallet” (both had 0 true positives and high false negatives in this class). This lack of detection for such objects may limit both models' applicability in environments where recognizing these items is crucial for inventory management.

Model 1 also showed perfect performance in detecting “truck” (with 1 true positive, 0 false positives, and 0 false negatives), but its weaknesses were clear in other critical classes. For example, Model 1 missed an instance of “fire” (with 1 true positive and 1 false negative) and completely failed to detect smaller or occluded objects like “forklift,” “gloves,” and “QR code”, where it recorded 0 true positives, resulting in precision, recall, and F1-scores of 0.00. This poor recall in these classes makes it less suitable for safety-critical environments where missing detections could have serious consequences, such as in fire detection or monitoring for safety equipment like helmets or gloves.

In contrast, Model 2 struggled with higher false positives in certain classes, such as “forklift” (with 2 false positives and 0 true positives) and “truck” (with 1 false positive and 1 true positive). However, Model 2 was still able to detect all instances of “truck”, achieving a recall of 1.00, though with lower precision (0.50) compared to Model 1. Like Model 1, Model 2 also detected “fire” with 1 true positive and 1 false negative, but it performed poorly in detecting smaller objects like “license plate” and “QR code”, where no objects were detected (yielding 0 true positives, 0 precision, and 0 recall).

For applications where missing a detection could lead to severe consequences, such as fire detection systems or safety equipment recognition, Model 2's better recall for some classes, even if at the cost of generating more false positives, may make it the preferable choice. In scenarios where avoiding false negatives is critical, Model 2’s performance is more appropriate, particularly in situations like fire detection or monitoring for the presence of workers in safety gear.

Moreover, Model 2's superior performance on larger datasets (e.g., with a higher mAP in previous evaluations) suggests that it may be more generalizable and better at handling the diversity of real-world environments, where the variety and complexity of images are greater. Model 1, while better at avoiding false positives, may not generalize as well across different conditions, as indicated by its tendency to miss critical objects in the smaller dataset.

Although both models have their disadvantages, Model 2 appears better suited for real-world scenarios where comprehensive object detection is needed, especially when missing critical objects is not acceptable. This is supported by Model 2's higher mean Average Precision (mAP) when evaluated on larger datasets, indicating better overall performance across various object classes. In contrast, Model 1 is more appropriate for controlled environments where false positives need to be minimized, but its poor recall and lower mAP limit its applicability in dynamic, safety-critical environments. The mAP for Model 2 on the 100-image dataset (0.1149) was higher than Model 1's (0.0950), reinforcing the notion that Model 2 is more robust and generalizable in complex, real-world conditions.

Thus, for tasks requiring more robust object detection across a wide variety of scenarios, especially where a higher degree of recall and comprehensive detection performance is critical, Model 2 would likely be the preferred model. While the precision was similar in the 10-image dataset, Model 2's higher mAP in the 100-image dataset indicates it performs better overall in detecting objects across diverse environments.