# Hackathon: The AED Challenge

Our Hackathon project focus on Annotation Error Detection (AED) for machine learning datasets. The goal is to identify and correct errors in the annotations of datasets used for training machine learning models.



## 🎯 Goals

Your goal is to build an automated method to detect incorrectly labeled images in a **training dataset**. You will be evaluated on how many label errors you can correctly identify.



## 📁 Datasets

You will be provided with two main datasets:

1.  **Training Set (`training_set`)**
    *   **Contents**: A collection of images and their corresponding labels.
    *   **The Catch**: This set contains an unknown number of labeling errors. You **do not** have the ground truth for which labels are incorrect.
    *   **Your Goal**: This is the dataset you must analyze. Your final submission will be a list of suspected errors from this set.

2.  **Validation Set (`validation_set`)**
    *   **Contents**: A smaller collection of images, their labels, and a special ground truth column (`is_noisy`).
    *   **Purpose**: This set is your sandbox. You can use it to develop, test, and validate your error detection models and algorithms. The `is_noisy` column will tell you exactly which labels are wrong (`1`) and which are correct (`0`), allowing you to measure your method's performance before applying it to the real challenge.

The original dataset comes from [this hackathon](https://https-deeplearning-ai.github.io/data-centric-comp/) by Andrew Ng and the Deeplearning.ai team.


## 📝 Your Task

1.  **Explore the Data**: Start by analyzing both the `training_set` and `validation_set`. Understand the classes, image characteristics, and potential sources of error.
2.  **Develop Your Method**: Use the `validation_set` to build and refine your strategy. You could train a model and look for low-confidence predictions, analyze feature embeddings to find outliers, or use other clever techniques.
3.  **Identify Errors in the Training Set**: Once you are confident in your method, apply it to the `training_set` to predict which items have incorrect labels.
4.  **Generate a Submission File**: Create a CSV file detailing your findings.



## 🚀 Submission

Your final output must be a single CSV file named `submission.csv`. This file will be uploaded to the online judging platform.

For your grades,  you'll also need to upload your work to a GitHub repository. The repository should include:
-   Your code for the error detection method.
-   A README file explaining your approach, how to run your code, and any dependencies.
-   Any additional files or scripts that are necessary to reproduce your results.

## 📅 Timeline
-   **Start Time**: 2025-06-26 - 19:00
-   **End Time**: 2025-06-26 - 21:30
-   **Submission Deadline**: 2025-06-26 - 21:30
-   **SIGAA Submission Deadline**: 2025-06-27 - 23:59



### Submission File Format

The CSV file must contain exactly two columns: `uid` and `is_noisy`.

*   `uid`: The identifier/filename of the image from the `training_set`.
*   `is_noisy`: Your prediction. Use `1` if you believe the label is an error, and `0` if you believe it is correct.

**Example `submission.csv`:**

```csv
uid,is_noisy
12345676890aed,0
12345676890aee,1
12345676890aef,0
12345676890af0,1
...
```



## 🏆 Evaluation

Your submission will be evaluated using the **Matthews Correlation Coefficient (MCC)**.

*   **Why MCC?**: MCC is a robust metric for binary classification that performs well even when the classes are highly imbalanced (as we expect far more correct labels than incorrect ones). It provides a single, balanced score that accounts for true/false positives and negatives.
*   **The Winner**: The team with the highest MCC score on the hidden ground truth of the `training_set` wins!
*   **Your Grade**: Your grade will be based on the quality of your submission on SIGAA.

The final **leaderboard** on the platform will show your rank and score after each submission.


In [None]:
!pip install transformers
!pip install Pillow
!pip install numpy
!pip install joblib
!pip install pathlib
!pip install tqdm
!pip install cleanlab[datalab]
!pip install matplotlib
!pip install plotly
!pip install xgboost
!pip install cleanvision
!pip install seaborn
!pip install --extra-index-url https://download.pytorch.org/whl/cu128
!pip install torch
!pip install torchvision
!pip install torchaudio
!pip install transformers[torch]

In [None]:
import pandas as pd
import requests
import io
# Download the dataset from the provided URL
url_train = 'https://public.jacob.al/imd3011-datacentric_ai/hackathon-2025-2/train_no_labels.json.gz'
url_valid = 'https://public.jacob.al/imd3011-datacentric_ai/hackathon-2025-2/valid.json.gz'

response_train = requests.get(url_train)
response_valid = requests.get(url_valid)

# Read the JSON data into pandas DataFrames
train_data = pd.read_json(io.BytesIO(response_train.content), compression='gzip')
valid_data = pd.read_json(io.BytesIO(response_valid.content), compression='gzip')


In [None]:
train_data

In [None]:
valid_data

In [None]:
len(valid_data[valid_data.is_noisy == True])

In [None]:
from PIL import Image
from matplotlib import pyplot as plt
import numpy as np

def show_image_from_np_array(np_array):
    np_array = np.array(np_array, dtype=np.uint8)  # Ensure the array is in the correct format
    image = Image.fromarray(np_array)
    plt.imshow(image)
    plt.axis('off')  # Hide the axis
    plt.show()

# Display the first image from the training dataset
show_image_from_np_array(train_data['image'][0])

In [None]:
valid_data.query('is_noisy == True').head(5)

In [None]:
train_data.label.value_counts()

In [None]:
row = valid_data.query('is_noisy == True').sample(1).iloc[0]

show_image_from_np_array(row['image'])
print(row['label'])


In [None]:
row = valid_data.query('is_noisy == True').sample(1).iloc[0]

show_image_from_np_array(row['image'])
print(row['label'])


In [None]:
row = valid_data.query('is_noisy == True').sample(1).iloc[0]

show_image_from_np_array(row['image'])
print(row['label'])


In [None]:
row = valid_data.query('is_noisy == True').sample(1).iloc[0]

show_image_from_np_array(row['image'])
print(row['label'])


## Geração dos Embeddings

Essa foi a etapa que mais nos atrapalhou.

Nos faltou experiência em problemas de visão computacional, estavamos acostumado a ter um dataset de imagens e não em formato RGB.

A nossa estratégia foi identificar a maior imagem e fazer um padding nas demais. Olhando agora esse padding também poderia ter sido feito com um cuidado maior, utilizamos o 0, para fazer o padding, porém como o RGB de 0, 0 , 0 estava sendo utilizado para os digitos romanos pode ter gerado uma confusão ao fazer os embeddings.

In [None]:
max_length = max(train_data['image'].apply(len))
print(max_length)

In [None]:
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import torch
import numpy as np
from tqdm import tqdm

processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
model = AutoModel.from_pretrained("google/vit-base-patch16-224-in21k")

TARGET_SIZE = 648 
IMG_SHAPE = (18, 12)  

def extract_embedding_from_pixels(pixel_lists):
    embeddings = []
    for i, pixels in enumerate(tqdm(pixel_lists)):
        try:
            pixels_array = np.array(pixels, dtype=np.uint8)

            if pixels_array.size < TARGET_SIZE:
                pixels_array = np.pad(pixels_array, (0, TARGET_SIZE - pixels_array.size), 'constant')
            elif pixels_array.size > TARGET_SIZE:
                pixels_array = pixels_array[:TARGET_SIZE]

            if pixels_array.size != TARGET_SIZE:
                 pixels_array = np.array(pixels, dtype=np.uint8).flatten()
                 if pixels_array.size < TARGET_SIZE:
                     pixels_array = np.pad(pixels_array, (0, TARGET_SIZE - pixels_array.size), 'constant')
                 else:
                     pixels_array = pixels_array[:TARGET_SIZE]

            reshaped_array = pixels_array.reshape((*IMG_SHAPE, 3))

            img = Image.fromarray(reshaped_array, mode="RGB")

            img = img.resize((224, 224))

            inputs = processor(images=img, return_tensors="pt")
            with torch.no_grad():
                outputs = model(**inputs)
            feature_vector = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
            embeddings.append(feature_vector)

        except ValueError as ve:
            print(f"ValueError at index {i}: {ve}. Pixels array shape: {pixels_array.shape if 'pixels_array' in locals() else 'N/A'}")
            embeddings.append(np.zeros(model.config.hidden_size))
        except Exception as e:
            print(f"General error at index {i}: {e}")
            embeddings.append(np.zeros(model.config.hidden_size))
    return np.array(embeddings)

In [None]:
train_embeddings = extract_embedding_from_pixels(train_data['image'])

In [None]:
train_embeddings

In [None]:
val_embeddings = extract_embedding_from_pixels(valid_data['image'])

In [None]:
X_train = train_embeddings
y_train = train_data.label

X_valid = val_embeddings
y_valid = valid_data.label

In [None]:
X_valid.shape, y_valid.shape

In [None]:
y_valid

In [None]:
class_names = y_train.unique().tolist()

class_to_idx = {name: i for i, name in enumerate(class_names)}

idx_to_class = {i: name for name, i in class_to_idx.items()}

In [None]:
idx_to_class

In [None]:
train_data.label.value_counts()

In [None]:
train_data['class_idx'] = train_data['label'].map(class_to_idx)

In [None]:
train_data.class_idx.value_counts()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(16, 22))
ax = sns.countplot(y=train_data['class_idx'], order=train_data['class_idx'].value_counts().index)
plt.xlabel('Count')
plt.ylabel('Class Label')
plt.title('Distribution of Numbers')

for p in ax.patches:
    width = p.get_width()    
    plt.text(width + 1,       
             p.get_y() + p.get_height()/2.,
             f'{int(width)}', 
             ha='left',       
             va='center')

plt.tight_layout()
plt.show()

## Cleanlab

Pela simplicidade optamos por explorar o Cleanlab para identificar possíveis outliers.

Ele gerou uma amostra de 10% do conjunto, porém todos com um score 0.0, por isso acamos desconsiderando. O near duplicate também foi utilizado, mas assim como no de outliers todos os scores foram 0.0.

In [None]:
from cleanlab import Datalab

lab = Datalab(data=train_data)

lab.find_issues(features=X_train, issue_types={"near_duplicate": {}, "outlier": {}})

lab.report()

In [None]:
near_duplicates = lab.get_issues("near_duplicate").query("is_near_duplicate_issue == True")

outliers = lab.get_issues("outlier").query("is_outlier_issue == True")

near_duplicates

In [None]:
# Sort the outliers DataFrame by the 'outlier_score' column in ascending order
# This helps in identifying the most significant outliers first
outliers = outliers.sort_values("outlier_score", ascending=True)

# Display the sorted outliers DataFrame
outliers

Mesmo assim observamos alguns exemplos utilizando o Index que o cleanlab havia encontrado, e como esperávamos são amostras completamente aleatórias, grande parte delas estavam corretamente classificadas, e uma pequena parte eram outliers.

In [None]:
NUMBER = 1431

In [None]:
show_image_from_np_array(train_data['image'][NUMBER])

In [None]:
train_data.iloc[NUMBER]

## ReCov

O último algoritmo que conseguimos utilizar foi o ReCov, estudamos ele e implementamos anterior a essa atividade e conseguimos aplicar ao conjunto de dados. 

O objetivo desse algoritmo é identificar as amostras ruidosas durante o cross validation, a partir do desempenho do modelo.

Essa foi a estratégia que utilizamos para filtrar o conjunto de treino.

Removemos todos os índices encontrados pelo ReCov, o maior motivo dessa abordagem bruta foi a falta de tempo no final.

In [3]:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.base import clone

def recov(X, y, model, n_runs=100, k_folds=5, random_state=None):
    n_samples = len(X)
    occurrence_counts = np.zeros(n_samples)
    
    for run in range(n_runs):
        kf = KFold(n_splits=k_folds, shuffle=True, random_state=random_state + run if random_state is not None else None)
        fold_val_scores = []
        fold_indices = []
        
        models = []
        for train_idx, val_idx in kf.split(X):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]
            
            fold_model = clone(model)
            fold_model.fit(X_train, y_train)
            models.append(fold_model)
            fold_indices.append(val_idx)
            
            score = fold_model.score(X_val, y_val)
            fold_val_scores.append(score)
        
        worst_fold_idx = np.argmin(fold_val_scores)
        worst_fold_samples = fold_indices[worst_fold_idx]
        
        occurrence_counts[worst_fold_samples] += 1
    
    return occurrence_counts

In [None]:
from sklearn.linear_model import LogisticRegression


model = LogisticRegression(max_iter=1000)
occurrence_counts = recov(X_train, y_train, model, n_runs=100, k_folds=5)

noisy_indices = np.argsort(occurrence_counts)[-int(0.1 * len(X_train)):]

In [None]:
noisy_indices

In [None]:
show_image_from_np_array(train_data['image'][18])

In [None]:
train_data.iloc[18]

In [None]:
final_submission = train_data.copy()

In [None]:
final_submission['is_noisy'] = False

final_submission.loc[noisy_indices, 'is_noisy'] = True

In [None]:
print(final_submission.columns)

In [None]:
final_submission = final_submission[['uid', 'is_noisy']]

## Estratégias que não conseguimos utilizar

Um dos erros que cometemos foi não fazer um modelo simples utilizando todo o conjunto de treinamento e utilizar os níveis de confiança, como vimos em aula, não esperávamos passar tanto tempo sofrendo com a produção dos embeddings.

Isso iria ajudar bastante já que poderíamos utilizar a confiança do modelo como guia para a seleção das imagens, uma estratégia, eficiente e fácil de ser implementada.

Além disso perdemos a oportunidade em não utilizar um serviço de LLM para classificar as imagens, acredito que essa seria uma solução criativa e decente em identificar possíveis erros de rótulos, caso houvesse conflitos entre os rótulos da LLM e os do Dataset.

In [None]:
import time
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, balanced_accuracy_score, classification_report,
    confusion_matrix, f1_score, matthews_corrcoef
)
from typing import List, Tuple

def calculate_evaluation_metrics(y_true: pd.Series, y_pred: pd.Series) -> Tuple[float, float, float, str, float, np.ndarray]:
    f1 = f1_score(y_true, y_pred, average='micro')
    balanced_accuracy = balanced_accuracy_score(y_true, y_pred)
    accuracy = accuracy_score(y_true, y_pred)
    classification_report_str = classification_report(y_true, y_pred)
    matthews_corr_coeff = matthews_corrcoef(y_true, y_pred)
    confusion_matrix_arr = confusion_matrix(y_true, y_pred)

    return f1, balanced_accuracy, accuracy, classification_report_str, matthews_corr_coeff, confusion_matrix_arr

def train_and_evaluate_logistic_regression(X_train: pd.DataFrame,
                                           y_train: pd.Series,
                                           X_valid: pd.DataFrame,
                                           y_valid: pd.Series) -> Tuple[pd.DataFrame, List[List]]:
    model_name = 'LogisticRegression'
    model = LogisticRegression(random_state=1408, n_jobs=-1, class_weight='balanced', max_iter=1000)

    evaluation_results = []
    classification_reports = []

    start_time = time.time()
    model.fit(X_train, y_train)
    predictions = model.predict(X_valid)

    f1, balanced_accuracy, accuracy, classification_report_str, matthews_corr_coeff, confusion_matrix_arr = calculate_evaluation_metrics(y_valid, predictions)
    classification_reports.append([model_name, classification_report_str, confusion_matrix_arr])

    elapsed_time = time.time() - start_time
    evaluation_results.append([model_name, f1, balanced_accuracy, accuracy, matthews_corr_coeff, elapsed_time, confusion_matrix_arr, classification_report_str])

    print(f'Name: {model_name} - F1: {f1:.4f} - BACC: {balanced_accuracy:.4f} - ACC: {accuracy:.4f} - MCC: {matthews_corr_coeff:.4f} - Elapsed: {elapsed_time:.2f}s')
    print(classification_report_str)
    print(confusion_matrix_arr)
    print('*' * 20, '\n')

    results_df = pd.DataFrame(evaluation_results, columns=['Model', 'F1', 'BACC', 'ACC', 'MCC', 'Total Time', 'Confusion Matrix', 'Classification Report'])
    results_df['Confusion Matrix'] = results_df['Confusion Matrix'].apply(str)

    return results_df, classification_reports


## 🚀 Submission

Your final output must be a single CSV file named `submission.csv`. This file will be uploaded to the online judging platform.


In [None]:
final_submission.head(2) # Replace with your final submission file path


In [None]:
final_submission.to_csv('submission.csv', index=False)