<a href="https://colab.research.google.com/github/DineshSiddhartha/SAI_A3/blob/main/SAI_A3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import json
from sklearn.metrics import cohen_kappa_score

# Loading annotations from JSON files
def load_annotations(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    annotations = []
    #extracting annotations from json files
    for item in data:
        results = item.get("annotations", [{}])[0].get("result", [])
        sentence_labels = [
            res["value"]["labels"][0] for res in results
        ]
        annotations.append(sentence_labels)
    return annotations

# Flattening annotations
def flatten_annotations(annotations):
    return [label for sentence in annotations for label in sentence]

file_a1 = "annotator1.json"
file_a2 = "annotator2.json"

# Loading json files and flattening annotations
annotations_a1 = flatten_annotations(load_annotations(file_a1))
annotations_a2 = flatten_annotations(load_annotations(file_a2))

# Calculating Cohen's kappa
kappa_score = cohen_kappa_score(annotations_a1, annotations_a2)
print(f"Cohen's kappa: {kappa_score}")


Cohen's kappa: 0.8363343684230562


Cohen's kappa is 0.836, indicating a good level of agreement. However, the value is not 1 due to some challenges in understanding parts of speech. There was uncertainty in classifying words like "number" as either an adjective or a number and "thousand" as either a noun or a number. These ambiguities contributed to differences in annotations and slightly reduced the agreement.

In [None]:
import json
import numpy as np
from statsmodels.stats.inter_rater import fleiss_kappa

# Loading JSON files
def load_json(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        return json.load(f)

# Aligning annotations from all annotators
def align_annotations(annotator_files):
    aligned_data = {}
    for file_path in annotator_files:
        for item in load_json(file_path):
            img_id = item['id']
            label = (
                item.get('annotations', [{}])[0]
                .get('result', [{}])[0]
                .get('value', {})
                .get('choices', [None])[0]
            )
            if label:
                aligned_data.setdefault(img_id, []).append(label)
    return aligned_data

# Preparing the Fleiss' Kappa input matrix
def prepare_fleiss_kappa_matrix(aligned_data, categories):
    matrix = []
    for img_id, annotations in aligned_data.items():
        row = [0] * len(categories)  # Initializing counts for each category
        for label in annotations:
            if label in categories:  # Increment counts for the corresponding category
                row[categories.index(label)] += 1
        matrix.append(row)
    return np.array(matrix)

# Annotator files
annotator_files = ['a1.json', 'a2.json', 'a3.json']

# Aligning annotations
aligned_data = align_annotations(annotator_files)

# Defining categories for labels
categories = ["Trucks", "No Trucks"]

# Preparing matrix for Fleiss' Kappa calculation
matrix = prepare_fleiss_kappa_matrix(aligned_data, categories)
kappa = fleiss_kappa(matrix)
print(f"Fleiss' Kappa: {kappa}")


Fleiss' Kappa: 0.7884841363102228


Fleiss' kappa is 0.788, which shows good agreement among the annotators. However, the value is not 1 because some images had trucks that blended into the background, making them hard to see. Proper identification required zooming in and closely examining the images. Without this careful observation, trucks were easy to miss at a quick glance. Additionally, some vehicles might have been mistaken for trucks, even though they were not. These factors caused differences in the annotations and reduced the overall agreement.