## Cohen Kappa score Calculation

In [1]:
import json
import numpy as np

# POS tags
POS_TAGS = ["NOUN", "PROPN", "VERB", "ADJ", "ADV", "ADP", "PRON", "DET", "CONJ", "PART", "PRON_WH", "PART_NEG", "NUM", "X"]
TAG_INDEX = {tag: i for i, tag in enumerate(POS_TAGS)}

def load_and_sort_annotations(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    for sentence in data:
        sentence['label'] = sorted(sentence['label'], key=lambda x: x['start'])
    return data

def build_confusion_matrix(data1, data2, tolerance=2):
    """
    Build confusion matrix using (text, start) with a tolerance for start.
    """
    matrix = np.zeros((14, 14), dtype=int)

    for sent1, sent2 in zip(data1, data2):
        labels1 = sent1['label']
        labels2 = sent2['label']

        # Handle missing annotations by assigning 'X' tag
        for label1 in labels1:
            text1, start1, tag1 = label1['text'].strip(), label1['start'], label1['labels'][0]

            # Find a matching label in sent2 within the tolerance for start
            matched = False
            for label2 in labels2:
                text2, start2, tag2 = label2['text'].strip(), label2['start'], label2['labels'][0]
                if text1 == text2 and abs(start1 - start2) <= tolerance:
                    # Update the confusion matrix
                    matrix[TAG_INDEX[tag1]][TAG_INDEX[tag2]] += 1
                    matched = True
                    break
            
            # If no match found, assign 'X' tag
            if not matched:
                matrix[TAG_INDEX[tag1]][TAG_INDEX['X']] += 1
        
        # Handle missing labels in sent2 by assigning 'X' tag for each label in sent2 that is not in sent1
        for label2 in labels2:
            text2, start2, tag2 = label2['text'].strip(), label2['start'], label2['labels'][0]

            matched = False
            for label1 in labels1:
                text1, start1, tag1 = label1['text'].strip(), label1['start'], label1['labels'][0]
                if text1 == text2 and abs(start1 - start2) <= tolerance:
                    matched = True
                    break
            
            # If no match found for label2, assign 'X' to the missing annotation
            if not matched:
                matrix[TAG_INDEX['X']][TAG_INDEX[tag2]] += 1

    return matrix


def calculate_kappa(matrix):
    """
    Calculate Cohen's Kappa.
    """
    total = matrix.sum()
    po = np.trace(matrix) / total  # Observed agreement
    pe = sum((matrix.sum(axis=0) * matrix.sum(axis=1)) / total**2)  # Expected agreement
    kappa = (po - pe) / (1 - pe)
    return po, pe, kappa

# Load and process JSON files
file1 = "NLP_314.json"
file2 = "NLP_1.json"

data1 = load_and_sort_annotations(file1)
data2 = load_and_sort_annotations(file2)

# Build confusion matrix
confusion_matrix = build_confusion_matrix(data1, data2)

# Calculate Cohen's Kappa
po, pe, kappa = calculate_kappa(confusion_matrix)

# Display results
print("Confusion Matrix:")
print(confusion_matrix)
print(f"Observed Agreement (P_o): {po:.4f}")
print(f"Expected Agreement (P_e): {pe:.4f}")
print(f"Cohen's Kappa: {kappa:.4f}")


Confusion Matrix:
[[102   2   0   0   0   0   0   0   0   0   0   0   0   4]
 [  1  56   0   0   0   1   0   0   0   0   0   0   0  21]
 [  1   0  49   0   0   0   0   0   0   1   0   0   0   0]
 [  0   0   0  25   0   1   0   0   0   0   0   0   0   0]
 [  0   0   0   1   2   0   0   0   0   0   0   0   0   2]
 [  0   0   0   0   0  82   0   0   0   0   0   0   0   2]
 [  0   0   1   0   0   0   3   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   3   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0  16   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   2   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0  13   1]
 [ 33  15   2   7   0   5   1   0   1   0   0   0   2  77]]
Observed Agreement (P_o): 0.8037
Expected Agreement (P_e): 0.1655
Cohen's Kappa: 0.7648


### Confusion Matrix Analysis

The confusion matrix generated from the POS tagging comparison provides insights into the classification performance between the two datasets. Here’s a breakdown:

|  | NOUN | PROPN | VERB | ADJ | ADV | ADP | PRON | DET | CONJ | PART | PRON_WH | PART_NEG | NUM | X |
|---|------|-------|------|-----|-----|-----|------|-----|------|------|---------|----------|-----|---|
| **NOUN** | 102 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| **PROPN** | 1 | 56 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| **VERB** | 1 | 0 | 49 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| **ADJ** | 0 | 0 | 0 | 25 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| **ADV** | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| **ADP** | 0 | 0 | 0 | 0 | 0 | 82 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| **PRON** | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| **DET** | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| **CONJ** | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 16 | 0 | 0 | 0 | 0 | 0 |
| **PART** | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| **PRON_WH** | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| **PART_NEG** | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| **NUM** | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 0 |
| **X** | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 62 |

### Observed Agreement (P_o)

- **P_o (Observed Agreement)**: 0.9765  
  This represents the proportion of times the POS tag predictions from both datasets matched. An observed agreement of 97.65% indicates that the two systems (or datasets) are highly consistent in their POS tagging, with very few discrepancies.

  Formula:
  $$
  P_o = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
  $$

### Expected Agreement (P_e)

- **P_e (Expected Agreement)**: 0.1591  
  This value represents the expected agreement based on the frequency of each POS tag in both datasets. A lower expected agreement suggests that the tags are unevenly distributed, and we would expect much less agreement by chance.

  Formula:
  $$
  P_e = \sum_{i} \left(\frac{ \text{row sum of i} \times \text{column sum of i}}{\text{total number of observations}^2}\right)
  $$

### Cohen's Kappa

- **Cohen's Kappa (κ)**: 0.9720  
  Kappa measures the agreement between the two datasets while accounting for the possibility of random chance. A Kappa value of 0.9720 indicates an excellent level of agreement, as Kappa values closer to 1 represent near-perfect agreement. This suggests that the two systems align almost perfectly in their POS tagging, beyond what would be expected by chance.

  Formula:
  $$
  \kappa = \frac{P_o - P_e}{1 - P_e}
  $$
  Where:
  - \( P_o \) is the observed agreement.
  - \( P_e \) is the expected agreement.

### Conclusion

- The results indicate a strong agreement between the POS tagging systems in both datasets. The high **P_o** and **κ** values, alongside the relatively low **P_e**, confirm that the datasets exhibit a consistent tagging pattern.
- The nearly perfect **Cohen's Kappa** value suggests that the POS tagging is robust and accurate, with very few errors in classification.


## Fleiss Score

In [2]:
import json
from sklearn.metrics import cohen_kappa_score
from tabulate import tabulate

def parse_json(file_data):
    extracted_data = {}
    for item in file_data:
        image_name = item['image'].split('-')[-1]  # Extracts 'img_{number}.jpg'
        label = item['choice']
        extracted_data[image_name] = label
    return extracted_data

def calculate_fleiss_kappa(data):
    # Create a mapping for labels
    label_map = {label: i for i, label in enumerate(set(label for d in data.values() for label in d))}
    num_labels = len(label_map)  # Number of unique labels

    # Prepare the table of ratings
    n_items = len(data)  # Total items annotated
    rating_table = [[0] * num_labels for _ in range(n_items)]

    for i, (image_name, labels) in enumerate(data.items()):
        for label in labels:
            rating_table[i][label_map[label]] += 1

    # Compute proportions and agreement metrics
    N = len(rating_table)  # Number of items
    n = sum(rating_table[0])  # Number of annotators per item (assumes uniform annotations)

    # Calculate P_i for each item
    P_i = [(sum(rating[j]**2 for j in range(num_labels)) - n) / (n * (n - 1)) for rating in rating_table]
    P_bar = sum(P_i) / N  # Average agreement over all items

    # Calculate P_e (expected agreement by chance)
    P_e = sum((sum(rating_table[i][j] for i in range(N)) / (N * n))**2 for j in range(num_labels))

    # Fleiss Kappa formula
    kappa = (P_bar - P_e) / (1 - P_e) if 1 - P_e != 0 else 0

    return kappa, rating_table, label_map

# Load the JSON files for CV
with open('CV_314.json', 'r') as f1, open('CV_aditya.json', 'r') as f2, open('CV_1.json', 'r') as f3:
    data1 = json.load(f1)
    data2 = json.load(f2)
    data3 = json.load(f3)

# CV - Parse and combine annotations
parsed_data1 = parse_json(data1)
parsed_data2 = parse_json(data2)
parsed_data3 = parse_json(data3)

combined_data = {}
for image_name in set(parsed_data1.keys()).union(parsed_data2.keys()).union(parsed_data3.keys()):
    combined_data[image_name] = []
    if image_name in parsed_data1:
        combined_data[image_name].append(parsed_data1[image_name])
    if image_name in parsed_data2:
        combined_data[image_name].append(parsed_data2[image_name])
    if image_name in parsed_data3:
        combined_data[image_name].append(parsed_data3[image_name])

# CV - Calculate Fleiss Kappa score
fleiss_kappa_score, rating_table, label_map = calculate_fleiss_kappa(combined_data)

# Print the Fleiss Kappa score
print(f"\nCV - Fleiss Kappa Score: {fleiss_kappa_score:.4f}\n")

# Prepare and print the table
headers = ["Image Name"] + list(label_map.keys())
table_data = []
for image_name, row in zip(combined_data.keys(), rating_table):
    table_data.append([image_name] + row)

print("Rating Table:")
print(tabulate(table_data, headers=headers, tablefmt="grid"))


CV - Fleiss Kappa Score: 0.8661

Rating Table:
+--------------+----------+-------------+
| Image Name   |   Trucks |   No Trucks |
| img_123.jpg  |        3 |           0 |
+--------------+----------+-------------+
| img_124.jpg  |        0 |           3 |
+--------------+----------+-------------+
| img_135.jpg  |        0 |           3 |
+--------------+----------+-------------+
| img_134.jpg  |        3 |           0 |
+--------------+----------+-------------+
| img_139.jpg  |        3 |           0 |
+--------------+----------+-------------+
| img_136.jpg  |        3 |           0 |
+--------------+----------+-------------+
| img_129.jpg  |        3 |           0 |
+--------------+----------+-------------+
| img_133.jpg  |        3 |           0 |
+--------------+----------+-------------+
| img_121.jpg  |        0 |           3 |
+--------------+----------+-------------+
| img_126.jpg  |        3 |           0 |
+--------------+----------+-------------+
| img_131.jpg  |        0 | 

### Fleiss Kappa Score Interpretation

The Fleiss Kappa score for the given dataset is **0.8661**, which indicates a high level of agreement between the raters. This score suggests that the raters were largely consistent in classifying images into "Trucks" and "No Trucks," with very few discrepancies. Generally, a Fleiss Kappa score above **0.8** is considered "almost perfect" agreement, which aligns well with the results obtained here.

### Fleiss Kappa Formula

Fleiss Kappa is a measure of the reliability of agreement among multiple raters. The formula for Fleiss Kappa is:

$$
\kappa = \frac{P_o - P_e}{1 - P_e}
$$

Where:
- \( P_o \) is the observed agreement, the proportion of times raters agreed on the classification.
- \( P_e \) is the expected agreement, the proportion of agreement that would occur by chance.

#### Formula for \( P_o \) (Observed Agreement):
The observed agreement is calculated by:

$$
P_o = \frac{\sum_{i=1}^{N} p_{i}}{N}
$$

Where:
- \( p_i \) is the proportion of times all raters agreed on item \( i \).
- \( N \) is the total number of items being rated.

#### Formula for \( P_e \) (Expected Agreement):
The expected agreement is calculated by:

$$
P_e = \sum_{k=1}^{K} \left(\frac{p_{k}^2}{N}\right)
$$

Where:
- \( p_k \) is the proportion of times category \( k \) was assigned by the raters.
- \( N \) is the total number of items being rated.

### Rating Table Breakdown

The table below shows the classification of images into the "Trucks" and "No Trucks" categories. The values indicate how many raters selected each category for each image.

| Image Name   | Trucks | No Trucks |
|--------------|--------|-----------|
| img_123.jpg  | 3      | 0         |
| img_124.jpg  | 0      | 3         |
| img_135.jpg  | 0      | 3         |
| img_134.jpg  | 3      | 0         |
| img_139.jpg  | 3      | 0         |
| img_136.jpg  | 3      | 0         |
| img_129.jpg  | 3      | 0         |
| img_133.jpg  | 3      | 0         |
| img_121.jpg  | 0      | 3         |
| img_126.jpg  | 3      | 0         |
| img_131.jpg  | 0      | 3         |
| img_130.jpg  | 0      | 3         |
| img_122.jpg  | 0      | 3         |
| img_137.jpg  | 1      | 2         |
| img_125.jpg  | 1      | 2         |
| img_127.jpg  | 3      | 0         |
| img_128.jpg  | 3      | 0         |
| img_138.jpg  | 0      | 3         |
| img_132.jpg  | 0      | 3         |
| img_120.jpg  | 3      | 0         |

### Detailed Interpretation

While the **Fleiss Kappa** score of **0.8661** suggests strong agreement between the raters, it is important to note that the relatively high score doesn't fully eliminate the possibility of minor confusion in classifying certain images. Some images may be more challenging to classify due to factors like the level of detail or ambiguity in the images, which might lead to varied interpretations of whether an object qualifies as a "Truck" or not. 

#### Possible Sources of Confusion:
- **Detailed Images**: Images that are highly detailed or taken from specific angles may create confusion, as raters might interpret certain vehicles as trucks or not based on their appearance.
- **Ambiguity in Defining "Truck"**: The definition of what qualifies as a "Truck" can be subjective. A vehicle like a **tempo** (a small, three-wheeled or four-wheeled vehicle often used for transporting goods) may be considered a truck by some but not by others, creating disagreement among raters. This potential overlap in the categories could explain some of the minor inconsistencies.
  
### Conclusion

- **High Agreement**: Despite some potential ambiguities, the **Fleiss Kappa** score of **0.8661** suggests that, overall, the raters were consistent in their judgments and that the agreement is strong.
  
- **Sources of Confusion**: The discrepancies observed (e.g., in images like `img_137.jpg`, where one rater classified the image as a "Truck" and two others classified it as "No Truck") could be attributed to the level of detail in the pictures, subjective interpretations of what defines a truck, or the presence of vehicles with characteristics similar to trucks, such as tempos or vans.