# Dataset  
This notebook uses the **CoNLL-2003** dataset for Named Entity Recognition (NER) using Conditional Random Fields (CRFs).

> **Citation:**  
> Sang, E. F., & De Meulder, F. (2003). *Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition.* In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003) (pp. 142–147).  
>  
> **Dataset Link:** [https://www.clips.uantwerpen.be/conll2003/ner/](https://www.clips.uantwerpen.be/conll2003/ner/)  

The dataset consists of annotated text for NER, with four entity types:  
- **PER** (Person)  
- **ORG** (Organization)  
- **LOC** (Location)  
- **MISC** (Miscellaneous)
#### The entity tags use Beginning-Inside-Outside (BIO) tagging.

In [31]:
import numpy as np
from datasets import load_dataset
from sklearn_crfsuite import CRF
from sklearn.metrics import classification_report
from sklearn_crfsuite.metrics import flat_classification_report

In [7]:
# Load with trust_remote_code=True
conll2003 = load_dataset("conll2003", trust_remote_code=True)

# Inspect the structure
print(conll2003)

Downloading data: 100%|█████████████████████████████████████████████████████████████| 983k/983k [00:00<00:00, 1.10MB/s]
Generating train split: 100%|███████████████████████████████████████████| 14041/14041 [00:02<00:00, 6964.89 examples/s]
Generating validation split: 100%|████████████████████████████████████████| 3250/3250 [00:00<00:00, 6306.18 examples/s]
Generating test split: 100%|██████████████████████████████████████████████| 3453/3453 [00:00<00:00, 6115.84 examples/s]


DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})


In [8]:
# Inspect first sample
print(conll2003["train"][0])  

{'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}


## POS Tags  
Part-of-Speech tags identify the grammatical category of each word in a sentence based on its definition and context. These tags follow the Penn Treebank tagset and are fundamental for syntactic analysis. They distinguish between nouns, verbs, adjectives, prepositions, and other linguistic elements that define a word's role in sentence structure.

In [10]:
pos_tag_names = conll2003["train"].features["pos_tags"].feature.names
print(conll2003["train"][0]['tokens']) 
print([pos_tag_names[tag] for tag in conll2003["train"][0]["pos_tags"]]) # Decode the first sentence's POS tags

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
['NNP', 'VBZ', 'JJ', 'NN', 'TO', 'VB', 'JJ', 'NN', '.']


| Word     | POS Tag | Meaning                          |
|----------|---------|----------------------------------|
| EU       | NNP     | Proper noun                      |
| rejects  | VBZ     | Verb, 3rd person singular        |
| German   | JJ      | Adjective                        |
| call     | NN      | Noun                             |
| to       | TO      | Infinitive marker                |
| boycott  | VB      | Verb                             |
| British  | JJ      | Adjective                        |
| lamb     | NN      | Noun                             |
| .        | .       | Punctuation                      |

## Chunk Tags  
Chunk tags (also called shallow parsing) identify syntactic phrases in text that form coherent grammatical units. Unlike full parsing, chunking focuses on identifying non-overlapping phrases without hierarchical structure. Key phrase types include noun phrases (groups of words centered around a noun), verb phrases (groups containing a main verb), and prepositional phrases (groups starting with a preposition).

In [13]:
chunk_tag_names = conll2003["train"].features["chunk_tags"].feature.names
print(conll2003["train"][0]['tokens']) 
print([chunk_tag_names[tag] for tag in conll2003["train"][0]["chunk_tags"]])

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
['B-NP', 'B-VP', 'B-NP', 'I-NP', 'B-VP', 'I-VP', 'B-NP', 'I-NP', 'O']


| Word    | Chunk Tag | Meaning                   |
|---------|-----------|---------------------------|
| EU      | B-NP      | Beginning of noun phrase  |
| rejects | B-VP      | Beginning of verb phrase  |
| German  | B-NP      | Beginning of noun phrase  |
| call    | I-NP      | Inside noun phrase        |
| to      | B-VP      | Beginning of verb phrase  |
| boycott | I-VP      | Inside verb phrase        |
| British | B-NP      | Beginning of noun phrase  |
| lamb    | I-NP      | Inside noun phrase        |
| .       | O         | Outside any chunk         |

## NER Tags  
Named Entity Recognition tags classify words or phrases that represent real-world objects into predefined categories. These tags follow the BIO (Begin-Inside-Outside) scheme to mark entity boundaries. The primary categories include persons, organizations, locations, and miscellaneous entities (dates, products, etc.), with each entity type having distinct B- (beginning) and I- (inside) tags for multi-word entities.<br>
__These are the target labels we want to predict.__

In [22]:
ner_tag_names = conll2003["train"].features["ner_tags"].feature.names
print(conll2003["train"][0]['tokens']) 
print([ner_tag_names[tag] for tag in conll2003["train"][0]["ner_tags"]])

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']


| Word    | NER Tag | Meaning                              |
|---------|---------|--------------------------------------|
| EU      | B-ORG   | Beginning of organization            |
| rejects | O       | Not an entity                        |
| German  | B-MISC  | Beginning of miscellaneous entity    |
| call    | O       | Not an entity                        |
| to      | O       | Not an entity                        |
| boycott | O       | Not an entity                        |
| British | B-MISC  | Beginning of miscellaneous entity    |
| lamb    | O       | Not an entity                        |
| .       | O       | Not an entity                        |

In [27]:
#Feature Engineering
def word2features(sent, i, pos_tags, chunk_tags):
    """Feature extraction for each word in sentence"""
    word = sent[i]
    features = {
        # Word features
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:], #extract the suffix
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        
        # POS features
        'pos': pos_tag_names[pos_tags[i]], #extract the word's pos tag
        'pos[:2]': pos_tag_names[pos_tags[i]][:2], #extract the word's pos tag's category only e.g., 'NN' from 'NNP'
        #extracting the category helps capture broader linguistic patterns
        #It also helps the model understand that, for example, 'VBZ' and 'VBD' are somehow related (both verbs), not totally unrelated
        
        # Chunk features
        'chunk': chunk_tag_names[chunk_tags[i]],
        'chunk[:2]': chunk_tag_names[chunk_tags[i]][:2],
    }
    
    # Context features
    if i > 0:
        features.update({
            'prev_word': sent[i-1].lower(),
            'prev_pos': pos_tag_names[pos_tags[i-1]],
            'prev_chunk': chunk_tag_names[chunk_tags[i-1]],
        })
    else:
        features['BOS'] = True  # Beginning of sentence
        
    if i < len(sent)-1:
        features.update({
            'next_word': sent[i+1].lower(),
            'next_pos': pos_tag_names[pos_tags[i+1]], 
            'next_chunk': chunk_tag_names[chunk_tags[i+1]],
        })
    else:
        features['EOS'] = True  # End of sentence
        
    return features

In [28]:
def prepare_data(dataset):
    """Convert dataset to features and labels"""
    X, y = [], []
    for tokens, pos_tags, chunk_tags, ner_tags in zip(
        dataset["tokens"],
        dataset["pos_tags"],
        dataset["chunk_tags"], 
        dataset["ner_tags"]
    ):
        #crf expects X_train to be a list of lists of dictionaries
        X.append([word2features(tokens, i, pos_tags, chunk_tags) for i in range(len(tokens))])
        #crf expects y_train to be a list of lists of strings (target labels)
        y.append([ner_tag_names[tag] for tag in ner_tags])
        
        #appending names not numerical labels for readability
        #also, crf model converts it internally anyway, so even if we passed numerical labels it would work
        #and it would return string labels as well when testing
    return X, y

### Given this input:
tokens = [["EU", "rejects"], ["Apple", "launched"]]<br>
pos_tags = [[22, 42], [22, 42]]<br>
chunk_tags = [[11, 21], [11, 21]]<br>
ner_tags = [[3, 0], [3, 0]]<br>

zip(tokens, pos_tags, chunk_tags, ner_tags) → <br>
[(["EU", "rejects"], [22, 42], [11, 21], [3, 0]),  # Sentence 0<br>
    (["Apple", "launched"], [22, 42], [11, 21], [3, 0])  # Sentence 1]<br>
    
#### How the for loop unpacks:
for tokens, pos_tags, chunk_tags, ner_tags in zip(...):<br>
#this results in "tokens" having a list of lists of tokens, each inner list contains one sentence's tokens, and so on.

In [29]:
#Prepare Train/Test Data
X_train, y_train = prepare_data(conll2003["train"])
X_test, y_test = prepare_data(conll2003["test"])

In [32]:
#CRF Model Training
crf = CRF(
    algorithm='lbfgs', #the optimizer
    c1=0.1,  # L1 regularization: Encourages sparsity (drops useless features)
    c2=0.01, # L2 regularization: Prevents large weights (smoother predictions)
    max_iterations=100,
    all_possible_transitions=True  
    # Allows all possible tag-to-tag transitions during training, even ones not seen in training data
    # It also prevents model from crashing on rare transitions
    # It lets regularization handle unlikely transitions (get low weights)
)

In [33]:
crf.fit(X_train, y_train)

In [34]:
#Evaluation
y_pred = crf.predict(X_test)
labels = list(crf.classes_)
labels.remove('O')  # Filter out 'O' tag

print(flat_classification_report(
    y_test, 
    y_pred, 
    labels=labels,
    digits=3
))

              precision    recall  f1-score   support

       B-ORG      0.770     0.713     0.741      1661
      B-MISC      0.816     0.765     0.790       702
       B-PER      0.826     0.855     0.840      1617
       I-PER      0.864     0.949     0.905      1156
       B-LOC      0.853     0.812     0.832      1668
       I-ORG      0.668     0.735     0.700       835
      I-MISC      0.708     0.662     0.684       216
       I-LOC      0.743     0.607     0.668       257

   micro avg      0.803     0.797     0.800      8112
   macro avg      0.781     0.762     0.770      8112
weighted avg      0.803     0.797     0.799      8112



In [50]:
#Error Analysis
from collections import defaultdict

def analyze_errors(y_true, y_pred, dataset):
    """Print full sentences with errors and error statistics"""
    error_counts = defaultdict(int)
    error_examples = defaultdict(list)
    
    for sent_idx in range(len(y_true)):
        tokens = dataset["test"][sent_idx]["tokens"]
        true_tags = y_true[sent_idx]
        pred_tags = y_pred[sent_idx]
        
        errors_in_sent = []
        for i in range(len(true_tags)):
            if true_tags[i] != pred_tags[i] and true_tags[i] != 'O':
                error_key = f"{true_tags[i]}→{pred_tags[i]}"
                error_counts[error_key] += 1
                errors_in_sent.append(
                    f"Word: '{tokens[i]}' (True: {true_tags[i]}, Pred: {pred_tags[i]})"
                )
        
        if errors_in_sent:
            error_examples["sentences"].append({
                "sentence": " ".join(tokens),
                "errors": errors_in_sent
            })
    
    # Print error statistics
    print("\nError Type Counts (True→Predicted):")
    for error, count in sorted(error_counts.items(), key=lambda x: x[1], reverse=True):
        print(f"{error}: {count}")
    
    # Print example sentences with errors
    print("\nExample Errors in Context:")
    for i, example in enumerate(error_examples["sentences"][:10]):  # First 10 examples
        print(f"\nSentence {i+1}: {example['sentence']}")
        for error in example["errors"]:
            print(f"  - {error}")

# Run analysis
analyze_errors(y_test, y_pred, conll2003)


Error Type Counts (True→Predicted):
B-LOC→B-ORG: 151
B-ORG→B-PER: 150
B-ORG→O: 136
B-ORG→B-LOC: 117
B-PER→O: 85
B-MISC→O: 80
I-ORG→I-PER: 80
B-PER→B-ORG: 66
B-LOC→O: 66
I-LOC→I-ORG: 64
I-ORG→O: 64
B-LOC→B-PER: 50
B-ORG→B-MISC: 45
B-PER→B-LOC: 43
I-PER→I-ORG: 43
B-MISC→B-ORG: 30
B-MISC→B-PER: 27
B-LOC→B-MISC: 25
I-ORG→I-LOC: 24
I-MISC→O: 23
B-ORG→I-ORG: 22
I-ORG→B-ORG: 21
I-LOC→I-PER: 19
B-MISC→B-LOC: 19
I-MISC→I-PER: 18
I-ORG→I-MISC: 17
B-PER→I-PER: 16
B-LOC→I-ORG: 16
I-LOC→O: 16
I-MISC→I-ORG: 13
B-PER→B-MISC: 10
I-ORG→B-PER: 9
I-MISC→B-MISC: 8
B-PER→I-ORG: 8
I-ORG→B-LOC: 8
I-PER→O: 7
I-PER→B-PER: 7
B-ORG→I-PER: 5
I-MISC→I-LOC: 5
I-LOC→B-LOC: 4
I-ORG→B-MISC: 3
B-MISC→I-MISC: 3
I-PER→I-LOC: 2
I-PER→I-MISC: 2
B-MISC→I-PER: 2
B-MISC→I-ORG: 2
B-LOC→I-PER: 2
I-MISC→B-LOC: 2
I-MISC→B-PER: 1
B-PER→I-MISC: 1
B-ORG→I-LOC: 1
B-LOC→I-LOC: 1
B-PER→I-LOC: 1
I-LOC→I-MISC: 1
B-ORG→I-MISC: 1
B-LOC→I-MISC: 1

Example Errors in Context:

Sentence 1: SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFE

# Conclusions
The model achieved an F1-score of 80% on the test set. While this demonstrates strong performance for a classical machine learning technique, a deeper error analysis reveals specific challenges that are compounded by the known limitations of the CoNLL-2003 dataset. The model frequently confused locations with organizations (e.g., 'United Arab Emirates' as an organization, 'Uzbekistan' as an organization) and organizations with persons (e.g., 'Bitar' as an organization).<br>
Crucially, some observed "errors" by the model, such as predicting 'CHINA' as a location (B-LOC) when __the original dataset annotates it as a person (B-PER)__ in a sentence like "SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT.", point directly to intrinsic annotation mistakes within the CoNLL-2003 __ground truth__ itself.<br>
Furthermore, a significant number of errors involved missing entities entirely (True→O, such as 'CUTTITTA', 'ITALY', and 'ROME' being missed), indicating difficulty in identifying boundaries or recognizing less common entities. These types of discrepancies underscore that a portion of the discrepancies between model predictions and the gold standard may stem from the dataset's inherent noise, rather than solely from model limitations. Future work could benefit from evaluation on corrected versions of the dataset or by incorporating robust error detection mechanisms during training to mitigate the impact of such annotation inconsistencies.