# Dataset Exploration for Named Entity Recognition (IOB2)

Before building any models, I analyse the CoNLL-2003 dataset to understand its structural and statistical properties.


Specifically, I examine:

- **Label imbalance: The "O" Problem**   
    How dominant is the "O" class compared to entity labels? This affects evaluation metrics and model bias.

- **Sentence length distribution**    
    Are the sentences mostly long news articles or short fragments? This informs padding strategy and maximum sequence length for the BiLSTM.
    
- **Surface cues**  
    Do entities typically start with capital letters or contain specific character patterns? These signals guide feature engineering for the Logistic Regression baseline.

- **Multi-token entity spans**  
    How often do we observe multi-token entities? This helps estimate the importance of sequence modelling.
    
- **Vocabulary coverage and OOV (Out-of-Vocabulary) issues**  
    How large is the vocabulary, and how often do unseen words appear in validation/test splits? This informs embedding and generalisation strategies.

## Objective

The goal is to derive modelling hypotheses that will inform:

- Feature design for the Logistic Regression baseline
- Architectural choices for the BiLSTM sequence model

This notebook follows a data-first, hypothesis-driven approach.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/projects
!git clone https://github.com/Efemirkan/ner-comparison-project.git

/content/drive/MyDrive/projects
Cloning into 'ner-comparison-project'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 12 (delta 3), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (12/12), 4.91 KiB | 386.00 KiB/s, done.
Resolving deltas: 100% (3/3), done.


In [3]:
%cd ner-comparison-project/

/content/drive/MyDrive/projects/ner-comparison-project


In [4]:
# Creating the folders in Repository Structure
import os

folders = ["data/raw", "data/processed", "src", "models", "results", "notebooks"]

for folder in folders:
    os.makedirs(folder, exist_ok=True)

In [11]:
# Create kaggle connection
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"efemirkan","key":"374cc44929371121bc09ffedd7711d89"}'}

In [12]:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [13]:
# Download the CoNLL003 (English-version) dataset
!kaggle datasets download -d alaakhaled/conll003-englishversion -p data/raw --unzip

Dataset URL: https://www.kaggle.com/datasets/alaakhaled/conll003-englishversion
License(s): CC0-1.0
Downloading conll003-englishversion.zip to data/raw
  0% 0.00/960k [00:00<?, ?B/s]
100% 960k/960k [00:00<00:00, 56.3MB/s]


In [15]:
%pwd

'/content/drive/MyDrive/projects/ner-comparison-project'