<a href="https://colab.research.google.com/github/Roaa27/NER-project/blob/main/NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

:Step 1: Import Libraries  
I imported all the required libraries including spaCy for natural language processing, displacy for visualization, and seqeval for evaluation metrics.

In [10]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=42b264705fa363921663cc508370d0bbb7ce9994a6c902026e3d78d3225af9c6
  Stored in directory: /root/.cache/pip/wheels/5f/b8/73/0b2c1a76b701a677653dd79ece07cfabd7457989dbfbdcd8d7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [11]:
import spacy
from spacy import displacy
import pandas as pd
from google.colab import files
from seqeval.metrics import classification_report, f1_score, precision_score, recall_score
print("✅ Libraries imported successfully!")


✅ Libraries imported successfully!


Step 2: Upload Dataset  
I uploaded the CoNLL-2003 dataset files (train, test, validation) to Colab using files.upload(). These files contain tokenized sentences with their ground truth NER labels.

In [2]:
print("⬆️ Please upload your train/test/valid files")
uploaded = files.upload()

uploaded_files = list(uploaded.keys())
print("Uploaded files:", uploaded_files)

⬆️ Please upload your train/test/valid files


Saving train.txt to train.txt
Uploaded files: ['train.txt']


Step 3: Read and Prepare Data  
I defined a function to read the CoNLL data format.  
- It collects tokens and their corresponding labels for each sentence.  
- Finally, it returns a list of sentences (`train_sentences`) and their labels (`train_labels`).

In [7]:
def read_conll_data(filepath):
    sentences = []
    labels = []
    sentence_tokens = []
    sentence_labels = []

    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            if line.strip() == "":
                if sentence_tokens:
                    sentences.append(sentence_tokens)
                    labels.append(sentence_labels)
                    sentence_tokens, sentence_labels = [], []
            else:
                parts = line.strip().split()
                if len(parts) >= 2:
                    token, ner_label = parts[0], parts[-1]
                    sentence_tokens.append(token)
                    sentence_labels.append(ner_label)
        if sentence_tokens:
            sentences.append(sentence_tokens)
            labels.append(sentence_labels)

    return sentences, labels

train_sentences, train_labels = read_conll_data(uploaded_files[0])
print(f"✅ Loaded {len(train_sentences)} training sentences")


✅ Loaded 14987 training sentences


Step 4: Load spaCy Model  
I loaded the spaCy English model en_core_web_sm, which comes with a pre-trained Named Entity Recognition component.

In [4]:
nlp = spacy.load("en_core_web_sm")
print("✅ spaCy model loaded successfully!")

✅ spaCy model loaded successfully!


Step 5: Run NER on a Sample Sentence  
I tested the model on one example sentence from the training set.  
The model outputs named entities such as PERSON, ORG, and GPE, and I printed them out to verify the predictions.

In [8]:
sample_sentence = " ".join(train_sentences[0])
print("\n🔹 Example sentence:\n", sample_sentence)

doc = nlp(sample_sentence)
print("\n🔹 Entities detected by spaCy:")
for ent in doc.ents:
    print(ent.text, "-", ent.label_)


displacy.render(doc, style="ent", jupyter=True)


🔹 Example sentence:
 -DOCSTART-

🔹 Entities detected by spaCy:


Step 6: Visualization with displacy  
I used spaCy’s built-in displacy visualizer to highlight named entities in the text with different colors for better understanding.

In [6]:
displacy.render(doc, style="ent", jupyter=True)



Step 7: Evaluate the Model  
I compared spaCy’s predictions with the ground truth labels from the dataset using seqeval metrics.  
The evaluation provides Precision, Recall, and F1-score, along with a detailed classification report for each entity type.

In [12]:


y_true = []
y_pred = []

for tokens, true_labels in zip(train_sentences[:200], train_labels[:200]):  # نستخدم 200 جملة بس للتجربة
    sentence = " ".join(tokens)
    doc = nlp(sentence)

    # Initialize predicted labels for each token
    predicted_labels = ["O"] * len(tokens)

    for ent in doc.ents:
        ent_tokens = ent.text.split()
        for i, token in enumerate(tokens):
            if token in ent_tokens:
                predicted_labels[i] = ent.label_

    y_true.append(true_labels)
    y_pred.append(predicted_labels)

print("\n🔹 Evaluation Results (on 200 sentences):")
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1-score:", f1_score(y_true, y_pred))
print("\nDetailed Report:")
print(classification_report(y_true, y_pred))


🔹 Evaluation Results (on 200 sentences):
Precision: 0.0
Recall: 0.0
F1-score: 0.0

Detailed Report:
              precision    recall  f1-score   support

          AC       0.00      0.00      0.00         0
     ANGUAGE       0.00      0.00      0.00         0
     ARDINAL       0.00      0.00      0.00         0
         ATE       0.00      0.00      0.00         0
      ERCENT       0.00      0.00      0.00         0
       ERSON       0.00      0.00      0.00         0
         IME       0.00      0.00      0.00         0
         LOC       0.00      0.00      0.00       169
        MISC       0.00      0.00      0.00       105
          OC       0.00      0.00      0.00         0
        ONEY       0.00      0.00      0.00         0
         ORG       0.00      0.00      0.00        65
         ORP       0.00      0.00      0.00         0
          PE       0.00      0.00      0.00         0
         PER       0.00      0.00      0.00        72
      RODUCT       0.00      0.00 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
