# <span style="color:darkred; font-weight:bold;">COMP5606 - Natural Language Processing - FALL 2024</span>

---

## <span style="color:black; font-weight:bold;">NLP Project: Named Entity Recognition (NER) or De-identification of PHI</span>

---

### <span style="color:darkred; font-weight:bold;">Team Members</span>  
- <span style="color:darkred;"></span> - <span style="color:darkred; font-style:italic;">Ethar Al Tamimi</span>  
- <span style="color:darkred;"></span> - <span style="color:darkred; font-style:italic;">Iman Al Hajri</span>  
- <span style="color:darkred;"></span> - <span style="color:darkred; font-style:italic;">Malak Al Hinai</span>  

---

### <span style="color:darkred; font-weight:bold;">Task</span>  
Identify the following entities from medical reports:  
- <span style="color:black;">**Person Name**</span>  
- <span style="color:black;">**Location**</span>  
- <span style="color:black;">**Dates**</span>  
- <span style="color:black;">**Age**</span>  

---

### <span style="color:darkred; font-weight:bold;">Goal</span>  
This project aims to <span style="color:black; font-weight:bold;">detect and de-identify Protected Health Information (PHI)</span> by developing a <span style="color:black; font-weight:bold;">Named Entity Recognition (NER)</span> technique for medical records obtained from a hospital's electronic health system.

---


## <span style="color:darkred; font-weight:bold;">Strategy 1: NER with linear SVM  </span>

This strategy uses Support Vector Machines (SVMs) to extract and classify entities in Named Entity Recognition (NER) tasks. The SVM-based approach employs handcrafted features to represent tokens and uses a classifier to assign one of four predefined entity categories to each token in a given text.
### Full Pipeline:
1. **Feature Extraction:** 
    - **`extract_features_from_text`** function extracts features for each token in the text and the features are:
        - Is the word capitalized?
        - Is it numeric?
        - Prefix and suffix analysis (first two and last two letters of the token).
        - Contextual features which are the previous and next word.


2. **Training the SVM:**
    - The model is trained using a **LinearSVC classifier** pipeline that includes:
    - Feature vectorization using **DictVectorizer**.
    - Scaling features with **StandardScaler** to improve SVM performance.
    - Balanced class weights to address potential class imbalances in the training data.
  
3. **Prediction:**
     - During prediction, tokens from new text are processed to extract features, and the trained SVM assigns an entity tag to each token. Tokens with the same entity type and contiguous positions are merged into a single entity span.

4. **Output:**
 - Results are saved as CSV files containing the start and end indices of each entity, the predicted tag, and the extracted text.

In [1]:
import os
import pandas as pd
import nltk
from sklearn.feature_extraction import DictVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
import nltk
import pandas as pd

import nltk
import pandas as pd
nltk.download('punkt_tab')
nltk.download('punkt')

[nltk_data] Downloading package punkt_tab to /home/iman/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /home/iman/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Extract features for the tokens

In [2]:
def extract_features_from_text(text):
    sentences = nltk.sent_tokenize(text)
    all_features = []
    tokens = []
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        for i, word in enumerate(words):
            if "," in word or '"' in word or "=" in word:
                continue
            tokens.append(word)
            previous_word = words[i - 1] if i > 0 else None
            next_word = words[i + 1] if i < len(words) - 1 else None
            features = {
                "word": word,
                "is_capitalized": word[0].isupper(),
                "is_numeric": word.isdigit(),
                "prefix_2": word[:2],
                "suffix_2": word[-2:],
                "prev_word": previous_word if previous_word else "",
                "next_word": next_word if next_word else "",
            }
            all_features.append(features)
    return all_features, tokens


### Prepare Train Data using anonymized_txt

In [3]:
directory_path = 'anonymized_txt' # Train data
X_train = []
y_train = []

for filename in os.listdir(directory_path):
    if filename.endswith('.txt'):
        base_name = filename[:-4]
        csv_filename = base_name + '.csv'
        if csv_filename in os.listdir(directory_path):
            txt_file_path = os.path.join(directory_path, filename)
            csv_file_path = os.path.join(directory_path, csv_filename)
            with open(txt_file_path, "r", encoding="utf-8") as file:
                text = file.read()
            features, tokens = extract_features_from_text(text)
            df = pd.read_csv(csv_file_path)
            token_index_mapping = []
            start = 0
            for token in tokens:
                start_index = text.find(token, start)
                end_index = start_index + len(token)
                token_index_mapping.append((token, start_index, end_index))
                start = end_index
            for i, (token, start_idx, end_idx) in enumerate(token_index_mapping):
                label = "O"
                for _, row in df.iterrows():
                    if start_idx >= row["start"] and end_idx <= row["end"]:
                        label = row["tag"]
                        break
                X_train.append(features[i])
                y_train.append(label)


In [4]:
X_train[:10]

[{'word': '``',
  'is_capitalized': False,
  'is_numeric': False,
  'prefix_2': '``',
  'suffix_2': '``',
  'prev_word': '',
  'next_word': 'A'},
 {'word': 'A',
  'is_capitalized': True,
  'is_numeric': False,
  'prefix_2': 'A',
  'suffix_2': 'A',
  'prev_word': '``',
  'next_word': '51'},
 {'word': '51',
  'is_capitalized': False,
  'is_numeric': True,
  'prefix_2': '51',
  'suffix_2': '51',
  'prev_word': 'A',
  'next_word': 'years'},
 {'word': 'years',
  'is_capitalized': False,
  'is_numeric': False,
  'prefix_2': 'ye',
  'suffix_2': 'rs',
  'prev_word': '51',
  'next_word': 'old'},
 {'word': 'old',
  'is_capitalized': False,
  'is_numeric': False,
  'prefix_2': 'ol',
  'suffix_2': 'ld',
  'prev_word': 'years',
  'next_word': 'female'},
 {'word': 'female',
  'is_capitalized': False,
  'is_numeric': False,
  'prefix_2': 'fe',
  'suffix_2': 'le',
  'prev_word': 'old',
  'next_word': 'went'},
 {'word': 'went',
  'is_capitalized': False,
  'is_numeric': False,
  'prefix_2': 'we',
  'su

In [5]:
y_train[:10]

['O', 'O', 'AGE', 'O', 'O', 'O', 'O', 'O', 'LOCATION', 'LOCATION']

### Training Phaze using Linear Support Vector Classifier (LinearSVC)

In [6]:
clf = Pipeline([
    ("vectorizer", DictVectorizer(sparse=False)),
    ("scaler", StandardScaler()),  # Scaling features (important for SVM)
    ("classifier", LinearSVC(
        C=0.1,                        # Regularization parameter (best value)
        loss="squared_hinge",         # Loss function (best value)
        penalty="l2",                 # Regularization type (best value)
        max_iter=1000,                # Maximum number of iterations
        class_weight="balanced"       # Handling class imbalance
    ))
])

clf.fit(X_train, y_train)



### Testing Phaze Using 21 files

In [12]:
import os
import pandas as pd

txt_folder_path = "to test"
predicted_folder = "output_LinearSVC_afterTrain"


if not os.path.exists(predicted_folder):
    os.makedirs(predicted_folder)

for filename in os.listdir(txt_folder_path):
    if filename.endswith(".txt"):
        txt_path = os.path.join(txt_folder_path, filename)
        with open(txt_path, "r", encoding="utf-8") as file:
            text = file.read()
        
        features, tokens = extract_features_from_text(text)
        predictions = clf.predict(features)

        token_index_mapping = []
        start = 0
        for token in tokens:
            start_index = text.find(token, start)
            end_index = start_index + len(token)
            token_index_mapping.append((token, start_index, end_index))
            start = end_index
        
        merged_data = []
        current_tag = None
        current_start = None
        current_end = None
        current_words = []

        for (token, start_idx, end_idx), tag in zip(token_index_mapping, predictions):
            if tag != "O":
                if tag == current_tag:
                    current_end = end_idx
                    current_words.append(token)
                else:
                    if current_tag is not None:
                        merged_data.append([current_start, current_end, current_tag, " ".join(current_words)])
                    current_tag = tag
                    current_start = start_idx
                    current_end = end_idx
                    current_words = [token]
            else:
                if current_tag is not None:
                    merged_data.append([current_start, current_end, current_tag, " ".join(current_words)])
                    current_tag = None
                    current_words = []

        if current_tag is not None:
            merged_data.append([current_start, current_end, current_tag, " ".join(current_words)])
        
        output_csv_path = os.path.join(predicted_folder, filename.replace(".txt", ".csv"))
        df_predicted = pd.DataFrame(merged_data, columns=["start", "end", "tag", "word"])
        df_predicted.to_csv(output_csv_path, index=False)
        print(f"Saved predictions to: {output_csv_path}")


Saved predictions to: output_LinearSVC_afterTrain/p201_prostate_onc_1.csv
Saved predictions to: output_LinearSVC_afterTrain/p102_breast_sur_5.csv
Saved predictions to: output_LinearSVC_afterTrain/p103_breast_onc_7.csv
Saved predictions to: output_LinearSVC_afterTrain/p230_breast_onc_18.csv
Saved predictions to: output_LinearSVC_afterTrain/p204_prostate_onc_8.csv
Saved predictions to: output_LinearSVC_afterTrain/p105_breast_onc_18.csv
Saved predictions to: output_LinearSVC_afterTrain/p215_prostate_onc_6.csv
Saved predictions to: output_LinearSVC_afterTrain/p133_prostate_onc_11.csv
Saved predictions to: output_LinearSVC_afterTrain/p207_prostate_onc_11.csv
Saved predictions to: output_LinearSVC_afterTrain/p135_prostate_onc_15.csv
Saved predictions to: output_LinearSVC_afterTrain/p210_prostate_onc_2.csv
Saved predictions to: output_LinearSVC_afterTrain/p131_prostate_onc_6.csv
Saved predictions to: output_LinearSVC_afterTrain/p107_breast_sur_6.csv
Saved predictions to: output_LinearSVC_afte

### Display The Result in tables

In [13]:
for filename in os.listdir(predicted_folder):
    if filename.endswith('.csv'):
        file_path = os.path.join(predicted_folder, filename)
        
        
        df = pd.read_csv(file_path)
        
        
        print(f"Table from file: {filename}")
        display(df)  


Table from file: p104_breast_sur_10.csv


Unnamed: 0,start,end,tag,word
0,71,73,AGE,61
1,118,131,DATE,November/2023
2,414,423,DATE,20/5/2024
3,494,496,AGE,70
4,564,568,LOCATION,ASTR
5,591,600,DATE,26/7/2024


Table from file: p215_prostate_onc_6.csv


Unnamed: 0,start,end,tag,word
0,23,31,LOCATION,Abdallah
1,195,198,NAME,Has
2,561,567,DATE,5/9/16
3,714,721,DATE,18/5/14
4,957,959,LOCATION,NM


Table from file: p233_breast_onc_7.csv


Unnamed: 0,start,end,tag,word
0,280,281,DATE,2
1,298,307,DATE,23/1/2015
2,340,348,DATE,Dec 2016
3,461,475,LOCATION,Nizwa hospital
4,1197,1210,DATE,december 2011
5,1469,1477,LOCATION,Hormonal
6,1756,1765,DATE,28/5/2021
7,2207,2217,DATE,33/08/2021
8,2890,2894,DATE,2006
9,2984,2986,DATE,23


Table from file: p201_prostate_onc_1.csv


Unnamed: 0,start,end,tag,word
0,0,3,DATE,Dec
1,52,59,LOCATION,Solomon
2,73,85,LOCATION,SQU Hospital
3,118,132,NAME,Abdulhafid Al
4,144,148,LOCATION,Ibri
5,719,732,NAME,Nafaan Hashim
6,1056,1074,LOCATION,Mauritius Hospital
7,1512,1517,LOCATION,FAMCO
8,1989,1995,LOCATION,Muscat


Table from file: p107_breast_sur_6.csv


Unnamed: 0,start,end,tag,word
0,137,145,DATE,dec 2015
1,370,380,LOCATION,sohar hspt
2,504,514,LOCATION,Sohar hspt
3,518,526,DATE,Mar 2016
4,737,745,LOCATION,Pakistan
5,749,760,DATE,22 may 2016
6,2489,2499,DATE,11/07/2016
7,2911,2921,DATE,30/08/2016


Table from file: p207_prostate_onc_11.csv


Unnamed: 0,start,end,tag,word


Table from file: p102_breast_sur_5.csv


Unnamed: 0,start,end,tag,word
0,3,5,AGE,51
1,31,47,LOCATION,Salalah hospital
2,388,396,DATE,8/4/2014
3,613,629,LOCATION,salalah hospital


Table from file: p204_prostate_onc_8.csv


Unnamed: 0,start,end,tag,word
0,37,45,NAME,iftikhar
1,52,67,NAME,rahim Al suqri
2,816,817,DATE,2
3,907,908,DATE,2
4,1613,1621,DATE,apr 2016
5,2232,2241,DATE,27/2/2016
6,2410,2414,DATE,2011
7,3081,3082,DATE,2
8,384,385,DATE,2


Table from file: p236_breast_onc_25.csv


Unnamed: 0,start,end,tag,word
0,318,328,LOCATION,Ibra 2008.
1,473,481,DATE,Jun 2015
2,620,629,DATE,30/4/2012
3,886,894,DATE,Dec 2014
4,1038,1042,DATE,2015
5,1216,1220,NAME,Huda
6,1504,1508,LOCATION,Iron


Table from file: p101_breast_onc_11.csv


Unnamed: 0,start,end,tag,word
0,13,19,NAME,Harmal
1,201,207,LOCATION,France
2,233,239,LOCATION,France
3,2257,2263,LOCATION,France
4,2398,2408,DATE,20/11/2010
5,3491,3497,LOCATION,France
6,3612,3622,DATE,20/11/2010


Table from file: p132_prostate_onc_4.csv


Unnamed: 0,start,end,tag,word
0,35,60,NAME,Sinan Habib Amer Al-Hinai
1,97,103,LOCATION,Muscat
2,115,118,LOCATION,SQU
3,409,414,LOCATION,FAMCO
4,442,456,DATE,September 2012
5,708,720,DATE,October 2012
6,751,765,DATE,March 13 2013
7,774,784,DATE,17/04/2013
8,1114,1123,LOCATION,Indonesia
9,1433,1443,DATE,11/08/2014


Table from file: p133_prostate_onc_11.csv


Unnamed: 0,start,end,tag,word
0,203,209,LOCATION,Kuwait
1,349,359,DATE,20/03/2024


Table from file: p106_breast_onc_19.csv


Unnamed: 0,start,end,tag,word
0,19,22,NAME,Amr
1,121,134,DATE,Decemebr 2011
2,324,328,DATE,2000
3,353,357,DATE,2004
4,361,365,LOCATION,Iran
5,476,480,LOCATION,Seeb
6,521,524,DATE,JAn
7,584,587,DATE,Feb
8,1865,1875,DATE,6 July2021


Table from file: p103_breast_onc_7.csv


Unnamed: 0,start,end,tag,word
0,0,5,NAME,Amira
1,9,11,AGE,51
2,562,584,LOCATION,Shifa Alhayat Hospital
3,1082,1085,LOCATION,NMC
4,2573,2582,DATE,18/6/2006
5,3731,3744,LOCATION,Ibra hospital
6,3834,3843,DATE,11/4/2006


Table from file: p105_breast_onc_18.csv


Unnamed: 0,start,end,tag,word
0,260,262,AGE,47
1,643,652,DATE,12/2/2012
2,677,686,DATE,15/3/2012


Table from file: p135_prostate_onc_15.csv


Unnamed: 0,start,end,tag,word
0,223,228,LOCATION,Qatar
1,250,260,DATE,17/10/2010
2,741,751,DATE,10/07/2011
3,807,813,DATE,August
4,1437,1447,DATE,07/10/2011


Table from file: p220_prostate_onc_3.csv


Unnamed: 0,start,end,tag,word
0,167,177,DATE,30/08/2013


Table from file: p210_prostate_onc_2.csv


Unnamed: 0,start,end,tag,word
0,37,42,NAME,Masiy
1,66,71,LOCATION,Afifa
2,253,261,NAME,Suwaidan
3,389,393,DATE,2012
4,983,987,DATE,2017
5,1635,1645,DATE,16/08/2014
6,1998,2015,LOCATION,Hospita Hospital.


Table from file: p131_prostate_onc_6.csv


Unnamed: 0,start,end,tag,word
0,41,69,NAME,Salman Said Mubarik Al hajri
1,116,124,LOCATION,Khaboura
2,229,234,NAME,Hajri
3,340,342,LOCATION,SH
4,579,589,DATE,21/03/2017
5,727,734,DATE,12/8/16
6,875,885,DATE,23/11/2016
7,1071,1081,DATE,11/12/2016
8,1531,1540,LOCATION,two black
9,1544,1554,DATE,29/01/2017


Table from file: p230_breast_onc_18.csv


Unnamed: 0,start,end,tag,word
0,271,275,DATE,2017
1,296,297,DATE,2
2,371,374,LOCATION,SQU
3,698,700,DATE,Ki
4,1079,1087,DATE,dec 2020
5,1797,1798,DATE,2
6,1921,1925,DATE,30/3
7,2113,2121,DATE,9/6/2016
8,2295,2304,DATE,18/8/2014
9,2377,2386,DATE,33/3/2016


Table from file: p241_breast_onc_17.csv


Unnamed: 0,start,end,tag,word
0,315,326,DATE,August 2011
1,508,521,DATE,February 2014
2,561,564,DATE,May
3,1133,1135,AGE,70
4,1142,1144,DATE,16
5,1657,1660,NAME,Had
6,1740,1748,LOCATION,Mughreeb


### Evaluation

In [14]:
import os
import pandas as pd
from nervaluate import Evaluator, collect_named_entities, summary_report_ent, summary_report_overall

# Read predicted entities
predicted_entities = []
for filename in os.listdir(predicted_folder):
    if filename.endswith(".csv"):
        print(f"Processing predicted file: {filename}")
        predicted_path = os.path.join(predicted_folder, filename)
        pred_data = pd.read_csv(predicted_path)
        
        doc_entities = []
        for _, row in pred_data.iterrows():
            doc_entities.append({
                "label": row['tag'],
                "start": int(row['start']),
                "end": int(row['end'])
            })
        predicted_entities.append(doc_entities)

# Read actual entities
true_entities = []
for filename in os.listdir(txt_folder_path):
    if filename.endswith(".csv"):
        print(f"Processing true file: {filename}")
        true_path = os.path.join(txt_folder_path, filename)
        true_data = pd.read_csv(true_path)
        
        doc_entities = []
        for _, row in true_data.iterrows():
            doc_entities.append({
                "label": row['tag'],
                "start": int(row['start']),
                "end": int(row['end'])
            })
        true_entities.append(doc_entities)

# Run the evaluation

from nervaluate import Evaluator
from nervaluate import collect_named_entities, summary_report_ent, summary_report_overall

evaluator = Evaluator(true_entities,predicted_entities, tags=['LOCATION', 'NAME', 'DATE', 'AGE'])
results, results_per_tag, result_indices, result_indices_by_tag  = evaluator.evaluate()

print("\n\nOverall")
print(summary_report_overall(results))
print("\n\n'Strict'")
print(summary_report_ent(results_per_tag, scenario="strict"))
print("\n\n'Ent_Type'")
print(summary_report_ent(results_per_tag, scenario="ent_type"))
print("\n\n'Partial'")
print(summary_report_ent(results_per_tag, scenario="partial"))
print("\n\n'Exact'")
print(summary_report_ent(results_per_tag, scenario="exact"))


Processing predicted file: p104_breast_sur_10.csv
Processing predicted file: p215_prostate_onc_6.csv
Processing predicted file: p233_breast_onc_7.csv
Processing predicted file: p201_prostate_onc_1.csv
Processing predicted file: p107_breast_sur_6.csv
Processing predicted file: p207_prostate_onc_11.csv
Processing predicted file: p102_breast_sur_5.csv
Processing predicted file: p204_prostate_onc_8.csv
Processing predicted file: p236_breast_onc_25.csv
Processing predicted file: p101_breast_onc_11.csv
Processing predicted file: p132_prostate_onc_4.csv
Processing predicted file: p133_prostate_onc_11.csv
Processing predicted file: p106_breast_onc_19.csv
Processing predicted file: p103_breast_onc_7.csv
Processing predicted file: p105_breast_onc_18.csv
Processing predicted file: p135_prostate_onc_15.csv
Processing predicted file: p220_prostate_onc_3.csv
Processing predicted file: p210_prostate_onc_2.csv
Processing predicted file: p131_prostate_onc_6.csv
Processing predicted file: p230_breast_on

In [10]:
## Prediction new reports   
txt_folder_path = "test"  # Folder containing the .txt files
predicted_folder = "output_csv_SVM"  # Folder to save the CSV files

In [11]:

if not os.path.exists(predicted_folder):
    os.makedirs(predicted_folder)

for filename in os.listdir(txt_folder_path):
    if filename.endswith(".txt"):
        txt_path = os.path.join(txt_folder_path, filename)
        with open(txt_path, "r", encoding="utf-8") as file:
            text = file.read()
        
        features, tokens = extract_features_from_text(text)
        predictions = clf.predict(features)

        token_index_mapping = []
        start = 0
        for token in tokens:
            start_index = text.find(token, start)
            end_index = start_index + len(token)
            token_index_mapping.append((token, start_index, end_index))
            start = end_index
        
        merged_data = []
        current_tag = None
        current_start = None
        current_end = None
        current_words = []

        for (token, start_idx, end_idx), tag in zip(token_index_mapping, predictions):
            if tag != "O":
                if tag == current_tag:
                    current_end = end_idx
                    current_words.append(token)
                else:
                    if current_tag is not None:
                        merged_data.append([current_start, current_end, current_tag, " ".join(current_words)])
                    current_tag = tag
                    current_start = start_idx
                    current_end = end_idx
                    current_words = [token]
            else:
                if current_tag is not None:
                    merged_data.append([current_start, current_end, current_tag, " ".join(current_words)])
                    current_tag = None
                    current_words = []

        if current_tag is not None:
            merged_data.append([current_start, current_end, current_tag, " ".join(current_words)])
        
        output_csv_path = os.path.join(predicted_folder, filename.replace(".txt", ".csv"))
        df_predicted = pd.DataFrame(merged_data, columns=["start", "end", "tag", "word"])
        df_predicted.to_csv(output_csv_path, index=False)
        print(f"Saved predictions to: {output_csv_path}")


Saved predictions to: output_csv_SVM/p237_breast_onc_6.csv
Saved predictions to: output_csv_SVM/p218_prostate_onc_2.csv
Saved predictions to: output_csv_SVM/p253_breast_sur_2.csv
Saved predictions to: output_csv_SVM/p219_prostate_onc_10.csv
Saved predictions to: output_csv_SVM/p232_breast_onc_9.csv
Saved predictions to: output_csv_SVM/p214_prostate_onc_8.csv
Saved predictions to: output_csv_SVM/p217_prostate_onc_1.csv
Saved predictions to: output_csv_SVM/p249_breast_sur_10.csv
Saved predictions to: output_csv_SVM/p220_prostate_sur_6.csv
Saved predictions to: output_csv_SVM/p248_breast_onc_13.csv
Saved predictions to: output_csv_SVM/p256_breast_onc_11.csv
Saved predictions to: output_csv_SVM/p238_breast_sur_4.csv
Saved predictions to: output_csv_SVM/p251_breast_onc_82.csv
Saved predictions to: output_csv_SVM/p235_breast_sur_5.csv
Saved predictions to: output_csv_SVM/p228_prostate_onc_5.csv
Saved predictions to: output_csv_SVM/p246_breast_onc_5.csv
Saved predictions to: output_csv_SVM/p2

### <span style="color:darkred; font-weight:bold;"> Entity Extraction with SVM-Linear Evaluation Summary </span>
**Overall Evaluation**

- **Ent_Type**: 
  - **Precision**: 0.83, **Recall**: 0.64, **F1-score**: 0.72
  - Strong precision but lower recall, indicating missed entities or spurious results.
  
- **Partial**: 
  - **Precision**: 0.78, **Recall**: 0.60, **F1-score**: 0.67
  - Decent performance, accounting for partially correct matches.
  
- **Strict**: 
  - **Precision**: 0.70, **Recall**: 0.54, **F1-score**: 0.61
  - Lower scores due to missed entities and stricter matching criteria.

- **Exact**:
  - **Precision**: 0.70, **Recall**: 0.54, **F1-score**: 0.61
  - Similar to Strict, reflecting the challenges of exact matching.
    
**Summary**
- DATE is one of the best-performing entity types, particularly in less strict evaluations, with balanced precision and recall.
- NAME performs poorly across all evaluations, with low precision and recall.
- AGE shows high precision but low recall, indicating many missed extractions.
- LOCATION achieves relatively strong results, particularly in type-based evaluations, with high precision and recall.
- The Strict and Exact evaluations highlight the limitations of the model under stringent criteria, while Ent_Type - - and Partial provide better overall performance.

## <span style="color:darkred; font-weight:bold;">Strategy 2: NER with Regular Expression  </span>

This strategy uses regular expressions (regex) to extract the four types the of entities from a given text. Each entity type is extracted using specific regex patterns designed to handle variations in how each entity can appear in the text. 

### Full Pipeline:
The **`extract_and_classify_entities`** function extracts entities using the regex patterns defined earlier and classifies them into four categories: **NAME**, **DATE**, **LOCATION**, and **AGE**. It prints the results and returns a dictionary containing the start and end positions along with the extracted text for each entity.

1. **Name (TITLE + NAME)**:  
   - **Covers**:  
     This pattern captures names that start with titles such as "Dr.", "Mr.", "Ms.", "Mrs.", or "Prof." followed by one or more parts of a proper name (e.g., "John Doe").
   - **Handling cases**:  
     The title is excluded from the captured name, and it skips any name starting with stop words like "from", "the", etc.

2. **Date**:  
   - **Covers**:  
     Matches various date formats such as:
     - Year (e.g., 2023)
     - Date in `YYYY-MM-DD` or `MM-DD-YYYY` formats
     - Month-year combinations (e.g., May 2022, March/2023)
     - Month, Day, Year combinations (e.g., May 22, 2023)
   - **Handling cases**:  
    Supports different delimiters (like -, /, .) and formats, handling month names in both full and abbreviated forms, excluding the word 'may' but  accepting 'May' only as a valid date

3. **Location**:  
   - **Covers**:  
     This pattern captures locations in various formats, including:
     - Locations with context phrases like "from" or "located in" (e.g., "from Paris")
     - City names (e.g., "New York city")
     - Street addresses (e.g., "5th Avenue")
     - Hospitals (e.g., "General Hospital")
     - Certain known locations like "SQU" or "FAMCO", due to the source of reports 
   - **Handling cases**:  
     - Ignores locations that are month names (e.g., "January", "Feb") to avoid confusion.
     - Removes stop words from the start of location strings (e.g., "from", "the").

4. **Age**:  
   - **Covers**:  
     Matches age-related phrases in various formats like:
     - "I'm 32 years old"
     - "45-year-old"
     - "32 yo"
   - **Handling cases**:  
     - Accepts age values expressed in several common ways (e.g., "32 years old" or "32 yo").

In [4]:
import re

MONTHS = {
    "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "Decemebr",
    "Jan", "Feb", "Mar", "Apr", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
}

def extract_entities(text):
    """
    Extract entities using regex patterns.
    """
    stop_words = { "from" ,"the", "and", "in", "on", "at", "with", "for", "of", "by", "a", "an", "is", "are", "was", "were", "to", "it", "he", "she", "they", "this", "that"}

    entities = {"NAME": [], "DATE": [], "LOCATION": [], "AGE": []}

   
    name_pattern = r"(Dr\.?|Mr\.?|Ms\.?|Mrs\.?|Prof\.?)\s([A-Z][a-z]+(?:\s[A-Z][a-z]+)*)"

    
    for match in re.finditer(name_pattern, text):
        start, end = match.span(2) 
        name = match.group(2)
        if name.split()[0] not in stop_words:
            entities["NAME"].append((start, end, name))
            
    dates_pattern = r"""
        \b(?:  # Word boundary and non-capturing group
            \d{4}(?=\D|$)  # Standalone year (e.g., 2023), not followed by a digit
            |
            \d{4}[-/.]\d{1,2}[-/.]\d{1,2}  # YYYY-MM-DD, YYYY/MM/DD
            |
            \d{1,2}[-/]\d{1,2}[-/]\d{2,4}  # MM-DD-YYYY, DD/MM/YYYY
            |
            (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)
            \s?\d{4}  # Month YYYY or MonthYYYY (e.g., December2011, December 2011)
            |
            # Month/YYYY (e.g., November/2023)
            (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)
            /?\d{4}
            |
            # Month DD, YYYY (e.g., May 22, 2023)
            (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)
            \s\d{1,2},\s\d{4}
            |
            \b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|(?:May(?!\b(?!\s\d)))|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\b
            |
            # DD Month YYYY (e.g., 22 May 2016)
            \d{1,2}\s(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)
            \s\d{4}
        )
    """
    for match in re.finditer(dates_pattern, text, re.VERBOSE| re.IGNORECASE):
    
        start, end = match.span()
        entities["DATE"].append((start, end, match.group(0)))

    
    locations_pattern = r"""
    (?:from|located in|live in|resides in|based in|in|at|to)\s+([A-Z][a-z]+(?:\s[A-Z][a-z]+)*)  # Contextual locations (e.g., "from Paris")
    | \b(?:in|at)\s+([A-Z]{3,}(?:\s[A-Z]{2,})*)                                         # Lowercase locations preceded by "in" or "at"
    | \b(?:In|At)\s+([A-Z]{3,}(?:\s[A-Z]{2,})*)                                 # Locations preceded by "in" or "at" with uppercase locations >= 3 chars (e.g., "At NYC")
    | ([A-Za-z][a-z]*\s(?:[A-Z][a-z]+|[a-z]+)\s(?:Hospital|hospital))  # Locations followed by "Hospital"/"hospital"
    | (\b[A-Za-z][a-z]+(?:\s[A-Za-z][a-z]+)*\s+city\b)                                  # Locations followed by "city" (e.g., "New York city")
    | (\b(?:Street|street|Avenue|avenue|Boulevard|boulevard|Road|road|Lane|lane|Square|square|Park|park|Hill|hill|Circle|circle|Court|court|Drive|drive|Plaza|plaza|Terrace|terrace)\b\s+[A-Za-z][a-z]+(?:\s[A-Za-z][a-z]+)*)  # Street/road/place names
    | (\b[A-Za-z][a-z]+(?:\s[A-Z][a-z]+)*\s+(?:Park|park)\b)                           # Locations followed by "Park"/"park"
    | (\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)*\s+(?:Region|region)\b)                          # Locations followed by "Region"/"region"
    | (\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)*\s+(?:National|national)\b)                      # Locations followed by "National"/"national"
    | (\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)*\s+(?:State|state)\b)                            # Locations followed by "State"/"state"
    | (\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)*\s+(?:Country|country)\b)                        # Locations followed by "Country"/"country"
    | (\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)*\s+(?:Continent|continent)\b)                    # Locations followed by "Continent"/"continent"
    | \b(SQU|FAMCO)\b  # Locations matching "SQU" or "FAMCO"
    """

    for match in re.finditer(locations_pattern, text, re.VERBOSE):
        location = match.group(0)
        
        len_stop = 1
        location_parts = location.split()
        if location_parts[0].lower() in stop_words:
            location = " ".join(location_parts[1:])
            len_stop += len(location_parts[0])

         
        if location.split()[0] in MONTHS:
             continue  # Skip this location
        # Ensure the cleaned location is not a stop word itself
        if location.lower() not in stop_words:
            start, end = match.span()
            entities["LOCATION"].append((start+len_stop, end, location))

    
    ages_pattern = r"\b(\d{1,3})\s?(?:years?\s?old|yrs?\s?old|-year-old|yo)\b"
    
    for match in re.finditer(ages_pattern, text):
        start, end = match.span(1)  # Get the span of only the numeric part (group 1)
        age_value = match.group(1).strip()  # Capture only the numeric value (e.g., '45')
        entities["AGE"].append((start, end, age_value))

    return entities

# Full workflow: Combine regex extraction and POS tagging
def extract_and_classify_entities(text):
    """
    Full pipeline to extract and classify entities.
    """
    
    regex_entities = extract_entities(text)
    print("\nEntities Extracted Using Regex:")
    for entity_type, values in regex_entities.items():
        print(f"{entity_type}: {values}")

    return regex_entities


In [7]:

folder_path = "to test" 

In [11]:
#Evaluation
import os
import pandas as pd
import re
import os
import csv
from nervaluate import Evaluator, summary_report_overall, summary_report_ent

txt_files = {}
csv_files = {}

for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    if filename.endswith(".txt"):
        with open(file_path, "r", encoding="utf-8") as file:
            txt_files[filename] = file.read()  
    elif filename.endswith(".csv"):
        data = pd.read_csv(file_path)  
        csv_files[filename] = data  


predicted_entities = []
for d in txt_files.keys():
    txt = txt_files[d]
    ner_entities = extract_entities(txt)
    
    doc_entities = []
    for tag in ner_entities:
        for entity in ner_entities[tag]:
            doc_entities.append({"label": tag, "start": entity[0], "end": entity[1]})
    predicted_entities.append(doc_entities)


true_entities = []
for d in txt_files.keys():
    csv_filename = d.replace('.txt', '.csv')  
    if csv_filename in csv_files:
        data = csv_files[csv_filename]
        doc_entities = []
        for _, row in data.iterrows():
            doc_entities.append({"label": row['tag'], "start": int(row['start']), "end": int(row['end'])})
        true_entities.append(doc_entities)


from nervaluate import Evaluator
from nervaluate import summary_report_overall, summary_report_ent

evaluator = Evaluator(true_entities, predicted_entities, tags=['LOCATION', 'NAME', 'DATE', 'AGE'])

results, results_per_tag, result_indices, result_indices_by_tag = evaluator.evaluate()

print("\n\nOverall")
print(summary_report_overall(results))
print("\n\n'Strict'")
print(summary_report_ent(results_per_tag, scenario="strict"))
print("\n\n'Ent_Type'")
print(summary_report_ent(results_per_tag, scenario="ent_type"))
print("\n\n'Partial'")
print(summary_report_ent(results_per_tag, scenario="partial"))
print("\n\n'Exact'")
print(summary_report_ent(results_per_tag, scenario="exact"))




Overall
              correct   incorrect     partial      missed    spurious   precision      recall    f1-score

ent_type          168           0           0          25          21        0.89        0.87        0.88
 partial          143           0          25          25          21        0.82        0.81        0.81
  strict          143          25           0          25          21        0.76        0.74        0.75
   exact          143          25           0          25          21        0.76        0.74        0.75



'Strict'
              correct   incorrect     partial      missed    spurious   precision      recall    f1-score

     AGE           14           0           0           3           9        0.61        0.82        0.70
    DATE           94           5           0           6           5        0.90        0.90        0.90
LOCATION           29          10           0          11           7        0.63        0.58        0.60
    NAME            6 

In [13]:
input_folder = "test"  
output_folder = "output_csv"  

In [14]:
import os
import csv


def save_entities_to_csv(entities, output_csv_path):
    """
    Save the extracted entities to a CSV file in the specified format.
    """
    # Prepare the data to be written to the CSV file
    rows = []
    for entity_type, values in entities.items():
        for start, end, entity in values:
            rows.append([start, end, entity_type, entity])

    # Write to CSV with the specified column names
    with open(output_csv_path, mode="w", newline="", encoding="utf-8") as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow(["start", "end", "tag", "text"])  # CSV header
        writer.writerows(rows)

def process_text_files(input_folder, output_folder):
    """
    Process each .txt file in the folder, extract entities and save to corresponding CSV file.
    """
    
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    
    for filename in os.listdir(input_folder):
        if filename.endswith(".txt"):  
            txt_file_path = os.path.join(input_folder, filename)
            print(f"Processing file: {filename}")

            
            with open(txt_file_path, "r", encoding="utf-8") as file:
                text = file.read()

           
            entities = extract_entities(text)

            
            output_csv_path = os.path.join(output_folder, f"{os.path.splitext(filename)[0]}.csv")

            
            save_entities_to_csv(entities, output_csv_path)
            print(f"Saved extracted entities to: {output_csv_path}")

# Process the text files
process_text_files(input_folder, output_folder)

Processing file: p202_prostate_onc_1.txt
Saved extracted entities to: output_csv\p202_prostate_onc_1.csv
Processing file: p203_prostate_onc_10.txt
Saved extracted entities to: output_csv\p203_prostate_onc_10.csv
Processing file: p204_prostate_onc_23.txt
Saved extracted entities to: output_csv\p204_prostate_onc_23.csv
Processing file: p205_prostate_onc_5.txt
Saved extracted entities to: output_csv\p205_prostate_onc_5.csv
Processing file: p206_prostate_onc_4.txt
Saved extracted entities to: output_csv\p206_prostate_onc_4.csv
Processing file: p208_prostate_onc_4.txt
Saved extracted entities to: output_csv\p208_prostate_onc_4.csv
Processing file: p209_prostate_onc_3.txt
Saved extracted entities to: output_csv\p209_prostate_onc_3.csv
Processing file: p210_prostate_onc_7.txt
Saved extracted entities to: output_csv\p210_prostate_onc_7.csv
Processing file: p211_prostate_onc_5.txt
Saved extracted entities to: output_csv\p211_prostate_onc_5.csv
Processing file: p212_prostate_onc_4.txt
Saved extr

### <span style="color:darkred; font-weight:bold;"> Entity Extraction with Regular Expression Evaluation Summary </span>
**Overall Evaluation**

- **Ent_Type**: 
  - **Precision**: 0.89, **Recall**: 0.87, **F1-score**: 0.88
  - Strong performance with most entities correctly identified.
  
- **Partial**: 
  - **Precision**: 0.82, **Recall**: 0.81, **F1-score**: 0.81
  - Decent performance with some missed entities and spurious results.
  
- **Strict**: 
  - **Precision**: 0.76, **Recall**: 0.74, **F1-score**: 0.75
  - Lower performance due to missed entities and spurious results.

- **Exact**:
  - **Precision**: 0.76, **Recall**: 0.74, **F1-score**: 0.75
  - Similar to **Strict**; performance drops due to stricter evaluation.
    
**Summary**
- **DATE** is the best-performing entity type across all scenarios, with very high precision and recall.
- **NAME** consistently performs poorly, with low precision and recall.
- **AGE** and **LOCATION** show moderate performance with relatively high recall but lower precision.
- The **Strict** and **Exact** scenarios yield the lowest performance due to stricter matching criteria, while **Ent_Type** and **Partial** offer better flexibility and higher F1-scores.