<a href="https://colab.research.google.com/github/Daniel-Rossi-16/NLP_3_NER/blob/main/NER_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition (NER)

## Dataset and Problem Selection

The dataset used in this project is sourced from Kaggle (https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus/data?select=ner.csv) and is derived from the Groningen Meaning Bank (GMB) corpus. This dataset is structured to facilitate the training and evaluation of Named Entity Recognition (NER) models. It consists of tokenized text sequences, where each token is labeled with a corresponding entity class.

The primary objective of this project is to develop and assess NER models capable of accurately identifying and classifying named entities within textual data. Given its structured annotations, the dataset is well-suited for:

*   Supervised learning approaches, including deep learning models such as Transformers (BERT, LSTM-CRF) and classical machine learning methods like Conditional Random Fields (CRFs).

*   Benchmarking different NER techniques based on evaluation metrics such as precision, recall, and F1-score.

*   Enhancing practical applications in text processing, information retrieval, and automated knowledge extraction.

In [1]:
import os
# Get the current working directory
current_path = os.getcwd()
print(f"Current working directory: {current_path}")


Current working directory: /content


In [2]:
# Clone the GitHub Repository
!git clone https://github.com/Daniel-Rossi-16/NLP_3_NER.git

fatal: destination path 'NLP_3_NER' already exists and is not an empty directory.


In [3]:
# Navigate to the Cloned Repository
%cd NLP_3_NER

/content/NLP_3_NER


In [4]:
# List Files
!ls

LICENSE  ner.csv  README.md  requirements.txt


In [5]:
# Load the Dataset into a DataFrame
import pandas as pd

# Load the dataset from the cloned repository
df = pd.read_csv("ner.csv")

# Display the first few rows
df.head()


Unnamed: 0,Sentence #,Sentence,POS,Tag
0,Sentence: 1,Thousands of demonstrators have marched throug...,"['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP'...","['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', '..."
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNS', 'IN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
4,Sentence: 5,The protest comes on the eve of the annual con...,"['DT', 'NN', 'VBZ', 'IN', 'DT', 'NN', 'IN', 'D...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."


## Exploratory Data Analysis (EDA)

## Data Preprocessing


In [6]:
import ast

# Convert POS and Tag columns from string to list
df["POS"] = df["POS"].apply(ast.literal_eval)
df["Tag"] = df["Tag"].apply(ast.literal_eval)

# Check the format
df.head()


Unnamed: 0,Sentence #,Sentence,POS,Tag
0,Sentence: 1,Thousands of demonstrators have marched throug...,"[NNS, IN, NNS, VBP, VBN, IN, NNP, TO, VB, DT, ...","[O, O, O, O, O, O, B-geo, O, O, O, O, O, B-geo..."
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"[NNS, IN, NNS, VBN, IN, DT, NN, VBD, DT, NNS, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"[PRP, VBD, IN, DT, NNS, IN, NN, TO, DT, NN, IN...","[O, O, O, O, O, O, O, O, O, O, O, B-geo, I-geo..."
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","[NNS, VBD, DT, NN, IN, NNS, IN, CD, IN, NNS, V...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"
4,Sentence: 5,The protest comes on the eve of the annual con...,"[DT, NN, VBZ, IN, DT, NN, IN, DT, JJ, NN, IN, ...","[O, O, O, O, O, O, O, O, O, O, O, B-geo, O, O,..."


In [7]:
# Create a list of tuples (token, POS, Tag) for Tokenization and Preparation for NER
data = []
for _, row in df.iterrows():
    sentence = row["Sentence"].split()  # Simple tokenization
    pos_tags = row["POS"]
    ner_tags = row["Tag"]

    # Combine information into a structured format
    data.append(list(zip(sentence, pos_tags, ner_tags)))

# Print an example
for word in data[0]:
    print(word)


('Thousands', 'NNS', 'O')
('of', 'IN', 'O')
('demonstrators', 'NNS', 'O')
('have', 'VBP', 'O')
('marched', 'VBN', 'O')
('through', 'IN', 'O')
('London', 'NNP', 'B-geo')
('to', 'TO', 'O')
('protest', 'VB', 'O')
('the', 'DT', 'O')
('war', 'NN', 'O')
('in', 'IN', 'O')
('Iraq', 'NNP', 'B-geo')
('and', 'CC', 'O')
('demand', 'VB', 'O')
('the', 'DT', 'O')
('withdrawal', 'NN', 'O')
('of', 'IN', 'O')
('British', 'JJ', 'B-gpe')
('troops', 'NNS', 'O')
('from', 'IN', 'O')
('that', 'DT', 'O')
('country', 'NN', 'O')
('.', '.', 'O')


## Train a Model

# Evaluation and Conclusion
