**In this assignment, you'll get to practice the concepts and skills covered in the first 3 modules (Modules 1, 2, and 3). The main objective of this assignment is to implement and use some of the tools, algorithms, and techniques to represent and clean textual data and to extract named entities..**



**Guidelines**
* Download `NER.csv` file from D2L. 
* Make sure to run all the code cells, otherwise you may get errors like `NameError` for undefined variables.
* Do not change variable names, delete cells or disturb other existing code. It may cause problems during evaluation.
* In some cases, you may need to add some code cells or new statements before or after the line of code containing the `???`.
* Use markdown cells to write your discussions and reflections. 

**Procedure**
* Save your work as `IPYNB` file named `Lab3.ipynb` and submit to D2L `Lab 3 – Named Entity Recognition (Dropbox)` by the due date.
* As you go through this notebook, you will find the symbol `???` in certain places. To complete this assignment, you must replace all the `???` with appropriate values, expressions or statements to ensure that the notebook runs properly end-to-end.
* Include your response for `Part 1` and `Part 2` in this notebook. 

<div class="alert alert-block alert-info">

# Part 1: Activity 

</div>

# Question 1: Reading the dataset 
<hr style="border:1px solid orange"> </hr>

#### Read the content of the `NER.csv` into a dataframe `ner_df` and perform the following: 

> **Q1.1.** Perform Part-of-Speech tagging on the `Sentence` column in the `ner_df`. Then, add the results as new column called `pos`.

> **Q1.2.** Implement a function called `np_chunker` that receives each sentence from the `ner_df` dataframe. Then, it defines and applies a chunk parser to chunk all noun phrases from these sentences.

> **Q1.3.** Perfrom a Named Entity Recognition (NER) on the Sentence column and add the extracted entities as a new column called `entities`

In [6]:
import pandas as pd 
import numpy as np 
import nltk
from pprint import pprint
from nltk import pos_tag
from nltk.tokenize import word_tokenize


In [7]:
ner_df = pd.read_csv("ner.csv")

In [8]:
ner_df

Unnamed: 0,Sentence #,Sentence
0,Sentence: 1,Thousands of demonstrators have marched throug...
1,Sentence: 2,Families of soldiers killed in the conflict jo...
2,Sentence: 3,They marched from the Houses of Parliament to ...
3,Sentence: 4,"Police put the number of marchers at 10,000 wh..."
4,Sentence: 5,The protest comes on the eve of the annual con...
...,...,...
47954,Sentence: 47955,Indian border security forces are accusing the...
47955,Sentence: 47956,Indian officials said no one was injured in Sa...
47956,Sentence: 47957,Two more landed in fields belonging to a nearb...
47957,Sentence: 47958,They say not all of the rockets exploded upon ...


In [9]:
# Q1.1. Perform Part-of-Speech tagging on the Sentence column in the ner_df. Then, add the results as new column called pos.

In [10]:
#ner_df['Sentence'] = ner_df['Sentence'].str.lower()

In [11]:
# Function to perform POS tagging
def pos_tagger(sentence):
    token = word_tokenize(sentence)
    return pos_tag(token)

In [12]:
## ner_df["Sentence"] = ner_df["Sentence"].apply(nltk.word_tokenize)

In [13]:
ner_df["pos"] = ner_df["Sentence"].apply(pos_tagger)

In [14]:
ner_df

Unnamed: 0,Sentence #,Sentence,pos
0,Sentence: 1,Thousands of demonstrators have marched throug...,"[(Thousands, NNS), (of, IN), (demonstrators, N..."
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"[(Families, NNS), (of, IN), (soldiers, NNS), (..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"[(They, PRP), (marched, VBD), (from, IN), (the..."
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","[(Police, NNP), (put, VBD), (the, DT), (number..."
4,Sentence: 5,The protest comes on the eve of the annual con...,"[(The, DT), (protest, NN), (comes, VBZ), (on, ..."
...,...,...,...
47954,Sentence: 47955,Indian border security forces are accusing the...,"[(Indian, JJ), (border, NN), (security, NN), (..."
47955,Sentence: 47956,Indian officials said no one was injured in Sa...,"[(Indian, JJ), (officials, NNS), (said, VBD), ..."
47956,Sentence: 47957,Two more landed in fields belonging to a nearb...,"[(Two, CD), (more, JJR), (landed, VBN), (in, I..."
47957,Sentence: 47958,They say not all of the rockets exploded upon ...,"[(They, PRP), (say, VBP), (not, RB), (all, DT)..."


In [15]:
# Q1.2. Implement a function called np_chunker that receives each sentence from the ner_df dataframe. Then, it defines and applies a chunk parser to chunk all noun phrases from these sentences.

In [16]:
grammar = r"NP: {<DT>?<JJ.*>*<NN.*>+}"

def np_chunker(sentences):
    pos_tags = pos_tagger(sentences)
    chunk_parser = nltk.RegexpParser(grammar)
    result = chunk_parser.parse(pos_tags)
    noun_phrases = []
    for subtree in result.subtrees():
        if subtree.label() == 'NP':  # Only consider subtrees labeled as NP
            noun_phrases.append(' '.join([word for word, tag in subtree.leaves()]))
    return noun_phrases

In [17]:
# Apply np_chunker to the tokenized sentences in ner_df dataframe
ner_df['noun_phrases'] = ner_df['Sentence'].apply(np_chunker)

In [18]:
# Display the updated dataframe
ner_df

Unnamed: 0,Sentence #,Sentence,pos,noun_phrases
0,Sentence: 1,Thousands of demonstrators have marched throug...,"[(Thousands, NNS), (of, IN), (demonstrators, N...","[Thousands, demonstrators, London, the war, Ir..."
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"[(Families, NNS), (of, IN), (soldiers, NNS), (...","[Families, soldiers, the conflict, the protest..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"[(They, PRP), (marched, VBD), (from, IN), (the...","[the Houses, Parliament, a rally, Hyde Park]"
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","[(Police, NNP), (put, VBD), (the, DT), (number...","[Police, the number, marchers, organizers]"
4,Sentence: 5,The protest comes on the eve of the annual con...,"[(The, DT), (protest, NN), (comes, VBZ), (on, ...","[The protest, the eve, the annual conference, ..."
...,...,...,...,...
47954,Sentence: 47955,Indian border security forces are accusing the...,"[(Indian, JJ), (border, NN), (security, NN), (...","[Indian border security forces, Pakistani coun..."
47955,Sentence: 47956,Indian officials said no one was injured in Sa...,"[(Indian, JJ), (officials, NNS), (said, VBD), ...","[Indian officials, no one, Saturday, incident,..."
47956,Sentence: 47957,Two more landed in fields belonging to a nearb...,"[(Two, CD), (more, JJR), (landed, VBN), (in, I...","[fields, a nearby village]"
47957,Sentence: 47958,They say not all of the rockets exploded upon ...,"[(They, PRP), (say, VBP), (not, RB), (all, DT)...","[the rockets, impact]"


In [None]:
# Q1.3. Perfrom a Named Entity Recognition (NER) on the Sentence column and add the extracted entities as a new column called entities
# Ensure necessary NLTK resources are downloaded
nltk.download("maxent_ne_chunker")
nltk.download("words")

In [32]:
def ner_extractor(sentence):
    pos_tags = pos_tagger(sentence)
    chunk_ner = nltk.ne_chunk(pos_tags)
    entities = []
    for chunk in chunk_ner:
        if isinstance(chunk, nltk.Tree):  
            entity = " ".join([token for token, pos in chunk.leaves()])
            entity_label = chunk.label() 
            entities.append((entity, entity_label))
    return entities

In [36]:
# Apply the ner_extractor function to the 'Sentence' column of this subset
ner_df['entities'] = ner_df['Sentence'].apply(ner_extractor)

In [38]:
ner_df

Unnamed: 0,Sentence #,Sentence,pos,noun_phrases,entities
0,Sentence: 1,Thousands of demonstrators have marched throug...,"[(Thousands, NNS), (of, IN), (demonstrators, N...","[Thousands, demonstrators, London, the war, Ir...","[(London, GPE), (Iraq, GPE), (British, GPE)]"
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"[(Families, NNS), (of, IN), (soldiers, NNS), (...","[Families, soldiers, the conflict, the protest...","[(Bush Number One Terrorist, PERSON), (Bombing..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"[(They, PRP), (marched, VBD), (from, IN), (the...","[the Houses, Parliament, a rally, Hyde Park]","[(Houses, ORGANIZATION), (Parliament, ORGANIZA..."
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","[(Police, NNP), (put, VBD), (the, DT), (number...","[Police, the number, marchers, organizers]",[]
4,Sentence: 5,The protest comes on the eve of the annual con...,"[(The, DT), (protest, NN), (comes, VBZ), (on, ...","[The protest, the eve, the annual conference, ...","[(Britain, GPE), (Labor Party, ORGANIZATION), ..."
...,...,...,...,...,...
47954,Sentence: 47955,Indian border security forces are accusing the...,"[(Indian, JJ), (border, NN), (security, NN), (...","[Indian border security forces, Pakistani coun...","[(Indian, GPE)]"
47955,Sentence: 47956,Indian officials said no one was injured in Sa...,"[(Indian, JJ), (officials, NNS), (said, VBD), ...","[Indian officials, no one, Saturday, incident,...","[(Indian, GPE)]"
47956,Sentence: 47957,Two more landed in fields belonging to a nearb...,"[(Two, CD), (more, JJR), (landed, VBN), (in, I...","[fields, a nearby village]",[]
47957,Sentence: 47958,They say not all of the rockets exploded upon ...,"[(They, PRP), (say, VBP), (not, RB), (all, DT)...","[the rockets, impact]",[]


<div class="alert alert-block alert-info">

# Part 2: Reflection
    
</div>

As a second step—after answering the questions, include the following:
1. A reflection of your experience performing the activity. 
2. A reflection on the importance of learning this activity.


**Note:** include your reflection in this notebook as markdown cells. 

### 1. A reflection of your experience performing the activity.

The activity of implementing noun phrase chunking using Natural Language Processing (NLP) techniques was a valuable learning experience for me. By working through tokenization, part-of-speech tagging, and chunking, I gained a deeper understanding of how sentences are structured linguistically and how these structures can be processed programmatically.

### 2. A reflection on the importance of learning this activity.

This exercise emphasized the foundational importance of basic NLP tasks for more advanced applications, such as information extraction, text summarization, and question-answering systems. By learning how to extract linguistic features like noun phrases, we set the stage for deeper natural language understanding, a core element in modern AI systems. 