<a href="https://colab.research.google.com/github/Anou26/NLP-Tasks/blob/main/Assignment_3_Part_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Submitted By**

**Name: Anoushka Mergoju**

**SUID: 328542442**

**1: Implementing HMM and Viterbi Algorithm**

In [74]:
import ast
from collections import defaultdict, Counter

class HiddenMarkovModel:
    def __init__(self, smoothing=0.01):
        self.transition_probs = defaultdict(lambda: defaultdict(lambda: smoothing))
        self.emission_probs = defaultdict(lambda: defaultdict(lambda: smoothing))
        self.initial_probs = defaultdict(lambda: smoothing)  # Ensure smoothing in initial probabilities
        self.states = set()
        self.vocabulary = set()
        self.smoothing = smoothing

    def train(self, data):
        transition_counts = defaultdict(lambda: defaultdict(int))
        emission_counts = defaultdict(lambda: defaultdict(int))
        initial_counts = defaultdict(int)
        state_counts = defaultdict(int)

        for sentence in data:
            previous_state = None
            for word, tag in sentence:
                self.states.add(tag)
                self.vocabulary.add(word)
                emission_counts[tag][word] += 1
                state_counts[tag] += 1

                if previous_state is None:
                    initial_counts[tag] += 1
                else:
                    transition_counts[previous_state][tag] += 1
                previous_state = tag

        total_initials = sum(initial_counts.values())
        for state in self.states:
            self.initial_probs[state] = (initial_counts[state] + self.smoothing) / (total_initials + self.smoothing * len(self.states))

        for prev_state in self.states:
            total_transitions = sum(transition_counts[prev_state].values())
            for state in self.states:
                self.transition_probs[prev_state][state] = (transition_counts[prev_state][state] + self.smoothing) / (total_transitions + self.smoothing * len(self.states))

        for state in self.states:
            total_emissions = sum(emission_counts[state].values())
            for word in self.vocabulary:
                self.emission_probs[state][word] = (emission_counts[state].get(word, 0) + self.smoothing) / (total_emissions + self.smoothing * len(self.vocabulary))

    def viterbi(self, sequence):
        if not sequence:
            return []
        V = [{}]
        path = {}

        for state in self.states:
            V[0][state] = self.initial_probs[state] * self.emission_probs[state].get(sequence[0], 0)
            path[state] = [state]

        for t in range(1, len(sequence)):
            new_path = {}
            V.append({})
            for curr_state in self.states:
                (max_prob, max_state) = max(
                    ((V[t-1][prev_state] * self.transition_probs[prev_state][curr_state] * self.emission_probs[curr_state].get(sequence[t], 0), prev_state)
                     for prev_state in self.states),
                    key=lambda item: item[0]
                )
                V[t][curr_state] = max_prob
                new_path[curr_state] = path[max_state] + [curr_state]

            path = new_path

        (max_prob, max_state) = max(((V[len(sequence) - 1][state], state) for state in self.states), key=lambda item: item[0])
        return path[max_state]



def load_data(filepath):
    with open(filepath, "r") as file:
        content = file.read()

    training_data = ast.literal_eval(content.split('training_data =')[1].split('# test_data')[0].strip())
    test_data = ast.literal_eval(content.split('test_data =')[1].strip())
    return training_data, test_data

# Load and prepare the data
file_path = '/content/A3-3-Data.txt'
train_data, test_data = load_data(file_path)

# Create and train the model
hmm = HiddenMarkovModel(smoothing=0.01)
hmm.train(train_data)

# Run predictions
predictions = [hmm.viterbi(sentence) for sentence in test_data]

# Display results
for sentence, predicted_tags in zip(test_data, predictions):
    print("Sentence:", ' '.join(sentence))
    print("Predicted Tags:", predicted_tags)
    print("\n")  # Adds a newline for better readability between different sentences




Sentence: Bill Gates founded Microsoft
Predicted Tags: ['PERSON', 'PERSON', 'PERSON', 'PERSON']


Sentence: The Louvre Museum is in Paris
Predicted Tags: ['O', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON']


Sentence: Mount Fuji is a famous landmark in Japan
Predicted Tags: ['PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON']


Sentence: The United Nations was formed in 1945
Predicted Tags: ['O', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON']


Sentence: Shakira performed at the Super Bowl halftime show
Predicted Tags: ['PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON']


Sentence: The Nobel Peace Prize was awarded to Malala Yousafzai
Predicted Tags: ['O', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON']


Sentence: The Amazon River flows through Brazil
Predicted Tags: ['O', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON']


Sentence: The Pyramids of Giza are in Egypt
Predicted Tag

**2. Write a Python program that uses the NLTK
ne_chunk() function for NER.**

In [10]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('averaged_perceptron_tagger')

# Test data as provided in the assignment
test_data = [
    ["Bill", "Gates", "founded", "Microsoft"],
    ["The", "Louvre", "Museum", "is", "in", "Paris"],
    ["Mount", "Fuji", "is", "a", "famous", "landmark", "in", "Japan"],
    ["The", "United", "Nations", "was", "formed", "in", "1945"],
    ["Shakira", "performed", "at", "the", "Super", "Bowl", "halftime", "show"],
    ["The", "Nobel", "Peace", "Prize", "was", "awarded", "to", "Malala", "Yousafzai"],
    ["The", "Amazon", "River", "flows", "through", "Brazil"],
    ["The", "Pyramids", "of", "Giza", "are", "in", "Egypt"],
    ["Rome", "is", "the", "capital", "of", "Italy"],
    ["The", "Great", "Wall", "of", "China", "is", "one", "of", "the", "Seven", "Wonders", "of", "the", "World"]
]

# Function to perform NER using NLTK's ne_chunk
def ner_nltk(test_sentences):
    results = []
    for sentence in test_sentences:
        tagged_sentence = pos_tag(sentence)  # Part-of-speech tagging
        chunked_sentence = ne_chunk(tagged_sentence)  # NER using ne_chunk
        results.append(chunked_sentence)
    return results

# Perform NER on the test data
nltk_results = ner_nltk(test_data)
nltk_results

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[Tree('S', [Tree('PERSON', [('Bill', 'NNP')]), Tree('PERSON', [('Gates', 'NNP')]), ('founded', 'VBD'), Tree('PERSON', [('Microsoft', 'NNP')])]),
 Tree('S', [('The', 'DT'), Tree('ORGANIZATION', [('Louvre', 'NNP'), ('Museum', 'NNP')]), ('is', 'VBZ'), ('in', 'IN'), Tree('GPE', [('Paris', 'NNP')])]),
 Tree('S', [Tree('PERSON', [('Mount', 'NNP')]), Tree('ORGANIZATION', [('Fuji', 'NNP')]), ('is', 'VBZ'), ('a', 'DT'), ('famous', 'JJ'), ('landmark', 'NN'), ('in', 'IN'), Tree('GPE', [('Japan', 'NNP')])]),
 Tree('S', [('The', 'DT'), Tree('ORGANIZATION', [('United', 'NNP'), ('Nations', 'NNP')]), ('was', 'VBD'), ('formed', 'VBN'), ('in', 'IN'), ('1945', 'CD')]),
 Tree('S', [Tree('PERSON', [('Shakira', 'NNP')]), ('performed', 'VBD'), ('at', 'IN'), ('the', 'DT'), Tree('ORGANIZATION', [('Super', 'NNP'), ('Bowl', 'NNP')]), ('halftime', 'NN'), ('show', 'NN')]),
 Tree('S', [('The', 'DT'), Tree('ORGANIZATION', [('Nobel', 'NNP'), ('Peace', 'NNP'), ('Prize', 'NNP')]), ('was', 'VBD'), ('awarded', 'VBN'), ('

**3. Write a Python program that uses the spaCy
nlp() function for NER.**

In [12]:
import spacy

# Load the pre-trained spaCy model
nlp = spacy.load('en_core_web_sm')

# Test data as provided in the assignment
test_data = [
    "Bill Gates founded Microsoft",
    "The Louvre Museum is in Paris",
    "Mount Fuji is a famous landmark in Japan",
    "The United Nations was formed in 1945",
    "Shakira performed at the Super Bowl halftime show",
    "The Nobel Peace Prize was awarded to Malala Yousafzai",
    "The Amazon River flows through Brazil",
    "The Pyramids of Giza are in Egypt",
    "Rome is the capital of Italy",
    "The Great Wall of China is one of the Seven Wonders of the World"
]

def ner_spacy(sentences):
    results = []
    for sentence in sentences:
        doc = nlp(sentence)
        entities = []
        for ent in doc.ents:
            entities.append((ent.text, ent.label_))
        results.append(entities)
    return results

# Perform NER on the test data
spacy_results = ner_spacy(test_data)
for result in spacy_results:
    print(result)


[('Bill Gates', 'PERSON'), ('Microsoft', 'ORG')]
[('The Louvre Museum', 'ORG'), ('Paris', 'GPE')]
[('Mount Fuji', 'LOC'), ('Japan', 'GPE')]
[('The United Nations', 'ORG'), ('1945', 'DATE')]
[('Shakira', 'PERSON'), ('the Super Bowl', 'EVENT')]
[('The Nobel Peace Prize', 'WORK_OF_ART'), ('Malala Yousafzai', 'PERSON')]
[('Amazon River', 'LOC'), ('Brazil', 'GPE')]
[('Giza', 'PERSON'), ('Egypt', 'GPE')]
[('Rome', 'GPE'), ('Italy', 'GPE')]
[('The Great Wall of China', 'FAC'), ('one', 'CARDINAL'), ('Seven', 'CARDINAL')]


**4. Write a Python program that uses the
Huggingface/Transformers’ ner pipeline for NER.**

In [14]:
!pip install transformers



In [16]:
from transformers import pipeline

# Initialize the pipeline for named entity recognition
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

# Test data as provided in the assignment
test_data = [
    "Bill Gates founded Microsoft",
    "The Louvre Museum is in Paris",
    "Mount Fuji is a famous landmark in Japan",
    "The United Nations was formed in 1945",
    "Shakira performed at the Super Bowl halftime show",
    "The Nobel Peace Prize was awarded to Malala Yousafzai",
    "The Amazon River flows through Brazil",
    "The Pyramids of Giza are in Egypt",
    "Rome is the capital of Italy",
    "The Great Wall of China is one of the Seven Wonders of the World"
]

def ner_transformers(sentences):
    results = []
    for sentence in sentences:
        entities = ner_pipeline(sentence)
        formatted_entities = []
        for entity in entities:
            formatted_entities.append({
                'entity_group': entity['entity'],
                'score': entity['score'],
                'word': entity['word'],
                'start': entity['start'],
                'end': entity['end']
            })
        results.append(formatted_entities)
    return results

# Perform NER on the test data
transformers_results = ner_transformers(test_data)
for result in transformers_results:
    print(result)


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'I-PER', 'score': 0.9968207, 'word': 'Bill', 'start': 0, 'end': 4}, {'entity_group': 'I-PER', 'score': 0.9972857, 'word': 'Gates', 'start': 5, 'end': 10}, {'entity_group': 'I-ORG', 'score': 0.99925035, 'word': 'Microsoft', 'start': 19, 'end': 28}]
[{'entity_group': 'I-ORG', 'score': 0.80029345, 'word': 'Lou', 'start': 4, 'end': 7}, {'entity_group': 'I-ORG', 'score': 0.921264, 'word': '##vre', 'start': 7, 'end': 10}, {'entity_group': 'I-LOC', 'score': 0.47728893, 'word': 'Museum', 'start': 11, 'end': 17}, {'entity_group': 'I-LOC', 'score': 0.99909127, 'word': 'Paris', 'start': 24, 'end': 29}]
[{'entity_group': 'I-LOC', 'score': 0.5602855, 'word': 'Mount', 'start': 0, 'end': 5}, {'entity_group': 'I-PER', 'score': 0.48292893, 'word': 'Fuji', 'start': 6, 'end': 10}, {'entity_group': 'I-LOC', 'score': 0.9984364, 'word': 'Japan', 'start': 35, 'end': 40}]
[{'entity_group': 'I-ORG', 'score': 0.9990476, 'word': 'United', 'start': 4, 'end': 10}, {'entity_group': 'I-ORG', 'score

**5. Write a Python program that uses the stanza’s
ner pipeline for NER.**

In [17]:
!pip install stanza


Collecting stanza
  Downloading stanza-1.8.1-py3-none-any.whl (970 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m970.4/970.4 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji (from stanza)
  Downloading emoji-2.11.0-py2.py3-none-any.whl (433 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.8/433.8 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.3.0->stanza)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.3.0->stanza)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.3.0->stanza)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.3.0->stanza)
  Using cached nvidia_cudnn_cu12-8.9.2.2

In [18]:
import stanza

# Download and initialize the English neural pipeline
stanza.download('en')  # This downloads the English models for the neural pipeline
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')

# Test data as provided in the assignment
test_data = [
    "Bill Gates founded Microsoft",
    "The Louvre Museum is in Paris",
    "Mount Fuji is a famous landmark in Japan",
    "The United Nations was formed in 1945",
    "Shakira performed at the Super Bowl halftime show",
    "The Nobel Peace Prize was awarded to Malala Yousafzai",
    "The Amazon River flows through Brazil",
    "The Pyramids of Giza are in Egypt",
    "Rome is the capital of Italy",
    "The Great Wall of China is one of the Seven Wonders of the World"
]

def ner_stanza(sentences):
    results = []
    for sentence in sentences:
        doc = nlp(sentence)
        entities = []
        for ent in doc.entities:
            entities.append((ent.text, ent.type))
        results.append(entities)
    return results

# Perform NER on the test data
stanza_results = ner_stanza(test_data)
for result in stanza_results:
    print(result)


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/default.zip:   0%|          | 0…

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


[('Bill Gates', 'PERSON'), ('Microsoft', 'ORG')]
[('The Louvre Museum', 'FAC'), ('Paris', 'GPE')]
[('Mount Fuji', 'LOC'), ('Japan', 'GPE')]
[('The United Nations', 'ORG'), ('1945', 'DATE')]
[('Shakira', 'PERSON'), ('the Super Bowl', 'EVENT')]
[('The Nobel Peace Prize', 'WORK_OF_ART'), ('Malala Yousafzai', 'PERSON')]
[('The Amazon River', 'LOC'), ('Brazil', 'GPE')]
[('The Pyramids of Giza', 'PERSON'), ('Egypt', 'GPE')]
[('Rome', 'GPE'), ('Italy', 'GPE')]
[('China', 'GPE'), ('Seven', 'CARDINAL')]


**6. Evaluate their performance in terms of accuracy,
precision and recall, either as a whole (micro-averaging), or
category by category (macro-averaging).**