<a href="https://colab.research.google.com/github/Kyogeshkumar/DSA0317-NLP/blob/main/NLP_EXP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import necessary libraries
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download NLTK resources (if not already done)
nltk.download('punkt')         # Tokenizer
nltk.download('wordnet')       # Lemmatizer
nltk.download('omw-1.4')       # Lemmatizer for WordNet's data
nltk.download('averaged_perceptron_tagger')  # For POS tagging

# Sample text for analysis
text = "The leaves on the trees are falling gracefully."

# Tokenizing the text into words
words = nltk.word_tokenize(text)

# Initialize PorterStemmer and WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to get part-of-speech tag for lemmatization
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# Perform stemming and lemmatization
for word in words:
    stemmed = stemmer.stem(word)
    lemmatized = lemmatizer.lemmatize(word, get_wordnet_pos(word))
    print(f"Original Word: {word} | Stemmed: {stemmed} | Lemmatized: {lemmatized}")



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Original Word: The | Stemmed: the | Lemmatized: The
Original Word: leaves | Stemmed: leav | Lemmatized: leaf
Original Word: on | Stemmed: on | Lemmatized: on
Original Word: the | Stemmed: the | Lemmatized: the
Original Word: trees | Stemmed: tree | Lemmatized: tree
Original Word: are | Stemmed: are | Lemmatized: be
Original Word: falling | Stemmed: fall | Lemmatized: fall
Original Word: gracefully | Stemmed: grace | Lemmatized: gracefully
Original Word: . | Stemmed: . | Lemmatized: .


In [None]:
import random
import nltk
from collections import defaultdict

# Sample text to build the bigram model
text = """I am learning Python. Python is fun and powerful. I love coding in Python.
          Natural Language Processing with Python is exciting."""

# Tokenize the text into words
tokens = nltk.word_tokenize(text.lower())  # Lowercase for consistency

# Create a dictionary to store bigrams and their following word frequencies
bigram_model = defaultdict(list)

# Create bigrams (pairs of words) and populate the model
for i in range(len(tokens) - 1):
    bigram_model[tokens[i]].append(tokens[i+1])

# Function to generate text based on the bigram model
def generate_text(start_word, num_words=10):
    word = start_word
    sentence = [word]

    for _ in range(num_words - 1):
        # Choose the next word based on the current word's bigrams
        if word in bigram_model:
            word = random.choice(bigram_model[word])
            sentence.append(word)
        else:
            break  # Stop if there is no bigram for the current word

    return ' '.join(sentence)

# Starting word for text generation
start_word = 'python'

# Generate text using the bigram model
generated_text = generate_text(start_word, num_words=10)
print("Generated Text: ", generated_text)


Generated Text:  python . i am learning python is exciting . i


7.	Write program using the NLTK library to perform part-of-speech tagging on a text

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Download necessary resources
nltk.download('punkt')  # Tokenizer
nltk.download('averaged_perceptron_tagger')  # POS Tagger

# Sample text
text = "Natural Language Processing with Python is fun and educational."

# Tokenize the text into words
tokens = word_tokenize(text)

# Perform part-of-speech tagging
pos_tags = pos_tag(tokens)

# Print the tokens with their corresponding POS tags
for word, tag in pos_tags:
    print(f"{word}: {tag}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Natural: JJ
Language: NNP
Processing: NNP
with: IN
Python: NNP
is: VBZ
fun: NN
and: CC
educational: JJ
.: .


8.	Implement a simple stochastic part-of-speech tagging algorithm using a basic probabilistic model to assign POS tags using python.

In [None]:
import nltk
from nltk.corpus import treebank
from nltk.tokenize import word_tokenize

nltk.download('treebank')
nltk.download('punkt')

# Train UnigramTagger on Treebank corpus
train_data = treebank.tagged_sents()
tagger = nltk.UnigramTagger(train_data)

# Sample text
text = "Natural Language Processing is fun."

# Tokenize and tag
tokens = word_tokenize(text)
print(tagger.tag(tokens))


[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[('Natural', 'NNP'), ('Language', None), ('Processing', None), ('is', 'VBZ'), ('fun', None), ('.', '.')]


9.	Implement a rule-based part-of-speech tagging system using regular expressions using python.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import RegexpTagger

# Sample text to tag
text = "The quick brown fox jumps over the lazy dog."

# Tokenize the text
tokens = word_tokenize(text)

# Define rules for regular expression-based POS tagging
patterns = [
    (r'^[Tt]he$', 'DT'),    # Determiner (The, the)
    (r'^[Aa]n?$', 'DT'),    # Determiner (A, An, a, an)
    (r'.*ing$', 'VBG'),     # Gerunds (verbs ending with -ing)
    (r'.*ed$', 'VBD'),      # Past tense verbs (ended, jumped)
    (r'.*es$', 'VBZ'),      # Present tense singular verbs (jumps, runs)
    (r'.*ould$', 'MD'),     # Modals (could, would, should)
    (r'.*ly$', 'RB'),       # Adverbs (quickly, slowly)
    (r'[0-9]+', 'CD'),      # Cardinal numbers
    (r'.*', 'NN')           # Default rule: Noun (every other word)
]

# Create a RegexpTagger
regexp_tagger = RegexpTagger(patterns)

# Apply the tagger to the tokens
tagged_tokens = regexp_tagger.tag(tokens)

# Print the tagged tokens
print(tagged_tokens)


[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NN'), ('over', 'NN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN'), ('.', 'NN')]


10.	Implement transformation-based tagging using a set of transformation rules, apply a simple rule to tag words using python.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import UnigramTagger, RegexpTagger
from nltk.corpus import treebank

# Download necessary NLTK resources
nltk.download('treebank')
nltk.download('punkt')

# Train initial unigram tagger using Treebank corpus
train_data = treebank.tagged_sents()
unigram_tagger = UnigramTagger(train_data)

# Sample text
text = "The dog barked loudly at the cat."

# Tokenize the text
tokens = word_tokenize(text)

# Perform initial POS tagging
initial_tags = unigram_tagger.tag(tokens)

# Define a transformation rule: Change NN to VB when word is "barked"
def apply_transformation(tags):
    transformed_tags = []
    for word, tag in tags:
        if word.lower() == "barked" and tag == "NN":
            tag = "VBD"  # Change NN to VBD for "barked"
        transformed_tags.append((word, tag))
    return transformed_tags

# Apply transformation rule
transformed_tags = apply_transformation(initial_tags)

# Print the transformed tags
print("Initial Tags:", initial_tags)
print("Transformed Tags:", transformed_tags)


[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Initial Tags: [('The', 'DT'), ('dog', None), ('barked', None), ('loudly', None), ('at', 'IN'), ('the', 'DT'), ('cat', None), ('.', '.')]
Transformed Tags: [('The', 'DT'), ('dog', None), ('barked', None), ('loudly', None), ('at', 'IN'), ('the', 'DT'), ('cat', None), ('.', '.')]


11.	Implement a simple top-down parser for context-free grammars using python.



In [None]:
# Tokens for the parser
tokens = []
current_token = ''
pos = 0

# Lexer (tokenizer) to break input into tokens
def lexer(input_string):
    global tokens
    tokens = []
    i = 0
    while i < len(input_string):
        if input_string[i].isdigit():
            num = ''
            while i < len(input_string) and input_string[i].isdigit():
                num += input_string[i]
                i += 1
            tokens.append(('id', num))
        elif input_string[i] in '+*()':
            tokens.append((input_string[i], input_string[i]))
            i += 1
        elif input_string[i] == ' ':
            i += 1
        else:
            raise ValueError(f"Unexpected character: {input_string[i]}")
    tokens.append(('$', '$'))  # End of input

# Helper function to get the current token
def current():
    global pos
    return tokens[pos]

# Match the current token and move to the next one
def match(expected_token):
    global pos
    if current()[0] == expected_token:
        pos += 1
    else:
        raise ValueError(f"Expected {expected_token}, found {current()[0]}")

# Parsing functions for each non-terminal

def E():
    T()  # Parse a term
    E_prime()  # Parse the rest of the expression

def E_prime():
    if current()[0] == '+':
        match('+')
        T()
        E_prime()  # Handle recursion

def T():
    F()  # Parse a factor
    T_prime()  # Parse the rest of the term

def T_prime():
    if current()[0] == '*':
        match('*')
        F()
        T_prime()  # Handle recursion

def F():
    if current()[0] == 'id':
        match('id')  # Match identifier (number/variable)
    elif current()[0] == '(':
        match('(')
        E()  # Parse an expression inside parentheses
        match(')')
    else:
        raise ValueError(f"Unexpected token: {current()[0]}")

# Main function to start parsing
def parse(input_string):
    global pos
    pos = 0
    lexer(input_string)  # Tokenize the input
    E()  # Start parsing from the start symbol (E)
    if current()[0] == '$':
        print("Parsing successful!")
    else:
        print("Parsing failed!")

# Example usage
input_string = "(1+2)*3"
parse(input_string)


Parsing successful!


12.	Implement an Earley parser for context-free grammars using a simple python program.

In [None]:
from collections import defaultdict

# State class to store each Earley parsing state
class State:
    def __init__(self, lhs, rule, dot, origin):
        self.lhs = lhs    # Left-hand side of the rule (non-terminal)
        self.rule = rule  # The right-hand side of the rule (list of symbols)
        self.dot = dot    # Position of the dot (how much of the rule is processed)
        self.origin = origin  # Position in the input where this rule started

    def is_complete(self):
        return self.dot == len(self.rule)  # Rule is complete if dot is at the end

    def next_symbol(self):
        if not self.is_complete():
            return self.rule[self.dot]
        return None

    def __repr__(self):
        rule_str = ' '.join(self.rule[:self.dot] + ['•'] + self.rule[self.dot:])
        return f'{self.lhs} → {rule_str}, ({self.origin})'


# Earley parser class
class EarleyParser:
    def __init__(self, grammar, start_symbol):
        self.grammar = grammar  # Grammar (dict of non-terminal to list of rules)
        self.start_symbol = start_symbol  # Starting symbol of the grammar

    def parse(self, input_string):
        input_tokens = input_string.split() + ['$']  # Split input into tokens and append end-of-input marker
        n = len(input_tokens)

        # Create empty state sets for each position
        state_sets = [set() for _ in range(n + 1)]

        # Initialize the state set with the start rule
        state_sets[0].add(State(self.start_symbol, self.grammar[self.start_symbol][0], 0, 0))

        # Processing the input through Earley states
        for i in range(n):
            self._predictor(state_sets, i)
            self._scanner(state_sets, i, input_tokens)
            self._completer(state_sets, i)

        # Final state completion check
        final_state = State(self.start_symbol, self.grammar[self.start_symbol][0], len(self.grammar[self.start_symbol][0]), 0)
        if final_state in state_sets[n]:
            print("Input string is accepted.")
        else:
            print("Input string is not accepted.")

    def _predictor(self, state_sets, pos):
        added = True
        while added:
            added = False
            for state in list(state_sets[pos]):
                next_symbol = state.next_symbol()
                if next_symbol and isinstance(next_symbol, str) and next_symbol in self.grammar:  # If non-terminal, predict its expansion
                    for rule in self.grammar[next_symbol]:
                        new_state = State(next_symbol, rule, 0, pos)
                        if new_state not in state_sets[pos]:
                            state_sets[pos].add(new_state)
                            added = True

    def _scanner(self, state_sets, pos, input_tokens):
        next_states = set()
        for state in state_sets[pos]:
            next_symbol = state.next_symbol()
            if next_symbol and next_symbol == input_tokens[pos]:  # If terminal matches input
                new_state = State(state.lhs, state.rule, state.dot + 1, state.origin)
                next_states.add(new_state)
        state_sets[pos + 1].update(next_states)

    def _completer(self, state_sets, pos):
        added = True
        while added:
            added = False
            for state in list(state_sets[pos]):
                if state.is_complete():  # Completed state
                    origin = state.origin
                    for prev_state in state_sets[origin]:
                        next_symbol = prev_state.next_symbol()
                        if next_symbol == state.lhs:  # Complete the previous state that was waiting for this non-terminal
                            new_state = State(prev_state.lhs, prev_state.rule, prev_state.dot + 1, prev_state.origin)
                            if new_state not in state_sets[pos]:
                                state_sets[pos].add(new_state)
                                added = True


# Example grammar (Simple arithmetic grammar)
grammar = {
    'S': [['E']],         # Start symbol
    'E': [['E', '+', 'T'], ['T']],
    'T': [['T', '*', 'F'], ['F']],
    'F': [['(', 'E', ')'], ['id']]
}

# Example usage
parser = EarleyParser(grammar, 'S')
input_string = 'id + id * id'
parser.parse(input_string)


13.	Generate a parse tree for a given sentence using a context-free grammar using python program.

In [None]:
from collections import defaultdict

# Define the tree node for representing the parse tree
class TreeNode:
    def __init__(self, symbol):
        self.symbol = symbol
        self.children = []

    def add_child(self, child):
        self.children.append(child)

    def __repr__(self, level=0):
        ret = "\t" * level + repr(self.symbol) + "\n"
        for child in self.children:
            ret += child.__repr__(level + 1)
        return ret


# Recursive descent parser class for generating parse trees
class RecursiveDescentParser:
    def __init__(self, grammar, start_symbol):
        self.grammar = grammar  # Grammar in dict form {LHS: [RHS1, RHS2, ...]}
        self.start_symbol = start_symbol  # Start symbol of the grammar
        self.tokens = []  # List of tokens from the input
        self.pos = 0  # Pointer to the current token position

    def parse(self, input_string):
        # Tokenize the input string by splitting on spaces
        self.tokens = input_string.split()
        self.pos = 0

        # Start parsing from the start symbol and generate the parse tree
        parse_tree = self._parse_rule(self.start_symbol)

        if parse_tree and self.pos == len(self.tokens):
            print("Parse successful!")
            print("Parse Tree:")
            print(parse_tree)
        else:
            print("Parse failed!")

    def _parse_rule(self, symbol):
        """Attempt to parse a rule for the given symbol."""
        if symbol not in self.grammar:  # If it's a terminal
            if self.pos < len(self.tokens) and self.tokens[self.pos] == symbol:
                # Terminal matches current token, consume it
                node = TreeNode(symbol)
                self.pos += 1
                return node
            else:
                return None

        # Try each rule in the grammar for the given non-terminal symbol
        for rule in self.grammar[symbol]:
            saved_pos = self.pos  # Save current position to backtrack if needed
            node = TreeNode(symbol)
            success = True
            for sym in rule:
                child = self._parse_rule(sym)
                if child:
                    node.add_child(child)
                else:
                    success = False
                    break

            if success:
                return node  # Successfully parsed this rule

            # Backtrack if the rule didn't match
            self.pos = saved_pos

        return None  # No rule matched

# Define the context-free grammar
grammar = {
    'S': [['NP', 'VP']],
    'NP': [['Det', 'N']],
    'VP': [['V', 'NP'], ['V']],
    'Det': [['the'], ['a']],
    'N': [['cat'], ['dog']],
    'V': [['chased'], ['saw']]
}

# Example usage of the Recursive Descent Parser
parser = RecursiveDescentParser(grammar, 'S')
input_sentence = "the cat chased the dog"
parser.parse(input_sentence)


Parse successful!
Parse Tree:
'S'
	'NP'
		'Det'
			'the'
		'N'
			'cat'
	'VP'
		'V'
			'chased'
		'NP'
			'Det'
				'the'
			'N'
				'dog'



14.	Create a program in python to check for agreement in sentences based on a context-free grammar's rules.

In [None]:
import nltk
from nltk import CFG
grammar = CFG.fromstring("""
    S -> NP_SG VP_SG | NP_PL VP_PL
    NP_SG -> Det_SG N_SG
    NP_PL -> Det_PL N_PL
    VP_SG -> V_SG
    VP_PL -> V_PL
    Det_SG -> 'the'
    Det_PL -> 'the'
    N_SG -> 'cat' | 'dog'
    N_PL -> 'cats' | 'dogs'
    V_SG -> 'runs' | 'jumps'
    V_PL -> 'run' | 'jump'
""")

def check_agreement(sentence):
    tokens = sentence.split()
    parser = nltk.ChartParser(grammar)

    try:
        next(parser.parse(tokens))
        return True

    except StopIteration:
        return False

sentences = [
    "the cat runs",
    "the dogs run",
    "the cat run",
    "the dog runs"
]
for sentence in sentences:
    if check_agreement(sentence):
        print(f"Agreement satisfied for '{sentence}'")
    else:
        print(f"Agreement NOT satisfied for '{sentence}'")

Agreement satisfied for 'the cat runs'
Agreement satisfied for 'the dogs run'
Agreement NOT satisfied for 'the cat run'
Agreement satisfied for 'the dog runs'


15.	Implement probabilistic context-free grammar parsing for a sentence using python.

In [None]:
import nltk
from nltk import PCFG, ViterbiParser

grammar = PCFG.fromstring("""
    S -> NP VP [1.0]
    VP -> V NP [0.7] | V [0.3]
    NP -> Det N [0.6] | N [0.4]
    Det -> 'the' [0.8] | 'a' [0.2]
    N -> 'cat' [0.5] | 'dog' [0.5]
    V -> 'chased' [0.9] | 'saw' [0.1]
""")

def parse_sentence_pcfg(sentence):
    tokens = sentence.split()
    parser = ViterbiParser(grammar)

    try:
        trees = list(parser.parse(tokens))
        return trees[0]
    except IndexError:
        return None  # Handle case where no parse tree is found
sentences = [
    "the cat chased the dog",
    "a dog saw the cat"
]
for sentence in sentences:
    parse_tree = parse_sentence_pcfg(sentence)
    if parse_tree:
        print(f"Parse tree for '{sentence}':")
        print(parse_tree)
        parse_tree.pretty_print()
    else:
        print(f"No parse tree found for sentence: '{sentence}'")


Parse tree for 'the cat chased the dog':
(S
  (NP (Det the) (N cat))
  (VP (V chased) (NP (Det the) (N dog)))) (p=0.036288)
              S               
      ________|_____           
     |              VP        
     |         _____|___       
     NP       |         NP    
  ___|___     |      ___|___   
Det      N    V    Det      N 
 |       |    |     |       |  
the     cat chased the     dog

Parse tree for 'a dog saw the cat':
(S
  (NP (Det a) (N dog))
  (VP (V saw) (NP (Det the) (N cat)))) (p=0.001008)
             S             
      _______|___           
     |           VP        
     |        ___|___       
     NP      |       NP    
  ___|___    |    ___|___   
Det      N   V  Det      N 
 |       |   |   |       |  
 a      dog saw the     cat



16.	Implement a Python program using the SpaCy library to perform Named Entity Recognition (NER) on a given text.



In [None]:
import spacy

# Load the pre-trained model (small English model)
nlp = spacy.load("en_core_web_sm")

# Function to perform NER on the given text
def perform_ner(text):
    # Process the text using the SpaCy model
    doc = nlp(text)

    # Extract named entities and their labels
    print(f"{'Entity':<20} | {'Label':<10} | {'Explanation'}")
    print("-" * 50)
    for ent in doc.ents:
        print(f"{ent.text:<20} | {ent.label_:<10} | {spacy.explain(ent.label_)}")

# Example text
text = "Apple is looking at buying U.K. startup for $1 billion. Elon Musk founded SpaceX in 2002 in the United States."

# Perform NER on the example text
perform_ner(text)


Entity               | Label      | Explanation
--------------------------------------------------
Apple                | ORG        | Companies, agencies, institutions, etc.
U.K.                 | GPE        | Countries, cities, states
$1 billion           | MONEY      | Monetary values, including unit
Elon Musk            | PERSON     | People, including fictional
2002                 | DATE       | Absolute or relative dates or periods
the United States    | GPE        | Countries, cities, states


17.Write program demonstrates how to access WordNet, a lexical database, to retrieve synsets and explore word meanings in python.

In [None]:
# Import necessary libraries
import nltk
from nltk.corpus import wordnet as wn

# Make sure to download the WordNet corpus if you haven't already
nltk.download('wordnet')
nltk.download('omw-1.4')  # Optional: To get word translations in different languages

# Define a word for which you want to explore synsets
word = "dog"

# Retrieve synsets for the word
synsets = wn.synsets(word)

# Print the synsets (each synset represents a specific meaning of the word)
print(f"Synsets for '{word}':")
for synset in synsets:
    print(f"{synset.name()} - {synset.definition()}")

# Explore details of the first synset
if synsets:
    first_synset = synsets[0]
    print(f"\nDetails for the first synset '{first_synset.name()}':")

    # Definition of the first synset
    print(f"Definition: {first_synset.definition()}")

    # Examples of usage for this sense of the word
    print(f"Examples: {first_synset.examples()}")

    # Hypernyms (more general terms) for this synset
    hypernyms = first_synset.hypernyms()
    print(f"Hypernyms: {[hypernym.name() for hypernym in hypernyms]}")

    # Hyponyms (more specific terms) for this synset
    hyponyms = first_synset.hyponyms()
    print(f"Hyponyms: {[hyponym.name() for hyponym in hyponyms]}")

    # Lemmas (different word forms for this synset)
    lemmas = first_synset.lemmas()
    print(f"Lemmas: {[lemma.name() for lemma in lemmas]}")

    # Antonyms (opposite meanings)
    antonyms = []
    for lemma in lemmas:
        antonyms.extend(lemma.antonyms())
    print(f"Antonyms: {[antonym.name() for antonym in antonyms]}")


Synsets for 'dog':
dog.n.01 - a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
frump.n.01 - a dull unattractive unpleasant girl or woman
dog.n.03 - informal term for a man
cad.n.01 - someone who is morally reprehensible
frank.n.02 - a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
pawl.n.01 - a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
andiron.n.01 - metal supports for logs in a fireplace
chase.v.01 - go after with the intent to catch

Details for the first synset 'dog.n.01':
Definition: a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
Examples: ['the dog barked all night']
Hypernyms: ['canine.n.02', 'domestic_animal.n.01']
Hyponyms: ['basenji.n.01', 'corgi.n.01', 'cur.n.01', 

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


18.	Implement a simple FOPC parser for basic logical expressions using python program.

In [None]:
import re

# Token types
AND = 'AND'
OR = 'OR'
NOT = 'NOT'
IMPLIES = 'IMPLIES'
EQUIVALENT = 'EQUIVALENT'
FORALL = 'FORALL'
EXISTS = 'EXISTS'
LPAREN = 'LPAREN'
RPAREN = 'RPAREN'
PREDICATE = 'PREDICATE'
VARIABLE = 'VARIABLE'

# Token regex patterns
TOKEN_REGEX = [
    (r'\(', LPAREN),
    (r'\)', RPAREN),
    (r'¬', NOT),  # Not symbol
    (r'∧', AND),  # And symbol
    (r'∨', OR),   # Or symbol
    (r'→', IMPLIES),  # Implies symbol
    (r'↔', EQUIVALENT),  # Equivalent symbol
    (r'∀', FORALL),  # Universal quantifier
    (r'∃', EXISTS),  # Existential quantifier
    (r'[A-Z][a-zA-Z]*\([a-z]+\)', PREDICATE),  # Predicates like P(x)
    (r'[a-z]', VARIABLE),  # Variables like x, y, z
]

# Tokenizer: breaks down input into tokens
def tokenize(expression):
    pos = 0
    tokens = []
    while pos < len(expression):
        # Skip any whitespace
        if expression[pos].isspace():
            pos += 1
            continue
        match = None
        for regex, token_type in TOKEN_REGEX:
            pattern = re.compile(regex)
            match = pattern.match(expression, pos)
            if match:
                tokens.append((token_type, match.group(0)))
                pos = match.end(0)
                break
        if not match:
            raise SyntaxError(f"Invalid character at position {pos}: {expression[pos]}")
    return tokens

# Simple FOPC parser class
class FOPCParser:
    def __init__(self, tokens):
        self.tokens = tokens
        self.pos = 0

    def parse(self):
        return self.expr()

    def expr(self):
        # Parsing expressions (AND, OR, IMPLIES, etc.)
        left = self.term()
        while self.pos < len(self.tokens) and self.tokens[self.pos][0] in {AND, OR, IMPLIES, EQUIVALENT}:
            operator = self.tokens[self.pos]
            self.pos += 1
            right = self.term()
            left = (operator, left, right)
        return left

    def term(self):
        token = self.tokens[self.pos]
        if token[0] == LPAREN:
            self.pos += 1
            result = self.expr()
            if self.tokens[self.pos][0] != RPAREN:
                raise SyntaxError("Expected closing parenthesis")
            self.pos += 1
            return result
        elif token[0] == NOT:
            self.pos += 1
            return ('NOT', self.term())
        elif token[0] == FORALL or token[0] == EXISTS:
            quantifier = token
            self.pos += 1
            variable = self.tokens[self.pos]
            self.pos += 1
            result = self.term()
            return (quantifier, variable, result)
        elif token[0] == PREDICATE:
            self.pos += 1
            return ('PREDICATE', token[1])
        else:
            raise SyntaxError(f"Unexpected token: {token}")

# Sample program to test the FOPC parser
if __name__ == "__main__":
    # Example expression: ∀x (P(x) → Q(x))
    expression = "∀x (P(x) → Q(x))"
    print(f"Input FOPC Expression: {expression}")

    # Tokenize the input expression
    tokens = tokenize(expression)
    print(f"Tokens: {tokens}")

    # Parse the tokenized expression
    parser = FOPCParser(tokens)
    parsed_expression = parser.parse()
    print(f"Parsed Expression: {parsed_expression}")


Input FOPC Expression: ∀x (P(x) → Q(x))
Tokens: [('FORALL', '∀'), ('VARIABLE', 'x'), ('LPAREN', '('), ('PREDICATE', 'P(x)'), ('IMPLIES', '→'), ('PREDICATE', 'Q(x)'), ('RPAREN', ')')]
Parsed Expression: (('FORALL', '∀'), ('VARIABLE', 'x'), (('IMPLIES', '→'), ('PREDICATE', 'P(x)'), ('PREDICATE', 'Q(x)')))


19.	Create a program for word sense disambiguation using the Lesk algorithm using python.

In [None]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize

# Ensure that the necessary NLTK packages are downloaded
nltk.download('wordnet')
nltk.download('punkt')

def lesk(context_sentence, ambiguous_word):
    # Tokenize the context sentence
    context = set(word_tokenize(context_sentence))

    # Get all senses (synsets) of the ambiguous word
    best_sense = None
    max_overlap = 0

    for sense in wn.synsets(ambiguous_word):
        # Get the definition of the sense and tokenize it
        signature = set(word_tokenize(sense.definition()))

        # Also consider examples of the sense as part of its signature
        for example in sense.examples():
            signature.update(word_tokenize(example))

        # Calculate the overlap between the context and the sense signature
        overlap = len(context.intersection(signature))

        # Choose the sense with the highest overlap
        if overlap > max_overlap:
            max_overlap = overlap
            best_sense = sense

    return best_sense

# Test the Lesk Algorithm with an example
if __name__ == "__main__":
    sentence = "I went to the bank to deposit my money."
    ambiguous_word = "bank"

    sense = lesk(sentence, ambiguous_word)
    if sense:
        print(f"Best Sense: {sense.name()}")
        print(f"Definition: {sense.definition()}")
    else:
        print("No suitable sense found.")


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Best Sense: depository_financial_institution.n.01
Definition: a financial institution that accepts deposits and channels the money into lending activities


20.	Implement a basic information retrieval system using TF-IDF (Term Frequency-Inverse Document Frequency) for document ranking using python.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import nltk

# Ensure NLTK tokenizer is available
nltk.download('punkt')

# Sample documents for the information retrieval system
documents = [
    "The sky is blue and beautiful.",
    "Love this blue and beautiful sky!",
    "The quick brown fox jumps over the lazy dog.",
    "A king's breakfast has sausages, ham, bacon, eggs, toast and beans.",
    "I love green eggs, ham, sausages, and bacon!",
    "The brown fox is quick and the blue dog is lazy!",
    "The sky is very blue and the sky is very beautiful today.",
    "Sausages, bacon, ham, eggs, and toast make for a great breakfast!",
    "The lazy dog loves the beautiful sky on a sunny day."
]

# Preprocess the text
def preprocess(text):
    # Lowercase and tokenize the text
    tokens = nltk.word_tokenize(text.lower())
    return ' '.join(tokens)

# Preprocess all documents
preprocessed_docs = [preprocess(doc) for doc in documents]

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Compute TF-IDF matrix for the corpus (document set)
tfidf_matrix = vectorizer.fit_transform(preprocessed_docs)

# Function to perform document ranking based on query
def rank_documents(query, tfidf_matrix, vectorizer):
    # Preprocess the query
    query = preprocess(query)

    # Transform the query into the same TF-IDF space as the documents
    query_vec = vectorizer.transform([query])

    # Compute cosine similarity between the query and all the documents
    cosine_similarities = np.dot(query_vec, tfidf_matrix.T).toarray().flatten()

    # Rank documents based on similarity score
    ranked_doc_indices = cosine_similarities.argsort()[::-1]

    return ranked_doc_indices, cosine_similarities

# Test the system with a query
if __name__ == "__main__":
    query = "blue sky"
    ranked_indices, similarities = rank_documents(query, tfidf_matrix, vectorizer)

    print(f"Query: {query}")
    print("\nRanked Documents Based on Relevance:")

    for idx in ranked_indices:
        print(f"Document {idx+1} (Similarity: {similarities[idx]:.4f}): {documents[idx]}")


Query: blue sky

Ranked Documents Based on Relevance:
Document 1 (Similarity: 0.5977): The sky is blue and beautiful.
Document 2 (Similarity: 0.5133): Love this blue and beautiful sky!
Document 7 (Similarity: 0.4105): The sky is very blue and the sky is very beautiful today.
Document 9 (Similarity: 0.1703): The lazy dog loves the beautiful sky on a sunny day.
Document 6 (Similarity: 0.1691): The brown fox is quick and the blue dog is lazy!
Document 8 (Similarity: 0.0000): Sausages, bacon, ham, eggs, and toast make for a great breakfast!
Document 5 (Similarity: 0.0000): I love green eggs, ham, sausages, and bacon!
Document 4 (Similarity: 0.0000): A king's breakfast has sausages, ham, bacon, eggs, toast and beans.
Document 3 (Similarity: 0.0000): The quick brown fox jumps over the lazy dog.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


21.	Create a python program that performs syntax-driven semantic analysis by extracting noun phrases and their meanings from a sentence.

In [None]:
import spacy

# Load the English model
nlp = spacy.load('en_core_web_sm')

def extract_noun_phrases_and_meanings(sentence):
    # Process the sentence with spaCy
    doc = nlp(sentence)

    noun_phrases = []
    entity_meanings = {}

    # Extract noun phrases
    for chunk in doc.noun_chunks:
        noun_phrases.append(chunk.text)

    # Extract entities and their meanings (using Named Entity Recognition)
    for ent in doc.ents:
        entity_meanings[ent.text] = ent.label_

    return noun_phrases, entity_meanings

# Test the function
if __name__ == "__main__":
    sentence = "Apple is looking at buying a U.K. startup for $1 billion."

    noun_phrases, entity_meanings = extract_noun_phrases_and_meanings(sentence)

    print(f"Sentence: {sentence}")
    print("\nExtracted Noun Phrases:")
    for np in noun_phrases:
        print(f"- {np}")

    print("\nExtracted Meanings (Named Entities):")
    for entity, meaning in entity_meanings.items():
        print(f"- {entity}: {meaning}")



Sentence: Apple is looking at buying a U.K. startup for $1 billion.

Extracted Noun Phrases:
- Apple
- a U.K. startup

Extracted Meanings (Named Entities):
- Apple: ORG
- U.K.: GPE
- $1 billion: MONEY


22.	Create a python program that performs reference resolution within a text.

In [None]:
import stanza

# Download the English NLP model
stanza.download('en')

# Initialize the pipeline with coreference resolution enabled
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma,depparse,ner,constituency,coref')

# Function to perform coreference resolution
def resolve_coreferences(text):
    doc = nlp(text)
    resolved_text = text
    for coref in doc.coref:
        representative = ' '.join([word.text for word in coref[0]])
        for mention in coref[1:]:
            mention_text = ' '.join([word.text for word in mention])
            resolved_text = resolved_text.replace(mention_text, representative)
    return resolved_text

# Sample text
text = """
John went to the store. He bought some apples. Then, he went home to cook dinner.
Mary also bought apples. She loves cooking.
"""

# Get resolved text
resolved_text = resolve_coreferences(text)

print("Original Text:")
print(text)
print("\nResolved Text:")
print(resolved_text)


ModuleNotFoundError: No module named 'stanza'

23.	Develop a python program that evaluates the coherence of a given text.

In [None]:
import nltk
import numpy as np
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Download required NLTK resources
nltk.download('punkt')

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings for a sentence
def get_sentence_embedding(sentence):
    inputs = tokenizer(sentence, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze()

# Function to evaluate coherence using sentence similarity
def evaluate_coherence(text):
    # Split the text into sentences
    sentences = nltk.sent_tokenize(text)

    # Get embeddings for each sentence
    sentence_embeddings = [get_sentence_embedding(sentence).numpy() for sentence in sentences]

    # Calculate coherence by measuring cosine similarity between consecutive sentences
    similarities = []
    for i in range(len(sentence_embeddings) - 1):
        similarity = cosine_similarity([sentence_embeddings[i]], [sentence_embeddings[i+1]])[0][0]
        similarities.append(similarity)

    # Return the average similarity as a coherence score
    if len(similarities) == 0:
        return 0  # For texts with a single sentence

    coherence_score = np.mean(similarities)
    return coherence_score

# Sample text for coherence evaluation
text = """
John went to the store. He bought some apples. After that, he went home and made dinner.
The sky was cloudy, and it looked like it would rain. Mary decided to carry an umbrella just in case.
"""

# Evaluate coherence
coherence_score = evaluate_coherence(text)

print(f"Coherence Score: {coherence_score:.2f}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Coherence Score: 0.68


24.	Create a python program that recognizes dialog acts in a given dialog or conversation.

In [None]:
import nltk
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline

# Download NLTK data
nltk.download('punkt')

# Define dialog act categories (example categories)
dialog_act_labels = ["statement", "question", "command", "acknowledgment", "other"]

# Load pre-trained BERT tokenizer and model (You can fine-tune BERT for dialog act recognition)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(dialog_act_labels))

# Function to classify dialog acts
def classify_dialog_acts(dialog):
    sentences = nltk.sent_tokenize(dialog)  # Split the dialog into sentences
    act_recognizer = pipeline("text-classification", model=model, tokenizer=tokenizer, return_all_scores=True)

    dialog_acts = {}
    for sentence in sentences:
        result = act_recognizer(sentence)[0]  # Get the prediction
        best_act = max(result, key=lambda x: x['score'])  # Find the label with the highest score
        dialog_acts[sentence] = dialog_act_labels[best_act['label']]

    return dialog_acts

# Sample conversation
conversation = """
Hi, can you help me with my order?
Yes, I can assist you.
I would like to return a product.
Sure, I will initiate the return process.
Thank you for your help!
"""

# Classify dialog acts in the conversation
dialog_acts = classify_dialog_acts(conversation)

# Output the dialog acts
for sentence, act in dialog_acts.items():
    print(f"Sentence: {sentence}\nDialog Act: {act}\n")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


TypeError: list indices must be integers or slices, not str

25.	Utilize the GPT-3 model to generate text based on a given prompt. Make sure to install the OpenAI GPT-3 library in python implementation.

In [None]:
import openai

# Set your OpenAI API key
openai.api_key = 'your-api-key-here'

# Function to generate text based on a prompt
def generate_text(prompt, max_tokens=100, temperature=0.7):
    response = openai.Completion.create(
        engine="text-davinci-003",  # GPT-3 engine
        prompt=prompt,
        max_tokens=max_tokens,  # Maximum number of tokens to generate
        temperature=temperature,  # Creativity level of the model (0.0 to 1.0)
        n=1,  # Number of completions to generate
        stop=None  # When to stop generating text (can define custom stopping tokens)
    )

    return response.choices[0].text.strip()

# Example prompt
prompt = "Write a short story about a robot that learns to love."

# Generate text using the GPT-3 model
generated_text = generate_text(prompt)

print("Generated Text:")
print(generated_text)


ModuleNotFoundError: No module named 'openai'

In [None]:
!pip install openai



Collecting openai
  Downloading openai-1.47.1-py3-none-any.whl.metadata (24 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.47.1-py3-none-any.whl (375 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m375.6/375.6 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K   [90m━

In [None]:
import openai

# Set your OpenAI API key
openai.api_key = 'your-api-key-here'

# Function to generate text based on a prompt
def generate_text(prompt, max_tokens=100, temperature=0.7):
    response = openai.Completion.create(
        engine="text-davinci-003",  # GPT-3 engine
        prompt=prompt,
        max_tokens=max_tokens,  # Maximum number of tokens to generate
        temperature=temperature,  # Creativity level of the model (0.0 to 1.0)
        n=1,  # Number of completions to generate
        stop=None  # When to stop generating text (can define custom stopping tokens)
    )

    return response.choices[0].text.strip()

# Example prompt
prompt = "Write a short story about a robot that learns to love."

# Generate text using the GPT-3 model
generated_text = generate_text(prompt)

print("Generated Text:")
print(generated_text)


APIRemovedInV1: 

You tried to access openai.Completion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface. 

Alternatively, you can pin your installation to the old version, e.g. `pip install openai==0.28`

A detailed migration guide is available here: https://github.com/openai/openai-python/discussions/742


26.	Implement a machine translation program using the Hugging Face Transformers library,  translate English text to French using python.

In [None]:
from transformers import MarianMTModel, MarianTokenizer

# Load the pre-trained model and tokenizer
model_name = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Function to translate English text to French
def translate_to_french(text):
    # Tokenize the input text
    tokenized_text = tokenizer.prepare_seq2seq_batch([text], return_tensors="pt")

    # Generate translation
    translation_tokens = model.generate(**tokenized_text)

    # Decode the tokens back to a string
    translated_text = tokenizer.decode(translation_tokens[0], skip_special_tokens=True)

    return translated_text

# Example usage
english_text = "Hello, how are you today?"
french_translation = translate_to_french(english_text)

print(f"English: {english_text}")
print(f"French: {french_translation}")


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



English: Hello, how are you today?
French: Bonjour, comment allez-vous aujourd'hui?
