<a href="https://colab.research.google.com/github/Nidhi-1223/nlp-lab/blob/main/NLP_codes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Exp 1 - Study preprocessing of text


Theory:
To preprocess your text simply means to bring your text into a form that is predictable and analyzable for your task. A task here is a combination of approach and domain.
Machine Learning needs data in the numeric form. We basically used encoding techniques (BagOfWord, Bi-gram,n-gram, TF-IDF, Word2Vec) to encode text into numeric vectors. But before encoding we first need to clean the text data and this process to prepare (or clean) text data before encoding is called text preprocessing, this is the very first step to solve the NLP problems.


Tokenization:
Tokenization is about splitting strings of text into smaller pieces, or “tokens”. Paragraphs can be tokenized into sentences and sentences can be tokenized into words.
Filtration:
Similarly, if we are doing simple word counts, or trying to visualize our text with a word cloud, stopwords are some of the most frequently occurring words but don’t really tell us anything. We’re often better off tossing the stopwords out of the text.


Certainly! Here's a shorter summary of the algorithm for text preprocessing:

1. **Input Text**: Start with your raw text data.

2. **Tokenization**: Split the text into words or tokens.

3. **Filtration**: Clean the text by removing special characters and lowercasing.

4. **Script Validation**: Ensure the text is in the correct script or language.

5. **Stop Word Removal**: Eliminate common words like "a" and "the."

6. **Stemming**: Reduce words to their root form.

7. **Output Preprocessed Text**: Save the cleaned text for analysis.

8. **Evaluation**: Assess the impact of preprocessing on your analysis.

9. **Iterate and Experiment**: Fine-tune preprocessing based on your specific task and data.

In [None]:
# Import necessary libraries
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Sample text
text = "Text preprocessing is an important step in natural language processing. It involves tokenization, filtration, script validation, stop word removal, and stemming."

# Tokenization: Split the text into words or tokens
tokens = word_tokenize(text)

# Filtration: Remove non-alphanumeric characters and convert to lowercase
filtered_tokens = [re.sub(r'[^a-zA-Z0-9]', '', token).lower() for token in tokens]

'''
re.sub() function is used to substitute (replace) all characters in token that are not letters (a to z and A to Z) or digits (0 to 9) with an empty string ''.

'''

# Script Validation: You can use regular expressions to validate scripts (e.g., only keep words with Latin characters - letters from english alphabets, both uppercase and lowercase)
latin_tokens = [token for token in filtered_tokens if re.match('^[a-zA-Z]+$', token)]

# Stop Word Removal: Remove common stop words
stop_words = set(stopwords.words('english'))
filtered_tokens_no_stop = [token for token in latin_tokens if token not in stop_words]

# Stemming: Reduce words to their root form using Porter Stemmer
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens_no_stop]

# Display the results
print("Original Text:")
print(text)
print("\nTokenization:")
print(tokens)
print("\nFiltration:")
print(filtered_tokens)
print("\nScript Validation:")
print(latin_tokens)
print("\nStop Word Removal:")
print(filtered_tokens_no_stop)
print("\nStemming:")
print(stemmed_tokens)


Original Text:
Text preprocessing is an important step in natural language processing. It involves tokenization, filtration, script validation, stop word removal, and stemming.

Tokenization:
['Text', 'preprocessing', 'is', 'an', 'important', 'step', 'in', 'natural', 'language', 'processing', '.', 'It', 'involves', 'tokenization', ',', 'filtration', ',', 'script', 'validation', ',', 'stop', 'word', 'removal', ',', 'and', 'stemming', '.']

Filtration:
['text', 'preprocessing', 'is', 'an', 'important', 'step', 'in', 'natural', 'language', 'processing', '', 'it', 'involves', 'tokenization', '', 'filtration', '', 'script', 'validation', '', 'stop', 'word', 'removal', '', 'and', 'stemming', '']

Script Validation:
['text', 'preprocessing', 'is', 'an', 'important', 'step', 'in', 'natural', 'language', 'processing', 'it', 'involves', 'tokenization', 'filtration', 'script', 'validation', 'stop', 'word', 'removal', 'and', 'stemming']

Stop Word Removal:
['text', 'preprocessing', 'important', 's

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# !pip install nltk
# nltk.download('punkt')
# nltk.download('stopwords')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Exp 2 - Study Morphological Analysis


Theory:
While performing the morphological analysis, each particular word is analyzed. Non-word tokens such as
punctuation are removed from the words. Hence the remaining words are assigned categories. For
instance, Ram’s iPhone cannot convert the video from .mkv to .mp4. In Morphological analysis, word by
word the sentence is analyzed. So here, Ram is a proper noun, Ram’s is assigned as possessive suffix and
.mkv and .mp4 is assigned as a file extension.

Algo:

1. **Input Word**: Start with a word you want to analyze.

2. **Tokenization**: If the input text contains multiple words, split it into individual words.

3. **Lemmatization**: Reduce the word to its base or dictionary form (lemma). This helps to handle inflections and variants of the word.

4. **Part-of-Speech Tagging**: Assign the appropriate part-of-speech tag (e.g., noun, verb, adjective) to each word.

5. **Morphological Analysis**: Analyze the word's morphology, including its gender, number, tense, case, and other linguistic features.

6. **Output Results**: Store or display the results of the morphological analysis, which may include the lemma, part-of-speech, and specific morphological features.

7. **Iterate and Experiment**: Depending on your linguistic analysis goals, you may need to adapt the analysis steps and rules to different languages or specific text corpora.

Morphological analysis is particularly important in natural language processing and computational linguistics to understand the structure and meaning of words in a given language. The specific tools and libraries you use for these tasks will depend on the programming language and NLP frameworks you're working with.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('wordnet')

# Sample text
text = "Morphological analysis involves breaking down words into their constituent morphemes."

# Tokenization: Split the text into words
tokens = word_tokenize(text)

# Initialize a stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Stemming: Reduce words to their root form using Porter Stemmer
stemmed_tokens = [stemmer.stem(token) for token in tokens]

# Lemmatization: Reduce words to their base or dictionary form
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

# Display the results
print("Original Text:")
print(text)
print("\nTokenization:")
print(tokens)
print("\nStemming:")
print(stemmed_tokens)
print("\nLemmatization:")
print(lemmatized_tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Original Text:
Morphological analysis involves breaking down words into their constituent morphemes.

Tokenization:
['Morphological', 'analysis', 'involves', 'breaking', 'down', 'words', 'into', 'their', 'constituent', 'morphemes', '.']

Stemming:
['morpholog', 'analysi', 'involv', 'break', 'down', 'word', 'into', 'their', 'constitu', 'morphem', '.']

Lemmatization:
['Morphological', 'analysis', 'involves', 'breaking', 'down', 'word', 'into', 'their', 'constituent', 'morpheme', '.']


Exp 3 - Study N-gram

Theory:

Given a sequence of N-1 words, an N-gram model predicts the most probable word that might follow
this sequence. It's a probabilistic model that's trained on a corpus of text. Such a model is useful in many
NLP applications including speech recognition, machine translation and predictive text input.
An N-gram model is built by counting how often word sequences occur in corpus text and then
estimating the probabilities. Since a simple N-gram model has limitations, improvements are often made
via smoothing, interpolation and backoff. An N-gram model is one type of a Language Model (LM),
which is about finding the probability distribution over word sequences.
Consider two sentences: "There was heavy rain" vs. "There was heavy flood". From experience, we know
that the former sentence sounds better. An N-gram model will tell us that "heavy rain" occurs much
more often than "heavy flood" in the training corpus. Thus, the first sentence is more probable and will
be selected by the model

Algo

In [None]:
# Import the necessary library
from nltk.util import ngrams
from nltk.tokenize import word_tokenize

# Sample text
text = "N-grams are a sequence of items in a text or speech."

# Tokenize the text into words
tokens = word_tokenize(text)

# Function to generate N-grams
def generate_ngrams(text, n):
    n_grams = ngrams(text, n)
    return [' '.join(gram) for gram in n_grams]

# Specify the value of N for N-grams
n = 3  # You can change this value to generate different N-grams (e.g., 2 for bigrams, 4 for 4-grams, etc.)

# Generate N-grams
ngram_list = generate_ngrams(tokens, n)

# Display the results
print(f"Original Text: {text}")
print(f"{n}-grams:")
for ngram in ngram_list:
    print(ngram)


Original Text: N-grams are a sequence of items in a text or speech.
3-grams:
N-grams are a
are a sequence
a sequence of
sequence of items
of items in
items in a
in a text
a text or
text or speech
or speech .


Exp 4 - POS Tagging

Theory:It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having
a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun,
adjective, verb, and so on.
Default tagging is a basic step for the part-of-speech tagging. It is performed using the DefaultTagger
class. The DefaultTagger class takes ‘tag’ as a single argument. NN is the tag for a singular noun.
DefaultTagger is most useful when it gets to work with the most common part-of-speech tag. That's why
a noun tag is recommended.
Tagging is a kind of classification that may be defined as the automatic assignment of description to the
tokens. Here the descriptor is called tag, which may represent one part-of-speech, semantic information
and so on.

Algo:

Input:
A sentence or text document.

Tokenization:
Begin by tokenizing the input text into words or terms. This can be done using simple whitespace-based tokenization or more advanced techniques, such as regular expressions.

Preprocessing:
Remove any punctuation, special characters, or unwanted symbols from the tokens to ensure that only words or meaningful tokens are processed.

Initialization:
Initialize an empty list to store the POS tags for each word in the input text.

Tag Dictionary:
Use a pre-built dictionary or lexicon that maps words to their likely POS tags. This dictionary can be based on language-specific rules, training data, or established databases like WordNet.

POS Tagging:
Iterate through the tokens in the input text, one by one.
For each token, consult the tag dictionary to determine its probable POS tag based on its context.
Apply context-based rules, if available, to disambiguate POS tags when a word may have multiple possible tags.
Add the determined POS tag to the list created in the initialization step.

Output:
After processing all tokens in the input text, you will have a list of words with their corresponding POS tags. This list represents the POS-tagged text.



In [None]:
# Import the necessary library
import nltk
nltk.download('averaged_perceptron_tagger')

# Sample text
text = "Part-of-speech tagging is an essential task in natural language processing."

# Tokenize the text into words
tokens = nltk.word_tokenize(text)

# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)

# Display the results
print("Original Text:")
print(text)
print("\nPOS Tags:")
for word, tag in pos_tags:
    print(f"{word}: {tag}")


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Original Text:
Part-of-speech tagging is an essential task in natural language processing.

POS Tags:
Part-of-speech: JJ
tagging: NN
is: VBZ
an: DT
essential: JJ
task: NN
in: IN
natural: JJ
language: NN
processing: NN
.: .


Exp 5 - Chunking

Theory: Chunk extraction or partial parsing is a process of meaningful extracting short phrases from the sentence (tagged with Part-of-Speech). Chunks are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can’t be apart of chuck and such words are known as chinks. A ChunkRule class specifies what words or patterns to include and exclude in a chunk.
Defining Chunk patterns: Chuck patterns are normal regular expressions which are modified and designed to match the part-ofspeech tag designed to match sequences of part-of-speech tags. Angle brackets are used to specify an individual tag for example – to match a noun tag. One can define multiple tags in the same way. Chunking up or down allows the speaker to use certain language patterns, to utilize the natural internal process through language, to reach for higher meanings or search for more specific bits/portions of missing information. When we “Chunk Up” the language gets more abstract and there are more chances for agreement, and when we “Chunk Down” we tend to be looking for the specific details that may have been missing in the chunk up.

Algo:
Input:
A sentence or text document with part-of-speech (POS)-tagged words.

Tokenization and POS Tagging:
Begin by tokenizing the input text into words or terms.
Tag each word with its POS (Part-of-Speech) label. This can be done using an existing POS tagging tool or library.

Initialization:
Initialize an empty list to store the chunks.

Define Chunking Patterns:
Define the patterns or rules for chunking based on POS tags. These patterns can include regular expressions or syntactic rules.
For example, a common pattern to extract noun phrases (NP) is to look for sequences of words with the following structure: (Adjective)*(Noun)+.

Chunking:
Iterate through the list of POS-tagged words.
For each word, check if it matches any of the defined chunking patterns.
If a pattern is matched, create a chunk that includes the words that matched the pattern.
Continue this process for the entire sentence, extracting and storing chunks as they are encountered.

Output:
After processing the entire sentence, you will have a list of chunks, each representing a group of related words. These chunks may correspond to noun phrases, verb phrases, or other syntactic structures in the text.


In [None]:
import nltk

# Sample text
text = "Natural language processing is a subfield of artificial intelligence that deals with the interaction between computers and human language."

# Tokenize the text into words
tokens = nltk.word_tokenize(text)

# Perform part-of-speech tagging
pos_tags = nltk.pos_tag(tokens)

# Define a chunking grammar
grammar = "NP: {<DT>?<JJ>*<NN>}"  # Define a simple grammar for noun phrases (DT: determiner, JJ: adjective, NN: noun)

# Create a chunk parser
chunk_parser = nltk.RegexpParser(grammar)

# Parse the part-of-speech tagged text
chunked_text = chunk_parser.parse(pos_tags)

# Display the results
print("Original Text:")
print(text)
print("\nChunked Text:")
print(chunked_text)


Original Text:
Natural language processing is a subfield of artificial intelligence that deals with the interaction between computers and human language.

Chunked Text:
(S
  (NP Natural/JJ language/NN)
  (NP processing/NN)
  is/VBZ
  (NP a/DT subfield/NN)
  of/IN
  (NP artificial/JJ intelligence/NN)
  that/IN
  deals/NNS
  with/IN
  (NP the/DT interaction/NN)
  between/IN
  computers/NNS
  and/CC
  (NP human/JJ language/NN)
  ./.)


Exp 6 - Name Entity Recognisation


Theory:Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) that involves identifying and classifying named entities within text. Named entities are specific entities in text, such as names of persons, organizations, locations, dates, monetary values, percentages, and more. NER is a fundamental component in various NLP applications, including information retrieval, question answering, text summarization, and language understanding.

Algo:
Input:
A text document or sentence.

Tokenization:
Tokenize the input text into words or terms. This is typically the first step in NER.

Part-of-Speech Tagging:
Perform Part-of-Speech tagging on the tokens to determine the grammatical category of each word (e.g., noun, verb, adjective).

Named Entity Recognition:
Use NER models or libraries that have been trained on large annotated datasets to identify named entities in the text.
These models employ techniques such as rule-based methods, machine learning, and deep learning (e.g., Conditional Random Fields, BiLSTM-CRF, or Transformers) to classify words or phrases into predefined categories, such as:
PERSON: Names of people
ORGANIZATION: Names of companies, institutions, or organizations
LOCATION: Names of places, cities, countries, or regions
DATE: Dates and temporal expressions
MONEY: Monetary values
PERCENT: Percentage values
...and other custom categories depending on the application.

Classification:
Assign a category label to each identified named entity. For example, "John Smith" might be labeled as a PERSON, "Apple Inc." as an ORGANIZATION, "New York" as a LOCATION, and "July 12, 2020" as a DATE.

Output:
Return the text with named entities highlighted or labeled according to their categories.

In [None]:
import nltk

# Download the NLTK data for POS tagging
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

# Sample text
text = "The quick brown fox jumps over the lazy dog."

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)

# Print the POS tags
for word, pos_tag in pos_tags:
    print(f"{word}: {pos_tag}")


The: DT
quick: JJ
brown: NN
fox: NN
jumps: VBZ
over: IN
the: DT
lazy: JJ
dog: NN
.: .


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Exp 7 - Create append and remove list


Theory:

In programming, you often need to work with lists, which are ordered collections of elements. Two fundamental operations are appending elements to a list and removing elements from a list.

Appending to a List:

Appending means adding an element to the end of a list.
This operation can be performed in constant time because you are simply adding one element to the existing list.
Removing from a List:

Removing an element from a list can be done based on the index (position) of the element or based on the value.
Removal by index typically involves shifting elements in the list to fill the gap left by the removed element.
Removal by value searches for the element and then removes it.


Algorithm:

Add element to the end of the list lst.
The list is now updated with the new element at the end.
Algorithm for Removing from a List by Index:

Input:

A list lst.
An index index representing the position of the element to be removed.
Algorithm:

Check if index is within the valid range (0 to len(lst) - 1). If not, handle the out-of-bounds case if needed.
If the index is valid, remove the element at that index from the list.
Shift the elements to the right of the removed element to fill the gap.
The list is now updated with the specified element removed.
Algorithm for Removing from a List by Value:

Input:

A list lst.
A value value to be removed.
Algorithm:

Search the list to find the first occurrence of value.
If found, remove the element from the list.
Shift the elements to the right of the removed element to fill the gap.
The list is now updated with the specified value removed.


In [None]:
# Create an empty list
my_list = []

# Append items to the list
my_list.append(10)
my_list.append(20)
my_list.append(30)

# Display the list
print("Initial List:", my_list)

# Remove an item by value
value_to_remove = 20
if value_to_remove in my_list:
    my_list.remove(value_to_remove)

# Display the list after removal
print("List after removing", value_to_remove, ":", my_list)

# Remove an item by index
index_to_remove = 1
if 0 <= index_to_remove < len(my_list):
    removed_item = my_list.pop(index_to_remove)
    print("Removed item at index", index_to_remove, ":", removed_item)

# Display the list after removal by index
print("List after removal by index:", my_list)



# ------ method 2
myList = [10,20,30,40,50]

value_to_remove = 40
for i in range(len(myList)):
    if value_to_remove == myList[i]:
        myList.remove(value_to_remove)
        print(f'{value_to_remove} removed! New list- \n')
        print(myList)
        break

index_to_remove = 2
if i < len(myList):
    myList.pop(i)

print(myList)


Initial List: [10, 20, 30]
List after removing 20 : [10, 30]
Removed item at index 1 : 30
List after removal by index: [10]


Exp 8 - Parsing and Context free Grammar

Theory:

Context-Free Grammar (CFG) Theory:
Context-Free Grammar is a formalism used in linguistics and computer science to describe the syntax of a language. It consists of a set of production rules that define how sentences or phrases can be constructed in the language.

Parsing:
Parsing is the process of analyzing a sequence of symbols (usually text or code) to determine its grammatical structure with respect to a given formal grammar, such as a CFG. Parsing is a crucial step in language understanding, as it enables computers to interpret and process human language or programming code.

Algo:

Initialization:
Create an empty stack for parsing.
Push the start symbol S onto the stack.
Initialize a pointer at the beginning of the input string.

Parsing Loop:
Repeat the following until the stack is empty or the input is fully consumed:
If the top of the stack is a non-terminal symbol A:
Consult the CFG rules to find a production rule A → β that matches the current input symbol.
Push the symbols in β onto the stack, replacing A.
If the top of the stack is a terminal symbol that matches the current input symbol:
Pop the stack and advance the input pointer.

Completion Check:
If the stack is empty and the input is fully consumed, parsing is successful.
If the stack still contains symbols or the input is not fully consumed, parsing fails.

Output:
If parsing is successful, you can construct a parse tree or extract information about the structure of the input. If parsing fails, report an error or provide diagnostic information.

In [2]:
import nltk
nltk.download('punkt')
# Define a context-free grammar
cfg = nltk.CFG.fromstring("""
    S -> NP VP
    NP -> Det N | 'I'
    VP -> V NP
    Det -> 'the' | 'an'
    N -> 'cat' | 'dog'
    V -> 'chased' | 'saw'
""")

# Create a parser with the defined CFG
parser = nltk.ChartParser(cfg)

# Define a sentence to parse
sentence = "I saw the cat"

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# Parse the sentence
for tree in parser.parse(tokens):
    tree.pretty_print()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


         S             
  _______|___           
 |           VP        
 |    _______|___       
 |   |           NP    
 |   |        ___|___   
 NP  V      Det      N 
 |   |       |       |  
 I  saw     the     cat



Exp 9 - Implementation of Named Entity Recognition

Theory:

Named Entity Recognition (NER) Theory:

Named Entity Recognition (NER) is a subtask of information extraction that involves identifying and classifying named entities in text into predefined categories such as names of persons, organizations, locations, dates, monetary values, percentages, and more. NER plays a crucial role in various natural language processing applications, including information retrieval, question answering, and language understanding.

Algo:

Data Preparation:
Collect and preprocess your training data, which includes labeled examples of text with named entities tagged (e.g., with entity types like PERSON, ORGANIZATION, LOCATION).

Feature Extraction:
Convert the text data into numerical features that can be used by a machine learning model. Common features include word embeddings, character-level representations, or subword embeddings (e.g., Word2Vec, GloVe, or BERT embeddings).

Model Selection:
Choose a machine learning model for sequence labeling. Common choices include:
Conditional Random Fields (CRF)
Bi-directional LSTM (Long Short-Term Memory) with CRF
Pre-trained Transformer-based models like BERT or GPT-3 fine-tuned for NER.

Training:
Train the selected model using the preprocessed and labeled training data. The model learns to predict named entities within the text.

Testing:
Apply the trained model to new, unlabeled text data to identify named entities. The model assigns entity labels to spans of text.

Post-processing:
Depending on the model output, you may need to post-process the results to group consecutive tokens into entity spans and assign entity types.

Output:
The output will be the text with identified named entities and their associated categories.

In [None]:
import random
import string

def generate_random_word(length=8):
    letters = string.ascii_lowercase
    return ''.join(random.choice(letters) for _ in range(length))

# Generate a random word of 8 characters
random_word = generate_random_word()
print(random_word)


tybqepor
