# Linguistic Analysis in Natural Language Processing (NLP)

## 1. Lexical Processing

#### Description:
Lexical processing involves analyzing individual words in a text. It focuses on recognizing and understanding words as discrete units of meaning (lexemes), identifying their properties, such as part of speech, base forms, and inflections.

#### Key Tasks:
1. **Tokenization**: Splitting text into words, phrases, or meaningful units (tokens).
2. **Lemmatization and Stemming**: Reducing words to their base or root form.
   - **Example**:
     - **Lemmatization**: "running" → "run"
     - **Stemming**: "running" → "runn"
3. **Part-of-Speech (POS) Tagging**: Assigning a grammatical category to each word (e.g., noun, verb, adjective).
4. **Spell Checking**: Identifying and correcting misspelled words.
5. **Word Recognition**: Differentiating valid words from non-words.

### Example:
Consider the sentence: `"Cats are running."`
- **Lexical Processing identifies**:
  - "Cats" as a plural noun
  - "are" as an auxiliary verb
  - "running" as a verb (present participle)



In [3]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Manikanta\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [5]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Manikanta\AppData\Roaming\nltk_data...


True

In [7]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Manikanta\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


True

In [8]:
### Code Example (Tokenization, Lemmatization, Stemming, POS Tagging):

# Importing necessary libraries for lexical processing
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Example Sentence
sentence = "Cats are running around the park."

# Tokenization
tokens = word_tokenize(sentence)
print("Tokens:", tokens)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
print("Lemmatized Tokens:", lemmatized_tokens)

# Stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
print("Stemmed Tokens:", stemmed_tokens)

# Part-of-Speech (POS) Tagging
pos_tags = nltk.pos_tag(tokens)
print("POS Tags:", pos_tags)


Tokens: ['Cats', 'are', 'running', 'around', 'the', 'park', '.']
Lemmatized Tokens: ['Cats', 'are', 'running', 'around', 'the', 'park', '.']
Stemmed Tokens: ['cat', 'are', 'run', 'around', 'the', 'park', '.']
POS Tags: [('Cats', 'NNS'), ('are', 'VBP'), ('running', 'VBG'), ('around', 'IN'), ('the', 'DT'), ('park', 'NN'), ('.', '.')]


### 2. Syntactic  Processing

#### Description:
Syntactic processing focuses on the structure of sentences, ensuring that the arrangement of words follows the rules of grammar. It involves analyzing the grammatical structure of a sentence to check if it is syntactically valid.

#### Key Tasks:
1. **Parsing**:
    - Analyzing sentences to identify their grammatical structure.
2. **Parse Tree**:
    - A hierarchical tree structure that represents the syntactic structure of a sentence.
3.  **Dependency Parsing**:
    - Identifying dependencies between words in a sentence (e.g., subject-verb relationships).
4. **Syntax Error Detection**:
    - Identifying incorrect grammatical structures.

- **Example**:
    -   Sentence: "He goes to park"
        -   Error: Missing "the" before "park" (should be "the park").


##### Code Example (Parse Tree Generation):

In [23]:
# Define a simple grammar for parsing
import nltk
from nltk import CFG

# Define the grammar
grammar = CFG.fromstring("""
  S -> NP VP
  NP -> Det N
  VP -> V NP
  Det -> 'the' | 'a'
  N -> 'cat' | 'mat'
  V -> 'sat' | 'ran'
""")

# Create a parser
parser = nltk.ChartParser(grammar)

# Parse a sentence
sentence_parse = 'the cat sat'.split()
print("Executing sentence parse:", sentence_parse)

try:
    for tree in parser.parse(sentence_parse):
        print(tree)
        tree.pretty_print()
except Exception as e:
    print(f"Error: {e}")



Executing sentence parse: ['the', 'cat', 'sat']


### 3. Semantic Processing:

#### Description:
Semantic processing deals with the meaning of words and sentences. It extracts and represents the meaning of the text, resolving ambiguities and capturing relationships between concepts

#### Key Tasks:
1. **Word Sense Disambiguation (WSD)**: Identifying the correct meaning of a word based on context.
- **Example**:
    -   "bank" → Financial institution (in "He went to the bank").
    -   "bank" → Riverbank (in "He sat by the bank").
2. **Named Entity Recognition (NER)**: Identifying proper nouns and their categories (e.g., names, places).
- **Example**:
    -   "Apple" → Organization
    -   "Paris" → Location
3. **Semantic Role Labeling (SRL)**: Identifying roles of words in a sentence (e.g., subject, object).
- **Example**:
    -   Sentence: "John gave Mary a gift."
    -   Roles: John = giver, Mary = receiver, gift = object.
4. **Coreference Resolution**: Linking pronouns and phrases to their referents.
- **Example**:
    -   Sentence: "The cat sat on the mat. It was fluffy."
    -   "It" refers to "The cat."
5. **Relationship Extraction**: Identifying relationships between entities.
- **Example**:
    -   Sentence: "Barack Obama was born in Hawaii."
    -   Extracted relationship: (Barack Obama, born in, Hawaii).


##### Code Example (Named Entity Recognition):

In [11]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [25]:
import nltk
nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     C:\Users\Manikanta\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers\maxent_ne_chunker_tab.zip.


True

In [26]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from nltk import CFG, ChartParser

# Ensure nltk_data path is set
nltk.data.path.append("C:\\Users\\Manikanta\\AppData\\Roaming\\nltk_data")

# Download necessary resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Named Entity Recognition example
sentence_ner = "Apple is looking at buying U.K. startup for $1 billion"
ner_tree = ne_chunk(pos_tag(word_tokenize(sentence_ner)))
print("Named Entity Recognition:")
print(ner_tree)

# Context-Free Grammar (CFG) Parsing example
grammar = CFG.fromstring("""
  S -> NP VP
  NP -> Det N
  VP -> V NP
  Det -> 'the' | 'a'
  N -> 'cat' | 'dog'
  V -> 'sat' | 'ran'
""")

parser = ChartParser(grammar)
sentence_parse = 'the cat sat'.split()
for tree in parser.parse(sentence_parse):
    print(tree)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Manikanta\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Manikanta\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\Manikanta\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Manikanta\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


Named Entity Recognition:
(S
  (GPE Apple/NNP)
  is/VBZ
  looking/VBG
  at/IN
  buying/VBG
  U.K./NNP
  startup/NN
  for/IN
  $/$
  1/CD
  billion/CD)


### 4. Unicode Standards:

#### Description:
Unicode is a universal character encoding standard that represents text from most of the world's writing systems. It allows for the encoding of text in a machine-readable format (binary data) that can be stored, transmitted, or processed.

#### How Encoding Works After Linguistic Processing:
1. **Input Text**: 
    -   Processed text (e.g., tokens, syntactic structures) is prepared for encoding.
    - **Example**:  
        -   "John gave a book to Mary."
2. **Unicode Mapping**: 
    -   Each character is mapped to a unique Unicode code point.
    - **Example**:  
        - "J" → U+004A
        - "o" → U+006F
        - " " (space) → U+0020
3. **Encoding Format**: 
    - The Unicode code points are converted into a specific encoding format, such as:
        - **UTF-8**: Variable-length encoding; widely used.
        - **UTF-16**: Fixed-length or variable-length encoding; supports more complex scripts.
        - **UTF-32**: Fixed-length encoding; uses 4 bytes for every character.
4. **Output**: 
    - Encoded text ready for storage or transmission.
    - **Example**: (UTF-8 encoding for "John")
        - J → 01001010
        - o → 01101111
        - h → 01101000
        - n → 01101110



##### Code Example (Encoding and Decoding):

In [27]:
# Demonstrating how text encoding works
amount = u"₹50"
print('Default string:', amount, '\n', 'Type of string:', type(amount), '\n')

# Encode to UTF-8 byte format
amount_encoded = amount.encode('utf-8')
print('Encoded to UTF-8:', amount_encoded, '\n', 'Type of string:', type(amount_encoded), '\n')

# Decode from UTF-8 byte format
amount_decoded = amount_encoded.decode('utf-8')
print('Decoded from UTF-8:', amount_decoded, '\n', 'Type of string:', type(amount_decoded), '\n')

# Encoding format examples
print("\nEncoding formats example:")
utf8_example = "John"
utf16_example = "John"
utf32_example = "John"

print(f"UTF-8 encoding for 'John': {utf8_example.encode('utf-8')}")
print(f"UTF-16 encoding for 'John': {utf16_example.encode('utf-16')}")
print(f"UTF-32 encoding for 'John': {utf32_example.encode('utf-32')}")


Default string: ₹50 
 Type of string: <class 'str'> 

Encoded to UTF-8: b'\xe2\x82\xb950' 
 Type of string: <class 'bytes'> 

Decoded from UTF-8: ₹50 
 Type of string: <class 'str'> 


Encoding formats example:
UTF-8 encoding for 'John': b'John'
UTF-16 encoding for 'John': b'\xff\xfeJ\x00o\x00h\x00n\x00'
UTF-32 encoding for 'John': b'\xff\xfe\x00\x00J\x00\x00\x00o\x00\x00\x00h\x00\x00\x00n\x00\x00\x00'


## Word Frequencies
#### Definition:
The count of how often each word appears in a given text.
#### Why It's Important:
-   Helps identify commonly used terms.
-   Forms the basis for techniques like Term Frequency-Inverse Document Frequency (TF-IDF).
#### Applications:
-   Text summarization.
-   Identifying trends in data (e.g., word clouds).
- **Example** :
    -   Text: "The quick brown fox jumps over the lazy dog."
    -   Frequencies:
        - "The" → 2
        - "quick" → 1
        - "brown" → 1
        - ...

## Stop Words
#### Definition:
Commonly used words (e.g., "the," "is," "and") that often carry little meaning and are filtered out during text analysis.
#### Why It's Important:
-   Reduces noise in text processing.
-   Improves the efficiency of algorithms by focusing on meaningful words.
#### Applications:
-   Search engines.
-   Text summarization and sentiment analysis.
- **Example** :
    - English Stop Words: "a," "an," "the," "in," "on," "and," etc.

## Bag of words
#### Definition:
Bag of Words (BoW) is a simple and commonly used natural language processing (NLP) technique for text representation. 
It transforms a text into a fixed-size vector of numbers by focusing on the frequency of words, ignoring their order or grammar. 
#### Bag of words matrix
It converts text into numerical features by creating a matrix where each row represents a document (or text instance) and each column corresponds to a word from the vocabulary. 
The values in the matrix indicate the frequency (or sometimes binary presence) of the word in the respective document.
#### Steps in Creating a Bag of Words Representation
#### Tokenization:

Split the text into individual words (tokens).
- **Example**: "The cat sat on the mat" → ["The", "cat", "sat", "on", "the", "mat"].
#### Normalization (Optional):

Convert all words to lowercase and remove punctuation.
- **Example**: ["The", "cat", "sat", "on", "the", "mat"] → ["the", "cat", "sat", "on", "the", "mat"].
#### Vocabulary Creation:

Create a vocabulary (unique list of words) from all the tokens in the dataset.
- **Example**: ["the", "cat", "sat", "on", "mat"].
#### Vectorization:

Represent each document (or sentence) as a vector based on the frequency of words in the vocabulary.
- **Example**: "The cat sat on the mat":
Vocabulary:  ["the", "cat", "sat", "on", "mat"]
Vector:      [2, 1, 1, 1, 1]



### Stemming
#### **Definition**:
Stemming is the process of removing suffixes and prefixes from a word to reduce it to its base or root form, often without considering the word's meaning.
- **Example**:
    - Words: running, runner, ran
    - Stem: run
    - Words: studies, studied
    - Stem: studi

#### Popular Algorithms:
1. **Porter Stemmer**:
   - This was developed in 1980 and works only on English words. 
2. **Snowball Stemmer**:
   - This is a more versatile stemmer that not only works on English words but also on words of other languages such as French, German, Italian, Finnish, Russian, and many more languages.
3. **Lancaster Stemmer**

#### **Limitation**:
        - The root word produced by stemming may not always be a valid word in the language. 
        - It focuses on form rather than meaning.

### Lemmatization
#### **Definition**:
    - Lemmatization reduces a word to its base or dictionary form (lemma), ensuring that the root word is meaningful.
    - WordNet lemmatizer is a lexical database that can be used to lemmatize words in python.
    - more expensive than stemming.
- **Example**:
    - Words: running, ran
    - Lemma: run
    - Words: better
    - Lemma: good (considering semantics)
#### **Advantage**:
    - It provides grammatically correct words, making it more suitable for NLP tasks like text understanding.

#### Comparison
![image.png](attachment:c29b5f46-eccf-454f-9e7d-20819fe82686.png)

### **TF-IDF Representation**
#### **Definition:**
TF-IDF stands for Term Frequency-Inverse Document Frequency, and it's a statistical measure used in text processing to evaluate the importance of a word in a document relative to a collection of documents (corpus).
#### **Components of TF-IDF**:
1. **Term Frequency (TF)**:
   - Measures how frequently a term occurs in a document.
   - Formula :
         ![image.png](attachment:e0589e88-c3d3-4f8c-9681-032780cd2c54.png)
2. **Inverse Document Frequency (IDF)**:
   - Measures how important a term is by reducing the weight of terms that occur in many documents.
   - Formula:
         ![image.png](attachment:01eaa0e4-58f2-40c1-bf9b-270344ec3b22.png)
3. **TF-IDF Score**:
   - Combines TF and IDF to compute the importance of a term.
   - Formula
         ![image.png](attachment:628115a8-ad37-4eb9-bbf2-fc6842b0f326.png)
#### ** Steps to Compute TF-IDF**
1. Tokenize the documents to split them into words or terms.
2. Compute TF for each term in each document.
3. Compute IDF for each term based on the entire corpus.
4. Multiply TF and IDF to get the TF-IDF score for each term
   

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "The bird flew over the tree"
]

# Create TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Compute TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get feature names (terms)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert TF-IDF matrix to array
tfidf_array = tfidf_matrix.toarray()

# Print TF-IDF representation
print("TF-IDF Matrix:")
print(tfidf_array)

# Print feature names
print("Feature Names:")
print(feature_names)


TF-IDF Matrix:
[[0.         0.46869865 0.         0.         0.         0.46869865
  0.3564574  0.         0.3564574  0.55364194 0.        ]
 [0.         0.         0.46869865 0.         0.46869865 0.
  0.3564574  0.         0.3564574  0.55364194 0.        ]
 [0.4305185  0.         0.         0.4305185  0.         0.
  0.         0.4305185  0.         0.50854232 0.4305185 ]]
Feature Names:
['bird' 'cat' 'dog' 'flew' 'log' 'mat' 'on' 'over' 'sat' 'the' 'tree']


### **Canonicalisation**
    - Canonicalization (or canonicalisation) is the process of converting data or representations into a standardized, normalized, or "canonical" form. 
    - This ensures consistency, removes ambiguity, and makes it easier to compare, analyze, or process data.
**Applications of Canonicalization**
1. **In Web Development (SEO)**:
    - Ensures that search engines treat different URLs as the same page.
    - **Example**:
    - http://example.com, https://example.com, and http://example.com/index.html might all point to the same page.
    - Canonicalization ensures that search engines recognize these as the same content.
2. **In Natural Language Processing (NLP)**:
    - Text preprocessing step to standardize words.
    - **Example**:
    - Words like "running," "ran," and "runs" might be canonicalized to "run" (through lemmatization or stemming).
3. **In File Systems and Software**:
    - Standardizes file paths or URLs.
    - **Example**:
    - Resolving relative paths like ../folder/file.txt to its absolute form.
4. **In Cryptography**:
    - Ensures that data is in a consistent format before applying cryptographic operations.
    - **Example**:
    - Converting XML documents to a canonical form to ensure digital signatures are consistent.
5. **In Machine Learning (Data Cleaning)**:
    - Helps unify variations in data.
    - **Example**:
    - Unifying "NY," "New York," and "N.Y." into a single representation: "New York."

**Why is Canonicalization Important?**
1. **Consistency**: Makes processing easier by ensuring all inputs are in the same format.
2. **Efficiency**: Reduces duplicate representations in databases or systems.
3. **Security**: Prevents attacks based on variations in input formats (e.g., path traversal attacks).
4. **Improved Accuracy**: Essential in tasks like search indexing, text processing, and data comparison.


### **Phonetic Hashing**
    - Phonetic Hashing is a technique used to encode words or names into a hash value based on their pronunciation rather than their exact spelling.
    - This method ensures that similar-sounding words or names produce similar hash values, making it useful for tasks like fuzzy matching, spell-checking, or searching in databases.

#### **Common Algorithms for Phonetic Hashing**
1. **Soundex**:
    - One of the earliest phonetic algorithms.
    - Encodes English names to match phonetically similar words.
    - **Example**:
            - Robert and Rupert → R163
            - Ashcraft and Ashcroft → A261
2. **Metaphone**:

    - An improvement over Soundex, focusing on more accurate phonetic encoding.
    - Handles a wider range of English phonetics.
    - **Example**:
            - Smith → SM0
            - Schmidt → XMT
3. **Double Metaphone**:

    - An enhancement of Metaphone that produces two keys: one for the primary pronunciation and another for an alternative pronunciation.
    - Useful for multi-lingual and diverse pronunciation scenarios.
    - **Example**:
        - Smith → SM0
        - Schmidt → SMT
4. **NYSIIS** (New York State Identification and Intelligence System):

    - Focuses on standardizing spelling variations of names.
    - **Example**:
        - MacDonald → MCANAL
        - McDonald → MCANAL
5. **Caverphone**:

    - Developed for matching names in New Zealand.
    - Focuses on better handling of non-English phonetics.
                                                                                                    
#### **How Phonetic Hashing Works**

1. **Preprocessing**:
- Convert input text to uppercase or lowercase for uniformity.
- Remove unwanted characters like punctuation or numbers.

2. **Encoding Rules**:
- Replace specific patterns of letters with standardized phonetic representations.
- Remove silent letters or vowels where appropriate.

3. **Generate Hash**:
- Combine the phonetic codes into a fixed-length hash.


#### **Applications of Phonetic Hashing**

1. **Database Searching**:
- Helps search for names or terms with varying spellings (e.g., "Smith" vs. "Smyth").

2. **Spell-Checking**:
- Suggests corrections for misspelled words based on phonetic similarity.

3. **Record Linkage**:
- Identifies matching records in datasets with slight variations in spelling.

4. **Fraud Detection**:
- Matches names with similar pronunciations to identify duplicates.

5. **Linguistics**:
- Analyzes phonetic patterns in large text corpora.



### **Edit Distance**
- Edit Distance, also known as Levenshtein Distance, is a measure of the minimum number of single-character edits required to convert one string into another. 
- These edits include:
1. **Insertion**: Adding a character.
2. **Deletion**: Removing a character.
3. **Substitution**: Replacing one character with another.

## Applications of Edit Distance
- **Spell Checkers**: Suggest corrections for misspelled words.
- **Plagiarism Detection**: Compare documents for similarity.
- **DNA Sequence Alignment**: Compare genetic sequences.
- **Natural Language Processing**: Text similarity analysis.


