#### Linguistic Analysis in Natural Language Processing (NLP)

## 1. Lexical Processing

#### Description:
Lexical processing involves analyzing individual words in a text. It focuses on recognizing and understanding words as discrete units of meaning (lexemes), identifying their properties, such as part of speech, base forms, and inflections.

#### Key Tasks:
1. **Tokenization**: Splitting text into words, phrases, or meaningful units (tokens).
2. **Lemmatization and Stemming**: Reducing words to their base or root form.
   - **Example**:
     - **Lemmatization**: "running" → "run"
     - **Stemming**: "running" → "runn"
3. **Part-of-Speech (POS) Tagging**: Assigning a grammatical category to each word (e.g., noun, verb, adjective).
4. **Spell Checking**: Identifying and correcting misspelled words.
5. **Word Recognition**: Differentiating valid words from non-words.

### Example:
Consider the sentence: `"Cats are running."`
- **Lexical Processing identifies**:
  - "Cats" as a plural noun
  - "are" as an auxiliary verb
  - "running" as a verb (present participle)

### Code Example (Tokenization, Lemmatization, Stemming, POS Tagging):
```python
# Importing necessary libraries for lexical processing
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Example Sentence
sentence = "Cats are running around the park."

# Tokenization
tokens = word_tokenize(sentence)
print("Tokens:", tokens)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
print("Lemmatized Tokens:", lemmatized_tokens)

# Stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
print("Stemmed Tokens:", stemmed_tokens)

# Part-of-Speech (POS) Tagging
pos_tags = nltk.pos_tag(tokens)
print("POS Tags:", pos_tags)


## 2. Syntactic  Processing

#### Description:
Syntactic processing focuses on the structure of sentences, ensuring that the arrangement of words follows the rules of grammar. It involves analyzing the grammatical structure of a sentence to check if it is syntactically valid.

#### Key Tasks:
1. **Parsing**:
    - Analyzing sentences to identify their grammatical structure.
2. **Parse Tree**:
    - A hierarchical tree structure that represents the syntactic structure of a sentence.
3.  **Dependency Parsing**:
    - Identifying dependencies between words in a sentence (e.g., subject-verb relationships).
4. **Syntax Error Detection**:
    - Identifying incorrect grammatical structures.

- **Example**:
    -   Sentence: "He goes to park"
        -   Error: Missing "the" before "park" (should be "the park").


##### Code Example (Parse Tree Generation):

In [10]:
# Define a simple grammar for parsing
from nltk import CFG

grammar = CFG.fromstring("""
  S -> NP VP
  NP -> Det N
  VP -> V NP
  Det -> 'the' | 'a'
  N -> 'cat' | 'mat'
  V -> 'sat' | 'ran'
""")

# Create a parser
parser = nltk.ChartParser(grammar)

# Parse a sentence
sentence_parse = 'the cat sat'.split()
for tree in parser.parse(sentence_parse):
    print(tree)
    tree.pretty_print()


NameError: name 'nltk' is not defined

### 3. Semantic Processing:

#### Description:
Semantic processing deals with the meaning of words and sentences. It extracts and represents the meaning of the text, resolving ambiguities and capturing relationships between concepts

#### Key Tasks:
1. **Word Sense Disambiguation (WSD)**: Identifying the correct meaning of a word based on context.
- **Example**:
    -   "bank" → Financial institution (in "He went to the bank").
    -   "bank" → Riverbank (in "He sat by the bank").
2. **Named Entity Recognition (NER)**: Identifying proper nouns and their categories (e.g., names, places).
- **Example**:
    -   "Apple" → Organization
    -   "Paris" → Location
3. **Semantic Role Labeling (SRL)**: Identifying roles of words in a sentence (e.g., subject, object).
- **Example**:
    -   Sentence: "John gave Mary a gift."
    -   Roles: John = giver, Mary = receiver, gift = object.
4. **Coreference Resolution**: Linking pronouns and phrases to their referents.
- **Example**:
    -   Sentence: "The cat sat on the mat. It was fluffy."
    -   "It" refers to "The cat."
5. **Relationship Extraction**: Identifying relationships between entities.
- **Example**:
    -   Sentence: "Barack Obama was born in Hawaii."
    -   Extracted relationship: (Barack Obama, born in, Hawaii).


##### Code Example (Named Entity Recognition):

In [11]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [12]:
# Named Entity Recognition (NER)
from nltk import ne_chunk
from nltk.tokenize import word_tokenize

sentence_ner = "Apple is looking at buying U.K. startup for $1 billion"
ner_tree = ne_chunk(nltk.pos_tag(word_tokenize(sentence_ner)))
print("Named Entity Recognition:")
print(ner_tree)


NameError: name 'nltk' is not defined

### 4. Unicode Standards:

#### Description:
Unicode is a universal character encoding standard that represents text from most of the world's writing systems. It allows for the encoding of text in a machine-readable format (binary data) that can be stored, transmitted, or processed.

#### How Encoding Works After Linguistic Processing:
1. **Input Text**: 
    -   Processed text (e.g., tokens, syntactic structures) is prepared for encoding.
    - **Example**:  
        -   "John gave a book to Mary."
2. **Unicode Mapping**: 
    -   Each character is mapped to a unique Unicode code point.
    - **Example**:  
        - "J" → U+004A
        - "o" → U+006F
        - " " (space) → U+0020
3. **Encoding Format**: 
    - The Unicode code points are converted into a specific encoding format, such as:
        - **UTF-8**: Variable-length encoding; widely used.
        - **UTF-16**: Fixed-length or variable-length encoding; supports more complex scripts.
        - **UTF-32**: Fixed-length encoding; uses 4 bytes for every character.
4. **Output**: 
    - Encoded text ready for storage or transmission.
    - **Example**: (UTF-8 encoding for "John")
        - J → 01001010
        - o → 01101111
        - h → 01101000
        - n → 01101110



##### Code Example (Encoding and Decoding):

In [13]:
# Demonstrating how text encoding works
amount = u"₹50"
print('Default string:', amount, '\n', 'Type of string:', type(amount), '\n')

# Encode to UTF-8 byte format
amount_encoded = amount.encode('utf-8')
print('Encoded to UTF-8:', amount_encoded, '\n', 'Type of string:', type(amount_encoded), '\n')

# Decode from UTF-8 byte format
amount_decoded = amount_encoded.decode('utf-8')
print('Decoded from UTF-8:', amount_decoded, '\n', 'Type of string:', type(amount_decoded), '\n')

# Encoding format examples
print("\nEncoding formats example:")
utf8_example = "John"
utf16_example = "John"
utf32_example = "John"

print(f"UTF-8 encoding for 'John': {utf8_example.encode('utf-8')}")
print(f"UTF-16 encoding for 'John': {utf16_example.encode('utf-16')}")
print(f"UTF-32 encoding for 'John': {utf32_example.encode('utf-32')}")


Default string: ₹50 
 Type of string: <class 'str'> 

Encoded to UTF-8: b'\xe2\x82\xb950' 
 Type of string: <class 'bytes'> 

Decoded from UTF-8: ₹50 
 Type of string: <class 'str'> 


Encoding formats example:
UTF-8 encoding for 'John': b'John'
UTF-16 encoding for 'John': b'\xff\xfeJ\x00o\x00h\x00n\x00'
UTF-32 encoding for 'John': b'\xff\xfe\x00\x00J\x00\x00\x00o\x00\x00\x00h\x00\x00\x00n\x00\x00\x00'
