#### Three stages of linguistic analysis in Natural Language Processing (NLP):


1.  Lexical Processing:
    -   Deals with analyzing individual words in a text.
    -   Understand and identify words as discrete units of meaning (lexemes).
    -   Recognize their properties, such as part of speech, base forms, and inflections.
    -   Key Tasks:
        -   Tokenization : Splitting text into words, phrases, or meaningful units (tokens).
        -   Lemmatization and Stemming: Reducing words to their base or root form.
        -   Example:
                Lemmatization: "running" → "run".
                Stemming: "running" → "runn".
        - Part-of-Speech (POS) Tagging:Assigning a grammatical category to each word.
        - Spell Checking:Identifying and correcting misspelled words.
        - Word Recognition:Differentiating valid words from non-words.
        -   Example:
            -   Sentence: "Cats are running."
            -   Lexical processing identifies:
            -   "Cats" as plural noun.
            -   "are" as auxiliary verb.
            -   "running" as verb (present participle).


2. Syntactic Processing:
    -   Syntactic processing focuses on the structure of sentences, ensuring that the arrangement of words follows the rules of grammar.
    -   Analyze and understand the grammatical structure of a sentence.
    -   Ensures that sentences are valid according to the grammar rules of the language.
        -   Parsing:Analyzing sentences to identify their grammatical structure.
        -   Parse tree for "The cat sat on the mat"
        -    ![image.png](attachment:image.png)
        -   Dependency Parsing:Identifying dependencies between words in a sentence.
        -   Ex: "cat" → subject of "sat".,"mat" → object of "on".
        -   Syntax Error Detection:Identifying incorrect grammatical structures.
        -   ex: "He goes to park" → Error: Missing "the".



3. Semantic Processing:
    -   Semantic processing deals with the meaning of words and sentences.
    -   Extract and represent the meaning of the text.
    -   Resolve ambiguities and capture relationships between concepts.
        -   Word Sense Disambiguation (WSD):Identifying the correct meaning of a word based on context.
            -   Example:"bank" → Financial institution (in "He went to the bank")."bank" → Riverbank (in "He sat by the bank"). 
        -   Named Entity Recognition (NER): Identifying proper nouns and their categories (e.g., names, places).
            -   Example:"Apple" → Organization."Paris" → Location.
        -   Semantic Role Labeling (SRL):Identifying roles of words in a sentence (e.g., subject, object).
            -   Example:"John gave Mary a gift."Roles: John = giver, Mary = receiver, gift = object.
        -   Coreference Resolution: Linking pronouns and phrases to their referents.
            -   Example:"The cat sat on the mat. It was fluffy.","It" refers to "The cat."
        -   Relationship Extraction:Identifying relationships between entities.
            -   Example:Sentence: "Barack Obama was born in Hawaii.",Extracted relationship: (Barack Obama, born in, Hawaii).

#### Comparison of the Three Levels

![image.png](attachment:image.png)

#### Unicode standards

-   Encoding converts text into a machine-readable format (binary data) that can be stored, transmitted, or processed.
-   Unicode is a universal character encoding standard that represents text from most of the world's writing systems.


#### How Encoding Works After Linguistic Processing

1. Input Text
<pre>
Processed text (e.g., tokens, syntactic structures) is prepared for encoding.
Example: "John gave a book to Mary."
</pre>

2. Unicode Mapping
<pre>
Each character is mapped to a unique Unicode code point.
Example:
"J" → U+004A
"o" → U+006F
" " (space) → U+0020
</pre>

3. Encoding Format
<pre>
The Unicode code points are converted into a specific encoding format, such as:
- UTF-8: Variable-length encoding; widely used.
- UTF-16: Fixed-length or variable-length encoding; supports more complex scripts.
- UTF-32: Fixed-length encoding; uses 4 bytes for every character.
</pre>

4. Output
<pre>
Encoded text ready for storage or transmission.
Example (UTF-8 for "John"):
J → 01001010
o → 01101111
h → 01101000
n → 01101110
</pre>

In [2]:
# create a string
amount = u"₹50"
print('Default string: ', amount, '\n', 'Type of string', type(amount), '\n')

# encode to UTF-8 byte format
amount_encoded = amount.encode('utf-8')
print('Encoded to UTF-8: ', amount_encoded, '\n', 'Type of string', type(amount_encoded), '\n')


# sometime later in another computer...
# decode from UTF-8 byte format
amount_decoded = amount_encoded.decode('utf-8')
print('Decoded from UTF-8: ', amount_decoded, '\n', 'Type of string', type(amount_decoded), '\n')

Default string:  ₹50 
 Type of string <class 'str'> 

Encoded to UTF-8:  b'\xe2\x82\xb950' 
 Type of string <class 'bytes'> 

Decoded from UTF-8:  ₹50 
 Type of string <class 'str'> 

