![](./lab%20header%20image.png)

<div style="text-align: center;">
    <h3>Experiment No. 02</h3>
</div>

<img src="./Student%20Information.png" style="width: 100%;" alt="Student Information">

<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: center;">
    <strong>AIM</strong>
</div>

**Analysis of natural language using lexical analysis**

<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: center;">
    <strong>Theory/Procedure/Algorithm</strong>
</div>

**Lexical Analysis** is the process of converting a sequence of characters into a sequence of tokens. A token is a string with an assigned meaning, which is the basic building block of syntax analysis. In the context of natural language processing (NLP), lexical analysis involves breaking down the text into words, phrases, or symbols and categorizing them into predefined classes like keywords, operators, identifiers, etc.

The key steps in lexical analysis are:

1. `Tokenization`: Splitting the input text into individual tokens (words, punctuation, etc.).
2. `Normalization`: Transforming tokens into a standard form (e.g., converting all text to lowercase).
3. `Classification`: Assigning each token to a category such as noun, verb, adjective, etc.
4. `Handling Punctuation`: Differentiating between words and punctuation marks to ensure accurate analysis.

Lexical analysis is critical in natural language processing tasks such as text parsing, speech recognition, and data extraction. By analyzing the structure of language at the lexical level, one can gain insights into the syntax and semantics of the text, enabling more sophisticated language understanding.

In [14]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

def download_nltk_resources():
    resources = ['punkt', 'averaged_perceptron_tagger', 'wordnet', 'stopwords']
    for resource in resources:
        try:
            nltk.download(resource, quiet=True)
        except Exception as e:
            print(f"Error downloading {resource}: {str(e)}")
            print(f"You may need to manually download this resource.")

download_nltk_resources()

def lexical_analysis(text):
    try:
        # Tokenization
        tokens = word_tokenize(text)
        
        # Lowercasing
        tokens = [token.lower() for token in tokens]
        
        # Remove punctuation and numbers
        tokens = [token for token in tokens if token not in string.punctuation and not token.isnumeric()]
        
        # Remove stopwords
        stop_words = set(stopwords.words('english'))
        tokens = [token for token in tokens if token not in stop_words]
        
        # Lemmatization
        lemmatizer = WordNetLemmatizer()
        lemmas = [lemmatizer.lemmatize(token) for token in tokens]
        
        # Part-of-speech tagging
        pos_tags = nltk.pos_tag(tokens)
        
        # Word frequency
        freq_dist = FreqDist(lemmas)
        
        # Lexical diversity (unique lemmas / total lemmas)
        lexical_diversity = len(set(lemmas)) / len(lemmas) if lemmas else 0
        
        return {
            'tokens': tokens,
            'lemmas': lemmas,
            'pos_tags': pos_tags,
            'word_frequencies': dict(freq_dist.most_common(10)),  # Top 10 most common words
            'lexical_diversity': lexical_diversity
        }
    except Exception as e:
        print(f"An error occurred during analysis: {str(e)}")
        return None

# Example usage
sample_text = "The quick brown fox jumps over the lazy dog. The dog was not amused by the fox's antics."
result = lexical_analysis(sample_text)

if result:
    print("Tokens:", result['tokens'])
    print("Lemmas:", result['lemmas'])
    print("POS Tags:", result['pos_tags'])
    print("Top 10 Word Frequencies:", result['word_frequencies'])
    print("Lexical Diversity:", result['lexical_diversity'])
else:
    print("Analysis could not be completed due to an error.")

Tokens: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', 'dog', 'amused', 'fox', "'s", 'antics']
Lemmas: ['quick', 'brown', 'fox', 'jump', 'lazy', 'dog', 'dog', 'amused', 'fox', "'s", 'antic']
POS Tags: [('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('lazy', 'JJ'), ('dog', 'NN'), ('dog', 'NN'), ('amused', 'VBD'), ('fox', 'NN'), ("'s", 'POS'), ('antics', 'NNS')]
Top 10 Word Frequencies: {'fox': 2, 'dog': 2, 'quick': 1, 'brown': 1, 'jump': 1, 'lazy': 1, 'amused': 1, "'s": 1, 'antic': 1}
Lexical Diversity: 0.8181818181818182


<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: center;">
    <strong>CONCLUSION</strong>
</div>

Lexical analysis plays a crucial role in understanding and processing natural language. By breaking down text into tokens and analyzing their distribution, we can gain valuable insights into the language's structure and semantics. The simple lexical analysis performed in this experiment demonstrates how natural language can be dissected and analyzed to uncover patterns, frequencies, and relationships within the text. This foundational step is essential in more advanced NLP tasks such as parsing, sentiment analysis, and machine translation.

<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: center;">
    <strong>ASSESSMENT</strong>
</div>

<img src="./marks_distribution.png" style="width: 100%;" alt="marks_distribution">