Lexical resources, in the context of natural language processing (NLP), refer to various types of linguistic data or databases that provide information about words, their meanings, and their relationships with other words. These resources are essential for NLP tasks such as text analysis, information retrieval, machine translation, and more. Here are some common types of lexical resources:

-   Dictionaries: 

Dictionaries provide definitions, pronunciations, parts of speech, and other lexical information about words. They may also include example sentences, synonyms, antonyms, and usage notes. Examples of dictionaries include WordNet, Wiktionary, and various online dictionaries.

-   Thesauri: 

Thesauri are resources that organize words based on their semantic relationships, such as synonyms (words with similar meanings) and antonyms (words with opposite meanings). They help in finding alternative words and expanding vocabulary. Roget's Thesaurus is a well-known example of a thesaurus.

-   Word Lists: 

Word lists are simple collections of words organized alphabetically or based on specific criteria (e.g., frequency, word length). They serve as basic lexical resources and are used in various NLP tasks such as spell checking, text classification, and word frequency analysis.

-   Lexicons: 

Lexicons are specialized dictionaries or vocabularies that focus on specific domains, languages, or subject areas. They contain terminology, definitions, and linguistic information relevant to their respective domains. For example, a medical lexicon would contain medical terminology and definitions.

-   Ontologies: 

Ontologies are formal representations of knowledge that organize concepts and their relationships in a hierarchical structure. They are used to capture domain-specific knowledge and provide a framework for understanding the semantics of words and concepts. Examples include the Gene Ontology for molecular biology and the Cyc ontology for common-sense reasoning.

-   Semantic Networks: 

Semantic networks represent words and concepts as nodes connected by semantic relationships such as synonymy, hyponymy (hypernymy), meronymy, and entailment. They facilitate semantic analysis and reasoning by capturing the meaning and relationships between words. WordNet is a widely used semantic network.

-   Corpora: 

While not exclusively lexical resources, corpora (plural of corpus) are large collections of text or speech data that serve as valuable sources of lexical information. They provide real-world examples of word usage, collocations, and linguistic patterns, which are essential for various NLP tasks such as text analysis, language modeling, and corpus linguistics research.

#### Dictionary -

-   Import NLTK and WordNet:

We import NLTK and specifically the wordnet module from NLTK's corpus.

-    Specify the Word:

We specify the word we want to look up in the dictionary. In this example, we're using the word "car".

-   Retrieve Synsets:

We use the synsets() function from WordNet to retrieve synsets (sets of synonyms) for the specified word.

-   Display Information:

If synsets are found (i.e., if the word exists in WordNet), we iterate over each synset and display various pieces of information, including the definition, part of speech, examples, and synonyms (lemmas).

-   Handling Absence of Synsets:

If no synsets are found for the word (i.e., if the word is not in WordNet), we display a message indicating that no information was found.

In [14]:
# Dictionary
import nltk
from nltk.corpus import wordnet

# Word to look up in the dictionary
word = "car"

# Retrieve synsets (sets of synonyms) for the word
synsets = wordnet.synsets(word)

if synsets:
    # Display information for each synset
    for synset in synsets:
        print(f"Definition: {synset.definition()}")
        print(f"Part of Speech: {synset.pos()}")
        print(f"Examples: {synset.examples()}")
        print(f"Synonyms: {', '.join(synonym.name() for synonym in synset.lemmas())}")
        print()
else:
    print("No information found for the word.")

Definition: a motor vehicle with four wheels; usually propelled by an internal combustion engine
Part of Speech: n
Examples: ['he needs a car to get to work']
Synonyms: car, auto, automobile, machine, motorcar

Definition: a wheeled vehicle adapted to the rails of railroad
Part of Speech: n
Examples: ['three cars had jumped the rails']
Synonyms: car, railcar, railway_car, railroad_car

Definition: the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant
Part of Speech: n
Examples: []
Synonyms: car, gondola

Definition: where passengers ride up and down
Part of Speech: n
Examples: ['the car was on the top floor']
Synonyms: car, elevator_car

Definition: a conveyance for passengers or freight on a cable railway
Part of Speech: n
Examples: ['they took a cable car to the top of the mountain']
Synonyms: cable_car, car



#### thesauri    
    
    Import NLTK and WordNet:

    We import NLTK and specifically the wordnet module from NLTK's corpus.
    Specify the Word:

    We specify the word for which we want to find synonyms and antonyms. In this example, we're using the word "happy".
    Retrieve Synsets:

    We use the synsets() function from WordNet to retrieve synsets (sets of synonyms) for the specified word.
    Display Synonyms:

    If synsets are found (i.e., if the word exists in WordNet), we iterate over each synset and add the synonyms (lemmas) to a set to remove duplicates. Then, we display the set of synonyms for the word.
    Display Antonyms:

    We also look for antonyms by iterating over each synset, lemma, and antonym. We add the antonyms to a set to remove duplicates and then display the set of antonyms for the word.
    Handling Absence of Synsets:

    If no synsets are found for the word (i.e., if the word is not in WordNet), we display a message indicating that no synsets were found.


In [15]:
import nltk
from nltk.corpus import wordnet

# Word to find synonyms and antonyms for
word = "happy"

# Retrieve synsets (sets of synonyms) for the word
synsets = wordnet.synsets(word)

if synsets:
    # Display synonyms for the word
    synonyms = set()
    for synset in synsets:
        for lemma in synset.lemmas():
            synonyms.add(lemma.name())
    print(f"Synonyms of '{word}': {', '.join(synonyms)}")

    # Display antonyms for the word
    antonyms = set()
    for synset in synsets:
        for lemma in synset.lemmas():
            for antonym in lemma.antonyms():
                antonyms.add(antonym.name())
    print(f"Antonyms of '{word}': {', '.join(antonyms)}")
else:
    print(f"No synsets found for '{word}'.")

Synonyms of 'happy': felicitous, well-chosen, glad, happy
Antonyms of 'happy': unhappy


#### Stopword Lists:

Stopwords are common words that are often filtered out during text preprocessing because they typically do not carry significant meaning for analysis.

In [16]:
import nltk
from nltk.corpus import stopwords

# Download stopwords list
nltk.download('stopwords')

# Get English stopwords
stopwords_list = set(stopwords.words('english'))

# Print the stopwords
print("Stopwords in English:")
print(stopwords_list)

Stopwords in English:
{"it's", 'she', "wouldn't", 'itself', 'down', 'against', 'wasn', 'until', 'just', 'but', 'both', "you've", 'needn', 'off', 'of', 'if', 'an', "won't", 'yourselves', 'its', 'be', 'their', 'he', 'aren', 'yours', 'me', 'then', 'haven', 'through', 'ma', 'again', 'him', 're', 'during', 'll', 'this', 'doing', "hasn't", 'ours', 'they', "aren't", 'for', 'here', 'further', 'from', "you'll", 'himself', 'any', 's', 'mightn', 'isn', "needn't", 'after', 'them', 'y', 'ain', 'up', 'having', 't', 'should', 'you', 'will', "shan't", 'weren', 'most', 'are', 'what', 'so', "hadn't", 'ourselves', 'it', 'those', 'had', 'between', 'each', 'not', 'hadn', 'yourself', 'below', 'doesn', 'very', 'a', 'more', "you'd", 'on', 'why', 'wouldn', 'm', 've', "couldn't", 'didn', 'does', 'into', 'do', 'only', 'where', 'hers', 'how', "shouldn't", 'his', 'some', 'all', 'own', 'have', 'while', 'other', 'nor', 'about', 'o', 'mustn', 'which', "that'll", "weren't", 'same', "mustn't", 'her', 'my', 'themselves'

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sengu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Word Frequency Lists:

Word frequency lists contain words along with their frequencies of occurrence in a corpus of text.

In [17]:
from collections import Counter

# Sample text data
text = "This is a sample text. It contains some words, and it repeats some words."

# Tokenize the text
words = text.lower().split()

# Count word frequencies
word_freq = Counter(words)

# Print the most common words and their frequencies
print("Word frequencies:")
print(word_freq)

Word frequencies:
Counter({'it': 2, 'some': 2, 'this': 1, 'is': 1, 'a': 1, 'sample': 1, 'text.': 1, 'contains': 1, 'words,': 1, 'and': 1, 'repeats': 1, 'words.': 1})


#### Lexicons

##### Sentiment Lexicon:

    A sentiment lexicon contains words along with their associated sentiment polarity (e.g., positive, negative, neutral). It is used in sentiment analysis to determine the sentiment expressed in a piece of text.

In [20]:
# Example: Using the AFINN lexicon for sentiment analysis
from afinn import Afinn

# Instantiate the AFINN lexicon
afinn = Afinn()

# Analyze sentiment of a piece of text
text = "This movie is amazing! I love it."
sentiment_score = afinn.score(text)

# Interpret sentiment score
if sentiment_score > 0:
    sentiment_label = "positive"
elif sentiment_score < 0:
    sentiment_label = "negative"
else:
    sentiment_label = "neutral"

print(f"Sentiment: {sentiment_label} (Score: {sentiment_score})")

Sentiment: positive (Score: 7.0)


##### Part-of-Speech (POS) Lexicon:

A POS lexicon contains words along with their corresponding parts of speech (e.g., noun, verb, adjective). It is used in POS tagging to assign parts of speech to words in a sentence.

    NN: Noun, singular or mass (e.g., "dog", "cat", "house")
    NNS: Noun, plural (e.g., "dogs", "cats", "houses")
    NNP: Proper noun, singular (e.g., "John", "London", "January")
    NNPS: Proper noun, plural (e.g., "Smiths", "United States", "Mountains")
    VB: Verb, base form (e.g., "eat", "run", "play")
    VBD: Verb, past tense (e.g., "ate", "ran", "played")
    VBG: Verb, gerund or present participle (e.g., "eating", "running", "playing")
    VBN: Verb, past participle (e.g., "eaten", "run", "played")
    VBP: Verb, non-3rd person singular present (e.g., "eat", "run", "play")
    VBZ: Verb, 3rd person singular present (e.g., "eats", "runs", "plays")
    JJ: Adjective (e.g., "big", "happy", "red")
    RB: Adverb (e.g., "quickly", "very", "well")
    DT: Determiner (e.g., "the", "a", "this", "those")
    PRP: Personal pronoun (e.g., "I", "you", "he", "she", "it", "they")
    CC: Coordinating conjunction (e.g., "and", "but", "or")
    IN: Preposition or subordinating conjunction (e.g., "in", "on", "at", "since", "because")
    CD: Cardinal number (e.g., "one", "two", "three", "1", "2", "3")

In [24]:
# Example: Using the NLTK POS tagger with a sample sentence
nltk.download('averaged_perceptron_tagger')
sentence = "The cat is sleeping on the mat."

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# Tag parts of speech
pos_tags = nltk.pos_tag(tokens)

# Print POS tags
print("POS tags:")
for word, pos_tag in pos_tags:
    print(f"{word}: {pos_tag}")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\sengu\AppData\Roaming\nltk_data...


POS tags:
The: DT
cat: NN
is: VBZ
sleeping: VBG
on: IN
the: DT
mat: NN
.: .


[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


#### Accesing Text Corpora  

    A corpus (plural: corpora) in the context of natural language processing (NLP) refers to a large and structured collection of text or speech data that is used for linguistic analysis, language modeling, and the development and evaluation of NLP algorithms and models. Corpora serve as the primary source of data for studying language patterns, extracting linguistic features, and training machine learning models in NLP tasks.

    Here are some key characteristics of corpora:

    Size and Diversity: Corpora can vary widely in size, ranging from small collections of texts to large-scale datasets containing millions of documents. They can encompass diverse sources such as books, articles, transcripts, social media posts, websites, and more. The diversity of the corpus influences its representativeness and applicability to different NLP tasks.

    Annotation and Metadata: Corpora may contain additional metadata and annotations that provide contextual information about the texts, such as author names, publication dates, genre labels, part-of-speech tags, named entity annotations, sentiment labels, and more. These annotations enhance the usefulness of the corpus for specific NLP tasks and research objectives.

    Structured Format: Corpora are typically organized and structured in a standardized format to facilitate data retrieval, processing, and analysis. They may be stored in various formats such as plain text files, XML files, JSON files, database tables, or specialized formats designed for linguistic annotation (e.g., the CoNLL format).

    Representativeness and Bias: The representativeness of a corpus refers to how well it reflects the linguistic characteristics and usage patterns of a particular language or domain. Corpora may exhibit biases based on factors such as the selection criteria, data sources, and collection methods. It's important for researchers to be aware of potential biases in corpora and account for them in their analyses and interpretations.

    Usage in NLP: Corpora serve as the foundation for many NLP tasks, including text classification, named entity recognition, part-of-speech tagging, sentiment analysis, machine translation, language modeling, and more. Researchers and practitioners leverage corpora to train and evaluate NLP models, extract linguistic insights, develop language resources, and advance the state-of-the-art in NLP.

    Overall, corpora play a crucial role in advancing our understanding of natural language and developing effective computational approaches for processing and analyzing text data in various domains.

-   Import NLTK and Download Corpus:

    import nltk: Import the NLTK library.
    nltk.download('gutenberg'): Download the Gutenberg corpus. This is one of many corpora available in NLTK.

-   Accessing Text Corpora:

    from nltk.corpus import gutenberg: Import the Gutenberg corpus from NLTK's corpus module.

-   Listing Available Files:

    print(gutenberg.fileids()): Print the list of available files in the Gutenberg corpus. 
    This provides information about the texts included in the corpus.

In [1]:
import nltk
nltk.download('gutenberg')  # Download the Gutenberg corpus (one of many available)

# Accessing text from the Gutenberg corpus
from nltk.corpus import gutenberg

# List available files in the Gutenberg corpus
print(gutenberg.fileids())

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\sengu\AppData\Roaming\nltk_data...


['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


[nltk_data]   Unzipping corpora\gutenberg.zip.


-   Accessing Text from a Specific File:

    emma_text = gutenberg.raw('austen-emma.txt'): Access the raw text of a specific file in the Gutenberg corpus (in this case, 'austen-emma.txt', which contains Jane Austen's novel "Emma").
    
-   print(emma_text[:500]): 
    
    Print the first 500 characters of the text to get a glimpse of the content.

In [10]:
# Accessing the text of a specific file in the Gutenberg corpus
emma_text = gutenberg.raw('austen-sense.txt')  # Access the raw text
print(emma_text[:800])  # Print the first 500 characters of the text

[Sense and Sensibility by Jane Austen 1811]

CHAPTER 1


The family of Dashwood had long been settled in Sussex.
Their estate was large, and their residence was at Norland Park,
in the centre of their property, where, for many generations,
they had lived in so respectable a manner as to engage
the general good opinion of their surrounding acquaintance.
The late owner of this estate was a single man, who lived
to a very advanced age, and who for many years of his life,
had a constant companion and housekeeper in his sister.
But her death, which happened ten years before his own,
produced a great alteration in his home; for to supply
her loss, he invited and received into his house the family
of his nephew Mr. Henry Dashwood, the legal inheritor
of the Norland estate, and the person to whom 


##### Conditional Frequency Distribution 

Conditional frequency distribution (CFD) is a concept commonly used in Natural Language Processing (NLP) to analyze the distribution of words or other linguistic units within different contexts. It provides a way to examine how the frequency of one variable (e.g., words) varies with respect to another variable (e.g., contexts or categories).

In Python, the nltk.ConditionalFreqDist class from the NLTK library is often used to create and work with conditional frequency distributions. Here's how it works:

-   Creating a Conditional Frequency Distribution:

    To create a conditional frequency distribution, you typically start with a list of tuples where each tuple contains a pair of values representing the condition and the event. Then, you pass this list of tuples to the ConditionalFreqDist constructor.

-   Accessing Frequencies:

    Once you have created a conditional frequency distribution, you can access the frequencies of specific events conditioned on particular conditions using indexing notation.


    Alright, imagine you have a bunch of storybooks, and each page of these books has words written on them. Now, let's say you want to know how many times certain words appear in these books, but you also want to know where they appear. That's where a conditional frequency distribution comes in!

    Think of it like this: You have a big pile of colorful candies, and you want to see how many candies of each color you have in different jars. The jars represent different situations or conditions. For example, you might want to know how many red candies you have in the jar labeled "kitchen" and how many in the jar labeled "living room."

    So, in the world of words, we're basically doing the same thing. We're counting how many times specific words (like "the" or "is") show up in different situations (like when they are used as certain types of words, such as determiners or verbs).

    And just like you'd count your candies and put the numbers in each jar, with a conditional frequency distribution, we count the words and store the counts based on their conditions. Then, when we want to know how many times a word appears in a certain situation, we can look it up in the right jar!

    So, if you want to know how many times the word "the" is used as a determiner (like "the cat") in your storybooks, you'd check the jar labeled "determiner" and find the number. It helps us understand not only how many times words appear but also where and how they're used in different contexts.

In [7]:
import nltk

# Pretend we have a list of toys in different rooms of the house
toys_in_rooms = [
    ('kitchen', 'ball'),
    ('kitchen', 'doll'),
    ('living_room', 'ball'),
    ('living_room', 'car'),
    ('bedroom', 'doll'),
    ('bedroom', 'teddy_bear'),
    ('kitchen', 'doll'),
    ('living_room', 'ball'),
]

# Create a conditional frequency distribution
cfd = nltk.ConditionalFreqDist(toys_in_rooms)

# Now, let's see how many times each toy appears in each room
print("Frequency of toys in different rooms:")
for room in cfd.conditions():  # Get the list of rooms
    print(f"In the {room}:")
    for toy in cfd[room]:  # Get the list of toys in each room
        print(f"- {toy}: {cfd[room][toy]} times")

Frequency of toys in different rooms:
In the kitchen:
- doll: 2 times
- ball: 1 times
In the living_room:
- ball: 2 times
- car: 1 times
In the bedroom:
- doll: 1 times
- teddy_bear: 1 times


<CategorizedTaggedCorpusReader in 'C:\\Users\\sengu\\AppData\\Roaming\\nltk_data\\corpora\\brown'>