## NLP_Assignment_2
1. What are Corpora?
2. What are Tokens?
3. What are Unigrams, Bigrams, Trigrams?
4. How to generate n-grams from text?
5. Explain Lemmatization
6. Explain Stemming
7. Explain Part-of-speech (POS) tagging
8. Explain Chunking or shallow parsing
9. Explain Noun Phrase (NP) chunking
10. Explain Named Entity Recognition

In [1]:
'''Ans 1:- Corpora are large collections of written or spoken texts
used for linguistic analysis and research. They serve as
valuable resources for studying language patterns and trends. For
example, the "Brown Corpus" contains samples of English text from
various sources and genres, enabling researchers to analyze
language usage across different contexts and time periods.

This Python code uses the NLTK library to access the
"Brown Corpus," specifically the "news" genre. It then calculates
the total word count and displays the first 10 words from that
genre'''

import nltk
nltk.download('brown')
from nltk.corpus import brown

# Accessing a specific genre from the Brown Corpus
news_text = brown.words(categories='news')

# Getting some basic statistics
print(f"Total words in the 'news' genre: {len(news_text)}")
print(f"First 10 words in the 'news' genre: {news_text[:10]}")

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Aditya\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


Total words in the 'news' genre: 100554
First 10 words in the 'news' genre: ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']


In [4]:
'''Ans 2:- Tokens are individual units of text, such as words or
punctuation marks, separated by spaces or other delimiters. They are
the building blocks for natural language processing tasks.
Here's a code example in Python using the NLTK library to
tokenize a sentence. In this example, the word_tokenize function
from NLTK is used to split the sentence into tokens, resulting
in a list of words and punctuation marks.'''

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

sentence = "Tokenization is an important NLP task."
tokens = word_tokenize(sentence)

print(tokens)

['Tokenization', 'is', 'an', 'important', 'NLP', 'task', '.']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Aditya\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
'''Ans 3:- Unigrams, bigrams, and trigrams are different types of
n-grams, which are contiguous sequences of n items (usually words)
in a text. Here's a Python code example using NLTK to
generate unigrams, bigrams, and trigrams from a sentence.This code
tokenizes the input sentence and then uses the ngrams function from
NLTK to create unigrams, bigrams, and trigrams from the
tokenized words. The output will display the respective n-grams for
the given sentence. 

1. Unigrams: These are single words considered in isolation.
2. Bigrams: These are sequences of two consecutive words.
3. Trigrams: These are sequences of three consecutive words.'''

import nltk
from nltk.util import ngrams

sentence = "Natural language processing is fascinating."

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# Generate unigrams, bigrams, and trigrams
unigrams = list(ngrams(tokens, 1))
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))

print("Unigrams:", unigrams)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)

Unigrams: [('Natural',), ('language',), ('processing',), ('is',), ('fascinating',), ('.',)]
Bigrams: [('Natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'fascinating'), ('fascinating', '.')]
Trigrams: [('Natural', 'language', 'processing'), ('language', 'processing', 'is'), ('processing', 'is', 'fascinating'), ('is', 'fascinating', '.')]


In [6]:
'''Ans 4:- To generate n-grams from text in Python, we can use the
ngrams function from libraries like NLTK or create a custom
function. Below is an example of generating n-grams using Python's
built-in zip function. The generate_ngrams function takes a text
and the desired n as inputs. It splits the text into words. It
uses a list comprehension to generate n-grams by zipping
together slices of the words list. The result is a list of n-grams,
which are joined into strings and returned.'''

def generate_ngrams(text, n):
    words = text.split()
    ngrams = zip(*[words[i:] for i in range(n)])
    return [' '.join(ngram) for ngram in ngrams]

text = "This is an example sentence for n-gram generation."
# we can Change n to generate different n-grams (e.g., 1 for unigrams, 2 for bigrams)
n = 3

result = generate_ngrams(text, n)
print(result)

['This is an', 'is an example', 'an example sentence', 'example sentence for', 'sentence for n-gram', 'for n-gram generation.']


In [2]:
'''Ans 5:- Lemmatization is a text normalization technique that
reduces words to their base or dictionary form (lemma) while
considering word context. It helps maintain linguistic accuracy by
transforming words to their root form. For example, "running" becomes
"run," and "better" becomes "good." Here's a Python example using
the NLTK library. In this code, the WordNetLemmatizer from
NLTK is used to lemmatize the word "running" into its base
form, "run," considering it as a verb.'''

import spacy

# Download the English language model
!python -m spacy download en_core_web_sm

# Load the English language model
nlp = spacy.load('en_core_web_sm')

# Text to be lemmatized
text = "running better and faster"

# Process the text with spaCy
doc = nlp(text)

# Extract lemmas
lemmas = [token.lemma_ for token in doc]

print("Original text:", text)
print("Lemmatized text:", " ".join(lemmas))

Collecting en-core-web-sm==3.6.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Original text: running better and faster
Lemmatized text: run well and fast


In [3]:
'''Ans 6:- Stemming is a text normalization technique that reduces
words to their root or stem form by removing suffixes. It's a
more aggressive approach compared to lemmatization and may
produce non-linguistic or partial stems. For example, "jumping"
becomes "jump," and "flies" becomes "fli." Stemming is often used
in information retrieval and text analysis for simplifying
words to their core form.'''

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
word = "jumping"
stem = stemmer.stem(word)

print(f"Original word: {word}")
print(f"Stemmed word: {stem}")

Original word: jumping
Stemmed word: jump


In [4]:
'''Ans 7:- Part-of-speech (POS) tagging is the process of labeling
each word in a text with its grammatical category, such as
noun, verb, adjective, etc. It helps analyze the syntactic
structure and meaning of sentences. In Python, libraries like NLTK
and spaCy provide tools for POS tagging.This code tokenizes
the text and assigns POS tags to each word, resulting in a
list of tuples like ('John', 'NN'), where 'NN' represents a
noun.

1. 'NNP' stands for Proper Noun (e.g., 'John').
2. 'VBZ' stands for Verb, 3rd person singular present (e.g., 'likes').
3. 'TO' is the word 'to,' often considered a preposition.
4. 'VB' stands for Verb (e.g., 'play').
5. 'NN' stands for Noun (e.g., 'soccer').
6. '.' represents a punctuation mark (e.g., '.').'''

import nltk
nltk.download('punkt')
from nltk import word_tokenize, pos_tag

text = "John likes to play soccer."
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)

print(pos_tags)

[('John', 'NNP'), ('likes', 'VBZ'), ('to', 'TO'), ('play', 'VB'), ('soccer', 'NN'), ('.', '.')]


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Aditya\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
'''Ans 8:- Chunking, also known as shallow parsing, is a natural
language processing technique that groups words in a sentence into
meaningful chunks, such as noun phrases or verb phrases. It aids in
identifying higher-level linguistic structures. In Python, libraries
like NLTK are used for chunking.This code chunks the input
sentence into noun phrases based on a defined grammar, resulting in
tree structures that represent the chunks.'''

import nltk
from nltk import word_tokenize, pos_tag, RegexpParser

text = "The cat chased the mouse."
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)

# Define a grammar for chunking noun phrases
grammar = "NP: {<DT>?<JJ>*<NN>}"

chunk_parser = RegexpParser(grammar)
chunks = chunk_parser.parse(pos_tags)

print(chunks)

(S (NP The/DT cat/NN) chased/VBD (NP the/DT mouse/NN) ./.)


In [6]:
'''Ans 9:- Noun Phrase (NP) chunking is a technique in natural
language processing that identifies and extracts noun phrases from
a text. These phrases consist of a noun and its associated
words, such as adjectives or determiners. NP chunking helps in
extracting meaningful information from sentences.This code identifies
and extracts noun phrases from the input sentence, resulting
in a tree structure representing the NP chunks.

'''

import nltk
from nltk import word_tokenize, pos_tag, RegexpParser

text = "The quick brown fox jumped over the lazy dog."
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)

# Define a grammar for NP chunking
grammar = "NP: {<DT>?<JJ>*<NN>}"

chunk_parser = RegexpParser(grammar)
chunks = chunk_parser.parse(pos_tags)

print(chunks)

(S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumped/VBD
  over/IN
  (NP the/DT lazy/JJ dog/NN)
  ./.)


In [7]:
'''Ans 10:- Named Entity Recognition (NER) is a natural language
processing task that identifies and categorizes named entities in
text, such as names of people, places, organizations, and dates.
It's used to extract structured information from unstructured
text.This code identifies and categorizes named entities in the
given text and prints their text and corresponding labels,

like:
1. "Apple Inc." is identified as an organization (ORG).
2. "Steve Jobs" is identified as a person (PERSON).
3. "Cupertino" is identified as a geographical place (GPE).
4. "California" is identified as a geographical place (GPE).
5. "April 1, 1976" is identified as a date (DATE).'''

import spacy
text = "Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976."
doc = nlp(text)

for entity in doc.ents:
    print(f"Entity: {entity.text}, Label: {entity.label_}")

Entity: Apple Inc., Label: ORG
Entity: Steve Jobs, Label: PERSON
Entity: Cupertino, Label: GPE
Entity: California, Label: GPE
Entity: April 1, 1976, Label: DATE
