#	What are Corpora?

A corpus is a large and structured set of machine-readable texts that have been produced in a natural communicative setting. Its plural is **`corpora`**. They can be derived in different ways like text that was originally electronic, transcripts of spoken language and optical character recognition, etc.

#	What are Tokens?

Tokens are the building blocks of Natural Language. Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.

# What are Unigrams, Bigrams, Trigrams?

In Natural Language Processing n-gram is a contiguous sequence of n items generated from a given sample of text where the items can be characters or words and n can be any numbers like 1,2,3, etc.

An n-gram of size 1 is referred to as a **`unigram`** size 2 is a **`bigram`** size 3 is a **`trigram`**. When **`N>3`** this is usually referred to as four grams or five grams and so on.


#	How to generate n-grams from text?

In [1]:
from nltk.util import ngrams, everygrams

def ngram_convertor(sentence,n=3):
    ngram_sentence = ngrams(sentence.split(), n)
    for item in ngram_sentence:
        print(item,end=',')
    print()
        
sentence = "My Name is Sagar Pahariya"
print('Unigram')
ngram_convertor(sentence,1)
print('\n')
print('Bigram')
ngram_convertor(sentence,2)
print('\n')
print('Trigram')
ngram_convertor(sentence,3)
print('\n')
print('Everygram')
print(list(everygrams(sentence.split())))

Unigram
('My',),('Name',),('is',),('Sagar',),('Pahariya',),


Bigram
('My', 'Name'),('Name', 'is'),('is', 'Sagar'),('Sagar', 'Pahariya'),


Trigram
('My', 'Name', 'is'),('Name', 'is', 'Sagar'),('is', 'Sagar', 'Pahariya'),


Everygram
[('My',), ('My', 'Name'), ('My', 'Name', 'is'), ('My', 'Name', 'is', 'Sagar'), ('My', 'Name', 'is', 'Sagar', 'Pahariya'), ('Name',), ('Name', 'is'), ('Name', 'is', 'Sagar'), ('Name', 'is', 'Sagar', 'Pahariya'), ('is',), ('is', 'Sagar'), ('is', 'Sagar', 'Pahariya'), ('Sagar',), ('Sagar', 'Pahariya'), ('Pahariya',)]


#	Explain Lemmatization ?

Lemmatization is unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

#	Explain Stemming ?

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

# Explain Part-of-speech (POS) tagging ?

Part-of-speech (POS) tagging may be defined as the process of converting a sentence in the form of a list of words, into a list of tuples. Here, the tuples are in the form of (word, tag). We can also call POS tagging a process of assigning one of the parts of speech to the given word.

#	Explain Chunking or shallow parsing  ?

Chunking is somewhere between part of speech (POS) tagging and full language parsing, hence the name shallow parsing. 

**Chunking:**
![image-3.png](attachment:image-3.png)

#	Explain Noun Phrase (NP) chunking ?

Text chunking is dividing sentences into non-overlapping phrases. Noun phrase chunking deals with extracting the noun phrases from a sentence. While NP chunking is much simpler than parsing, it is still a challenging task to build a accurate and very efficient NP chunker. The importance of NP chunking derives from the fact that it is used in many applications.

# Explain Named Entity Recognition ?

 Named entity recognition (NER) — sometimes referred to as entity chunking, extraction, or identification — is the task of identifying and categorizing key information (entities) in text. An entity can be any word or series of words that consistently refers to the same thing. Every detected entity is classified into a predetermined category. For example, an NER machine learning (ML) model might detect the word “super.AI” in a text and classify it as a “Company”.

NER is a form of natural language processing (NLP), a subfield of artificial intelligence. NLP is concerned with computers processing and analyzing natural language, i.e., any language that has developed naturally, rather than artificially, such as with computer coding languages.