<h1> <u> <font color= green > NLP_Assignment_2 </font> </u> </h1>

## 1. What are Corpora?

> A **corpus** is a large and structured collection of text. **Corpora** are used in natural language processing (NLP) to train machine learning models. They can also be used to study the properties of language.

> There are **different types of corpora**, but they typically fall into one of two categories: monolingual or multilingual. **Monolingual corpora** are corpora that contain text in a single language. **Multilingual corpora** contain text in multiple languages.

> Corpora can be collected from a variety of sources, that may includes:
> * **Published text**: This includes books, articles, and other documents that have been published.
> * **Web text**: This includes text that is found on the web.
> * **Speech**: This includes audio recordings of speech.
> * **Code**: This includes source code and other computer-generated text.


## 2. What are Tokens?

> In natural language processing (NLP), a **token** is a unit of text that is used to represent a word, phrase, or other meaningful element of a text. Tokens are typically created by breaking down a text into smaller units, such as words, punctuation marks, and numbers.

> There are a different **ways to tokenize text**. 
> * One common approach is to use a regular expression to match patterns of characters that represent tokens. For example, a regular expression could be used to match all of the words in a text, as well as punctuation marks and numbers.
> * Once a text has been tokenized, the tokens can be used to represent the text in a machine-readable format. This can be useful for a variety of NLP tasks, such as text classification, machine translation, and question answering.

> Here are some of the **benefits of using tokens**:
> * They can be used to represent text in a machine-readable format.
> * They can be used to analyze the structure of text.
> * They can be used to represent the meaning of text.


## 3. What are Unigrams, Bigrams, Trigrams?

> In natural language processing, unigrams, bigrams, and trigrams are terms **used to describe sequences of words**. A **unigram** is a single word, a **bigram** is a sequence of two words, and a **trigram** is a sequence of three words.

> These terms are often used in the context of bag-of-words models, which represent text as a **collection of words** without considering the order in which they appear. 
> * For example, the sentence ```"The quick brown fox jumps over the lazy dog"``` would be represented as a bag-of-words with 11 unigrams: "the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", and ".".

> Bigrams and trigrams can be used to capture the order of words in a text. 
> * For example, the **bigram** "quick brown" would be used to represent the fact that the words "quick" and "brown" often appear together in a text. 
> * Similarly, the **trigram** "brown fox jumps" would be used to represent the fact that the words "brown", "fox", and "jumps" often appear together in a text.

> Bigrams and trigrams can be used to **improve the accuracy** of natural language processing models. For example, a model that is trained on a corpus of text that includes bigrams and trigrams is likely to be more accurate at predicting the next word in a sequence than a model that is only trained on unigrams.

> Here are some examples of unigrams, bigrams, and trigrams:
> * **Unigrams**: the, quick, brown, fox, jumps, over, the, lazy, dog, .
> * **Bigrams:** quick brown, brown fox, fox jumps, jumps over, over the, the lazy, lazy dog, dog .
> * **Trigrams**: quick brown fox, brown fox jumps, fox jumps over, jumps over the, over the lazy, the lazy dog.


## 4. How to generate n-grams from text?

> Here are the steps on how to generate n-grams from text:
> * **Tokenize the text**: This means breaking the text down into individual words or tokens.
> * **Create a list of all possible n-grams**: This can be done by starting with the first n words in the text and then adding the next word to each n-gram until the end of the text is reached.
> * **Remove any n-grams that appear less than a certain number of times**: This is called filtering. The number of times an n-gram must appear before it is not filtered out is called the minimum frequency.
> * **Sort the n-grams by frequency**: This will give you the most common n-grams first.

> Here is an example of how to generate bigrams from the text ```"The quick brown fox jumps over the lazy dog":```

<b><i> 1. Tokenize the text: </i></b> 
```
tokens = ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
```
<b><i> 2. Create a list of all possible bigrams: </i></b> 
```
bigrams = []
for i in range(len(tokens) - 1):
     bigram = tokens[i] + " " + tokens[i + 1]
     bigrams.append(bigram)
```
<b><i> 3. Remove any bigrams that appear less than a certain number of times: </i></b> 
```
min_frequency = 2
filtered_bigrams = []
for bigram in bigrams:
    if bigrams.count(bigram) >= min_frequency:
        filtered_bigrams.append(bigram)
```
<b><i> 4. Sort the bigrams by frequency: </i></b> 
```
filtered_bigrams.sort(key=lambda bigram: bigrams.count(bigram), reverse=True)
```


> The output of this code would be a list of the most common bigrams in the text, sorted by frequency. For example, the first bigram in the list might be "the quick", followed by "quick brown", and so on.

<b><i> Python code that can generate n-grams from text: </i></b>

In [None]:
import re

def generate_ngrams(text, n):
    tokens = re.split(" ", text)
    bigrams = []
    for i in range(len(tokens) - n + 1):
        bigram = " ".join(tokens[i:i + n])
        bigrams.append(bigram)
    return bigrams


text = "The quick brown fox jumps over the lazy dog"
bigrams = generate_ngrams(text, 3)
print(bigrams)


## 5. Explain Lemmatization

> **Lemmatization** is the process of grouping together different inflected forms of a word so they can be analyzed as a single item. For example, the words "runs," "ran," and "running" would all be lemmatized to the word "run."

> Lemmatization is **similar to stemming**, but it is more sophisticated. **Stemming simply removes** the inflectional endings of words, while **lemmatization takes into account the meaning** of the words. This means that lemmatization can group together words that have different inflectional endings but have the same meaning.

> Here are some of the benefits of using lemmatization:
> * It can help to improve the accuracy of NLP tasks.
> * It can help to reduce the ambiguity of words.
> * It can help to make text more consistent.

> examples of lemmatization:
> * The word "runs" can be lemmatized to "run".
> * The word "was" can be lemmatized to "be".
> * The word "ate" can be lemmatized to "eat".


## 6. Explain Stemming

> **Stemming** is a process of reducing inflected words to their word stem, **base or root form**. The stem is the part of the word that carries the basic meaning of the word. For example, the words "running", "ran", and "runner" all have the stem "run".

> Here are some of the benefits of using stemming:
> * It can help to improve the accuracy of NLP tasks.
> * It can help to reduce the size of the vocabulary that needs to be stored.
> * It can help to make text more consistent.

> examples of stemming:
> * The word "running" can be stemmed to "run".
> * The word "was" can be stemmed to "was".
> * The word "ate" can be stemmed to "eat".


## 7. Explain Part-of-speech (POS) tagging

> **Part-of-speech (POS) tagging** is the process of assigning a part-of-speech tag to each word in a sentence. **Part-of-speech tags** are labels that indicate the syntactic category of a word, such as noun, verb, adjective, adverb, preposition, conjunction, pronoun, or interjection.

> **POS tagging** is a fundamental task in natural language processing (NLP). It is used in a variety of NLP tasks, such as text classification, machine translation, and question answering.

> There are two main **types of POS tagging**: rule-based and statistical. 
> * **Rule-based POS taggers** use a set of rules to assign part-of-speech tags to words. 
> * **Statistical POS taggers** use a statistical model to learn how to assign part-of-speech tags to words.

> Here are some of the benefits of using POS tagging:
> * It can help to improve the accuracy of NLP tasks.
> * It can help to make text more consistent.
> * It can help to understand the meaning of text.

> examples of POS tags:
> * **Noun**: dog, cat, table, chair
> * **Verb**: run, jump, eat, sleep
> * **Adjective**: big, small, red, blue
> * **Adverb**: quickly, slowly, loudly, quietly
> * **Preposition**: in, on, under, over
> * **Conjunction**: and, or, but, yet
> * **Pronoun**: I, you, he, she, it
> * **Interjection**: oh, wow, ouch, shh


## 8. Explain Chunking or shallow parsing


> **Chunking or shallow parsing** is a natural language processing task that involves grouping words together into phrases or chunks based on their grammatical relationships. For example, the sentence ```"The quick brown fox jumps over the lazy dog"``` could be chunked into the following phrases:

                    The quick brown fox
                    jumps over
                    the lazy dog

> Chunking is a less complex task than full parsing, which involves **identifying the syntactic structure** of an entire sentence. However, chunking can still be useful for a variety of natural language processing tasks, such as text classification, machine translation, and question answering.

> There are two main types of chunkers: rule-based and statistical. 
> * **Rule-based chunkers** use a set of rules to identify chunks. 
> * **Statistical chunkers** use a statistical model to learn how to identify chunks.

> Here are some of the benefits of using chunking:
> * It can help to improve the accuracy of NLP tasks.
> * It can help to make text more consistent.
> * It can help to understand the meaning of text.

> some examples of chunks:
> * **Noun phrase**: The quick brown fox
> * **Verb phrase**: jumps over
> * **Prepositional phrase**: over the lazy dog
> * **Adjective phrase**: lazy dog

> Chunks can be performed using a variety of tools and libraries. Some popular tools for chunking include:
> * **Stanford CoreNLP** is a natural language processing toolkit that includes a chunker for English.
> * **SpaCy** is a natural language processing library that includes a chunker for English.
> * **NLTK** is a natural language processing library that includes a chunker for a variety of languages.


## 9. Explain Noun Phrase (NP) chunking

> **Noun phrase (NP) chunking** is a type of shallow parsing that involves grouping words together into noun phrases (NPs). NPs are phrases that contain a noun as their head word. For example, the sentence ```"The quick brown fox jumps over the lazy dog"``` could be chunked into the following NPs:

                        The quick brown fox
                        jumps over
                        the lazy dog

> There are two main types of NP chunkers: rule-based and statistical. 
> * **Rule-based NP chunkers** use a set of rules to identify NPs. 
> * **Statistical NP chunkers** use a statistical model to learn how to identify NPs.

> examples of NPs:
> * **Noun phrase**: The quick brown fox
> * **Adjective phrase**: lazy dog
> * **Prepositional phrase**: over the lazy dog
> * **Verb phrase**: jumps over


## 10. Explain Named Entity Recognition

> **Named Entity Recognition (NER)** is a natural language processing task that involves identifying named entities in text. **Named entities** are words or phrases that refer to specific things, such as <b><i> people, organizations, locations, dates, and times </i></b>. For example, the sentence ```"The quick brown fox jumps over the lazy dog"``` contains the following named entities:

                        The quick brown fox: A person
                        The lazy dog: A person

> **NER** is a useful task for a variety of natural language processing tasks, such as information extraction, machine translation, and question answering. 
For example, 
> * **In information extraction**, NER can be used to extract information about people, organizations, and locations from text. 
> * **In machine translation**, NER can be used to ensure that named entities are translated correctly. 
> * **In question answering**, NER can be used to identify the entities that are being asked about in a question.

> There are two main types of NER systems: rule-based and statistical. 
> * **Rule-based NER systems** use a set of rules to identify named entities. 
> * **Statistical NER systems** use a statistical model to learn how to identify named entities.

> examples of named entities:
> * **Person**: John Smith, Jane Doe
> * **Organization**: Google, Microsoft, Amazon
> * **Location**: New York City, Paris, London
> * **Date**: February 25, 2023, 10:00 AM
> * **Time**: 10:00 AM
