## 1. What are Corpora?
**Answer:** Corpora (plural of corpus) are large and structured sets of texts or documents used for statistical analysis and hypothesis testing in linguistic research. They provide a sample of natural language usage and are used to study language patterns and structures.

**Example:** The Brown Corpus, the British National Corpus, and the Google Books Corpus are all examples of corpora used in linguistic research and NLP.

---

## 2. What are Tokens?
**Answer:** Tokens are individual units of meaning in a text, such as words, punctuation marks, or other symbols. Tokenization is the process of splitting text into these units.

**Example:** In the sentence "ChatGPT is awesome!", the tokens are ["ChatGPT", "is", "awesome", "!"].

---

## 3. What are Unigrams, Bigrams, Trigrams?
**Answer:**
- **Unigrams:** Single tokens or words in a text. For instance, in the sentence "NLP is fun", the unigrams are ["NLP", "is", "fun"].
- **Bigrams:** Sequences of two consecutive tokens. For the same sentence, the bigrams are ["NLP is", "is fun"].
- **Trigrams:** Sequences of three consecutive tokens. In the sentence, the trigrams are ["NLP is fun"].

---

## 4. How to generate n-grams from text?
**Answer:** N-grams are contiguous sequences of `n` items from a given text. To generate n-grams, you typically follow these steps:
1. **Tokenize** the text into a list of tokens.
2. **Create n-grams** by taking contiguous sequences of `n` tokens.

**Example Code:**
```python
from nltk import ngrams

text = "ChatGPT is an amazing tool"
tokens = text.split()
bigrams = list(ngrams(tokens, 2))




## 5. Explain Lemmatization
**Answer:** Lemmatization is the process of reducing a word to its base or root form. Unlike stemming, it considers the context and converts the word into its meaningful base form.

**Example:** The word "running" is lemmatized to "run", and "better" is lemmatized to "good".

**Example Code:**
```python
from nltk import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_word = lemmatizer.lemmatize("running", pos='v')  # 'run'


## 6. Explain Stemming
**Answer:**: Stemming is a process of reducing a word to its root form by removing suffixes. It is less precise than lemmatization and can produce stems that are not actual words.

**Example:**: The words "running", "runner", and "runs" might all be stemmed to "run".

**Example Code:**

In [9]:
from nltk import PorterStemmer

stemmer = PorterStemmer()
stemmed_word = stemmer.stem("running")  


## 7. Explain Part-of-Speech (POS) tagging
**Answer:**: POS tagging is the process of identifying the part of speech (noun, verb, adjective, etc.) for each token in a text. It helps in understanding the grammatical structure of the sentence.

**Example Code:**:

In [5]:
import nltk

nltk.download('averaged_perceptron_tagger')
tokens = nltk.word_tokenize("ChatGPT is an amazing tool")
pos_tags = nltk.pos_tag(tokens)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\cla.shehal\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## 8. Explain Chunking or Shallow Parsing
**Answer:**: Chunking involves grouping tokens into chunks or phrases based on their POS tags. It helps in identifying and extracting meaningful structures from the text, such as noun phrases or verb phrases.

**Example Code:**:

In [6]:
from nltk import RegexpParser

grammar = "NP: {<DT>?<JJ>*<NN>}"
parser = RegexpParser(grammar)
chunks = parser.parse(pos_tags)


## 9. Explain Noun Phrase (NP) chunking
**Answer:**: NP Chunking is a specific type of chunking that focuses on identifying and grouping noun phrases within a text. Noun phrases typically consist of a noun and its modifiers.

**Example Code:**:

In [7]:
grammar = "NP: {<DT>?<JJ>*<NN>}"
parser = RegexpParser(grammar)
np_chunks = parser.parse(pos_tags)


## 10. Explain Named Entity Recognition (NER)
**Answer:**: NER is the process of identifying and classifying named entities in text into predefined categories like person names, organizations, locations, dates, etc. It helps in extracting structured information from unstructured text.

**Example Code:**:

In [8]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")
for ent in doc.ents:
    print(ent.text, ent.label_)


Apple ORG
U.K. GPE
$1 billion MONEY
