---

## **Notes on the Code**

### **Concept of Tokenization**
Tokenization is the process of breaking down a given text into smaller components, such as sentences or words. It is a fundamental step in Natural Language Processing (NLP) that helps in text analysis and processing. There are two main types of tokenization demonstrated in this code:
1. **Sentence Tokenization**: Splits a paragraph or large text into individual sentences.
2. **Word Tokenization**: Splits sentences or paragraphs into individual words.

---

### **Step-by-Step Explanation**

#### **1. Install NLTK Library**
The line `!pip install nltk` installs the **Natural Language Toolkit (NLTK)**, a library widely used for NLP tasks.

#### **2. Input Corpus**
```python
corpus = """Hello Welcome, My name is Armghan.
I'm currently learning NLP to effecrively land a Job in Gen Ai field.
Hopefully i'll get it in this year.
"""
```
The `corpus` variable contains a sample text that will be tokenized into sentences and words.

---

#### **3. Sentence Tokenization**
```python
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
documents = sent_tokenize(corpus)
```
- **`sent_tokenize()`**: Breaks the input text into sentences.
- **`nltk.download('punkt')`**: Downloads the `punkt` tokenizer model, which is necessary for sentence tokenization.
- **Output**: The variable `documents` will store a list of sentences:
  ```python
  ['Hello Welcome, My name is Armghan.',
   "I'm currently learning NLP to effecrively land a Job in Gen Ai field.",
   "Hopefully i'll get it in this year."]
  ```

#### **4. Printing Sentences**
```python
for sentence in documents:
    print(sentence)
```
This loop iterates through the `documents` list and prints each sentence.

---

#### **5. Word Tokenization**
##### **Paragraph to Words**
```python
from nltk.tokenize import word_tokenize
word_tokenize(corpus)
```
- **`word_tokenize()`**: Splits the entire paragraph into words (including punctuation).
- **Output**:
  ```python
  ['Hello', 'Welcome', ',', 'My', 'name', 'is', 'Armghan', '.',
   'I', "'m", 'currently', 'learning', 'NLP', 'to', 'effecrively',
   'land', 'a', 'Job', 'in', 'Gen', 'Ai', 'field', '.',
   'Hopefully', 'i', "'ll", 'get', 'it', 'in', 'this', 'year', '.']
  ```

##### **Sentence to Words**
```python
for sentence in documents:
    print(word_tokenize(sentence))
```
This loop applies `word_tokenize()` to each sentence from the `documents` list and prints the tokenized words for each sentence separately.

---

#### **6. WordPunct Tokenizer**
```python
from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize(corpus)
```
- **`wordpunct_tokenize()`**: Splits the text into words, treating punctuation as separate tokens.
- **Output**:
  ```python
  ['Hello', 'Welcome', ',', 'My', 'name', 'is', 'Armghan', '.',
   'I', "'", 'm', 'currently', 'learning', 'NLP', 'to',
   'effecrively', 'land', 'a', 'Job', 'in', 'Gen', 'Ai',
   'field', '.', 'Hopefully', 'i', "'", 'll', 'get',
   'it', 'in', 'this', 'year', '.']
  ```

---

#### **7. Treebank Word Tokenizer**
```python
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)
```
- **`TreebankWordTokenizer()`**: A tokenizer based on the Penn Treebank conventions. It handles splitting contractions (e.g., "I'm" → ["I", "'m"]) and punctuation.
- **Output**:
  ```python
  ['Hello', 'Welcome', ',', 'My', 'name', 'is', 'Armghan', '.',
   'I', "'m", 'currently', 'learning', 'NLP', 'to', 'effecrively',
   'land', 'a', 'Job', 'in', 'Gen', 'Ai', 'field', '.',
   'Hopefully', 'i', "'ll", 'get', 'it', 'in', 'this', 'year', '.']
  ```

---

### **Summary**
- **Sentence Tokenization**: Splits a text into sentences using `sent_tokenize()`.
- **Word Tokenization**:
  - Splits text into words using `word_tokenize()` and other tokenizers.
  - Handles punctuation as separate tokens (e.g., `wordpunct_tokenize()`).
  - Treebank tokenizer uses specific rules to handle contractions and punctuation.

Each tokenizer has its own purpose and use case. Experimenting with these helps to understand which tokenizer is best suited for specific NLP tasks.

In [None]:
!pip install nltk



In [None]:
corpus = """Hello Welcome, My name is Armghan.
I'm currently learning NLP to effecrively land a Job in Gen Ai field.
Hopefully i'll get it in this year.
 """

In [None]:
print(corpus)

Hello Welcome, My name is Armghan.
I'm currently learning NLP to effecrively land a Job in Gen Ai field.
Hopefully i'll get it in this year.
 


In [None]:
##Tokenization
## Sentence-->paragraphs.
import nltk # Import the nltk module
from nltk.tokenize import sent_tokenize
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
documents=sent_tokenize(corpus)

In [None]:
type(documents)

list

In [None]:
for sentence in documents:
  print(sentence)

Hello Welcome, My name is Armghan.
I'm currently learning NLP to effecrively land a Job in Gen Ai field!.
Hopefully i'll get it in this year


In [None]:
##Tokenization
##Paragrah--> words
##Sentence--> words
from nltk.tokenize import word_tokenize


In [None]:
word_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'My',
 'name',
 'is',
 'Armghan',
 '.',
 'I',
 "'m",
 'currently',
 'learning',
 'NLP',
 'to',
 'effecrively',
 'land',
 'a',
 'Job',
 'in',
 'Gen',
 'Ai',
 'field',
 '.',
 'Hopefully',
 'i',
 "'ll",
 'get',
 'it',
 'in',
 'this',
 'year',
 '.']

In [None]:
for sentence in documents:
  print(word_tokenize(sentence))

['Hello', 'Welcome', ',', 'My', 'name', 'is', 'Armghan', '.']
['I', "'m", 'currently', 'learning', 'NLP', 'to', 'effecrively', 'land', 'a', 'Job', 'in', 'Gen', 'Ai', 'field', '.']
['Hopefully', 'i', "'ll", 'get', 'it', 'in', 'this', 'year', '.']


In [None]:
from nltk.tokenize import wordpunct_tokenize

In [None]:
wordpunct_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'My',
 'name',
 'is',
 'Armghan',
 '.',
 'I',
 "'",
 'm',
 'currently',
 'learning',
 'NLP',
 'to',
 'effecrively',
 'land',
 'a',
 'Job',
 'in',
 'Gen',
 'Ai',
 'field',
 '.',
 'Hopefully',
 'i',
 "'",
 'll',
 'get',
 'it',
 'in',
 'this',
 'year',
 '.']

In [None]:
##Tree Bank word tokenizer
from nltk.tokenize import TreebankWordTokenizer


In [None]:
tokenizer = TreebankWordTokenizer()

In [None]:
tokenizer.tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'My',
 'name',
 'is',
 'Armghan.',
 'I',
 "'m",
 'currently',
 'learning',
 'NLP',
 'to',
 'effecrively',
 'land',
 'a',
 'Job',
 'in',
 'Gen',
 'Ai',
 'field.',
 'Hopefully',
 'i',
 "'ll",
 'get',
 'it',
 'in',
 'this',
 'year',
 '.']