In [1]:
corpus = """Hey there!
Welcome to the Roadmap of the Generative AI Crash Course! 
Here, you’ll explore NLP’s, ML’s, DL’s, and LLM’s concepts.
"""
token = ['eating', 'eats', 'eaten', 'writing', 'write', 'history', 'finally', 'final']

# Tokenization

Tokenization is the process of splitting text into smaller units, such as sentences or words. This is a crucial step in natural language processing.

## Sentence to Paragraphs

When working with text, you may need to convert sentences into paragraphs for better readability or analysis.

### Useful Resources
- For more information on tokenization, check the [NLTK Tokenization Documentation](https://www.nltk.org/api/nltk.tokenize.html).
- Explore additional resources on natural language processing techniques.

### Key Terms
- **Tokenization**: The process of breaking text into tokens (words, phrases, symbols).
- **Sentence**: A set of words that conveys a complete thought.


In [2]:
## Paragrah/Sentence -> Words

from nltk.tokenize import word_tokenize, wordpunct_tokenize, TreebankWordTokenizer, sent_tokenize
from IPython.display import Markdown, display

print("\033[1mThe Corpus is:-\033[0m \n", corpus)
# Sentence Tokenizer
sent_doc = sent_tokenize(corpus)
print(display(Markdown('**Sentence Toeknize:-**')),sent_doc)

# Word Tokenizer
word_doc = word_tokenize(corpus)
print('Word Tokenize:- \n',word_doc)

# Word Tokenizer with punctuation
word_pun = wordpunct_tokenize(corpus)
print('Word Punct Tokenize:- \n',word_pun)

# Word Tree Tokenizer
word_tree = TreebankWordTokenizer().tokenize(corpus)
print('Word Tree Tokenize:- \n',word_tree)

[1mThe Corpus is:-[0m 
 Hey there!
Welcome to the Roadmap of the Generative AI Crash Course! 
Here, you’ll explore NLP’s, ML’s, DL’s, and LLM’s concepts.



**Sentence Toeknize:-**

None ['Hey there!', 'Welcome to the Roadmap of the Generative AI Crash Course!', 'Here, you’ll explore NLP’s, ML’s, DL’s, and LLM’s concepts.']
Word Tokenize:- 
 ['Hey', 'there', '!', 'Welcome', 'to', 'the', 'Roadmap', 'of', 'the', 'Generative', 'AI', 'Crash', 'Course', '!', 'Here', ',', 'you', '’', 'll', 'explore', 'NLP', '’', 's', ',', 'ML', '’', 's', ',', 'DL', '’', 's', ',', 'and', 'LLM', '’', 's', 'concepts', '.']
Word Punct Tokenize:- 
 ['Hey', 'there', '!', 'Welcome', 'to', 'the', 'Roadmap', 'of', 'the', 'Generative', 'AI', 'Crash', 'Course', '!', 'Here', ',', 'you', '’', 'll', 'explore', 'NLP', '’', 's', ',', 'ML', '’', 's', ',', 'DL', '’', 's', ',', 'and', 'LLM', '’', 's', 'concepts', '.']
Word Tree Tokenize:- 
 ['Hey', 'there', '!', 'Welcome', 'to', 'the', 'Roadmap', 'of', 'the', 'Generative', 'AI', 'Crash', 'Course', '!', 'Here', ',', 'you’ll', 'explore', 'NLP’s', ',', 'ML’s', ',', 'DL’s', ',', 'and', 'LLM’s', 'concepts', '.']


# Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefiex or to the roots of words known as a lemma.
### RegexpStemmer Class
NLTK has RegexpStemmer Class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and remove any prefix or suffix that matches the expression.
### Snowball Stemmer

In [3]:
from nltk.stem import PorterStemmer, RegexpStemmer, SnowballStemmer
print("Token:- ",token)
# Stemming
stemming = PorterStemmer()
stem_tokens = [stemming.stem(i) for i in token]
print("Stem Token:- \n", stem_tokens)

# Regex Stemmer
reg_stem = RegexpStemmer('ing$|s$|e$|able$', min=4)
reg_token = [reg_stem.stem(i) for i in token]
print("Reg Token:- \n",reg_token)

# Snow Stemmer
snow = SnowballStemmer('english')
snow_token = [snow.stem(i) for i in token]
print("Snow Token:- \n",snow_token)


Token:-  ['eating', 'eats', 'eaten', 'writing', 'write', 'history', 'finally', 'final']
Stem Token:- 
 ['eat', 'eat', 'eaten', 'write', 'write', 'histori', 'final', 'final']
Reg Token:- 
 ['eat', 'eat', 'eaten', 'writ', 'writ', 'history', 'finally', 'final']
Snow Token:- 
 ['eat', 'eat', 'eaten', 'write', 'write', 'histori', 'final', 'final']


# Wordnet Lemmatizer
Lemmatization technique is like stemming. The output we will get after lemmatization is called 'lemma', which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a vaild word that means the same thing.

NLTK provide WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma.

In [4]:
"""
## POS parameters
POS - Noun-n
verb - v
adjective - a
adverb - r
"""
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
lem_token = [lem.lemmatize(i, pos='v') for i in token]
print('Token:- \n', token)
print('Lem Token:- \n', lem_token)

Token:- 
 ['eating', 'eats', 'eaten', 'writing', 'write', 'history', 'finally', 'final']
Lem Token:- 
 ['eat', 'eat', 'eat', 'write', 'write', 'history', 'finally', 'final']


# Stop Words

In [30]:
paragraph = """
In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, 
which is built from scratch using singing training data mined from music websites. 
The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, 
lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, 
we design a lyrics-to-singing alignment model to automatically extract the duration of each phoneme in 
lyrics starting from coarse-grained sentence level to fine-grained phoneme level, and further design a multi-lingual 
multi-singer singing model based on a feed-forward Transformer to directly generate linear-spectrograms from lyrics, 
and synthesize voices using Griffin-Lim. DeepSinger has several advantages over previous SVS systems: 
1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 
2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost,
3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis 
and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 
4) it can synthesize singing voices in multiple languages and multiple singers. 
We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages 
(Chinese, Cantonese and English). The results demonstrate that with the singing data purely mined from the Web, 
DeepSinger can synthesize high-quality singing voices in terms of both pitch accuracy and voice naturalness
"""

In [14]:
from nltk.corpus import stopwords
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [31]:
import nltk 
sentences = nltk.sent_tokenize(paragraph)

In [None]:
# Apply StopWords and Filter and then apply Stemming
from nltk.stem import PorterStemmer
stem = PorterStemmer()

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i]) #converting list of sentences into words
    new_word = [stem.stem(i) for i in words if i not in set(stopwords.words('english'))] # Iterate the words and if the word not in stop words apply stemming
    sentences[i] = ' '.join(new_word) # After removing stop wrods and stemming it, join the words into sentences and place it in the same index in sentence list.
print(sentences)

In [33]:
from nltk.stem import WordNetLemmatizer
w_stem = WordNetLemmatizer()

# Apply StopWords and Filter and then apply Lemmatize

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i]) #converting list of sentences into words
    new_word = [w_stem.lemmatize(i.lower(), pos='v') for i in words if i not in set(stopwords.words('english'))] # Iterate the words and if the word not in stop words apply stemming
    sentences[i] = ' '.join(new_word) # After removing stop wrods and stemming it, join the words into sentences and place it in the same index in sentence list.
print(sentences)

['in paper , develop deepsinger , multi-lingual multi-singer sing voice synthesis ( svs ) system , build scratch use sing train data mine music websites .', 'the pipeline deepsinger consist several step , include data crawl , sing accompaniment separation , lyrics-to-singing alignment , data filtration , sing model .', 'specifically , design lyrics-to-singing alignment model automatically extract duration phoneme lyric start coarse-grained sentence level fine-grained phoneme level , design multi-lingual multi-singer sing model base feed-forward transformer directly generate linear-spectrograms lyric , synthesize voice use griffin-lim .', 'deepsinger several advantage previous svs systems : 1 ) best knowledge , first svs system directly mine train data music websites , 2 ) lyrics-to-singing alignment model avoid human efforts alignment label greatly reduce label cost , 3 ) sing model base feed-forward transformer simple efficient , remove complicate acoustic feature model parametric syn