# NLTK

### Introduction

NLTK is a python library that specialises in NLP tasks such as text generation, spell check etc.

In [2]:
#import library
import nltk
#nltk.download('punkt_tab')

### Tokenisation

LLMs are trained on tokens, which are small units of data. these may be words, subwords, or or other data.

In [3]:
from nltk.tokenize import word_tokenize

# take a sample text as follows:
sample_text="Welcome to the Adventurer's Guild!"
#tokenize the sample text
tokens = nltk.word_tokenize(sample_text.lower())
#print the tokens:
print(tokens)

['welcome', 'to', 'the', 'adventurer', "'s", 'guild', '!']


### Stop Words

In [5]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nagan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [6]:
from nltk.corpus import stopwords

sentence="This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
print(stop_words)

{"wouldn't", 'when', 've', 'both', 'had', "they've", 'i', 'so', 'shan', 'me', "you'll", "i've", 'too', 'was', 'his', "haven't", 'up', 'most', 'then', 'and', 'shouldn', 'mustn', 'they', 'isn', 'couldn', "wasn't", 'our', 'he', "i'll", 'to', 'being', 'why', "he'd", 'themselves', 'the', 'him', 'against', 'd', 'after', 'of', "shouldn't", "won't", 'wasn', 'in', 'now', 'same', 'wouldn', "didn't", 'll', "you're", 'only', "i'm", 'we', 'am', 'all', 'before', 'own', "isn't", 'herself', 'were', 's', "mustn't", "should've", 'just', "she's", 'doing', 'it', 'under', "you'd", 'y', 'mightn', 'their', 'my', 'hadn', 't', 'a', 'if', 'her', 'ma', "we'll", 'its', 'through', "don't", 'or', "doesn't", 'did', 'off', 'those', 'o', 'has', 'an', 'again', 'ours', 'because', "she'd", 'be', 'once', 'hers', "mightn't", 'below', "weren't", 'above', 'between', 'can', 'what', 'don', "he'll", 'does', 'm', "needn't", 'here', "we'd", 'down', 'them', 'as', 'more', 'how', 'no', 'where', 'myself', 'out', "he's", 'have', 'not'

In [7]:
words=word_tokenize(sentence)
filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)

['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


### Stemming

Stemming refers to separating a base word from its stem. This helps to find the base meaning. eg:

`['ride', 'riding', 'ridden']`

The base word here is common to all three, that is 'rid'

### N-grams

n-grams are sequences of n words taken from a given piece of text. n-grams are useful as they help us understand patterns in text, and help in predicting the next word in a sequence. there are different types of n-gram models:

1. unigram
2. bigram
3. trigram
4. n-gram

n-grams can be created using the *ngrams* package in the *nltk* library.

In [28]:
from nltk.util import ngrams

# Create a list of n-grams
unigrams = list(ngrams(tokens, 1)) # unigram
# print the n-grams
print(unigrams)

# bigrams
bigrams = list(ngrams(tokens, 2))
print(bigrams)

# trigrams
trigrams = list(ngrams(tokens, 3)) 
print(trigrams)

[('welcome',), ('to',), ('the',), ('adventurer',), ("'s",), ('guild',), ('!',)]
[('welcome', 'to'), ('to', 'the'), ('the', 'adventurer'), ('adventurer', "'s"), ("'s", 'guild'), ('guild', '!')]
[('welcome', 'to', 'the'), ('to', 'the', 'adventurer'), ('the', 'adventurer', "'s"), ('adventurer', "'s", 'guild'), ("'s", 'guild', '!')]


### Text Classification using Naive Bayes Algorithm

The steps are:

1. Import required libraries.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

2. Prepare the dataset, as well as the labels. Here we shall classify short sentences as technical or non-technical.

In [30]:
texts = [
  'I love programming.', 'Python is amazing.',
  'I enjoy machine learning.', 'The weather is nice today.', 'I like algo.',
  'Machine learning is fascinating.', 'Natural Language Processing is a part of AI.'
]

labels = [
  'tech', 'tech', 'tech', 'non-tech', 'tech', 'tech', 'tech'
]

3. Use *CountVectorizer* to convert the text into a matrix of tokens.

In [31]:
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(texts)

4. Split the given dataset into train and test splits.

In [32]:
x_train, x_test, y_train, y_test = train_test_split(x, labels, test_size=0.2, random_state=42)

5. train the MultnomialNB classifier on the given data.

In [33]:
model = MultinomialNB()
model.fit(x_train, y_train)

6. Now we can use the trained model to make predictions on the data. we make the predictions on the test split of the dataset.

In [34]:
y_pred = model.predict(x_test)

7. Test the accuracy of the model by comapring the predicted labels (y_pred) with the actual labels (y_test).

In [35]:
accuracy_score(y_test, y_pred)

1.0

A popular application of NLP is machine translation, in which words or sentences in one language may be converted to another. in python, this may be achieved using the *translate* library.

In [36]:
from translate import Translator

in order to translate any text using this library, we first create a *Translator* object. Using the *translate* method of this object, we can translate a given sample of text.

In [37]:
# create a translator object
translator = Translator(to_lang='fr')  # set the language to French
# define  a sample text
text="Hello, how are you doing today?"
# translate the text
translation = translator.translate(text)
# print the translation
print(translation)

bonjour, comment allez-vous aujourd’hui ?


Another useful library for NLP is the *TextBlob* library. this library is built on top of the NLTK library, and is useful for:

1. spell checking
2. text analysis (sentiment analysis etc.)

In [38]:
from textblob import TextBlob

Here is an application of spell checker using *TextBlob*.

In [41]:
# sample text. here 'gold.' is misspelled as 'golb.'
text = 'All that is golb does not glitter.'
# create a TextBlob object
blob = TextBlob(text)
# correct the misspelled word
corrected_text = blob.correct()
# print the corrected text
print(corrected_text)

All that is gold does not glitter.


*SOURCE:*
+ *Codedex* - https://www.codedex.io/gen-ai