# Text Preprocessing Beyond Tokenization**

In [None]:
import re
# for using NLTK
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords

# for using SpaCy
import spacy

# for HuggingFace
!pip install transformers
# !pip install ftfy

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Here, we try to install libraries.
re - is the regular expression libraries is used for searching for matching patterns from the string.

nltk - nltk stands for natural language processing toolkit its use for processing text data and its come with lots of test dataset.

nltk_download('punkt') -here we try to download nltk for punktuation. punkt is used for divided text into list of senteces and for that its use supervised algorithm for abbreviation words, collation, and start with sentences with the words.

nltk_download('stopwords') - here, we try to download the stopwords.stopwords are a, an, the, is, etc. when we try to search the pattern that stopwords doesn't use for search pattern matching.

nltk_download('wordnet') - here we download wordnet. wordnet is english dataset of adjectives, verbs, noun and adverbs.

nltk.corpus - corpus is used for corpus reader class.

import spacy - spacy is used for information extraction or build natural language understanding system.

!pip install transformers - transformers used to clean, reduce, expand or generate features.

In [None]:
# trick to wrap text to the viewing window for this notebook
# Ref: https://stackoverflow.com/questions/58890109/line-wrapping-in-collaboratory-google-results
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

ipython - ipython is used for import display for html code.

def set_css() - this is used for set up the style sheet of css(casecading style sheet)

## **(Tutorial) Tokenizing text using Spacy**

Following is a sample of text to demonstrate tokenization in SpaCy.

In [None]:
dummy_text1 = """Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.
"""

print(dummy_text1)

Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.



this is just simple dummy text printed.

In [None]:
# loads a trained English pipeline with specific preprocessing components
nlp = spacy.load('en_core_web_sm')

# using SpaCy's tokenizer...
doc = nlp(dummy_text1)      # applies the processing pipeline on the text
for token in doc:
  print(token.text)

Here
is
the
First
Paragraph
and
this
is
the
First
Sentence
.
Here
is
the
Second
Sentence
.
Now
is
the
Third
Sentence
.
This
is
the
Fourth
Sentence
of
the
first
paragaraph
.
This
paragraph
is
ending
now
with
a
Fifth
Sentence
.


Now
,
it
is
the
Second
Paragraph
and
its
First
Sentence
.
Here
is
the
Second
Sentence
.
Now
is
the
Third
Sentence
.
This
is
the
Fourth
Sentence
of
the
second
paragraph
.
This
paragraph
is
ending
now
with
a
Fifth
Sentence
.


Finally
,
this
is
the
Third
Paragraph
and
is
the
First
Sentence
of
this
paragraph
.
Here
is
the
Second
Sentence
.
Now
is
the
Third
Sentence
.
This
is
the
Fourth
Sentence
of
the
third
paragaraph
.
This
paragraph
is
ending
now
with
a
Fifth
Sentence
.


4th
paragraph
just
has
one
sentence
in
it
.




spacy.load() - is used for wrapping. its read the pipeline config. cfg, use for construct the model object and load the data model.

### **Task 1. Revisiting Tokenization**

Let's start with using the WordPunctTokenizer using NLTK.

WordPunctTokenizer seperates each character in the text and shows them as a seperate unit. This method helps us to get the tokens from sentences as alphabatic and non-alphabetic characters.


#### **Question 1a. Implement the WordPunctTokenizer method for the given text in the code block below and write down the observations in your own words .(5 points)**

**Important Note:**
1. DO NOT use any of the existing implementations for tokenization distributed as part of open-source NLP libraries.
2. **If your solution uses readily available implementations of tokenizers, you will receive zero credit for this question.**
3. Your tokenizer implentation need not be the most optimized one. It should just be able to get the job done. You can also ignore punctuation.

In [None]:
sample_text="""Lorem ipsum is a substitute text that is frequently used in publication and graphic arts to display the visual shape of a paper or a typeface without depending on significant content.
Before the final version is available, lorem ipsum can be used as a stand-in. A technique known as greeking, which enables designers to think about the shape of a website or publishing without the significance of the text affecting the layout,
also uses it to try and replace text.The Latin phrase lorem ipsum is usually a mutated version of the first-century BC treatise De finibus bonorum et malorum by the Roman politician and scholar Marcus.Tullius.Cicero,
with phrases changed, added, and deleted to make it sound absurd and indecent."""


from nltk.tokenize import WordPunctTokenizer

# add your code below this comment and execute it once you have written the code
txt = WordPunctTokenizer()
text = txt.tokenize(sample_text)
print(text)


['Lorem', 'ipsum', 'is', 'a', 'substitute', 'text', 'that', 'is', 'frequently', 'used', 'in', 'publication', 'and', 'graphic', 'arts', 'to', 'display', 'the', 'visual', 'shape', 'of', 'a', 'paper', 'or', 'a', 'typeface', 'without', 'depending', 'on', 'significant', 'content', '.', 'Before', 'the', 'final', 'version', 'is', 'available', ',', 'lorem', 'ipsum', 'can', 'be', 'used', 'as', 'a', 'stand', '-', 'in', '.', 'A', 'technique', 'known', 'as', 'greeking', ',', 'which', 'enables', 'designers', 'to', 'think', 'about', 'the', 'shape', 'of', 'a', 'website', 'or', 'publishing', 'without', 'the', 'significance', 'of', 'the', 'text', 'affecting', 'the', 'layout', ',', 'also', 'uses', 'it', 'to', 'try', 'and', 'replace', 'text', '.', 'The', 'Latin', 'phrase', 'lorem', 'ipsum', 'is', 'usually', 'a', 'mutated', 'version', 'of', 'the', 'first', '-', 'century', 'BC', 'treatise', 'De', 'finibus', 'bonorum', 'et', 'malorum', 'by', 'the', 'Roman', 'politician', 'and', 'scholar', 'Marcus', '.', 'Tu

wordpuncttokenizer() - Here, we use wordpunctTokenizer() for tokenize sentences into token as a alphabetic and non-alphabetic character by words.

output - in this output we can see that wordpuncttokenizer() is tokenize whole the sentence in to token and add periods (".") and comma(","). here based on the results its shows that its tokenize with '-' , '.', ','. here stand-in is whole word but here its split the word with '-' and devided in to three tokens.
another word first-century is also split into three differnet tokens "first","-","century". so its only devide the whole paragraph into tokens and its part tokens of each words and punctuations also taken as a tokens.

####**Question 1b.** Implement the tokenizers given below. Analyze how the words are being tokenized for both the tokenizers. Differentiate the outputs of the tokenizers and write them down in your own words.(10 points)
1. **NLTK's Word tokenizer**
2. **NLTK's Punctuation-based tokenizer**

**Note:** You are already familiar with using NLTK's tokenization which was demosntrated in the previous labs. If you do not remember, just revisit them to refresh your memory.

In [None]:
import nltk
nltk.download('punkt')
sample_text="""Lorem ipsum is a substitute text that is frequently used in publication and graphic arts to display the visual shape of a paper or a typeface without depending on significant content.
Before the final version is available, lorem ipsum can be used as a stand-in. A technique known as greeking, which enables designers to think about the shape of a website or publishing without the significance of the text affecting the layout,
also uses it to try and replace text.The Latin phrase lorem ipsum is usually a mutated version of the first-century BC treatise De finibus bonorum et malorum by the Roman politician and scholar Marcus.Tullius.Cicero,
with phrases changed, added, and deleted to make it sound absurd and indecent."""

from nltk.tokenize import(word_tokenize, wordpunct_tokenize)

# add your code below this comment and execute it once you have written the code.
# you can additional code cells if need be. make sure to use the text cell provided to answer the question.
#code for NLTK's word Tokenizer
tokens = word_tokenize(sample_text)
print(word_tokenize(sample_text))
print(" ")
#Code for NLTK's Punctuation-based tokenizer
txt = WordPunctTokenizer()
text = txt.tokenize(sample_text)
print(text)

['Lorem', 'ipsum', 'is', 'a', 'substitute', 'text', 'that', 'is', 'frequently', 'used', 'in', 'publication', 'and', 'graphic', 'arts', 'to', 'display', 'the', 'visual', 'shape', 'of', 'a', 'paper', 'or', 'a', 'typeface', 'without', 'depending', 'on', 'significant', 'content', '.', 'Before', 'the', 'final', 'version', 'is', 'available', ',', 'lorem', 'ipsum', 'can', 'be', 'used', 'as', 'a', 'stand-in', '.', 'A', 'technique', 'known', 'as', 'greeking', ',', 'which', 'enables', 'designers', 'to', 'think', 'about', 'the', 'shape', 'of', 'a', 'website', 'or', 'publishing', 'without', 'the', 'significance', 'of', 'the', 'text', 'affecting', 'the', 'layout', ',', 'also', 'uses', 'it', 'to', 'try', 'and', 'replace', 'text.The', 'Latin', 'phrase', 'lorem', 'ipsum', 'is', 'usually', 'a', 'mutated', 'version', 'of', 'the', 'first-century', 'BC', 'treatise', 'De', 'finibus', 'bonorum', 'et', 'malorum', 'by', 'the', 'Roman', 'politician', 'and', 'scholar', 'Marcus.Tullius.Cicero', ',', 'with', 'phr

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Answer for Q1b.** Type in your answer here!

Here, we use two different model for tokenize sentences in to tokens.

word_tokenize() - this is used for split sentences into words using natural language processing(NLTK) libraries toolkit.

WordPunctTokenizer() - this is used for split tokens from the words and sentences. its contain alphabetic and non-alphabetic characters.

**Different between word_tokenize() vs WordPunctTokenizer()**

The main difference between two method is word_tokenize() method take 'stand-in' as a whole word however, WordPunctTokenizer() method is take 'stand-in' as three different tokens 'stand' ,'-', 'in'.

In this paragraph we see that the words 'Marcus.Tullius.Cicero' take as a  single word while using word_tokenize() method while using WordPunctTokenizer() method devides into 5 tokens as 'Marcus','.','Tullius','.','Cicero'.

## **(Tutorial) Stemming and Lemmatization using NLTK**

Execute and understand the following code in order to get the basic understanding of how the porter stemmer algorithm works.

In [None]:
# importing PorterStemmer class from nltk.stem module
from nltk.stem import PorterStemmer
porter = PorterStemmer()    # instantiating an object of the PorterStemmer class

stem = porter.stem('cats')    # calling the stemmer algorithm on the desired word
print(f"'cats' after stemming: {stem}")

'cats' after stemming: cat


**Porter Stemming algorithm:**
Porter Stemming algorithm was presented by Porter,M. this algorithm is used for stemming. stemming is the process of converting words in to the root or base word. this algorithm reduced the word from 'having' to 'have'.

we first get the library nltk.stem and import porterstemmer. then initialize the object to the porter stemmer. now we call the stem() method with name of the word which is we want to stem and then print the stemming word with its base form.

**Try executing the porter stemmer on your own examples**

In [None]:
#Enter your code here
# importing PorterStemmer class from nltk.stem module
from nltk.stem import PorterStemmer
porter = PorterStemmer()    # instantiating an object of the PorterStemmer class

stem = porter.stem('having')    # calling the stemmer algorithm on the desired word
print(f"'having' after stemming: {stem}")

'having' after stemming: have


Here, we try to use the same algorithm with the different word. stemming is basically use to remove 'ing','ed','ss','s','es' its transform the words into the basic root form. here i am using  ing  word 'having' remove ing and convert into the base form 'have'.

**Let's see how we can perform stemming and lemmatization using NLTK library.**

The Lancaster stemmer algorithm is a series of rules that specifies the substitution or removal of an ending. Unlike the porter stemmer algorithm it applies heavy stemming on the text. For instance, using LancasterStemmer, destabilized will be stemmed to dest and porter stemmer will stem it to destabl. Due to iterations and over-stemming, LancasterStemmer creates a stem that is even smaller than Porter's.

Below is the implementation of Lancaster stemming algorithm.

In [None]:
# importing LancasterStemmer class from nltk.stem module
from nltk.stem import LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
lancaster=LancasterStemmer()    # instantiating an object of the LancasterStemmer class

for S in ["cats"]: # Used a list as you will be working with a list in the below questions
  print(lancaster.stem(S))

cat


Here, we try to use lancaster stemmer algorithm which is used as extremly aggresive algorithm. this algorithm is provide the functionality to do customize the rules of algorithms.

we first use nltk.stem and import LancasterStemmer algorithm and nltk.tokenize and import sent_tokenize, word_tokenize.

nltk.stem - natural language processing toolkit with stem is used as a package that perform stemming process.

LancasterStemmer - LancasterStemmer is the algorithm for stemming.

nltk.tokenize - nltk is the library used for pattern search and making tokens from the sentences.

sent_tokenize - this is used for split string into multiple sentences.

word_tokenize - this is used for split string into words.

**Try executing the Lancaster stemmer on your own examples (2 points)**

In [None]:
#Enter your code here
# importing LancasterStemmer class from nltk.stem module
words = ['regarding','ranges','kindly','nonobjectivity']

Lanc = LancasterStemmer()

for wrd in words:

    print(wrd, " : ", Lanc.stem(wrd))

regarding  :  regard
ranges  :  rang
kindly  :  kind
nonobjectivity  :  nonobject


Here, we try to use the LancasterStemmer algorithm for stemming words. here i defined 4 different words and after that make an object for lancasterstemmer() after that use for loop for printing words after using lancasterstemmer().

output : Here, we use lancasterstemmer() for word stemming. its shows the regarding -> regrad, ranges -> rang, kindly -> kind, nonobjectivity -> nonobject. lancaster remove the ending and print them.

Here we will see the implementation of Lemmatization using textblob method.

Python's TextBlob package is used to process textual data. It offers a straightforward API for getting started with typical natural language processing (NLP) activities like part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and others.

**Installing some libraries for performing Lemmatization using textblob method**

In [None]:
import nltk
!pip install textblob
!python -m textblob.download_corpora
nltk.download('omw-1.4')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

Here, we try to install nltk libraries.Install textblob and other libraries.

nltk - natural language toolkit for  processing text.

textblob - textblob is used for processing text based data. its use for sentiment analysis, part of speech tagging etc.

textblob.download_corpora - this used for getting some basic functionality.

nltk.download('omw-1.4') - nltk download omw. omw stands for Open Multilingual Wordnet. its just like wordnet but using this insted of using omw.

lemmatization - lemmatization is used for coverting words into the base form.

In [None]:
from textblob import TextBlob, Word

my_word = 'cars'

# create a Word object
w = Word(my_word)

print(w.lemmatize())

car


Here, we try to download textblob for process text based data with word. then set up one word 'cars' then we create one object for that word  and at the last print that word with lemmatize() method.

lemmatize() - lemmatize used for covert different forms of words into the root based words.

**Execute lemmatization with textblob on your own examples (3 points)**

In [None]:
#Enter your code here
from textblob import Word

# create object.
u = Word("bricks")

# apply lemmatization to the word bricks.
print("bricks :", u.lemmatize())

# create object.
v = Word("ultra")

# apply lemmatization to the word ultra.
print("ultra :", v.lemmatize())

# create object.
w = Word("joker")

# apply lemmatization with to the word joker.
print("joker :", w.lemmatize())

bricks : brick
ultra : ultra
joker : joker


Here, we try to use textblob for different word processing.

first we import word library from textblob.
we create object for words.
we print this word with using lemmatize().

outpur : here we use lemmatize() method with each words. bricks -> brick, ultra -> ultra, joker -> joker. In conclusion, lemmatize convert different form of words into the base form.

### **Task 2: Lemmatization or Stemming?**




Following is the text that you will be using for this task (Task 2 only):

In [None]:
# This is the text on which you have to perform stemming; taken from Internet.
text = "In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change. Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis)"
#print("Given text:")
print(text)

In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change. Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis)


**Example of Lemmatization.**

In [None]:
from nltk.stem import WordNetLemmatizer
# Tokenize: Split the sentence into words using word_tokenize()
words = nltk.word_tokenize(text)
print(words)
print(' ')
# Lemmatize list of the words and join that words.
lemmatizer = WordNetLemmatizer()
words = nltk.word_tokenize(text)
lemmatize_result = ' '.join([lemmatizer.lemmatize(wrd) for wrd in words])
print(lemmatize_result)

['In', 'grammar', ',', 'inflection', 'is', 'the', 'modification', 'of', 'a', 'word', 'to', 'express', 'different', 'grammatical', 'categories', 'such', 'as', 'tense', ',', 'case', ',', 'voice', ',', 'aspect', ',', 'person', ',', 'number', ',', 'gender', ',', 'and', 'mood', '.', 'An', 'inflection', 'expresses', 'one', 'or', 'more', 'grammatical', 'categories', 'with', 'a', 'prefix', ',', 'suffix', 'or', 'infix', ',', 'or', 'another', 'internal', 'modification', 'such', 'as', 'a', 'vowel', 'change', '.', 'Stem', '(', 'root', ')', 'is', 'the', 'part', 'of', 'the', 'word', 'to', 'which', 'you', 'add', 'inflectional', '(', 'changing/deriving', ')', 'affixes', 'such', 'as', '(', '-ed', ',', '-ize', ',', '-s', ',', '-de', ',', 'mis', ')']
 
In grammar , inflection is the modification of a word to express different grammatical category such a tense , case , voice , aspect , person , number , gender , and mood . An inflection express one or more grammatical category with a prefix , suffix or in

Here, we use lemmatization for text process. here we use nltk.stem download libraries and import wordnetlemmatizer.

we make object for word tokenize for text. now print thoes words. use use normal word_tokenize() method for tokenize.

after that we make object for method wordnetlemmatizer(). after that we join thoes words using lemmatize() method for join that words and at the last print the whole sentences.

**Example of Stemming.**

In [None]:
# This is the text on which you have to perform stemming; taken from Internet.
text = "In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change. Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis)"
import nltk
from nltk.stem.lancaster import LancasterStemmer
lanca_stemmer = LancasterStemmer()
nltk_tokens = nltk.word_tokenize(text)
stemmer_result = ' '.join([lanca_stemmer.stem(wrd) for wrd in nltk_tokens])
print(stemmer_result)


in gramm , inflect is the mod of a word to express diff gram categ such as tens , cas , voic , aspect , person , numb , gend , and mood . an inflect express on or mor gram categ with a prefix , suffix or infix , or anoth intern mod such as a vowel chang . stem ( root ) is the part of the word to which you ad inflect ( changing/deriv ) affix such as ( -ed , -ize , -s , -de , mis )


Here, we try to use lancasterstemmer for stemming the words.

we use nltk.stem.lancaster libraries. we create object for lancasterstemmer() method after that we use word_tokens() and print.

now we use for loop for print stemming words. print actual word as well as stem words. word 'grammer' convert to 'gram'.

Performing some preprocessing that we have learnt in previous ICEs...

In [None]:
text = "In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change. Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis)"
import nltk
from nltk.corpus import stopwords
en_stopwords = stopwords.words('english')


#remove punctuation from the text.
import re
clean_text = re.sub(r"[^a-zA-Z0-9 ]", r" ", text.lower())
print(clean_text)
print(" ")
#remove stopwords from the text.
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(clean_text)
# converts the words in tokens to lowercase and then checks whether they are available in stop_words or not
final_text = [w for w in clean_text.split(' ') if not w.lower() in stop_words]
final_text = []

for w in word_tokens:
    if w not in stop_words:
        final_text.append(w)
print("After remove stowords")
print(final_text)
final_texts = (" ").join(final_text)

#print(word_tokens)
print(" ")
print(final_texts)

in grammar  inflection is the modification of a word to express different grammatical categories such as tense  case  voice  aspect  person  number  gender  and mood  an inflection expresses one or more grammatical categories with a prefix  suffix or infix  or another internal modification such as a vowel change  stem  root  is the part of the word to which you add inflectional  changing deriving  affixes such as   ed  ize   s  de mis 
 
After remove stowords
['grammar', 'inflection', 'modification', 'word', 'express', 'different', 'grammatical', 'categories', 'tense', 'case', 'voice', 'aspect', 'person', 'number', 'gender', 'mood', 'inflection', 'expresses', 'one', 'grammatical', 'categories', 'prefix', 'suffix', 'infix', 'another', 'internal', 'modification', 'vowel', 'change', 'stem', 'root', 'part', 'word', 'add', 'inflectional', 'changing', 'deriving', 'affixes', 'ed', 'ize', 'de', 'mis']
 
grammar inflection modification word express different grammatical categories tense case vo

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Here, we use searching pattern based on english grammers. and we try to spit data based on white spaces. we remove stopwords of english from the text-string.

remove stopwords and split that text string into tokens.here, i used text split and convert text into lower case and check wether its available in stop words.print final text without stop words.

#### **Question 2. Remove punctuation and stopwords from the text using the functions provided above.Then perform stemming on the cleaned text using the Lancaster Stemmer from NLTK.(10 points)**

In [None]:
# apply Lancaster Stemmer on the cleaned text (after punctuation and stopwords are removed) below this comment
import nltk
from nltk.stem.lancaster import LancasterStemmer
print(final_texts)
print(" ")
lanca_stemmer = LancasterStemmer()
nltk_tokens = nltk.word_tokenize(final_texts)
stemmer_result = ' '.join([lanca_stemmer.stem(wrd) for wrd in nltk_tokens])
print(stemmer_result)

grammar inflection modification word express different grammatical categories tense case voice aspect person number gender mood inflection expresses one grammatical categories prefix suffix infix another internal modification vowel change stem root part word add inflectional changing deriving affixes ed ize de mis
 
gramm inflect mod word express diff gram categ tens cas voic aspect person numb gend mood inflect express on gram categ prefix suffix infix anoth intern mod vowel chang stem root part word ad inflect chang der affix ed iz de mis


Here, we used lancasterstemmer with cleaned text after removing stop words and punctuation. we can see that how the lancaster stemmer worked with text.

final result : Here, focusing on the final result its shows some of the words converted into base form and some ending part remove by the lancaster algorithm.
e.g. inflection convert to inflect
     tense taken as a same base word.

lancaterstammer() - lancasterstammer is used for heavy iteration algorithm and there is issue of over iteration.

stem() - stem() method is used for stemming words.


#### **Question 3. Perform lemmatization on the same cleaned text above using textblob lemmatizer.(10 points)**

In [None]:
# apply NLTK's textblob lemmatizer on the cleaned text (after punctuation and stopwords are removed) below this comment
from textblob import TextBlob, Word

print(final_texts)
print(" ")
lem = []
for i in final_text:

   word1 = Word(i).lemmatize("n")
   word2 = Word(word1).lemmatize("v")
   word3 = Word(word2).lemmatize("a")
   lem.append(Word(word3).lemmatize())
   lems = (" ").join(lem)
print("Lemmatize using TextBlob")
print(lems)

grammar inflection modification word express different grammatical categories tense case voice aspect person number gender mood inflection expresses one grammatical categories prefix suffix infix another internal modification vowel change stem root part word add inflectional changing deriving affixes ed ize de mis
 
Lemmatize using TextBlob
grammar inflection modification word express different grammatical category tense case voice aspect person number gender mood inflection express one grammatical category prefix suffix infix another internal modification vowel change stem root part word add inflectional change derive affix ed ize de mi


Here, we are using textblob lemmatizer for cleaned text.
textblob - textblob is used for process text based data.its provide the pattern search tagging, sentimental analysis etc.

final_text - final text is the final text after stopwords and punctuation. now we are using this this final text for using textblob.

lem[] = lem is the array of lemmitization.

now, we are using for loop for lemmatize words based on conditions and then append words and print them.

lemmatize() - this function is used for single words.

#### **Question 4. Can you think of any any differences while performing lemmatization and stemming? If yes, write them in your own words. Also write down your observations on performing lemmatization and stemming on text before and after cleaning (removing punctuation and stopwords) (10 points)**

**IMPORTANT NOTE: Your observations should not be based on just Q2 and Q3. Your observations should characterize both the method as a whole. Bring out the differences also if there are any**

**Answer for Q4.:**

**Lemmatization:**
Lemmatization means its use while doing text process of vocabulary and morphological words. its remove the human inflectional words and convert into base form.
example of packages : wordnet lemmatization,spacy lemmatization, textblob etc..

**Stemming:**
stemming is the process for stemming word to the root of the words states as 'lemmas'.
example of packages : 'crying' is the word and its end with 'ing' stem remove ing and its convert to the base form 'cry'.

**Different between stemming and lemmatization:**
Focusing on the previous examples, we can easily identify that stemming is stem the word while lemmatize use for context of the words. The main different is how they work for reducing words ending.

In conclusion, we can see the results as stemming only remove the end of the words and doesn't check whether that type of words available or not in english dictionary such as "cakes" so stemming remove "es" from the end and print as "cak" there is no word as cak in english. However, talking about the lemmatization lemmatization remove the words ending but its also check with the english dictionary whether that type of word is available or not and then according to final result its print the words such as "companies" so its remove "ies" from the end and then check the dictionary for similar kind of words and then print the converted word "company".

**Result of stemming and lemmatization after and before cleand text:**
takling about performing stemming and lemmatize before cleaned text.focusing on the first result lemmatize before clean text its print whole text same only change some words into the base word such as "categories" convert to "category" and "affexes" convert to "affex". In contrast, on the another hand while taking about stemming, stemming remove ending of the words and doesn't check whether this type of english word is exist or not and print the result focusing of stemming results "different" word remove ending "erent" and print"diff" and "grammatical" word remove "atical" and print "gramm" which is change the meaning of word but stemming print that word.when we use stemming and lemmatization before cleaned text result prints with all the unnecessary words which is not going to use for search pattern match after stopwords and punctuation remove the results are more specific and more clear.its directly search for the content using each and every tokens. basically, after cleanded text the result must be accurate as well as matches with the pattern.

---


## **(Tutorial) Sentence Segmentation using Spacy**

Following is a dummy paragraph of text to demonstrate how to use SpaCy to segment text into sentences.

In [None]:
dummy_text3 = """Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.
"""

print(dummy_text3)

Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.



Here, we just print the dummy text.

**Code for sentence segmentation using Spacy**

In [None]:
nlp = spacy.load('en_core_web_sm')

# performing sentence splitting...
doc = nlp(dummy_text3)
for sentence in doc.sents:
  print(sentence)

Here is the First Paragraph and this is the First Sentence.
Here is the Second Sentence.
Now is the Third Sentence.
This is the Fourth Sentence of the first paragaraph.
This paragraph is ending now with a Fifth Sentence.

Now, it is the Second Paragraph and its First Sentence.
Here is the Second Sentence.
Now is the Third Sentence.
This is the Fourth Sentence of the second paragraph.
This paragraph is ending now with a Fifth Sentence.

Finally, this is the Third Paragraph and is the First Sentence of this paragraph.
Here is the Second Sentence.
Now is the Third Sentence.
This is the Fourth Sentence of the third paragaraph.
This paragraph is ending now with a Fifth Sentence.

4th paragraph just has one sentence in it.



Here, we are using spacy to sentence segmentation. spacy is basically used for build system for data extraction for natural language processing.

import spacy - here, we are importing spacy library which is used for data extraction and sentence segmentation.

spacy.load() - spacy load is used for loading english dictionaries which contains vocabulary, formulas and entities. Its useful for sentence segmentation.

doc = variable which is used for nlp() method to include dummy text.nlp is natural language processing.

after that we are using for loop for printing sentences in and print white spaces after ending of the each sentences.

Here, we can see that each sentence is sepreat with the periods(".") and after that each sentence its print white spaces.

**Code for sentence segmentation using NLTK library**

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Here, we import nltk library for data processing.
and download nltk download library for punktuation.

In [None]:
text="This is a very bad situation. Also I am looking good"
sentences=nltk.sent_tokenize(text)
for sentence in sentences:
  print(sentence)
  print()

This is a very bad situation.

Also I am looking good



nltk is the natural language toolkit and its used for sentence tokenize.


### **Task 3. Segmenting Sentences**

In [None]:
seg_text="""There are numerous different versions of portions from Lorem Ipsum that are accessible, but the most have been changed in some way,
usually by adding humor or randomizing phrases that don't look the slightest bit plausible. If you plan to quote from Lorem Ipsum, make sure there
is nothing unpleasant tucked away in the center of the text. This is the first real generator on the Internet because all other Lorem Ipsum producers
tend to repeat specified pieces as many as needed. It creates Lorem Ipsum that appears plausible by using a vocabulary of more than 200 Latin terms
and a few sample sentence constructions. As a result, the created Lorem Ipsum is never riddled with clichés, humor, or other uncharacteristic language."""
seg_texts = seg_text.replace(".","\\")

print(seg_texts)



There are numerous different versions of portions from Lorem Ipsum that are accessible, but the most have been changed in some way,
usually by adding humor or randomizing phrases that don't look the slightest bit plausible\ If you plan to quote from Lorem Ipsum, make sure there
is nothing unpleasant tucked away in the center of the text\ This is the first real generator on the Internet because all other Lorem Ipsum producers
tend to repeat specified pieces as many as needed\ It creates Lorem Ipsum that appears plausible by using a vocabulary of more than 200 Latin terms
and a few sample sentence constructions\ As a result, the created Lorem Ipsum is never riddled with clichés, humor, or other uncharacteristic language\


#### **Question 5a. Write a python code for sentence segmentation using a backslash as the boundry for each sentence for the text above. Write your observations in your own words for the genereated output.(15 points)**

**Note:** You do not need to remove any stopwords, punctuation or apply any kind of other preprocessing techniques. Only perform what's asked to minimize your effort needed to answer this question.

**Hint**: Use print( ) to help you understand how the sentences are being split when analyzing your output to note down your observations.

In [None]:
# write your code below this comment
#here is the text with the backslash for sentence segmentation.
seg_texts = seg_text.replace(".","\\")

import nltk
nltk.download('punkt')

sentences=nltk.sent_tokenize(seg_texts)

for sentence in sentences:
  print(sentence)
  print()

There are numerous different versions of portions from Lorem Ipsum that are accessible, but the most have been changed in some way,
usually by adding humor or randomizing phrases that don't look the slightest bit plausible\ If you plan to quote from Lorem Ipsum, make sure there
is nothing unpleasant tucked away in the center of the text\ This is the first real generator on the Internet because all other Lorem Ipsum producers
tend to repeat specified pieces as many as needed\ It creates Lorem Ipsum that appears plausible by using a vocabulary of more than 200 Latin terms
and a few sample sentence constructions\ As a result, the created Lorem Ipsum is never riddled with clichés, humor, or other uncharacteristic language\



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
#this is with the regular text ending sentences with the periods(".")
import nltk
nltk.download('punkt')

sentences=nltk.sent_tokenize(seg_text)

for sentence in sentences:
  print(sentence)
  print()

There are numerous different versions of portions from Lorem Ipsum that are accessible, but the most have been changed in some way,
usually by adding humor or randomizing phrases that don't look the slightest bit plausible.

If you plan to quote from Lorem Ipsum, make sure there
is nothing unpleasant tucked away in the center of the text.

This is the first real generator on the Internet because all other Lorem Ipsum producers
tend to repeat specified pieces as many as needed.

It creates Lorem Ipsum that appears plausible by using a vocabulary of more than 200 Latin terms
and a few sample sentence constructions.

As a result, the created Lorem Ipsum is never riddled with clichés, humor, or other uncharacteristic language.



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Here, as per the result shown in above example if we change end of the sentence peiods(.) with backslash then results are shown that sentence segmentation is not working properly.its continuing as a whole paragraph without understanding that its end of the sentence so we need to split the sentence.i replaced each end of the sentence period with the backslash and when i try to split the sentences then using nltk its not working properly. so, In conclusion we can say that nltk only idenfy that period(".") is the only end of sentence after long bunch of words and backslash(" \ ") is not consider as a end of the sentence.

first i used string.replace method and pass the argument to replace "\" with the "."

then import the nltk library and download the punctuation library.

now using nltk- natural language toolkit do the sentence tokenize using sent_tokenize() method.

after that we are using for loop for printing that processed sentences one by one with end od sentence punctuation using backslash.

Overall, focusing on the final results its prove that nltk is take period(".") as the end of the sentence by default but if we use backslash as the end of sentence then its couldn't identify as a sentence end and its continue the next sentence without parting.


Focusing on the second example  its shows that when use period(".") as the end of the sentence then its easily identify the end of the sentence and split from there.

#### **Question 5b. Perform sentence segmentation using spacy on the the text used in question 5a(seg_text) (5 points)**

In [None]:
import spacy

#load spacy for core english library
nlp = spacy.load("en_core_web_sm")
doc = nlp(seg_texts)
#to print sentences
for sent in doc.sents:
  print(sent)


There are numerous different versions of portions from Lorem Ipsum that are accessible, but the most have been changed in some way,
usually by adding humor or randomizing phrases that don't look the slightest bit plausible\
If you plan to quote from Lorem Ipsum, make sure there
is nothing unpleasant tucked away in the center of the text\
This is the first real generator on the Internet because all other Lorem Ipsum producers
tend to repeat specified pieces as many as needed\ It creates Lorem Ipsum that appears plausible by using a vocabulary of more than 200 Latin terms
and a few sample sentence constructions\
As a result, the created Lorem Ipsum is never riddled with clichés, humor, or other uncharacteristic language\


In [None]:
import spacy

#load spacy for core english library
nlp = spacy.load("en_core_web_sm")
doc = nlp(seg_text)
#to print sentences
for sent in doc.sents:
  print(sent)


There are numerous different versions of portions from Lorem Ipsum that are accessible, but the most have been changed in some way,
usually by adding humor or randomizing phrases that don't look the slightest bit plausible.
If you plan to quote from Lorem Ipsum, make sure there
is nothing unpleasant tucked away in the center of the text.
This is the first real generator on the Internet because all other Lorem Ipsum producers
tend to repeat specified pieces as many as needed.
It creates Lorem Ipsum that appears plausible by using a vocabulary of more than 200 Latin terms
and a few sample sentence constructions.
As a result, the created Lorem Ipsum is never riddled with clichés, humor, or other uncharacteristic language.


Here, we try to use spacy for sentence segmentation.
spacy is the open source library for build system for extraction data.
import spacy - we are importing spacy library for sentence segmentation.
spacy.load() -  spacy load is used for loading english vocabularies and entities.
doc -  doc is the variable and stored sentence after using nlp(natural language processing).
for loop - here, we are using for loop for printing sentences while using natural language processed data.

sent - sent is the processed data for natural language processed sentences and print them.

To summerize, focusing on the final result its shows that spacy include backslash ("\") special character as a end of the sentence after long bunch of words and its split sentences in to different sentence tokens. In contrast, in the above example we use the nltk for sentence segmentation and its not work properly.

On the other hand try to use period(".") as the end of the sentence and focusing on the results spacy include after the long bunch of words and periods occure then its consider as a end of the sentence aslo in short spacy include all the punctuations as the end of the sentence.

#### **Question 5c. Perform sentence segmentation using NLTK on the the text used in question 5a(seg_text) (5 points)**

**Hint**: For implementing NLTK's sentence segmentation, you can refer to the code block above.

In [None]:
# write your code below this comment
import nltk
nltk.download('punkt')
sentences = nltk.sent_tokenize(seg_texts)  #whole paragraph break into sentence.
for sentence in sentences:
	print(sentence)
	print()

There are numerous different versions of portions from Lorem Ipsum that are accessible, but the most have been changed in some way,
usually by adding humor or randomizing phrases that don't look the slightest bit plausible\ If you plan to quote from Lorem Ipsum, make sure there
is nothing unpleasant tucked away in the center of the text\ This is the first real generator on the Internet because all other Lorem Ipsum producers
tend to repeat specified pieces as many as needed\ It creates Lorem Ipsum that appears plausible by using a vocabulary of more than 200 Latin terms
and a few sample sentence constructions\ As a result, the created Lorem Ipsum is never riddled with clichés, humor, or other uncharacteristic language\



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# write your code below this comment
import nltk
nltk.download('punkt')
sentences = nltk.sent_tokenize(seg_text)  #whole paragraph break into sentence.
for sentence in sentences:
	print(sentence)
	print()

There are numerous different versions of portions from Lorem Ipsum that are accessible, but the most have been changed in some way,
usually by adding humor or randomizing phrases that don't look the slightest bit plausible.

If you plan to quote from Lorem Ipsum, make sure there
is nothing unpleasant tucked away in the center of the text.

This is the first real generator on the Internet because all other Lorem Ipsum producers
tend to repeat specified pieces as many as needed.

It creates Lorem Ipsum that appears plausible by using a vocabulary of more than 200 Latin terms
and a few sample sentence constructions.

As a result, the created Lorem Ipsum is never riddled with clichés, humor, or other uncharacteristic language.



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Here, we are using nltk- natural language toolkit for processed sentences from the paragraph.

import nltk - we are importing nltk library for doing sentence segmentation.

nltk.sent_tokenize() -  this is used for nltk.sent_tokenize() method for sentence tokenize.

we are using for loop for printing sentence segmentation final sentences after processed natural language processing.

To conclude, its shows in the result that nltk is not spliting sentences while using backslash as the end of sentences its not identify that backslash is the end of sentence and its continue print the results whitout splitting the sentences.On the other hand, while use period(".") as the end of the sentence its clearly identify as a end of the sentence and its split the sentences.

#### **Question 5d. Analyze the generated output from question 5b and 5c. Provide your observations in your own words. (5 points)**

ANSWER 5d: Here, we are using nltk and spacy for sentence segmentation.

**Sentence segmentation:**
The process of where is staring and ending of the sentence and devide that paragraph using NLP- natural language processing in sentences.

Here, we can see the main difference is how punctuation marks changes the whole results.

**nltk sentence segmentation :**

Here, we use nltk.sent_tokenize() method for sentence tokenize using nltk.

nltk sentence segmentation is used for split each sentences with periods(".") at the end of the sentence. nltk use sent_tokenize() method for sentence segmentation.
nltk consider ".", "?", "!" as the end of the sentences or we can say sentence boundries. so, backslash("\" ) is not consider as a boundry of sentences.

Here, i used two different punctuation for end of the sentences (".") and("\") and its shows that backslash is not take as the sentence boundries but if we are using periods(".") then its consider as boundries and split the sentences and print them.

nltk have their own restriction for rules of sentence boundries.

**Spacy sentence segmentation:**

spacy is the open source library its used for tokenizing words and sentences.

spacy using nlp object to convert text in to doc text and split paragraph in to sentences.

spacy sentence segmentation is used for split paragraph into sentences using nlp-natural language processing.

We provide two different set of examples for sentence boundaries (".") and ("\"). spacy identify both of them as  the sentence boundaries and split the paragraph in to sentences.

to sum up, we can say that spacy allows customization rule based sentence boundaries and we can customize as per our needs and its provide by default punctuation also so based on whole text processing its identify that this is used as a boundaries and print the next sentence.


In summarize, nltk tool not allow to customize the sentence boundries while, spacy does.


## **(Tutorial) Subword Tokenization using HuggingFace**

### **Task 4: Subword Tokenization**

Well, the natural language processing is not as intelligent as we humans are, and not so intellectual to break words into sub words and try to decipher the word if it sees a word that is not in the corpus yet. This is where Subword Tokenization comes into picture.

Subword tokenization is a recent strategy from machine translation that helps us solve these problems by breaking unknown words into “subword units” - strings of characters like ing or eau - that still allow the downstream model to make intelligent decisions on words it doesn't recognize.


**Here we will implement the basic BertTokenizer using a pretrained model**

BERT stands for Bidirectional Encoder Representations from Transformers. BERT is a transformers model that was self-supervisedly pretrained on a sizable corpus of English data. This indicates that it was trained exclusively on raw texts, with no humans tagging in any kind. As a result, it may employ a large amount of data that is readily available to the general public, with an automatic procedure to produce inputs and tags from such texts. The uncased model tokenizes the words which is not present in the vocabulary.

**Below is the implementation of the Bert Tokenizer:**

In [None]:
!pip install tokenizers
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


tokenization :
tokenization is the process of spliting paragraphs into tokens using natural language processing.Here, we try to install tokenize library for doing tokenizing using natural language processing.

Transformers:
transformers is the library which is provide lots of pretrained model of text, video and audio data.here, this is basically use for perfom some different task for text.

In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer.tokenize("Because he was late again, he would be docked a day's pay")

['because',
 'he',
 'was',
 'late',
 'again',
 ',',
 'he',
 'would',
 'be',
 'docked',
 'a',
 'day',
 "'",
 's',
 'pay']

Here, we are using BertTokenizer for processing data. Bert main use is to tokenize one word as a whole word token or else devide each word into small tokens.Bert is also known as a wordpiece tokenizer.

Focusing on the example above we first import berttokenizer from transformer and use this for split sentences into wordpieces.after that we load the pretrained model of berttokenizer bert-base-uncased and then at the last we provide text with the object of berttokenizer tokenizer.

In summerize, after using pretrained model of berttokenize with our sentences now focusing on the result which is we got it clearly visible that its devide the whole sentence with the each word convert into tokens.in breif its visible that its also count "," ," ' ","'s" as the tokens.

**Question 6:** Examine these two sentences below:

* The job interview was very tough, but I'm glad I made it through.
* The new battleroyal game is super amazing!

Using the Bert tokenizer(created during the tutorial) encode these two sentences using . Examine the tokens from the encodings of the two sentences. Is/Are there any interesting observations when you compare the tokens between the two encodings? What do you think is causing what you observe as part of your comparison? **(20points)**


In [None]:
# use the BertTokenizer that was created during the tutorial to encode the sentences
# write your code below this comment and execute
# type in your answer to the question asked above in the following cell (see below)
#this is for Sentence 1
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer.tokenize("The job interview was very tough, but I'm glad I made it through.")

['the',
 'job',
 'interview',
 'was',
 'very',
 'tough',
 ',',
 'but',
 'i',
 "'",
 'm',
 'glad',
 'i',
 'made',
 'it',
 'through',
 '.']

In [None]:
#this is for sentence two
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer.tokenize("The new battleroyal game is super amazing!")

['the',
 'new',
 'battle',
 '##roy',
 '##al',
 'game',
 'is',
 'super',
 'amazing',
 '!']

In [None]:
#this is for sentence two
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer.tokenize("The new Battle Royal game is super amazing!")

['the', 'new', 'battle', 'royal', 'game', 'is', 'super', 'amazing', '!']

**Answer to Question 6:**

Let's start with the basic hugging face.

**Hugging Face :**
Hugging face usually do devide paragraph or text into tokens and check with vocabulary and add some tokens too.

**Tokenizer:**
Tokenizer is used for devide paragraph or text into sentences and sentences into words.

**Transformers:**
Transformers is used as processed data of text,video and audio with pretrained model.transformers also provide some pretrained models for text classification ,sentiment analysis etc.

**Berttokenizer:**
Berttokenizer is used for tokenize the sentences into words. bert is work two ways for tokenize, first way is its take one whole word as the one token second is its also devide each words into small tokens.

Focusing on the examples above we discussed below how it works:

**from transformers import BertTokenizer**
This is we use for using berttokenizer library from transformers library.we are using this for text processing and we can say using for tokenize the sentences.

**tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')**
This is we are using for use text processing or tokenization pretrained model from the berttokenizer. name of the pretrained model is bert-base-uncased which is generally used for text processing and tokenizing the words.


**tokenizer.tokenize("The job interview was very tough, but I'm glad I made it through.")**
**tokenizer.tokenize("The new battleroyal game is super amazing!") **
This two sentences we are using for making tokens from the sentences. that's why we use tokenizer object of berttokenize pretrained model and use here with tokenize() method to do the tokenize of that sentences.

**Different between two sentence output**

To recapitulate we are using the berttokenize for split sentences into words or each words split into small tokens.let's discuss the main point of different output.

The main difference is focusing on the output of first sentence its devide each words into tokens its consider "," and "." also taken as tokens.its devide word "I'm" as a three different tokens."I","'","m" etc.at the last period "." of the sentence also taken as a whole token.However, on the other hand, in the second example its devide sentence into each word tokens but main thing is focusing here is its devide the word "battleroyal"words into three different words follow as "battle","##roy","##al" etc.


The main thing is first sentence "The job interview was very tough, but I'm glad I made it through.". focusing on the results its shows its split sentence into words token but its take punctuation marks also as a token and add sentence boundaries periods"." is also taken as a tokens.

the second sentence is "The new battleroyal game is super amazing!" here its devide sentence into words tokens but because the word "battleroyal" is taken as a lowercase so model doesn't identify that battleroyal is the game name its noun and taken as a three different token are as follow "bottle","##roy","##al".at the last "!" mark is also use as a token.so, if the word "bottleroyal" write it down in the first capital letter "Battle Royal" then results of the tokens will be different.

## **References**
* https://spacy.io/usage/spacy-101
* https://spacy.io/models/en
* https://www.geeksforgeeks.org/python-nltk-tokenize-wordpuncttokenizer/
* https://neptune.ai/blog/tokenization-in-nlp
* https://www.datacamp.com/tutorial/stemming-lemmatization-python
* https://www.geeksforgeeks.org/python-lemmatization-approaches-with-examples/
*https://colab.research.google.com/drive/10gwzRY55JqzgeEQOX6nwFs6bQ84-mB9f?usp=sharing#scrollTo=DP1xuStV0fDl
*https://towardsdatascience.com/a-comprehensive-guide-to-subword-tokenisers-4bbd3bad9a7c
*https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/
*https://huggingface.co/transformers/v3.0.2/tokenizer_summary.html