spaCy is a popular natural language processing library that provides efficient tools for various NLP tasks, 
including tokenization. It can tokenize text into words and sentences, making it easier to process and analyze text data.
We can either create a blank pipeline or pre-trained pipeline using spaCy and perform tokenization process easily.

In [1]:
import spacy 

In [2]:
content = ""
with open("./sample.txt","r") as f:
    content = f.read()
content

'The names "John Doe" for males, "Jane Doe" or "Jane Roe" for females, or "Jonnie Doe" and "Janie Doe" for children, or just "Doe" non-gender-specifically are used as placeholder names for a party whose true identity is unknown or must be withheld in a legal action, case, or discussion. The names are also used to refer to acorpse or hospital patient whose identity is unknown. This practice is widely used in the United States and Canada, but is rarely used in other English-speaking countries including the United Kingdom itself, from where the use of "John Doe" in a legal context originates. The names Joe Bloggs or John Smith are used in the UK instead, as well as in Australia and New Zealand.\n\nJohn Doe is sometimes used to refer to a typical male in other contexts as well, in a similar manner to John Q. Public, known in Great Britain as Joe Public, John Smith or Joe Bloggs. For example, the first name listed on a form is often John Doe, along with a fictional address or other fictiona

We can either create a blank pipeline or pre-trained pipeline using spaCy and perform tokenization process easily.            
Blank Pipeline: When you create a "blank" spaCy pipeline, you are initializing an instance of the spaCy language model without loading any pre-trained components. This can be useful if you want to add your own custom processing components or if you only need basic tokenization without advanced linguistic analysis.

### Tokenization

In [3]:
NLP_blank = spacy.blank("en")
doc = NLP_blank(content)
tokens = [token for token in doc]
tokens

[The,
 names,
 ",
 John,
 Doe,
 ",
 for,
 males,
 ,,
 ",
 Jane,
 Doe,
 ",
 or,
 ",
 Jane,
 Roe,
 ",
 for,
 females,
 ,,
 or,
 ",
 Jonnie,
 Doe,
 ",
 and,
 ",
 Janie,
 Doe,
 ",
 for,
 children,
 ,,
 or,
 just,
 ",
 Doe,
 ",
 non,
 -,
 gender,
 -,
 specifically,
 are,
 used,
 as,
 placeholder,
 names,
 for,
 a,
 party,
 whose,
 true,
 identity,
 is,
 unknown,
 or,
 must,
 be,
 withheld,
 in,
 a,
 legal,
 action,
 ,,
 case,
 ,,
 or,
 discussion,
 .,
 The,
 names,
 are,
 also,
 used,
 to,
 refer,
 to,
 acorpse,
 or,
 hospital,
 patient,
 whose,
 identity,
 is,
 unknown,
 .,
 This,
 practice,
 is,
 widely,
 used,
 in,
 the,
 United,
 States,
 and,
 Canada,
 ,,
 but,
 is,
 rarely,
 used,
 in,
 other,
 English,
 -,
 speaking,
 countries,
 including,
 the,
 United,
 Kingdom,
 itself,
 ,,
 from,
 where,
 the,
 use,
 of,
 ",
 John,
 Doe,
 ",
 in,
 a,
 legal,
 context,
 originates,
 .,
 The,
 names,
 Joe,
 Bloggs,
 or,
 John,
 Smith,
 are,
 used,
 in,
 the,
 UK,
 instead,
 ,,
 as,
 well,
 as,
 in

### Stemming

In [4]:
import nltk
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [5]:
words = ['generous','generate','generously','generation','eating','eats','eaten','puts','putting','mass','was','bee','computer','advisable',"friend", "friendship", "friends", "friendships"]
for word in words:
    print(word , "=>" , stemmer.stem(word))

generous => gener
generate => gener
generously => gener
generation => gener
eating => eat
eats => eat
eaten => eaten
puts => put
putting => put
mass => mass
was => wa
bee => bee
computer => comput
advisable => advis
friend => friend
friendship => friendship
friends => friend
friendships => friendship


### Lemmatization

Pre-trained Pipeline: A pre-trained spaCy pipeline includes various linguistic components like part-of-speech tagging, named entity recognition, dependency parsing, and more, in addition to tokenization

In [6]:
NLP = spacy.load("en_core_web_sm")
str = " ".join(words)
str
doc = NLP(str)
for word in doc:
    print(word , "=>", word.lemma_)

generous => generous
generate => generate
generously => generously
generation => generation
eating => eat
eats => eat
eaten => eat
puts => put
putting => put
mass => mass
was => be
bee => bee
computer => computer
advisable => advisable
friend => friend
friendship => friendship
friends => friend
friendships => friendship


### POS Tagging

In [7]:
doc = NLP(content)
for token in doc: 
    print(token, "=>", token.pos_ ,f"({spacy.explain(token.pos_)})") 

The => DET (determiner)
names => NOUN (noun)
" => PUNCT (punctuation)
John => PROPN (proper noun)
Doe => PROPN (proper noun)
" => PUNCT (punctuation)
for => ADP (adposition)
males => NOUN (noun)
, => PUNCT (punctuation)
" => PUNCT (punctuation)
Jane => PROPN (proper noun)
Doe => PROPN (proper noun)
" => PUNCT (punctuation)
or => CCONJ (coordinating conjunction)
" => PUNCT (punctuation)
Jane => PROPN (proper noun)
Roe => PROPN (proper noun)
" => PUNCT (punctuation)
for => ADP (adposition)
females => NOUN (noun)
, => PUNCT (punctuation)
or => CCONJ (coordinating conjunction)
" => PUNCT (punctuation)
Jonnie => PROPN (proper noun)
Doe => PROPN (proper noun)
" => PUNCT (punctuation)
and => CCONJ (coordinating conjunction)
" => PUNCT (punctuation)
Janie => PROPN (proper noun)
Doe => PROPN (proper noun)
" => PUNCT (punctuation)
for => ADP (adposition)
children => NOUN (noun)
, => PUNCT (punctuation)
or => CCONJ (coordinating conjunction)
just => ADV (adverb)
" => PUNCT (punctuation)
Doe => PR