# Natural Language Processing using spacy

- Introduction of Spacy library
- Load english dictionary
- Find out stop words
- create an nlp object of given document (sentence)
- Count frequency of each word using hash values (using count_by(ORTH) and nlp.vocab.strings)
- print each word count, using dictionary comprehension
- print index of each token
- Print various attributes of nlp object (i.e. is_alpha,tok.shape_,is_stop,tok.pos_,tok.tag_) !!!
- Stemming (using nltk)
    - using PorterStemmer()
    - using SnowballStemmer()
- Lemmatization
- Display tree view of words using displacy using displacy.render()
- How to get the meaning of any denoted words by nlp using explain(<word>)
- How to Find out NER(Named entity Recognition) in given doc
- Display Named Entity in doc using displacy.render
- Remove stop_words/punctuation using is_stop & is_punct attribute
- create a list of words/sentence after removing stop_words then make sentence
- Sentence and Word Tokenization
- Pipelining:
    - Get all the factory pipelining options available
    - How to disable preloaded pipeline, that will enahnce the processing time?
    - Adding custom pipelines
- Reading a file and displaying entity
- Chunking
- Computing word similarity    
- n-grams (using nltk and sklearn-CountVectorizer())
    - bi-grams
    - tri-grams
    - n-grams

In [1]:
import spacy as sp
from spacy import displacy # used for data visualization
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.attrs import ORTH # to be used for word count

In [2]:
nlp = sp.load("en_core_web_sm") # ref: https://spacy.io/models/en

###### To load english model

# !python -m spacy download en_core_web_sm

In [5]:
txt = """Commercial writers know that most people don’t want to read 1,000
words of closely-spaced text in order to see what they are writing about, so 
they also like to keep sentences and paragraphs short. 
They’ll even use lots of sub-headers so you can see what each paragraph is about 
before you read it."""

In [6]:
obj = nlp(txt)

In [7]:
obj

Commercial writers know that most people don’t want to read 1,000 
words of closely-spaced text in order to see what they are writing about, so 
they also like to keep sentences and paragraphs short. 
They’ll even use lots of sub-headers so you can see what each paragraph is about 
before you read it.

In [9]:
print(dir(obj))

['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '_bulk_merge', '_py_tokens', '_realloc', '_vector', '_vector_norm', 'cats', 'char_span', 'count_by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', 'from_disk', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'is_nered', 'is_parsed', 'is_sentenced', 'is_tagged', 'lang', 'lang_', 'mem', 'merge', 'noun_chunks', 'noun_chunks_iterator', 'print_tree', 'remove_extension', 'retokenize', 'sentiment', 'sents', 'set_extension', 'similarity', 'tensor', 'text', 'text_with_ws', 'to_array', 'to_bytes', 'to_disk', 'to_json', 'to_utf8_array', 'user_data', 'user_hooks', 

###### How to get all the words from text

In [14]:
for wd in obj:
    print(dir(wd))
    break

['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_extension', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'morph', 'n_lefts', 'n_rights', 'nb

In [13]:
for wd in obj:
    print(wd.text.lower())

commercial
writers
know
that
most
people
do
n’t
want
to
read
1,000


words
of
closely
-
spaced
text
in
order
to
see
what
they
are
writing
about
,
so


they
also
like
to
keep
sentences
and
paragraphs
short
.


they
’ll
even
use
lots
of
sub
-
headers
so
you
can
see
what
each
paragraph
is
about


before
you
read
it
.


###### Find out stop words

In [17]:
for wd in obj:
    print((wd.text,wd.is_stop))

('Commercial', False)
('writers', False)
('know', False)
('that', True)
('most', True)
('people', False)
('do', True)
('n’t', True)
('want', False)
('to', True)
('read', False)
('1,000', False)
('\n', False)
('words', False)
('of', True)
('closely', False)
('-', False)
('spaced', False)
('text', False)
('in', True)
('order', False)
('to', True)
('see', True)
('what', True)
('they', True)
('are', True)
('writing', False)
('about', True)
(',', False)
('so', True)
('\n', False)
('they', True)
('also', True)
('like', False)
('to', True)
('keep', True)
('sentences', False)
('and', True)
('paragraphs', False)
('short', False)
('.', False)
('\n', False)
('They', True)
('’ll', True)
('even', True)
('use', False)
('lots', False)
('of', True)
('sub', False)
('-', False)
('headers', False)
('so', True)
('you', True)
('can', True)
('see', True)
('what', True)
('each', True)
('paragraph', False)
('is', True)
('about', True)
('\n', False)
('before', True)
('you', True)
('read', False)
('it', True)
(

###### create an nlp object of given document (sentence)

In [18]:
for sent in obj.sents:
    print(sent)

Commercial writers know that most people don’t want to read 1,000 
words of closely-spaced text in order to see what they are writing about, so 
they also like to keep sentences and paragraphs short. 

They’ll even use lots of sub-headers
so you can see what each paragraph is about 
before you read it.


###### to create separate word from senetence

###### Count frequency of each word using hash values (using count_by(ORTH) and nlp.vocab.strings)

In [19]:
d = obj.count_by(ORTH)
d

{6679199052911211715: 1,
 357501887436434592: 1,
 7743033266031195906: 1,
 4380130941430378203: 1,
 11104729984170784471: 1,
 7593739049417968140: 1,
 2158845516055552166: 1,
 16712971838599463365: 1,
 7597692042947428029: 1,
 3791531372978436496: 3,
 11792590063656742891: 2,
 18254674181385630108: 1,
 962983613142996970: 4,
 10289140944597012527: 1,
 886050111519832510: 2,
 9696970313201087903: 1,
 9153284864653046197: 2,
 16159022834684645410: 1,
 15099781594404091470: 1,
 3002984154512732771: 1,
 13136985495629980461: 1,
 11925638236994514241: 2,
 5865838185239622912: 2,
 16875582379069451158: 2,
 5012629990875267006: 1,
 9147119992364589469: 1,
 942632335873952620: 2,
 2593208677638477497: 1,
 9781598966686434415: 2,
 12084876542534825196: 1,
 18194338103975822726: 1,
 9099225972875567996: 1,
 5257340109698985342: 1,
 2283656566040971221: 1,
 12626284911390218812: 1,
 3563698965725164461: 1,
 12646065887601541794: 2,
 14947529218328092544: 1,
 17092777669037358890: 1,
 173392260459

In [24]:
for k,v in d.items():
    print((nlp.vocab.strings[k],v))

('Commercial', 1)
('writers', 1)
('know', 1)
('that', 1)
('most', 1)
('people', 1)
('do', 1)
('n’t', 1)
('want', 1)
('to', 3)
('read', 2)
('1,000', 1)
('\n', 4)
('words', 1)
('of', 2)
('closely', 1)
('-', 2)
('spaced', 1)
('text', 1)
('in', 1)
('order', 1)
('see', 2)
('what', 2)
('they', 2)
('are', 1)
('writing', 1)
('about', 2)
(',', 1)
('so', 2)
('also', 1)
('like', 1)
('keep', 1)
('sentences', 1)
('and', 1)
('paragraphs', 1)
('short', 1)
('.', 2)
('They', 1)
('’ll', 1)
('even', 1)
('use', 1)
('lots', 1)
('sub', 1)
('headers', 1)
('you', 2)
('can', 1)
('each', 1)
('paragraph', 1)
('is', 1)
('before', 1)
('it', 1)


###### print each word count, using dictionary comprehension

###### print index of each token

In [26]:
txt

'Commercial writers know that most people don’t want to read 1,000 \nwords of closely-spaced text in order to see what they are writing about, so \nthey also like to keep sentences and paragraphs short. \nThey’ll even use lots of sub-headers so you can see what each paragraph is about \nbefore you read it.'

In [25]:
for wd in obj:
    print((wd,wd.idx))

(Commercial, 0)
(writers, 11)
(know, 19)
(that, 24)
(most, 29)
(people, 34)
(do, 41)
(n’t, 43)
(want, 47)
(to, 52)
(read, 55)
(1,000, 60)
(
, 66)
(words, 67)
(of, 73)
(closely, 76)
(-, 83)
(spaced, 84)
(text, 91)
(in, 96)
(order, 99)
(to, 105)
(see, 108)
(what, 112)
(they, 117)
(are, 122)
(writing, 126)
(about, 134)
(,, 139)
(so, 141)
(
, 144)
(they, 145)
(also, 150)
(like, 155)
(to, 160)
(keep, 163)
(sentences, 168)
(and, 178)
(paragraphs, 182)
(short, 193)
(., 198)
(
, 200)
(They, 201)
(’ll, 205)
(even, 209)
(use, 214)
(lots, 218)
(of, 223)
(sub, 226)
(-, 229)
(headers, 230)
(so, 238)
(you, 241)
(can, 245)
(see, 249)
(what, 253)
(each, 258)
(paragraph, 263)
(is, 273)
(about, 276)
(
, 282)
(before, 283)
(you, 290)
(read, 294)
(it, 299)
(., 301)


In [27]:
for wd in obj:
    print((wd,wd.i))

(Commercial, 0)
(writers, 1)
(know, 2)
(that, 3)
(most, 4)
(people, 5)
(do, 6)
(n’t, 7)
(want, 8)
(to, 9)
(read, 10)
(1,000, 11)
(
, 12)
(words, 13)
(of, 14)
(closely, 15)
(-, 16)
(spaced, 17)
(text, 18)
(in, 19)
(order, 20)
(to, 21)
(see, 22)
(what, 23)
(they, 24)
(are, 25)
(writing, 26)
(about, 27)
(,, 28)
(so, 29)
(
, 30)
(they, 31)
(also, 32)
(like, 33)
(to, 34)
(keep, 35)
(sentences, 36)
(and, 37)
(paragraphs, 38)
(short, 39)
(., 40)
(
, 41)
(They, 42)
(’ll, 43)
(even, 44)
(use, 45)
(lots, 46)
(of, 47)
(sub, 48)
(-, 49)
(headers, 50)
(so, 51)
(you, 52)
(can, 53)
(see, 54)
(what, 55)
(each, 56)
(paragraph, 57)
(is, 58)
(about, 59)
(
, 60)
(before, 61)
(you, 62)
(read, 63)
(it, 64)
(., 65)


###### Print various attributes of nlp object (i.e. is_alpha,tok.shape_,is_stop,tok.pos_,tok.tag_) !!!

In [34]:
for wd in obj:
    print((wd.text,wd.is_alpha,wd.shape_,wd.pos_,wd.tag_))

('Commercial', True, 'Xxxxx', 'ADJ', 'JJ')
('writers', True, 'xxxx', 'NOUN', 'NNS')
('know', True, 'xxxx', 'VERB', 'VBP')
('that', True, 'xxxx', 'SCONJ', 'IN')
('most', True, 'xxxx', 'ADJ', 'JJS')
('people', True, 'xxxx', 'NOUN', 'NNS')
('do', True, 'xx', 'AUX', 'VBP')
('n’t', False, 'x’x', 'PART', 'RB')
('want', True, 'xxxx', 'VERB', 'VB')
('to', True, 'xx', 'PART', 'TO')
('read', True, 'xxxx', 'VERB', 'VB')
('1,000', False, 'd,ddd', 'NUM', 'CD')
('\n', False, '\n', 'SPACE', '_SP')
('words', True, 'xxxx', 'NOUN', 'NNS')
('of', True, 'xx', 'ADP', 'IN')
('closely', True, 'xxxx', 'ADV', 'RB')
('-', False, '-', 'PUNCT', 'HYPH')
('spaced', True, 'xxxx', 'VERB', 'VBN')
('text', True, 'xxxx', 'NOUN', 'NN')
('in', True, 'xx', 'ADP', 'IN')
('order', True, 'xxxx', 'NOUN', 'NN')
('to', True, 'xx', 'PART', 'TO')
('see', True, 'xxx', 'VERB', 'VB')
('what', True, 'xxxx', 'PRON', 'WP')
('they', True, 'xxxx', 'PRON', 'PRP')
('are', True, 'xxx', 'AUX', 'VBP')
('writing', True, 'xxxx', 'VERB', 'VBG')
(

In [35]:
sp.explain("VBP")

'verb, non-3rd person singular present'

In [38]:
sp.explain("SP")

'space'

In [36]:
sp.explain("MD")

'verb, modal auxiliary'

###### Stemming (using nltk)
using PorterStemmer()
using SnowballStemmer()

In [39]:
from nltk.stem import SnowballStemmer
from nltk.stem import PorterStemmer

In [42]:
sm = SnowballStemmer("english")
ps = PorterStemmer()

In [43]:
text = ['play', 'playing', 'playable', 'played', 'plays']

In [44]:
for wd in text:
    print(sm.stem(wd))

play
play
playabl
play
play


In [45]:
for wd in text:
    print(ps.stem(wd))

play
play
playabl
play
play


###### Lemmatization

In [47]:
text = 'play playing playable played plays'
obj1 = nlp(text)

In [52]:
for wd in obj1:
    print((wd.text,wd.lemma_))

('play', 'play')
('playing', 'play')
('playable', 'playable')
('played', 'play')
('plays', 'play')


In [53]:
sp.__version__

'2.2.1'

###### Display tree view of words using displacy using displacy.render()

In [54]:
t1 = "This is Learnbay class11 and we are learning NLP par2"

In [55]:
obj2 = nlp(t1)

In [57]:
displacy.render(obj2,jupyter=True)

###### How to get the meaning of any denoted words by nlp using explain()

In [58]:
sp.explain("cc")

'coordinating conjunction'

###### How to Find out NER(Named entity Recognition) in given doc

In [61]:
t2 = "IBM produces and sells computer hardware, middleware and software, and provides hosting and consulting services in areas ranging from mainframe computers to nanotechnology. IBM is also a major research organization, holding the record for most U.S. patents generated by a business (as of 2020) for 27 consecutive years.[6] Inventions by IBM include the automated teller machine (ATM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming language, the UPC barcode, and dynamic random-access memory (DRAM). The IBM mainframe, exemplified by the System/360, was the dominant computing platform during the 1960s and 1970s."

In [62]:
obj3 = nlp(t2)

In [73]:
for ent in obj3.ents: # to get entities
    print((ent,ent.label_))

(IBM, 'ORG')
(IBM, 'ORG')
(U.S., 'GPE')
(2020, 'DATE')
(27, 'CARDINAL')
(IBM, 'ORG')
(SQL, 'ORG')
(UPC, 'ORG')
(IBM, 'ORG')
(the 1960s and 1970s, 'DATE')


In [70]:
for ent in obj3.ents: # to get entities
    print(dir(ent))
    break

['_', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_fix_dep_copy', '_recalculate_indices', '_vector', '_vector_norm', 'as_doc', 'conjuncts', 'doc', 'end', 'end_char', 'ent_id', 'ent_id_', 'ents', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'kb_id', 'kb_id_', 'label', 'label_', 'lefts', 'lemma_', 'lower_', 'merge', 'n_lefts', 'n_rights', 'noun_chunks', 'orth_', 'remove_extension', 'rights', 'root', 'sent', 'sentiment', 'set_extension', 'similarity', 'start', 'start_char', 'string', 'subtree', 'tensor', 'text', 'text_with_ws', 'to_array', 'upper_', 'vector', 'vector_norm', 'vocab']


###### Display Named Entity in doc using displacy.render

In [66]:
displacy.render(obj3,style="ent")

In [67]:
sp.explain("GPE")

'Countries, cities, states'

# Exercise

In [69]:
t3 = """The International Business Machines Corporation (IBM) is an American multinational information technology company headquartered in Armonk, New York, with operations in over 170 countries. The company began in 1911, founded in Endicott, New York, as the Computing-Tabulating-Recording Company (CTR) and was renamed "International Business Machines" in 1924. IBM is incorporated in New York.[5]

IBM produces and sells computer hardware, middleware and software, and provides hosting and consulting services in areas ranging from mainframe computers to nanotechnology. IBM is also a major research organization, holding the record for most U.S. patents generated by a business (as of 2020) for 27 consecutive years.[6] Inventions by IBM include the automated teller machine (ATM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming language, the UPC barcode, and dynamic random-access memory (DRAM). The IBM mainframe, exemplified by the System/360, was the dominant computing platform during the 1960s and 1970s.

IBM has continually shifted business operations by focusing on higher-value, more profitable markets. This includes spinning off printer manufacturer Lexmark in 1991 and the sale of personal computer (ThinkPad/ThinkCentre) and x86-based server businesses to Lenovo (in 2005 and 2014, respectively), and acquiring companies such as PwC Consulting (2002), SPSS (2009), The Weather Company (2016), and Red Hat (2019). Also in 2015, IBM announced that it would go "fabless", continuing to design semiconductors, but offloading manufacturing to GlobalFoundries.

Nicknamed Big Blue, IBM is one of 30 companies included in the Dow Jones Industrial Average and one of the world's largest employers, with (as of 2018) over 350,000 employees, known as "IBMers". At least 70% of IBMers are based outside the United States, and the country with the largest number of IBMers is India.[7] IBM employees have been awarded five Nobel Prizes, six Turing Awards, ten National Medals of Technology (USA) and five National Medals of Science (USA)."""

In [79]:
sp.explain("NORP")

'Nationalities or religious or political groups'

In [74]:
obj4 = nlp(t3)
displacy.render(obj4,style="ent")

###### find out total number of occurences of ORG entities

In [78]:
len([wd for wd in obj4.ents if wd.label_ == "ORG"])

26

###### Remove stop_words/punctuation using is_stop & is_punct attribute

###### create a list of words/sentence after removing stop_words then make sentence

###### Sentence and Word Tokenization
try at home

##### Pipelining:
Get all the factory pipelining options available
How to disable preloaded pipeline, that will enahnce the processing time?
Adding custom pipelines

In [82]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [83]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x119647e80>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x11977e0a8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x11977e108>)]

In [84]:
nlp.factories

{'tokenizer': <function spacy.language.Language.<lambda>(nlp)>,
 'tensorizer': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'tagger': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'morphologizer': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'parser': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'ner': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'entity_linker': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'similarity': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'textcat': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'sentencizer': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'merge_noun_chunks': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'merge_entities': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'merge_subtokens': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'entity_ruler': <function spacy.language.Language.<lambda>(nlp, **

In [85]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [86]:
dis = nlp.disable_pipes("ner")
dis

[('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x11977e108>)]

In [87]:
nlp.pipe_names

['tagger', 'parser']

In [88]:
dis.restore()

In [89]:
nlp.pipe_names

['tagger', 'parser', 'ner']

###### add your own pipeline

In [90]:
def upperizer(token):
    op = token.text.upper()
    return(op)

In [92]:
nlp.add_pipe(upperizer)

In [93]:
nlp.pipe_names

['tagger', 'parser', 'ner', 'upperizer']

### Check the add pipeline functionality

In [94]:
t4 = "This is testing of user added pipeline function."

In [95]:
nlp(t4)

'THIS IS TESTING OF USER ADDED PIPELINE FUNCTION.'

##### Reading a file and displaying entity

In [100]:
nlp.remove_pipe("upperizer")

('upperizer', <function __main__.upperizer(token)>)

In [101]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [102]:
fh = open("IBM.txt")
obj5 = nlp(fh.read())
displacy.render(obj5,style="ent")

##### Chunking

In [103]:
txt

'Commercial writers know that most people don’t want to read 1,000 \nwords of closely-spaced text in order to see what they are writing about, so \nthey also like to keep sentences and paragraphs short. \nThey’ll even use lots of sub-headers so you can see what each paragraph is about \nbefore you read it.'

In [104]:
for chunk in obj.noun_chunks:
    print(chunk)

Commercial writers
most people
1,000 
words
closely-spaced text
order
what
they
they
sentences
They
lots
sub-headers
you
what
each paragraph
you
it


##### Computing word similarity

In [108]:
from nltk.corpus import wordnet as wn

In [110]:
wn.synsets("like")

[Synset('like.n.01'),
 Synset('like.n.02'),
 Synset('wish.v.02'),
 Synset('like.v.02'),
 Synset('like.v.03'),
 Synset('like.v.04'),
 Synset('like.v.05'),
 Synset('like.a.01'),
 Synset('like.a.02'),
 Synset('alike.a.01'),
 Synset('comparable.s.02')]

In [119]:
ws1 = wn.synset("like.v.02")
ws2 = wn.synset("wish.v.02")

In [120]:
wn.wup_similarity(ws1,ws2)

0.4

In [117]:
ws1 = wn.synset("like.v.02")
ws2 = wn.synset("like.v.02")

In [118]:
wn.wup_similarity(ws1,ws2)

1.0

##### n-grams (using nltk and sklearn-CountVectorizer())
bi-grams
tri-grams
n-grams

In [128]:
t4 = "This is testing of user added pipeline function."
low = t4.split(" ")
low

['This', 'is', 'testing', 'of', 'user', 'added', 'pipeline', 'function.']

In [121]:
from nltk import bigrams,trigrams,ngrams

In [126]:
list(bigrams(t4.split(" ")))

[('This', 'is'),
 ('is', 'testing'),
 ('testing', 'of'),
 ('of', 'user'),
 ('user', 'added'),
 ('added', 'pipeline'),
 ('pipeline', 'function.')]

In [129]:
list(trigrams(t4.split(" ")))

[('This', 'is', 'testing'),
 ('is', 'testing', 'of'),
 ('testing', 'of', 'user'),
 ('of', 'user', 'added'),
 ('user', 'added', 'pipeline'),
 ('added', 'pipeline', 'function.')]

In [130]:
list(ngrams(t4.split(" "),6))

[('This', 'is', 'testing', 'of', 'user', 'added'),
 ('is', 'testing', 'of', 'user', 'added', 'pipeline'),
 ('testing', 'of', 'user', 'added', 'pipeline', 'function.')]

In [131]:
list(ngrams(t4.split(" "),1))

[('This',),
 ('is',),
 ('testing',),
 ('of',),
 ('user',),
 ('added',),
 ('pipeline',),
 ('function.',)]

In [132]:
list(ngrams(t4.split(" "),2))

[('This', 'is'),
 ('is', 'testing'),
 ('testing', 'of'),
 ('of', 'user'),
 ('user', 'added'),
 ('added', 'pipeline'),
 ('pipeline', 'function.')]

In [133]:
list(ngrams(t4.split(" "),3))

[('This', 'is', 'testing'),
 ('is', 'testing', 'of'),
 ('testing', 'of', 'user'),
 ('of', 'user', 'added'),
 ('user', 'added', 'pipeline'),
 ('added', 'pipeline', 'function.')]