# Natural Language Processing with **SpaCy**

## spaCy - what is it?

NLP is part of Machine Learning (ML) amd spaCy is a Python library that is open source which can handle and understand large volumes of text data (it can handle various languages too, except for Klingon, Bajoran and other Federation languages). 

## Install & setup
- for detailed instructions:  https://spacy.io/usage/
- step 1: **Terminal**: <code> pip3 install -U spacy</code> or <code> conda install -c conda-forge spacy</code>
- step 2: as admin/sudo <code> python -m spacy download en </code>
<br>
- successful when you see:


     Linking successful
      // your computer path of download
    You can now load the model via spacy.load('en')

## **Import spaCy** in notebook or Python

- this takes awhile, as spaCy has a fairly large library to load

In [6]:
# Import spaCy and load the language library
import spacy

# ---- load spacy
# spacy has en_core_web_ (small: sm, medium: md and large: lg) 
nlp = spacy.load('en_core_web_sm') 

#---- create a Document object
# some documentation will have nlp(u"string example")
# but the u in the (u"s...") is optional, stands for unicode
doc = nlp("Captain Sisko loves Bajor")

# each word in the string is a token
# a for-loop to print out English word structures of our doc
# spaCy has token method functions for words in the doc
for token in doc:
    print(token.text) 
    print(':',token.pos_) # part of speech
    print(token.dep_,'\n') # syntax dependency

Captain
: PROPN
compound 

Sisko
: PROPN
nsubj 

loves
: VERB
ROOT 

Bajor
: NOUN
dobj 



In [7]:
doc2 = nlp(u"Quark is making 45 bars of Latinum or $2.3 million each year") 

for token in doc2:
    print(token.text) 
    print(':',token.pos_) # part of speech
    print(token.dep_,'\n') # syntax dependency

Quark
: PROPN
nsubj 

is
: AUX
aux 

making
: VERB
ROOT 

45
: NUM
nummod 

bars
: NOUN
dobj 

of
: ADP
prep 

Latinum
: PROPN
pobj 

or
: CCONJ
cc 

$
: SYM
quantmod 

2.3
: NUM
compound 

million
: NUM
conj 

each
: DET
det 

year
: NOUN
npadvmod 



if you are wondering what PROPN, AUX, CCONJ etc mean, spaCy can explain it

In [8]:
spacy.explain('PROPN')

'proper noun'

In [9]:
spacy.explain('CCONJ')

'coordinating conjunction'

## Tokens are step one of the **NLP Pipeline** process
![](Pipeline.png)

### Part of Speech (POS) Tagging
- https://spacy.io/api/annotation/#pos-tagging 


- doc = nlp(u"Quark is making 45 bars of Latinum or $2.3 million each year")


|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Quark`|
|`.lemma_`|The base form of the word|`quark`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [10]:
## NLP pipeline
#  NER = name entity recognizer 

nlp.pipe_names

['tagger', 'parser', 'ner']

In [12]:
doc3 = nlp(u"Odo isn't investigating Quark for the first time, in awhile")

for token in doc3:
    print(token.text,':',token.pos_)

Odo : PROPN
is : AUX
n't : PART
investigating : VERB
Quark : PROPN
for : ADP
the : DET
first : ADJ
time : NOUN
, : PUNCT
in : ADP
awhile : ADJ


In [13]:
# lemmas
print(doc3[0].text)
print(doc3[0].lemma_)  

Odo
Odo


In [14]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc3[4].pos_)
print(doc3[4].tag_ + ' / ' + spacy.explain(doc3[4].tag_)) 


PROPN
NNP / noun, proper singular


In [18]:
# Word Shapes: (word size) 
print(doc3[0].text +': '+ doc3[0].shape_)
print(doc[3].text +' : '+ doc[3].shape_) 
# Odo = 3 letters = word size 3 = xxx

Odo: Xxx
Bajor : Xxxxx


In [19]:
# Boolean Values:
print('.is_alpha :',doc2[0].is_alpha)
print('.is_stop :',doc2[0].is_stop) 

.is_alpha : True
.is_stop : False


## **Span**

- span is a slice of a large document in order to easily work with
- <code> doc[start : stop] </code>

In [24]:
TNG= nlp(u"Space: the final frontier. These are the voyages of the starship Enterprise.\
Its five-year mission: to explore strange new worlds. To seek out new life and new civilizations. \
To boldly go where no [humanoid] has gone before!") 

In [25]:
trek_quote = TNG[16:30] 
print(trek_quote) 

five-year mission: to explore strange new worlds. To seek out


In [26]:
type(trek_quote) # our TNG doc got sliced or span

spacy.tokens.span.Span

if you want to print out just the sentences 

In [27]:
for sent in TNG.sents:
    print(sent) 

Space: the final frontier.
These are the voyages of the starship Enterprise.
Its five-year mission: to explore strange new worlds.
To seek out new life and new civilizations.
To boldly go where no [humanoid] has gone before!


## get your **Tokenization** done here

## Tokenization
-  **Prefix**:	Character(s) at the beginning &#9656; `$ ( “ ¿`
-  **Suffix**:	Character(s) at the end &#9656; `km ) , . ! ”`
-  **Infix**:	Character(s) in between &#9656; `- -- / ...`
-  **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`


### Prefixes, Suffixes and Infixes

spaCy will isolate punctuation that does *not* form an integral part of a word. 
- Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. 
- However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

In [30]:
# Create a string that includes 
# opening and closing quotation marks

mystring = '"We\'re moving to Vulcan!"'
print(mystring)


# Create a Doc object and explore tokens
doc = nlp(mystring)

for token in doc:
    print(token.text, end=' | ') 

"We're moving to Vulcan!"
" | We | 're | moving | to | Vulcan | ! | " | 

In [31]:
doc4 = nlp("You have reached the Starfleet IT Department,\
if you are experiencing problems, send a message to our comlink,\
com-channel extension is 6433, or visit https://askpython.com for further help.") 

for t in doc4:
    print(t) 

You
have
reached
the
Starfleet
IT
Department
,
if
you
are
experiencing
problems
,
send
a
message
to
our
comlink
,
com
-
channel
extension
is
6433
,
or
visit
https://askpython.com
for
further
help
.


notice that the link remains together and clickable. spaCy is pretty cool.

In [32]:
# how many tokens are there in doc4?  
# use len() function
len(doc4) 

35

In [34]:
# how many Vocab word objects in spaCy library?
len(nlp.vocab)

569

### get **token** by index position and slice

- tokens CAN'T be reassigned like in regular Python

In [35]:
doc4[4] 

Starfleet

In [42]:
doc4[-4:]  # the last 3 words

for further help.

another part of the NLP pipeline

## Named Entities Recognition (**NER**)

- The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the `ents` property of a `Doc` object.

In [43]:
named_ents = nlp("Starfleet HQ in San Francisco, and Tokyo costing $1.8 Billion a year")

for token in named_ents:
    print(token.text, end=' | ')

print('\n')

for ent in named_ents.ents:
    print(ent.text + ' - ' + ent.label_ + ' - ' + str(spacy.explain(ent.label_)) ) 
    

print('\nLength of string', len(named_ents))
print('number of entities', len(named_ents.ents)) 

Starfleet | HQ | in | San | Francisco | , | and | Tokyo | costing | $ | 1.8 | Billion | a | year | 

Starfleet HQ - ORG - Companies, agencies, institutions, etc.
San Francisco - GPE - Countries, cities, states
Tokyo - GPE - Countries, cities, states
$1.8 Billion - MONEY - Monetary values, including unit

Length of string 14
number of entities 4


Notice that Starfleet is recognized as a Named Entity, an Organization

### Noun **Chunks**

- *Noun chunks* are "base noun phrases" – flat phrases that have a noun as their head. 
- You can think of noun chunks as a noun plus the words describing the noun

In [44]:
chunky = nlp("Autonomous shuttles cause \
insurance premiums to be high for the Federation")

for chunk in chunky.noun_chunks:
    print(chunk) 

Autonomous shuttles
insurance premiums
the Federation


In [48]:
chunky2 = nlp("red spacecraft have higher \
insurance rates debunked says Federation News")

for chunk in chunky2.noun_chunks: # noun_chunks 
    print(chunk.text) 

red spacecraft
higher insurance rates
Federation News


In [52]:
chunky3 = nlp("Gul Dukat fired from position with The Dominion,\
reports Jake Sisko from the Federation News")

for chunk in chunky3.noun_chunks:
    print(chunk.text) 

Gul Dukat
position
The Dominion
Jake Sisko
the Federation News


now is the visualization part of spaCy that makes it so cool

## Built-in Visualization: **displacy**

- spaCy includes a built-in visualization tool called **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.
- https://spacy.io/usage/visualizers
- 
- The dependency visualizer, **dep**, shows part-of-speech tags and syntactic dependencies.

In [53]:
from spacy import displacy

In [54]:
gamma = nlp("The Gamma quadrant is full of mystery and wonder")

# displacy.render() 
displacy.render(gamma,
                style='dep', # parts of speech & syntax
                jupyter=True, # True, running this code in Jupyter
                options={
                    'distance':110, # distance between words
                    'color':'#ffff66', # Trek yellow
                    'bg': '#000', # background color, black
                    # can set type of font you want used
#                   'font':'Montserrat'
                    } 
               )

another visualization for our word documents

## Entity Recognizer Visualization

In [58]:
doc201 = nlp("Ferenginar stock exchange has increased profit by 34.7%, that's $2.12 million")

displacy.render(
    doc201,
    style='ent', # entity
    jupyter=True,
    options={
        'color':'#000',
        'bg': '#ff3399'
        } 
    )

incase you have a hard time seeing the print out on a dark Notebook


<font color='#ffff66'> Ferenginar stock exchange ORG </font>

has increased profit by <font color='#66ff99'>34.7% PERCENT</font> , 


that's <font color='#66ff99'> $2.12 million MONEY </font>

more pipeline processing 

## Lemmatization
- lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a morphological analysis to words.
- The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'.
- Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence.

In [59]:
lemmy = nlp("The Chief of Engineering says the warpcore engine needs an update")

for token in lemmy:
     print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_) 

The 	 DET 	 7425985699627899538 	 the
Chief 	 PROPN 	 9615737558093935814 	 Chief
of 	 ADP 	 886050111519832510 	 of
Engineering 	 PROPN 	 17806784801972435463 	 Engineering
says 	 VERB 	 8685289367999165211 	 say
the 	 DET 	 7425985699627899538 	 the
warpcore 	 NOUN 	 15559318848824790373 	 warpcore
engine 	 NOUN 	 10857443142397967886 	 engine
needs 	 VERB 	 478886015463313967 	 need
an 	 DET 	 15099054000809333061 	 an
update 	 NOUN 	 1936357517718432020 	 update


spaCy has not seen warpcore before and labels it noun.


Note: the long numbers are spaCy's lemma reference numbers that identify how common the word is used in the English language sentences

okay, let's repeat a word that has a stem and see what the lemmatization labels it as
- example: run (stem), running, runs, run

In [62]:
lemmy2 = nlp("the Chief of Engineering runs around, \
always running software to see if we have any errors in the running program, it's a race")

for token in lemmy2:
    print(token.text, '\t', token.pos_, 
          '\t#', token.lemma, '\t', token.lemma_) 

the 	 DET 	# 7425985699627899538 	 the
Chief 	 PROPN 	# 9615737558093935814 	 Chief
of 	 ADP 	# 886050111519832510 	 of
Engineering 	 PROPN 	# 17806784801972435463 	 Engineering
runs 	 VERB 	# 12767647472892411841 	 run
around 	 ADV 	# 3194226484742107227 	 around
, 	 PUNCT 	# 2593208677638477497 	 ,
always 	 ADV 	# 17471638809377599778 	 always
running 	 VERB 	# 12767647472892411841 	 run
software 	 NOUN 	# 8212201967714533330 	 software
to 	 PART 	# 3791531372978436496 	 to
see 	 VERB 	# 11925638236994514241 	 see
if 	 SCONJ 	# 12446819118446800910 	 if
we 	 PRON 	# 561228191312463089 	 -PRON-
have 	 AUX 	# 14692702688101715474 	 have
any 	 DET 	# 13148361048351484388 	 any
errors 	 NOUN 	# 5141748423479617815 	 error
in 	 ADP 	# 3002984154512732771 	 in
the 	 DET 	# 7425985699627899538 	 the
running 	 NOUN 	# 12212083579121184944 	 running
program 	 NOUN 	# 17812688126189747487 	 program
, 	 PUNCT 	# 2593208677638477497 	 ,
it 	 PRON 	# 561228191312463089 	 -PRON-
's 	 AUX 	# 103825

In [63]:
# display the lemmas nicely
def show_lemmas(text):
    for token in text:
        # using f-string formatting, numbers determine spacing
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}') 

In [64]:
show_lemmas(lemmy2) 

the          DET    7425985699627899538    the
Chief        PROPN  9615737558093935814    Chief
of           ADP    886050111519832510     of
Engineering  PROPN  17806784801972435463   Engineering
runs         VERB   12767647472892411841   run
around       ADV    3194226484742107227    around
,            PUNCT  2593208677638477497    ,
always       ADV    17471638809377599778   always
running      VERB   12767647472892411841   run
software     NOUN   8212201967714533330    software
to           PART   3791531372978436496    to
see          VERB   11925638236994514241   see
if           SCONJ  12446819118446800910   if
we           PRON   561228191312463089     -PRON-
have         AUX    14692702688101715474   have
any          DET    13148361048351484388   any
errors       NOUN   5141748423479617815    error
in           ADP    3002984154512732771    in
the          DET    7425985699627899538    the
running      NOUN   12212083579121184944   running
program      NOUN   178126881261897

as fun as that is, for further analysis requires to remove certain words that do not help in finding patterns or specific word associations, these are called *stop words*. Stop words is a huge python set of words like "umm","been", etc. and can be modified to add new words.

# **STOP WORDS**

In [70]:
# Print the set of spaCy's default stop words 
# (remember that sets are unordered):

#-- uncomment to see the words
# print(nlp.Defaults.stop_words) 

print('number of stop_words:', len(nlp.Defaults.stop_words) ) 

number of stop_words: 326


In [71]:
# check if a word is part of this set
nlp.vocab['is'].is_stop

True

In [72]:
nlp.vocab['space'].is_stop 

False

## Add a **stop_word**

In [77]:
# 'btw' = by the way
# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('btw')

# Set the stop_word tag on the lexeme
nlp.vocab['btw'].is_stop = True

# 326 stop_words before
len(nlp.Defaults.stop_words) # added it to the set

327

## Remove a **stop_word**

In [78]:
# word 'beyond' to be removed for example
# Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('beyond')

# Remove the stop_word tag from the lexeme
nlp.vocab['beyond'].is_stop = False

In [79]:
len(nlp.Defaults.stop_words) # dropped the word from the set

326

### Phrases: Vocab & Matching

- spaCy offers a rule-matching tool called **Matcher** that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. 
- You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

rule-based Token Matcher attributes
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

In [80]:
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab) 

using a list that holds a dictionary, the dictionary holds the attribute and the word to find a pattern match for. 

In [81]:
# define a pattern to match
pattern1 = [ {'LOWER': 'warpcore'} ]
pattern2 = [ {'LOWER': 'warp'}, {'LOWER': 'core'}, ] # warp core
pattern3 = [ {'LOWER': 'warp'}, {'IS_PUNCT': True}, {'LOWER': 'core'}, ] # warp-core

matcher.add('WarpCore', None, pattern1, pattern2, pattern3) 

- pattern1 looks for 'warpcore'
- pattern2 looks for 'warp core'
- pattern3 looks for 'warp-core'
- callbacks are set to None

## Apply **matcher** to a doc object

In [89]:
matchy = nlp("There is a warpcore breach due to the warp-core not being properly installed \
like the manufacturer of the warp core states in the manual")

found_matches = matcher(matchy)
print(found_matches) 

[(10517466410332998210, 3, 4), (10517466410332998210, 8, 11), (10517466410332998210, 20, 22)]


In [93]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print( string_id, start, end, span.text) 

WarpCore 3 4 moving
WarpCore 8 11 
WarpCore 20 22 


### Patterns optional token rules

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>

In [96]:
# define a pattern to match
pattern1 = [ {'LOWER': 'warpcore'} ] 
pattern3 = [ {'LOWER': 'warp'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'core'}, ] # warp-core

# Remove the old patterns to avoid duplication:
# matcher.remove('warpcore') 

matcher.add('WarpCore', None, pattern1, pattern3) 

In [100]:
matchy2 = nlp("it is important to have a maintained warpcore, unless you want a warp--core")

found_matches = matcher(matchy2) 
print(found_matches) 

[(10517466410332998210, 7, 8), (10517466410332998210, 13, 16)]


In [102]:
print(matchy2[7:8] )
print(matchy2[13:16]) 

warpcore
warp--core


now we move on to phrase matching from a document 

## Phrase Matcher

In [103]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab) 

In [105]:
# https://en.wikipedia.org/wiki/Starfleet

# open the saved file
with open("starfleet.txt", encoding="utf8") as f:
    wiki_doc = nlp(f.read()) 

In [106]:
#  phrases list for matching
phrase_list = ['Starfleet','engineer','Trek','Academy','Medical']

# Next, convert each phrase to a Doc object:
#  list comprehension
phrase_patterns = [nlp(text) for text in phrase_list] 

# Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('federation', None, *phrase_patterns)

# Build a list of matches:
found_matches = matcher(wiki_doc) 

In [107]:
phrase_patterns

[Starfleet, engineer, Trek, Academy, Medical]

In [108]:
len(found_matches) 

43

In [110]:
# to print out where the word 'federation' appears in the document 
# it also give the word next to the pattern word

for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = wiki_doc[start:end]                    # get the matched span
#     print(match_id, string_id, start, end, span.text) 
    print( string_id, start, end, span.text) 

federation 0 1 Starfleet
federation 8 9 Trek
federation 17 18 Starfleet
federation 56 57 Starfleet
federation 75 76 Trek
federation 84 85 Starfleet
federation 112 113 Starfleet
federation 119 120 Starfleet
federation 166 167 Starfleet
federation 178 179 Starfleet
federation 179 180 Academy
federation 187 188 Trek
federation 193 194 Starfleet
federation 194 195 Academy
federation 219 220 Starfleet
federation 231 232 Starfleet
federation 241 242 Starfleet
federation 246 247 Starfleet
federation 280 281 Trek
federation 287 288 Trek
federation 331 332 Trek
federation 340 341 Starfleet
federation 351 352 Starfleet
federation 377 378 Starfleet
federation 384 385 Starfleet
federation 414 415 Trek
federation 437 438 Trek
federation 460 461 Trek
federation 479 480 Starfleet
federation 503 504 Trek
federation 515 516 Starfleet
federation 531 532 Starfleet
federation 532 533 Medical
federation 538 539 Starfleet
federation 551 552 Trek
federation 579 580 Starfleet
federation 580 581 Medical
federa

In [111]:
# the context of the word in the document
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = wiki_doc[start-3:end+3]                    # get the matched span
    print(string_id, start, end, span.text) 

federation 0 1 
federation 8 9 in the Star Trek media franchise.
federation 17 18 fictional universe, Starfleet is a uniformed
federation 56 57 diplomacy (although Starfleet predates the Federation
federation 75 76 television series Star Trek: Enterprise)
federation 84 85 the majority of Starfleet's members are
federation 112 113 's protagonists are Starfleet commissioned officers.
federation 119 120 

Mission: Starfleet has been shown
federation 166 167 The flagship of Starfleet is often considered
federation 178 179 Enterprise.

Starfleet Academy: As
federation 179 180 .

Starfleet Academy: As early
federation 187 188 the original Star Trek, characters refer
federation 193 194 refer to attending Starfleet Academy. Later
federation 194 195 to attending Starfleet Academy. Later series
federation 219 220 is located near Starfleet Headquarters in what
federation 231 232 California.

Starfleet Command: is
federation 241 242 command center of Starfleet. The term
federation 246 247 The term

In [131]:
# the phrase matcher of the document span
# the bigger the slice the more context of the phrase
x = wiki_doc[0:1] 
v = wiki_doc[8:9] 
m = wiki_doc[17:18]
print("{} | {} | {}".format(x,v,m)) 

Starfleet | Trek | Starfleet


In [132]:
wiki_doc[200:250] 

as an officer training facility with a four-year educational program. The main campus is located near Starfleet Headquarters in what is now Fort Baker, California.

Starfleet Command: is the headquarters/command center of Starfleet. The term "Starfleet Command" is

In [133]:
# Build a list of sentences
sents = [sent for sent in wiki_doc.sents]

# In the next section we'll see that sentences contain start and end token values:
print(sents[0].start, sents[0].end) 

0 12


In [135]:
sents[2] 

While the majority of Starfleet's members are human and it is headquartered on Earth, hundreds of other species are also represented.

In [138]:
sents[17]  

The Wrath of Khan, the design of the Yellowstone-class Runabout in the alternate timeline in the Star Trek: Voyager episode "Non Sequitur", and devising a defense against the Breen energy-dampening weapon in the Star Trek: