<a href="https://colab.research.google.com/github/1daytotheleft/ENG3810/blob/main/token_tutorial_estherp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# spaCy: How to Find Titled, Uppercased, and Lowercased Words in a Text

*SpaCy is an extensive, open-source modern library for Natural Language Processing (NLP). SpaCy is written in Python, but can load spaCy models in several different programming languages. It has several functions that can be used to build applications to process large texts.*

*This tutorial uses spaCy vrn. 2.2.4*

## spaCy Token Attributes

spaCy `Token` objects are the "storage containers" for linguistic information such as words or punctuation. spaCy `Token` objects constitute larger spaCy `Doc` objects.

spaCy `Token` objects have various "Attributes" that contain linguistic information about the `Token.` These attributes include the following:

|attribute|description|type|
|---------|-----------|----|
|is_alpha|	Does the token consist of alphabetic characters? Equivalent to token text.isalpha().|bool|
|is_ascii|	Does the token consist of ASCII characters? Equivalent to all(ord(c) < 128 for c in token.text).|bool|
|is_digit	|Does the token consist of digits? Equivalent to token.text.isdigit().|bool|
|is_lower	|Is the token in lowercase? Equivalent to token.text.islower().|bool|
|is_upper	|Is the token in uppercase? Equivalent to token.text.isupper().|bool|
|is_title|	Is the token in titlecase? Equivalent to token.text.istitle().|bool|
|is_punct	|Is the token punctuation?|bool|
|is_left_punct|	Is the token a left punctuation mark, e.g. "(" ?|bool|
|is_right_punct|	Is the token a right punctuation mark, e.g. ")" ?|bool|
|is_sent_start|	Does the token start a sentence? |bool or None if unknownDefaults to True for the first token in the Doc.|
|is_sent_end|	Does the token end a sentence? bool or None if unknown.
|is_space	|Does the token consist of whitespace characters? Equivalent to token.text.isspace().|bool|
|is_bracket|	Is the token a bracket?|bool|
|is_quote	|Is the token a quotation mark?|bool|
|is_currency	|Is the token a currency symbol?|bool|
|like_url	|Does the token resemble a URL?|bool|
|like_num	|Does the token represent a number? e.g. “10.9”, “10”, “ten”, etc.|bool|
|like_email	|Does the token resemble an email address?|bool|
|is_oov|	Is the token out-of-vocabulary (i.e. does it not have a word vector)?|bool|
|is_stop|	Is the token part of a “stop list”?|bool|

See [spaCy Token API](https://spacy.io/api/token#attributes) for more information and attribute types.

## For this tutorial, we will use the attribute:

* `.is_title` to identify tokens with a capitalized first letter
* `.is_upper` to identify tokens in all capitalized letters
* `.is_lower` to identify tokens in all lowercased letters


### Step 1: 
Import spaCy library. Then process the raw text into readable tokens.

In [None]:
import spacy

# load spaCy language model

nlp = spacy.load('en_core_web_sm')

# process raw text into a spaCy Doc

doc = nlp('The USA has 50 states.')

## Step 2: 
You can access each token via a Python `for-loop`, as displayed below: 

In [None]:
for token in doc:
  print(token)

## note: print() is a command to display each token as attributed within the parentheses

The
USA
has
50
states
.


## Step 3: 
Within the `for-loop`, each token can be tagged due to the attribute

In [None]:
for token in doc:
  print(token.text, token.is_upper)

The False
USA True
has False
50 False
states False
. False


Click through the options to see the appropriate set-up for each attribute.

In [None]:
#@title { display-mode: "form" }

option1 = 'Upper' #@param ["Title", "Upper", "Lower"]
print('# You selected', option1)

if option1 == 'Title': 
  print("""
  for token in doc:
   print(token.text, token.is_title)""")
  for token in doc: 
     print(token.text,token.is_title)
elif option1 == 'Upper':
  print("""
  for token in doc:
   print(token.text, token.is_upper)""")
  for token in doc: 
     print(token.text,token.is_upper)
else: 
  print("""
  for token in doc:
   print(token.text, token.is_lower)""")
  for token in doc: 
     print(token.text,token.is_lower)

# You selected Upper

  for token in doc:
   print(token.text, token.is_upper)
The False
USA True
has False
50 False
states False
. False
