#  <font color = 'dodgerblue'> **Objective**
The first step in NLP projects is to clean the text. For example we might want to remove punctuations, white spaces etc. Futher we want to break our strings into tokens. This step is required as we want to lean the vector (number) representaion of the tokens that we can use in our models. Spacy is a very useful library which can help us in text cleaning and tokeinzation. In this notebook, you will understand the basics of the spacy library.

After completing this notebook, you will be able to
- Clean text using spacy
- Create tokens using spacy
- Extract Part of Speech Tags
- Extract Named Entities

#  <font color = 'dodgerblue'>**Install latest version of spaCy**

In [None]:
!pip install -U spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#  <font color = 'dodgerblue'>**Import Libraries**

In [None]:
import spacy
import pandas as pd

In [None]:
# check spaCy Verion
print(spacy.__version__)

3.4.1


#  <font color = 'dodgerblue'>**Sample Strings**

In [None]:
# Sample String - Create a sample String

text1 = """China's capital is Beijing. \n\nBeijing is where we'll go. \n\nLet's travel to Hong Kong from Beijing. \
          \n\nA friend is pursuing his M.S from Beijing. \n\nBeijing is a cool place!!! :-P <3 #Awesome \
          \n\nA Rolex watch costs in the range of $3000.0 - $8000.0 in USA and China. \n\n@tompeter I'm \ 
          \n\ngoing to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3 \
          for more info see: http://www.example_beijing.com! Ten is different from 10"""

In [None]:
text2  = """China's capital is Beijing. \n\nA Rolex watch costs in the range of $3000.0 - $8000.0 in USA"""

#  <font color = 'dodgerblue'>**White Space Tokenizers**

In [None]:
# Whitespace Tokenizer splits text across whitespaces
text2.split()

["China's",
 'capital',
 'is',
 'Beijing.',
 'A',
 'Rolex',
 'watch',
 'costs',
 'in',
 'the',
 'range',
 'of',
 '$3000.0',
 '-',
 '$8000.0',
 'in',
 'USA']

#  <font color = 'dodgerblue'>**spaCy Basics**

**spaCy** (https://spacy.io/) is an open-source Python library that parses and "understands" large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).

Models in spaCy for English Language as of release 3.0.0
- **en_core_web_sm:** 11MB
- **en_core_web_md:** 48MB
- **en_core_web_lg:** 746MB
<br><br>
![picture](https://spacy.io/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg)

Picture Source : https://spacy.io/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg

The first step in spaCy is to create an `nlp` object. The `nlp` object is a instance of a model and consists of various operations like tokenizaton, tagger, parser, ner etc (see figure above). When a text is passed through the object, it goes throught these operations. When creating an object ,we can disable the operations that we do not need.


 ## <font color = 'dodgerblue'>**Download Model**

In [None]:
!python -m spacy download en_core_web_sm

2022-08-28 13:06:49.949860: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 25.5 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


 ## <font color = 'dodgerblue'>**Load Model**

When we load the model, we can use the 'disable=' arguments to disable the components we do not need.

In this notebook, we will work only with tokenizer and hence we will disable `tagger, ner and parser`.

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [None]:
disabled = nlp.select_pipes(disable= ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'])

In [None]:
print(nlp.pipe_names)

[]


# <font color = 'dodgerblue'>**Tokenization in spaCy**
A Text is tokenized in spaCy when creating the Language processing pipeline nlp() object. 

![picture](https://spacy.io/tokenization-9b27c0f6fe98dcb26239eba4d3ba1f3d.svg)

Picture Source: https://spacy.io/tokenization-9b27c0f6fe98dcb26239eba4d3ba1f3d.svg

The algorithm below is taken from Tokenization part from spaCy's Documentation.  
https://spacy.io/usage/linguistic-features#tokenization



- Iterate over space-separated substrings.
- Look for a token match. If there is a match, stop processing and keep this token.
- Check whether we have an explicitly defined special case for this substring. If we do, use it.
- Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2, so that the token match and special cases always get priority.
- If we didn’t consume a prefix, try to consume a suffix and then go back to #2.
- If we can’t consume a prefix or a suffix, look for a URL match.
- If there’s no URL match, then look for a special case.
- Look for “infixes” – stuff like hyphens etc. and split the substring into tokens on all infixes.
- Once we can’t consume any more of the string, handle it as a single token.
- Make a final pass over the text to check for special cases that include spaces or that were missed due to the incremental processing of affixes.


 ## <font color = 'dodgerblue'>**Create Doc Object**

When we call nlp on a string, spaCy first tokenizes the text and creates a document object.

In [None]:
# creating a Doc object
doc2 = nlp(text2)

In [None]:
doc2

China's capital is Beijing. 

A Rolex watch costs in the range of $3000.0 - $8000.0 in USA

In [None]:
# check the type of doc
type(doc2)

spacy.tokens.doc.Doc

  ## <font color = 'dodgerblue'>**Accessing text of the tokens**
token is an object. We can acccess the text of the token using text attribute.

In [None]:
[token.text for token in doc2]

['China',
 "'s",
 'capital',
 'is',
 'Beijing',
 '.',
 '\n\n',
 'A',
 'Rolex',
 'watch',
 'costs',
 'in',
 'the',
 'range',
 'of',
 '$',
 '3000.0',
 '-',
 '$',
 '8000.0',
 'in',
 'USA']

  ## <font color = 'dodgerblue'>**Compare spacy tokenizer with white space tokenizer**
We can create tokenizer using Python split() function. In the last lecture we created tokenizer by splitting on non-alpha numeric characters. That gave us tokens separated by non-alphanumeric caharacters i.e. our tokens only have alpha numeric characters (words, numbers and underscores). We will now craete a white space tokenizer. i.e. it will split the string based on white spaces and create tokens.

We wil compare this tokenizer with spacy's tokenizer.

  ## <font color = 'dodgerblue'>**Example 1 (more and better tokens)**

The Whitespace Tokenizer splits the words from whitespaces.

The spaCy tokenizer splits the text into meaningful segments dependent on the language model that is used. 

It is apparent that spaCy is a better tokenizer, as it's tokens contain more than just words separated from whitespaces.

In [None]:
doc1 = nlp(text1)
# Whitespace Tokenizer
print(text1.split())

# spaCy Tokenizer
print([token.text for token in doc1])

["China's", 'capital', 'is', 'Beijing.', 'Beijing', 'is', 'where', "we'll", 'go.', "Let's", 'travel', 'to', 'Hong', 'Kong', 'from', 'Beijing.', 'A', 'friend', 'is', 'pursuing', 'his', 'M.S', 'from', 'Beijing.', 'Beijing', 'is', 'a', 'cool', 'place!!!', ':-P', '<3', '#Awesome', 'A', 'Rolex', 'watch', 'costs', 'in', 'the', 'range', 'of', '$3000.0', '-', '$8000.0', 'in', 'USA', 'and', 'China.', '@tompeter', "I'm", '\\', 'going', 'to', 'buy', 'a', 'Rolexxxxxxxx', 'watch!!!', ':-D', '#happiness', '#rolex', '<3', 'for', 'more', 'info', 'see:', 'http://www.example_beijing.com!', 'Ten', 'is', 'different', 'from', '10']
['China', "'s", 'capital', 'is', 'Beijing', '.', '\n\n', 'Beijing', 'is', 'where', 'we', "'ll", 'go', '.', '\n\n', 'Let', "'s", 'travel', 'to', 'Hong', 'Kong', 'from', 'Beijing', '.', '          \n\n', 'A', 'friend', 'is', 'pursuing', 'his', 'M.S', 'from', 'Beijing', '.', '\n\n', 'Beijing', 'is', 'a', 'cool', 'place', '!', '!', '!', ':-P', '<3', '#', 'Awesome', '          \n\n',

In [None]:
# No. of Tokens in Whitespace Tokenizer
print(len(text1.split()))
# No. of Tokens in spaCy's Tokenizer
print(len([token.text for token in doc1]))

70
100


  ## <font color = 'dodgerblue'>**Example 2**
You can see that spacy recognizes % sumbol and create a separate token for it.

In [None]:
text3 = "There is 20% probaility of winning a lottery."
doc3 = nlp(text3)
# Whitespace Tokenizer
print(text3.split())

# spaCy Tokenizer
print([token.text for token in doc3])

['There', 'is', '20%', 'probaility', 'of', 'winning', 'a', 'lottery.']
['There', 'is', '20', '%', 'probaility', 'of', 'winning', 'a', 'lottery', '.']


  ## <font color = 'dodgerblue'>**Example 3**
Spacy's tokenizer recognizes that m is the unit of distance (based on sentence and creates a separate token for it. It will not split random combination of numbers and alphabets.


In [None]:
# Another example measuring difference between Whitespace and spaCy tokenizers
text4="I walk 10m everyday."
doc4 = nlp(text4)
# Whitespace Tokenizer
print(text4.split())

# spaCy Tokenizer
print([token.text for token in doc4])

text5 = " What is 10o8iu"
doc5 = nlp(text5)
# spaCy Tokenizer
print([token.text for token in doc5])

['I', 'walk', '10m', 'everyday.']
['I', 'walk', '10', 'm', 'everyday', '.']
[' ', 'What', 'is', '10o8iu']


  ## <font color = 'dodgerblue'>**Example 4**
It takes into acount special cases like C++, U.S.A

In [None]:
text5="Some good programming languages to know HTML, CSS, JavaScript, C++, and Node.js."
doc5 = nlp(text5)
# Whitespace Tokenizer
print(text5.split())

# spaCy Tokenizer
print([token.text for token in doc5])


['Some', 'good', 'programming', 'languages', 'to', 'know', 'HTML,', 'CSS,', 'JavaScript,', 'C++,', 'and', 'Node.js.']
['Some', 'good', 'programming', 'languages', 'to', 'know', 'HTML', ',', 'CSS', ',', 'JavaScript', ',', 'C++', ',', 'and', 'Node.js', '.']


  ## <font color = 'dodgerblue'>**Text Processing/Cleaning**
Spacy's tokens have attributes which can be very useful in text cleaning. https://spacy.io/api/token.

In [None]:
# let us check other attributes of token class
att_doc1 ={'token': [token for token in doc1],
          'token.idx': [token.idx for token in doc1],
          'token.text': [token.text for token in doc1],
          'token.is_alpha': [token.is_alpha for token in doc1],
          'token.is_punct': [token.is_punct for token in doc1],
          'token.is_space': [token.is_space for token in doc1],
          'token.is_stop': [token.is_stop for token in doc1],
          'token.like_num': [token.like_num for token in doc1],
          'token.is_digit': [token.is_digit for token in doc1],
          'token.like_url': [token.like_url for token in doc1],
           'token.like_email': [token.like_url for token in doc1],
          
          }


In [None]:
pd.DataFrame(att_doc1)


Unnamed: 0,token,token.idx,token.text,token.is_alpha,token.is_punct,token.is_space,token.is_stop,token.like_num,token.is_digit,token.like_url,token.like_email
0,China,0,China,True,False,False,False,False,False,False,False
1,'s,5,'s,False,False,False,True,False,False,False,False
2,capital,8,capital,True,False,False,False,False,False,False,False
3,is,16,is,True,False,False,True,False,False,False,False
4,Beijing,19,Beijing,True,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
95,Ten,437,Ten,True,False,False,True,True,False,False,False
96,is,441,is,True,False,False,True,False,False,False,False
97,different,444,different,True,False,False,False,False,False,False,False
98,from,454,from,True,False,False,True,False,False,False,False


  ## <font color = 'dodgerblue'>**Extract only numbers and alphabets**

In [None]:
# extract only alphabets and numbers
[token.text for token in doc1 if  (token.is_alpha or token.like_num)]

['China',
 'capital',
 'is',
 'Beijing',
 'Beijing',
 'is',
 'where',
 'we',
 'go',
 'Let',
 'travel',
 'to',
 'Hong',
 'Kong',
 'from',
 'Beijing',
 'A',
 'friend',
 'is',
 'pursuing',
 'his',
 'from',
 'Beijing',
 'Beijing',
 'is',
 'a',
 'cool',
 'place',
 'Awesome',
 'A',
 'Rolex',
 'watch',
 'costs',
 'in',
 'the',
 'range',
 'of',
 '3000.0',
 '8000.0',
 'in',
 'USA',
 'and',
 'China',
 'I',
 'going',
 'to',
 'buy',
 'a',
 'Rolexxxxxxxx',
 'watch',
 'happiness',
 'rolex',
 'for',
 'more',
 'info',
 'see',
 'Ten',
 'is',
 'different',
 'from',
 '10']

  ## <font color = 'dodgerblue'>**Remove punctuations**

In [None]:
print(text2)

China's capital is Beijing. 

A Rolex watch costs in the range of $3000.0 - $8000.0 in USA


In [None]:
# remove punctuation
" ".join([token.text for token in doc2 if  not token.is_punct])

"China 's capital is Beijing \n\n A Rolex watch costs in the range of $ 3000.0 $ 8000.0 in USA"

  ## <font color = 'dodgerblue'>**Extract/Remove URLs**

In [None]:
# extract urls
text7 = 'my urls are https://colab.research.google.com/ and utdallas.edu '
doc7 = nlp(text7)
[token.text for token in doc7 if  token.like_url]

['https://colab.research.google.com/', 'utdallas.edu']

In [None]:
# remove urls
" ".join([token.text for token in doc7 if not token.like_url])

'my urls are and'

  ## <font color = 'dodgerblue'>**Extract/Remove emails**

In [None]:
# extract emails 
text8 = 'my email is xyz@utdallas.edu or xyz@gmail.com'
doc8 = nlp(text8)
[token.text for token in doc8 if  token.like_email]

['xyz@utdallas.edu', 'xyz@gmail.com']

In [None]:
" ".join([token.text for token in doc8 if not token.like_email])

'my email is or'

  ## <font color = 'dodgerblue'>**Stopwords**

# Stop words
- Stop words are basically a set of most commonly used words in a language, for exampe, 'the', 'a', 'in', 'an' etc.
- The stop words do not provide any contextual meaning to the text and are therefore sometimes removed. 

In [None]:
#  The following paragraph has been taken from https://en.wikipedia.org/wiki/Regular_expression
text9 = """A regular expression (shortened as regex or regexp;[1] also referred to as rational expression[2][3]) is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.
The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language. The concept came into common use with Unix text-processing utilities. Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard and another, widely used, being the Perl syntax.
Regular expressions are used in search engines, search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK and in lexical analysis. Many programming languages provide regex capabilities either built-in or via libraries."""

In [None]:
doc9 = nlp(text9)

  ## <font color = 'dodgerblue'>**Understanding Stopwords**

In [None]:
# create tokens using spacy
tokens = [token.text for token in doc9 if not token.is_punct]

In [None]:
from collections import Counter

In [None]:
# create a counter object based on tokens obtained 
# A Counter is a class containing dict objects that is used to count hashable objects
# Counter contains elements as keys in a dictionary and their counts as the values for the respective keys.

counter = Counter(tokens)

In [None]:
# print counter
print(counter)

Counter({'and': 7, 'in': 6, 'the': 6, 'or': 4, 'a': 4, 'regular': 3, 'as': 3, 'of': 3, 'search': 3, 'used': 3, 'for': 3, 'text': 3, 'regex': 2, 'is': 2, 'such': 2, 'are': 2, 'find': 2, 'replace': 2, 'language': 2, '\n': 2, 'The': 2, 'concept': 2, 'processing': 2, 'utilities': 2, 'expressions': 2, 'being': 2, 'A': 1, 'expression': 1, 'shortened': 1, 'regexp;[1': 1, 'also': 1, 'referred': 1, 'to': 1, 'rational': 1, 'expression[2][3': 1, 'sequence': 1, 'characters': 1, 'that': 1, 'define': 1, 'pattern': 1, 'Usually': 1, 'patterns': 1, 'by': 1, 'string': 1, 'searching': 1, 'algorithms': 1, 'operations': 1, 'on': 1, 'strings': 1, 'input': 1, 'validation': 1, 'It': 1, 'technique': 1, 'developed': 1, 'theoretical': 1, 'computer': 1, 'science': 1, 'formal': 1, 'theory': 1, 'arose': 1, '1950s': 1, 'when': 1, 'American': 1, 'mathematician': 1, 'Stephen': 1, 'Cole': 1, 'Kleene': 1, 'formalized': 1, 'description': 1, 'came': 1, 'into': 1, 'common': 1, 'use': 1, 'with': 1, 'Unix': 1, 'Different': 1

In [None]:
# Counter class contains class methods that provide useful info using the count of elements.
counter.most_common(10)

[('and', 7),
 ('in', 6),
 ('the', 6),
 ('or', 4),
 ('a', 4),
 ('regular', 3),
 ('as', 3),
 ('of', 3),
 ('search', 3),
 ('used', 3)]

  ## <font color = 'dodgerblue'>**Stop Words with Spacy**
- Each model in Spacy has default list of stopwords. 
- You can check that using model.Defaults.stop_words. 
- You can also check whether a particular word is a stopword. 
- Further, you can modify the default list of stopwords. 



In [None]:
# default stopwords from the loaded model in spaCy
# the stopwords will change with the librray we import
print(nlp.Defaults.stop_words)

{'thru', 'part', 'indeed', 'out', 'alone', 'also', 'even', 'many', 'noone', 'thus', 'everywhere', 'most', 'became', 'cannot', 'whoever', 'there', 'take', 'therein', 'between', 'during', 'become', 'make', 'itself', 'unless', 'must', 'anyway', 'amongst', 'ten', 'again', '’re', 'since', 'nothing', 'hundred', 'of', 'wherein', '’s', 'by', 'would', 'still', 'something', 'among', 'how', 'doing', 'along', 'none', 'back', 'neither', 'why', 'just', '‘s', 'hers', 'which', 'whether', 'besides', 'whereas', 'rather', 'what', 'few', 'however', 'had', 'anyone', 'n’t', 'thereupon', 'though', 'both', 'afterwards', 'ever', 'but', 'him', 'fifteen', 'whole', "'s", 'enough', 'give', 'beforehand', 'wherever', 'hereafter', 'been', 'whose', 'eight', 'to', 'nor', 'when', 'formerly', 'yet', 'about', 'someone', 'such', 'my', 'empty', 'name', 'eleven', 'below', 'while', 'into', "'d", 'go', 'at', 'upon', 'one', 'his', 'five', 'either', 'will', 'yours', 'after', 'beside', 'really', 'for', 'hereupon', 'perhaps', 'wit

In [None]:
len(nlp.Defaults.stop_words)

326

In [None]:
# To check whether word regular is in default stop words
'regular' in nlp.Defaults.stop_words

False

In [None]:
# modify spacy's default stop words; 
# add regular as stopwords
nlp.Defaults.stop_words.add('regular')
'regular' in nlp.Defaults.stop_words

True

In [None]:
# now let us modify the default words again 
# remove regular from default stop words
nlp.Defaults.stop_words.remove('regular')
'regular' in nlp.Defaults.stop_words

False

  ## <font color = 'dodgerblue'>**Remove stop words from text**

In [None]:
tokens = [ token.text for token in doc9  if not (token.is_stop or token.is_punct)]
text9_clean = " ".join(tokens)

In [None]:
print(text9_clean)

regular expression shortened regex regexp;[1 referred rational expression[2][3 sequence characters define search pattern Usually patterns string searching algorithms find find replace operations strings input validation technique developed theoretical computer science formal language theory 
 concept arose 1950s American mathematician Stephen Cole Kleene formalized description regular language concept came common use Unix text processing utilities Different syntaxes writing regular expressions existed 1980s POSIX standard widely Perl syntax 
 Regular expressions search engines search replace dialogs word processors text editors text processing utilities sed AWK lexical analysis programming languages provide regex capabilities built libraries


In [None]:
counter = Counter(tokens)

In [None]:
counter.most_common(10)

[('regular', 3),
 ('search', 3),
 ('text', 3),
 ('regex', 2),
 ('find', 2),
 ('replace', 2),
 ('language', 2),
 ('\n', 2),
 ('concept', 2),
 ('processing', 2)]

  ## <font color = 'dodgerblue'> **Lammetization**

<img src ="https://drive.google.com/uc?export=view&id=1zk5L9vyg6LlTW8nCZh-YxBOyU-IinShN" width = 500>

image source: https://spacy.io/models

- For Lammetization we need POS and for POS we need `['tagger', 'attribute_ruler' , tok2vec]`
- Hence for lammetization we can disable  `['ner', 'parser']`

In [None]:
print(nlp.pipe_names)

[]


In [None]:
disabled.restore()

In [None]:
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [None]:
disabled = nlp.select_pipes(disable= ['ner', 'parser'])

In [None]:
print(nlp.pipe_names)

['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer']


In [None]:
# We will look at lemmatizing on a small part of the text
text10 =" A regular expression also referred to as rational expression is a sequence of characters that define a search pattern"

In [None]:
doc10=nlp(text10)

In [None]:
# Lemmatizing the text
lemmas= [token.lemma_ for token in doc10]

In [None]:
print(f'lemmatized : {" ".join(lemmas)}')
print(f'original   : {text10}')

lemmatized :   a regular expression also refer to as rational expression be a sequence of character that define a search pattern
original   :  A regular expression also referred to as rational expression is a sequence of characters that define a search pattern


# <font color = 'dodgerblue'>**Sentence tokenization using spacy**

In [None]:
disabled.restore()

In [None]:
doc2 = nlp(text2)

In [None]:
# We use doc.sents to tokenize sentences
sentences = [sent.text for sent in doc2.sents]
sentences

["China's capital is Beijing.",
 '\n\n',
 'A Rolex watch costs in the range of $3000.0 - $8000.0 in USA']

# <font color = 'dodgerblue'>**Name Entity Recognition (NER)**

In [None]:
disabled = nlp.select_pipes(disable= ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'])

In [None]:
print(nlp.pipe_names)

['ner']


In [None]:
print(f'{"Entity":<20}: Tag\n')
for entity in doc1.ents:
  print(f'{entity.text:<20} : {entity.label_}')

Entity              : Tag

China                : GPE
Beijing              : GPE
Beijing              : GPE
Hong Kong            : GPE
Beijing              : GPE
Beijing              : GPE
Beijing              : GPE
Rolex                : ORG
$3000.0 - $          : MONEY
8000.0               : MONEY
USA                  : GPE
China                : GPE
Rolexxxxxxxx         : PRODUCT
Ten                  : CARDINAL
10                   : CARDINAL


In [None]:
# You can use displacy to visualize the named entities
from spacy import displacy

In [None]:
displacy.render(doc1,style='ent',jupyter=True)

In [None]:
# text taken from https://oilprice.com/Energy/Oil-Prices/Oil-Rally-Continues-On-Bright-US-Economic-Data.html on June23 2021.
# Defining String
text11 = """
Oil prices rose early on Wednesday, driven by brighter economic prospects for the United States and continued recovery in oil demand in America and elsewhere in the world.
As of 9:04 a.m. EDT on Wednesday, ahead of the weekly inventory report by the U.S. Energy Information Administration (EIA), WTI Crude was up 1.04 percent at $73.61, 
and Brent Crude traded at $75.54, up by 0.99 percent on the day.Prices found support late on Tuesday after the American Petroleum Institute (API) 
reported a draw in crude oil inventories of 7.199 million barrels for the week ending June 18. If the EIA confirms a draw today, it would be the fifth consecutive week of crude inventory draws in the United States, where demand for fuels continues to grow.
"""

In [None]:
# Creating doc object
doc11 = nlp(text11)

In [None]:
# doc.ents give us the named entities
# We can use entity.text and entity.label_ to get the entities and their tags
print(f'{"Entity":<45} : Tag\n')
for entity in doc11.ents:
  print(f'{entity.text:<45} : {entity.label_}')

Entity                                        : Tag

Wednesday                                     : DATE
the United States                             : GPE
America                                       : GPE
9:04 a.m. EDT                                 : TIME
Wednesday                                     : DATE
weekly                                        : DATE
the U.S. Energy Information Administration    : ORG
1.04 percent                                  : PERCENT
73.61                                         : MONEY
Brent Crude                                   : ORG
75.54                                         : MONEY
0.99 percent                                  : PERCENT
the day                                       : DATE
Tuesday                                       : DATE
the American Petroleum Institute              : ORG
API                                           : ORG
7.199 million barrels                         : QUANTITY
the week ending June 18                 

In [None]:
# You can use displacy to visualize the named entities
displacy.render(doc11,style='ent',jupyter=True)

# <font color = 'dodgerblue'>**Part of Speech Tagging**

<img src ="https://drive.google.com/uc?export=view&id=1zk5L9vyg6LlTW8nCZh-YxBOyU-IinShN" width = 500>

image source: https://spacy.io/models

- For POS we need `['tagger', 'attribute_ruler' , tok2vec]`
-The POS tags come from rules that map token.tag to token.pos in (see mapping here https://spacy.io/api/annotation) the attribute_ruler component.
- If the dependency parse is available, there are more specific rules it can apply related to AUX and VERB. 
- The mapping is hard to do perfectly because the token.tag (PTB tags) that come from the tagger don't make an aux/verb distinction at all.
- Hence for POS we can disable `['lemmatizer', 'ner']`

source: https://stackoverflow.com/questions/69313960/does-spacys-version3-1-pos-tagger-depends-on-parser


In [None]:
disabled.restore()
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [None]:
disabled = nlp.select_pipes(disable= ['ner','lemmatizer'])
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler']


In [None]:
# Get Part of Speech (POS) tags 
# print token text, pos and tag
doc9 = nlp(text9)
for token in doc9:
    print(f'{token.text:<15} -> {token.pos_:<10} -> {token.tag_:<10}')

A               -> DET        -> DT        
regular         -> ADJ        -> JJ        
expression      -> NOUN       -> NN        
(               -> PUNCT      -> -LRB-     
shortened       -> VERB       -> VBN       
as              -> ADP        -> IN        
regex           -> NOUN       -> NNS       
or              -> CCONJ      -> CC        
regexp;[1       -> NOUN       -> NN        
]               -> PUNCT      -> -RRB-     
also            -> ADV        -> RB        
referred        -> VERB       -> VBD       
to              -> ADP        -> IN        
as              -> ADV        -> RB        
rational        -> ADJ        -> JJ        
expression[2][3 -> NOUN       -> NN        
]               -> PUNCT      -> -RRB-     
)               -> PUNCT      -> -RRB-     
is              -> AUX        -> VBZ       
a               -> DET        -> DT        
sequence        -> NOUN       -> NN        
of              -> ADP        -> IN        
characters      -> NOUN       ->

The list of pos_ attributes along with its meaning:

* ADJ: adjective, e.g. old, green, first, etc.
* ADP: adposition, e.g. in, to, during, etc.
* ADV: adverb, e.g. very, tomorrow, down, where, there, etc.
* AUX: auxiliary, e.g. is, has (done), will (do), should (do), etc.
* CONJ: conjunction, e.g. and, or, but, etc.
* CCONJ: coordinating conjunction, e.g. and, or, but, etc.
* DET: determiner, e.g. a, an, the, etc.
* INTJ: interjection, e.g. psst, ouch, bravo, hello, etc.
* NOUN: noun, e.g. girl, cat, tree, air, etc.
* NUM: numeral, e.g. 1, 2017, one, seventy-seven, IV, MMXIV, etc.
* PART: particle, e.g. ’s, not, etc.
* PRON: pronoun, e.g I, you, he, she, myself, themselves, somebody, etc.
* PROPN: proper noun, e.g. Mary, John, Chucago, NATO, etc.
* PUNCT: punctuation, e.g. ., (, ), ?, etc.
* SCONJ: subordinating conjunction, e.g. if, while, that, etc.
* SYM: symbol, e.g. $, %, §, ©, +, −, ×, ÷, =, :), 😝, etc.
* VERB: verb, e.g. run, runs, running, ate, eating, etc.
* X: other, e.g. sfpksdpsxmsa(some random text).
* SPACE: space.

In [None]:
# We can get any of the above Part of Speech
# For a list of the Parts of Speech click the link here
# https://spacy.io/usage/linguistic-features
# Let us get Verbs , Nouns and Proper Nouns

Verbs = [token.text for token in doc9 if(token.pos_=='VERB')]
Nouns = [token.text for token in doc9 if(token.pos_=='NOUN')]
Proper_Nouns = [token.text for token in doc9 if(token.pos_=='PROPN')]

In [None]:
# Print Verbs
for verb in Verbs[:10]:
  print(verb)

shortened
referred
define
used
searching
find
find
replace
developed
arose


In [None]:
# Print Nouns
for noun in Nouns[:10]:
  print(noun)

expression
regex
regexp;[1
expression[2][3
sequence
characters
search
pattern
patterns
string


# <font color = 'dodgerblue'>**Stemming**

In [None]:
!pip install -U nltk

In [None]:
# Import PorterStemmer from nltk.stem
from nltk.stem import PorterStemmer

## <font color = 'dodgerblue'>**Example**

In [None]:
# Create an object of class PorterStemmer
stemmer = PorterStemmer()

words = ['connection', 'connected', 'connnecter', 'connnecting', 'connect']

for w in words:
  print(w, " : ", stemmer.stem(w))

connection  :  connect
connected  :  connect
connnecter  :  connnect
connnecting  :  connnect
connect  :  connect


# <font color = 'dodgerblue'>**Remove HTML Tags**

In [None]:
text3="""I just can't understand the negative comments about this film. Yes it is a typical
boy-meets-girl romance but it is done with such flair and polish that the time just flies by. 
Henstridge (talk about winning the gene-pool lottery!) is as magnetic and alluring as ever 
(who says the golden age of cinema is dead?) and Vartan holds his own.<br /><br />There is 
simmering chemistry between the two leads; the film is most alive when they share a scene - 
lots! It is done so well that you find yourself willing them to get together...<br /><br />Ignore 
the negative comments - if you are feeling a bit blue, watch this flick, you will feel so much 
better. If you are already happy, then you will be euphoric.<br /><br />(PS: I am 33, Male, 
from the UK and a hopeless romantic still searching for his Princess...)"""

In [None]:
from bs4 import BeautifulSoup
import re

In [None]:
soup = BeautifulSoup(text3, "html.parser")

In [None]:
cleaned_text3 = soup.get_text()

In [None]:
cleaned_text3

"I just can't understand the negative comments about this film. Yes it is a typical\nboy-meets-girl romance but it is done with such flair and polish that the time just flies by. \nHenstridge (talk about winning the gene-pool lottery!) is as magnetic and alluring as ever \n(who says the golden age of cinema is dead?) and Vartan holds his own.There is \nsimmering chemistry between the two leads; the film is most alive when they share a scene - \nlots! It is done so well that you find yourself willing them to get together...Ignore \nthe negative comments - if you are feeling a bit blue, watch this flick, you will feel so much \nbetter. If you are already happy, then you will be euphoric.(PS: I am 33, Male, \nfrom the UK and a hopeless romantic still searching for his Princess...)"

In [None]:
def basic_clean(text):

    '''
    This fuction removes HTML tags from text
    '''
    if (bool(BeautifulSoup(text, "html.parser").find())==True):         
        soup = BeautifulSoup(text, "html.parser")
        text = soup.get_text()
    else:
        pass
    return re.sub(r'[\n\r]',' ', text) 

In [None]:
cleaned_text = basic_clean(text=text3)

In [None]:
cleaned_text

"I just can't understand the negative comments about this film. Yes it is a typical boy-meets-girl romance but it is done with such flair and polish that the time just flies by.  Henstridge (talk about winning the gene-pool lottery!) is as magnetic and alluring as ever  (who says the golden age of cinema is dead?) and Vartan holds his own.There is  simmering chemistry between the two leads; the film is most alive when they share a scene -  lots! It is done so well that you find yourself willing them to get together...Ignore  the negative comments - if you are feeling a bit blue, watch this flick, you will feel so much  better. If you are already happy, then you will be euphoric.(PS: I am 33, Male,  from the UK and a hopeless romantic still searching for his Princess...)"