# I will practice the use of the Spacy Library for NLP. This will also serve as my customized reference workbook for the library.

In [94]:
# Installation

#pip install spacy -> Already installed


In [3]:
# importing the library
import spacy

#### In the next cells, I will create a blank object 'nlp'(Because Spacy is an Object based library), and use apply word tokenization on a sample text.

In [11]:
#Creating a blank object in English language mainly for tokenization
nlp = spacy.blank("en")

In [300]:
#This will be my text sample in a tuple for word tokenization
text = (
"Tesla's", '''gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
"BMW's", gross cost of operating vehicles in FY2021 S1 was $8''', "billion."
)
test = ' '.join(text) # -> convert to a string because the nlp object expects a string.
#Creating a doc variable and printing word tokens using a Python 'for' loop.
doca = nlp(test)
for token in doca:
    print(token)

Tesla
's
gross
cost
of
operating
lease
vehicles
in
FY2021
Q1
was
$
4.85
billion
.


"
BMW
's
"
,
gross
cost
of
operating
vehicles
in
FY2021
S1
was
$
8
billion
.


#### So basically the purpose of using a library like SpaCy for word tokenization is for unstructured text data. Because a list of words separated by commas will iterate separately according to their order in the the list with just a Python 'for' loop, because being in a list means they are already structured. So SpaCy takes a text sample (A group of words without a distinct separation except spaces) and separates (Tokenizes) each wprd accordingly. It will be a hectic work to try to organize words from a large text in a list, before iteration.

In [98]:
#This will be my text sample for word tokenization
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
BMW's gross cost of operating vehicles in FY2021 S1 was $8 billion.
'''





Tesla
's
gross
cost
of
operating
lease
vehicles
in
FY2021
Q1
was
$
4.85
billion
.


BMW
's
gross
cost
of
operating
vehicles
in
FY2021
S1
was
$
8
billion
.




# Remember that the reason for word tokenization in NLP is for vector conversion, TF-IDF, embedding and encodings(I've not gotten most of these right, but it for the purpose of converting words to numbers for ML purposes)

## Spacy has many attributes that can be used to extract words, digits, currency etc from a text.  
#### In the next cells, I will use the Spacy object to extract urls and emails using text samples containing urls, and emails.

Extracting urls from a text sample using the nlp Spacy object

In [126]:
#Text sample:
text1 = '''
Follow our leader Elon musk on twitter here: https://twitter.com/elonmusk, more information 
on Tesla's products can be found at https://www.tesla.com/. Also here are leading influencers 
for tesla related news,
https://twitter.com/teslarati
https://twitter.com/dummy_tesla
https://twitter.com/dummy_2_tesla
'''

doc = nlp(text1) # -> convert the text into a document
print(type(doc)) #  -> This is to show that the text is now a doc. Should print spacy.tokens.doc.Doc

url = [] # -> A global variable of empty list to contain the extracted url
for token in doc:
    if token.like_url: # -> This method checks for urls in the doc.
        url.append(token)


url

<class 'spacy.tokens.doc.Doc'>


[https://twitter.com/elonmusk,
 https://www.tesla.com/.,
 https://twitter.com/teslarati,
 https://twitter.com/dummy_tesla,
 https://twitter.com/dummy_2_tesla]

Extracting emails from a text sample using the nlp Spacy object

In [128]:
#Text sample:
text2 = '''
1. **John Doe**  
   - **Email:** john.doe@example.com  
   - **Age:** 34  
   - **Occupation:** Software Developer  
   - **Location:** 123 Maple Street, Springfield  
   - **Phone:** (555) 123-4567  

2. **Jane Smith**  
   - **Email:** jane.smith@example.com  
   - **Age:** 29  
   - **Occupation:** Marketing Specialist  
   - **Location:** 456 Oak Avenue, Rivertown  
   - **Phone:** (555) 987-6543  

3. **Alice Johnson**  
   - **Email:** alice.johnson@example.com  
   - **Age:** 42  
   - **Occupation:** Project Manager  
   - **Location:** 789 Pine Road, Lakeview  
   - **Phone:** (555) 234-5678  

4. **Robert Brown**  
   - **Email:** robert.brown@example.com  
   - **Age:** 37  
   - **Occupation:** Graphic Designer  
   - **Location:** 321 Cedar Lane, Hilltop  
   - **Phone:** (555) 345-6789  

5. **Emily Davis**  
   - **Email:** emily.davis@example.com  
   - **Age:** 31  
   - **Occupation:** Financial Analyst  
   - **Location:** 654 Birch Boulevard, Greenwood  
   - **Phone:** (555) 456-7890
'''

doc = nlp(text2) # -> convert the text into a document
print(type(doc)) #  -> This is to show that the text is now a doc. It should print something like spacy.tokens.doc.Doc

emails = [] # -> A global variable of empty list to contain the extracted url
for token in doc:
    if token.like_email: # -> This method checks for emails in the doc.
        emails.append(token)


emails

<class 'spacy.tokens.doc.Doc'>


[john.doe@example.com,
 jane.smith@example.com,
 alice.johnson@example.com,
 robert.brown@example.com,
 emily.davis@example.com]

# Sentence tokenization or segmentation

#### Spacy is also a sentence tokenizer. This means it takes a text/doc and smartly splits it into sentences. This is different from the usual splitting done with Python Pandas with where a delimeter is specified for the splitting. The difference is that with Pandas, abbreviations can be separated if the specified delimeter is a period(.), but Spacy is smart enough to know that such words are not sentences.

#### In the next cells, I'll use Spacy to apply a sentence tokenization on text samples.

In [132]:
# Text sample for 
text3 = '''
Concentration of Risk: Credit Risk
Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,
restricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds
or on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021
and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note
hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.
Concentration of Risk: Supply Risk
We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our
products in a timely manner at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components from these
suppliers, could have a material adverse effect on our business, prospects, financial condition and operating results.
'''

#### The nlp object we created earlier in this notebook had nothing in its pipeline except work tokenization. To use it with the '.sents' method for sentence tokenizatin, we should have loaded it with 'en_core_web_sm' in this way: 
* #### nlp = spacy.load("en_core_web_sm") .  This will load the  object with virtually everything needed in the pipline like the sentencizer, etc.

#### On the other hand we can use the 'add_pipe' method to add a component like the 'sentencizer' (nlp.add_pipe("sentencizer"), a 'sentence recognizer', or 'dependency parser' to the blank object, or set sentence boudaries, before using it for sentence tokenization. But the first approach is my prefered method, and the 'setencizer' has a way of making sentence tokenization that isn't really smart enough for me.

In [141]:
#nlp.add_pipe("sentencizer")
#nlp.pipe_names # -> To check the contents of the pipe

In [134]:
nlp = spacy.load("en_core_web_sm") # -> Recreating the 'nlp' object.

doc = nlp(text3)
for sentence in doc.sents:
    print(sentence)


Concentration of Risk: Credit Risk
Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,
restricted cash, accounts receivable, convertible note hedges, and interest rate swaps.
Our cash balances are primarily invested in money market funds
or on deposit at high credit quality financial institutions in the U.S.
These deposits are typically in excess of insured limits.
As of September 30, 2021
and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance.
The risk of concentration for our convertible note
hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.

Concentration of Risk: Supply Risk

We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our
products in a timely manner at prices, quality levels and volumes acceptable 

# Romoving spaces, punctuations and other characters using Spacy. 

### The difference between this method and using string.punctuation is that this method removes all the punctuations, and in some tasks it's not so proper, while using the string.punctuate method removes only the unicode punctuations while leaving the few ascii punctuations that might be needed for some certain tasks.

First, I will show how the parts of speech components works.

In [281]:
nlp = spacy.load("en_core_web_sm") # -> load the "en_core_we_sm" model in an object
#Below, I have chosen to add the text directly as the parameter for the object instead of assigning it to a variable first.
doc = nlp("'just when i think i’ve lost you just when i’m so tired i toss away the fight and say “i’ll just embrace my demons then… ‘cause you feel so far away and i’ll never be your angel” —that’s when' ")

for token in doc:
    print(token, " | ", # -> The tokenized words
          token.pos_, " | ", # -> pos_ prints or reveals the parts of speech. 
          spacy.explain(token.pos_), " | ", # -> spacy.explain() explains the meaning of the argument in its parenthesis. You use it to understand the spaCy methods.
          token.tag_, " | ", # -> .tag_ prints or reveals the type, tense and more detail of the parts of speech.
          spacy.explain(token.tag_)
         )  
    #In spaCy, there are many of the parts of speech; More than the 8 fundamental ones. 
    


'  |  PUNCT  |  punctuation  |  ``  |  opening quotation mark
just  |  ADV  |  adverb  |  RB  |  adverb
when  |  SCONJ  |  subordinating conjunction  |  WRB  |  wh-adverb
i  |  PRON  |  pronoun  |  PRP  |  pronoun, personal
think  |  VERB  |  verb  |  VBP  |  verb, non-3rd person singular present
i  |  PRON  |  pronoun  |  PRP  |  pronoun, personal
’ve  |  AUX  |  auxiliary  |  VBP  |  verb, non-3rd person singular present
lost  |  VERB  |  verb  |  VBN  |  verb, past participle
you  |  PRON  |  pronoun  |  PRP  |  pronoun, personal
just  |  ADV  |  adverb  |  RB  |  adverb
when  |  SCONJ  |  subordinating conjunction  |  WRB  |  wh-adverb
i  |  PRON  |  pronoun  |  PRP  |  pronoun, personal
’m  |  VERB  |  verb  |  VBZ  |  verb, 3rd person singular present
so  |  ADV  |  adverb  |  RB  |  adverb
tired  |  ADJ  |  adjective  |  JJ  |  adjective (English), other noun-modifier (Chinese)
i  |  PRON  |  pronoun  |  PRP  |  pronoun, personal
toss  |  VERB  |  verb  |  VBP  |  verb, non-3rd 

#### Now to I will remove spaces, punctuations and other characters from the text, using Spacy.

In [270]:
nlp = spacy.load("en_core_web_sm") # -> load the "en_core_we_sm" model in an object
#Below, I have chosen to add the text directly as the parameter for the object instead of assigning it to a variable first.
doc = nlp("'just when i think i’ve lost you just when i’m so tired i toss away the fight and say “i’ll just embrace my demons then… ‘cause you feel so far away and i’ll never be your angel” —that’s when' ")

filtered_text = []
for token in doc:
    if token.pos_ not in ["SPACE", "X", "PUNCT"]: # -> pos_ prints or reveals the parts of speech. 
    #In spaCy, there are many of the. More than the 8 fundamental ones. 
    #In spaCy,"SPACE", "X", AND "PUNCT" represent spaces, special characters (like etc), and punctuations in a document.
        filtered_text.append(token)     
filtered_text

[just,
 when,
 i,
 think,
 i,
 ’ve,
 lost,
 you,
 just,
 when,
 i,
 ’m,
 so,
 tired,
 i,
 toss,
 away,
 the,
 fight,
 and,
 say,
 i,
 ’ll,
 just,
 embrace,
 my,
 demons,
 then,
 cause,
 you,
 feel,
 so,
 far,
 away,
 and,
 i,
 ’ll,
 never,
 be,
 your,
 angel,
 that,
 ’s,
 when]

#### Below is a Similar code to the one above, but here we want the filtered_text to be a clean list of strings rather than a spacy token, so we converted the token to text before appending it to the filtered_text list. Now we can extract the whole sentence from the list with the join method.

In [289]:
filtered_text = []
for token in doc:
    if token.pos_ not in ["SPACE", "X", "PUNCT"]:
        filtered_text.append(token.text)
filtered_text = ' '.join(filtered_text)
filtered_text

'just when i think i ’ve lost you just when i ’m so tired i toss away the fight and say i ’ll just embrace my demons then cause you feel so far away and i ’ll never be your angel that ’s when'