<a href="https://colab.research.google.com/github/Rohanrathod7/my-ml-labs/blob/main/18_Natural_Language_Processing_with_spaCy/03_Data_Analysis_with_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 3. Data Analysis with spaCy


Get familiar with spaCy pipeline components, how to add a pipeline component, and analyze the NLP pipeline. You will also learn about multiple approaches for rule-based information extraction using EntityRuler, Matcher, and PhraseMatcher classes in spaCy and RegEx Python package.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import datetime as dt
# Import confusion matrix and train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import Ridge, Lasso, LogisticRegression, LinearRegression
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier

url = "https://raw.githubusercontent.com/Rohanrathod7/my-ml-labs/main/15_Hyperparameter_Tuning_in_Python/Dataset/results_df.csv"
# Read the CSV file
# Apply pd.to_numeric only to relevant columns, excluding 'text'
results_df = pd.read_csv(url)


display(results_df.head())

Unnamed: 0,max_depth,min_samples_leaf,learn_rate,accuracy
0,4,16,0.624362,95
1,10,14,0.47745,97
2,7,14,0.050067,96
3,5,12,0.023356,96
4,6,12,0.771275,97


In [3]:
!python3 pip install spacy
!python3 -m spacy download en_core_web_sm

python3: can't open file '/content/pip': [Errno 2] No such file or directory
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m73.9 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [4]:
# Load en_core_web_sm and create an nlp object
import spacy
nlp = spacy.load("en_core_web_sm")

### spaCy pipelines

**Adding pipes in spaCy**  
You often use an existing spaCy model for different NLP tasks. However, in some cases, an off-the-shelf pipeline component such as sentence segmentation will take long times to produce expected results. In this exercise, you'll practice adding a pipeline component to a spaCy model (text processing pipeline).

You will use the first five reviews from the Amazon Fine Food Reviews dataset for this exercise. You can access these reviews by using the texts string.

The spaCy package is already imported for you to use.



In [11]:
texts = 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most. Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo". This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and the Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch. If you are looking for the secret ingredient in Robitussin I believe I have found it.  I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda.  The flavor is very medicinal. Great taffy at a great price.  There was a wide assortment of yummy taffy.  Delivery was very quick.  If your a taffy lover, this is a deal.'


# Load a blank spaCy English model and add a sentencizer component
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

# Create Doc containers, store sentences and print its number of sentences
doc = nlp(texts)
sentences = [s.sent for s in doc]
print("Number of sentences: ", len(sentences), "\n")

# Print the list of tokens in the second sentence
print("Second sentence tokens: ", [texts for token in sentences[1]])

Number of sentences:  285 

Second sentence tokens:  ['I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most. Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo". This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and the Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and S

In [12]:

# Load a blank spaCy English model and add a sentencizer component
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

# Create Doc containers, store sentences and print its number of sentences
doc = nlp(texts)
sentences = [s for s in doc.sents]
print("Number of sentences: ", len(sentences), "\n")

# Print the list of tokens in the second sentence
print("Second sentence tokens: ", [token for token in sentences[1]])

# You successfully loaded a blank English model and added a Sentencizer component to segment a given text into its sentences.

Number of sentences:  19 

Second sentence tokens:  [The, product, looks, more, like, a, stew, than, a, processed, meat, and, it, smells, better, .]


**Analyzing pipelines in spaCy**  
spaCy allows you to analyze a spaCy pipeline to check whether any required attributes are not set. In this exercise, you'll practice analyzing a spaCy pipeline. Earlier in the video, an existing en_core_web_sm pipeline was analyzed and the result was No problems found., in this instance, you will analyze a blank spaCy English model with few added components and observe results of the analysis.

The spaCy package is already imported for you to use.

In [13]:
# Load a blank spaCy English model
nlp = spacy.blank("en")

# Add tagger and entity_linker pipeline components
nlp.add_pipe("tagger")
nlp.add_pipe("entity_linker")

# Analyze the pipeline
analysis = nlp.analyze_pipes(pretty=True)

[1m

#   Component       Assigns           Requires         Scores        Retokenizes
-   -------------   ---------------   --------------   -----------   -----------
0   tagger          token.tag                          tag_acc       False      
                                                       pos_acc                  
                                                       tag_micro_p              
                                                       tag_micro_r              
                                                       tag_micro_f              
                                                                                
1   entity_linker   token.ent_kb_id   doc.ents         nel_micro_f   False      
                                      doc.sents        nel_micro_r              
                                      token.ent_iob    nel_micro_p              
                                      token.ent_type                            

[1m
[38;5;3m⚠ 'enti

**Question**  
The output of analyze_pipes() method showed that entity_linker requirements not met: doc.ents, doc.sents, token.ent_iob, token.ent_type.

Which NLP components should be added before adding entity_linker component to ensure the created spaCy pipeline have all the required attributes for entity linking?

-> ner, sentencizer

    In this instance, the pipeline is missing sentence segmentation and named entity recognition components before entity_linker component.!

### spaCy EntityRuler

**EntityRuler with blank spaCy model**  
EntityRuler lets you to add entities to doc.ents. It can be combined with EntityRecognizer, a spaCy pipeline component for named-entity recognition, to boost accuracy, or used on its own to implement a purely rule-based entity recognition system. In this exercise, you will practice adding an EntityRuler component to a blank spaCy English model and classify named entities of the given text using purely rule-based named-entity recognition.

The spaCy package is already imported and a blank spaCy English model is ready for your use as nlp. A list of patterns to classify lower cased OpenAI and Microsoft as ORG is already created for your use.

In [14]:
nlp = spacy.blank("en")
patterns = [{"label": "ORG", "pattern": [{"LOWER": "openai"}]},
            {"label": "ORG", "pattern": [{"LOWER": "microsoft"}]}]
text = "OpenAI has joined forces with Microsoft."

# Add EntityRuler component to the model
entity_ruler = nlp.add_pipe("entity_ruler")

# Add given patterns to the EntityRuler component
entity_ruler.add_patterns(patterns)

# Run the model on a given text
doc = nlp(text)

# Print entities text and type for all entities in the Doc container
print([(ent.text, ent.label_) for ent in doc.ents])

# You can now define as many patterns and use EntityRuler to perform purely rule-based entity recognition for a given text.
# In this instance, OpenAI and Microsoft are identified correctly as ORG (organization).

[('OpenAI', 'ORG'), ('Microsoft', 'ORG')]


**EntityRuler for NER**  
EntityRuler can be combined with EntityRecognizer of an existing model to boost its accuracy. In this exercise, you will practice combining an EntityRuler component and an existing NER component of the en_core_web_sm model. The model is already loaded as nlp.

When EntityRuler is added before NER component, the entity recognizer will respect the existing entity spans and adjust its predictions based on patterns

In [15]:
nlp = spacy.load("en_core_web_sm")
text = "New York Group was built in 1987."

# Add an EntityRuler to the nlp before NER component
ruler = nlp.add_pipe("entity_ruler", before="ner")

# Define a pattern to classify lower cased new york group as ORG
patterns = [{"label": "ORG", "pattern": [{"lower": "new york group"}]}]

# Add the patterns to the EntityRuler component
ruler.add_patterns(patterns)

# Run the model and print entities text and type for all the entities
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])

#  You just defined a token entity pattern and were able to improve the accuracy of an
#  NER model by adding an EntityRuler before NER component of the spaCy model.

[('New York Group', 'ORG'), ('1987', 'DATE')]


**EntityRuler with multi-patterns in spaCy**  
EntityRuler lets you to add entities to doc.ents and boost its named entity recognition performance. In this exercise, you will practice adding an EntityRuler component to an existing nlp pipeline to ensure multiple entities are correctly being classified.

The en_core_web_sm model is already loaded and is available for your use as nlp. You can access an example text in example_text and use nlp and doc to access an spaCy model and Doc container of example_text respectively.



In [18]:
!python3 -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m54.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [20]:
nlp = spacy.load("en_core_web_md")

example_text = 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most. Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo". This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and the Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch. If you are looking for the secret ingredient in Robitussin I believe I have found it.  I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda.  The flavor is very medicinal. Great taffy at a great price.  There was a wide assortment of yummy taffy.  Delivery was very quick.  If your a taffy lover, this is a deal.'

# Print a list of tuples of entities text and types in the example_text
print("Before EntityRuler: ", [ent for ent in nlp(example_text).ents], "\n")

# Define pattern to add a label PERSON for lower cased sisters and brother entities
patterns = [{"label": "PERSON", "pattern": [{"lower": "sisters"}]},
            {"label": "PERSON", "pattern": [{"lower": "brother"}]}]

# Add an EntityRuler component and add the patterns to the ruler
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

# Print a list of tuples of entities text and types
print("After EntityRuler: ", [(ent.text, ent.label_) for ent in nlp(example_text).ents])

# Now you are able to handcraft rules in spaCy using EntityRuler and ensure spaCy models are able to correctly identify entities and their corresponding labels.
# In this instance, the EntityRuler helps to predict brother and sisters correctly as PERSON.

Before EntityRuler:  [Jumbo Salted Peanuts, around a few centuries, Filberts, C.S. Lewis', The Lion, The Witch, Edmund, Robitussin, the Root Beer Extract] 

After EntityRuler:  [('Jumbo Salted Peanuts', 'ORG'), ('around a few centuries', 'DATE'), ('Filberts', 'PERSON'), ("C.S. Lewis'", 'PERSON'), ('The Lion, The Witch', 'WORK_OF_ART'), ('Edmund', 'PERSON'), ('Brother', 'PERSON'), ('Sisters', 'PERSON'), ('Robitussin', 'ORG'), ('the Root Beer Extract', 'ORG')]


**RegEx in Python**  
Rule-based information extraction is useful for many NLP tasks. Certain types of entities, such as dates or phone numbers have distinct formats that can be recognized by a set of rules without needing to train any model. In this exercise, you will practice using re package for RegEx. The goal is to find phone numbers in a given text.

re package is already imported for your use. You can use \d to match string patterns representative of a metacharacter that matches any digit from 0 to 9

In [22]:
import re

text = "Our phone number is (425)-123-4567."

# Define a pattern to match phone numbers
pattern = r"\((\d){3}\)-(\d){3}-(\d){4}"

# Find all the matching patterns in the text
phones = re.finditer(pattern, text)

# Print start and end characters and matching section of the text
for match in phones:
    start_char = match.start()
    end_char = match.end()
    print("Start character: ", start_char, "| End character: ", end_char, "| Matching text: ", text[start_char:end_char])

Start character:  20 | End character:  34 | Matching text:  (425)-123-4567


**RegEx with EntityRuler in spaCy**  
Regular expressions, or RegEx, are used for rule-based information extraction with complex string matching patterns. RegEx can be used to retrieve patterns or replace matching patterns in a string with some other patterns. In this exercise, you will practice using EntityRuler in spaCy to find email addresses in a given text.

spaCy package is already imported for your use. You can use \d to match string patterns representative of a metacharacter that matches any digit from 0 to 9.

A spaCy pattern can use REGEX as an attribute. In this case, a pattern will be of shape [{"TEXT": {"REGEX": "<a given pattern>"}}].

In [23]:
text = "Our phone number is 4251234567."

# Define a pattern to match phone numbers
patterns = [{"label": "PHONE_NUMBERS", "pattern": [{"TEXT": {"REGEX": "(\d){10}"}}]}]

# Load a blank model and add an EntityRuler
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")

# Add the compiled patterns to the EntityRuler
ruler.add_patterns(patterns)

# Print the tuple of entities texts and types for the given text
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])

# You are now able to write a regular expression pattern and use spaCy EntityRuler to align the matched patterns with a given Doc container entities.

[('4251234567', 'PHONE_NUMBERS')]


### spaCy Matcher and PhraseMatcher

**Matching a single term in spaCy**  
RegEx patterns are not trivial to read, write and debug. But you are not at a loss, spaCy provides a readable and production-level alternative, the Matcher class. The Matcher class can match predefined rules to a sequence of tokens in a given Doc container. In this exercise, you will practice using Matcher to find a single word.

You can access the corresponding text in example_text and use nlp and doc to access an spaCy model and Doc container of example_text respectively.



In [25]:
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)

# Initialize a Matcher object
matcher = Matcher(nlp.vocab)

# Define a pattern to match lower cased word witch
pattern = [{"lower" : "witch"}]

# Add the pattern to matcher object and find matches
matcher.add("CustomMatcher", [pattern])
matches = matcher(doc)

# Print start and end token indices and span of the matched text
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end, "| Matched text: ", doc[start:end].text)

# Matcher is one of the commonly used functionalities in spaCy to find given patterns in a text.
# In this case, you identify two occurrences of the word witch in the given context.

Start token:  177  | End token:  178 | Matched text:  Witch
Start token:  200  | End token:  201 | Matched text:  Witch


***PhraseMatcher in spaCy***  
While processing unstructured text, you often have long lists and dictionaries that you want to scan and match in given texts. The Matcher patterns are handcrafted and each token needs to be coded individually. If you have a long list of phrases, Matcher is no longer the best option. In this instance, PhraseMatcher class helps us match long dictionaries. In this exercise, you will practice to retrieve patterns with matching shapes to multiple terms using PhraseMatcher class.

In [None]:
text = "There are only a few acceptable IP addresse: (1) 127.100.0.1, (2) 123.4.1.0."
terms = ["110.0.0.0", "101.243.0.0"]

# Initialize a PhraseMatcher class to match to shapes of given terms
matcher = PhraseMatcher(nlp.vocab, attr = "Shape")

# Create patterns to add to the PhraseMatcher object
patterns = [nlp.make_doc(term) for term in terms]
matcher.add("IPAddresses", patterns)

# Find matches to the given patterns and print start and end characters and matches texts
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end, "| Matched text: ", doc[start:end].text)

# You can now find matches to a shape of given patterns using PhraseMatcher.
# In this instance, we identify two IP addresses that match the given shapes to the PhraseMatcher object.

**Matching with extended syntax in spaCy**  
Rule-based information extraction is essential for any NLP pipeline. The Matcher class allows patterns to be more expressive by allowing some operators inside the curly brackets. These operators are for extended comparison and look similar to Python's in, not in and comparison operators. In this exercise, you will practice with spaCy's matching functionality, Matcher, to find matches for given terms from an example text.

Matcher class is already imported from spacy.matcher library. You will use a Doc container of an example text in this exercise by calling doc. A pre-loaded spaCy model is also accessible at nlp.

In [26]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)

# Define a matcher object
matcher = Matcher(nlp.vocab)
# Define a pattern to match tiny squares and tiny mouthful
pattern = [{"lower": "tiny"}, {"lower": {"IN": ["squares", "mouthful"]}}]

# Add the pattern to matcher object and find matches
matcher.add("CustomMatcher", [pattern])
matches = matcher(doc)

# Print out start and end token indices and the matched text span per match
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end, "| Matched text: ", doc[start:end].text)

# spaCy's Matcher provides a readable and production-level approach to match a predefined set of patterns to a given text. In this instance,
# you were able to successfully using IN operator to extend Matcher patterns and find to similar patterns that start with tiny

Start token:  123  | End token:  125 | Matched text:  tiny squares
Start token:  138  | End token:  140 | Matched text:  tiny mouthful
