# Vocabulary (or "vocab") and string matching

Vocabulary (or "vocab") and string matching are essential concepts in Natural Language Processing (NLP), which is a field of artificial intelligence that focuses on the interaction between computers and human language. These concepts are fundamental for various NLP tasks, such as text processing, information retrieval, and text analysis.

## Vocabulary (Vocab):

In NLP, the vocabulary refers to the set of unique words or tokens that exist in a particular corpus or dataset. A corpus can be a collection of documents, articles, or any text data.
The size and composition of the vocabulary are crucial for many NLP tasks, including text classification, sentiment analysis, and language modeling. A larger vocabulary can capture more nuances in language, but it can also make models more complex and computationally expensive.
Vocabularies are often used to create word embeddings, such as Word2Vec or GloVe, which represent words as vectors in high-dimensional spaces. These embeddings are essential for various NLP applications.

### String Matching:

String matching in NLP involves finding occurrences of specific strings within a text or comparing strings for similarity.

Exact string matching: 
    
    
    Determining if an exact string or substring exists in a given text. This can be done using simple methods like the in operator in Python or more advanced algorithms like the Knuth-Morris-Pratt algorithm.


Fuzzy string matching: 
    
    Identifying strings that are approximately equal, taking into account typos, misspellings, and variations. Algorithms like Levenshtein distance or Jaccard similarity are often used for this purpose.

Regular expressions: 
    
    Regular expressions (regex) provide a powerful way to define patterns for string matching. They are widely used for text extraction and manipulation.

Token-based matching: 

    In NLP, matching is often done at the token level, where tokens are words or subwords. Token-based matching can involve comparing tokens based on exact or approximate matches.

# Rules based Matching

SpaCy is a popular Python library for natural language processing (NLP). It provides a range of features and tools for working with text data. The Matcher and PhraseMatcher are two components of SpaCy that are used for specific text pattern matching tasks.

## Matcher:

The Matcher in SpaCy is used for rule-based, token-level pattern matching in text. It allows you to define rules to find sequences of tokens based on their attributes (e.g., part of speech, text, or other custom attributes).

Use cases of the Matcher include:

Entity Recognition:
    
    You can create rules to identify and extract specific entities in text, such as dates, addresses, or custom domain-specific terms.

Information Extraction:
    
    It is commonly used to extract structured information from unstructured text by defining patterns for specific data like phone numbers, email addresses, or product codes.

Custom Text Analysis:
    
    You can define rules to find specific phrases or patterns that are relevant to your NLP task, such as finding mentions of medical conditions in a healthcare context.

In [1]:
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
# Import commands of  the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)


## Write pattern
pattern_1= [{"LOWER": "data"}, {"LOWER": "science"}]
pattern_2= [{"LOWER": "data"},{'IS_PUNCT': True}, {"LOWER": "science"}]
matcher.add("DataSciencePattern", [pattern_1,pattern_2])


#text
text = "Data science is a field of data-science, and it involves data data data science."


## entity/vocab matching
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    
    ## The `match_id` is simply the hash value of the `string_ID` 'VOCAB'
    ##start and end are the location where we found our match in the text
    print(match_id, start, end, span.text)

7702723889527052641 0 2 Data science
7702723889527052641 6 9 data-science
7702723889527052641 15 17 data science


In [3]:
## Write pattern
pattern_1= [{"LOWER": "data"}, {"LOWER": "science"}]

#if we want to find the form of verb like 2nd and 3rd then at the last we use lemma insted of LOWER.
pattern_2= [{"LOWER": "data"},{'IS_PUNCT': True}, {"LOWER": "science"}]
matcher.add("DataSciencePattern", [pattern_1,pattern_2])


#text
text = "Data science is a field of data-science, and it involves data science."


## entity/vocab matching
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    
    ## The `match_id` is simply the hash value of the `string_ID` 'VOCAB'
    ##start and end are the location where we found our match in the text
    print(match_id, start, end, span.text)

7702723889527052641 0 2 Data science
7702723889527052641 6 9 data-science
7702723889527052641 13 15 data science


# Token wildcard

When we want to extract the vocab after the specific char we use this. Like below we want to extract the # all char.

In [4]:
pattern=[{'ORTH': '#'}, {}]
matcher.remove('DataSciencePattern')
matcher.add("DataSciencePattern", [pattern])

#text
text = "Data science is a field of #data-science, and it #involves data #science."


## entity/vocab matching
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

#data
#involves
#science


# PhraseMatcher:

The PhraseMatcher in SpaCy is specifically designed for matching a list of phrases (or terms) in the text. It is more efficient for large lists of phrases compared to the regular Matcher.

Use cases of the PhraseMatcher include:
    
    Named Entity Recognition (NER): 
        
        You can match and extract specific entities based on a predefined list of names or terms, such as matching a list of cities or organization names.
        
        
     Custom Named Entity Recognition: 
     
     You can use it to identify and extract custom domain-specific entities using a predefined list of phrases.

In [5]:
from spacy.matcher import PhraseMatcher
phrase_matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

In [6]:
phrases = ["New York", "Los Angeles", "San Francisco"]
patterns = [nlp(text) for text in phrases]
phrase_matcher.add("CityMatcher",patterns)  # Pass the list of patterns as a single argument

text = "New York, Los Angeles, and San Francisco are major cities."
doc = nlp(text)
matches = phrase_matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

New York
Los Angeles
San Francisco


In [7]:
import Levenshtein

# Define a list of target strings
target_strings = ["apple", "banana", "cherry", "date", "elderberry"]

# Input string with a potential typo
input_string = "banama"

# Set a threshold for similarity
threshold = 3  # You can adjust this threshold based on your needs

# Initialize a list to store matched strings
matches = []

# Perform fuzzy matching
for target in target_strings:
    distance = Levenshtein.distance(input_string, target)
    if distance <= threshold:
        matches.append(target)

# Print the matched strings
print("Fuzzy Matches:")
for match in matches:
    print(match)

Fuzzy Matches:
banana


# NAMED Entity Recognition

In [8]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [9]:
doc=nlp('Tesla has lost over $94B in market valuation so far in 2024, and Jeff Bezos is threatening to steal Elon Musk’s crown as the world’s richest person')

for ent in doc.ents:
    print(ent.text,'|',ent.label_,'|',spacy.explain(ent.label_))

Tesla | ORG | Companies, agencies, institutions, etc.
over $94B | MONEY | Monetary values, including unit
2024 | DATE | Absolute or relative dates or periods
Jeff Bezos | PERSON | People, including fictional
Elon Musk’s | WORK_OF_ART | Titles of books, songs, etc.


In [10]:
from spacy import displacy
displacy.render(doc,style='ent')

So if we see above that Elon Musk's is not considered as Person, so we can say that Spacy not always work , so in that case we can train spacy like below

In [19]:
from spacy.tokens import Span

t1=Span(doc,21,24,label='PERSON')

doc.set_ents([t1],default='unmodified')

In [20]:
from spacy import displacy
displacy.render(doc,style='ent')