In [1]:
# spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python
# spaCy is designed  for “understanding” large volumes of text. 
# Can be used for information extraction or natural language understanding systems, 
# or to pre-process text for deep learning
# Also BERTopic uses/ can use it 
# Some of spaCy’s features work independently, others require statistical models to be loaded

In [None]:
#!pip install -U spacy

In [None]:
# spaCy has lots of models
# For starters you can use, the small, default model
# This has following components:
# 1. Binary weights for the part-of-speech (POS) tagger, dependecy parser and named entity recognizer to predicht annotaions
#    POS: tool that assigns specific categories/tags to each word in a text
#    Dependecy parser: tool that analyzes the grammatical strucure by identifying and representing the relationsship between words
#     It creates a tree like structure / dependency tree
#     Each word is a node, edges are grammatical dependencies
#     Useful for syntactic analysis and understanding hierarchical strucure
#    Named entity recognizer identifies named entities
# 2. Lexical entries: runnings -> run (Lemmatization), wor forms, pos, definition,...
# 3. Words vectors: multidimensional representation of the words
# en_core_web_sm: English multi-task CNN trained on OntoNotes. Size – 11 MB
# en_core_web_md: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size – 91 MB
# en_core_web_lg: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size – 789 MB

In [2]:
# !python -m spacy download en_core_web_lg
# !python -m spacy download en_core_web_sm


/usr/bin/zsh: /home/pclinux/anaconda3/lib/libtinfo.so.6: no version information available (required by /usr/bin/zsh)
2023-10-03 10:54:56.659988: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-03 10:55:31.904798: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-03 10:55:32.514466: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-03 10:55:32.514698: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc

In [11]:
import spacy
from spacy import displacy

In [3]:
nlp = spacy.load('en_core_web_sm')

2023-10-03 11:16:26.150486: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-03 11:16:28.034277: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-03 11:16:28.068939: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-03 11:16:28.069054: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been bu

In [None]:
# First step is to pass the text to an NLP object, which does several pre-processing , named entity recognizer

![spaCy](./images/spacy_nlp.png)

In [6]:
# Creating an NLP objects
doc = nlp("Company Microsoft is planning to aquire stake in Tesla for $23 billion")
nlp.pipe_names
# Pipelines can be disbabled
# nlp.disable_pipes('tagger','parser')

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Explaning spaCy in 10 Steps

Step 1/10: Tokenization; segmenting text into words, punctionations marks etc.
![Tokenization](./images/tokenization.png)

Step 2/10: Part-Of-Speech (POS) Tagging: POS explain how the word in the sentence: noun,verb,adjective
spacy also has a displacy which to see the results of pos:

In [12]:
for token in doc:
    print(token.text,",",token.tag_,",",spacy.explain(token.tag_),",",token.pos_,token.dep_)
displacy.render(doc, style="dep" , jupyter=True)

Company , NNP , noun, proper singular , PROPN compound
Microsoft , NNP , noun, proper singular , PROPN nsubj
is , VBZ , verb, 3rd person singular present , AUX aux
planning , VBG , verb, gerund or present participle , VERB ROOT
to , TO , infinitival "to" , PART aux
aquire , VB , verb, base form , VERB xcomp
stake , NN , noun, singular or mass , NOUN dobj
in , IN , conjunction, subordinating or preposition , ADP prep
Tesla , NNP , noun, proper singular , PROPN pobj
for , IN , conjunction, subordinating or preposition , ADP prep
$ , $ , symbol, currency , SYM quantmod
23 , CD , cardinal number , NUM compound
billion , CD , cardinal number , NUM pobj


Step 3/10: 

Dependency Parsing: extracting the depency parse of a sentence to represent its grammar

Relationship between headwords and their dependents. 

Usually the verb is the headword, everything else is dependent on it.
![Alt text](./images/parsing.png)

In [17]:
for token in doc:
    print(token.text,",",token.dep_,",",spacy.explain(token.dep_))

Company , compound , compound
Microsoft , nsubj , nominal subject
is , aux , auxiliary
planning , ROOT , root
to , aux , auxiliary
aquire , xcomp , open clausal complement
stake , dobj , direct object
in , prep , prepositional modifier
Tesla , pobj , object of preposition
for , prep , prepositional modifier
$ , quantmod , modifier of quantifier
23 , compound , compound
billion , pobj , object of preposition


Step 4: Lemmatization: reducing inflected forms of words, based->base, is->be ...

In [18]:
# Iterate over the tokens
for token in doc:
    # Print the token and its part-of-speech tag
    print(token.text, "-->", token.lemma_)

Company --> Company
Microsoft --> Microsoft
is --> be
planning --> plan
to --> to
aquire --> aquire
stake --> stake
in --> in
Tesla --> Tesla
for --> for
$ --> $
23 --> 23
billion --> billion


Step 5:  Sentence Boundary Detection (SBD) : Locating the start and end of the sentence, gives you meaningful processing units

In [20]:
# Create an nlp object
doc = nlp("Reliance is looking at buying U.K. based analytics startup for $7 billion.This is India.India is great")
sentences = list(doc.sents)
for sentence in sentences:
     print (sentence)

Reliance is looking at buying U.K. based analytics startup for $7 billion.
This is India.
India is great


Step 6: Named Entity Recognition (NER): assigned a name – a person, a country, a product or a book title

In [22]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_,spacy.explain(ent.label_))

Reliance 0 8 ORG Companies, agencies, institutions, etc.
U.K. 30 34 GPE Countries, cities, states
$7 billion 63 73 MONEY Monetary values, including unit
India 82 87 GPE Countries, cities, states
India 88 93 GPE Countries, cities, states


Step 7: Entity Detection: identifying important element - places,people,... in an text -> allows to extact important topics

In [24]:
doc= nlp(u"""The Amazon rainforest,[a] alternatively, the Amazon Jungle, also known in English as Amazonia, is a moist broadleaf tropical rainforest in the Amazon biome that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 km2 (2,700,000 sq mi), of which 5,500,000 km2 (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Bolivia, Ecuador, French Guiana, Guyana, Suriname, and Venezuela. Four nations have "Amazonas" as the name of one of their first-level administrative regions and France uses the name "Guiana Amazonian Park" for its rainforest protected area. The Amazon represents over half of the planet's remaining rainforests,[2] and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species.[3]
Etymology.The name Amazon is said to arise from a war Francisco de Orellana fought with the Tapuyas and other tribes. The women of the tribe fought alongside the men, as was their custom.[4] Orellana derived the name Amazonas from the Amazons of Greek mythology, described by Herodotus and Diodorus.[4]
History. See also: History of South America § Amazon, and Amazon River § History. Tribal societies are well capable of escalation to all-out wars between tribes. Thus, in the Amazonas, there was perpetual animosity between the neighboring tribes of the Jivaro. Several tribes of the Jivaroan group, including the Shuar, practised headhunting for trophies and headshrinking.[5] The accounts of missionaries to the area in the borderlands between Brazil and Venezuela have recounted constant infighting in the Yanomami tribes. More than a third of the Yanomamo males, on average, died from warfare.[6]""")
entities=[(i, i.label_, i.label) for i in doc.ents]
entities
displacy.render(doc, style = "ent",jupyter = True)


Step 8: Similarity: word vectors/ embeddings - multidimensional representation of the words are compared. 
- built in tools or spacword2vec can be used for it

In [27]:
doc= nlp("dog cat banana")

for token in doc:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)
    
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))

dog True 6.7392993 True
cat True 7.6955767 True
banana True 8.330785 True
dog dog 1.0
dog cat 0.5957574844360352
dog banana 0.43743896484375
cat dog 0.5957574844360352
cat cat 1.0
cat banana 0.46431881189346313
banana dog 0.43743896484375
banana cat 0.46431881189346313
banana banana 1.0


  print(token1.text, token2.text, token1.similarity(token2))


Step 9: Text Classification: Assigning categories or labels to a whole document, or parts of a document
![Alt text](./images/classification.png)

In [30]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
# Loading TSV file
df_amazon = pd.read_csv ("./data/amazon_alexa.tsv", sep="\t")
display(df_amazon.head())
print(df_amazon.shape)
print(df_amazon.info())


Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


(3150, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rating            3150 non-null   int64 
 1   date              3150 non-null   object
 2   variation         3150 non-null   object
 3   verified_reviews  3149 non-null   object
 4   feedback          3150 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 123.2+ KB
None


In [32]:
# Custom tokenizer using spaCy: spacy_tokenizer()
import string 
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
nlp = spacy.load('en_core_web_sm')

# List of puntuation marks 
punctuations = string.punctuation
# List of stop words
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load tokenizer, tagger, parser, ner, word vectors
parser = English()

# Tokenizer function
def spacy_tokenizer(sentence):
    # Creating token object
    tokens = parser(sentence)
    
    # Lemmatizing each token and converting each token
    tokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens]

    # Removing stop words
    tokens = [ word for word in tokens if word not in stop_words and word not in punctuations]
    
    return tokens

In [51]:
# Custom transformer for removing initial and end spaces and converting text to lower case
# Custom class inheriting TransformerMixin

class predictors(TransformerMixin):
    def transform(self,X,**transform_params):
        # Cleaning text
        cleaned_X = [clean_text(text) if isinstance(text, str) else text for text in X]
        return cleaned_X
    
    def fit(self, X, y=None,**fit_params):
        return self
    
    def get_params(self, deep=True):
        return{}

def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

In [52]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Classifying text in positive and negative labels is called sentiment analysis
# BoW - Bag Of Words - creates a matrix of occurrence of words with in a document - was the word in the text or not.
# Scikit CountVectorizer gives the BoW
# CountVectorizer using the custom spacy_tokenizer function as its tokenizer, and defining the ngram range we want.
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

# TF-IDF Term Frequency-Inverse Document Frequency
# Normalizing our Bag of Words(BoW) by looking at each word’s frequency in comparison to the document frequency.
# Representing how important a particular term is in the context of a given document
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)


X = df_amazon['verified_reviews'] # the features we want to analyze
ylabels = df_amazon['feedback'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

classifier = LogisticRegression()

# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner",predictors()),
                 ('vectoriter', bow_vector),
                 ('classifier',classifier)])

# model generation
pipe.fit(X_train,y_train)
display(X_train.head())
display(y_train.head())

ValueError: np.nan is an invalid document, expected byte or unicode string.