**Text Processing with Spacy**


In [5]:
docs = ["Kaggle provides notebooks for python.",
       "Python is an easy language.",
       "Kaggle provides many datasets."]

The list of ** Stop Word and Punctuations ** which will be removed.

In [6]:
import spacy
# Creting List of Stop Words
from spacy.lang.en.stop_words import STOP_WORDS
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Creating list of punctuation marks
import string
punctuations = string.punctuation

print(stop_words)
print("\n===================\n")
print(punctuations)

{'the', 'make', 'whatever', '‘m', 'thereafter', 'very', 'whenever', 'you', 'once', 'any', 'which', 'thru', 'did', 'mostly', 'seemed', 'either', 'keep', 'about', 'beside', 'during', 'name', 'their', 'myself', 'whereafter', 'will', 'former', 'been', 'whence', 'whether', 'perhaps', "'d", 'before', 'yourself', 'over', 'yet', 'seeming', 'can', 'serious', 'becomes', 'from', 'using', 'really', 'much', 'ca', 'yours', 'more', 'back', '‘ll', 'then', 'whose', 'one', 'who', 'done', 'be', 'within', 'fifty', 'those', 'please', 'made', 'latterly', 'upon', 'everywhere', 'whole', 'and', 'had', 'none', 'have', 'third', 'whoever', 'meanwhile', 'except', 'is', 'together', 'same', 'though', 'such', 'mine', 'three', 'on', 'ours', 'regarding', 'all', 'somehow', 'almost', 'enough', 'to', 'were', 'her', 'after', 'whereupon', 'whom', 'often', 'six', 'give', 'top', 'into', 'namely', 'moreover', 'nevertheless', 'are', 'ever', 'should', '‘ve', 'that', 'show', 'anything', 'again', 'a', 'however', 'part', 'bottom', 

en_core_web_sm is a small English model for the spaCy library. It provides basic functionality for processing English text, such as tokenization, part-of-speech tagging, and named entity recognition. This model is useful for tasks where you need a lightweight and fast solution, though it may be less accurate compared to larger models. If you're working on NLP tasks with spaCy and need something quick and efficient, en_core_web_sm is a good choice.

Using SpaCy library to load model of English

In [7]:
nlp = spacy.load('en_core_web_sm')


**SpaCy vs NLTK**
* NLTK is built for academia, Spacy is built for industry.
* NLTK is has many ways to do the same thing, SpaCy has only one way.
* SpaCy is faster than NLTK.
* More human languages are supported in NLTK than SpaCy.

Apply the NLP model of English on the given documents. The documents are tokenized and different lexical and syntactical features are assigned to the tokens.

In [8]:
prc_docs = [nlp(doc) for doc in docs]
print(prc_docs)

#prc_docs = []
#for doc in docs:
  #prc_docs.append(nlp(doc))

for tok in prc_docs[0]:
  print(tok)


[Kaggle provides notebooks for python., Python is an easy language., Kaggle provides many datasets.]
Kaggle
provides
notebooks
for
python
.


Transforming words into their root words.

In [16]:
print("Before: ", prc_docs)
token_docs = [ [tok.lemma_.lower().strip() for tok in prc_doc] for prc_doc in prc_docs]

#token_docs = []
#for prc_doc in prc_docs:
  #for tok in prc_doc:
    #token_docs.append(tok.lemma_.lower().strip())



print("\nAfter: ", token_docs)

Before:  [Kaggle provides notebooks for python., Python is an easy language., Kaggle provides many datasets.]

After:  [['kaggle', 'provide', 'notebook', 'for', 'python', '.'], ['python', 'be', 'an', 'easy', 'language', '.'], ['kaggle', 'provide', 'many', 'dataset', '.']]


Removing stop words and punctuations.

In [17]:
token_docs = [ [tok for tok in token_doc if (tok not in stop_words and tok not in punctuations)] for token_doc in token_docs]
print("Before: ", token_docs)
print("\nAfter: ",token_docs)

Before:  [['kaggle', 'provide', 'notebook', 'python'], ['python', 'easy', 'language'], ['kaggle', 'provide', 'dataset']]

After:  [['kaggle', 'provide', 'notebook', 'python'], ['python', 'easy', 'language'], ['kaggle', 'provide', 'dataset']]


Rebuilding the strings that can be passed to Vectorizer(s).

In [18]:
s = ''
docs = []
for token_doc in token_docs:
    #for token in token_doc:
    #    s += token + ' '
    #docs.append(s)
    #s = ''
    docs.append(' '.join(token_doc))
print(docs)

['kaggle provide notebook python', 'python easy language', 'kaggle provide dataset']


Transforming the list of tokens to vectors having count of words as the dimensions.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
print("Feature Labels")
print(vectorizer.get_feature_names_out())
print("Sparse Matrix")
print(X)
print("Dense Matrix")
print(X.todense())

#vectorizer.vocabulary_
#print(X.toarray())



Feature Labels
['dataset' 'easy' 'kaggle' 'language' 'notebook' 'provide' 'python']
Sparse Matrix
  (0, 2)	1
  (0, 5)	1
  (0, 4)	1
  (0, 6)	1
  (1, 6)	1
  (1, 1)	1
  (1, 3)	1
  (2, 2)	1
  (2, 5)	1
  (2, 0)	1
Dense Matrix
[[0 0 1 0 1 1 1]
 [0 1 0 1 0 0 1]
 [1 0 1 0 0 1 0]]


In [20]:
vectorizer2 = CountVectorizer(max_features=4, ngram_range=(1,2))
X2 = vectorizer2.fit_transform(docs)
print("Feature Labels")
print(vectorizer2.get_feature_names_out())
print("Sparse Matrix")
print(X2)
print("Dense Matrix")
print(X2.todense())

Feature Labels
['kaggle' 'kaggle provide' 'provide' 'python']
Sparse Matrix
  (0, 0)	1
  (0, 2)	1
  (0, 3)	1
  (0, 1)	1
  (1, 3)	1
  (2, 0)	1
  (2, 2)	1
  (2, 1)	1
Dense Matrix
[[1 1 1 1]
 [0 0 0 1]
 [1 1 1 0]]


It multiplies the Term Frequency (TF) by Inverse Document Frequency (IDF).
IDF reduced the weight of those terms/words that occur in majority of the documents.
IDFw = Count of all documents / Count of the documents in which the word w appears


In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
print("Feature Labels")
print(vectorizer.get_feature_names_out())
print("Sparse Matrix")
print(X)
print("Dense Matrix")
print(X.todense())

Feature Labels
['dataset' 'easy' 'kaggle' 'language' 'notebook' 'provide' 'python']
Sparse Matrix
  (0, 6)	0.4598535287588349
  (0, 4)	0.6046521283053111
  (0, 5)	0.4598535287588349
  (0, 2)	0.4598535287588349
  (1, 3)	0.6227660078332259
  (1, 1)	0.6227660078332259
  (1, 6)	0.4736296010332684
  (2, 0)	0.680918560398684
  (2, 5)	0.5178561161676974
  (2, 2)	0.5178561161676974
Dense Matrix
[[0.         0.         0.45985353 0.         0.60465213 0.45985353
  0.45985353]
 [0.         0.62276601 0.         0.62276601 0.         0.
  0.4736296 ]
 [0.68091856 0.         0.51785612 0.         0.         0.51785612
  0.        ]]


In [22]:
vectorizer2 = TfidfVectorizer(max_features=4, ngram_range=(1,2))
X2 = vectorizer2.fit_transform(docs)
print("Feature Labels")
print(vectorizer2.get_feature_names_out())
print("Sparse Matrix")
print(X2)
print("Dense Matrix")
print(X2.todense())


Feature Labels
['kaggle' 'kaggle provide' 'provide' 'python']
Sparse Matrix
  (0, 1)	0.5
  (0, 3)	0.5
  (0, 2)	0.5
  (0, 0)	0.5
  (1, 3)	1.0
  (2, 1)	0.5773502691896257
  (2, 2)	0.5773502691896257
  (2, 0)	0.5773502691896257
Dense Matrix
[[0.5        0.5        0.5        0.5       ]
 [0.         0.         0.         1.        ]
 [0.57735027 0.57735027 0.57735027 0.        ]]
