<h2>Natural Language processing using Spacy and Python</h2>

Natural Language Processing (NLP) is a field focused on enabling computers to understand and process human language. 

<h4>Tokenization in NLP</h4>

Tokenization is the process of breaking text into pieces, called tokens, and ignoring characters like punctuation marks (,. “ ‘) and spaces. spaCy's tokenizer takes input in form of unicode text and outputs a sequence of token objects
- Types of tokenization
  <ul>
      <li>Word Tokenization</li>
      This method divides text into individual words. It's the most common form of tokenization and works well for languages with clear word boundaries.Example: The sentence "Natural language processing is fascinating." becomes ["Natural", "language", "processing", "is", "fascinating"].

   <li>Sentence Tokenization</li>
  This technique splits text into sentences rather than words. It’s useful for tasks that require analyzing the structure of a document.
   <li>Character Tokenization</li>
   Divides text into smaller units(characters)
   eg NLP becomes "N", "L", "P"
   <li>Subword Tokenization</li>
   This method breaks text into smaller units that are larger than single characters but smaller than full words. It’s useful for handling out-of-vocabulary words in NLP tasks.Eg The word "unhappiness" could be tokenized into ["un", "happiness"].
   <li>Custom Tokenization</li>
   In some cases, custom rules are created based on specific requirements, such as tokenizing hashtags or identifying domain-specific phrases.
  </ul>
Here i will only deal with word & sentence tokenization

In [1]:
# word tokenization
from spacy.lang.en import English
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()
text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""
#  "nlp" Object is used to create documents with linguistic annotations.
token_doc = nlp(text)
token_list = []
for token in token_doc:
    token_list.append(token.text)
print(token_list)

['When', 'learning', 'data', 'science', ',', 'you', 'should', "n't", 'get', 'discouraged', '!', '\n', 'Challenges', 'and', 'setbacks', 'are', "n't", 'failures', ',', 'they', "'re", 'just', 'part', 'of', 'the', 'journey', '.', 'You', "'ve", 'got', 'this', '!']


- Notice that `Spacy` produces a list that contains each token(word) as a separate item.

In [2]:
#sentence tokenization
nlp.add_pipe("sentencizer")
text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""
doc = nlp(text)
sentence = list(doc.sents)
for i, sentence in enumerate(sentence):
    print(f"Sentence {i + 1}: {sentence.text}")

Sentence 1: When learning data science, you shouldn't get discouraged!
Sentence 2: 
Challenges and setbacks aren't failures, they're just part of the journey.
Sentence 3: You've got this!


Again, spaCy has correctly parsed the text into the format we want, this time outputting a list of sentences found in our source text.



<h4>Cleaning Text Data: Removing Stopwords</h4>

Most text data that we work with is going to contain a lot of words that aren't actually useful to us. These words, called stopwords, are useful in human speech, but they don't have much to contribute to data analysis. Removing stopwords helps us eliminate noise and distraction from our text data, and also speeds up the time analysis takes (since there are fewer words to process).

In [3]:
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print("Number of stop words: ", len(spacy_stopwords))
#printing the first ten stop words:
print('First 10 stop words:', list(spacy_stopwords)[:10])

Number of stop words:  326
First 10 stop words: ['seem', '‘re', 'during', 'moreover', 'therein', 'into', 'they', 'were', 'call', 'very']


In [4]:
#removing stopwords from the data
filtered_sent = []
doc = nlp(text)

#filtering stop words
for word in doc:
    if word.is_stop == False:
        filtered_sent.append(word)
print("Filtered sentence:", filtered_sent)

Filtered sentence: [learning, data, science, ,, discouraged, !, 
, Challenges, setbacks, failures, ,, journey, ., got, !]


 Removing them has boiled our original text down to just a few words that give us a good idea of what the sentences are discussing: learning data science, and discouraging challenges and setbacks along that journey.



<h4>Lexicon Normalization</h4>

Lexicon normalization is another step in the text data cleaning process. In the big picture, normalization converts high dimensional features into low dimensional features which are appropriate for any machine learning model. For our purposes here, we're only going to look at lemmatization, a way of processing words that reduces them to their roots.



In [5]:
#Implementing lemmatization - reducing a word to it's root meaning
lemma = nlp("run runs running runner")
for word in lemma:
    print(word.text, word.lemma_)

run 
runs 
running 
runner 


<h4>POS Tagging</h4>
A word's part of speech defines its function within a sentence. A noun, for example, identifies an object. An adjective describes an object. A verb describes action. Identifying and tagging each word's part of speech in the context of a sentence is called Part-of-Speech Tagging, or POS Tagging.

In [6]:
# post tagging
import en_core_web_sm
nlp = en_core_web_sm.load()
docs = nlp("She sells sea shells at the sea shore")
for word in docs:
    print(word.text, word.pos_)

She PRON
sells VERB
sea NOUN
shells NOUN
at ADP
the DET
sea NOUN
shore NOUN


<h4>Entity Detection</h4>
Entity detection, also called entity recognition, is a more advanced form of language processing that identifies important elements like places, people, organizations, and languages within an input string of text. This is really helpful for quickly extracting information from text, since you can quickly pick out important topics or indentify key sections of text.

In [7]:
from spacy import displacy 
nytimes= nlp(u"""New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases.

At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday.

The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.""")
entities = [(i, i.label_, i.label) for i in nytimes.ents]
entities

[(New York City, 'GPE', 384),
 (Tuesday, 'DATE', 391),
 (At least 285, 'CARDINAL', 397),
 (September, 'DATE', 391),
 (Brooklyn, 'GPE', 384),
 (Williamsburg, 'GPE', 384),
 (four, 'CARDINAL', 397),
 (Zip, 'PERSON', 380),
 (Bill de Blasio, 'PERSON', 380),
 (Tuesday, 'DATE', 391),
 (Orthodox, 'NORP', 381),
 (Jews, 'NORP', 381),
 (as young as 6 months old, 'DATE', 391),
 (up to $1,000, 'MONEY', 394)]

Using displaCy we can also visualize our input text, with each identified entity highlighted by color and labeled. We'll use style = "ent" to tell displaCy that we want to visualize entities here.



In [8]:
displacy.render(nytimes,style="ent", jupyter= True)

<h4>Dependency Parsing</h4>
Depenency parsing is a language processing technique that allows us to better determine the meaning of a sentence by analyzing how it's constructed to determine how the individual words relate to each other.

In [9]:
doc = nlp("Lao Zhu sought enlightment, and he achieved it")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)

Lao Zhu Zhu nsubj sought
enlightment enlightment dobj sought
he he nsubj achieved
it it dobj achieved


This output can be a little bit difficult to follow, but since we've already imported the displaCy visualizer, we can use that to view a dependency diagraram using style = "dep" that's much easier to understand:



In [10]:
displacy.render(doc, style="dep", jupyter=True)

<h4>Word vector Representation</h4>
A word vector is a numeric representation of a word that communicates its relationship to other words

In [11]:
import en_core_web_sm
nlp = en_core_web_sm.load()
dopple = nlp("Doppleganger")
print(dopple.vector.shape)
print(dopple.vector)

(96,)
[-1.1861942e+00 -9.1249442e-01 -1.3897794e-01  7.0208502e-01
  3.7187719e-01 -5.5389905e-01  3.1117851e-01  6.4828175e-01
 -6.4046234e-01 -1.0683224e+00 -5.3949989e-02 -9.6142054e-02
 -3.2191742e-02  7.2896862e-01  1.4663376e-01  1.0728086e+00
 -1.3139668e+00 -4.3733674e-01  5.3877443e-01 -2.2522332e-01
 -1.3220045e-01  1.2192971e+00 -7.5704598e-01 -7.1275020e-01
  4.0038523e-01  1.6996339e-01  1.0006807e+00 -6.9105172e-01
 -1.3702026e-01  1.5841055e+00 -8.1059825e-01  1.5627509e-01
  2.8953835e-01  1.0808635e-01 -8.2277262e-01 -6.5260947e-02
 -7.1238708e-01  1.1303772e-01  8.5052550e-01  4.9783963e-01
  1.5338853e-01 -1.0406020e+00  4.6463126e-01  1.8587688e-01
 -1.5972078e-02 -8.1118721e-01 -1.2280668e+00 -2.6232600e-03
  2.8735816e-02 -1.3073421e+00  8.2931018e-01  1.1444800e+00
 -2.0533168e-01 -4.1878861e-01  6.5064961e-01  1.9696614e-01
  1.2446871e+00 -1.6359596e-01  4.2302164e-01 -4.2493194e-01
 -1.0849035e+00  2.3684630e-01  3.6907226e-02  3.7950069e-01
 -4.1121662e-02  5