# Text classification using spaCy NLP package.

- Author : Manu Nellutla
- Date   : Aug 16,2020

We will be using SPACY package to classify text and also do a sentiment analysis.

In [1]:
# install the required packages
!pip install spacy

Collecting spacy
  Downloading spacy-3.2.2-cp39-cp39-macosx_10_9_x86_64.whl (6.3 MB)
[K     |████████████████████████████████| 6.3 MB 2.5 MB/s eta 0:00:01
[?25hCollecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.6-cp39-cp39-macosx_10_9_x86_64.whl (32 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0
  Using cached spacy_loggers-1.0.1-py3-none-any.whl (7.0 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.6-cp39-cp39-macosx_10_9_x86_64.whl (18 kB)
Collecting pathy>=0.3.5
  Using cached pathy-0.6.1-py3-none-any.whl (42 kB)
Collecting spacy-legacy<3.1.0,>=3.0.8
  Using cached spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting typer<0.5.0,>=0.3.0
  Using cached typer-0.4.0-py3-none-any.whl (27 kB)
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp39-cp39-macosx_10_9_x86_64.whl (452 kB)
[K     |████████████████████████████████| 452 kB 23.3 MB/s eta 0:00:01
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.6-cp39-cp39-macosx_10_9_x86_64.whl (106 k

In [2]:
# text to analyze. and lets split it into words.

text ="Neuro-linguistic programming was developed in the 1970's at the University of California, Santa Cruz. Its primary founders are John Grinder, a linguist, and Richard Bandler, an information scientist and mathematician. Judith DeLozier and Leslie Cameron-Bandler also contributed significantly to the field, as did David Gordon and Robert Dilts."

text.split()

['Neuro-linguistic',
 'programming',
 'was',
 'developed',
 'in',
 'the',
 "1970's",
 'at',
 'the',
 'University',
 'of',
 'California,',
 'Santa',
 'Cruz.',
 'Its',
 'primary',
 'founders',
 'are',
 'John',
 'Grinder,',
 'a',
 'linguist,',
 'and',
 'Richard',
 'Bandler,',
 'an',
 'information',
 'scientist',
 'and',
 'mathematician.',
 'Judith',
 'DeLozier',
 'and',
 'Leslie',
 'Cameron-Bandler',
 'also',
 'contributed',
 'significantly',
 'to',
 'the',
 'field,',
 'as',
 'did',
 'David',
 'Gordon',
 'and',
 'Robert',
 'Dilts.']

## Normal split of text - 

**using programmatical split.**

When you see above you can see apostrophe and commas are included in the words and split was based on blanks between the words.

## lets check how spaCy does it.

**using NLP englis spit**


In [3]:
import spacy
import sys
!{sys.executable} -m spacy download en

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 2.6 MB/s eta 0:00:01
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [8]:
#import necessary packages

from spacy.lang.en import English

nlp_lang = lp = spacy.load("en_core_web_sm")

#spacy will use English language annotaitons to split the text

nlp_split = nlp_lang(text)

# Create list of word tokens
token_list = []
for token in nlp_split:
    token_list.append(token.text)

display(len(token_list))
token_list


61

['Neuro',
 '-',
 'linguistic',
 'programming',
 'was',
 'developed',
 'in',
 'the',
 '1970',
 "'s",
 'at',
 'the',
 'University',
 'of',
 'California',
 ',',
 'Santa',
 'Cruz',
 '.',
 'Its',
 'primary',
 'founders',
 'are',
 'John',
 'Grinder',
 ',',
 'a',
 'linguist',
 ',',
 'and',
 'Richard',
 'Bandler',
 ',',
 'an',
 'information',
 'scientist',
 'and',
 'mathematician',
 '.',
 'Judith',
 'DeLozier',
 'and',
 'Leslie',
 'Cameron',
 '-',
 'Bandler',
 'also',
 'contributed',
 'significantly',
 'to',
 'the',
 'field',
 ',',
 'as',
 'did',
 'David',
 'Gordon',
 'and',
 'Robert',
 'Dilts',
 '.']

The split is completely different to what we saw in the normal split. Before using this... 

## You can also split by sentences

**using sentenceTokenizer**


In [9]:
nlp_lang = English()
#nlp_lang.remove_pipe("sentencizer")
sbd = nlp_lang.create_pipe('sentencizer')

# Add the component to the pipeline
nlp_lang.add_pipe(sbd, last=True)

nlp_split = nlp_lang(text)

# Create list of word tokens
token_list = []
for token in nlp_split.sents: #----> nlp_split.sents does split by sentences
    token_list.append(token.text)
print(token_list)


ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <spacy.pipeline.sentencizer.Sentencizer object at 0x7fdd433a56c0> (name: 'None').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

Now that we have the ability to split we need to remove words that doesn't provide context. These are called stop words.

## Stopwords 



In [None]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

#Printing the total number of stop words:
print('Number of stop words: %d' % len(spacy_stopwords))

#Printing first ten stop words:
print('First ten stop words: %s' % list(spacy_stopwords)[:10])

lets filter all the stop words from our text. 

use **is_stop == false** to remove all the words that are not 
    

In [5]:
#filter stop words from our text.
display(f"number of words before filtering : {len(nlp_split)}")
words_no_stop = [a for a in nlp_split if a.is_stop == False]

display(f"number of words after filtering : {len(words_no_stop)}")

display(words_no_stop)

NameError: name 'nlp_split' is not defined

## Lemmatize the words.

**using '.lemma_**

In [6]:
words_no_stop_lemma = {ab : [ab.lemma_, ab.pos_, ab.dep_] for ab in words_no_stop}

words_no_stop_lemma

NameError: name 'words_no_stop' is not defined

## Lets do Entity Detection

understanding 'person' date etc....

In [7]:
#identifying entities in text
entities=[(i, i.label_, i.label) for i in nlp_split.ents]
entities

NameError: name 'nlp_split' is not defined

### Displacy - package helps highlite entities

In [14]:
#import displacy
from spacy import displacy

displacy.render(nlp_split, style = "ent",jupyter = True)