# Using SpaCY
## Preprocessing our data with SpaCy to improve our model 

reminder (from spaCy_cheat_sheet)
Processing text with the nlp object returns a Doc object that holds all information about the tokie, their linguistic features and their relationships.
``` doc = nlp('this is text to process')```

### Here's what we want to do:
1. Use spacy to make a function that processes one document and returns it processed using nlp.

2. Once we have that function, we can apply it to the whole column of reviews, maybe making a new column that will have the processed version of the review

3. Then, we can use this new column and use a model like MBmulti or knn or...


In [9]:
import spacy
import re
from spacy.language import Language

In [2]:
df = pd.read_csv('../nlp-hackathon/data/drug_reviews_cleaned.csv')

In [3]:
spacy.cli.download('en_core_web_md')

Collecting en-core-web-md==3.7.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[citation]('https://guides.library.upenn.edu/penntdm/python/spacy')

In [10]:
# load medium pipeline
nlp = spacy.load('en_core_web_md')


In [11]:
type(nlp)

spacy.lang.en.English

In [12]:
# what are our pipeline componenets?
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [31]:
document = 'hey, there, here are some words and punctuation !!?..;. Here is another sentence with some words in it.'

In [32]:
# create a spacy document by processing with nlp object
%time
ex = nlp(document)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.2 µs


In [33]:
type(ex)

spacy.tokens.doc.Doc

In [34]:
ex.text #just the input text

'hey, there, here are some words and punctuation !!?..;. Here is another sentence with some words in it.'

In [35]:
for sent in ex.sents:
    print(sent.text)

hey, there, here are some words and punctuation !!?..;.
Here is another sentence with some words in it.


In [36]:
for chunk in ex.noun_chunks:
    print(chunk.text)

some words
punctuation
another sentence
some words
it


In [37]:
set([token.text for token in ex.noun_chunks])

{'another sentence', 'it', 'punctuation', 'some words'}

In [6]:
nlp = load_model(text)
# in lesson, this was ex = nlp(doc)

In [8]:
# what is included in our pipeline downloaded as en_core_web_md
load_model.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [None]:
# define a function that takes in a document and processes it using SpaCy.
def doc_improver(doc):
    doc_imp = nlp(doc)
    return doc_imp

[citation]('https://stackoverflow.com/questions/68166400/how-to-get-spacy-to-read-through-an-entire-column-in-a-data-frame')


In [None]:
doc_improver('Where are we going, Walt Whitman? The doors close in an hour. Which way does your beard point')