# Chapter 1 - spaCy Basics

## Instructions

- Run the cells with "assert" statements to see if your answer's output matches what the output should be. If it runs without error, your answer matches! If your output is different, you'll get a hint.

In [1]:
import spacy

## Exercise 1

When working with spaCy, the first step is to choose and load a language model into your workspace.  Let's use spaCy's small English language model ("en_core_web_sm").  Use spaCy to load this model and save it as `nlp`.

In [2]:
nlp = spacy.load('en_core_web_sm')

type(nlp)

spacy.lang.en.English

_If you're gettting an error about being unable to find `en_core_web_sm`, uncomment and run this line first:_

In [3]:
#!python3 -m spacy download en_core_web_sm

## Exercise 2

Now let's use the language model `nlp` to parse a sentence.  We've included an exmaple sentence for you called `sent`.  
1. Pass `sent` through the language model and save the output as `doc`.  
2. Now use the token properties to create a list of all the adjective tokens of `doc` and call this list `adjectives`.  You may find a list comprehension to be helpful.

In [4]:
sent = "The adorable kittens played with a large ball of blue yarn."

### BEGIN SOLUTION
doc = nlp(sent)
adjectives = [token for token in doc if token.pos_ == 'ADJ']
### END SOLUTION

adjectives

[adorable, large, blue]

In [5]:
### CHECK YOUR OUTPUT WITH THE ANSWER

assert type(doc) == spacy.tokens.doc.Doc, "Be sure that doc is a spacy document.  You should use the language model to parse the sentence provided."
assert doc.text == sent, "Be sure to use the language model to parse the sentence provided as sent."
assert type(adjectives) == list, "adjectives should be a Python list."
assert type(adjectives[0]) == spacy.tokens.token.Token, "Be sure that your adjectives lists contains spacy tokens as its elements.  The elements should not be strings."

In [6]:
### BEGIN HIDDEN TESTS
test_doc = nlp(sent)
test_adjs = [token for token in doc if token.pos_ == 'ADJ']
assert len(doc) == len(test_doc)
for token, test_token in zip(doc, test_doc):
    assert token.pos_ == test_token.pos_
for token, test_token in zip(adjectives, test_adjs):
    assert token.pos_ == 'ADJ'
    assert token.text == test_token.text
### END HIDDEN TESTS

## Exercise 3

Now we will use spaCy to do a bit of pre-processing.  Use `doc`, which is the parsed spaCy document for `sent`, to create a Python string that you will save as `sent_cleaned`.  `sent_cleaned` should not have any stop words or punctuation; furthermore, `sent_cleaned` should contain the remaining lemmatized, lowercase text from `sent`.  

You will likely want to do the following to create `sent_cleaned`:
1. Filter the stop words out of `doc`
2. Filter the punctuation out of `doc` (remember that the punctuation part of speech is "PUNCT")
3. Extract the lemmas from each of the remaining words (these will be lowercase)
4. Save these lemmas together in a Python string (don't forget to join these with a space!)

In [7]:
### BEGIN SOLUTION
sent_cleaned = ''
for token in doc[:-1]:
    if not token.is_stop:
        if not token.pos_ == 'PUNCT':
            sent_cleaned += token.lemma_
            sent_cleaned += ' '
sent_cleaned = sent_cleaned.strip()
### END SOLUTION

sent_cleaned

'adorable kitten play large ball blue yarn'

In [8]:
### CHECK YOUR OUTPUT WITH THE ANSWER

assert type(sent_cleaned) == str, "Be sure that sent_cleaned is a Python string."
assert len(sent_cleaned.split(' ')) == 7, "You should find seven words that are not stop words or punctuation.  Be sure to separate these with spaces and that you do not have any trailing spaces."
assert 'kittens' not in sent_cleaned, "Be sure that your remaining text has been lemmatized.  The lemma for kittens should be kitten."

In [9]:
### BEGIN HIDDEN TEST

## allows them to forget to remove final space
assert sent_cleaned.strip() == 'adorable kitten play large ball blue yarn'

### END HIDDEN TESTS