# Customizing NLP Pipeline

**Objective**
- Customize SpaCy's nlp pipeline

### Customizing the pipeline

We learned how to customize stop words when making word clouds, but we can also customize a SpaCy nlp pipeline. In this lesson we will demonstrate how to:
- Add stopwords
- Remove stopwords
- Tokenize contractions

Once we demonstrate each technique, we will define a custom function to create the pipeline.


In [1]:
# Imports
import spacy

In [14]:
# Custom functions
def batch_preprocess_texts(
    texts,
    nlp=None,
    remove_stopwords=True,
    remove_punct=True,
    use_lemmas=False,
    disable=["ner"],
    batch_size=50,
    n_process=-1,
):
    """Efficiently preprocess a collection of texts using nlp.pipe()

    Args:
        texts (collection of strings): collection of texts to process (e.g. df['text'])
        nlp (spacy pipe), optional): Spacy nlp pipe. Defaults to None; if None, it creates a default 'en_core_web_sm' pipe.
        remove_stopwords (bool, optional): Controls stopword removal. Defaults to True.
        remove_punct (bool, optional): Controls punctuation removal. Defaults to True.
        use_lemmas (bool, optional): lemmatize tokens. Defaults to False.
        disable (list of strings, optional): named pipeline elements to disable. Defaults to ["ner"]: Used with nlp.pipe(disable=disable)
        batch_size (int, optional): Number of texts to process in a batch. Defaults to 50.
        n_process (int, optional): Number of CPU processors to use. Defaults to -1 (meaning all CPU cores).

    Returns:
        list of tokens
    """
    # from tqdm.notebook import tqdm
    from tqdm import tqdm

    if nlp is None:
        nlp = spacy.load("en_core_web_sm")

    processed_texts = []

    for doc in tqdm(nlp.pipe(texts, disable=disable, batch_size=batch_size, n_process=n_process)):
        tokens = []
        for token in doc:
            # Check if should remove stopwords and if token is stopword
            if (remove_stopwords == True) and (token.is_stop == True):
                # Continue the loop with the next token
                continue

            # Check if should remove stopwords and if token is stopword
            if (remove_punct == True) and (token.is_punct == True):
                continue

            # Check if should remove stopwords and if token is stopword
            if (remove_punct == True) and (token.is_space == True):
                continue

            
            ## Determine final form of output list of tokens/lemmas
            if use_lemmas:
                tokens.append(token.lemma_.lower())
            else:
                tokens.append(token.text.lower())

        processed_texts.append(tokens)
    return processed_texts

### Pipeline

In [7]:
# Data
# Define sample text
sample_text = "While running in Central Park, I noticed that the constant buzzing of flies was annoying. I don't like flies, but I couldn't be too upset as they were likely attracted to the McDonald's food that someone carelessly dropped. I wondered, 'How can they be so uncaring?'"
sample_text

"While running in Central Park, I noticed that the constant buzzing of flies was annoying. I don't like flies, but I couldn't be too upset as they were likely attracted to the McDonald's food that someone carelessly dropped. I wondered, 'How can they be so uncaring?'"

In [17]:
# Default nlp pipeline and keep all stopwords
tokens_keep_all_stop = batch_preprocess_texts([sample_text], nlp = nlp_light, remove_stopwords = False)
print(tokens_keep_all_stop)

1it [00:21, 21.77s/it]

[['while', 'running', 'in', 'central', 'park', 'i', 'noticed', 'that', 'the', 'constant', 'buzzing', 'of', 'flies', 'was', 'annoying', 'i', 'do', "n't", 'like', 'flies', 'but', 'i', 'could', "n't", 'be', 'too', 'upset', 'as', 'they', 'were', 'likely', 'attracted', 'to', 'the', 'mcdonald', "'s", 'food', 'that', 'someone', 'carelessly', 'dropped', 'i', 'wondered', 'how', 'can', 'they', 'be', 'so', 'uncaring']]





Now run the default pipeline and remove stopwords.

In [18]:
# Remove default stopwords
tokens_remove_default_stop = batch_preprocess_texts([sample_text])
print(tokens_remove_default_stop)

1it [00:21, 21.53s/it]

[['running', 'central', 'park', 'noticed', 'constant', 'buzzing', 'flies', 'annoying', 'like', 'flies', 'upset', 'likely', 'attracted', 'mcdonald', 'food', 'carelessly', 'dropped', 'wondered', 'uncaring']]





The next code is a simple loop to compare the two lists of tokens we just created. The result will be that we can easily see which words were removed by default.

In [19]:
# Looop to find words removed
removed_tokens = []
for token in tokens_keep_all_stop:
    if token not in tokens_remove_default_stop:
        removed_tokens.append(token)
removed_tokens

[['while',
  'running',
  'in',
  'central',
  'park',
  'i',
  'noticed',
  'that',
  'the',
  'constant',
  'buzzing',
  'of',
  'flies',
  'was',
  'annoying',
  'i',
  'do',
  "n't",
  'like',
  'flies',
  'but',
  'i',
  'could',
  "n't",
  'be',
  'too',
  'upset',
  'as',
  'they',
  'were',
  'likely',
  'attracted',
  'to',
  'the',
  'mcdonald',
  "'s",
  'food',
  'that',
  'someone',
  'carelessly',
  'dropped',
  'i',
  'wondered',
  'how',
  'can',
  'they',
  'be',
  'so',
  'uncaring']]

The list above includes all of the words that were removed from our simple sample text.

We can customize the stopwords. First we can obtain the entire set of default stopwords to use as the starting point for our custom list.


In [20]:
# Define custom nlp pipeline
custom_nlp = spacy.load('en_core_web_sm')
# Let's start by accessing spaCy's default stopwords
spacy_stopwords = custom_nlp.Defaults.stop_words
spacy_stopwords

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [21]:
# How many default stopwords?
len(spacy_stopwords)

326

### Adding Stopwords

For this demo we want to add the words "food", "likely", "upset", and "carelessly" to the list of stopwords. Remember that adding a word to the stopwords list will cause it to be removed from the final list of tokens when stopwords are removed.

In [22]:
# We can include additional stopwords by adding them to the default set
# Add custom stopwords
custom_stopwords = ["food", "likely",'upset','carelessly']
for word in custom_stopwords:
    # Add the word to the list of stopwords (for easily tracking stopwords)
    custom_nlp.Defaults.stop_words.add(word)
    # Set the is_stop attribute for the word in the vocab dict to true. 
    # this is what will actually determine spacy treating the word as a stop word
    custom_nlp.vocab[word].is_stop = True
updated_spacy_stopwords = custom_nlp.Defaults.stop_words
len(updated_spacy_stopwords)

330

### Removing Stopwords

We can remove stopwords by discarding them. We will remove "but" and "someone" from our custom list of stopwords. Remember that removing a word from the stopwords means it will be included in the final list of tokens.

In [23]:
# Remove stopwords
remove_stopwords = ["but", "someone"]
for word in remove_stopwords:
    custom_nlp.Defaults.stop_words.discard(word)
    # Ensure the words are not recognized as stopwords
    custom_nlp.vocab[word].is_stop = False
updated_spacy_stopwords = custom_nlp.Defaults.stop_words
len(updated_spacy_stopwords)

328

Now that we have defined the list of stopwords we wish to apply, we can call our custom function with our customized nlp pipeline.

In [25]:
# Process text with custom nlp pipeline
custom_stopwords_removed = batch_preprocess_texts([sample_text], nlp = custom_nlp)
print(custom_stopwords_removed)

1it [00:21, 21.62s/it]

[['running', 'central', 'park', 'noticed', 'constant', 'buzzing', 'flies', 'annoying', 'like', 'flies', 'but', 'attracted', 'mcdonald', 'someone', 'dropped', 'wondered', 'uncaring']]





We see that we have eliminated "food", "likely", "upset", and "carelessly" from our final list of tokens. We have also included "but" and "someone." Remember, that this is just a demonstration of the techniques; the choice of which words to add or remove is dependent on your specific text and goals.

**Contractions**

You may have noticed that the original text contained two contractions: "don't" and "couldn't". When tokenized, "don't" was split into "do" and "n't". "Couldn't" was split into "could" and "n't." Also, note that "do", "could" and "n't" were all on the default stopwords list. Depending on your problem, you may want to keep contractions whole.

Below we will demonstrate how to keep contrations intact by using .add_special_case to the nlp pipeline tokenizer.

In [27]:
# List of contractions to keep as single tokens
contractions = ["don't", "couldn't"]
# Loop through the contractions list and add special cases
for contraction in contractions:
    special_case = [{"ORTH": contraction}]
    custom_nlp.tokenizer.add_special_case(contraction, special_case)
keep_contractions = batch_preprocess_texts([sample_text], nlp = custom_nlp)
print(keep_contractions)

1it [00:21, 21.64s/it]

[['running', 'central', 'park', 'noticed', 'constant', 'buzzing', 'flies', 'annoying', "don't", 'like', 'flies', 'but', "couldn't", 'attracted', 'mcdonald', 'someone', 'dropped', 'wondered', 'uncaring']]





Now, we have "don't and "couldn't" included in our final list of tokens.

### Custom nlp pipeline function

To allow us to customize the pipeline efficiently, we will define a custom function, including the code we demonstrated above. We will add two additional arguments. "Disable" will allow us to disable components of the pipeline when calling the function. We will also allow for the customization of which spacy language model to use. We will learn more about this in a future lesson

In [16]:
def make_custom_nlp(
    disable=["ner"],
    contractions=["don't", "can't", "couldn't", "you'd", "I'll"],
    stopwords_to_add=[],
    stopwords_to_remove=[],
    spacy_model = "en_core_web_sm"
):
    """Returns a custom spacy nlp pipeline.
    
    Args:
        disable (list, optional): Names of pipe components to disable. Defaults to ["ner"].
        contractions (list, optional): List of contractions to add as special cases. Defaults to ["don't", "can't", "couldn't", "you'd", "I'll"].
        stopwords_to_add(list, optional): List of words to set as stopwords (word.is_stop=True)
        stopwords_to_remove(list, optional): List of words to remove from stopwords (word.is_stop=False)
        spacy_model(string, optional): String to select a spacy language model. (Defaults to "en_core_web_sm".)
                            Additional Options:  "en_core_web_md", "en_core_web_lg"; 
                            (Must first download the model by name in the terminal:
                            e.g.  "python -m spacy download en_core_web_lg" )
            
    Returns:
        nlp pipeline: spacy pipeline with special cases and updated nlp.Default.stopwords
    """
    # Load the English NLP model
    nlp = spacy.load(spacy_model, disable=disable)
    
    # Adding Special Cases 
    # Loop through the contractions list and add special cases
    for contraction in contractions:
        special_case = [{"ORTH": contraction}]
        nlp.tokenizer.add_special_case(contraction, special_case)
    
    # Adding stopwords
    for word in stopwords_to_add:
        # Set the is_stop attribute for the word in the vocab dict to true.
        nlp.vocab[
            word
        ].is_stop = True  # this determines spacy's treatmean of the word as a stop word
        # Add the word to the list of stopwords (for easily tracking stopwords)
        nlp.Defaults.stop_words.add(word)
    
    # Removing Stopwords
    for word in stopwords_to_remove:
        
        # Ensure the words are not recognized as stopwords
        nlp.vocab[word].is_stop = False
        nlp.Defaults.stop_words.discard(word)
        
    return nlp

**Test the function**

We will now define a custom pipeline and include and include it in our preprocessing function. Since we are processing a single string with our batch preprocessing function, we will need to pass it into the function as a list

In [28]:
# Customize the nlp pipeline
function_nlp = make_custom_nlp(    
    disable=['ner', 'parser'],
    contractions=["don't"],
    stopwords_to_add=['park'],
    stopwords_to_remove=['while'],
    spacy_model = "en_core_web_sm"
)
# call preprocessing function with custom nlp pipeline
tokens = batch_preprocess_texts([sample_text], nlp = function_nlp)
print(tokens[0])

1it [00:21, 21.17s/it]

['while', 'running', 'central', 'noticed', 'constant', 'buzzing', 'flies', 'annoying', "don't", 'like', 'flies', 'but', 'attracted', 'mcdonald', 'someone', 'dropped', 'wondered', 'uncaring']





Check the results to confirm that the outcome was what you expected.

### Summary

In this lesson, you learned how to customize your nlp preprocessing pipeline by adding custom stop words specific to a text, how to remove stop words from the standard 'english' stop word list, and how to tell SpaCy to keep contractions together.  You now have a  function to create a customized nlp object to use in the preprocess_text function you learned earlier.