In [1]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
len(STOP_WORDS)


326

In [2]:
nlp = spacy.load('en_core_web_sm')
doc = nlp("We just opened our wings, the flying part is coming soon")

for token in doc:
    if token.is_stop:
        print(token)


We
just
our
the
part
is


In [5]:
def preprocess(text):
    doc = nlp(text)
    # give me words that's not stop word or punctuation mark
    no_stop_words = [
        token.text for token in doc if not token.is_stop and not token.is_punct]
    return no_stop_words


In natural language processing (NLP), stop words are commonly used to refer to words that are considered insignificant or commonly occurring in a language. These words, such as articles (e.g., "a," "an," "the"), prepositions (e.g., "in," "on," "at"), conjunctions (e.g., "and," "or," "but"), and pronouns (e.g., "he," "she," "it"), do not carry significant meaning and are often used to connect meaningful words in a sentence.

Stop words are typically removed from text during certain NLP tasks, such as text classification, information retrieval, and topic modeling. The primary reasons for removing stop words are:

Reducing dimensionality: Stop words occur frequently in text and can be present in a large number of documents or sentences. Removing them helps reduce the feature space and computational complexity of NLP models.

Improving performance: Stop words often do not contribute much to the overall meaning of a text. By removing them, the focus shifts to the more informative words, which can improve the performance of various NLP algorithms.

However, there are cases where you might not want to remove stop words. For example, in tasks like sentiment analysis or text generation, removing stop words may alter the sentiment or the intended meaning of the text. Additionally, if you're working with short texts or specific domains where stop words carry significant meaning, it may be better to retain them.


In [6]:
preprocess("We just opened our wings, the flying part is coming soon")


['opened', 'wings', 'flying', 'coming', 'soon']