## Stop Words
top words are words in natural language that are often filtered out or ignored in Natural Language Processing (NLP) tasks because they are considered to be of little semantic value. These words are usually common words such as "the", "is", "at", "which", and "on". They are called "stop words" because they are often "stopped" or removed in the preprocessing phase of text data analysis and model training.

The rationale behind removing stop words is that they occur frequently in the language and usually don't carry significant meaning on their own. By eliminating them, NLP algorithms can focus on more meaningful words, potentially improving the efficiency and performance of tasks like text classification, sentiment analysis, and keyword extraction.

However, the removal of stop words is not always appropriate. It depends on the specific task and the nature of the data. In some cases, such as in sentence structure analysis or certain types of semantic analysis, the stop words can provide important contextual cues and should be retained.

In [4]:
import spacy

nlp = spacy.load("en_core_web_sm")

spaCy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print(len(spaCy_stopwords))

for stop_word in list(spaCy_stopwords)[:10]:
    print(stop_word)

326
all
again
without
hereafter
various
became
seems
your
bottom
any


In [5]:
# In the exaple above, I exmined the STOP_WORDS list from spacy.lang.en.stop_words.STOP_WORDS.

In [6]:
custom_about_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing."
)
nlp = spacy.load("en_core_web_sm")
about_doc = nlp(custom_about_text)
print([token for token in about_doc if not token.is_stop])

[Gus, Proto, Python, developer, currently, working, London, -, based, Fintech, company, ., interested, learning, Natural, Language, Processing, .]
