<a href="https://colab.research.google.com/github/lucasvx273/NLP_Python/blob/main/NLTK_Stop_Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Fonte do estudo: https://pythonspot.com/nltk-stop-words/

Esse caderno será usado para testes envolvendo NLTK e Stop_Words

--------------------------------



Natural Language Processing with PythonNatural language processing (nlp) is a research field that presents many challenges such as natural language understanding.

Stop words are common words like ‘the’, ‘and’, ‘I’, etc. that are very frequent in text, and so don’t convey insights into the specific topic of a document. We can remove these stop words from the text in a given corpus to clean up the data, and identify words that are more rare and potentially more relevant to what we’re interested in.

Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.

In [None]:
# Here’s a list of most commonly used words in English

N = [ 'stop', 'the', 'to', 'and', 'a', 'in', 'it', 'is', 'I', 'that', 'had', 'on', 'for', 'were', 'was']

print(N)

**With nltk you don’t have to define every stop word manually**. Stop words are frequently used words that carry very little meaning. Stop words are words that are so common they are basically ignored by typical tokenizers.

By default, **NLTK** (Natural Language Toolkit) **includes a list of 40 stop words,** including: “a”, “an”, “the”, “of”, “in”, etc.

The stopwords in nltk are the most common words in data. They are words that you do not want to use to describe the topic of your content. They are pre-defined and cannot be removed.

In [None]:
#!pip install nltk
import nltk
nltk.download('punkt')

In [6]:
from nltk.tokenize import sent_tokenize, word_tokenize
 
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
words = word_tokenize(data)
print(words)

['All', 'work', 'and', 'no', 'play', 'makes', 'jack', 'dull', 'boy', '.', 'All', 'work', 'and', 'no', 'play', 'makes', 'jack', 'a', 'dull', 'boy', '.']


Getting rid of stop words makes a lot of sense for any Natural Language Processing task. In this code you will see how you can get rid of these ugly stop words from your texts.

In [7]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

**NLTK Stopword List**

So stopwords are words that are very common in human language but are generally not useful because they represent particularly common words such as “the”, “of”, and “to”.

If you get the error NLTK stop words not found, make sure to download the stop words after installing nltk.

In [8]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

stops = set(stopwords.words('english'))
print(stops)

{'of', 'is', 'wouldn', 'very', "she's", 'having', 'did', 'has', 'hadn', 'each', 'himself', 'after', 'there', 'and', 'more', 'own', "you've", 'mustn', 'both', 'down', 'isn', 'just', "mustn't", 'than', "you'll", 'have', 'about', 'against', 'ma', "should've", 'here', 'are', 'how', 'll', 'was', 'does', 'been', 'during', 'above', 'hasn', 'who', 'am', 'we', "hasn't", 'by', 'most', 'being', 'doing', 'the', 'won', 'before', 'a', 'shouldn', 'that', 'an', 'where', "isn't", "it's", 'again', 'theirs', 'but', 'can', 'when', 'whom', "wouldn't", 'those', 'at', 'm', 'or', 'our', 'what', 'it', 'she', 'doesn', 'do', 'through', 'off', "shan't", 't', "wasn't", 'yourself', 'them', 'further', 'aren', 'needn', "needn't", 'so', 'mightn', 'because', 'wasn', 'its', 'up', "you're", 're', 'as', 'not', 'below', 'any', 'yours', 'some', "won't", 'under', 'ourselves', 'his', 'now', "mightn't", 'same', 'over', "that'll", 'while', 'between', 'which', 'hers', 'haven', 'their', 'to', 'nor', 'shan', 'they', 'these', 'my',

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


You can do that for different languages, so you can configure fro the language you need:

stops = set(stopwords.words('german'))

stops = set(stopwords.words('indonesia'))

stops = set(stopwords.words('**portuguese**'))

stops = set(stopwords.words('spanish'))

**Filter stop words nltk**

We will use a string (data) as text. Of course you can also do this with a text file as input. If you want to use a text file instead.

In [11]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
 
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []

for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)

print(wordsFiltered)
print(len(stopWords))
print(stopWords)


['All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', '.', 'All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', '.']
179
{'of', 'is', 'wouldn', 'very', "she's", 'having', 'did', 'has', 'hadn', 'each', 'himself', 'after', 'there', 'and', 'more', 'own', "you've", 'mustn', 'both', 'down', 'isn', 'just', "mustn't", 'than', "you'll", 'have', 'about', 'against', 'ma', "should've", 'here', 'are', 'how', 'll', 'was', 'does', 'been', 'during', 'above', 'hasn', 'who', 'am', 'we', "hasn't", 'by', 'most', 'being', 'doing', 'the', 'won', 'before', 'a', 'shouldn', 'that', 'an', 'where', "isn't", "it's", 'again', 'theirs', 'but', 'can', 'when', 'whom', "wouldn't", 'those', 'at', 'm', 'or', 'our', 'what', 'it', 'she', 'doesn', 'do', 'through', 'off', "shan't", 't', "wasn't", 'yourself', 'them', 'further', 'aren', 'needn', "needn't", 'so', 'mightn', 'because', 'wasn', 'its', 'up', "you're", 're', 'as', 'not', 'below', 'any', 'yours', 'some', "won't", 'under', 'ourselves', 'his', 'now', "mightn't", 