## Stop Words
    Stop words are a set of commonly used words in a language that are filtered out or ignored during natural language processing (NLP) tasks, such as text analysis, information retrieval, and text mining. These words are deemed insignificant because they:
    1. Carry little to no semantic value or meaning
    2. Appear frequently in almost every document or text
    3. Do not provide unique information or context
##### Examples of stop words in English include:
    1. Generic Stopwords: “a”, “and”, “the”, “all”, “do”, “so”, etc.
    2. Function Words: prepositions (e.g., “of”, “in”, “on”), conjunctions (e.g., “and”, “but”), articles (e.g., “the”, “a”), auxiliary verbs (e.g., “is”, “are”), etc.

In [46]:
## sample paragraph
paragraph = '''
Hi there, My name is Subrat Mishra, I am a Software Developer, and a aspiring Data Scientist. I am more driven towards Data. My hobby is to develop intellect softwares.
'''

In [29]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [30]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [31]:
stemmer = PorterStemmer()

In [32]:
stop_words = set(stopwords.words('english'))

In [33]:
sentences = nltk.sent_tokenize(paragraph)

In [34]:
sentences

['\nHi there, My name is Subrat Mishra, I am a Software Developer, and a aspiring Data Scientist.',
 'I am more driven towards Data.',
 'My hobby is to develop intellect softwares.']

In [35]:
type(sentences)

list

In [36]:
## apply stopwords and filter and then apply stemming
porter_stemmed_sentences = []
for sentence in sentences:
    words = nltk.word_tokenize(sentence)  # Tokenize the sentence
    words = [stemmer.stem(word) for word in words if word.lower() not in stop_words]
    porter_stemmed_sentences.append(' '.join(words))  # Convert the list of words back to a sentence


In [37]:
porter_stemmed_sentences

['hi , name subrat mishra , softwar develop , aspir data scientist .',
 'driven toward data .',
 'hobbi develop intellect softwar .']

In [38]:
from nltk.stem import SnowballStemmer

In [39]:
ball = SnowballStemmer(language='english')

In [40]:
## apply stopwords and filter and then apply snow ball stemming
snowball_stemmed_sentences = []
for sentence in sentences:
    words = nltk.word_tokenize(sentence)  # Tokenize the sentence
    words = [ball.stem(word) for word in words if word.lower() not in stop_words]
    snowball_stemmed_sentences.append(' '.join(words))  # Convert the list of words back to a sentence


In [41]:
snowball_stemmed_sentences

['hi , name subrat mishra , softwar develop , aspir data scientist .',
 'driven toward data .',
 'hobbi develop intellect softwar .']

In [47]:
from nltk.stem import WordNetLemmatizer

In [48]:
lemmatizer = WordNetLemmatizer()

In [51]:
## apply stopwords and filter and then apply snow ball stemming
snowball_lemmatized_sentences = []
for sentence in sentences:
    sentence = sentence.lower()
    words = nltk.word_tokenize(sentence)  # Tokenize the sentence
    words = [lemmatizer.lemmatize(word, pos='v') for word in words if word.lower() not in stop_words]
    snowball_lemmatized_sentences.append(' '.join(words))  # Convert the list of words back to a sentence


In [52]:
snowball_lemmatized_sentences

['hi , name subrat mishra , software developer , aspire data scientist .',
 'drive towards data .',
 'hobby develop intellect softwares .']