# Stopwords

Stopwords are common words that do not carry significant meaning by themselves but are essential for sentence structure and syntax. In text processing tasks, such as Natural Language Processing (NLP) or text mining, stopwords are often removed because they do not contribute much to the overall meaning or analysis of the content.

Some examples of stopwords in English are:

- I
- you
- and
- that
- is
- not
- it
- to

In [49]:
paragraph = "Machine Learning (ML) is a subset of Artificial Intelligence (AI) that enables systems to learn from data and improve their performance over time without being explicitly programmed. It involves algorithms and statistical models that analyze patterns within data to make predictions or decisions. ML is used in various fields, such as healthcare for diagnosing diseases, finance for fraud detection, and e-commerce for personalized recommendations. With advancements in deep learning, a subset of ML, the capabilities of AI systems have grown significantly, allowing them to handle complex tasks like image recognition, natural language processing, and autonomous driving."

In [43]:
from nltk.stem import PorterStemmer, SnowballStemmer, RegexpStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize


In [40]:
# stopwords.words("english")

In [41]:
# first tokenize
# second remove stopwords
# then we user steammer

sentences = sent_tokenize(paragraph)
stop_words = set(stopwords.words('english'))
steammer = PorterStemmer()

processed_sentences = []

for sentence in sentences:
    words = word_tokenize(sentence)
    # print(words)
    # check for stopword if it not then app it to words
    words = [steammer.stem(word) for word in words if word.lower() not in stop_words]

    sentence = ' '.join(words)
    processed_sentences.append(sentence)
    
processed_sentences


['machin learn ( ml ) subset artifici intellig ( ai ) enabl system learn data improv perform time without explicitli program .',
 'involv algorithm statist model analyz pattern within data make predict decis .',
 'ml use variou field , healthcar diagnos diseas , financ fraud detect , e-commerc person recommend .',
 'advanc deep learn , subset ml , capabl ai system grown significantli , allow handl complex task like imag recognit , natur languag process , autonom drive .']

In [45]:
# first tokenize
# second remove stopwords
# then we user steammer

sentences = sent_tokenize(paragraph)
stop_words = set(stopwords.words('english'))
steammer = SnowballStemmer(language='english')

processed_sentences = []

for sentence in sentences:
    words = word_tokenize(sentence)
    # print(words)
    # check for stopword if it not then app it to words
    words = [steammer.stem(word) for word in words if word.lower() not in stop_words]

    sentence = ' '.join(words)
    processed_sentences.append(sentence)
    
processed_sentences


['machin learn ( ml ) subset artifici intellig ( ai ) enabl system learn data improv perform time without explicit program .',
 'involv algorithm statist model analyz pattern within data make predict decis .',
 'ml use various field , healthcar diagnos diseas , financ fraud detect , e-commerc person recommend .',
 'advanc deep learn , subset ml , capabl ai system grown signific , allow handl complex task like imag recognit , natur languag process , autonom drive .']

In [59]:
# first tokenize
# second remove stopwords
# then we user steammer

sentences = sent_tokenize(paragraph)
stop_words = set(stopwords.words('english'))
lemma = WordNetLemmatizer()

processed_sentences = []

for sentence in sentences:
    words = word_tokenize(sentence)
    words = [word.lower() for word in words]
    # print(words)
    # check for stopword if it not then app it to words
    words = [lemma.lemmatize(word, pos='v') for word in words if word.lower() not in stop_words]
    print(words)
    sentence = ' '.join(words)
    processed_sentences.append(sentence)
    
processed_sentences


['machine', 'learn', '(', 'ml', ')', 'subset', 'artificial', 'intelligence', '(', 'ai', ')', 'enable', 'systems', 'learn', 'data', 'improve', 'performance', 'time', 'without', 'explicitly', 'program', '.']
['involve', 'use', 'algorithms', 'statistical', 'model', 'analyze', 'pattern', 'within', 'data', 'make', 'predictions', 'decisions', '.']
['ml', 'use', 'various', 'field', ',', 'healthcare', 'diagnose', 'diseases', ',', 'finance', 'fraud', 'detection', ',', 'e-commerce', 'personalize', 'recommendations', '.']
['advancements', 'deep', 'learn', ',', 'subset', 'ml', ',', 'capabilities', 'ai', 'systems', 'grow', 'significantly', ',', 'allow', 'handle', 'complex', 'task', 'like', 'image', 'recognition', ',', 'natural', 'language', 'process', ',', 'autonomous', 'drive', '.']


['machine learn ( ml ) subset artificial intelligence ( ai ) enable systems learn data improve performance time without explicitly program .',
 'involve use algorithms statistical model analyze pattern within data make predictions decisions .',
 'ml use various field , healthcare diagnose diseases , finance fraud detection , e-commerce personalize recommendations .',
 'advancements deep learn , subset ml , capabilities ai systems grow significantly , allow handle complex task like image recognition , natural language process , autonomous drive .']