# $$ Step\ 3\ : Remove\ stopwrods\ $$

______________________

## Removing Stop Words in Text Preprocessing 🧹

In text preprocessing, **stop words** are common words (such as "the", "is", "in", "and", etc.) that are typically removed from the text because they don't add significant meaning in most **Natural Language Processing (NLP)** tasks. 

### Why Remove Stop Words?

Removing stop words is useful in many NLP tasks, especially when the goal is to focus on the important words in a sentence that contribute to the meaning. For instance:
- In **sentiment analysis**, words like "the", "a", and "is" do not provide much information about the sentiment of a text.
- In **topic modeling**, stop words don't contribute to discovering the core topics from the text data.

By removing these common, less meaningful words, the model can focus on the more important words that carry the core information.

### When Are Stop Words Relevant?

Although stop words are often removed, there are scenarios where they might be important:
- In **Named Entity Recognition (NER)**, the stop words can be important for understanding the context and relationships between entities. For example, in the sentence "Barack Obama is the president," removing "is" would lead to confusion.
- In some cases, stop words can help maintain sentence structure or meaning, especially for tasks that require understanding syntax, like **part-of-speech tagging**.

### Using NLTK for Stop Word Removal

The **nltk** (Natural Language Toolkit) package provides a list of predefined stop words in many languages, which can be used for easy removal during preprocessing. We can load the list of stop words and filter them out of the text data efficiently.

__________________

# Code implementation :

In [2]:
# Download stopwords :
import nltk 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# We will be using the english stopwords :
from nltk.corpus import stopwords
eng_stopwords = stopwords.words('english')

In [4]:
print(eng_stopwords)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

#### Add custom stop words :

In [7]:
eng_stopwords.append("go")

#### Removing stop words :

In [9]:
eng_stopwords.remove("did")
eng_stopwords.remove("not")

## Exemple 1 :

In [5]:
sentence = "The product is really good, but the delivery was slow and the packaging could be improved." 

In [6]:
sentence_no_stopwords = " ".join([word for word in sentence.split() if word not in (eng_stopwords)])
sentence_no_stopwords

'The product really good, delivery slow packaging could improved.'

## Exemple 2 :

In [16]:
sentence_list = ['i absolutely love this new phone! the features are incredible!',
                     "i am so frustrated with the slow performance of my laptop. it's terrible!",
                     'the movie was okay, not great but not bad either.',
                     'the service at the restaurant was amazing! i will definitely come back.',
                     "i can’t believe how poorly this software is designed. it's unusable.",
                     'i am really excited about the new game release! can’t wait to try it!',
                     'i am thrilled with how well the project turned out. it’s beyond expectations!']

#### Transform the list into a single string :

In [11]:
list_as_one_string = " ".join(sentence_list)
list_as_one_string

"i absolutely love this new phone! the features are incredible! i am so frustrated with the slow performance of my laptop. it's terrible! the movie was okay, not great but not bad either. the service at the restaurant was amazing! i will definitely come back. i can’t believe how poorly this software is designed. it's unusable. i am really excited about the new game release! can’t wait to try it! i am thrilled with how well the project turned out. it’s beyond expectations!"

In [12]:
list_as_one_string_cleaned = " ".join([x for x in list_as_one_string.split() if x not in (eng_stopwords)])
list_as_one_string_cleaned

'absolutely love new phone! features incredible! frustrated slow performance laptop. terrible! movie okay, not great not bad either. service restaurant amazing! definitely come back. can’t believe poorly software designed. unusable. really excited new game release! can’t wait try it! thrilled well project turned out. it’s beyond expectations!'

## Exemple 3 : 

In [35]:
import string 
import nltk 
from nltk.corpus import stopwords
eng_stopwords = stopwords.words('english')

In [36]:

def text_cleaning(text):
    # Lowercase :
    text_lower = [ x.lower() for x in text ]

    # Remove punctuation :
    text_punc_removed = [ char for char in text_lower if char not in string.punctuation]
    text_punc_removed_join = "".join(text_punc_removed)

    # Remove stopwords : 
    text_cleaned = " ".join([x for x in text_punc_removed_join.split() if x not in (eng_stopwords)])
    
    return text_cleaned

In [37]:
sentence = "The product is REALLY good !!!!, but the delivery was slow and the #packaging could be improved." 

In [38]:
print(text_cleaning(sentence))

product really good delivery slow packaging could improved


In [39]:
sentences = [
    "This is an amazing product, I love it!",
    "The weather is nice today, perfect for a walk.",
    "I cannot believe how beautiful the sunset is!",
    "It was a fantastic event, everyone enjoyed it."
]

cleaned_sentences = [ text_cleaning(sentence) for sentence in sentences ]
print(cleaned_sentences)

['amazing product love', 'weather nice today perfect walk', 'cannot believe beautiful sunset', 'fantastic event everyone enjoyed']
