<a href="https://colab.research.google.com/github/RifatMuhtasim/NLP_Natural_Language_Processing/blob/main/Learn/16.Remove_Stop_Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stop Words
Stop words are those kinds of words that do not have any significant meaning in English words. That needs to be removed from the NLP Preprocessing Stage.

## Single Line

In [3]:
import spacy

#load the pretrained english model
nlp = spacy.load("en_core_web_sm")

text = "We just opened our wings, the flying part is coming soon"
doc = nlp(text)

#display those words they don't have stop words and puncuations
for token in doc:
    if not token.is_stop and not token.is_punct:
        print(token)

opened
wings
flying
coming
soon


## Remove Stop Words from Large Dataset

In [1]:
import pandas as pd
import numpy as np

In [4]:
#load dataset
df = pd.read_json("https://raw.githubusercontent.com/codebasics/nlp-tutorials/main/10_stop_words/doj_press.json", lines=True)
df.head()

Unnamed: 0,id,title,contents,date,topics,components
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,[],[National Security Division (NSD)]
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,[],[Environment and Natural Resources Division]
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,[],[Environment and Natural Resources Division]
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,[],[Environment and Natural Resources Division]
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]


In [5]:
#Remove those row those does't have any topic
df = df[ df['topics'].str.len() != 0 ]
df.head()

Unnamed: 0,id,title,contents,date,topics,components
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division]
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U..."
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division]
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]"


In [7]:
#this function remove all stop words and puncuations

def preprocess_text(text):
    doc = nlp(text)
    no_stop_words = [token.text for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(no_stop_words)

In [8]:
#remove all stop words from contents column
df['new_contents'] = df['contents'].apply(preprocess_text)
df.head()

Unnamed: 0,id,title,contents,date,topics,components,new_contents
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division],U.S. Department Justice U.S. Environmental Pro...
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division],131 count criminal indictment unsealed today B...
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U...",United States Attorney Office Middle District ...
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division],21st Century Oncology LLC agreed pay $ 19.75 m...
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]",21st Century Oncology Inc. certain subsidiarie...


In [14]:
#Check the contents text
print(f"Text Length: {len(df.iloc[0]['contents'])}")
df.iloc[0]['contents']

Text Length: 6286


"The U.S. Department of Justice, the U.S. Environmental Protection Agency (EPA), and the Rhode Island Department of Environmental Management (RIDEM) announced today that two subsidiaries of Stanley Black & Decker Inc.—Emhart Industries Inc. and Black & Decker Inc.—have agreed to clean up dioxin contaminated sediment and soil at the Centredale Manor Restoration Project Superfund Site in North Providence and Johnston, Rhode Island.\xa0 “We are pleased to reach a resolution through collaborative work with the responsible parties, EPA, and other stakeholders,” said\xa0Acting Assistant Attorney General Jeffrey H. Wood for the Justice Department's\xa0Environment and Natural Resources Division . “Today’s settlement ends protracted litigation and allows for important work to get underway to restore a healthy environment for citizens living in and around the Centredale Manor Site and the Woonasquatucket River.” “This settlement demonstrates the tremendous progress we are achieving working with 

In [15]:
#Compare the new contents text
print(f"Text Length: {len(df.iloc[0]['new_contents'])}")
df.iloc[0]['new_contents']

Text Length: 4574


'U.S. Department Justice U.S. Environmental Protection Agency EPA Rhode Island Department Environmental Management RIDEM announced today subsidiaries Stanley Black Decker Inc.—Emhart Industries Inc. Black Decker Inc.—have agreed clean dioxin contaminated sediment soil Centredale Manor Restoration Project Superfund Site North Providence Johnston Rhode Island \xa0  pleased reach resolution collaborative work responsible parties EPA stakeholders said \xa0 Acting Assistant Attorney General Jeffrey H. Wood Justice Department \xa0 Environment Natural Resources Division Today settlement ends protracted litigation allows important work underway restore healthy environment citizens living Centredale Manor Site Woonasquatucket River settlement demonstrates tremendous progress achieving working responsible parties states federal partners expedite sites entire Superfund remediation process said EPA Acting Administrator Andrew Wheeler Centredale Manor Site National Priorities List 18 years taking c

## Problems

In [17]:
text = '''
Thor: Love and Thunder is a 2022 American superhero film based on Marvel Comics featuring the character Thor, produced by Marvel Studios and
distributed by Walt Disney Studios Motion Pictures. It is the sequel to Thor: Ragnarok (2017) and the 29th film in the Marvel Cinematic Universe (MCU).
The film is directed by Taika Waititi, who co-wrote the script with Jennifer Kaytin Robinson, and stars Chris Hemsworth as Thor alongside Christian Bale, Tessa Thompson,
Jaimie Alexander, Waititi, Russell Crowe, and Natalie Portman. In the film, Thor attempts to find inner peace, but must return to action and recruit Valkyrie (Thompson),
Korg (Waititi), and Jane Foster (Portman)—who is now the Mighty Thor—to stop Gorr the God Butcher (Bale) from eliminating all gods.
'''

doc = nlp(text)

#### Count the total numbers of stop words

In [18]:
total_number_of_words = len(text)

#find the stop words from the text
stop_words = [token.text for token in doc if token.is_stop]
total_number_of_stop_words = len(stop_words)

print("Total Numbers of Stop Words: ", total_number_of_stop_words)

Total Numbers of Stop Words:  40


#### Print the total numbers of stop words

In [19]:
print("Percentage of Stop Words: ", (total_number_of_stop_words/ total_number_of_words)*100)

Percentage of Stop Words:  5.215123859191656


#### Remove The "not" word from Spacy

In [22]:
#check if not if removed or not
texts = ["this is a good movie", "this is not a good movie"]

#display the results
for text in texts:
    doc = nlp(text)
    for token in doc:
        if not token.is_stop and not token.is_punct:
            print(token)

good
movie
not
good
movie


In [21]:
#remove 'not' word
nlp.vocab['not'].is_stop = False

#check if not if removed or not
texts = ["this is a good movie", "this is not a good movie"]

#display the results
for text in texts:
    doc = nlp(text)
    for token in doc:
        if not token.is_stop and not token.is_punct:
            print(token)

good
movie
not
good
movie
