### Stop words tutorial

In [20]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

print(len(STOP_WORDS))

string = ' '.join(STOP_WORDS)
string

326


"did in most already he we everyone them cannot at or everywhere bottom per they which various with will ‘re ’ve noone otherwise should mostly into own elsewhere even used becomes amount by perhaps twenty show hereupon until nevertheless rather some then regarding throughout formerly once empty him its seem within hereby everything yourself three hereafter least thereafter an five other often front been i has made so thru towards 's might such across via was eleven ever ‘d ‘m be very herself take below go from part whether between more n't make ’re my who up ‘ll since ten although here of myself whoever no n‘t there for whom re anywhere are give us one themselves due hundred and she would 'm may ca against sixty say sometime name seems ’s move does former whereupon side along ’d serious that but whither afterwards their much another either doing me any mine her those two again anyone using thereby your nine now next than therefore when 're someone full never further am anything itself 

In [34]:
nlp = spacy.load('en_core_web_sm')
doc = nlp('We just opened our wings, the flying part is coming soon')

result_tokens = []

for token in doc:
    if token.text.lower() not in string:
        result_tokens.append(token.text)
        
result = ' '.join(result_tokens)

result

'opened wings , flying soon'

In [29]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("We just opened our wings, the flying part is coming soon")

for token in doc:
    if token.is_stop:
        print(token)

We
just
our
the
part
is


In [35]:
def preprocess(text):
    doc = nlp(text)
    
    no_stop_words = [token.text for token in doc if not token.is_stop]
    return " ".join(no_stop_words)            

In [36]:
preprocess("Musk wants time to prepare for a trial over his")

'Musk wants time prepare trial'

In [37]:
preprocess("The other is not other but your divine brother")

'divine brother'

##### Remove stop words from pandas dataframe text column

Dataset is downloaded from: https://www.kaggle.com/datasets/jbencina/department-of-justice-20092018-press-releases
It contains press releases of different court cases from depart of justice (DOJ). The releases contain information such as outcomes of criminal cases, notable actions taken against felons, or other updates about the current administration.

In [39]:
import pandas as pd

df = pd.read_json("doj_press.json", lines=True)

df.shape

(13087, 6)

In [98]:
df.head(5)

Unnamed: 0,id,title,contents,date,topics,components
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,[],[National Security Division (NSD)]
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,[],[Environment and Natural Resources Division]
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,[],[Environment and Natural Resources Division]
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,[],[Environment and Natural Resources Division]
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]


Filter out those rows that do not have any topics associated with the case

In [82]:
df = df[df["topics"].str.len() != 0]
df.head()

Unnamed: 0,id,title,contents,date,topics,components
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division]
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U..."
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division]
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]"


In [99]:
df.shape

(13087, 6)

In [48]:
df = df.head(100)
df.shape

(100, 6)

In [49]:
df["contents_new"] = df.contents.apply(preprocess)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["contents_new"] = df.contents.apply(preprocess)


In [50]:
df.head()

Unnamed: 0,id,title,contents,date,topics,components,contents_new
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,[],[National Security Division (NSD)],"PORTLAND , Oregon . – Mohamed Osman Mohamud , ..."
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,[],[Environment and Natural Resources Division],WASHINGTON – North Carolina Waccamaw River ...
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,[],[Environment and Natural Resources Division],BOSTON – $ 1 - million settlement reach...
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,[],[Environment and Natural Resources Division],WASHINGTON — federal grand jury Las Vegas t...
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division],"U.S. Department Justice , U.S. Environmental P..."


In [51]:
len(df.contents[4])

6286

In [52]:
len(df.contents_new[4])

4810

In [53]:
df.contents[4][:300]

'The U.S. Department of Justice, the U.S. Environmental Protection Agency (EPA), and the Rhode Island Department of Environmental Management (RIDEM) announced today that two subsidiaries of Stanley Black & Decker Inc.—Emhart Industries Inc. and Black & Decker Inc.—have agreed to clean up dioxin conta'

In [54]:
df.contents_new[4][:300]

'U.S. Department Justice , U.S. Environmental Protection Agency ( EPA ) , Rhode Island Department Environmental Management ( RIDEM ) announced today subsidiaries Stanley Black & Decker Inc.—Emhart Industries Inc. Black & Decker Inc.—have agreed clean dioxin contaminated sediment soil Centredale Manor'

##### Examples where removing stop words can create a problem

**(1) Sentiment detection: Not always but in some cases, based on your dataset it can change the sentiment of a sentence if you remove stop words**

In [55]:
preprocess("this is a good movie")

'good movie'

In [56]:
preprocess("this is not a good movie")

'good movie'

**(2) Language translation: Say you want to translate following sentence from english to telugu. Before actual translation if you remove stop words and then translate, it will produce horrible result**

In [57]:
preprocess("how are you doing dhaval?")

'dhaval ?'

**(3) Chat bot or any Q&A system**

In [58]:
preprocess("I don't find yoga mat on your website. Can you help?")

'find yoga mat website . help ?'