# Text Input

In [1]:
paragraph="""My dear young friends, dream, dream, dream. Dreams transform into thoughts and thoughts result in action. You have to dream before your dreams can come true. You should have a goal and a constant quest to acquire knowledge. Hard work and perseverance are essential. Use technology for the benefit of humankind and not for its destruction. The ignited mind of the youth is the most powerful resource on the earth, above the earth, and under the earth. When the student is ready, the teacher will appear. Aim high, dream big, and work hard to achieve those dreams. The future belongs to the young who have the courage to dream and the determination to realize those dreams. Remember, small aim is a crime; have great aim and pursue it with all your heart."""

#StopWords:
Stopwords are common words that are often removed from text data in natural language processing (NLP) tasks because they are considered to have little to no meaningful information for certain applications. Examples of stopwords include words like "a," "an," "the," "and," "but," "or," "on," "in," and "with."

**Why Remove Stopwords?**


1.  **Reduce Data Size:**Removing stopwords can reduce the size of the dataset, making processing more efficient.
2.   **Improve Model Performance:** For tasks like text classification or clustering, removing stopwords can help improve the performance of the model by reducing noise.
3.   **Focus on Meaningful Words:** Stopwords are generally considered to add little value in understanding the context or meaning of a document.



## StopWords Techniques:

###1.StopWords using NLTK

In [2]:
!pip install nltk



In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Checking the stopwords list in English Language.

In [4]:
nltk.corpus.stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [5]:
#start tokenization
from nltk.tokenize import word_tokenize
nltk.download('punkt')

words=word_tokenize(paragraph)
print(words)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['My', 'dear', 'young', 'friends', ',', 'dream', ',', 'dream', ',', 'dream', '.', 'Dreams', 'transform', 'into', 'thoughts', 'and', 'thoughts', 'result', 'in', 'action', '.', 'You', 'have', 'to', 'dream', 'before', 'your', 'dreams', 'can', 'come', 'true', '.', 'You', 'should', 'have', 'a', 'goal', 'and', 'a', 'constant', 'quest', 'to', 'acquire', 'knowledge', '.', 'Hard', 'work', 'and', 'perseverance', 'are', 'essential', '.', 'Use', 'technology', 'for', 'the', 'benefit', 'of', 'humankind', 'and', 'not', 'for', 'its', 'destruction', '.', 'The', 'ignited', 'mind', 'of', 'the', 'youth', 'is', 'the', 'most', 'powerful', 'resource', 'on', 'the', 'earth', ',', 'above', 'the', 'earth', ',', 'and', 'under', 'the', 'earth', '.', 'When', 'the', 'student', 'is', 'ready', ',', 'the', 'teacher', 'will', 'appear', '.', 'Aim', 'high', ',', 'dream', 'big', ',', 'and', 'work', 'hard', 'to', 'achieve', 'those', 'dreams', '.', 'The', 'future', 'belongs', 'to', 'the', 'young', 'who', 'have', 'the', 'cour

In [6]:
words_after_stopwords=[]
word_as_stopwords=[]
for word in words:
  if word not in set (nltk.corpus.stopwords.words('english')):
    words_after_stopwords.append(word)
  else:
    word_as_stopwords.append(word)

print("Words after removing stopwords:", words_after_stopwords)
print("Stopwords:",word_as_stopwords)



Words after removing stopwords: ['My', 'dear', 'young', 'friends', ',', 'dream', ',', 'dream', ',', 'dream', '.', 'Dreams', 'transform', 'thoughts', 'thoughts', 'result', 'action', '.', 'You', 'dream', 'dreams', 'come', 'true', '.', 'You', 'goal', 'constant', 'quest', 'acquire', 'knowledge', '.', 'Hard', 'work', 'perseverance', 'essential', '.', 'Use', 'technology', 'benefit', 'humankind', 'destruction', '.', 'The', 'ignited', 'mind', 'youth', 'powerful', 'resource', 'earth', ',', 'earth', ',', 'earth', '.', 'When', 'student', 'ready', ',', 'teacher', 'appear', '.', 'Aim', 'high', ',', 'dream', 'big', ',', 'work', 'hard', 'achieve', 'dreams', '.', 'The', 'future', 'belongs', 'young', 'courage', 'dream', 'determination', 'realize', 'dreams', '.', 'Remember', ',', 'small', 'aim', 'crime', ';', 'great', 'aim', 'pursue', 'heart', '.']
Stopwords: ['into', 'and', 'in', 'have', 'to', 'before', 'your', 'can', 'should', 'have', 'a', 'and', 'a', 'to', 'and', 'are', 'for', 'the', 'of', 'and', 'no

###2.StopWords using Spacy


In [7]:
!pip install Spacy
!python -m spacy download en


[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m68.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [8]:
import spacy
# Load the English model
nlp = spacy.load('en_core_web_sm')

# Access the stopwords
stop_words = nlp.Defaults.stop_words

print(stop_words)




{'too', 'used', 'twenty', 'front', 'ten', 'former', '’re', 'hence', 'four', 'meanwhile', 'nobody', 'at', 'in', 'so', 'hereupon', 'please', 'back', 'their', 'until', 'is', 'once', 'being', 'seems', 'does', 'because', 'various', 'top', 'under', 'will', 'call', 'keep', 'ca', 'either', '’ll', 'two', 'yourselves', 'anyhow', 'noone', "'re", 'after', 'she', 'name', 'toward', 'around', 'other', 'first', 'before', 'which', 'few', 'less', 'beforehand', 'otherwise', 'no', 'several', 'namely', 'they', 'not', 'bottom', 'really', '‘m', 'enough', 'made', 'very', 'whereafter', 'thus', 'six', 'do', 'over', 'has', 'against', 'anyway', 'though', 'whereupon', '‘ve', 'afterwards', 'now', 'be', 'make', 'say', 'neither', 'empty', 'together', 'some', 'last', 'seeming', 'put', 'are', 'would', 'myself', 'anywhere', 'became', 'him', 'next', 'becoming', 'hundred', 'whom', 'our', 'was', 'done', 'get', 'another', 'besides', 'anything', 'those', 'except', 'serious', 'such', '‘s', 'on', 'but', 'herein', 'hers', 'move

In [9]:
# Process the text
doc = nlp(paragraph)

# Remove stopwords
filtered_tokens = [token.text for token in doc if not token.is_stop]

print(filtered_tokens)

['dear', 'young', 'friends', ',', 'dream', ',', 'dream', ',', 'dream', '.', 'Dreams', 'transform', 'thoughts', 'thoughts', 'result', 'action', '.', 'dream', 'dreams', 'come', 'true', '.', 'goal', 'constant', 'quest', 'acquire', 'knowledge', '.', 'Hard', 'work', 'perseverance', 'essential', '.', 'Use', 'technology', 'benefit', 'humankind', 'destruction', '.', 'ignited', 'mind', 'youth', 'powerful', 'resource', 'earth', ',', 'earth', ',', 'earth', '.', 'student', 'ready', ',', 'teacher', 'appear', '.', 'Aim', 'high', ',', 'dream', 'big', ',', 'work', 'hard', 'achieve', 'dreams', '.', 'future', 'belongs', 'young', 'courage', 'dream', 'determination', 'realize', 'dreams', '.', 'Remember', ',', 'small', 'aim', 'crime', ';', 'great', 'aim', 'pursue', 'heart', '.']


###3.StopWords using scikit-learn (CountVectorizer)


In [10]:
!pip install scikit-learn



In [11]:
#sentence Tokenization
#Spliting the paragraph into sentences.
paragraph="""My dear young friends, dream, dream, dream. Dreams transform into thoughts and thoughts result in action. You have to dream before your dreams can come true. You should have a goal and a constant quest to acquire knowledge. Hard work and perseverance are essential. Use technology for the benefit of humankind and not for its destruction. The ignited mind of the youth is the most powerful resource on the earth, above the earth, and under the earth. When the student is ready, the teacher will appear. Aim high, dream big, and work hard to achieve those dreams. The future belongs to the young who have the courage to dream and the determination to realize those dreams. Remember, small aim is a crime; have great aim and pursue it with all your heart."""
tokens= paragraph.split('.')
tokens

['My dear young friends, dream, dream, dream',
 ' Dreams transform into thoughts and thoughts result in action',
 ' You have to dream before your dreams can come true',
 ' You should have a goal and a constant quest to acquire knowledge',
 ' Hard work and perseverance are essential',
 ' Use technology for the benefit of humankind and not for its destruction',
 ' The ignited mind of the youth is the most powerful resource on the earth, above the earth, and under the earth',
 ' When the student is ready, the teacher will appear',
 ' Aim high, dream big, and work hard to achieve those dreams',
 ' The future belongs to the young who have the courage to dream and the determination to realize those dreams',
 ' Remember, small aim is a crime; have great aim and pursue it with all your heart',
 '']

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer with English stopwords
vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the data
X = vectorizer.fit_transform(tokens)

# Get the feature names (i.e., the words that are not stopwords)
features = vectorizer.get_feature_names_out()

print(features)


['achieve' 'acquire' 'action' 'aim' 'appear' 'belongs' 'benefit' 'big'
 'come' 'constant' 'courage' 'crime' 'dear' 'destruction' 'determination'
 'dream' 'dreams' 'earth' 'essential' 'friends' 'future' 'goal' 'great'
 'hard' 'heart' 'high' 'humankind' 'ignited' 'knowledge' 'mind'
 'perseverance' 'powerful' 'pursue' 'quest' 'ready' 'realize' 'remember'
 'resource' 'result' 'small' 'student' 'teacher' 'technology' 'thoughts'
 'transform' 'true' 'use' 'work' 'young' 'youth']


###4.StopWords using scikit-learn (TfidfVectorizer)

TfidfVectorizer is used to remove stopwords and compute the TF-IDF representation.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer with English stopwords
vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the data
X = vectorizer.fit_transform(tokens)

# Get the feature names (i.e., the words that are not stopwords)
features = vectorizer.get_feature_names_out()

print(features)


['achieve' 'acquire' 'action' 'aim' 'appear' 'belongs' 'benefit' 'big'
 'come' 'constant' 'courage' 'crime' 'dear' 'destruction' 'determination'
 'dream' 'dreams' 'earth' 'essential' 'friends' 'future' 'goal' 'great'
 'hard' 'heart' 'high' 'humankind' 'ignited' 'knowledge' 'mind'
 'perseverance' 'powerful' 'pursue' 'quest' 'ready' 'realize' 'remember'
 'resource' 'result' 'small' 'student' 'teacher' 'technology' 'thoughts'
 'transform' 'true' 'use' 'work' 'young' 'youth']


##Customization of own StopWords:

Customizing own stopwords in scikit-learn is straightforward. We can either modify the existing list of stopwords provided by scikit-learn or create an entirely new list.


###1.Using Custom Stopwords with CountVectorizer

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
texts = "I don’t even know where to begin with you. It all started about a year ago when I first stumbled upon a photo of you somewhere on the internet.I was sure you were located somewhere super close, already preparing myself for a trip to Czechia. How wrong I was!"
#Sentence Tokenizer
texts=texts.split('.')
texts
# Define custom stopwords
custom_stop_words = ['this','to','it', 'is', 'an', 'of', 'the', 'with']
# Initialize the CountVectorizer with custom stopwords
vectorizer = CountVectorizer(stop_words=custom_stop_words)

# Fit and transform the data
X = vectorizer.fit_transform(texts)

# Get the feature names (i.e., the words that are not stopwords)
features = vectorizer.get_feature_names_out()

print(features)


['about' 'ago' 'all' 'already' 'begin' 'close' 'czechia' 'don' 'even'
 'first' 'for' 'how' 'internet' 'know' 'located' 'myself' 'on' 'photo'
 'preparing' 'somewhere' 'started' 'stumbled' 'super' 'sure' 'trip' 'upon'
 'was' 'were' 'when' 'where' 'wrong' 'year' 'you']


###2.Combining Default Stopwords with Custom Stopwords

In [29]:
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

# Sample text data
texts = "I don’t even know where to begin with you. It all started about a year ago when I first stumbled upon a photo of you somewhere on the internet. I was sure you were located somewhere super close, already preparing myself for a trip to Czechia. How wrong I was!"

# Extend the default stopwords with custom ones
custom_stop_words = ENGLISH_STOP_WORDS.union(['dont', 'with'])
custom_stop_words = list(custom_stop_words)  # Convert to list   #without thiis line there will be a parameter error.

# Initialize the CountVectorizer with combined stopwords
vectorizer = CountVectorizer(stop_words=custom_stop_words)

# Fit and transform the data
X = vectorizer.fit_transform([texts])

# Get the feature names (i.e., the words that are not stopwords)
features = vectorizer.get_feature_names_out()

print(features)


['ago' 'begin' 'close' 'czechia' 'don' 'internet' 'know' 'located' 'photo'
 'preparing' 'started' 'stumbled' 'super' 'sure' 'trip' 'wrong' 'year']
