In [1]:
paragraph = """
The antiquated, rust-eaten ship, a relic of a bygone era, bobbed ominously in the tempestuous, inky waters. Its weathered hull, scarred by countless storms and treacherous seas, bore silent witness to a century of maritime adventures and mishaps. As the relentless gale howled, whipping the waves into a frenzy of foam, the ship creaked and groaned, its timbers protesting the relentless onslaught. A lone figure, clad in a tattered oilskin coat, stood at the helm, his face etched with a mixture of fear and defiance. His weathered hands gripped the wheel, his eyes fixed on the horizon, where the storm clouds raged like monstrous, celestial beasts.

The ship had once been a majestic vessel, its sails billowing proudly in the wind as it traversed vast oceans, carrying precious cargo and bold explorers to distant shores. Now, it was a mere ghost of its former self, a forlorn sentinel of a forgotten age. The figure at the helm, a grizzled old sailor named Elias, had spent most of his life at sea, braving countless dangers and witnessing the wrath of nature firsthand. He had seen ships sink beneath the waves, their crews lost to the depths; he had weathered hurricanes that had stripped vessels bare; and he had faced pirates, smugglers, and other nefarious characters who prowled the seas in search of plunder. But nothing had prepared him for the storm that was now raging around him.

As the night wore on, the storm grew even more ferocious, the wind howling like a banshee and the waves crashing against the ship with terrifying force. The figure at the helm clung to the wheel, his knuckles white with the effort. He could feel the ship shudder and groan, its timbers straining under the immense pressure. A wave larger than any he had ever seen reared up before him, its crest towering above the ship. Elias braced himself, his heart pounding in his chest. The wave crashed down, engulfing the ship in a torrent of seawater. The figure at the helm was swept overboard, his body disappearing beneath the churning waves.

The storm raged on for days, the sea a chaotic, churning mass of foam and spray. The ship, battered and bruised, drifted helplessly at the mercy of the elements. It was eventually discovered by a passing merchant vessel, its crew astounded to find it still afloat. The ship was towed to a nearby port, where it was hauled ashore and repaired. But the memory of the storm and the loss of its captain would linger long after the ship had been restored to its former glory.


"""

In [2]:
#step-1 tokenisation

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

#split the para to sentences

sentences = sent_tokenize(paragraph)

#Tokenize each sentence into words

tokenized_sentences = [word_tokenize(sentences) for sentences in sentences]

print(tokenized_sentences[:2])

[['The', 'antiquated', ',', 'rust-eaten', 'ship', ',', 'a', 'relic', 'of', 'a', 'bygone', 'era', ',', 'bobbed', 'ominously', 'in', 'the', 'tempestuous', ',', 'inky', 'waters', '.'], ['Its', 'weathered', 'hull', ',', 'scarred', 'by', 'countless', 'storms', 'and', 'treacherous', 'seas', ',', 'bore', 'silent', 'witness', 'to', 'a', 'century', 'of', 'maritime', 'adventures', 'and', 'mishaps', '.']]


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\princ\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
# step-2 lowercasing

lowercase_sentences = [[word.lower() for word in sentence] for sentence in tokenized_sentences]
print(lowercase_sentences[:2])

[['the', 'antiquated', ',', 'rust-eaten', 'ship', ',', 'a', 'relic', 'of', 'a', 'bygone', 'era', ',', 'bobbed', 'ominously', 'in', 'the', 'tempestuous', ',', 'inky', 'waters', '.'], ['its', 'weathered', 'hull', ',', 'scarred', 'by', 'countless', 'storms', 'and', 'treacherous', 'seas', ',', 'bore', 'silent', 'witness', 'to', 'a', 'century', 'of', 'maritime', 'adventures', 'and', 'mishaps', '.']]


In [4]:
#step-3 removing punctuations
import string

punctuation_table = str.maketrans('', '', string.punctuation)
cleaned_sentences = [[word.translate(punctuation_table) for word in sentence] for sentence in lowercase_sentences]

# Remove empty words resulting from punctuation removal
cleaned_sentences = [[word for word in sentence if word] for sentence in cleaned_sentences]

print(cleaned_sentences[:2]) 

[['the', 'antiquated', 'rusteaten', 'ship', 'a', 'relic', 'of', 'a', 'bygone', 'era', 'bobbed', 'ominously', 'in', 'the', 'tempestuous', 'inky', 'waters'], ['its', 'weathered', 'hull', 'scarred', 'by', 'countless', 'storms', 'and', 'treacherous', 'seas', 'bore', 'silent', 'witness', 'to', 'a', 'century', 'of', 'maritime', 'adventures', 'and', 'mishaps']]


In [5]:
# step-4 stopwords removal

from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

#remove stopwords
filtered_sentence = [[word for word in sentence if word not in stop_words] for sentence in cleaned_sentences]

print(filtered_sentence[:2])

[['antiquated', 'rusteaten', 'ship', 'relic', 'bygone', 'era', 'bobbed', 'ominously', 'tempestuous', 'inky', 'waters'], ['weathered', 'hull', 'scarred', 'countless', 'storms', 'treacherous', 'seas', 'bore', 'silent', 'witness', 'century', 'maritime', 'adventures', 'mishaps']]


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\princ\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
# STEP-5 LEMMATISATION
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

#lemmatize

lemmatized_sentences = [[lemmatizer.lemmatize(word) for word in sentence] for sentence in filtered_sentence]

print(lemmatized_sentences)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\princ\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\princ\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


[['antiquated', 'rusteaten', 'ship', 'relic', 'bygone', 'era', 'bobbed', 'ominously', 'tempestuous', 'inky', 'water'], ['weathered', 'hull', 'scarred', 'countless', 'storm', 'treacherous', 'sea', 'bore', 'silent', 'witness', 'century', 'maritime', 'adventure', 'mishap'], ['relentless', 'gale', 'howled', 'whipping', 'wave', 'frenzy', 'foam', 'ship', 'creaked', 'groaned', 'timber', 'protesting', 'relentless', 'onslaught'], ['lone', 'figure', 'clad', 'tattered', 'oilskin', 'coat', 'stood', 'helm', 'face', 'etched', 'mixture', 'fear', 'defiance'], ['weathered', 'hand', 'gripped', 'wheel', 'eye', 'fixed', 'horizon', 'storm', 'cloud', 'raged', 'like', 'monstrous', 'celestial', 'beast'], ['ship', 'majestic', 'vessel', 'sail', 'billowing', 'proudly', 'wind', 'traversed', 'vast', 'ocean', 'carrying', 'precious', 'cargo', 'bold', 'explorer', 'distant', 'shore'], ['mere', 'ghost', 'former', 'self', 'forlorn', 'sentinel', 'forgotten', 'age'], ['figure', 'helm', 'grizzled', 'old', 'sailor', 'named'

In [7]:
#step-6 Bag of words

from sklearn.feature_extraction.text import CountVectorizer

#join words back into sentece for vectorisation
joined_sentences = [''.join(sentence) for sentence in lemmatized_sentences]\

#initialise countvectoriser

vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(joined_sentences)

# Display Bag of Words feature representation for the first 5 sentences
print(X_bow.toarray()[:5])  # Bag of Words representation
print(vectorizer.get_feature_names_out()[:10])  # First 10 feature names

[[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]]
['antiquatedrusteatenshiprelicbygoneerabobbedominouslytempestuousinkywater'
 'couldfeelshipshuddergroantimberstrainingimmensepressure'
 'eliabracedheartpoundingchest'
 'eventuallydiscoveredpassingmerchantvesselcrewastoundedfindstillafloat'
 'figurehelmclungwheelknucklewhiteeffort'
 'figurehelmgrizzledoldsailornamedeliaspentlifeseabravingcountlessdangerwitnessingwrathnaturefirsthand'
 'figurehelmsweptoverboardbodydisappearingbeneathchurningwave'
 'lonefigurecladtatteredoilskincoatstoodhelmfaceetchedmixturefeardefiance'
 'memorystormlosscaptainwouldlingerlongshiprestoredformerglory'
 'mereghostformerselfforlornsentinelforgottenage']


In [8]:
# step-7 tf-idf
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(joined_sentences)

# Display TF-IDF feature representation for the first 5 sentences
print(X_tfidf.toarray()[:5])  # TF-IDF representation
print(tfidf.get_feature_names_out()[:10])

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]
['antiquatedrusteatenshiprelicbygoneerabobbedominouslytempestuousinkywater'
 'couldfeelshipshuddergroantimberstrainingimmensepressure'
 'eliabracedheartpoundingchest'
 'eventuallydiscoveredpassingmerchantvesselcrewastoundedfindstillafloat'
 'figurehelmclungwheelknucklewhiteeffort'
 'figurehelmgrizzledoldsailornamedeliaspentlifeseabravingcountlessdangerwitnessingwrathnaturefirsthand'
 'figurehelmsweptoverboardbodydisappearingbeneathchurningwave'
 'lonefigurecladtatteredoilskincoatstoodhelmfaceetchedmixturefeardefiance'
 'memorystormlosscaptainwouldlingerlongshiprestoredformerglory'
 'mereghostformerselfforlornsentinelforgottenage']
