In [1]:
text='''Nepal is a beautiful landlocked country located in South Asia, nestled 
between China to the north and India to the south, east, and west. Known for its 
stunning natural landscapes, Nepal is home to eight of the world’s ten highest peaks,
including Mount Everest, the tallest mountain on Earth. The country boasts a rich cultural 
and historical heritage, with a diverse population made up of various ethnic groups and languages.
Kathmandu, the capital city, is famous for its ancient temples, vibrant festivals, 
and UNESCO World Heritage Sites. Nepal is also the birthplace of Lord Buddha, 
making it a significant pilgrimage site for Buddhists around the world. Despite being a
developing country, Nepal's warm hospitality,
unique traditions, and bre
athtaking scenery attract travelers from all corners of the globe.'''

In [2]:
with open("spacy_lib.txt", 'w', encoding='utf-8') as file:
    file.write(text)


In [3]:
print(text)

Nepal is a beautiful landlocked country located in South Asia, nestled 
between China to the north and India to the south, east, and west. Known for its 
stunning natural landscapes, Nepal is home to eight of the world’s ten highest peaks,
including Mount Everest, the tallest mountain on Earth. The country boasts a rich cultural 
and historical heritage, with a diverse population made up of various ethnic groups and languages.
Kathmandu, the capital city, is famous for its ancient temples, vibrant festivals, 
and UNESCO World Heritage Sites. Nepal is also the birthplace of Lord Buddha, 
making it a significant pilgrimage site for Buddhists around the world. Despite being a
developing country, Nepal's warm hospitality,
unique traditions, and bre
athtaking scenery attract travelers from all corners of the globe.


In [4]:
import spacy

In [5]:
nlp = spacy.load("en_core_web_sm") #is used to load the spaCy English language model called en_core_web_sm.

In [7]:
# tokenization
doc = nlp(text)
tokens = [token.text for token in doc]
print("Tokens:", tokens)

Tokens: ['Nepal', 'is', 'a', 'beautiful', 'landlocked', 'country', 'located', 'in', 'South', 'Asia', ',', 'nestled', '\n', 'between', 'China', 'to', 'the', 'north', 'and', 'India', 'to', 'the', 'south', ',', 'east', ',', 'and', 'west', '.', 'Known', 'for', 'its', '\n', 'stunning', 'natural', 'landscapes', ',', 'Nepal', 'is', 'home', 'to', 'eight', 'of', 'the', 'world', '’s', 'ten', 'highest', 'peaks', ',', '\n', 'including', 'Mount', 'Everest', ',', 'the', 'tallest', 'mountain', 'on', 'Earth', '.', 'The', 'country', 'boasts', 'a', 'rich', 'cultural', '\n', 'and', 'historical', 'heritage', ',', 'with', 'a', 'diverse', 'population', 'made', 'up', 'of', 'various', 'ethnic', 'groups', 'and', 'languages', '.', '\n', 'Kathmandu', ',', 'the', 'capital', 'city', ',', 'is', 'famous', 'for', 'its', 'ancient', 'temples', ',', 'vibrant', 'festivals', ',', '\n', 'and', 'UNESCO', 'World', 'Heritage', 'Sites', '.', 'Nepal', 'is', 'also', 'the', 'birthplace', 'of', 'Lord', 'Buddha', ',', '\n', 'making

doc = nlp(text): Passes your text to spaCy's NLP pipeline, producing a Doc object.

[token.text for token in doc]: Extracts the string value of each token.

print(...): Displays the list of tokens (words, punctuation, etc.).

In [8]:
#Count words
from collections import Counter
freq = Counter(tokens)

most_common = freq.most_common(5)
print("Most common:", most_common)



Most common: [(',', 15), ('\n', 10), ('the', 8), ('and', 6), ('.', 6)]


Counter(tokens): Counts occurrences of each word.

most_common(5): Returns the 5 most frequent tokens as a list of tuples.

token.is_alpha: Filters only alphabetic tokens (no punctuation or numbers).

not token.is_stop: Removes common stop words like "is", "at", "the", etc.

In [9]:
# 4. Remove Punctuation
tokens_no_punct = [token.text for token in doc if not token.is_punct]
print("Without punctuation:", tokens_no_punct)


Without punctuation: ['Nepal', 'is', 'a', 'beautiful', 'landlocked', 'country', 'located', 'in', 'South', 'Asia', 'nestled', '\n', 'between', 'China', 'to', 'the', 'north', 'and', 'India', 'to', 'the', 'south', 'east', 'and', 'west', 'Known', 'for', 'its', '\n', 'stunning', 'natural', 'landscapes', 'Nepal', 'is', 'home', 'to', 'eight', 'of', 'the', 'world', '’s', 'ten', 'highest', 'peaks', '\n', 'including', 'Mount', 'Everest', 'the', 'tallest', 'mountain', 'on', 'Earth', 'The', 'country', 'boasts', 'a', 'rich', 'cultural', '\n', 'and', 'historical', 'heritage', 'with', 'a', 'diverse', 'population', 'made', 'up', 'of', 'various', 'ethnic', 'groups', 'and', 'languages', '\n', 'Kathmandu', 'the', 'capital', 'city', 'is', 'famous', 'for', 'its', 'ancient', 'temples', 'vibrant', 'festivals', '\n', 'and', 'UNESCO', 'World', 'Heritage', 'Sites', 'Nepal', 'is', 'also', 'the', 'birthplace', 'of', 'Lord', 'Buddha', '\n', 'making', 'it', 'a', 'significant', 'pilgrimage', 'site', 'for', 'Buddhist

token.is_punct: This returns True if the token is a punctuation mark (like ,, ., !, etc.).
if not token.is_punct: Filters out all punctuation tokens.
token.text: Extracts the text of each token.

In [10]:
#Remove stopwords
tokens_no_stop = [token.text for token in doc if not token.is_stop and not token.is_punct]
print("Without stopwords:", tokens_no_stop)

Without stopwords: ['Nepal', 'beautiful', 'landlocked', 'country', 'located', 'South', 'Asia', 'nestled', '\n', 'China', 'north', 'India', 'south', 'east', 'west', 'Known', '\n', 'stunning', 'natural', 'landscapes', 'Nepal', 'home', 'world', 'highest', 'peaks', '\n', 'including', 'Mount', 'Everest', 'tallest', 'mountain', 'Earth', 'country', 'boasts', 'rich', 'cultural', '\n', 'historical', 'heritage', 'diverse', 'population', 'ethnic', 'groups', 'languages', '\n', 'Kathmandu', 'capital', 'city', 'famous', 'ancient', 'temples', 'vibrant', 'festivals', '\n', 'UNESCO', 'World', 'Heritage', 'Sites', 'Nepal', 'birthplace', 'Lord', 'Buddha', '\n', 'making', 'significant', 'pilgrimage', 'site', 'Buddhists', 'world', 'Despite', '\n', 'developing', 'country', 'Nepal', 'warm', 'hospitality', '\n', 'unique', 'traditions', 'bre', '\n', 'athtaking', 'scenery', 'attract', 'travelers', 'corners', 'globe']


-token.is_stop: Checks if the token is a stopword.
-token.is_punct: Checks if the token is punctuation.
-not token.is_stop and not token.is_punct: Ensures only meaningful tokens are kept.
-token.text: Extracts the raw text of the token.

In [11]:
# 6. Lemmatization
lemmas = [token.lemma_ for token in doc if not token.is_punct]
print("Lemmas:", lemmas)


Lemmas: ['Nepal', 'be', 'a', 'beautiful', 'landlocked', 'country', 'locate', 'in', 'South', 'Asia', 'nestle', '\n', 'between', 'China', 'to', 'the', 'north', 'and', 'India', 'to', 'the', 'south', 'east', 'and', 'west', 'know', 'for', 'its', '\n', 'stunning', 'natural', 'landscape', 'Nepal', 'be', 'home', 'to', 'eight', 'of', 'the', 'world', '’s', 'ten', 'high', 'peak', '\n', 'include', 'Mount', 'Everest', 'the', 'tall', 'mountain', 'on', 'Earth', 'the', 'country', 'boast', 'a', 'rich', 'cultural', '\n', 'and', 'historical', 'heritage', 'with', 'a', 'diverse', 'population', 'make', 'up', 'of', 'various', 'ethnic', 'group', 'and', 'language', '\n', 'Kathmandu', 'the', 'capital', 'city', 'be', 'famous', 'for', 'its', 'ancient', 'temple', 'vibrant', 'festival', '\n', 'and', 'UNESCO', 'World', 'Heritage', 'Sites', 'Nepal', 'be', 'also', 'the', 'birthplace', 'of', 'Lord', 'Buddha', '\n', 'make', 'it', 'a', 'significant', 'pilgrimage', 'site', 'for', 'Buddhists', 'around', 'the', 'world', 'de

🧠 Explanation:
token.lemma_: This gives you the lemma of the token (e.g., "running" → "run", "better" → "good").
not token.is_punct: Skips punctuation.



In [13]:

# 7. Pos tagging
for token in doc:
    print(f"{token.text}: {token.pos_}")


Nepal: PROPN
is: AUX
a: DET
beautiful: ADJ
landlocked: ADJ
country: NOUN
located: VERB
in: ADP
South: PROPN
Asia: PROPN
,: PUNCT
nestled: VERB

: SPACE
between: ADP
China: PROPN
to: ADP
the: DET
north: NOUN
and: CCONJ
India: PROPN
to: ADP
the: DET
south: NOUN
,: PUNCT
east: NOUN
,: PUNCT
and: CCONJ
west: NOUN
.: PUNCT
Known: VERB
for: ADP
its: PRON

: SPACE
stunning: ADJ
natural: ADJ
landscapes: NOUN
,: PUNCT
Nepal: PROPN
is: AUX
home: ADV
to: ADP
eight: NUM
of: ADP
the: DET
world: NOUN
’s: PART
ten: NUM
highest: ADJ
peaks: NOUN
,: PUNCT

: SPACE
including: VERB
Mount: PROPN
Everest: PROPN
,: PUNCT
the: DET
tallest: ADJ
mountain: NOUN
on: ADP
Earth: PROPN
.: PUNCT
The: DET
country: NOUN
boasts: VERB
a: DET
rich: ADJ
cultural: ADJ

: SPACE
and: CCONJ
historical: ADJ
heritage: NOUN
,: PUNCT
with: ADP
a: DET
diverse: ADJ
population: NOUN
made: VERB
up: ADP
of: ADP
various: ADJ
ethnic: ADJ
groups: NOUN
and: CCONJ
languages: NOUN
.: PUNCT

: SPACE
Kathmandu: PROPN
,: PUNCT
the: DET
capital:

🧠 Explanation:
token.text: The actual word/token in the text.
token.pos_: The coarse-grained part of speech (e.g., NOUN, VERB, ADJ, etc.).