<a href="https://colab.research.google.com/github/Bhawesh2608/Shubham-s-Projects/blob/main/NLP_labFile.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentence Tokenization:**

### 🧩 What Is Sentence Tokenization?

**Sentence tokenization** is a process in **Natural Language Processing (NLP)** where a large chunk of text is split into individual **sentences**. It's one of the first steps in analyzing or processing text data.


### 🔍 Why Is It Useful?

- Helps machines understand text structure
- Enables further analysis like sentiment detection, translation, or summarization


In [52]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import  sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [53]:
def tokenize_sentence(text):
    sentence = sent_tokenize(text)
    return sentence

In [54]:
text = """He shaved the peach to prove a point.
Italy is my favorite country; in fact,
I plan to spend two weeks there next year.
It's not possible to convince a monkey to give you
a banana by promising it infinite bananas when they die.
I made myself a peanut butter sandwich as
I didn't want to subsist on veggie crackers.
If you don't like toenails, you probably shouldn't look at your feet."""

sentences = tokenize_sentence(text)
print(sentences)

['He shaved the peach to prove a point.', 'Italy is my favorite country; in fact,\nI plan to spend two weeks there next year.', "It's not possible to convince a monkey to give you\na banana by promising it infinite bananas when they die.", "I made myself a peanut butter sandwich as\nI didn't want to subsist on veggie crackers.", "If you don't like toenails, you probably shouldn't look at your feet."]


In [55]:
for i ,sentence in enumerate(sentences):
  print(f"Sentence {i+1}: {sentence}")

Sentence 1: He shaved the peach to prove a point.
Sentence 2: Italy is my favorite country; in fact,
I plan to spend two weeks there next year.
Sentence 3: It's not possible to convince a monkey to give you
a banana by promising it infinite bananas when they die.
Sentence 4: I made myself a peanut butter sandwich as
I didn't want to subsist on veggie crackers.
Sentence 5: If you don't like toenails, you probably shouldn't look at your feet.


# **Word Tokenization**
Word tokenization is like teaching a computer to read by breaking text into its most basic building blocks—**words** 🧱.

### 🧠 What Is Word Tokenization?

In **Natural Language Processing (NLP)**, **word tokenization** is the process of splitting a sentence or paragraph into individual words, called **tokens**. These tokens are the units that algorithms use to understand and analyze language.




### 🔍 Why It Useful

- **Enables Text Anlysis** a Breaks down text so algorithms can analyze word frequency, patterns, and meaning.
- **Simplifies Preprocessing Prepares** raw text for further steps like stemming, lemmatization, and part-of-speech tagging

In [56]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import  word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [57]:
  def tokenize_words(text):
    words = word_tokenize(text)
    return words

In [58]:
text  = """"Trash covered the landscape like sprinkles do a birthday cake.
I cheated while playing the darts tournament by using a longbow."""

words  = tokenize_words(text)
print(words)

['``', 'Trash', 'covered', 'the', 'landscape', 'like', 'sprinkles', 'do', 'a', 'birthday', 'cake', '.', 'I', 'cheated', 'while', 'playing', 'the', 'darts', 'tournament', 'by', 'using', 'a', 'longbow', '.']


# **Parts of Speech Tagging**

🧠 What is POS Tagging?

Parts of Speech tagging is the process of labeling each word in a sentence with its **grammatical role**. It helps machines understand sentence structure and meaning.

🏷️ Common POS Tags
* NN: Noun, singular
* VB: Verb, base form
* JJ: Adjective
* RB: Adverb
* DT: Determiner
* IN: Preposition
* PRP: Pronoun
* CC: Coordinating conjunction
* VBG: Verb, gerund/present participle
* NNS: Noun, plural








In [59]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt') ,('puntk_tab'),('average_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [60]:
def pos_tagging(text):
  words = word_tokenize(text)
  tagged_words = nltk.pos_tag(words)
  return tagged_words

In [61]:
text = """He stepped gingerly onto the bridge knowing that enchantment awaited on the other side.
The shooter says goodbye to his love.
The beauty of the African sunset disguised the danger lurking nearby."""

tagged_text = pos_tagging(text)

print(tagged_text)

[('He', 'PRP'), ('stepped', 'VBD'), ('gingerly', 'RB'), ('onto', 'IN'), ('the', 'DT'), ('bridge', 'NN'), ('knowing', 'NN'), ('that', 'WDT'), ('enchantment', 'NN'), ('awaited', 'VBD'), ('on', 'IN'), ('the', 'DT'), ('other', 'JJ'), ('side', 'NN'), ('.', '.'), ('The', 'DT'), ('shooter', 'NN'), ('says', 'VBZ'), ('goodbye', 'NN'), ('to', 'TO'), ('his', 'PRP$'), ('love', 'NN'), ('.', '.'), ('The', 'DT'), ('beauty', 'NN'), ('of', 'IN'), ('the', 'DT'), ('African', 'NNP'), ('sunset', 'NN'), ('disguised', 'VBD'), ('the', 'DT'), ('danger', 'NN'), ('lurking', 'VBG'), ('nearby', 'RB'), ('.', '.')]


# **Lemmatization**

🧠 What is lemmatization

Lemmatization is a key NLP technique that reduces words **to their base** or dictionary form—called a **lemma**—while considering the word’s context and part of speech. Think of it as smart **word simplification**

📚 Lemmatization vs Stemming

| Feature | Lemmatization 🧠 | Stemming 🔪 |
|---|---|---|
| Uses context | ✅ Yes | ❌ No |
| Produces real words | ✅ Yes | ❌ Often not |
| Example: “running” | → “run” | → “run” or “runn” |

In [62]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [63]:
def lemmatize_text(text):
  lemmatizer = WordNetLemmatizer()
  tokens = word_tokenize(text)
  lemmatized_text = ' '.join([lemmatizer.lemmatize(word)for word in tokens])
  return lemmatized_text

In [64]:
text = "The cats are chasing mice and playing in the garden"

lemmatized_text = lemmatize_text(text)
print("Original Text:",text)
print("Lemmatize Text:",lemmatized_text)

Original Text: The cats are chasing mice and playing in the garden
Lemmatize Text: The cat are chasing mouse and playing in the garden
