<a href="https://colab.research.google.com/github/SaishWarule1116/Natural-Language-Processing-NLP-/blob/main/TextPreprocessing_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Key Steps in Text Preprocessing

1. **Lowercasing:** Sab kuch small letters mein karna.
2. **Removing Punctuation:** Special characters (.,!? etc.) hatao.
3. **Tokenization:** Text ko words ya sentences mein todna.
4. **Stopwords Removal:** Common words (jaise "hai", "aur") jo meaning nahi add karte, unko hatao.
5. **Stemming:** Words ko unke root form mein lana (jaise "sikhte" -> "sikh").
6. **Lemmatization:** Words ko meaningful base form mein lana (jaise "sikhte" -> "sikhna").

In [2]:
import nltk
import string
import pandas as pd

In [3]:
# Download Data
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# **Text Cleaning**

In [4]:
import pandas as pd
import string

In [5]:
data = {
    'text': [
        "I love programming in Python! It's amazing.",
        "The cats are running and jumping happily.",
        "Data Science is FUN, isn't it?",
        "Remove stopwords, punctuation, and extra spaces!!"
    ],
    'category': ['positive', 'animals', 'data', 'instructions']
}

# Create DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,text,category
0,I love programming in Python! It's amazing.,positive
1,The cats are running and jumping happily.,animals
2,"Data Science is FUN, isn't it?",data
3,"Remove stopwords, punctuation, and extra spaces!!",instructions


**1.Lowercasing**

In [6]:

df['text'].str.lower()

Unnamed: 0,text
0,i love programming in python! it's amazing.
1,the cats are running and jumping happily.
2,"data science is fun, isn't it?"
3,"remove stopwords, punctuation, and extra spaces!!"


**2.Remove Punctuation**

In [7]:
import string
punctuation = string.punctuation

def remove_punctuation(text):
  for char in punctuation:
    text = text.replace(char,'')
  return text

df['text'] = df['text'].apply(remove_punctuation)
df


Unnamed: 0,text,category
0,I love programming in Python Its amazing,positive
1,The cats are running and jumping happily,animals
2,Data Science is FUN isnt it,data
3,Remove stopwords punctuation and extra spaces,instructions


**3.Remove Stop Words**

In [8]:
import nltk
from nltk.corpus import stopwords

words = stopwords.words('english')
def remove_stopwords(text):
  for word in text.split():
    if word in words:
      text = text.replace(word,'')
  return text

df['text'] = df['text'].apply(remove_stopwords)
df


Unnamed: 0,text,category
0,I love programmg Python Its amazg,positive
1,The cats running jumping happily,animals
2,Data Science FUN nt,data
3,Remove stopwords punctuation extra spaces,instructions


**4.Tokenization**

In [9]:
text = "I love NLP!"
tokens = text.split()
print("Whitespace Tokenization:", tokens)
# Output: ['I', 'love', 'NLP!']

Whitespace Tokenization: ['I', 'love', 'NLP!']


In [10]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [11]:
text = "I love NLP! It's amazing."
sentences = nltk.sent_tokenize(text)
print("Sentence Tokenization:", sentences)

Sentence Tokenization: ['I love NLP!', "It's amazing."]


In [12]:
import re

text = "user123@email.com"
tokens = re.split(r'(\d+|@|\.)', text)
tokens = [t for t in tokens if t]  # Remove empty strings
print("Regex-Based Tokenization:", tokens)


Regex-Based Tokenization: ['user', '123', '@', 'email', '.', 'com']


In [13]:
import spacy

nlp = spacy.load("en_core_web_sm")  # Load English model
text = "Hello, world!"
doc = nlp(text)
tokens = [token.text for token in doc]
print("Punctuation-Based Tokenization:", tokens)
# Output: ['Hello', ',', 'world', '!']

Punctuation-Based Tokenization: ['Hello', ',', 'world', '!']


**5.Remove HTML Tags**

In [14]:
paragraph = '''The quick brown <b>fox</b> jumps over the lazy dog on a sunny day.Rainbows appear <i>after</i> the storm clears in the late afternoon.She <p>decided</p> to take a long walk through the vibrant forest.Colors of <span>nature</span> burst forth in every direction she looked.A <div>gentle</div> breeze rustled the leaves high above her head.Birds sang <h3>sweetly</h3> from their perches in the tall trees.The path wound <strong>steeply</strong> up the hill toward a clearing.A stream flowed <em>quietly</em> beside her as she continued on.By evening, <ul>stars</ul> began to twinkle in the darkening sky
She smiled, feeling <a href="#">peaceful</a> after her journey ended.'''

In [15]:
import re
def remove_html_tags(text):
  pattern = re.compile('<.*?>')
  return pattern.sub(r'',text)
remove_html_tags(paragraph)


'The quick brown fox jumps over the lazy dog on a sunny day.Rainbows appear after the storm clears in the late afternoon.She decided to take a long walk through the vibrant forest.Colors of nature burst forth in every direction she looked.A gentle breeze rustled the leaves high above her head.Birds sang sweetly from their perches in the tall trees.The path wound steeply up the hill toward a clearing.A stream flowed quietly beside her as she continued on.By evening, stars began to twinkle in the darkening sky\nShe smiled, feeling peaceful after her journey ended.'

**6.Stemming**

In [16]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

def stemming(text):
  return " ".join([ps.stem(word) for word in text.split()])

df['text']= df['text'].apply(stemming)
df

Unnamed: 0,text,category
0,i love programmg python it amazg,positive
1,the cat run jump happili,animals
2,data scienc fun nt,data
3,remov stopword punctuat extra space,instructions


**7.Lemmitization**

In [17]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def convert_lemmatizer(text):
  return " ".join([lemmatizer.lemmatize(words) for words in text.split()])

df['text'] = df['text'].apply(convert_lemmatizer)
df

[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,text,category
0,i love programmg python it amazg,positive
1,the cat run jump happili,animals
2,data scienc fun nt,data
3,remov stopword punctuat extra space,instructions


**8. Handel Emoji's**

In [18]:
!pip install emoji

Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/590.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m583.7/590.6 kB[0m [31m26.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.1


In [19]:
import emoji
text = "The emoji 🥰, which depicts"
print(emoji.demojize(text))

The emoji :smiling_face_with_hearts:, which depicts
