# 📘 NLP Preprocessing Techniques

- Tokenization
- Stemming
- Lemmatization
- Stopword Removal
- Part-of-Speech (POS) Tagging
- Word Embeddings (TF-IDF, Word2Vec, GloVe, Transformer embeddings)


## 🔹 Tokenization
Tokenization is the process of splitting text into smaller units (tokens), such as words, subwords, or sentences.

In [None]:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')

text = "Natural Language Processing (NLP) is fun to learn. It helps machines understand human language!"

# Word Tokenization
print("Word Tokenization:", word_tokenize(text))

# Sentence Tokenization
print("Sentence Tokenization:", sent_tokenize(text))


## 🔹 Stemming
Stemming reduces words to their root form by chopping suffixes. It may not produce valid words.

In [None]:

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "flies", "easily", "flying"]

print([stemmer.stem(w) for w in words])


## 🔹 Lemmatization
Lemmatization reduces words to their base or dictionary form (lemma), considering grammar and vocabulary.

In [None]:

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "easily", "better"]

print([lemmatizer.lemmatize(w) for w in words])


## 🔹 Stopword Removal
Stopwords are common words (like *is, the, a, an*) that usually carry little meaning and can be removed.

In [None]:

from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
words = word_tokenize("This is an example showing off stopword filtration.")
filtered = [w for w in words if w.lower() not in stop_words]

print("Original:", words)
print("After Stopword Removal:", filtered)


## 🔹 Part-of-Speech (POS) Tagging
POS tagging assigns grammatical tags (noun, verb, adjective, etc.) to words.

In [None]:

nltk.download('averaged_perceptron_tagger')

text = word_tokenize("John is learning NLP using Python.")
print(nltk.pos_tag(text))


## 🔹 Word Embeddings
Embeddings represent words as vectors of real numbers, capturing semantic meaning.
### 1. TF-IDF Embedding

In [None]:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "Natural Language Processing with Python is fun",
    "Deep learning advances NLP significantly",
    "Python makes machine learning easier"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("TF-IDF Matrix:", X.toarray())


### 2. Word2Vec Embedding

In [None]:

from gensim.models import Word2Vec

sentences = [["nlp", "is", "fun"], ["python", "makes", "nlp", "easy"]]
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=2)

print("Vector for 'nlp':", model.wv['nlp'])
print("Most similar to 'nlp':", model.wv.most_similar('nlp'))


### 3. spaCy Word Embeddings

In [None]:

import spacy

nlp = spacy.load("en_core_web_md")  # medium model with vectors
doc = nlp("NLP is amazing with embeddings")

for token in doc:
    print(token.text, token.vector[:5])  # print first 5 dimensions


### 4. Transformer-based Embeddings (BERT)

In [None]:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("NLP with Transformers is powerful", return_tensors="pt")
outputs = model(**inputs)

print("Embedding shape:", outputs.last_hidden_state.shape)
