# Natural Language Processing (NLP) Fundamentals

This notebook contains comprehensive examples of Natural Language Processing techniques using Python. The examples cover text preprocessing, feature extraction, classification, and advanced NLP algorithms.

## Prerequisites
```bash
pip install scikit-learn spacy gensim pandas numpy
python -m spacy download en_core_web_sm
```

## Table of Contents
1. [Text Preprocessing and Feature Extraction](#preprocessing)
2. [Classification and Machine Learning](#classification)
3. [Advanced NLP Techniques](#advanced)
4. [Text Analysis and Search](#analysis)

## 1. Text Preprocessing and Feature Extraction <a id="preprocessing"></a>

### Text Preprocessing Pipeline
Clean and normalize raw text data for NLP tasks.

In [4]:
# file: 1_preprocess.py
import re
from typing import List
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import spacy

nlp = spacy.load("en_core_web_sm")  # small, fast English model

def basic_clean(text: str) -> str:
    # lower, strip urls/emails/@mentions/hashtags, keep letters/numbers/space/apostrophe
    text = text.lower()
    text = re.sub(r"(http\S+|www\.\S+)", " ", text)
    text = re.sub(r"\S+@\S+", " ", text)
    text = re.sub(r"[@#]\w+", " ", text)
    text = re.sub(r"[^a-z0-9\s']", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def tokenize_stop_lemma(text: str) -> List[str]:
    doc = nlp(text)
    out = []
    for tok in doc:
        if tok.is_space or tok.is_punct:
            continue
        lemma = tok.lemma_.lower().strip()
        if len(lemma) < 3:           # drop very short tokens
            continue
        if lemma in ENGLISH_STOP_WORDS:  # sklearn's built-in stoplist
            continue
        out.append(lemma)
    return out

def preprocess(text: str) -> List[str]:
    return tokenize_stop_lemma(basic_clean(text))

if __name__ == "__main__":
    s = "Emails like help@site.com are filtered. I'm LOVING NLP!!! Visit https://x.y."
    print(preprocess(s))

['email', 'like', 'filter', 'love', 'nlp', 'visit']


### Bag of Words (BoW) Representation
Convert text documents to numerical feature vectors using word counts.

In [5]:
# file: 3_bow_demo.py
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample corpus
corpus = [
    "I love natural language processing",
    "Language processing is fun",
    "I love coding in Python",
    "Python and NLP are powerful"
]

# Create Bag of Words model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# Convert to dataframe for readability
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

print("Vocabulary:", vectorizer.get_feature_names_out())
print("\nBag of Words Matrix:")
print(bow_df)

Vocabulary: ['and' 'are' 'coding' 'fun' 'in' 'is' 'language' 'love' 'natural' 'nlp'
 'powerful' 'processing' 'python']

Bag of Words Matrix:
   and  are  coding  fun  in  is  language  love  natural  nlp  powerful  \
0    0    0       0    0   0   0         1     1        1    0         0   
1    0    0       0    1   0   1         1     0        0    0         0   
2    0    0       1    0   1   0         0     1        0    0         0   
3    1    1       0    0   0   0         0     0        0    1         1   

   processing  python  
0           1       0  
1           1       0  
2           0       1  
3           0       1  
 ['and' 'are' 'coding' 'fun' 'in' 'is' 'language' 'love' 'natural' 'nlp'
 'powerful' 'processing' 'python']

Bag of Words Matrix:
   and  are  coding  fun  in  is  language  love  natural  nlp  powerful  \
0    0    0       0    0   0   0         1     1        1    0         0   
1    0    0       0    1   0   1         1     0        0    0         0   


### TF-IDF Vectorization
Weight words by their importance using Term Frequency-Inverse Document Frequency.

In [6]:
# file: 4_tfidf_demo.py
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Sample documents
docs = [
    "machine learning is fun",
    "deep learning advances machine intelligence",
    "artificial intelligence and machine learning"
]

# Create TF-IDF model
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(docs)

# Convert to DataFrame
tfidf_df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names_out())

print("Vocabulary:", tfidf.get_feature_names_out())
print("\nTF-IDF Matrix:")
print(tfidf_df.round(3))

Vocabulary: ['advances' 'and' 'artificial' 'deep' 'fun' 'intelligence' 'is' 'learning'
 'machine']

TF-IDF Matrix:
   advances    and  artificial   deep    fun  intelligence     is  learning  \
0     0.000  0.000       0.000  0.000  0.609          0.00  0.609     0.360   
1     0.552  0.000       0.000  0.552  0.000          0.42  0.000     0.326   
2     0.000  0.552       0.552  0.000  0.000          0.42  0.000     0.326   

   machine  
0    0.360  
1    0.326  
2    0.326  


### Word2Vec Embeddings
Create dense vector representations of words that capture semantic relationships.

In [7]:
# file: 5_word2vec_demo.py
from gensim.models import Word2Vec

# Sample sentences (tokenized)
sentences = [
    ["i", "love", "natural", "language", "processing"],
    ["language", "processing", "is", "fun"],
    ["deep", "learning", "advances", "artificial", "intelligence"],
    ["python", "is", "great", "for", "machine", "learning"]
]

# Train a Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=2, sg=1)

# Explore embeddings
print("Vector for 'language':\n", model.wv["language"])
print("\nMost similar to 'learning':", model.wv.most_similar("learning"))
print("\nSimilarity between 'python' and 'language':", model.wv.similarity("python", "language"))

ModuleNotFoundError: No module named 'gensim'

## 2. Classification and Machine Learning <a id="classification"></a>

### Text Classification with TF-IDF and Logistic Regression
Perform sentiment analysis using machine learning pipeline.

In [None]:
# file: 2_classify_tfidf.py
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# tiny demo dataset (positive/negative sentiment)
texts = [
    "I loved this movie, fantastic acting and great story",
    "This film was terrible and boring",
    "Absolutely wonderful experience, highly recommend",
    "Worst acting ever, do not watch",
    "It was okay, some parts were fun",
    "I hated the plot, very disappointing",
    "Brilliant direction and superb cast",
    "Not good, waste of time",
    "Enjoyable and engaging from start to finish",
    "Awful soundtrack and weak story"
]
labels = np.array([1,0,1,0,1,0,1,0,1,0])  # 1=pos, 0=neg

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42, stratify=labels)

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=1)),
    ("clf",   LogisticRegression(max_iter=1000))
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

print("Classification report:\n", classification_report(y_test, y_pred, digits=4))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))

# try the model
samples = ["pretty good but slow in places", "utterly awful, I want my time back"]
print("Predictions:", pipe.predict(samples))
print("Class probabilities:", pipe.predict_proba(samples))

### Hyperparameter Tuning with Grid Search
Optimize model performance by systematically testing different parameter combinations.

In [None]:
# file: 3_tune_grid.py
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np

texts = [
    "excellent movie with great acting",
    "terrible plot and awful pacing",
    "loved every moment, fantastic!",
    "boring and predictable",
    "superb cinematography and direction",
    "weak script and bad acting",
    "what a masterpiece",
    "not good at all",
    "brilliant experience overall",
    "do not recommend"
]
y = np.array([1,0,1,0,1,0,1,0,1,0])

pipe = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", LogisticRegression(max_iter=1000))
])

param_grid = {
    "tfidf__ngram_range": [(1,1),(1,2)],
    "tfidf__min_df": [1,2],
    "tfidf__analyzer": ["word", "char_wb"],
    "clf__C": [0.25, 1.0, 4.0]  # regularization strength
}

search = GridSearchCV(pipe, param_grid, cv=3, n_jobs=-1, scoring="f1")
search.fit(texts, y)

print("Best params:", search.best_params_)
print("Best CV score (f1):", search.best_score_)
best_model = search.best_estimator_
print("Sample prediction:", best_model.predict(["not a great movie but had moments"]))

### Naive Bayes Classification
Simple probabilistic classifier that works well with text data.

In [None]:
# file: 6_nb_classify.py
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Tiny sentiment dataset
texts = [
    "I love this movie",
    "This film was awful",
    "Amazing performance and great story",
    "Boring and too long",
    "Fantastic acting",
    "Terrible direction"
]
labels = [1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

# Split data
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)

# Bag of Words + Naive Bayes
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

clf = MultinomialNB()
clf.fit(X_train_bow, y_train)

y_pred = clf.predict(X_test_bow)
print("Classification Report:\n", classification_report(y_test, y_pred))

## 3. Advanced NLP Techniques <a id="advanced"></a>

### Topic Modeling with Latent Dirichlet Allocation (LDA)
Discover hidden topics in document collections.

In [None]:
# file: 4_topic_lda.py
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

docs = [
    "cats purr and sleep on the sofa", 
    "dogs bark and love to play fetch",
    "kittens and puppies are adorable",
    "stocks rallied as the market rose",
    "investors expect inflation to ease",
    "central bank raised interest rates"
]

# bag-of-words for LDA
cv = CountVectorizer(stop_words="english")
X = cv.fit_transform(docs)

lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)

words = cv.get_feature_names_out()

def show_topics(model, feature_names, topn=6):
    for i, comp in enumerate(model.components_):
        terms = comp.argsort()[::-1][:topn]
        print(f"Topic {i}:"," ".join(feature_names[t] for t in terms))

show_topics(lda, words)

# infer topics for a new doc
import numpy as np
new = cv.transform(["the puppy sleeps while the dog plays"])
print("Topic distribution:", np.round(lda.transform(new), 3))

### Topic Modeling with Gensim
Alternative implementation using Gensim library for more robust topic modeling.

In [None]:
# file: 8_topic_modeling.py
from gensim import corpora, models

# Example dataset
docs = [
    "I love deep learning and natural language processing",
    "Artificial intelligence is the future",
    "Cooking and baking are my hobbies",
    "I enjoy trying new recipes in the kitchen",
    "Machine learning and AI are closely related"
]

# Tokenize
texts = [doc.lower().split() for doc in docs]

# Create dictionary & corpus
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Train LDA model (2 topics)
lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=10)

# Show topics
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")


### Named Entity Recognition (NER) and Part-of-Speech Tagging
Extract entities and analyze grammatical structure using spaCy.

In [None]:
# file: 5_spacy_ner_pos.py
import spacy
from pprint import pprint

nlp = spacy.load("en_core_web_sm")

text = ("Apple is opening a new office in Bengaluru next quarter. "
        "Tim Cook met Karnataka officials on September 3, 2025 to discuss expansion.")

doc = nlp(text)

print("\nNamed Entities (text, label):")
for ent in doc.ents:
    print(f"{ent.text:<25}  -> {ent.label_}")

print("\nPart-of-Speech & Lemmas:")
for token in doc:
    if not token.is_space:
        print(f"{token.text:<15} POS={token.pos_:<5}  Lemma={token.lemma_}")

print("\nNoun chunks (base NPs):")
pprint([chunk.text for chunk in doc.noun_chunks])

## 4. Text Analysis and Search <a id="analysis"></a>

### Semantic Search
Find relevant documents using cosine similarity and TF-IDF vectors.

In [None]:
# file: 6_semantic_search.py
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

corpus = [
    "deep learning methods for image classification",
    "convolutional neural networks for vision",
    "natural language processing with transformers",
    "classical machine learning with SVM and logistic regression",
    "transfer learning for NLP tasks"
]

vectorizer = TfidfVectorizer(stop_words="english", ngram_range=(1,2))
X = vectorizer.fit_transform(corpus)

def search(query: str, topk=3):
    q = vectorizer.transform([query])
    sims = cosine_similarity(q, X).ravel()
    top_idx = sims.argsort()[::-1][:topk]
    results = [(corpus[i], float(sims[i])) for i in top_idx]
    return results

if __name__ == "__main__":
    for item, score in search("best models for text classification", topk=3):
        print(f"{score:.3f}  {item}")

### Document Similarity Analysis
Compute similarity between documents using cosine similarity.

In [None]:
# file: 7_similarity_demo.py
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Example documents
docs = [
    "I love machine learning and NLP",
    "NLP and machine learning are amazing",
    "Cooking recipes are fun to try",
]

# TF-IDF representation
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

# Compute cosine similarity
sim_matrix = cosine_similarity(X)

print("Cosine Similarity Matrix:\n", sim_matrix)

### Extractive Text Summarization
Automatically summarize text by selecting the most important sentences.

In [None]:
# file: 7_summarize_extractive.py
import re
from collections import Counter
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def sent_split(text: str):
    # lightweight splitter; for production use nltk or spacy
    sents = re.split(r'(?<=[.!?])\s+', text.strip())
    return [s for s in sents if s]

def summarize(text: str, max_sentences=3):
    sents = sent_split(text)
    words = re.findall(r"[a-zA-Z']+", text.lower())
    words = [w for w in words if w not in ENGLISH_STOP_WORDS and len(w) > 2]
    freqs = Counter(words)
    # normalize frequencies
    if not freqs: return " ".join(sents[:max_sentences])
    maxf = max(freqs.values())
    for w in freqs: freqs[w] /= maxf

    # sentence scores = sum of word scores
    scored = []
    for i, s in enumerate(sents):
        ws = re.findall(r"[a-zA-Z']+", s.lower())
        score = sum(freqs.get(w, 0.0) for w in ws) / (len(ws) + 1e-9)
        scored.append((score, i, s))

    # keep top sentences in original order
    top = sorted(sorted(scored, key=lambda x: -x[0])[:max_sentences], key=lambda x: x[1])
    return " ".join(s for _, _, s in top)

if __name__ == "__main__":
    text = (
        "Transformers have revolutionized natural language processing. "
        "By leveraging self-attention, they capture long-range dependencies effectively. "
        "Pretraining on large corpora enables strong performance on many tasks. "
        "However, transformers can be computationally expensive. "
        "Researchers explore efficient architectures and distillation to reduce cost."
    )
    print(summarize(text, max_sentences=2))

## Conclusion

This notebook covered fundamental Natural Language Processing techniques:

### Key Concepts Covered:
1. **Text Preprocessing**: Cleaning, tokenization, lemmatization, stop word removal
2. **Feature Extraction**: Bag of Words, TF-IDF, Word2Vec embeddings
3. **Classification**: Logistic Regression, Naive Bayes, hyperparameter tuning
4. **Advanced NLP**: Topic modeling (LDA), Named Entity Recognition, POS tagging
5. **Text Analysis**: Semantic search, document similarity, text summarization

### Libraries and Tools:
- **spaCy**: Industrial-strength NLP with pre-trained models
- **scikit-learn**: Machine learning algorithms and text vectorization
- **Gensim**: Topic modeling and word embeddings
- **Regular Expressions**: Text cleaning and pattern matching

### Best Practices:
1. Always preprocess text before applying ML algorithms
2. Use appropriate feature extraction methods for your task
3. Evaluate models with proper train/test splits
4. Consider using pre-trained models for better performance
5. Handle class imbalance in classification tasks

### Next Steps:
- Explore transformer models (BERT, GPT) with Hugging Face
- Learn about neural language models and deep learning for NLP
- Practice with larger, real-world datasets
- Experiment with multilingual NLP models
- Study recent advances like attention mechanisms and transfer learning

### Applications:
- **Sentiment Analysis**: Customer feedback, social media monitoring
- **Document Classification**: Email filtering, content categorization
- **Information Extraction**: Named entity recognition, relation extraction
- **Search and Recommendation**: Semantic search, content recommendation
- **Text Generation**: Chatbots, automated writing assistance