# Natural Language Processing (NLP) Basics

# NLP Basics

## Preprocessing Steps
1. **Tokenization**: Splitting text into words, sentences, or subwords.
2. **Lowercasing**: Converting all text to lowercase for consistency.
3. **Removing punctuation and stopwords**: Cleaning irrelevant symbols and common words.
4. **Stemming**: Reducing words to their root form (e.g., "running" → "run").
5. **Lemmatization**: Reducing words to their dictionary form (e.g., "better" → "good").
6. **Normalization**: Handling variations like spelling, contractions, or numbers.

## Importance of Text Normalization
- Ensures consistency across text data.
- Reduces dimensionality of vocabulary.
- Improves model accuracy by treating similar words as equivalent.

Examples:
- Converting “U.S.A.” → “usa”
- Expanding contractions: “don’t” → “do not”
- Handling numbers: “1000” → “one thousand” or “NUM”


In [9]:
import spacy
import re

# Load spaCy English model (make sure it's installed: python -m spacy download en_core_web_sm)
nlp = spacy.load("en_core_web_sm")

text = "U.S.A. is running fast! Don't worry, it's 1000 times better."

# 1. Tokenization + Lowercasing
doc = nlp(text)
tokens = [token.text.lower() for token in doc]
print("Tokens:", tokens)

# 2. Remove punctuation and stopwords
tokens_clean = [token.text.lower() for token in doc if not token.is_stop and token.is_alpha]
print("Clean tokens:", tokens_clean)

# 3. Lemmatization (spaCy built-in)
lemmas = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
print("Lemmas:", lemmas)

# 4. Normalization examples
normalized = []
for token in doc:
    t = token.text.lower()
    # Replace common patterns
    t = re.sub(r"\b(u\.s\.a\.)\b", "usa", t)
    t = t.replace("don't", "do not")
    t = re.sub(r"\d+", "NUM", t)  # replace numbers with NUM
    normalized.append(t)
print("Normalized:", normalized)


Tokens: ['u.s.a.', 'is', 'running', 'fast', '!', 'do', "n't", 'worry', ',', 'it', "'s", '1000', 'times', 'better', '.']
Clean tokens: ['running', 'fast', 'worry', 'times', 'better']
Lemmas: ['run', 'fast', 'worry', 'time', 'well']
Normalized: ['u.s.a.', 'is', 'running', 'fast', '!', 'do', "n't", 'worry', ',', 'it', "'s", 'NUM', 'times', 'better', '.']


# Importance of Text Normalization
Normalization ensures that variations in text do not mislead models.

Examples:
- "USA", "U.S.A.", "United States" → standardized to "usa"
- "running", "ran", "runs" → normalized to "run"
- "don't" → "do not"

Without normalization, models treat these as separate tokens, increasing sparsity and reducing accuracy.


# Naive Bayes Classifier

## Principle
- Based on Bayes’ theorem: P(class|features) ∝ P(features|class) * P(class).
- Assumes **conditional independence** of features given the class.
- Works well for text classification where features are word counts.

## Strengths
- Simple, fast, efficient for high-dimensional text data.
- Performs well with small datasets.
- Robust to irrelevant features.

## Weaknesses
- Independence assumption often unrealistic.
- Struggles with correlated features.
- May underperform compared to more complex models on large datasets.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Synthetic text dataset
texts = [
    "Baseball is a great sport",
    "I love watching space launches",
    "Graphics design is my passion",
    "The team won the baseball game",
    "NASA discovered a new planet",
    "3D graphics are amazing"
]
labels = [0, 1, 2, 0, 1, 2]  # 0=baseball, 1=space, 2=graphics
categories = ['rec.sport.baseball', 'sci.space', 'comp.graphics']

# Split into train/test manually
X_train = texts[:4]
y_train = labels[:4]
X_test = texts[4:]
y_test = labels[4:]

# Vectorize text
vectorizer = CountVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train Naive Bayes
nb = MultinomialNB()
nb.fit(X_train_vec, y_train)

# Predict
y_pred = nb.predict(X_test_vec)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1-score:", f1_score(y_test, y_pred, average='weighted'))
print(classification_report(y_test, y_pred, target_names=categories))


Accuracy: 0.5
F1-score: 0.5
                    precision    recall  f1-score   support

rec.sport.baseball       0.00      0.00      0.00         0
         sci.space       0.00      0.00      0.00         1
     comp.graphics       1.00      1.00      1.00         1

          accuracy                           0.50         2
         macro avg       0.33      0.33      0.33         2
      weighted avg       0.50      0.50      0.50         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


# Strengths and Weaknesses of Naive Bayes

**Strengths:**
- Fast and efficient for text classification.
- Works well with high-dimensional sparse data.
- Requires less training data.

**Weaknesses:**
- Assumes independence of features, which is rarely true in language.
- Struggles with correlated words (e.g., "New York").
- May be outperformed by more complex models like SVM or deep learning.
