# Natural Language Processing Pipeline

## 1️⃣ What is the raw text?

Raw text is the original human language data before any processing.
It can contain:
- Upper/lowercase variations
- Punctuation
- Noise
- Grammar variations

**Example task:** Sentiment classification

In [7]:
# Sample text data for sentiment classification
texts = [
    "I love this movie!",
    "This movie was terrible...",
    "Amazing acting and great story",
    "Worst movie ever",
    "I enjoyed the film a lot",
    "Not good, very boring"
]

labels = [1, 0, 1, 0, 1, 0]  # 1 = positive, 0 = negative

print("Sample texts:")
for i, text in enumerate(texts):
    print(f"{i+1}. {text} (label: {labels[i]})")

Sample texts:
1. I love this movie! (label: 1)
2. This movie was terrible... (label: 0)
3. Amazing acting and great story (label: 1)
4. Worst movie ever (label: 0)
5. I enjoyed the film a lot (label: 1)
6. Not good, very boring (label: 0)


## 2️⃣ How is the text cleaned?

**Cleaning** = removing noise so patterns are easier to learn

**Typical steps:**
- Lowercasing
- Removing punctuation
- Removing stopwords

**Goal:** Reduce variation without losing meaning

In [9]:
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already available
try:
    nltk.download('stopwords', quiet=True)
except:
    pass

stop_words = set(stopwords.words('english'))

def clean_text(text):
    """Clean text by lowercasing, removing punctuation, and removing stopwords."""
    text = text.lower()                              # lowercase
    text = re.sub(r"[^a-z\s]", "", text)             # remove punctuation
    words = text.split()
    words = [w for w in words if w not in stop_words] # remove stopwords
    return " ".join(words)

cleaned_texts = [clean_text(t) for t in texts]

print("Original vs Cleaned texts:")
print("-" * 50)
for orig, cleaned in zip(texts, cleaned_texts):
    print(f"Original: {orig}")
    print(f"Cleaned:  {cleaned}\n")

Original vs Cleaned texts:
--------------------------------------------------
Original: I love this movie!
Cleaned:  love movie

Original: This movie was terrible...
Cleaned:  movie terrible

Original: Amazing acting and great story
Cleaned:  amazing acting great story

Original: Worst movie ever
Cleaned:  worst movie ever

Original: I enjoyed the film a lot
Cleaned:  enjoyed film lot

Original: Not good, very boring
Cleaned:  good boring



## 3️⃣ How is text turned into numbers?

Machines cannot process words, only numbers.

**We use vectorization:**
- Each sentence → numerical vector
- Each dimension → word importance

**We'll use TF-IDF:**
- Common words → lower weight
- Rare, meaningful words → higher weight

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Convert cleaned text to numerical vectors using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(cleaned_texts)

# Display the feature matrix
print("TF-IDF Feature Matrix:")
print(f"Shape: {X.shape} (samples x features)")
print(f"\nFeature names: {vectorizer.get_feature_names_out()}")
print(f"\nTF-IDF Matrix:\n{X.toarray()}")

TF-IDF Feature Matrix:
Shape: (6, 14) (samples x features)

Feature names: ['acting' 'amazing' 'boring' 'enjoyed' 'ever' 'film' 'good' 'great' 'lot'
 'love' 'movie' 'story' 'terrible' 'worst']

TF-IDF Matrix:
[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.82219037 0.56921261 0.
  0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.56921261 0.
  0.82219037 0.        ]
 [0.5        0.5        0.         0.         0.         0.
  0.         0.5        0.         0.         0.         0.5
  0.         0.        ]
 [0.         0.         0.         0.         0.63509072 0.
  0.         0.         0.         0.         0.4396812  0.
  0.         0.63509072]
 [0.         0.         0.         0.57735027 0.         0.57735027
  0.         0.         0.57735027 0.         0.         0.
  0.         0.        ]
 [0.         0.         0.70710678 0.         0.         

## 4️⃣ What model is used?

This is now a standard ML problem.

**We choose Logistic Regression because:**
- Simple
- Works well for text
- Easy to interpret

**The model learns:** which words push prediction toward positive or negative

In [12]:
from sklearn.linear_model import LogisticRegression

# Train the model
model = LogisticRegression()
model.fit(X, labels)

print("Model trained successfully!")
print(f"Number of features: {len(model.coef_[0])}")

Model trained successfully!
Number of features: 14


## 5️⃣ How is performance measured?

We check how well the model predicts unseen text.

**Common metrics:**
- Accuracy
- Precision / Recall
- F1-score

Here we use accuracy for simplicity.

In [13]:
from sklearn.metrics import accuracy_score, classification_report

# Make predictions
predictions = model.predict(X)
accuracy = accuracy_score(labels, predictions)

print("=" * 50)
print("MODEL PERFORMANCE")
print("=" * 50)
print(f"\nAccuracy: {accuracy:.2%}")
print(f"\nDetailed Classification Report:")
print(classification_report(labels, predictions, target_names=['Negative', 'Positive']))
print("\nPredictions vs Actual:")
print("-" * 50)
for i, (text, pred, actual) in enumerate(zip(texts, predictions, labels)):
    status = "✓" if pred == actual else "✗"
    print(f"{status} Text: {text[:40]:<40} | Predicted: {pred} | Actual: {actual}")

MODEL PERFORMANCE

Accuracy: 100.00%

Detailed Classification Report:
              precision    recall  f1-score   support

    Negative       1.00      1.00      1.00         3
    Positive       1.00      1.00      1.00         3

    accuracy                           1.00         6
   macro avg       1.00      1.00      1.00         6
weighted avg       1.00      1.00      1.00         6


Predictions vs Actual:
--------------------------------------------------
✓ Text: I love this movie!                       | Predicted: 1 | Actual: 1
✓ Text: This movie was terrible...               | Predicted: 0 | Actual: 0
✓ Text: Amazing acting and great story           | Predicted: 1 | Actual: 1
✓ Text: Worst movie ever                         | Predicted: 0 | Actual: 0
✓ Text: I enjoyed the film a lot                 | Predicted: 1 | Actual: 1
✓ Text: Not good, very boring                    | Predicted: 0 | Actual: 0


In [14]:
# Optional: Test with new text
test_texts = ["This film is fantastic!", "I hated every minute"]
test_cleaned = [clean_text(t) for t in test_texts]
test_X = vectorizer.transform(test_cleaned)
test_predictions = model.predict(test_X)

print("\n" + "=" * 50)
print("TESTING WITH NEW TEXTS")
print("=" * 50)
for text, pred in zip(test_texts, test_predictions):
    sentiment = "Positive" if pred == 1 else "Negative"
    print(f"Text: '{text}' → Prediction: {sentiment}")


TESTING WITH NEW TEXTS
Text: 'This film is fantastic!' → Prediction: Positive
Text: 'I hated every minute' → Prediction: Positive
