# Magshimim Project

## Practical NLP with Classical ML + Pretrained Transformers

This notebook is a **portfolio-style NLP project** that demonstrates a complete workflow:

- Build a strong **baseline** sentiment classifier (TF‑IDF + Logistic Regression)
- Compare it with a **pretrained Transformer** sentiment model
- Use pretrained models for **translation** and **summarization**
- Combine multiple pretrained models into a small **customer-support assistant** (language detection + summarization + QA)

Everything is written to be **readable, reproducible, and GitHub-friendly** (clear structure, comments, and minimal noise).

## 0. Setup

Run the next cell once (Colab-friendly). If you're not on Colab and already have the packages installed, you can skip it.

In [None]:
# Optional installs (recommended for Google Colab)
# If you're running locally and already have these packages, you can comment this out.

!pip -q install -U transformers evaluate scikit-learn nltk datasets

import nltk
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')


## 1. Imports & Utilities

Small helper functions used throughout the project.

In [None]:
from __future__ import annotations

import re
from typing import List, Tuple, Dict

import numpy as np
import pandas as pd

from nltk.corpus import movie_reviews, stopwords
from nltk.tokenize import word_tokenize

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression


In [None]:
# Text preprocessing
STOP_WORDS = set(stopwords.words("english"))

def clean_text(text: str) -> str:
    """Lightweight text cleaning suitable for TF‑IDF baselines.

    Notes:
      - For Transformer pipelines we typically do *less* manual cleaning,
        because the tokenizer expects natural text.
    """
    text = text.lower()
    text = re.sub(r"[^a-z\s]", " ", text)     # keep letters/spaces only
    text = re.sub(r"\s+", " ", text).strip()  # normalize whitespace
    return text

def tokenize_and_filter(text: str) -> List[str]:
    """Tokenize and remove stop-words. Used as TF‑IDF tokenizer."""
    tokens = word_tokenize(text)
    return [t for t in tokens if t not in STOP_WORDS and len(t) > 1]


## 2. Dataset: NLTK Movie Reviews

We use the classic **movie_reviews** dataset (binary sentiment: positive/negative). It's small, but great for demonstrating end-to-end modeling and evaluation.

In [None]:
def load_movie_reviews() -> pd.DataFrame:
    """Load NLTK movie_reviews into a DataFrame with columns: text, label."""
    rows = []
    for file_id in movie_reviews.fileids():
        label = movie_reviews.categories(file_id)[0]  # 'pos' or 'neg'
        words = movie_reviews.words(file_id)
        text = " ".join(words)
        rows.append((text, label))
    return pd.DataFrame(rows, columns=["text", "label"])

df = load_movie_reviews()
df.head()


In [None]:
# Quick sanity checks
print("Rows:", len(df))
print(df["label"].value_counts())


## 3. Baseline Model: TF‑IDF + Logistic Regression

This is a strong and interpretable baseline for text classification.

Why this baseline?
- TF‑IDF works well on small/medium datasets
- Logistic Regression is fast, stable, and provides a solid reference point

In [None]:
# Train/test split (stratified to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    df["text"], df["label"], test_size=0.2, random_state=42, stratify=df["label"]
)

baseline_clf: Pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        preprocessor=clean_text,
        tokenizer=tokenize_and_filter,
        ngram_range=(1, 2),         # unigrams + bigrams often help sentiment
        min_df=2
    )),
    ("clf", LogisticRegression(max_iter=2000))
])

baseline_clf


In [None]:
# Train
baseline_clf.fit(X_train, y_train)

# Predict
y_pred = baseline_clf.predict(X_test)

# Evaluate
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, pos_label="pos")

print(f"Baseline accuracy: {acc:.3f}")
print(f"Baseline F1 (pos): {f1:.3f}")
print("\nClassification report:\n")
print(classification_report(y_test, y_pred))


## 4. Pretrained Transformer: Sentiment Pipeline

Next, we compare the baseline to a pretrained Transformer sentiment model via `transformers.pipeline("sentiment-analysis")`.

Notes:
- This model is **not trained specifically** on NLTK movie_reviews.
- It outputs labels like `POSITIVE/NEGATIVE`, so we map them to `pos/neg`.
- In a real project, you'd likely fine-tune a model on your dataset, but here we keep it lightweight.

In [None]:
from transformers import pipeline

sentiment_pipe = pipeline("sentiment-analysis")  # uses a default checkpoint

def map_hf_sentiment_label(label: str) -> str:
    """Map HF pipeline labels to the dataset labels."""
    label = label.upper()
    if "POS" in label:
        return "pos"
    return "neg"


In [None]:
# Run inference on a subset to keep runtime reasonable.
# You can increase 'max_samples' if you want a more complete comparison.
max_samples = 200

X_small = list(X_test[:max_samples])
y_small = list(y_test[:max_samples])

hf_outputs = sentiment_pipe(X_small, truncation=True)
hf_pred = [map_hf_sentiment_label(o["label"]) for o in hf_outputs]

acc_hf = accuracy_score(y_small, hf_pred)
f1_hf = f1_score(y_small, hf_pred, pos_label="pos")

print(f"Transformer (pipeline) accuracy on {max_samples} samples: {acc_hf:.3f}")
print(f"Transformer (pipeline) F1 (pos) on {max_samples} samples: {f1_hf:.3f}")


## 5. Translation with Pretrained Models

Here we demonstrate machine translation using a pretrained model.

Tip: translation quality varies a lot by domain; for serious evaluation you would use BLEU / COMET on a labeled dataset.

In [None]:
from transformers import pipeline

# English → Hebrew translation (commonly used checkpoint)
en_he_translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-he")

examples = [
    "What time is it?",
    "This is a small demo of machine translation using Transformers.",
]

for s in examples:
    out = en_he_translator(s, max_length=128)[0]["translation_text"]
    print("EN:", s)
    print("HE:", out)
    print("-" * 60)


## 6. Summarization with Pretrained Models

We use a summarization model to compress long text into a short summary. This is useful as a building block in larger systems (e.g., triaging customer messages).

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

long_text = (
    "Transformers are a class of deep learning models that have become the standard approach "
    "for many natural language processing tasks. They rely on attention mechanisms to model "
    "relationships between tokens in a sequence. This allows them to capture long-range "
    "dependencies more effectively than earlier architectures, such as RNNs. Today, Transformers "
    "power applications like translation, summarization, question answering, and conversational assistants."
)

summary = summarizer(long_text, max_length=60, min_length=20, do_sample=False)[0]["summary_text"]
print(summary)


## 7. Mini Project: Customer-Support Assistant (Multi‑Model Pipeline)

This section shows how to combine multiple pretrained models into a small end-to-end feature.

Goal:
1. Detect language
2. (Optional) Summarize long questions
3. Answer questions using a QA model given an FAQ context

This is intentionally simple, but demonstrates **system design thinking** and model orchestration.

In [None]:
from transformers import pipeline

# Language detection
lang_detector = pipeline(
    "text-classification",
    model="papluca/xlm-roberta-base-language-detection"
)

# Extractive question answering (SQuAD-style)
qa_model = pipeline(
    "question-answering",
    model="deepset/roberta-base-squad2"
)

def detect_language(text: str) -> str:
    """Return ISO-ish language label from the language detector."""
    out = lang_detector(text, truncation=True)[0]
    return out["label"]

def summarize_if_needed(text: str, max_chars: int = 400) -> str:
    """Summarize long text to keep QA inputs manageable."""
    if len(text) <= max_chars:
        return text
    return summarizer(text, max_length=60, min_length=20, do_sample=False)[0]["summary_text"]

def answer_from_faq(question: str, faq_context: str) -> Dict[str, str]:
    """Answer a question using extractive QA over the FAQ context."""
    out = qa_model(question=question, context=faq_context)
    return {"answer": out.get("answer", ""), "score": float(out.get("score", 0.0))}


In [None]:
# Example FAQ data (toy dataset for demo)
faq_data = [
    {
        "question": "Where is my order?",
        "answer": "You can check the status of your order by visiting the 'My Orders' section of your account."
    },
    {
        "question": "How can I return a product?",
        "answer": "To return a product, please visit our Returns page, generate a return label, and ship the product back within 30 days."
    },
    {
        "question": "Do you ship internationally?",
        "answer": "Yes, we ship internationally. Shipping fees and delivery times vary by destination."
    },
    {
        "question": "How do I reset my password?",
        "answer": "Click 'Forgot password' on the login page and follow the instructions sent to your email."
    }
]

# Build a single context string for extractive QA
faq_context = "\n".join([f"Q: {x['question']}\nA: {x['answer']}" for x in faq_data])
print(faq_context[:400] + "...")


In [None]:
def customer_support_assistant(user_question: str) -> Dict[str, str]:
    """Simple end-to-end assistant.

    Strategy:
      - Only answer English questions in this demo (to keep it deterministic).
      - Summarize long questions (optional).
      - Use extractive QA over an FAQ context.
    """
    lang = detect_language(user_question)

    if lang != "en":
        return {
            "language": lang,
            "answer": "Sorry — this demo currently supports English questions only.",
            "score": "N/A"
        }

    q = summarize_if_needed(user_question)
    out = answer_from_faq(q, faq_context)

    return {
        "language": lang,
        "answer": out["answer"],
        "score": f"{out['score']:.3f}"
    }


In [None]:
# Try it
tests = [
    "I forgot my password. How do I reset it?",
    "Do you ship outside the US?",
    "Where can I track my order?"
]

for t in tests:
    result = customer_support_assistant(t)
    print("Q:", t)
    print("->", result)
    print("-" * 60)


## 8. Notes, Limitations, and Next Steps

**What this project shows well:**
- Clean baseline modeling (TF‑IDF + Logistic Regression)
- Practical use of pretrained Transformers (sentiment, translation, summarization, QA)
- A small but realistic multi-model pipeline

**Limitations (intentional for a lightweight demo):**
- No fine-tuning (would improve results on the movie_reviews dataset)
- Translation and summarization are not evaluated on a labeled benchmark here

**Easy upgrades:**
- Fine-tune a Transformer (e.g., DistilBERT) for sentiment on movie_reviews
- Add proper evaluation for translation (BLEU/COMET) and summarization (ROUGE)
- Support multilingual QA by translating non-English questions into English before QA
