### 📝 Homework: Building Your First NLP Pipeline

**Theory & Elaboration**
This homework is your capstone for the week. You'll combine all the concepts we've discussed into a single, functional **NLP Pipeline**. This is one of the most common and essential tasks in all of Natural Language Processing.

* **The NLP Pipeline (An Assembly Line for Text) 🏭**: Think of a pipeline as an assembly line for raw text. Messy, unstructured sentences go in one end, and each step on the line performs a specific, targeted transformation. The output of one step becomes the input for the next, until clean, structured, and useful data emerges at the end. The quality of your pipeline directly impacts the quality of your final analysis.

* **The Goal: Preparing for a "Bag of Words"**: The ultimate goal of this cleaning process is to prepare our text for a machine learning model. The output of your pipeline—a clean list of root words—is the perfect input for a foundational model called the **Bag of Words (BoW)**.
    * **Analogy**: Imagine you take a book, tear out all the pages, cut out every single word, and throw them all into a giant bag. You then shake the bag, disregarding all grammar and sentence order. Finally, you create a frequency count of every unique word. That's a Bag of Words!
    * This simple frequency model is surprisingly powerful and is the basis for tasks like **document classification** (e.g., spam vs. not spam) and **sentiment analysis**. Your pipeline is the essential first step to creating this model.

* **Why Order Matters**: As an engineer, the sequence of your pipeline is a critical design choice. A logical order is: `Lowercase -> Tokenize -> Filter -> Normalize`. This ensures, for example, that you lowercase words *before* checking them against an all-lowercase stop word list.

**Your Challenge**
Your task is to build a function, `preprocess_text`, that accepts a raw string of text. It must perform a series of normalization steps and return a list of cleaned, stemmed tokens, ready for a Bag of Words model.

**Requirements & Hints:**
* Your final list of tokens should be **lowercased**.
* It should **not** contain any **punctuation**.
* It should **not** contain any **stop words**.
* Each word in the final list should be **stemmed** to its root form.
* **Tools**: You will need to import and use `word_tokenize`, `stopwords` from `nltk`, and `string` from Python. You'll also need to initialize the `PorterStemmer`.
* **Advanced Tip**: For a more elegant solution, see if you can perform the filtering and stemming steps inside a single list comprehension.

**Bonus Challenge 🌟**
Create a second function called `preprocess_with_lemma` that performs the same cleaning steps but uses **spaCy's lemmatization** instead of NLTK's stemming for the final normalization step.
* **Hint**: You'll need to process the text with a spaCy `nlp` object to access the `.lemma_` attribute of each token. How can you integrate this with your existing filtering logic for stop words and punctuation?

In [7]:
# TODO: Import all necessary libraries and functions.
# You'll need tools for tokenization, stop words, and stemming from NLTK.
# You'll also need the built-in 'string' library and 'spacy'.

import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import spacy

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
pass

# --- Setup your tools ---

# TODO: Get the set of English stop words.
stop_words = set(stopwords.words('english'))

# TODO: Get the set of all punctuation characters.
punctuations  = set(string.punctuation)

# TODO: Create an instance of the PorterStemmer.
ps = PorterStemmer()

# TODO: Load a spaCy model for the bonus challenge (e.g., 'en_core_web_sm').
nlp =spacy.load("en_core_web_sm")


def preprocess_text(text):
    """
    This function takes raw text and returns a list of cleaned, stemmed tokens.
    """
    # TODO: 1. Convert the text to lowercase and tokenize it.
    tokens = []

    # TODO: 2. Filter out stop words and punctuation, then stem the remaining words.
    cleaned_tokens = []
    pass

    return cleaned_tokens

# Bonus Challenge Function
def preprocess_text(text):
    """
    This function takes raw text and returns a list of cleaned, stemmed tokens.
    """
    # TODO: 1. Convert the text to lowercase and tokenize it.
    text = text.lower()
    tokens = word_tokenize(text)

    # TODO: 2. Filter out stop words and punctuation, then stem the remaining words.
    cleaned_tokens = [
        ps.stem(word) for word in tokens
        if word.isalpha() and word not in stop_words]
    pass

    return cleaned_tokens



# Bonus Challenge Function
def preprocess_with_lemma(text):
    """
    This function takes raw text and returns a list of cleaned, lemmatized tokens.
    """
    # TODO: 1. Process the text with the loaded spaCy 'nlp' object.
    doc = nlp(text.lower())

    # TODO: 2. Loop through the tokens in the 'doc'. For each token, if it's not a stop word or punctuation, get its lowercase lemma.
    cleaned_tokens_lemma = [
        token.lemma_ for token in doc
        if token.is_alpha and not token.is_stop ]
    pass

    return cleaned_tokens_lemma


# --- Testing your functions ---
test_text = "Data Science is an amazing field! But, you'll need to clean your data first before analysis."

# Test the main function
processed_tokens = preprocess_text(test_text)

print("\nOriginal Text:", test_text)
print("\n--- Main Task (Stemming) ---")
print("Processed Tokens:", processed_tokens)

# Test the bonus function
processed_lemma_tokens = preprocess_with_lemma(test_text)
print("\n--- Bonus Challenge (Lemmatization) ---")
print("Processed Tokens:", processed_lemma_tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



Original Text: Data Science is an amazing field! But, you'll need to clean your data first before analysis.

--- Main Task (Stemming) ---
Processed Tokens: ['data', 'scienc', 'amaz', 'field', 'need', 'clean', 'data', 'first', 'analysi']

--- Bonus Challenge (Lemmatization) ---
Processed Tokens: ['data', 'science', 'amazing', 'field', 'need', 'clean', 'datum', 'analysis']


### ✅ Self-Assessment

Run the cell below to check your work. This script will evaluate the key exercises and the final homework to provide a score and detailed feedback. Make sure you have run all the cells above this one first.

In [9]:
#@title Run this cell to check your work
from IPython.display import display, Markdown

def check_week5_nlp_tasks():
    """Checks the student's work for Week 5 and provides feedback."""
    score = 0
    bonus_score = 0
    # FIXED: Total points should be 1 since there is only one core task.
    total_points = 1
    feedback = []

    # --- Check 1: Homework (Stemming Pipeline) ---
    try:
        func_exists = 'preprocess_text' in globals()
        if func_exists:
            test_sentence = "The quick brown foxes are jumping over the lazy dogs."
            correct_output = ['quick', 'brown', 'fox', 'jump', 'lazi', 'dog']
            student_output = preprocess_text(test_sentence)
            if student_output == correct_output:
                score += 1
                feedback.append("- ✅ **Homework (Stemming):** Passed. The preprocessing pipeline function works correctly.")
            else:
                feedback.append(f"- ❌ **Homework (Stemming):** Needs Revision. Your function did not produce the correct output. For the test sentence, expected `{correct_output}` but your function returned `{student_output}`.")
        else:
            feedback.append("- ❌ **Homework (Stemming):** Failed. The function `preprocess_text` was not found.")
    except Exception as e:
        feedback.append(f"- ❌ **Homework (Stemming):** An error occurred: {e}")

    # --- Bonus Check: (Lemmatization Pipeline) ---
    try:
        func_exists = 'preprocess_with_lemma' in globals()
        if func_exists:
            test_sentence = "The quick brown foxes are jumping over the lazy dogs."
            correct_output = ['quick', 'brown', 'fox', 'jump', 'lazy', 'dog']
            student_output = preprocess_with_lemma(test_sentence)
            if student_output == correct_output:
                bonus_score += 1
                feedback.append("- 🌟 **Bonus (Lemmatization):** Passed! The lemmatization pipeline works correctly. Excellent work!")
            else:
                feedback.append(f"- ⚠️ **Bonus (Lemmatization):** Needs Revision. Found the function, but the output is incorrect. Expected `{correct_output}` but got `{student_output}`.")
    except Exception as e:
        # Don't show an error if the bonus wasn't attempted
        pass

    # --- Final Feedback ---
    final_message = "## **Homework Self-Assessment Feedback**\n\n" + "\n\n".join(feedback)
    final_message += f"\n\n### **Final Score: {score}/{total_points}**"
    if bonus_score > 0:
        final_message += f" (plus **{bonus_score}** bonus point! 🎉)"

    if score == total_points:
        final_message += "\n\nGreat job! All core tasks passed."
    else:
        final_message += "\n\nSome tasks need revision. Please review the feedback above."

    display(Markdown(final_message))

# Run the fixed self-assessment tool
check_week5_nlp_tasks()

## **Homework Self-Assessment Feedback**

- ✅ **Homework (Stemming):** Passed. The preprocessing pipeline function works correctly.

- 🌟 **Bonus (Lemmatization):** Passed! The lemmatization pipeline works correctly. Excellent work!

### **Final Score: 1/1** (plus **1** bonus point! 🎉)

Great job! All core tasks passed.