# PART 1: Text Classification

### 1.1: Preparing Both Manual and Sourced Datasets


*   **Function Description**: This block serves as the complete data preparation pipeline for our experiment. It prepares two independent sets of data. First, it defines and splits a small, manually-created dataset of 20 samples. Second, it mounts Google Drive, loads the large, externally-sourced `emails.csv` dataset, processes it into the required format, and splits it. The purpose is to create two distinct pairs of training and testing sets (`manual_train_set`/`manual_test_set` and `sourced_train_set`/`sourced_test_set`) that will be used to train and evaluate two separate models in the subsequent blocks.

*   **Data Source References**:
    1.  **Manually Created Data**: A custom set of 20 email-style messages (10 spam, 10 ham) created to fulfill the manual labeling requirement of the assignment.
    2.  **Independently Sourced Data**:
        *   **Dataset Name**: `emails.csv`
        *   **Source**: User's personal Google Drive.
        *   **Location**: `My Drive/ITC508_data/`
        *   **Description**: A CSV file containing thousands of emails. The columns `Text` and `Spam` represent the email content and its classification label, respectively.

*   **Syntax Explanation**:
    *   `manual_data = [...]`: A hard-coded Python list containing 20 tuples, where each tuple holds a text string and its corresponding annotation dictionary.
    *   `manual_train_set = manual_data[:16]`: This list slice creates the training set for the manual data, taking the first 16 samples (80% of 20).
    *   `drive.mount(...)`: A function from the `google.colab` library that establishes a connection to the user's Google Drive, making its contents accessible.
    *   `pd.read_csv(file_path)`: A pandas function that loads the `emails.csv` file from the specified Google Drive path into a DataFrame.
    *   `df.iterrows()`: A pandas method to iterate over each row of the DataFrame, allowing access to the `Text` and `Spam` columns for each email.

*   **Inputs**:
    1.  The hard-coded `manual_data` list.
    2.  The `emails.csv` file located in the user's Google Drive at `/ITC508_data/`.
    3.  User authorization is required via a pop-up to mount Google Drive.

*   **Outputs**:
    1.  `manual_train_set`: A list of 16 manually-created samples for training the first model.
    2.  `manual_test_set`: A list of 4 manually-created samples for testing the first model.
    3.  `sourced_train_set`: A list containing 80% of the samples from `emails.csv` for training the second model.
    4.  `sourced_test_set`: A list containing 20% of the samples from `emails.csv` for testing the second model.

*   **Code Flow**: The code executes in two distinct parts. First, it defines the `manual_data` list, shuffles it, and splits it into training and testing sets. Second, it connects to Google Drive, loads, processes, shuffles, and splits the `sourced_data` into its own training and testing sets. The block concludes by making all four data variables available for the next steps.

*   **Comments and Observations**: In this initial setup, I prepared two distinct datasets for training a spam detector. First, I created a small, manual list of 20 clear-cut spam and ham examples to ensure the model learns from some textbook cases; I then shuffled and split this into a tiny training and testing set. The more significant part involved mounting my Google Drive to access a larger emails.csv file using the pandas library. I looped through this sourced data, converting its binary spam labels (1 or 0) into the same categorical dictionary format as my manual data, which seems necessary for the machine learning model I'll use later. After ensuring all text entries were valid strings, I performed a seeded, reproducible shuffle on this larger dataset and split it into a more robust 80/20 division for training and testing, giving me a substantial amount of data to build a more effective model.

In [None]:
import pandas as pd
import random
import os
from google.colab import drive

# --- Part 1: Prepare Manually Labeled Dataset ---
manual_data = [
    # -- SPAM Examples (10) --
    ("Claim your free prize now!", {"cats": {"SPAM": 1.0, "HAM": 0.0}}),
    ("URGENT: Your account needs immediate attention", {"cats": {"SPAM": 1.0, "HAM": 0.0}}),
    ("Click here to win a new iPhone", {"cats": {"SPAM": 1.0, "HAM": 0.0}}),
    ("Exclusive offer just for you, limited time only", {"cats": {"SPAM": 1.0, "HAM": 0.0}}),
    ("Congratulations! You have been selected for a special reward", {"cats": {"SPAM": 1.0, "HAM": 0.0}}),
    ("$$$ Make money fast with this simple trick", {"cats": {"SPAM": 1.0, "HAM": 0.0}}),
    ("Your payment is overdue, please update your details", {"cats": {"SPAM": 1.0, "HAM": 0.0}}),
    ("Meet local singles tonight", {"cats": {"SPAM": 1.0, "HAM": 0.0}}),
    ("You've won the lottery, click to claim", {"cats": {"SPAM": 1.0, "HAM": 0.0}}),
    ("We need to verify your bank account information", {"cats": {"SPAM": 1.0, "HAM": 0.0}}),

    # -- HAM (Not Spam) Examples (10) --
    ("Hello, are we still on for the meeting tomorrow?", {"cats": {"SPAM": 0.0, "HAM": 1.0}}),
    ("Here is the report you requested", {"cats": {"SPAM": 0.0, "HAM": 1.0}}),
    ("Can you please review this document?", {"cats": {"SPAM": 0.0, "HAM": 1.0}}),
    ("Thanks for your email, I will get back to you shortly", {"cats": {"SPAM": 0.0, "HAM": 1.0}}),
    ("Let's catch up for lunch next week", {"cats": {"SPAM": 0.0, "HAM": 1.0}}),
    ("Attached is the new project schedule", {"cats": {"SPAM": 0.0, "HAM": 1.0}}),
    ("What time is the team call today?", {"cats": {"SPAM": 0.0, "HAM": 1.0}}),
    ("See you at the conference!", {"cats": {"SPAM": 0.0, "HAM": 1.0}}),
    ("Your Amazon order has shipped", {"cats": {"SPAM": 0.0, "HAM": 1.0}}),
    ("I'm running about 10 minutes late, apologies", {"cats": {"SPAM": 0.0, "HAM": 1.0}})
]
random.shuffle(manual_data)
manual_train_set = manual_data[:16] # 80% of 20 is 16
manual_test_set = manual_data[16:]  # The remaining 4
print(f"Manual data prepared: {len(manual_train_set)} training samples, {len(manual_test_set)} testing samples.")
print("-" * 30)


# --- Part 2: Prepare Sourced Dataset from Google Drive ---
drive.mount('/content/drive', force_remount=True)
file_name = 'emails.csv'
file_path = f'/content/drive/My Drive/ITC508_data/{file_name}'
df = pd.read_csv(file_path)

sourced_data = []
for index, row in df.iterrows():
    text = row['Text']
    label = row['Spam']
    if label == 1:
        annotations = {"cats": {"SPAM": 1.0, "HAM": 0.0}}
    else:
        annotations = {"cats": {"SPAM": 0.0, "HAM": 1.0}}
    if isinstance(text, str):
        sourced_data.append((text, annotations))

random.seed(42)
random.shuffle(sourced_data)
split_point = int(len(sourced_data) * 0.8)
sourced_train_set = sourced_data[:split_point]
sourced_test_set = sourced_data[split_point:]
print(f"Sourced data prepared: {len(sourced_train_set)} training samples, {len(sourced_test_set)} testing samples.")

Manual data prepared: 16 training samples, 4 testing samples.
------------------------------
Mounted at /content/drive
Sourced data prepared: 4582 training samples, 1146 testing samples.


### 1.2: Model Trained on Manual Data


*   **Function Description**: This block focuses entirely on the manually-created dataset. It initializes a new spaCy model (`nlp_manual`), trains it exclusively on the 16 samples in `manual_train_set`, and then evaluates its performance on the 4 unseen samples in `manual_test_set`. The core training logic is taken directly from the professor's sample code. Finally, it provides an interactive loop for the user to test this specific, minimally-trained model with their own inputs.

*   **Syntax Explanation**:
    *   **`# --- PROVIDED SAMPLE CODE ...`**: These comments clearly mark the sections of code that directly implement the logic from the professor's provided snippet.
    *   `nlp_manual = spacy.blank("en")`: Creates a new, empty English language model.
    *   `nlp_manual.add_pipe("textcat")`: Adds the text classification component to the model.
    *   `nlp_manual.initialize()`: Prepares the model's weights for training.
    *   `nlp_manual.update([example], sgd=optimizer)`: The core training step from the sample code. It shows the model one example and adjusts its weights to reduce error.
    *   `for i in range(10):`: This is an added outer loop (an "epoch" loop). It runs the training process 10 times over the small dataset to give the model a better chance to learn.
    *   `classification_report(...)`: An added function from scikit-learn that calculates detailed performance metrics (precision, recall, F1-score), as required by the assignment.
    *   `while True:`: The interactive testing loop from the sample code, allowing for live classification of user input.

*   **Inputs**:
    1.  `manual_train_set`: The list of 16 training examples from Block 1.
    2.  `manual_test_set`: The list of 4 testing examples from Block 1.
    3.  `user_input`: Text entered by the user during the interactive testing phase.

*   **Outputs**:
    1.  A printed log showing the start and end of the training process.
    2.  An accuracy score and a detailed Classification Report for the model's performance on the manual test set.
    3.  An interactive prompt that takes user input and prints the model's classification (`SPAM` or `HAM`).

*   **Code Flow**: The code follows a logical sequence: 1. A new model is created and configured. 2. The model is trained using the small `manual_train_set`. 3. The trained model's performance is quantitatively evaluated using the `manual_test_set`. 4. The code enters an interactive loop, allowing the user to qualitatively test the model's behavior.

*   **Comments and Observations**: For this next step, I used the spacy library to build and train a text classification model from scratch. I configured a blank English model, added the "SPAM" and "HAM" labels, and then trained it only on the 16 manually created examples from the first part. Recognizing that this was a very small dataset, I ran the training process for 10 epochs, hoping the repetition would help solidify the learning. After training, I evaluated its performance on the 4 unseen test samples, adding code from sklearn to calculate and print the accuracy and a more detailed classification report, which will show me how well it identifies each category. The most interesting part was the interactive testing loop at the end, which allowed me to input my own email text and see how this minimally-trained model would classify it in real-time, giving me a direct way to gauge its effectiveness.

In [None]:
import spacy
from spacy.training import Example
from sklearn.metrics import accuracy_score, classification_report
import random

# --- Model 1: Trained on MANUAL Data Only ---

# --- PROVIDED SAMPLE CODE STARTS HERE (Initialization) ---
# Load a blank model and add text classifier
nlp_manual = spacy.blank("en")
textcat_manual = nlp_manual.add_pipe("textcat")

# Add labels for classification
textcat_manual.add_label("SPAM")
textcat_manual.add_label("HAM")

# Training the model
optimizer = nlp_manual.initialize()
# --- PROVIDED SAMPLE CODE ENDS HERE (Initialization) ---

print("--- Training Model 1 on 16 MANUAL examples ---")
# NOTE: The outer loop for epochs is an addition to improve performance.
for i in range(10):
    random.shuffle(manual_train_set)
    # --- PROVIDED SAMPLE CODE STARTS HERE (Core Training Loop) ---
    for text, annotations in manual_train_set:
        example = Example.from_dict(nlp_manual.make_doc(text), annotations)
        nlp_manual.update([example], sgd=optimizer)
    # --- PROVIDED SAMPLE CODE ENDS HERE (Core Training Loop) ---
print("Training of manual model complete.")

# --- Evaluation Metrics (ADDITION as per assignment requirements) ---
print("\n--- Evaluating Model 1 on 4 UNSEEN MANUAL examples ---")
true_labels = []
predicted_labels = []
for text, annotations in manual_test_set:
    true_label = "SPAM" if annotations['cats']['SPAM'] == 1.0 else "HAM"
    true_labels.append(true_label)
    doc = nlp_manual(text) # Using the trained model to predict
    predicted_label = "SPAM" if doc.cats['SPAM'] > doc.cats['HAM'] else "HAM"
    predicted_labels.append(predicted_label)
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"\nModel Accuracy on Manual Test Set: {accuracy * 100:.2f}%")
print("\nClassification Report (Manual Model):")
print(classification_report(true_labels, predicted_labels, zero_division=0))
# --- End of Evaluation Metrics Addition ---


# --- PROVIDED SAMPLE CODE STARTS HERE (Interactive Testing) ---
# Function to classify user input emails
def classify_email_manual(email):
    doc = nlp_manual(email)
    spam_score = doc.cats['SPAM']
    ham_score = doc.cats['HAM']
    if spam_score > ham_score:
        return "SPAM"
    else:
        return "HAM"

# Allow users to test the model by inputting their own email data
while True:
    user_input = input("\nTest the MANUAL Model: Enter an email (or type 'exit' to quit): ")
    if user_input.lower() == 'exit':
        break
    classification = classify_email_manual(user_input)
    print(f"The email is classified as: {classification}")
# --- PROVIDED SAMPLE CODE ENDS HERE (Interactive Testing) ---

--- Training Model 1 on 16 MANUAL examples ---
Training of manual model complete.

--- Evaluating Model 1 on 4 UNSEEN MANUAL examples ---

Model Accuracy on Manual Test Set: 75.00%

Classification Report (Manual Model):
              precision    recall  f1-score   support

         HAM       1.00      0.67      0.80         3
        SPAM       0.50      1.00      0.67         1

    accuracy                           0.75         4
   macro avg       0.75      0.83      0.73         4
weighted avg       0.88      0.75      0.77         4


Test the MANUAL Model: Enter an email (or type 'exit' to quit): YOU WON BIG TIME
The email is classified as: SPAM

Test the MANUAL Model: Enter an email (or type 'exit' to quit): Great
The email is classified as: SPAM

Test the MANUAL Model: Enter an email (or type 'exit' to quit): meeting tommorow
The email is classified as: HAM

Test the MANUAL Model: Enter an email (or type 'exit' to quit): exit


### 1.3: Model Trained on Sourced Data


*   **Function Description**: This block mirrors the structure of Block 2 but operates on the large, sourced dataset. It initializes a second, completely independent spaCy model (`nlp_sourced`), trains it on the thousands of email samples in `sourced_train_set`, and evaluates it on the large `sourced_test_set`. It uses the same core training logic from the professor's sample code. The goal is to build a high-performance model and compare its results directly to the baseline model from Block 2.

*   **Syntax Explanation**:
    *   **`# --- PROVIDED SAMPLE CODE ...`**: As in the previous block, these comments highlight the use of the professor's core implementation for initialization, training, and interactive testing.
    *   `nlp_sourced = spacy.blank("en")`: Creates a *new, separate* model object to ensure it does not have any "memory" of the manual data.
    *   `nlp_sourced.update(...)`: The exact same training method as before, but this time it is being fed thousands of examples from the `sourced_train_set`.
    *   `classification_report(...)`: The same scikit-learn function is used here to provide a robust evaluation of this more powerful model.
    *   `classify_email_sourced(email)`: A distinct classification function for the interactive loop to ensure we are testing the correct model (`nlp_sourced`).

*   **Inputs**:
    1.  `sourced_train_set`: The large list of training examples (80% of `emails.csv`) from Block 1.
    2.  `sourced_test_set`: The large list of testing examples (20% of `emails.csv`) from Block 1.
    3.  `user_input`: Text entered by the user during this block's interactive testing phase.

*   **Outputs**:
    1.  A printed log showing the training process on the large dataset.
    2.  A final accuracy score and Classification Report detailing the high performance of the model on the sourced test set.
    3.  A final interactive prompt for testing this superior model.

*   **Code Flow**: The flow is identical to Block 2, emphasizing the experimental nature: 1. A new model is created. 2. The model is trained, but this time on the large `sourced_train_set`. 3. The model is evaluated on the large `sourced_test_set`. 4. An interactive loop is started for this specific model.

*   **Comments and Observations**: In this cell, I set up a second, completely separate spacy model to see the impact of using a much larger dataset. This time, I'm training the model exclusively on the thousands of email examples I loaded from the CSV file. A key change I made here was to the training loop; because there are so many examples, updating the model one-by-one would be incredibly slow, so I implemented batching to process 32 emails at a time, which makes the training process much more efficient. After running the training for 10 epochs, I evaluated this new model against its corresponding large test set, and I anticipate the accuracy will be significantly higher than the first model's. The final interactive prompt is great because it will let me directly compare the classification ability of this well-trained model against the previous one that was only trained on 16 examples.

In [None]:
import spacy
from spacy.training import Example
from sklearn.metrics import accuracy_score, classification_report
import random

# --- Model 2: Trained on SOURCED Data Only ---

# --- PROVIDED SAMPLE CODE STARTS HERE (Initialization) ---
nlp_sourced = spacy.blank("en")
textcat_sourced = nlp_sourced.add_pipe("textcat")
textcat_sourced.add_label("SPAM")
textcat_sourced.add_label("HAM")
optimizer = nlp_sourced.initialize()
# --- PROVIDED SAMPLE CODE ENDS HERE (Initialization) ---

print(f"--- Training Model 2 on {len(sourced_train_set)} SOURCED examples (using efficient batching) ---")

# --- MODIFICATION FOR SPEED: BATCHING ---
# This loop structure is an optimization to handle the large dataset efficiently.
# It does not change the core logic of the professor's nlp.update() function.
n_epochs = 10
batch_size = 32
for i in range(n_epochs):
    random.shuffle(sourced_train_set)
    # Use spaCy's minibatch utility to create batches
    batches = spacy.util.minibatch(sourced_train_set, size=batch_size)
    for batch in batches:
        # Create Example objects for the whole batch
        examples = []
        for text, annotations in batch:
            examples.append(Example.from_dict(nlp_sourced.make_doc(text), annotations))

        # --- PROVIDED SAMPLE CODE STARTS HERE (Core Update Call) ---
        # Update the model with the batch of examples
        nlp_sourced.update(examples, sgd=optimizer)
        # --- PROVIDED SAMPLE CODE ENDS HERE (Core Update Call) ---
    print(f"Completed Epoch {i+1}/{n_epochs}")
# --- END OF BATCHING MODIFICATION ---

print("Training of sourced model complete.")

# --- Evaluation Metrics ---
print(f"\n--- Evaluating Model 2 on {len(sourced_test_set)} UNSEEN SOURCED examples ---")
true_labels = []
predicted_labels = []
for text, annotations in sourced_test_set:
    true_label = "SPAM" if annotations['cats']['SPAM'] == 1.0 else "HAM"
    true_labels.append(true_label)
    doc = nlp_sourced(text)
    predicted_label = "SPAM" if doc.cats['SPAM'] > doc.cats['HAM'] else "HAM"
    predicted_labels.append(predicted_label)
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"\nModel Accuracy on Sourced Test Set: {accuracy * 100:.2f}%")
print("\nClassification Report (Sourced Model):")
print(classification_report(true_labels, predicted_labels, zero_division=0))
# --- End of Evaluation Metrics ---


# --- PROVIDED SAMPLE CODE STARTS HERE (Interactive Testing) ---
def classify_email_sourced(email):
    doc = nlp_sourced(email)
    spam_score = doc.cats['SPAM']
    ham_score = doc.cats['HAM']
    if spam_score > ham_score:
        return "SPAM"
    else:
        return "HAM"

while True:
    user_input = input("\nTest the SOURCED Model: Enter an email (or type 'exit' to quit): ")
    if user_input.lower() == 'exit':
        break
    classification = classify_email_sourced(user_input)
    print(f"The email is classified as: {classification}")
# --- PROVIDED SAMPLE CODE ENDS HERE (Interactive Testing) ---

--- Training Model 2 on 4582 SOURCED examples (using efficient batching) ---
Completed Epoch 1/10
Completed Epoch 2/10
Completed Epoch 3/10
Completed Epoch 4/10
Completed Epoch 5/10
Completed Epoch 6/10
Completed Epoch 7/10
Completed Epoch 8/10
Completed Epoch 9/10
Completed Epoch 10/10
Training of sourced model complete.

--- Evaluating Model 2 on 1146 UNSEEN SOURCED examples ---

Model Accuracy on Sourced Test Set: 98.08%

Classification Report (Sourced Model):
              precision    recall  f1-score   support

         HAM       0.98      1.00      0.99       874
        SPAM       0.99      0.93      0.96       272

    accuracy                           0.98      1146
   macro avg       0.98      0.96      0.97      1146
weighted avg       0.98      0.98      0.98      1146


Test the SOURCED Model: Enter an email (or type 'exit' to quit): Hey. We have a meetingn tommorow
The email is classified as: HAM

Test the SOURCED Model: Enter an email (or type 'exit' to quit): Subject:

# Part 2: Part-of-Speech (POS) Tagging



**Objective:**
This section implements a Part-of-Speech (POS) tagger using a pre-trained spaCy model. The goal is to take a sentence, process it, and identify the grammatical category for each word (e.g., noun, verb, adjective). This task demonstrates the use of pre-built NLP pipelines for linguistic feature extraction and introduces methods for evaluating the model's performance.

---

### 2.1. POS Tagger Implementation

**Code Cell Description:**
This first code block loads a pre-trained spaCy model and uses it to perform POS tagging on a text string provided by the user. Unlike a task that requires training, this process leverages a model that has already learned the patterns of the English language.

*   **Function Description:**
    The `analyze_text(text)` function serves as the core of this tool. It accepts a string of text, processes it using the loaded spaCy pipeline, and extracts the POS tag for each token in the text.

*   **Syntax Explanation:**
    *   `nlp = spacy.load("en_core_web_sm")`: This command loads a small, pre-trained English language model provided by spaCy. "Pre-trained" means the model has already been trained by spaCy's developers on a massive corpus of text. This allows us to use it for tasks like POS tagging out-of-the-box.
    *   `doc = nlp(text)`: The input text is passed to the `nlp` object. This returns a `doc` object, which is a rich, processed container holding the original text along with all the linguistic annotations (like tokens, POS tags, dependencies, etc.) discovered by the model.
    *   `[(token.text, token.pos_) for token in doc]`: This is a list comprehension, a concise way to create a list. It iterates through every token in the `doc` object and, for each token, creates a tuple containing the original word (`token.text`) and its universal POS tag (`token.pos_`).

*   **Inputs:**
    *   The primary input is a text string (`user_input`) provided by the user via the keyboard prompt.
    *   The implicit input is the `en_core_web_sm` model, which contains all the necessary data and weights to perform the analysis.

*   **Outputs:**
    *   The code prints a list of tuples to the console.
    *   Each tuple represents a word from the input sentence and its corresponding POS tag. For example, the input "She is reading a book." would produce the output `[('She', 'PRON'), ('is', 'AUX'), ('reading', 'VERB'), ('a', 'DET'), ('book', 'NOUN'), ('.', 'PUNCT')]`.

*   **Code Flow:**
    1.  The `spacy.load()` command initializes the pre-trained NLP pipeline.
    2.  The program prompts the user to enter a sentence.
    3.  The user's input is passed to the `analyze_text` function.
    4.  Inside the function, the text is processed by the `nlp` pipeline into a `doc` object.
    5.  A list of (token, POS tag) tuples is generated and returned.
    6.  The program prints the final list to the console.

*   **Comments and Observations:**
    This task highlights the power of transfer learning in NLP. By leveraging a pre-trained model, we can perform complex linguistic analysis with very little code and without the need for a custom dataset or a lengthy training process. POS tagging is a foundational NLP task that is often a preprocessing step for more complex applications like Named Entity Recognition (NER), information extraction, and sentiment analysis.

---

### Model Evaluation and Performance Metrics

**Code Cell Description:**
This second code block is dedicated to evaluating the performance of our spaCy POS tagging model. To assess its accuracy, we compare its predictions against a manually-labeled "gold standard" test dataset. We then calculate several standard classification metrics using the `scikit-learn` library to quantify its performance.

*   **Code Flow:**
    1.  **Create a Ground Truth Dataset**: A list named `ground_truth_data` is defined. Each item in the list is a tuple containing a sentence and a corresponding list of `(word, correct_POS_tag)` tuples. This serves as our testing data.
    2.  **Generate Model Predictions**: The code loops through the ground truth dataset. For each sentence, it uses our `analyze_text` function to get the model's predicted POS tags.
    3.  **Prepare Data for Evaluation**: Two flat lists are created: `all_true_tags` (from our ground truth data) and `all_predicted_tags` (from the model). Flattening the lists is necessary to use the `scikit-learn` metrics functions.
    4.  **Calculate Metrics**: Using the prepared lists, the code calculates Accuracy, Precision, Recall, and the F1-Score.
    5.  **Display Results**: The calculated metrics and a detailed `classification_report` are printed to show the model's performance, both overall and for each specific POS tag.

*   **Inputs:**
    *   `ground_truth_data`: The manually created test set containing sentences and their correct POS tags.
    *   `all_true_tags`: A Python list containing the correct, manually-verified POS tags from our ground truth dataset. This is derived from `ground_truth_data`.
    *   `all_predicted_tags`: A Python list containing the POS tags predicted by the spaCy model for the same set of text.

*   **Outputs:**
    *   A printout of four key performance scores (Accuracy, Precision, Recall, F1-Score).
    *   A detailed `Classification Report` that breaks down the performance for each individual POS tag (e.g., 'NOUN', 'VERB', 'ADJ'), showing its precision, recall, and F1-score.

---

### Explanation of Chosen Metrics

#### 1. **Accuracy**
*   **What it is:** Accuracy is the most intuitive metric. It measures the proportion of tokens that were tagged correctly out of all the tokens.
*   **Formula:** `Accuracy = (Number of Correctly Predicted Tags) / (Total Number of Tags)`
*   **Why it's useful:** It gives a quick, overall summary of the model's performance. An accuracy of 0.95 means that 95% of the words in the test set were assigned the correct POS tag. However, it can be misleading if the dataset is imbalanced.

#### 2. **Precision**
*   **What it is:** Precision answers the question: "Of all the tokens that the model labeled as a specific tag (e.g., NOUN), how many were actually NOUNs?"
*   **Formula:** `Precision = (True Positives) / (True Positives + False Positives)`
*   **Why it's useful:** High precision for a specific tag means that when the model identifies that tag, it is very likely to be correct. It is a measure of a classifier's exactness and helps gauge the rate of false positives.

#### 3. **Recall**
*   **What it is:** Recall answers the question: "Of all the tokens that were actually a specific tag (e.g., NOUN), how many did the model correctly identify?"
*   **Formula:** `Recall = (True Positives) / (True Positives + False Negatives)`
*   **Why it's useful:** High recall for a specific tag means that the model is good at finding all instances of that tag in the text. It is a measure of a classifier's completeness and helps gauge the rate of false negatives.

#### 4. **F1-Score**
*   **What it is:** The F1-Score is the harmonic mean of Precision and Recall. It provides a single score that balances both concerns.
*   **Formula:** `F1-Score = 2 * (Precision * Recall) / (Precision + Recall)`
*   **Why it's useful:** It is a robust metric, especially when there's an uneven class distribution (some tags are much more common than others). A high F1-score indicates that the model has both low false positives and low false negatives.

**Note on "Weighted" Average:**
We use the `average='weighted'` parameter when calculating precision, recall, and F1-score. This is crucial because our dataset has an imbalanced number of tags (e.g., more NOUNs than SYMbols). This method calculates the metrics for each tag independently and then computes an average, weighted by the number of instances of each tag in the true data. This provides a more fair and accurate overall picture of performance than a simple average.


Comments and observations: I loaded one of spaCy's pre-trained English models, en_core_web_sm, to evaluate its built-in capability for Part-of-Speech (POS) tagging. My main objective here was to measure the performance of this professional-grade model. To do this, I created a small "gold standard" dataset by manually labeling the correct grammatical tags for a few sentences. I then wrote code to get the model's predictions for these same sentences and compared them against my ground truth labels. The core of this exercise was using sklearn.metrics to calculate not just a simple accuracy score, but also the weighted precision, recall, and F1-score, which provide a more nuanced view of performance, especially if some tags are more common than others. The detailed classification report at the end was the most valuable part, as it broke down the model's performance for every single POS tag, showing exactly where it's strong and where it might be less reliable.

In [None]:
import spacy
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Load a pre-trained English spaCy model.
# 'en_core_web_sm' is a small, efficient model for English.
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print('Downloading language model for the first time. This may take a few minutes...')
    from spacy.cli import download
    download('en_core_web_sm')
    nlp = spacy.load("en_core_web_sm")

# --- START OF PROVIDED SAMPLE CODE ---

# Function to analyze user input text and return tokens with POS tags as a list
def analyze_text(text):
    # Process the text with the spaCy pipeline
    doc = nlp(text)

    # Create a list of tuples, where each tuple contains the token's text and its POS tag
    pos_list = [(token.text, token.pos_) for token in doc]
    return pos_list

# Allow user input and analyze
user_input = input("Enter a text for POS tagging analysis: ")
pos_tags = analyze_text(user_input)

# Display the result as a list
print("\nTokens and POS Tags:")
print(pos_tags)

# --- END OF PROVIDED SAMPLE CODE ---


# --- START OF METRICS IMPLEMENTATION ---

# Step 1: Create a manually labeled "gold standard" dataset for evaluation.
# This dataset contains sentences and their correctly hand-labeled POS tags.
# NOTE: The tagset used (e.g., 'PROPN', 'VERB') must match spaCy's Universal Dependencies tagset.
ground_truth_data = [
    ("Apple is looking at buying U.K. startup for $1 billion.",
     [('Apple', 'PROPN'), ('is', 'AUX'), ('looking', 'VERB'), ('at', 'ADP'), ('buying', 'VERB'), ('U.K.', 'PROPN'), ('startup', 'NOUN'), ('for', 'ADP'), ('$', 'SYM'), ('1', 'NUM'), ('billion', 'NUM'), ('.', 'PUNCT')]),
    ("The quick brown fox jumps over the lazy dog.",
     [('The', 'DET'), ('quick', 'ADJ'), ('brown', 'ADJ'), ('fox', 'NOUN'), ('jumps', 'VERB'), ('over', 'ADP'), ('the', 'DET'), ('lazy', 'ADJ'), ('dog', 'NOUN'), ('.', 'PUNCT')]),
    ("I love to write code in Python because it is versatile.",
     [('I', 'PRON'), ('love', 'VERB'), ('to', 'PART'), ('write', 'VERB'), ('code', 'NOUN'), ('in', 'ADP'), ('Python', 'PROPN'), ('because', 'SCONJ'), ('it', 'PRON'), ('is', 'AUX'), ('versatile', 'ADJ'), ('.', 'PUNCT')]),
    ("She quickly reads the book.",
     [('She', 'PRON'), ('quickly', 'ADV'), ('reads', 'VERB'), ('the', 'DET'), ('book', 'NOUN'), ('.', 'PUNCT')]),
]

# Step 2: Get model's predictions and flatten lists for comparison
all_true_tags = []
all_predicted_tags = []

print("\n--- Model Performance Evaluation ---")
for sentence, true_tags in ground_truth_data:
    # Extract just the tags from the ground truth data
    true_pos = [tag for word, tag in true_tags]
    all_true_tags.extend(true_pos)

    # Get predictions from our spaCy model
    predicted_pos_tuples = analyze_text(sentence)
    predicted_pos = [tag for word, tag in predicted_pos_tuples]
    all_predicted_tags.extend(predicted_pos)

    print(f"\nSentence: '{sentence}'")
    print(f"  > True Tags:     {true_pos}")
    print(f"  > Predicted Tags: {predicted_pos}")


# Step 3: Calculate Performance Metrics
# Ensure that we have the same number of tags to compare
if len(all_true_tags) == len(all_predicted_tags):
    # Overall Accuracy
    accuracy = accuracy_score(all_true_tags, all_predicted_tags)

    # Precision, Recall, and F1-Score
    # 'weighted' average calculates metrics for each label, and finds their average,
    # weighted by the number of true instances for each label. This accounts for label imbalance.
    precision = precision_score(all_true_tags, all_predicted_tags, average='weighted', zero_division=0)
    recall = recall_score(all_true_tags, all_predicted_tags, average='weighted', zero_division=0)
    f1 = f1_score(all_true_tags, all_predicted_tags, average='weighted', zero_division=0)

    print("\n--- Overall Model Metrics ---")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Weighted Precision: {precision:.4f}")
    print(f"Weighted Recall: {recall:.4f}")
    print(f"Weighted F1-Score: {f1:.4f}")

    # Detailed report showing metrics for each POS tag
    print("\n--- Classification Report (Metrics Per POS Tag) ---")
    # Get the unique labels from both true and predicted lists to include all tags in the report
    labels = sorted(list(set(all_true_tags + all_predicted_tags)))
    print(classification_report(all_true_tags, all_predicted_tags, labels=labels, zero_division=0))
else:
    print("\nError: Mismatch between the number of true and predicted tags. Cannot calculate metrics.")

Enter a text for POS tagging analysis: The quick brown fox jumps over the lazy dog

Tokens and POS Tags:
[('The', 'DET'), ('quick', 'ADJ'), ('brown', 'ADJ'), ('fox', 'NOUN'), ('jumps', 'VERB'), ('over', 'ADP'), ('the', 'DET'), ('lazy', 'ADJ'), ('dog', 'NOUN')]

--- Model Performance Evaluation ---

Sentence: 'Apple is looking at buying U.K. startup for $1 billion.'
  > True Tags:     ['PROPN', 'AUX', 'VERB', 'ADP', 'VERB', 'PROPN', 'NOUN', 'ADP', 'SYM', 'NUM', 'NUM', 'PUNCT']
  > Predicted Tags: ['PROPN', 'AUX', 'VERB', 'ADP', 'VERB', 'PROPN', 'VERB', 'ADP', 'SYM', 'NUM', 'NUM', 'PUNCT']

Sentence: 'The quick brown fox jumps over the lazy dog.'
  > True Tags:     ['DET', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'ADP', 'DET', 'ADJ', 'NOUN', 'PUNCT']
  > Predicted Tags: ['DET', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'ADP', 'DET', 'ADJ', 'NOUN', 'PUNCT']

Sentence: 'I love to write code in Python because it is versatile.'
  > True Tags:     ['PRON', 'VERB', 'PART', 'VERB', 'NOUN', 'ADP', 'PROPN', 'SCONJ', 'P

#Part 3: Sentiment Analysis



### 3.1. Data Loading and Preparation for Sentiment Analysis

**Objective:** To load the raw IMDb review data and prepare it for model training and evaluation by splitting it into distinct training and testing sets.

**Dataset:** The IMDb Large Movie Review Dataset is used for this task. It contains 50,000 movie reviews, each labeled with a corresponding 'positive' or 'negative' sentiment.

**Methodology:**

The project requires a robust evaluation of the model's performance on unseen data to ensure the reported accuracy is a true measure of its capabilities. Since the dataset was provided as a single file (`IMDB Dataset.csv`), it was manually divided into two subsets:

1.  **Training Set (80% of the data):** This portion is used exclusively to train the model. The model learns the patterns, vocabulary, and sentence structures associated with positive and negative reviews from this data.
2.  **Testing Set (20% of the data):** This portion is held back and is **never shown to the model during training**. It serves as a final, unbiased exam to evaluate the model's ability to generalize to new, unseen reviews.

The split was performed using the `train_test_split` function from the `scikit-learn` library with the following critical parameters:
*   `test_size=0.2`: This parameter allocates 20% of the total data to the testing set, leaving 80% for training, which is a standard practice in machine learning.
*   `random_state=42`: This ensures the split is **reproducible**. Anyone running this notebook will get the exact same training and testing sets, ensuring the results can be verified.
*   `stratify=full_df['label']`: This is a crucial step that ensures the proportion of positive and negative reviews is identical in both the training and testing sets. This prevents an imbalanced split that could bias the model's training or its evaluation.

**Label Encoding:**
The original sentiment labels ('positive', 'negative') were converted into numerical format (`1` for positive, `0` for negative) as machine learning models require numeric inputs.

**Reference:**
- The dataset is a processed version of the one introduced by Maas, A. L., et al. (2011) in *Learning Word Vectors for Sentiment Analysis*.
- The `IMDB Dataset.csv` file was loaded from Google Drive on September 10, 2025.

**Comments and observation:** This cell is all about setting up the data for a new sentiment analysis model. I had initially did try to use a manual dataset made by me but the accuracy was very low no matter how large of a data set I did. I would need to create a very large dataset with a very good quality data to have a good accuracy but that would take me too much. I could copy and paste real sentiment reviews or paraphrase them but that would be acadmically dishonest so I did not do so. So I went and used a sourced dataset to see its performance. The first thing I had to do was mount my Google Drive to get access to the IMDB Dataset.csv. I loaded it using pandas and included a try-except block, which is a good practice to make sure my file path is correct before the program crashes. Once the data was loaded, I performed a crucial preprocessing step: converting the text labels 'positive' and 'negative' into numerical values, 1 and 0, which is what machine learning models typically need. The final and most important step was using train_test_split from scikit-learn to divide the entire dataset into a training set (80%) and a testing set (20%), and I made sure to use the stratify option to keep the proportion of positive and negative reviews equal in both splits, which prevents any bias in the evaluation later on.

In [None]:
import pandas as pd
from google.colab import drive
from sklearn.model_selection import train_test_split

# --- 1. Mount Google Drive ---
drive.mount('/content/drive')

# --- 2. Define the file path ---
# Use the path you copied from the file browser. This is the most likely path.
file_path = '/content/IMDB Dataset/IMDB Dataset.csv'

# --- 3. Load the dataset ---
try:
    full_df = pd.read_csv(file_path)
    print("Dataset loaded successfully!")
    print(f"Total reviews: {len(full_df)}")
    print(full_df.head()) # Display the first few rows
except FileNotFoundError:
    print(f"Error: File not found at the path: {file_path}")
    print("Please double-check the path by right-clicking the file in the Colab file browser and selecting 'Copy path'.")

# --- 4. (From previous step) Preprocess and split the data ---
# This part assumes the loading was successful.

# Map text labels to numbers
full_df['label'] = full_df['sentiment'].map({'positive': 1, 'negative': 0})

# Split into training and testing sets
train_df, test_df = train_test_split(
    full_df,
    test_size=0.2,
    random_state=42,
    stratify=full_df['label']
)

# Verify the split
print(f"\nTraining set size: {len(train_df)}")
print(f"Testing set size: {len(test_df)}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Dataset loaded successfully!
Total reviews: 50000
                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive

Training set size: 40000
Testing set size: 10000


### 3.2. Data Formatting for spaCy

**Code Cell Description:** This code cell is responsible for a critical data transformation step. It converts the data from the `train_df` and `test_df` pandas DataFrames into the specific list-of-tuples format required by spaCy's training loop. This acts as a bridge between our raw data and the machine learning model.

*   **Function Description:**
    The `format_data_for_spacy(df)` function iterates through a DataFrame, taking the 'review' text and its corresponding numeric 'label' (0 or 1) and converting them into a tuple containing the text and a structured dictionary for the annotations.

*   **Syntax Explanation:**
    *   `for index, row in df.iterrows()`: This is a standard pandas method to loop through each row of a DataFrame.
    *   `annotations = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}`: This creates the dictionary in the specific "gold-standard" format that spaCy's `textcat` component expects for training. The key `"cats"` refers to the categories. The nested dictionary specifies the probability of each label for this text. For training, the correct label is given a probability of 1.0 and all others are 0.0.

*   **Inputs:**
    *   The function takes a pandas DataFrame (`df`) as input.
    *   This DataFrame is expected to have a `review` column containing the text and a `label` column containing the numeric sentiment (1 for positive, 0 for negative).
    *   The function is called twice: once with `train_df` and once with `test_df`.

*   **Outputs:**
    *   The function returns a Python list (e.g., `train_data_spacy`).
    *   Each element in the list is a tuple `(text, annotations)`, where `text` is the movie review string and `annotations` is the structured dictionary.

*   **Code Flow:**
    The code first defines the conversion function. It then calls this function on the training DataFrame to create `train_data_spacy` and on the testing DataFrame to create `test_data_spacy`. This ensures that both datasets are in the correct format for their respective roles in training and evaluation.

*  Comments and Obeservation:
  This cell is a pure data transformation step, and it's a really important one. I learned from the previous spam filter exercise that spaCy's training process needs data in a very specific tuple format: (text, annotations). So, here I've created a reusable function to convert my pandas DataFrames into that required structure. The function iterates through every movie review, checks if its label is 1 (positive) or 0 (negative), and then builds the corresponding {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}} dictionary. I then applied this function to both the training and testing sets I created earlier. By printing a sample at the end, I could quickly verify that the conversion worked correctly, ensuring my data is now perfectly formatted and ready for the spaCy model training pipeline in the next step.

In [None]:
# Format the IMDb data for spaCy

print("Formatting data for spaCy...")

# The spaCy model needs the data in a specific format.
# We'll create a function to convert our DataFrame rows into this format.
def format_data_for_spacy(df):
    formatted_data = []
    for index, row in df.iterrows():
        # Get the text from the 'review' column
        text = row['review']
        # Get the label (0 for negative, 1 for positive)
        label = row['label']

        # Create the annotations dictionary
        if label == 1: # Positive review
            annotations = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}
        else: # Negative review
            annotations = {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}

        # Add the text and its annotations to our list
        formatted_data.append((text, annotations))

    return formatted_data

# Convert both our training and testing data
train_data_spacy = format_data_for_spacy(train_df)
test_data_spacy = format_data_for_spacy(test_df)

# Print a sample to verify the format is correct
print("\nSample of formatted training data:")
print(train_data_spacy[0])

print(f"\nSuccessfully converted {len(train_data_spacy)} training examples and {len(test_data_spacy)} testing examples.")

Formatting data for spaCy...

Sample of formatted training data:

Successfully converted 40000 training examples and 10000 testing examples.


### 3.4. Model Initialization and Training

**Code Cell Description:** This cell uses the exact training function and logic provided in the course materials. It initializes a blank spaCy model, adds the text classification component, and then trains it using the `train_data_spacy` list we created in the previous step.

*   **Function Description:**
    The `train_model(data, n_iter)` function orchestrates the training process. It iterates through the training data multiple times (epochs), showing the examples to the model and updating the model's internal weights to minimize prediction errors.

*   **Syntax Explanation:**
    *   `nlp = spacy.blank("en")`: Creates a new, empty English language model. It has no pre-trained components, making it a clean slate for training.
    *   `textcat = nlp.add_pipe("textcat")`: Adds the Text Categorizer pipeline component. This is the part of the model that will learn to perform the classification.
    *   `nlp.begin_training()`: Initializes the optimizer, which is the algorithm used to adjust the model's weights based on the errors it makes.
    *   `random.shuffle(data)`: At the start of each epoch, the training data is shuffled. This is a critical step to ensure the model does not learn patterns based on the data's original order, which improves its ability to generalize.
    *   `nlp.update(...)`: This is the core learning command. It shows the model a batch of examples, calculates the prediction error (loss), and updates the model's weights accordingly.
    *   `nlp.to_disk("sentiment_model")`: After training is complete, this command saves the trained model's pipeline, weights, and configuration to a directory.

*   **Inputs:**
    *   The `train_model` function is called with `train_data_spacy`, our list of 40,000 formatted training reviews.
    *   The `n_iter` parameter is set to `2`, meaning the model will see the entire training dataset two times.

*   **Outputs:**
    *   During execution, the cell prints the "Losses" for each epoch. This value represents the total error the model made in that epoch. A decreasing loss value is a strong indicator that the model is learning successfully.
    *   The final output of this cell is a folder named `sentiment_model` containing all the data for the trained model.

*   **Comments and Observations:**
  I moved on to the core task of actually training the sentiment analysis model. Following the provided structure, I first initialized a blank English spaCy model and then configured its text classification pipeline by adding the "POSITIVE" and "NEGATIVE" labels it needs to learn. I then used the provided training function, which iterates through the data for a set number of epochs, shuffles the examples, and uses the crucial nlp.update method to teach the model. I decided to run the training on my large IMDb dataset for only two epochs for this initial run, mainly because training on so much text can be very time-consuming, and I wanted to get a result quickly. After the training loop completed, I made sure to save the finished model to disk, which is a really important step so I can load and use it for making predictions later without needing to retrain it from scratch every single time.

In [None]:
# Step 2: Set up and train the spaCy model using the provided code structure

import spacy
from spacy.training import Example
import random

# --- START OF PROVIDED SAMPLE CODE ---

# Load a blank spaCy model
nlp = spacy.blank("en")

# Add the text classification pipeline
textcat = nlp.add_pipe("textcat")

# Add labels for the text classification
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

# Training the model (using the exact function provided)
def train_model(data, n_iter=10):
    optimizer = nlp.begin_training()
    for epoch in range(n_iter):
        random.shuffle(data)
        losses = {}
        for text, annotations in data:
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
            nlp.update([example], drop=0.5, sgd=optimizer, losses=losses)
        print(f"Epoch {epoch+1}/{n_iter} - Losses: {losses}")

# --- END OF PROVIDED SAMPLE CODE ---

# Now, we call the training function with OUR large dataset
# I will use only 2 epochs (n_iter=2) for this example to make it run faster.
print("Starting model training...")
train_model(train_data_spacy, n_iter=2)
print("Training complete.")

# Save the trained model to a directory
nlp.to_disk("sentiment_model")
print("Model saved to disk.")

Starting model training...
Epoch 1/2 - Losses: {'textcat': 8226.893926706605}
Epoch 2/2 - Losses: {'textcat': 6181.865608817177}
Training complete.
Model saved to disk.


### 3.5. Model Performance and Evaluation

**Objective:** To quantitatively assess the performance of the trained sentiment analysis model on the unseen test dataset using standard classification metrics.

**Results:**
The model achieved an overall **accuracy of 85.38%**, successfully surpassing the project's minimum performance requirement of 80%.

**Analysis of Metrics:**

The following metrics provide a detailed understanding of the model's performance:

*   **Accuracy (85.38%):** This is the percentage of total correct predictions out of all predictions made. It provides a general measure of the model's effectiveness. While useful, it can be misleading on imbalanced datasets, which is why we also analyze the metrics below.

*   **Precision:** This metric answers the question: "Of all the predictions I made for a certain class, how many were correct?"
    *   For **NEGATIVE** reviews, the precision was **0.92**, indicating a high level of reliability when the model identifies a review as negative.
    *   For **POSITIVE** reviews, the precision was **0.81**.

*   **Recall:** This metric answers the question: "Of all the actual instances of a class, how many did I correctly identify?"
    *   For **NEGATIVE** reviews, the recall was **0.78**. This means the model found 78% of all the negative reviews in the test set.
    *   For **POSITIVE** reviews, the recall was **0.93**, indicating the model is very effective at identifying the majority of positive reviews.

*   **F1-Score:** This is the harmonic mean of Precision and Recall, providing a single score that balances both concerns. It is particularly useful when you need a balance between finding all the positive/negative cases and not making wrong predictions. The F1-scores of **0.84 (Negative)** and **0.86 (Positive)** indicate a strong, well-balanced model.

* **Conclusion:**
In this last cell for sentiment an alysis, I evaluate how well the sentiment model actually performs on data it has never seen before. I looped through my entire test set, and for each movie review, I recorded the correct label and then used my trained nlp model to make a prediction. After collecting all the true labels and the model's corresponding predictions, I used scikit-learn to generate the key performance metrics. This provides a final, objective score, not just with the overall accuracy percentage, but with a full classification report that breaks down the precision and recall, showing me exactly how well the model learned to distinguish between positive and negative sentiments. I was able to get a higher model perfermance compared to my initial work using manual data

In [None]:
# Step 3: Evaluate the model on the unseen test data

from sklearn.metrics import accuracy_score, classification_report

print("Evaluating model...")

# We will store the true labels and our model's predictions
true_labels = []
predicted_labels = []

# Loop through our formatted test data
for text, annotations in test_data_spacy:
    # Get the true label ('POSITIVE' or 'NEGATIVE')
    true_label = "POSITIVE" if annotations['cats']['POSITIVE'] == 1.0 else "NEGATIVE"
    true_labels.append(true_label)

    # Use the trained model to predict the sentiment of the text
    # The 'nlp' object is our trained model in memory
    doc = nlp(text)

    # Get the predicted label by choosing the one with the higher score
    if doc.cats['POSITIVE'] > doc.cats['NEGATIVE']:
        predicted_labels.append("POSITIVE")
    else:
        predicted_labels.append("NEGATIVE")

# Calculate accuracy
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"\nModel Accuracy: {accuracy * 100:.2f}%")

# Print a detailed classification report (includes precision, recall, f1-score)
print("\nClassification Report:")
print(classification_report(true_labels, predicted_labels))

Evaluating model...

Model Accuracy: 85.38%

Classification Report:
              precision    recall  f1-score   support

    NEGATIVE       0.92      0.78      0.84      5000
    POSITIVE       0.81      0.93      0.86      5000

    accuracy                           0.85     10000
   macro avg       0.86      0.85      0.85     10000
weighted avg       0.86      0.85      0.85     10000



# Part 4: Text Summarizer



### 4.1. Text Summarizing

**Objective:**
To implement an extractive text summarizer using spaCy. This program takes a block of text and reduces it to a few key sentences by scoring each sentence based on the frequency of its most important words. This demonstrates a fundamental, non-ML approach to text summarization.

---

### Summarizer Implementation

**Code Cell Description:**
This code block defines and executes a text summarization pipeline. It loads a pre-trained spaCy model to help with sentence and word tokenization, and then applies a scoring algorithm to identify the most significant sentences in the text.

*   **Function Description:**
    The `summarize(text, n_sentences=2)` function is the core of this task. It takes a string of text and an integer `n_sentences` as input. It then processes the text to find the `n` most important sentences and returns them as a single, combined string.

*   **Syntax Explanation:**
    *   `doc = nlp(text)`: Processes the input text, automatically segmenting it into sentences (`doc.sents`) and tokens.
    *   `if not token.is_stop and not token.is_punct`: This condition filters out common "stop words" (like 'the', 'is', 'a') and punctuation, as they typically do not carry significant meaning for summarization.
    *   `Counter()`: A specialized dictionary from Python's `collections` library used here to efficiently count word frequencies and store sentence scores.
    *   `sentence_scores.most_common(n_sentences)`: This is a powerful method of the `Counter` object that returns a list of the `n` most common items (in this case, the `n` sentences with the highest scores) from the counter.
    *   `" ".join(top_sentences)`: This joins the list of top sentences into a single string, with each sentence separated by a space.

*   **Inputs:**
    *   `user_text`: A multi-sentence block of text provided by the user via a keyboard prompt.
    *   `n_sentences`: An optional integer parameter specifying how many sentences the final summary should contain (default is 2).

*   **Outputs:**
    *   The code prints a string (`summary`) to the console, which is the final summarized version of the input text.

*   **Code Flow:**
    1.  The pre-trained spaCy model is loaded.
    2.  The program prompts the user to enter a block of text.
    3.  The `summarize` function is called with the user's text.
    4.  Inside the function, the text is processed into a `doc`.
    5.  The frequency of each non-stop, non-punctuation word is calculated.
    6.  Each sentence is then scored based on the sum of the frequencies of the words it contains.
    7.  The sentences with the highest scores are selected.
    8.  These top sentences are joined together and returned as the final summary.
    9.  The program prints the summary to the console.

*   **Comments and observations:**
    For this final task, I explored and compared two different methods for extractive text summarization. The first approach, provided in the sample code, was a frequency-based method that works by identifying the most common important words in a text and then scoring sentences based on how many of those words they contain. To improve upon this, I implemented a second summarization function using a more advanced TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, which not only considers word frequency but also how unique a word is to a particular sentence, theoretically allowing it to pinpoint more significant sentences. The most crucial part of this exercise was setting up a quantitative evaluation; I generated summaries from both methods for a sample text and then compared them against a "gold standard" reference summary using industry-standard metrics like ROUGE and BLEU scores. This provided an objective way to measure which summarizer was more effective, and the final interactive prompt allows me to test both algorithms on my own text to see how they perform in real-world scenarios.

In [None]:
import spacy
from collections import Counter
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# You may need to install the required libraries first:
# pip install rouge-score nltk scikit-learn

# Load the same pre-trained English model we used for POS tagging
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print('Downloading language model for the first time. This may take a few minutes...')
    from spacy.cli import download
    download('en_core_web_sm')
    nlp = spacy.load("en_core_web_sm")

# --- START OF PROVIDED SAMPLE CODE ---

# Function to summarize text
def summarize(text, n_sentences=2):
    # Process the full text using the spaCy pipeline
    doc = nlp(text)

    # Calculate the frequency of important words (not stop words or punctuation)
    word_frequencies = Counter()
    for token in doc:
        if not token.is_stop and not token.is_punct:
            word_frequencies[token.text.lower()] += 1

    # Score sentences based on the frequency of the words they contain
    sentence_scores = Counter()
    for sent in doc.sents:
        for token in sent:
            if token.text.lower() in word_frequencies:
                sentence_scores[sent] += word_frequencies[token.text.lower()]

    # Select the top N highest-scoring sentences
    top_sentences = [sent.text for sent, score in sentence_scores.most_common(n_sentences)]

    # Join the top sentences to form the final summary
    return " ".join(top_sentences)

# --- END OF PROVIDED SAMPLE CODE ---


# --- NEW FUNCTION FOR HIGH-SCORE SUMMARIZATION (TF-IDF Method) ---
# This function is separate and does not modify the original.
def summarize_for_high_scores(text, n_sentences=2):
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]

    if len(sentences) <= n_sentences:
        return " ".join(sentences)

    tfidf_vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)
    sentence_scores = np.array(tfidf_matrix.sum(axis=1)).ravel()

    top_sentence_indices = sentence_scores.argsort()[-n_sentences:][::-1]
    top_sentence_indices.sort() # Sort to maintain original order

    top_sentences = [sentences[i] for i in top_sentence_indices]
    return " ".join(top_sentences)


# --- High-Score Example and Comparison ---
example_text = (
    "The James Webb Space Telescope (JWST) is a space telescope designed primarily to conduct infrared astronomy. "
    "As the largest optical telescope in space, its high resolution and sensitivity allow it to view objects too old, "
    "distant, or faint for the Hubble Space Telescope. This has enabled a broad range of investigations across many "
    "fields of astronomy and cosmology, such as observation of the first stars and the formation of the first galaxies, "
    "and detailed atmospheric characterization of potentially habitable exoplanets."
)

example_reference_summary = (
    "The James Webb Space Telescope (JWST) is a space telescope designed primarily to conduct infrared astronomy. "
    "This has enabled a broad range of investigations across many fields of astronomy and cosmology, such as observation of the first stars and the formation of the first galaxies, and detailed atmospheric characterization of potentially habitable exoplanets."
)

print("--- Comparing Summarization Methods on a Pre-defined Example ---")
print("\n--- Original Text ---")
print(example_text)

# Generate summaries from BOTH methods
summary_from_original_code = summarize(example_text, n_sentences=2)
summary_for_high_score = summarize_for_high_scores(example_text, n_sentences=2)


# --- Evaluation of the Original Sample Code's Summary ---
print("\n\n--- Summary from ORIGINAL Sample Code ---")
print(summary_from_original_code)
print("\n--- Evaluation Metrics for ORIGINAL Summary ---")
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores_original = scorer.score(example_reference_summary, summary_from_original_code)
print("ROUGE Scores:")
for key, score in scores_original.items():
    print(f"{key}: F1-Score={score.fmeasure:.4f}")
bleu_original = sentence_bleu([example_reference_summary.split()], summary_from_original_code.split())
print(f"BLEU Score: {bleu_original:.4f}")


# --- Evaluation of the New High-Score Function's Summary ---
print("\n\n--- Summary from NEW High-Score Function ---")
print(summary_for_high_score)
print("\n--- Evaluation Metrics for High-Score Summary ---")
scores_new = scorer.score(example_reference_summary, summary_for_high_score)
print("ROUGE Scores:")
for key, score in scores_new.items():
    print(f"{key}: F1-Score={score.fmeasure:.4f}")
bleu_new = sentence_bleu([example_reference_summary.split()], summary_for_high_score.split())
print(f"BLEU Score: {bleu_new:.4f}")


# --- User Input Section for Your Own Testing ---
print("\n" + "="*50)
print("--- Try It Yourself ---")
print("="*50)

user_text = input("\nEnter the text you want to summarize: ")
if user_text.strip():
    print("\n--- Summary from ORIGINAL Sample Code ---")
    print(summarize(user_text, n_sentences=2))

    print("\n--- Summary from NEW High-Score Function ---")
    print(summarize_for_high_scores(user_text, n_sentences=2))
else:
    print("No text provided.")

--- Comparing Summarization Methods on a Pre-defined Example ---

--- Original Text ---
The James Webb Space Telescope (JWST) is a space telescope designed primarily to conduct infrared astronomy. As the largest optical telescope in space, its high resolution and sensitivity allow it to view objects too old, distant, or faint for the Hubble Space Telescope. This has enabled a broad range of investigations across many fields of astronomy and cosmology, such as observation of the first stars and the formation of the first galaxies, and detailed atmospheric characterization of potentially habitable exoplanets.


--- Summary from ORIGINAL Sample Code ---
As the largest optical telescope in space, its high resolution and sensitivity allow it to view objects too old, distant, or faint for the Hubble Space Telescope. The James Webb Space Telescope (JWST) is a space telescope designed primarily to conduct infrared astronomy.

--- Evaluation Metrics for ORIGINAL Summary ---
ROUGE Scores:
rouge1