# PART A: Task: Extract and Categorize Tasks from Unannotated Text (70 marks)

## NLP Pipeline for Task Extraction and Categorization

This notebook demonstrates an NLP pipeline that extracts tasks from unstructured text and categorizes them using clustering and topic modeling. The pipeline includes:

1. **Preprocessing:** Clean and segment input text into sentences.
2. **Task Extraction:** Identify task sentences using heuristics and extract details such as performer and deadline.
3. **Clustering:** Compute sentence embeddings, then determine the optimal number of clusters using either the _Elbow Method_ or the _Silhouette Score_.
4. **Categorization:** Use LDA topic modeling on each cluster to derive a category label.

In [24]:
# Install required packages
!pip install spacy scikit-learn gensim matplotlib
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [42]:
# Import required libraries and load the spaCy model
import json
import numpy as np
import matplotlib.pyplot as plt
from pprint import pprint

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from gensim import corpora, models

# Load spaCy model with medium-sized vectors
nlp = spacy.load('en_core_web_md')



### Utility Functions

In [43]:
def load_text_from_file(filepath):
    """
    Read text from the provided file path.
    """
    try:
        with open(filepath, "r", encoding="utf-8") as file:
            text = file.read()
        return text.strip()
    except Exception as e:
        raise IOError(f"Error reading file '{filepath}': {e}")

def pretty_print_tasks(tasks, title="Tasks:"):
    """
    Pretty-print the list of task dictionaries in JSON format.
    """
    print(f"\n{title}")
    print(json.dumps(tasks, indent=4))

### Preprocessing Functions

In [44]:
def clean_text(text):
    """
    Clean the input text by stripping whitespace.
    """
    return text.strip()

### Task Identification and Extraction

In [45]:
def is_task_sentence(sentence):
    """
    Determine if a sentence likely represents a task.
    Uses two heuristics:
      1. Check if the sentence starts with a base form verb.
      2. Check for common task-related keywords (e.g., "has to", "please", "don't forget").
    """
    sent_text = sentence.text.strip()
    sent_lower = sent_text.lower()

    # Keywords that may indicate a task
    task_keywords = ['has to', 'need to', 'needs to', 'should', 'must', 'please', "don't forget"]

    # Heuristic 1: Check if the first token is a base form verb
    first_token = sentence[0]
    if first_token.pos_ == 'VERB' and first_token.tag_ == 'VB':
        return True

    # Heuristic 2: Look for task-related keywords
    for keyword in task_keywords:
        if keyword in sent_lower:
            return True

    return False

def extract_deadline(sentence):
    """
    Extract deadline information from a sentence by combining adjacent
    entities labeled as TIME or DATE.
    """
    deadline_tokens = []
    for ent in sentence.ents:
        if ent.label_ in ["TIME", "DATE"]:
            deadline_tokens.append(ent.text)
    if deadline_tokens:
        return " ".join(deadline_tokens)
    return None

def extract_task_details(sentence):
    """
    Extract details from a task sentence:
      - The full task text
      - The performer (first PERSON entity, if any)
      - The deadline (if any)
    """
    task_text = sentence.text.strip()
    performer = None

    # Find the first PERSON entity
    for ent in sentence.ents:
        if ent.label_ == "PERSON":
            performer = ent.text
            break

    deadline = extract_deadline(sentence)
    return task_text, performer, deadline

def process_text(text):
    """
    Process the input text:
      - Clean the text
      - Segment it into sentences
      - Extract sentences that likely represent tasks
    Returns a list of dictionaries containing task details.
    """
    cleaned_text = clean_text(text)
    doc = nlp(cleaned_text)
    tasks = []

    for sent in doc.sents:
        if is_task_sentence(sent):
            task_text, performer, deadline = extract_task_details(sent)
            tasks.append({
                "task": task_text,
                "performer": performer,
                "deadline": deadline
            })
    return tasks


### Clustering and Categorization Functions

In [46]:
def cluster_tasks(tasks, num_clusters):
    """
    Cluster tasks based on their sentence embeddings using KMeans.
    Adds a 'cluster' field to each task.
    Returns the updated tasks, the fitted KMeans model, and the task vectors.
    """
    task_vectors = []
    for task in tasks:
        doc = nlp(task["task"])
        task_vectors.append(doc.vector)
    X = np.array(task_vectors)

    kmeans = KMeans(n_clusters=num_clusters, random_state=42)
    kmeans.fit(X)
    labels = kmeans.labels_

    for i, task in enumerate(tasks):
        task["cluster"] = int(labels[i])
    return tasks, kmeans, X

def label_clusters_with_lda(tasks, num_topics=1):
    """
    For each cluster, perform LDA topic modeling on the task texts to derive
    a category label from the dominant topic. Adds a 'category' field to each task.
    """
    # Group tasks by cluster
    cluster_tasks_dict = {}
    for task in tasks:
        cluster = task["cluster"]
        cluster_tasks_dict.setdefault(cluster, []).append(task["task"])

    cluster_labels = {}
    for cluster, sentences in cluster_tasks_dict.items():
        texts = []
        for sentence in sentences:
            doc = nlp(sentence)
            tokens = [
                token.lemma_.lower()
                for token in doc
                if token.is_alpha and token.text.lower() not in STOP_WORDS
            ]
            texts.append(tokens)

        dictionary = corpora.Dictionary(texts)
        corpus = [dictionary.doc2bow(text) for text in texts]

        if len(dictionary) == 0:
            cluster_labels[cluster] = "General"
            continue

        lda_model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10, random_state=42)
        topic_terms = lda_model.show_topic(0, topn=3)
        label = " ".join([word for word, prob in topic_terms])
        cluster_labels[cluster] = label

    for task in tasks:
        task["category"] = cluster_labels[task["cluster"]]
    return tasks, cluster_labels

def determine_optimal_clusters_elbow(task_vectors, min_clusters=2, max_clusters=10):
    """
    Use the Elbow Method to display a plot of inertia for different cluster counts.
    Adjusts max_clusters if there are fewer samples.
    """
    n_samples = task_vectors.shape[0]

    if n_samples < min_clusters:
        print(f"Warning: Number of samples ({n_samples}) is less than the minimum clusters ({min_clusters}).")
        min_clusters = n_samples
    if n_samples < max_clusters:
        max_clusters = n_samples

    inertias = []
    cluster_range = range(min_clusters, max_clusters + 1)
    for k in cluster_range:
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(task_vectors)
        inertias.append(kmeans.inertia_)

    plt.figure(figsize=(8, 4))
    plt.plot(list(cluster_range), inertias, marker="o")
    plt.title("Elbow Method For Optimal Clusters")
    plt.xlabel("Number of clusters")
    plt.ylabel("Inertia")
    plt.xticks(list(cluster_range))
    plt.grid(True)
    plt.show()

    print("Elbow Method Inertia Values:")
    for k, inertia in zip(cluster_range, inertias):
        print(f"Clusters: {k}, Inertia: {inertia}")

def determine_optimal_clusters_silhouette(task_vectors, min_clusters=2, max_clusters=10):
    """
    Compute the average silhouette score for different numbers of clusters
    and return the optimal number of clusters (with the highest score).
    Adjusts max_clusters to be at most n_samples - 1.
    """
    n_samples = task_vectors.shape[0]

    if n_samples < min_clusters:
        min_clusters = n_samples
    # Ensure max_clusters does not exceed n_samples - 1
    max_clusters = min(max_clusters, n_samples - 1)

    silhouette_scores = []
    cluster_range = range(min_clusters, max_clusters + 1)
    for k in cluster_range:
        kmeans = KMeans(n_clusters=k, random_state=42)
        labels = kmeans.fit_predict(task_vectors)
        score = silhouette_score(task_vectors, labels)
        silhouette_scores.append(score)
        print(f"Clusters: {k}, Silhouette Score: {score:.3f}")

    optimal_k = cluster_range[np.argmax(silhouette_scores)]
    print(f"\nOptimal number of clusters based on silhouette score: {optimal_k}")
    return optimal_k


## Running the Pipeline

In the cell below, we define the input text (or you can upload a file) and run the complete pipeline:

1. **Task Extraction:** Process the input text to extract tasks.
2. **Vectorization:** Compute sentence embeddings for each extracted task.
3. **Optimal Cluster Determination:** Use the Silhouette Score (or Elbow Method) to choose the number of clusters.
4. **Clustering and Categorization:** Cluster the tasks and assign category labels using LDA.
5. **Output:** Display the extracted tasks along with their cluster and category information.


In [48]:
# Load input text from file "input.txt"
# (Ensure that input.txt is already uploaded to the Colab working directory. Sample File is provided at the end)
input_text = load_text_from_file("/content/input.txt")

# Process the input text to extract tasks
tasks = process_text(input_text)
pretty_print_tasks(tasks, title="Extracted Tasks:")

if not tasks:
    print("No tasks found in the input.")
else:
    # Compute task vectors for clustering
    task_vectors = []
    for task in tasks:
        doc = nlp(task["task"])
        task_vectors.append(doc.vector)
    task_vectors = np.array(task_vectors)

    # Choose the method to determine the number of clusters: 'elbow' or 'silhouette'
    method = "elbow"

    if method == "elbow":
        print("\nDetermining the optimal number of clusters using the Elbow Method...")
        determine_optimal_clusters_elbow(task_vectors, min_clusters=2, max_clusters=10)
        # Manually enter the desired number of clusters after reviewing the plot
        num_clusters = int(input("Based on the elbow plot, enter the desired number of clusters: "))
    else:
        print("\nDetermining the optimal number of clusters using the Silhouette Score...")
        num_clusters = determine_optimal_clusters_silhouette(task_vectors, min_clusters=2, max_clusters=10)

    # Cluster the tasks and assign category labels using LDA
    tasks, kmeans, _ = cluster_tasks(tasks, num_clusters)
    tasks, cluster_labels = label_clusters_with_lda(tasks, num_topics=1)

    pretty_print_tasks(tasks, title="Tasks with Categories:")



Extracted Tasks:
[
    {
        "task": "Meanwhile, we must finalize the budget by Friday.",
        "performer": null,
        "deadline": "Friday"
    },
    {
        "task": "Rahul and Priya need to schedule the client call by Monday.",
        "performer": "Priya",
        "deadline": "Monday"
    },
    {
        "task": "Remember to check the new version of the software for bugs.",
        "performer": null,
        "deadline": null
    },
    {
        "task": "The CEO insisted that marketing should distribute the new brochures next week.",
        "performer": null,
        "deadline": "next week"
    },
    {
        "task": "Bob must complete the security review by 5 PM today.",
        "performer": "Bob",
        "deadline": "5 PM today"
    },
    {
        "task": "Also, Michael should deliver the updated contract by email.",
        "performer": "Michael",
        "deadline": null
    },
    {
        "task": "The CFO said that the financial forecast has to be updated 

### Sample input.txt used
___

Team 3 will handle the documentation for the new project next week. John is traveling to New York tomorrow.
Meanwhile, we must finalize the budget by Friday. Lucy said she would handle the monthly summary.
Rahul and Priya need to schedule the client call by Monday. Everyone is excited for the upcoming holiday.
Remember to check the new version of the software for bugs. The CEO insisted that marketing should distribute
the new brochures next week. Bob must complete the security review by 5 PM today. This is critical for the next release.

I want you to vacuum the office space before the guests arrive. The project manager asked us to prepare the slides
for tomorrow's presentation. Also, Michael should deliver the updated contract by email. The developers are working
on bug fixes. Let's not forget to restock the pantry tomorrow morning. The CFO said that the financial forecast has
to be updated by the end of this quarter. Susan is expecting a call from the vendor at 2 PM. We have to decide on
a new coffee machine within the next few days. Harriet wants to plan a team outing next month. Could you please
forward that email to HR? The security team has recommended that we patch the servers by the end of the day.
Everyone is required to sign the new policy documents. Move the old chairs to the storage room.
We need to keep track of all the software licenses before they expire. Make sure to call the electrician
to fix the broken light. I want you to archive the 2022 records. Maria must write the blog post by Tuesday next week.

Also, the design team should finalize the logo. Prepare a backup copy of all website assets.
The cleaning crew must reorder supplies. Travis will handle the catering. If anyone can pick up the interns
from the train station, let me know. The CEO has to review the partnership agreement. We expect the new interns
to fill out their paperwork immediately. The marketing team should post the social media updates every morning.
Meanwhile, Rahul goes for coffee every day. There's a possibility that tomorrow might be a holiday.
We must still complete the daily report. The manager wants to measure the team's performance.
We need to deliver these tasks on time. Write the release notes for the software patch.
Timmy must update the user guide with the new features by next Wednesday. The design team is not working
on that today. The finance team has to finalize the Q1 reports by the end of the month. We should not forget
to track all deliverables in the project management tool. John is hosting a lunch session. Anna might join him.
It's crucial that the legal team reviews the compliance guidelines ASAP. We need to send them our feedback.

# PART B: Build a machine learning model to classify customer reviews as positive or negative. (30 marks)

### Step 1: Setup and Library Imports

In [17]:
import nltk
import re, string, random
import numpy as np
import pandas as pd

# Download required NLTK resources
nltk.download('movie_reviews')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.corpus import movie_reviews, stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


### Step 2: Data Collection

In [18]:
# We use the NLTK movie_reviews dataset as a proxy for customer reviews.
# Dataset link: http://www.nltk.org/book/ch02.html
# Each review is labeled as either 'pos' (positive) or 'neg' (negative).

# Load the dataset and shuffle it for randomness.
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

# Combine token lists back into strings and create labels:
texts = [' '.join(words) for words, label in documents]
labels = [1 if label == 'pos' else 0 for words, label in documents]  # 1: positive, 0: negative

### Step 3: Data Preprocessing

In [19]:
# Create a function to clean the text:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
    """
    Clean the input text by:
      - Converting to lowercase.
      - Removing digits.
      - Removing punctuation.
      - Removing extra whitespace.
      - Tokenizing, removing stop words, and lemmatizing tokens.
    """
    # Convert text to lowercase
    text = text.lower()

    # Remove digits
    text = re.sub(r'\d+', '', text)

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Tokenize and remove stop words; then lemmatize each token
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]

    return ' '.join(tokens)

# Apply the cleaning function to all texts
clean_texts = [clean_text(text) for text in texts]

### Step 4: Train/Test Split

In [20]:
# Split the dataset into training (80%) and testing (20%) sets.
X_train, X_test, y_train, y_test = train_test_split(clean_texts, labels, test_size=0.2, random_state=42)


### Step 5: Model Selection, Training, and Evaluation

In [21]:
# We will build three pipelines, one for each model.
# Each pipeline uses TF-IDF to convert text to numerical features and then applies a classifier.
# We use a small grid search to tune hyperparameters.

# Create a list to store evaluation results for each model.
results = []

###########################################
# Model 1: Logistic Regression
###########################################
pipeline_lr = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,2), min_df=5, max_df=0.9, max_features=5000)),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])
# Hyperparameter grid for Logistic Regression (tuning regularization strength C)
param_grid_lr = {
    'clf__C': [0.1, 1]
}
grid_lr = GridSearchCV(pipeline_lr, param_grid_lr, cv=3, scoring='accuracy', n_jobs=-1)
grid_lr.fit(X_train, y_train)
y_pred_lr = grid_lr.predict(X_test)

# Calculate metrics
acc_lr = accuracy_score(y_test, y_pred_lr)
prec_lr = precision_score(y_test, y_pred_lr)
rec_lr = recall_score(y_test, y_pred_lr)

results.append({
    'Model': 'Logistic Regression',
    'Accuracy': acc_lr,
    'Precision': prec_lr,
    'Recall': rec_lr,
    'Best Params': grid_lr.best_params_
})

print("Logistic Regression Best Params:", grid_lr.best_params_)

###########################################
# Model 2: Multinomial Naive Bayes
###########################################
pipeline_nb = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,2), min_df=5, max_df=0.9, max_features=5000)),
    ('clf', MultinomialNB())
])
# Hyperparameter grid for Naive Bayes (tuning the smoothing parameter alpha)
param_grid_nb = {
    'clf__alpha': [0.5, 1.0]
}
grid_nb = GridSearchCV(pipeline_nb, param_grid_nb, cv=3, scoring='accuracy', n_jobs=-1)
grid_nb.fit(X_train, y_train)
y_pred_nb = grid_nb.predict(X_test)

# Calculate metrics
acc_nb = accuracy_score(y_test, y_pred_nb)
prec_nb = precision_score(y_test, y_pred_nb)
rec_nb = recall_score(y_test, y_pred_nb)

results.append({
    'Model': 'Multinomial NB',
    'Accuracy': acc_nb,
    'Precision': prec_nb,
    'Recall': rec_nb,
    'Best Params': grid_nb.best_params_
})

print("Multinomial NB Best Params:", grid_nb.best_params_)

###########################################
# Model 3: Support Vector Machine (SVM)
###########################################
pipeline_svm = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,2), min_df=5, max_df=0.9, max_features=5000)),
    ('clf', SVC(probability=True, random_state=42))
])
# Hyperparameter grid for SVM (tuning regularization parameter C and kernel)
param_grid_svm = {
    'clf__C': [0.1, 1],
    'clf__kernel': ['linear']  # Using only the linear kernel for speed
}
grid_svm = GridSearchCV(pipeline_svm, param_grid_svm, cv=3, scoring='accuracy', n_jobs=-1)
grid_svm.fit(X_train, y_train)
y_pred_svm = grid_svm.predict(X_test)

# Calculate metrics
acc_svm = accuracy_score(y_test, y_pred_svm)
prec_svm = precision_score(y_test, y_pred_svm)
rec_svm = recall_score(y_test, y_pred_svm)

results.append({
    'Model': 'SVM',
    'Accuracy': acc_svm,
    'Precision': prec_svm,
    'Recall': rec_svm,
    'Best Params': grid_svm.best_params_
})

print("SVM Best Params:", grid_svm.best_params_)

Logistic Regression Best Params: {'clf__C': 1}
Multinomial NB Best Params: {'clf__alpha': 1.0}
SVM Best Params: {'clf__C': 1, 'clf__kernel': 'linear'}


### Step 6: Results Table and Best Model Selection

In [22]:
# Create a results DataFrame and sort by Accuracy.
results_df = pd.DataFrame(results)
results_df = results_df[['Model', 'Accuracy', 'Precision', 'Recall', 'Best Params']]
results_df = results_df.sort_values(by='Accuracy', ascending=False)
print("\n=== Comparison of Models ===")
print(results_df.to_string(index=False))

# Determine the best performing model (by highest Accuracy)
best_model_name = results_df.iloc[0]['Model']
print("\nBest performing model:", best_model_name)

# Print detailed classification report for the best model
print("\n=== Detailed Classification Report ===")
if best_model_name == 'Logistic Regression':
    best_preds = y_pred_lr
elif best_model_name == 'Multinomial NB':
    best_preds = y_pred_nb
elif best_model_name == 'SVM':
    best_preds = y_pred_svm
else:
    best_preds = None

print(classification_report(y_test, best_preds))


=== Comparison of Models ===
              Model  Accuracy  Precision   Recall                            Best Params
                SVM    0.8425   0.821782 0.860104 {'clf__C': 1, 'clf__kernel': 'linear'}
Logistic Regression    0.8300   0.817259 0.834197                          {'clf__C': 1}
     Multinomial NB    0.8075   0.808511 0.787565                    {'clf__alpha': 1.0}

Best performing model: SVM

=== Detailed Classification Report ===
              precision    recall  f1-score   support

           0       0.86      0.83      0.84       207
           1       0.82      0.86      0.84       193

    accuracy                           0.84       400
   macro avg       0.84      0.84      0.84       400
weighted avg       0.84      0.84      0.84       400



## Discussion on Model Improvement:
___
- The TF-IDF representation was chosen because it assigns higher weight to words that are
  unique to a document, and lower weight to words that occur frequently across all documents,
  thus making the features more discriminative than a simple Bag-of-Words model.
- Although all three models are fairly simple, further improvements could be made by:
  - Experimenting with more hyperparameter combinations or using more folds in cross-validation.
  - Incorporating more advanced text preprocessing techniques such as lemmatization (already applied), bigrams/trigrams, or even domain-specific stop word removal.
  - Trying ensemble methods that combine multiple classifiers.
  - Collecting and incorporating more data to improve generalization.
- For this assignment, the best model is chosen based on Accuracy, but precision and recall are also important depending on the business need (e.g., minimizing false negatives).