<a href="https://colab.research.google.com/github/Surya2004-janardhan/colab/blob/main/taskactionpipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Task: Extract and Categorize Tasks from Unannotated Text**


Importing neccessary modules and packages

In [None]:
import nltk
from nltk.corpus import stopwords
import re
from sklearn.cluster import KMeans
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Download NLTK resources
nltk.download('punkt')



This section cleans the input text by converting it to lowercase, tokenizing it into words, and removing punctuation and non-alphanumeric characters. The result is a list of clean tokens ready for further processing.

In [None]:
# Step 1: Preprocessing
def preprocess_text(text):
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word.isalnum()]
    return tokens

This function identifies actionable sentences as tasks using heuristic rules. It assigns scores based on keywords like deadlines, assignments, and entity mentions. Sentences scoring above a threshold are considered tasks

In [None]:
# Step 2: Task Identification
def identify_tasks(sentences):
    tasks = []
    deadline_keywords = ["by", "before", "after", "tomorrow", "today", "this evening"]
    assignment_keywords = ["has to", "needs to", "is responsible for", "should", "must", "need to"]

    for sentence in sentences:
        score = 0
        if any(sentence.startswith(verb) for verb in ["buy", "clean", "submit", "send"]):
            score += 1
        if any(word in sentence for word in assignment_keywords):
            score += 1
        if any(word in sentence for word in deadline_keywords):
            score += 1
        if "rahul" in sentence.lower():
            score += 1
        if score >= 2:
            tasks.append(sentence.strip())
    return tasks

This section generates word embeddings for tasks using Word2Vec and clusters them using K-Means. Tasks are grouped into meaningful categories based on their semantic similarity, with numerical clusters renamed to descriptive labels like "Category A."

In [None]:
# Step 3: Categorization with Word Embeddings
def cluster_tasks_with_word_embeddings(tasks):
    if len(tasks) <= 1:
        return {"Uncategorized": tasks}

    tokenized_tasks = [task.split() for task in tasks]
    model = Word2Vec(sentences=tokenized_tasks, vector_size=100, window=5, min_count=1, workers=4)
    embeddings = []
    for task in tokenized_tasks:
        vec = np.mean([model.wv[word] for word in task if word in model.wv], axis=0)
        embeddings.append(vec)

    n_clusters = min(len(tasks), 3)
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(embeddings)

    categories = {}
    for i, task in enumerate(tasks):
        cluster_id = clusters[i]
        if cluster_id not in categories:
            categories[cluster_id] = []
        categories[cluster_id].append(task)

    renamed_categories = {}
    category_names = ["Category A", "Category B", "Category C"][:len(categories)]
    for i, (key, value) in enumerate(categories.items()):
        renamed_categories[category_names[i]] = value
    return renamed_categories

This function uses Latent Dirichlet Allocation (LDA) to dynamically discover topics in the tasks. Tasks are assigned to the most probable topic, and the numerical topics are renamed to descriptive labels like "Topic A."

In [None]:
# Step 4: Categorization with LDA
def categorize_tasks_with_lda(tasks):
    if not tasks:
        return {"Uncategorized": []}

    if len(tasks) <= 1:
        return {"Uncategorized": tasks}

    vectorizer = CountVectorizer(max_df=1.0, min_df=1, stop_words='english')
    X = vectorizer.fit_transform(tasks)

    n_topics = min(len(tasks), 3)
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    lda.fit(X)

    topic_probabilities = lda.transform(X)
    categories = {}
    for i, task in enumerate(tasks):
        topic = topic_probabilities[i].argmax()
        if topic not in categories:
            categories[topic] = []
        categories[topic].append(task)

    renamed_categories = {}
    category_names = ["Topic A", "Topic B", "Topic C"][:len(categories)]
    for i, (key, value) in enumerate(categories.items()):
        renamed_categories[category_names[i]] = value
    return renamed_categories

This function combines all previous steps to generate the final output. It extracts tasks, categorizes them using both word embeddings (K-Means) and LDA, and formats the results into a structured dictionary.

In [None]:
# Step 5: Output Generation
def generate_output(text):
    sentences = nltk.sent_tokenize(text)
    tasks = identify_tasks(sentences)

    if not tasks:
        return {
            "Tasks": [],
            "Categories_Word_Embeddings": {},
            "Categories_LDA": {}
        }

    categories_word_embeddings = cluster_tasks_with_word_embeddings(tasks)
    categories_lda = categorize_tasks_with_lda(tasks)

    output = {
        "Tasks": [{"Task": task} for task in tasks],
        "Categories_Word_Embeddings": categories_word_embeddings,
        "Categories_LDA": categories_lda
    }
    return output

This section demonstrates the pipeline's functionality with an example input text. It prints the identified tasks and their categories derived from word embeddings and LDA in a readable format.

In [None]:
# Example Usage
if __name__ == "__main__":
    text = """
    Rahul bought groceries yesterday. Today, he plans to organize his study room.
    He needs to send an email to his manager by noon. Also, Rahul has to pick up his dry cleaning by 6 pm.
    Tomorrow, Rahul must prepare a presentation for the team meeting. In the evening, he will meet friends at the café."""
    result = generate_output(text)
    print("Output:")
    print("------")
    print("Tasks:")
    for task in result["Tasks"]:
        print(f"- {task['Task']}")
    print("\nCategories (Word Embeddings):")
    for category, tasks in result["Categories_Word_Embeddings"].items():
        print(f"- {category}: {', '.join(tasks)}")
    print("\nCategories (LDA):")
    for category, tasks in result["Categories_LDA"].items():
        print(f"- {category}: {', '.join(tasks)}")

**Summary of Key Sections
Preprocessing : Cleans and tokenizes the input text.


Task Identification : Extracts actionable sentences as tasks using heuristic rules.


Word Embeddings Clustering : Groups tasks into categories based on semantic similarity using K-Means.

LDA Topic Modeling : Dynamically discovers topics in tasks and assigns them to categories.


Output Generation : Combines all results into a structured dictionary and formats the output for readability.


Example Usage : Demonstrates the pipeline with a sample input and prints the results.
This modular approach ensures clarity, flexibility, and robustness in extracting and categorizing tasks from unstructured text.**


**Overall code :**

In [6]:
import nltk
from nltk.corpus import stopwords
import re
import numpy as np
from sklearn.cluster import KMeans
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Download NLTK resources
nltk.download('punkt')
nltk.download('punkt_tab')


# Step 1: Preprocessing
def preprocess_text(text):
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word.isalnum()]
    return tokens

# Step 2: Task Identification
def identify_tasks(sentences):
    tasks = []
    deadline_keywords = ["by", "before", "after", "tomorrow", "today", "this evening"]
    assignment_keywords = ["has to", "needs to", "is responsible for", "should", "must", "need to"]

    for sentence in sentences:
        score = 0
        if any(sentence.startswith(verb) for verb in ["buy", "clean", "submit", "send"]):
            score += 1
        if any(word in sentence for word in assignment_keywords):
            score += 1
        if any(word in sentence for word in deadline_keywords):
            score += 1
        if "rahul" in sentence.lower():
            score += 1
        if score >= 2:
            tasks.append(sentence.strip())  # Remove extra whitespace
    return tasks

# Step 3: Categorization with Word Embeddings
def cluster_tasks_with_word_embeddings(tasks):
    if len(tasks) <= 1:  # Skip clustering if there's only one task
        return {"Uncategorized": tasks}

    tokenized_tasks = [task.split() for task in tasks]
    model = Word2Vec(sentences=tokenized_tasks, vector_size=100, window=5, min_count=1, workers=4)
    embeddings = []
    for task in tokenized_tasks:
        vec = np.mean([model.wv[word] for word in task if word in model.wv], axis=0)
        embeddings.append(vec)

    # Dynamically adjust n_clusters
    n_clusters = min(len(tasks), 3)  # Limit to 3 clusters or the number of tasks
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(embeddings)

    categories = {}
    for i, task in enumerate(tasks):
        cluster_id = clusters[i]
        if cluster_id not in categories:
            categories[cluster_id] = []
        categories[cluster_id].append(task)

    # Rename numerical keys to descriptive category names
    renamed_categories = {}
    category_names = ["Category A", "Category B", "Category C"][:len(categories)]
    for i, (key, value) in enumerate(categories.items()):
        renamed_categories[category_names[i]] = value
    return renamed_categories

# Step 4: Categorization with LDA
def categorize_tasks_with_lda(tasks):
    if not tasks:
        return {"Uncategorized": []}

    if len(tasks) <= 1:  # Skip LDA if there's only one task
        return {"Uncategorized": tasks}

    vectorizer = CountVectorizer(max_df=1.0, min_df=1, stop_words='english')
    X = vectorizer.fit_transform(tasks)

    # Dynamically adjust n_components
    n_topics = min(len(tasks), 3)  # Limit to 3 topics or the number of tasks
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    lda.fit(X)

    topic_probabilities = lda.transform(X)
    categories = {}
    for i, task in enumerate(tasks):
        topic = topic_probabilities[i].argmax()
        if topic not in categories:
            categories[topic] = []
        categories[topic].append(task)

    # Rename numerical keys to descriptive category names
    renamed_categories = {}
    category_names = ["Topic A", "Topic B", "Topic C"][:len(categories)]
    for i, (key, value) in enumerate(categories.items()):
        renamed_categories[category_names[i]] = value
    return renamed_categories

# Step 5: Output Generation
def generate_output(text):
    sentences = nltk.sent_tokenize(text)
    tasks = identify_tasks(sentences)

    if not tasks:  # Handle case where no tasks are identified
        return {
            "Tasks": [],
            "Categories_Word_Embeddings": {},
            "Categories_LDA": {}
        }

    categories_word_embeddings = cluster_tasks_with_word_embeddings(tasks)
    categories_lda = categorize_tasks_with_lda(tasks)

    output = {
        "Tasks": [{"Task": task} for task in tasks],
        "Categories_Word_Embeddings": categories_word_embeddings,
        "Categories_LDA": categories_lda
    }
    return output

# Example Usage
if __name__ == "__main__":
    text = """Rahul finished his homework last night. In the morning, he needs to drop off the package at the post office.
Later, Rahul has to call the plumber to fix the leaking tap. By 2 pm, he must finish drafting the project proposal.
In the evening, Rahul plans to cook dinner for his family. Also, he should water the plants before leaving home.."""
    result = generate_output(text)
    print("Output:")
    print("------")
    print("Tasks:")
    for task in result["Tasks"]:
        print(f"- {task['Task']}")
    print("\nCategories (Word Embeddings):")
    for category, tasks in result["Categories_Word_Embeddings"].items():
        print(f"- {category}: {', '.join(tasks)}")
    print("\nCategories (LDA):")
    for category, tasks in result["Categories_LDA"].items():
        print(f"- {category}: {', '.join(tasks)}")

Output:
------
Tasks:
- Later, Rahul has to call the plumber to fix the leaking tap.
- Also, he should water the plants before leaving home..

Categories (Word Embeddings):
- Category A: Later, Rahul has to call the plumber to fix the leaking tap.
- Category B: Also, he should water the plants before leaving home..

Categories (LDA):
- Topic A: Later, Rahul has to call the plumber to fix the leaking tap.
- Topic B: Also, he should water the plants before leaving home..


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Output 1 :

Tasks:
- He needs to send an email to his manager by noon.
- Also, Rahul has to pick up his dry cleaning by 6 pm.
- Tomorrow, Rahul must prepare a presentation for the team meeting.

Categories (Word Embeddings):
- Category A: He needs to send an email to his manager by noon.
- Category B: Also, Rahul has to pick up his dry cleaning by 6 pm.
- Category C: Tomorrow, Rahul must prepare a presentation for the team meeting.

Categories (LDA):
- Topic A: He needs to send an email to his manager by noon., Also, Rahul has to pick up his dry cleaning by 6 pm.
- Topic B: Tomorrow, Rahul must prepare a presentation for the team meeting.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!

Output 2:


Later, Rahul has to call the plumber to fix the leaking tap. By 2 pm, he must finish drafting the project proposal.
In the evening, Rahul plans to cook dinner for his family. Also, he should water the plants before leaving home.."""

Output:
------
Tasks:
- Later, Rahul has to call the plumber to fix the leaking tap.
- Also, he should water the plants before leaving home..

Categories (Word Embeddings):
- Category A: Later, Rahul has to call the plumber to fix the leaking tap.
- Category B: Also, he should water the plants before leaving home..

Categories (LDA):
- Topic A: Later, Rahul has to call the plumber to fix the leaking tap.
- Topic B: Also, he should water the plants before leaving home..
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!

**Challeneges:**

1. **Task Identification**: Developing heuristics to accurately detect tasks amidst varied sentence structures and ambiguous language.
2. **Small Dataset Issues**: Clustering algorithms struggled with limited data, requiring dynamic adjustments to parameters like `n_clusters` and `n_topics`.
3. **Sparse Embeddings**: Limited vocabulary in some tasks led to sparse word embeddings, reducing clustering effectiveness.
4. **Interpretability**: Ensuring clusters and topics from LDA and K-Means were meaningful and user-friendly required additional renaming logic.
5. **Edge Cases & Dependencies**: Handling edge cases (e.g., no tasks or one task) and managing external library dependencies added complexity to the pipeline.