# Jira Comments Analysis ‚Äì Sentiment & Themes

This notebook loads Jira comments from a CSV file and walks through:

1. **Loading & inspecting the data**
2. **Cleaning / preprocessing the comment text**
3. **Running sentiment analysis with a HuggingFace model**
4. **Clustering comments into themes (reasons for delay) using TF‚ÄëIDF + KMeans**
5. **Summarizing the top reasons issues weren‚Äôt completed**

> üîß Before you start: update the `csv_path` variable in **Cell 2** so it points to your `JiraComments_FromJql.csv` file on your machine (e.g. `C:\\Users\\Andy\\Desktop\\JiraComments_FromJql.csv`).

In [None]:
# Cell 1 ‚Äì Imports and (optional) installs
# Run this cell first.

# If you don't have these installed yet, uncomment the pip commands below
# (remove the leading '#') and run once.

# !pip install pandas matplotlib scikit-learn transformers sentencepiece accelerate -q

import os
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

plt.style.use("ggplot")

In [None]:
# Cell 2 ‚Äì Load CSV (update csv_path for your machine)

# üëâ Change this to the actual path of your CSV file.
# Example:
# csv_path = r"C:\Users\Andy\Desktop\JiraComments_FromJql.csv"

csv_path = r"C:\Users\Andy\Desktop\Text Analysis\JiraComments_FromJql.csv"  # <-- update this line

if not os.path.exists(csv_path):
    raise FileNotFoundError(f"CSV not found at: {csv_path}\nPlease update csv_path to the correct location.")

df = pd.read_csv(csv_path)

print("Rows:", len(df))
print("Columns:", list(df.columns))
df.head()

## Basic cleaning & selection

We keep the most relevant columns:

- **IssueKey** ‚Äì which Jira issue the comment belongs to  
- **Author** (if present) ‚Äì who wrote the comment  
- **Created / Updated** ‚Äì timestamps  
- **Body** ‚Äì the actual comment text

We‚Äôll create a cleaned‚Äëup `text_clean` column for analysis.

In [None]:
# Cell 3 ‚Äì Select relevant columns and clean text

# Try common column names; adjust if your CSV uses different ones
possible_issue_cols = ["IssueKey", "issueKey", "Key", "key"]
possible_body_cols = ["Body", "body", "Comment", "comment"]

issue_col = next((c for c in possible_issue_cols if c in df.columns), None)
body_col  = next((c for c in possible_body_cols  if c in df.columns), None)

if issue_col is None or body_col is None:
    raise ValueError(f"Could not find issue or body column.\n"
                     f"Available columns: {list(df.columns)}\n"
                     f"Expected something like {possible_issue_cols} and {possible_body_cols}")

print(f"Using issue column: {issue_col}")
print(f"Using body column : {body_col}")

# Keep a working copy
data = df[[issue_col, body_col]].copy()
data.rename(columns={issue_col: "IssueKey", body_col: "Body"}, inplace=True)

# Drop empty comments
data["Body"] = data["Body"].astype(str).str.strip()
data = data[data["Body"].str.len() > 0].reset_index(drop=True)

print("After dropping empty comments:", len(data))

# Simple text cleaning function
import re

def clean_text(text: str) -> str:
    text = str(text)
    # Remove URLs
    text = re.sub(r"http\S+|www\.\S+", " ", text)
    # Remove Jira-like markup (basic)
    text = re.sub(r"\[~?\w+\]", " ", text)  # mentions
    # Remove non-alphanumeric except basic punctuation
    text = re.sub(r"[^a-zA-Z0-9\s,.!?'-]", " ", text)
    # Collapse whitespace
    text = re.sub(r"\s+", " ", text)
    return text.strip().lower()

data["text_clean"] = data["Body"].apply(clean_text)

data.head()

## Sentiment analysis with HuggingFace

We‚Äôll use a pretrained sentiment model from HuggingFace to label each comment as **positive** or **negative** (and a score).  
Feel free to change the model name if you prefer another sentiment model.

In [None]:
# Cell 4 ‚Äì Setup HuggingFace sentiment pipeline

# You can change this to another English sentiment model if you like
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

sentiment_model = AutoModelForSequenceClassification.from_pretrained(model_name)
sentiment_tokenizer = AutoTokenizer.from_pretrained(model_name)

sentiment_pipe = pipeline(
    "sentiment-analysis",
    model=sentiment_model,
    tokenizer=sentiment_tokenizer,
    device=-1  # use CPU; set to 0 to use GPU if available
)

sentiment_pipe("This is a quick test to see if the model works.")

In [None]:
# Cell 5 ‚Äì Run sentiment on all comments (batched)

texts = data["text_clean"].tolist()
results = []

batch_size = 32

for i in range(0, len(texts), batch_size):
    batch = texts[i:i+batch_size]
    batch_results = sentiment_pipe(batch, truncation=True)
    results.extend(batch_results)
    print(f"Processed {min(i+batch_size, len(texts))}/{len(texts)} comments", end="\r")

print("\nDone.")

data["sentiment_label"] = [r["label"] for r in results]
data["sentiment_score"] = [r["score"] for r in results]

data.head()

In [None]:
# Cell 6 ‚Äì Sentiment summary

print(data["sentiment_label"].value_counts())

# Plot distribution
data["sentiment_label"].value_counts().plot(kind="bar")
plt.title("Sentiment label distribution")
plt.xlabel("Label")
plt.ylabel("Count")
plt.show()

# Average score by label
data.groupby("sentiment_label")["sentiment_score"].describe()

## Thematic analysis (reasons / topics)

We‚Äôll use a simple unsupervised approach:

1. Convert comments into TF‚ÄëIDF vectors.  
2. Cluster them using KMeans.  
3. Inspect the **top terms per cluster** and some **example comments** to interpret themes.

You can adjust the number of clusters (`n_clusters`) depending on how fine‚Äëgrained you want the themes to be.

In [None]:
# Cell 7 ‚Äì TF-IDF vectorization

# You can tweak max_features or stop_words if needed
vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1,2),
    stop_words="english"
)

X = vectorizer.fit_transform(data["text_clean"])
X.shape

In [None]:
import threadpoolctl

class _NoOpThreadpoolLimit:
    def __init__(self, *args, **kwargs):
        pass
    def __enter__(self):
        return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        pass

# Monkey-patch threadpoolctl so scikit-learn's KMeans doesn't try
# to introspect BLAS libraries (which is what's crashing on Windows).
threadpoolctl.threadpool_limits = _NoOpThreadpoolLimit

print("Patched threadpoolctl.threadpool_limits ‚Äì KMeans should work now.")


In [None]:
# Cell 8 ‚Äì KMeans clustering into themes

n_clusters = 6  # adjust this number as needed

kmeans = KMeans(
    n_clusters=n_clusters,
    random_state=42,
    n_init=10
)

cluster_labels = kmeans.fit_predict(X)
data["cluster"] = cluster_labels

data["cluster"].value_counts().sort_index()

In [None]:
# Cell 9 ‚Äì Inspect top terms per cluster

def print_top_terms_per_cluster(kmeans_model, vectorizer, n_terms=15):
    terms = np.array(vectorizer.get_feature_names_out())
    order_centroids = kmeans_model.cluster_centers_.argsort()[:, ::-1]
    for i in range(kmeans_model.n_clusters):
        top_terms = terms[order_centroids[i, :n_terms]]
        print(f"\nCluster {i} ‚Äì top terms:")
        print(", ".join(top_terms))

print_top_terms_per_cluster(kmeans, vectorizer, n_terms=15)

In [None]:
# Cell 10 ‚Äì Example comments per cluster

for c in range(n_clusters):
    print(f"\n==================== Cluster {c} ====================")
    cluster_subset = data[data["cluster"] == c].head(5)  # show up to 5 examples
    for _, row in cluster_subset.iterrows():
        print(f"[{row['IssueKey']}] ({row['sentiment_label']}, {row['sentiment_score']:.2f})")
        print(row["Body"][:500])
        print("----")

## Putting it together: top themes & blocker-style comments

At this point you can:

- Manually label each cluster with a **theme name** (e.g., ‚ÄúWaiting on other team‚Äù, ‚ÄúTesting issues‚Äù, ‚ÄúRequirements unclear‚Äù).  
- Filter for **negative** comments within each cluster to see which themes are associated with delays or frustration.  
- Export the enriched dataset back to CSV for reporting or dashboards.

In [None]:
# Cell 11 ‚Äì Example: negative comments by cluster

neg = data[data["sentiment_label"] == "NEGATIVE"]

summary = neg.groupby("cluster").agg(
    n_comments=("Body", "count")
).reset_index().sort_values("n_comments", ascending=False)

summary

In [None]:
# Cell 12 ‚Äì Export enriched data (optional)

output_path = os.path.splitext(csv_path)[0] + "_enriched_with_sentiment_clusters.csv"
data.to_csv(output_path, index=False)
print(f"Saved enriched data to: {output_path}")