# Support Ticket Triage (Colab Demo)

This notebook demonstrates **Support Ticket Triage & Automated Routing** using both:
- A **baseline model** (TF-IDF + Logistic Regression)
- A **Transformer model** (DistilBERT fine-tuning)

 **Limitations in Colab**:
- Runtime resets clear all files → mount Google Drive for persistence.
- Free tier gives 12GB RAM, ~12h max runtime.
- FastAPI/serving is for local demo only.


## 1. Install requirements

In [1]:
# Install requirements
!pip install -q scikit-learn transformers datasets kaggle


## 2. Kaggle API & Directory Setup

In [2]:
# upload kaggle.json from desktop download directory
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"emmanuel007","key":"2a12a9d51552182b67437a02e7acd677"}'}

In [3]:
import os
# Create the .kaggle directory
os.makedirs('/root/.kaggle', exist_ok=True)

# Move kaggle.json to this directory
!mv kaggle.json /root/.kaggle/

# Set permissions
!chmod 600 /root/.kaggle/kaggle.json


In [4]:
!ls -ltr /content/


total 4
drwxr-xr-x 1 root root 4096 Oct  8 13:53 sample_data


In [5]:
!pwd


/content


In [6]:
zip_path = "/content/"

In [7]:
!ls -ltr /content/

total 4
drwxr-xr-x 1 root root 4096 Oct  8 13:53 sample_data


## 3. Download dataset from Kaggle & Unzip to Colab
The dataset name `customer-support-on-twitter`.

In [8]:
!kaggle datasets download -d thoughtvector/customer-support-on-twitter --force

Dataset URL: https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
License(s): CC-BY-NC-SA-4.0
Downloading customer-support-on-twitter.zip to /content
 99% 167M/169M [00:00<00:00, 1.74GB/s]
100% 169M/169M [00:00<00:00, 1.74GB/s]


In [9]:
!unzip -q customer-support-on-twitter.zip -d data/

In [None]:
!ls -ltr

total 172640
-rw-r--r-- 1 root root 176772673 Sep 21  2019 customer-support-on-twitter.zip
drwxr-xr-x 1 root root      4096 Oct  2 13:36 sample_data
drwxr-xr-x 3 root root      4096 Oct  4 20:35 data


In [None]:
!ls -ltr data/twcs/

total 504404
-rw-r--r-- 1 root root 516508641 Sep 21  2019 twcs.csv


In [None]:
!ls -ltr data/

total 24
-rw-r--r-- 1 root root 17357 Sep 21  2019 sample.csv
drwxr-xr-x 2 root root  4096 Oct  4 20:35 twcs


## 4. Project structure

In [None]:
!mkdir -p support_triage/src support_triage/data/raw support_triage/data/processed support_triage/models support_triage/models/transformer
!mv data/* support_triage/data/raw/ || true

In [None]:
!ls -ltr support_triage/data/raw/twcs

total 504404
-rw-r--r-- 1 root root 516508641 Sep 21  2019 twcs.csv


In [None]:
!apt-get install tree


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 38 not upgraded.
Need to get 47.9 kB of archives.
After this operation, 116 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tree amd64 2.0.2-1 [47.9 kB]
Fetched 47.9 kB in 0s (123 kB/s)
Selecting previously unselected package tree.
(Reading database ... 126675 files and directories currently installed.)
Preparing to unpack .../tree_2.0.2-1_amd64.deb ...
Unpacking tree (2.0.2-1) ...
Setting up tree (2.0.2-1) ...
Processing triggers for man-db (2.10.2-1) ...


In [None]:
!tree support_triage -L 3


[01;34msupport_triage[0m
├── [01;34mdata[0m
│   ├── [01;34mprocessed[0m
│   └── [01;34mraw[0m
│       ├── [00msample.csv[0m
│       └── [01;34mtwcs[0m
├── [01;34mmodels[0m
│   └── [01;34mtransformer[0m
└── [01;34msrc[0m

7 directories, 1 file


## 4. Dataset preprocessing (`src/datasets.py`)

In [None]:
%%writefile support_triage/src/datasets.py
import re
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt


# --------------------------------------------------
# Text cleaning / anonymization
# --------------------------------------------------

def clean_text(text: str) -> str:
    """
    Basic preprocessing of ticket text:
    - Lowercase
    - Remove emails, URLs, numbers
    - Strip extra whitespace
    """
    if not isinstance(text, str):
        return ""

    text = text.lower()
    text = re.sub(r"\S+@\S+", " [email] ", text)  # anonymize emails
    text = re.sub(r"http\S+|www\S+", " [url] ", text)  # anonymize urls
    text = re.sub(r"\d+", " [num] ", text)  # replace numbers
    text = re.sub(r"\s+", " ", text).strip()
    return text


def anonymize_text(text: str) -> str:
    """
    Additional anonymization (names, phone numbers, etc.)
    Extendable as needed.
    """
    if not isinstance(text, str):
        return ""

    text = re.sub(r"\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)+", " [name] ", text)  # names
    text = re.sub(r"\+?\d[\d -]{8,}\d", " [phone] ", text)  # phone numbers
    return text


# --------------------------------------------------
# Load and preprocess dataset
# --------------------------------------------------

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# --- Create label_team if missing ---
    if label_col not in df.columns:
        if "inbound" in df.columns:
            print("[INFO] 'label_team' not found. Creating from 'inbound' column...")
            df[label_col] = df["inbound"].apply(lambda x: "customer" if x else "support")
        else:
            raise ValueError(
                f"Expected column '{label_col}' not found in dataset "
                f"and no 'inbound' column to derive it from."
            )

    if text_col not in df.columns:
        raise ValueError(f"Expected column '{text_col}' in dataset")

    # --- Clean + anonymize text ---
    df["text"] = df[text_col].apply(clean_text).apply(anonymize_text)
    df = df[df["text"].str.strip() != ""]

    # --- Encode labels ---
    le = LabelEncoder()
    df["class"] = le.fit_transform(df[label_col])
    label_map = dict(zip(le.classes_, le.transform(le.classes_)))

    print(f"[INFO] Loaded {len(df)} rows from {path}")
    print(f"[INFO] Label map: {label_map}")

    return df[["text", "class", "author_id", "inbound"]], label_map

def load_sample_data(path: str = "support_triage/data/raw/twcs/sample.csv"):
    """Loads the sample dataset."""
    return pd.read_csv(path)


# --------------------------------------------------
#   EDA
# --------------------------------------------------

def perform_eda(df: pd.DataFrame):
    """
    Perform basic EDA on the dataset.

    Args:
        df (DataFrame): dataset with ["text", "class"]
    """
    # ---- Basic summary ----
    print("=== Dataset Summary ===")
    print(f"Total rows: {len(df):,}")
    print(f"Unique authors: {df['author_id'].nunique():,}")
    print("Inbound distribution (customer=True, support=False):")
    print(df["inbound"].value_counts(normalize=True))

    # ---- Text length stats ----
    df["text_length"] = df["text"].str.len()
    print("\n=== Text Length Stats ===")
    print(df["text_length"].describe())

    # ---- Plots ----
    plt.figure(figsize=(12, 8))

    # 1. Inbound distribution
    plt.subplot(2, 2, 1)
    df["inbound"].value_counts().plot(kind="bar")
    plt.title("Inbound Distribution (Customer vs Support)")
    plt.xticks([0, 1], ["Customer (inbound=True)", "Support (inbound=False)"], rotation=15)
    plt.ylabel("Count")

    # 2. Text length distribution
    plt.subplot(2, 2, 2)
    df["text_length"].hist(bins=50)
    plt.title("Text Length Distribution")
    plt.xlabel("Characters")
    plt.ylabel("Count")

    # 3. Top 10 support accounts
    plt.subplot(2, 2, 3)
    df.loc[df["inbound"]==False, "author_id"].value_counts().head(10).plot(kind="bar")
    plt.title("Top 10 Support Accounts")
    plt.ylabel("Replies sent")

    # 4. Top 10 customer accounts
    plt.subplot(2, 2, 4)
    df.loc[df["inbound"]==True, "author_id"].value_counts().head(10).plot(kind="bar")
    plt.title("Top 10 Customers")
    plt.ylabel("Messages sent")

    plt.tight_layout()
    plt.show()


# --------------------------------------------------
# Train/Val/Test split
# --------------------------------------------------

from sklearn.model_selection import train_test_split

def train_val_test_split(data, test_size=0.2, val_size=0.1, random_state=42):
    """
    Split dataset into train/val/test DataFrames and preserve label_map.

    Args:
        data (tuple): (df, label_map) from load_and_preprocess()
        test_size (float): fraction for test
        val_size (float): fraction for validation (of remaining)
        random_state (int): random seed

    Returns:
        train_df, val_df, test_df, label_map
    """
    df, label_map = data

    # train vs test split
    train_df, test_df = train_test_split(
        df, test_size=test_size, stratify=df["class"], random_state=random_state
    )

    # train vs val split
    train_df, val_df = train_test_split(
        train_df, test_size=val_size, stratify=train_df["class"], random_state=random_state
    )

    print(f"[INFO] Split: train={len(train_df)}, val={len(val_df)}, test={len(test_df)}")
    return train_df, val_df, test_df, label_map


Writing support_triage/src/datasets.py


In [None]:
# --------------------------------------------------
# Transformer Dataset + Tokenizer
# --------------------------------------------------

class SupportDataset(Dataset):
    """
    PyTorch Dataset for transformer models.
    """

    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = int(self.labels[idx])
        encoding = self.tokenizer(
            text,
            add_special_tokens=True,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt"
        )
        return {
            "input_ids": encoding["input_ids"].squeeze(),
            "attention_mask": encoding["attention_mask"].squeeze(),
            "labels": torch.tensor(label, dtype=torch.long)
        }



In [None]:
!tree support_triage -L 3


[01;34msupport_triage[0m
├── [01;34mdata[0m
│   ├── [01;34mprocessed[0m
│   └── [01;34mraw[0m
│       ├── [00msample.csv[0m
│       └── [01;34mtwcs[0m
├── [01;34mmodels[0m
│   └── [01;34mtransformer[0m
└── [01;34msrc[0m
    └── [00mdatasets.py[0m

7 directories, 2 files


## 5. Baseline model (`src/train_baseline.py`)

In [None]:
%%writefile support_triage/src/train_baseline.py
import joblib
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

def train_baseline(X_train, y_train, X_val, y_val, model_path='support_triage/models/baseline.pkl'):
    pipe = Pipeline([
        ('tfidf', TfidfVectorizer(max_features=20000, ngram_range=(1,2))),
        ('clf', LogisticRegression(max_iter=2000, class_weight='balanced'))
    ])
    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_val)
    print(classification_report(y_val, preds))
    joblib.dump(pipe, model_path)

Writing support_triage/src/train_baseline.py


## 6. Transformer model (`src/train_transformer.py`)

In [None]:
%%writefile support_triage/src/train_transformer.py
import os
import json
import numpy as np
import torch
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.utils.class_weight import compute_class_weight
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    EarlyStoppingCallback,
)
from datasets import Dataset


def train_transformer(
    train_df,
    val_df,
    label_map,
    model_name="distilbert-base-uncased",
    out_dir="./transformer_model",
    use_class_weights=True,
):
    """
    Train a Hugging Face Transformer for sequence classification using a label_map.
    Automatically chooses eval/save strategy based on dataset size and uses class weights if specified.
    """

    os.makedirs(out_dir, exist_ok=True)
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"[INFO] Using device: {device}")

    # -------------------------------------------------------
    # 1. Align labels with label_map correctly
    # -------------------------------------------------------
    train_df = train_df.copy()
    val_df = val_df.copy()

    # If class column already numeric and starts from 0, keep it as is
    if np.issubdtype(train_df["class"].dtype, np.number) and train_df["class"].min() == 0:
        train_df["labels"] = train_df["class"]
        val_df["labels"] = val_df["class"]
    else:
        # Otherwise, map using label_map
        train_df["labels"] = train_df["class"].map(label_map)
        val_df["labels"] = val_df["class"].map(label_map)

    num_labels = len(set(train_df["labels"].unique()))
    print(f"[INFO] Number of labels: {num_labels}")

    # -------------------------------------------------------
    # 2. Tokenizer & encoding
    # -------------------------------------------------------
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize(batch):
        return tokenizer(
            batch["text"],
            padding=True,
            truncation=True,
            max_length=256,
        )

    train_dataset = Dataset.from_pandas(train_df).map(tokenize, batched=True)
    val_dataset = Dataset.from_pandas(val_df).map(tokenize, batched=True)

    train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
    val_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

    # -------------------------------------------------------
    # 3. Model setup
    # -------------------------------------------------------
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels).to(device)

    # Compute class weights if requested
    class_weights = None
    if use_class_weights:
        print("[INFO] Computing class weights...")
        labels = train_df["labels"].values
        classes = np.unique(labels)
        weights = compute_class_weight("balanced", classes=classes, y=labels)
        weights = torch.tensor(weights, dtype=torch.float).to(device)
        class_weights = weights
        print(f"[INFO] Class weights: {class_weights}")

        # Wrap model forward pass for weighted loss
        orig_forward = model.forward

        def weighted_forward(**kwargs):
            labels = kwargs.pop("labels", None)
            outputs = orig_forward(**kwargs)
            logits = outputs.logits
            if labels is not None:
                loss_fn = torch.nn.CrossEntropyLoss(weight=class_weights)
                loss = loss_fn(logits.view(-1, num_labels), labels.view(-1))
                outputs.loss = loss
            return outputs

        model.forward = weighted_forward

    # -------------------------------------------------------
    # 4. Choose training strategy
    # -------------------------------------------------------
    dataset_size = len(train_dataset)
    if dataset_size < 50_000:
        eval_strategy = "epoch"
        save_strategy = "epoch"
        eval_steps = None
    else:
        eval_strategy = "steps"
        save_strategy = "steps"
        eval_steps = 500

    # -------------------------------------------------------
    # 5. Training arguments
    # -------------------------------------------------------
    args = TrainingArguments(
        output_dir=out_dir,
        evaluation_strategy=eval_strategy,
        save_strategy=save_strategy,
        eval_steps=eval_steps,
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3,
        weight_decay=0.01,
        logging_dir=os.path.join(out_dir, "logs"),
        logging_steps=50,
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        save_total_limit=2,
        fp16=torch.cuda.is_available(),
        report_to="none",  # Disable W&B unless configured
    )

    # -------------------------------------------------------
    # 6. Metrics
    # -------------------------------------------------------
    def compute_metrics(pred):
        labels_true = pred.label_ids
        labels_pred = pred.predictions.argmax(-1)
        return {
            "accuracy": accuracy_score(labels_true, labels_pred),
            "f1": f1_score(labels_true, labels_pred, average="weighted"),
            "precision": precision_score(labels_true, labels_pred, average="weighted"),
            "recall": recall_score(labels_true, labels_pred, average="weighted"),
        }

    # -------------------------------------------------------
    # 7. Trainer setup
    # -------------------------------------------------------
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
    )

    # -------------------------------------------------------
    # 8. Train & save
    # -------------------------------------------------------
    trainer.train()

    print("[INFO] Training complete. Saving model, tokenizer, and label map...")
    trainer.save_model(out_dir)
    tokenizer.save_pretrained(out_dir)

    with open(os.path.join(out_dir, "label_map.json"), "w") as f:
        json.dump(label_map, f, indent=2)

    print(f"[INFO] Model saved to {out_dir}")
    return trainer


Overwriting support_triage/src/train_transformer.py


In [None]:
"""%%writefile support_triage/src/train_transformer.py
import numpy as np
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import f1_score, accuracy_score, classification_report

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    EarlyStoppingCallback
)
from datasets import Dataset
import torch
import os
import json
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def train_transformer(train_df, val_df, label_map, model_name="distilbert-base-uncased", out_dir="./transformer_model"):
    """
    Train a Hugging Face Transformer for sequence classification using a label_map.
    Automatically chooses eval/save strategy based on dataset size.
    """

    os.makedirs(out_dir, exist_ok=True)

    # ---- Map string labels to int IDs ----
    train_df = train_df.copy()
    val_df = val_df.copy()
    train_df["class_id"] = train_df["class"].map(label_map)
    val_df["class_id"] = val_df["class"].map(label_map)

    # ---- Tokenizer ----
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize(batch):
        return tokenizer(batch["text"], padding=True, truncation=True, max_length=256)

    # ---- Dataset objects ----
    train_dataset = Dataset.from_pandas(train_df).map(tokenize, batched=True)
    val_dataset = Dataset.from_pandas(val_df).map(tokenize, batched=True)

    train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "class_id"])
    val_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "class_id"])

    # ---- Model ----
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=len(label_map)
    )

    # ---- Choose strategy automatically ----
    dataset_size = len(train_dataset)
    if dataset_size < 50_000:
        eval_strategy = "epoch"
        save_strategy = "epoch"
        eval_steps = None
    else:
        eval_strategy = "steps"
        save_strategy = "steps"
        eval_steps = 500  # every 500 steps

    # ---- Training Arguments ----
    args = TrainingArguments(
        output_dir=out_dir,
        evaluation_strategy=eval_strategy,
        save_strategy=save_strategy,
        eval_steps=eval_steps,
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3,
        weight_decay=0.01,
        logging_dir=os.path.join(out_dir, "logs"),
        logging_steps=50,
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        save_total_limit=2
    )

    # ---- Metrics ----
    def compute_metrics(pred):
        labels_true = pred.label_ids
        labels_pred = pred.predictions.argmax(-1)
        return {
            "accuracy": accuracy_score(labels_true, labels_pred),
            "f1": f1_score(labels_true, labels_pred, average="weighted"),
            "precision": precision_score(labels_true, labels_pred, average="weighted"),
            "recall": recall_score(labels_true, labels_pred, average="weighted")
        }

    # ---- Trainer ----
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
    )

    # ---- Train ----
    trainer.train()

    # ---- Save final model + tokenizer + label_map ----
    trainer.save_model(out_dir)
    tokenizer.save_pretrained(out_dir)

    with open(os.path.join(out_dir, "label_map.json"), "w") as f:
        json.dump(label_map, f, indent=2)

    return trainer"""


Writing support_triage/src/train_transformer.py


In [None]:
!tree support_triage -L 4

[01;34msupport_triage[0m
├── [01;34mdata[0m
│   ├── [01;34mprocessed[0m
│   └── [01;34mraw[0m
│       ├── [00msample.csv[0m
│       └── [01;34mtwcs[0m
│           └── [00mtwcs.csv[0m
├── [01;34mmodels[0m
│   └── [01;34mtransformer[0m
└── [01;34msrc[0m
    ├── [00mdatasets.py[0m
    ├── [00mtrain_baseline.py[0m
    └── [00mtrain_transformer.py[0m

7 directories, 5 files


## 7. Run experiments

In [None]:
import pandas as pd
from support_triage.src.datasets import load_and_preprocess, train_val_test_split
from support_triage.src.train_baseline import train_baseline
from support_triage.src.train_transformer import train_transformer

# Load and preprocess the dataset
df, label_map = load_and_preprocess('support_triage/data/raw/twcs/twcs.csv')

# Perform train/validation/test split
train, val, test, label_map = train_val_test_split(df)

print("=== Baseline ===")
train_baseline(train['text'], train['class'], val['text'], val['class'])

print("=== Transformer ===")
labels = sorted(df['class'].unique())
train_transformer(train, val, label_map)

ValueError: too many values to unpack (expected 2)

In [None]:
from support_triage.src.datasets import load_sample_data, load_and_preprocess

# Load the sample data
sample_df = load_sample_data()

# Preprocess the sample data
processed_sample_df, sample_label_map = load_and_preprocess(
    "support_triage/data/raw/twcs/sample.csv"
)

# Display the first few rows of the processed sample data
print("--- Processed Sample Data ---")
display(processed_sample_df.head())

# Display the label map
print("\n--- Sample Label Map ---")
print(sample_label_map)

In [None]:
import nbformat, datetime, subprocess, os
from getpass import getpass
from IPython.display import Javascript, display

def push_notebook_to_github(
    nb_path="support-ticket.ipynb",
    repo_owner="your-username",
    repo_name="support-triage",
    branch="main",
    commit_message=None
):
    """
    Save, bump version, commit, and push notebook to GitHub from Colab.
    Uses HTTPS + GitHub Personal Access Token (PAT).
    """
    # 1. Try saving the notebook
    try:
        display(Javascript('IPython.notebook.save_checkpoint();'))
    except Exception:
        print("⚠️ Could not auto-save notebook. Please manually save before pushing.")

    # 2. Bump version metadata
    if not os.path.exists(nb_path):
        raise FileNotFoundError(f"Notebook not found at {nb_path}")
    nb = nbformat.read(nb_path, as_version=4)
    meta = nb.metadata.get("support_ticket", {})
    meta["version"] = meta.get("version", 0) + 1
    meta["last_modified_utc"] = datetime.datetime.utcnow().isoformat() + "Z"
    nb.metadata["support_ticket"] = meta
    nbformat.write(nb, nb_path)
    new_version = meta["version"]

    # 3. Stage and commit
    if commit_message is None:
        commit_message = f"Colab update: {os.path.basename(nb_path)} v{new_version}"
    subprocess.run(f'git add "{nb_path}"', shell=True, check=False)
    subprocess.run(f'git commit -m "{commit_message}"', shell=True, check=False)

    # 4. Push with PAT
    token = getpass("🔑 Enter your GitHub PAT (hidden): ")
    repo_url = f"https://{token}@github.com/{repo_owner}/{repo_name}.git"
    print(f"Pushing {nb_path} → {repo_owner}/{repo_name}@{branch} ...")
    subprocess.run(f"git push {repo_url} HEAD:{branch}", shell=True, check=False)

    print(f"✅ Notebook pushed. Current version: {new_version}")


# Support-Ticked

# Support Ticket Classifier — Colab-ready Notebook (Python script format)
#
# UPDATED: Fixes SyntaxError caused by notebook shell-magic (`!pip install`) when the
# file is executed as a regular Python script. This version uses a safe, portable
# auto-install approach (via subprocess) when `AUTO_INSTALL = True`, and falls back
# to helpful error messages otherwise.
#
# This file is still organized with `# %%` cell markers so you can paste it into
# Colab, VSCode/Run as a Notebook, or run as a plain Python script.
#
# IMPORTANT:
#  - If you don't want the script to auto-install packages, set AUTO_INSTALL = False.
#  - Auto-install can be slow. For production or CI, prefer a requirements.txt / Dockerfile.
#  - Replace placeholders: dataset_slug, csv_path, SLACK_WEBHOOK before production use.




# 0 — Environment setup (safe & portable)
# Configuration: set to True to attempt to install missing Python packages automatically.
# If running in a locked environment (CI, production image, enterprise laptop), set to False
# and install packages beforehand (recommended).

# Google Colab MVP Runtime + Processor Setup

In [None]:
# ============================================================
# MVP Runtime Setup for Twitter Support Ticket Analyzer
# ============================================================

# 1️⃣ — Check Runtime Info
!nvidia-smi || echo "⚠️ No GPU detected, falling back to CPU runtime."
!cat /proc/cpuinfo | grep 'model name' | uniq
!free -h

# 2️⃣ — Mount Drive for persistence
from google.colab import drive
drive.mount('/content/drive')

# 3️⃣ — Install dependencies (balanced for ML + NLP + dashboard)
!pip install -q \
    torch torchvision torchaudio accelerate \
    transformers datasets sentencepiece \
    scikit-learn pandas numpy matplotlib seaborn \
    spacy nltk emoji tweepy \
    fastapi uvicorn streamlit \
    mlflow pinecone-client \
    langdetect python-dotenv tqdm

# 4️⃣ — Load SpaCy model for preprocessing
!python -m spacy download en_core_web_sm

# 5️⃣ — Setup Folder Structure
import os
base_dir = "/content/drive/MyDrive/twitter_support_mvp"
subdirs = [
    "data/raw", "data/processed", "data/labeled",
    "models", "outputs", "dashboard", "logs"
]
for d in subdirs:
    os.makedirs(os.path.join(base_dir, d), exist_ok=True)
print("✅ Folder structure initialized at:", base_dir)

# 6️⃣ — Check Device
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"🧠 Using device: {device}")


Wed Oct  8 14:53:37 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   51C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
AUTO_INSTALL = True

# Packages required for the notebook. You may edit this list depending on your needs.
REQUIRED_PACKAGES = [
    "evaluate",
    "accelerate",
    "kaggle",
    "nest-asyncio",
    "aiohttp",
    "sqlalchemy",
    "emoji",
    "langdetect",
    "streamlit",
    "pandas",
    "requests"
]


In [None]:
if AUTO_INSTALL:
    import sys, subprocess
    try:
        print("Attempting to install required packages (this may take a while)...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + REQUIRED_PACKAGES)
        print("Package installation completed (or packages already present).")
    except Exception as e:
        print("Auto-install failed or was interrupted. Please install the following packages manually:")
        print(", ".join(REQUIRED_PACKAGES))
        print("Error:", e)
        # don't raise — allow user to inspect and install manually if desired

Attempting to install required packages (this may take a while)...
Package installation completed (or packages already present).


In [None]:
!pip install -U transformers

Collecting transformers
  Downloading transformers-4.57.0-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.4/41.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.57.0-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m103.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.56.2
    Uninstalling transformers-4.56.2:
      Successfully uninstalled transformers-4.56.2
Successfully installed transformers-4.57.0


In [None]:
# Try imports with helpful error messages if something is missing
try:
    import transformers
    import datasets
    import evaluate
    import pandas as pd
    import numpy as np
    import requests
    import os
    import re
    import emoji
    from langdetect import detect, DetectorFactory
    DetectorFactory.seed = 0
    from datasets import Dataset, DatasetDict
    from transformers import (
        AutoTokenizer, DataCollatorWithPadding,
        AutoModelForSequenceClassification, TrainingArguments, Trainer, pipeline
    )
    import sqlalchemy
    from sqlalchemy import create_engine, Column, Integer, String, Float, DateTime, MetaData, Table
    from sqlalchemy.orm import sessionmaker
    from pydantic import BaseModel
    from fastapi import FastAPI
    import nest_asyncio
    from datetime import datetime
    import subprocess
    print("Imports OK.\ntransformers:", transformers.__version__, "datasets:", datasets.__version__)
except Exception as e:
    print("One or more imports failed. If AUTO_INSTALL=False you must install packages manually.")
    print("Missing import error:", e)
    raise


Imports OK.
transformers: 4.57.0 datasets: 4.0.0


# 1 — Download / load dataset (Kaggle or local CSV)


In [None]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"emmanuel007","key":"2a12a9d51552182b67437a02e7acd677"}'}

In [None]:
import os
# Create the .kaggle directory
os.makedirs('/root/.kaggle', exist_ok=True)

# Move kaggle.json to this directory
!mv kaggle.json /root/.kaggle/

# Set permissions
!chmod 600 /root/.kaggle/kaggle.json

In [None]:
# If using Kaggle: upload your kaggle.json to /root/.kaggle/kaggle.json before running.
# Replace dataset_slug with the actual Kaggle dataset slug if you want the notebook to
# attempt a Kaggle download.

import glob

kaggle_token_path = "/root/.kaggle/kaggle.json"

# CHANGE this to your dataset slug if you want automatic kaggle download (optional)
dataset_slug = "thoughtvector/customer-support-on-twitter" # e.g. "your-kaggle-username/support-ticket-twitter-dataset"

# Default CSV path (if you upload a CSV to Colab or put it in /content)
csv_path = "/content/support_tweets.csv"


In [None]:
!ls -ltr /content/

total 8
drwxr-xr-x 1 root root 4096 Oct  6 13:38 sample_data
drwx------ 5 root root 4096 Oct  8 14:54 drive


In [None]:
# --- Download and unzip Kaggle dataset ---
import os
import zipfile

# Path to Kaggle API token
kaggle_token_path = "/root/.kaggle/kaggle.json"

# Make sure the Kaggle directory exists
os.makedirs(os.path.dirname(kaggle_token_path), exist_ok=True)

# If running in Colab, you must upload kaggle.json first
# Example:
# from google.colab import files
# files.upload()  # <-- upload your kaggle.json here

# Move kaggle.json to the correct directory and set permissions
!mv kaggle.json {kaggle_token_path}
!chmod 600 {kaggle_token_path}

# Set the dataset slug (change if using a different one)
dataset_slug = "thoughtvector/customer-support-on-twitter"

# Create a folder to store the dataset
download_dir = "/content/kaggle_data"
os.makedirs(download_dir, exist_ok=True)

# Download the dataset using Kaggle CLI
!kaggle datasets download -d {dataset_slug} -p {download_dir}

# Unzip the dataset
zip_files = [f for f in os.listdir(download_dir) if f.endswith(".zip")]
for zip_file in zip_files:
    with zipfile.ZipFile(os.path.join(download_dir, zip_file), "r") as zip_ref:
        zip_ref.extractall(download_dir)

print("✅ Dataset downloaded and extracted to:", download_dir)


mv: cannot stat 'kaggle.json': No such file or directory
Dataset URL: https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
License(s): CC-BY-NC-SA-4.0
Downloading customer-support-on-twitter.zip to /content/kaggle_data
 99% 167M/169M [00:00<00:00, 1.74GB/s]
100% 169M/169M [00:00<00:00, 1.74GB/s]
✅ Dataset downloaded and extracted to: /content/kaggle_data


In [None]:
# Now find the main dataset CSV file
csv_candidates = [
    os.path.join(root, file)
    for root, _, files in os.walk(download_dir)
    for file in files
    if file.endswith(".csv")
]

print("📄 Found CSV files:")
for c in csv_candidates:
    print("  ", c)

# Pick the main file (the larger one is usually the dataset)
csv_path = max(csv_candidates, key=os.path.getsize)
print(f"\n✅ Loading main CSV file: {csv_path}")

# Load it
df = pd.read_csv(csv_path, low_memory=False)
print(f"Loaded {len(df):,} rows and {len(df.columns)} columns.")
df.head()

📄 Found CSV files:
   /content/kaggle_data/sample.csv
   /content/kaggle_data/twcs/twcs.csv

✅ Loading main CSV file: /content/kaggle_data/twcs/twcs.csv
Loaded 2,811,774 rows and 7 columns.


Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


In [None]:
!ls -ltr /content/

total 12
drwxr-xr-x 1 root root 4096 Oct  6 13:38 sample_data
drwx------ 5 root root 4096 Oct  8 14:54 drive
drwxr-xr-x 3 root root 4096 Oct  8 15:06 kaggle_data


In [None]:
!ls -ltr kaggle_data/

total 172656
-rw-r--r-- 1 root root 176772673 Sep 21  2019 customer-support-on-twitter.zip
-rw-r--r-- 1 root root     17357 Oct  8 15:06 sample.csv
drwxr-xr-x 2 root root      4096 Oct  8 15:06 twcs


In [None]:
!ls -ltr /content/kaggle_data/twcs/

total 504408
-rw-r--r-- 1 root root 516508641 Oct  8 15:06 twcs.csv



# 2 — Quick EDA & label harmonization

In [None]:
# 2 — Quick EDA & label harmonization
print("Columns:", df.columns.tolist())

# Ensure text column exists
if 'text' not in df.columns:
    for alt in ['tweet', 'content', 'message']:
        if alt in df.columns:
            df['text'] = df[alt].astype(str)
            break
    if 'text' not in df.columns:
        raise ValueError("No text column found. Please rename your tweet/text column to 'text'")

# Create binary 'is_support' label if not present
if 'label' in df.columns and df['label'].dtype != object:
    try:
        # Map common label names
        df['is_support'] = df['label'].map(lambda x: 1 if str(x).lower().strip() in ['support','1','yes','true','1.0'] else 0)
    except Exception:
        # Fallback when numeric
        df['is_support'] = df['label'].apply(lambda x: 1 if x == 1 else 0)
else:
    # Heuristic labeling for demo — replace with ground truth if available.
    keywords = ['help', 'support', 'issue', 'problem', 'error', 'failed', 'please', 'unable', 'won\'t', 'cannot', "can't"]
    df['is_support'] = df['text'].apply(lambda t: 1 if any(k in str(t).lower() for k in keywords) else 0)

print("Support tweets count:", int(df['is_support'].sum()), "of", len(df))


Columns: ['tweet_id', 'author_id', 'inbound', 'created_at', 'text', 'response_tweet_id', 'in_response_to_tweet_id']
Support tweets count: 1263590 of 2811774


In [None]:
df.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id,is_support
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0,0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0,0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0,0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0,1
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0,0


# 3 — Preprocessing (cleaning, tokenization, language detection)

In [None]:
# Utilities
import html

def clean_text(text):
    """Lightweight tweet cleaning:
    - remove URLs
    - remove mentions (@username)
    - demojize emojis
    - unescape HTML
    - collapse whitespace
    """
    if text is None:
        return ""
    text = str(text)
    # remove URLs
    text = re.sub(r'http\S+|www\.[^\s]+', '', text)
    # remove handles (but keep the rest)
    text = re.sub(r'@\w+', '', text)
    # demojize
    try:
        text = emoji.demojize(text)
    except Exception:
        pass
    # unescape html entities
    text = html.unescape(text)
    # collapse whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def safe_lang(text):
    try:
        lang = detect(text)
        return lang
    except Exception:
        return 'unknown'

# apply
print("Applying cleaning + language detection (fast)...")
df['clean_text'] = df['text'].apply(clean_text)
# language detection can be flaky on short text, so keep results but don't drop unknowns aggressively
try:
    df['lang'] = df['clean_text'].apply(safe_lang)
except Exception as e:
    print("Language detection failed on some entries — proceeding and marking as 'unknown'", e)
    df['lang'] = 'unknown'

# keep English or unknown for this demo
df = df[df['lang'].str.startswith('en') | (df['lang'] == 'unknown')]
print("After language filter rows:", len(df))




Applying cleaning + language detection (fast)...
After language filter rows: 2623248


In [None]:
df_samp = df.sample(frac=0.2, random_state=42)
df_samp.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id,is_support,clean_text,lang
2095237,2255361,299048,True,Fri Nov 10 22:03:49 +0000 2017,@AppleSupport Current version is 11.1,2255360.0,2255362.0,1,Current version is 11.1,en
2765963,2942276,663570,True,Wed Nov 29 13:02:18 +0000 2017,Shoutout to @JetBlue and #flight2324. I love y...,29422742942277.0,,0,Shoutout to and #flight2324. I love you big bl...,en
2546903,2717462,LondonMidland,False,Mon Nov 20 07:12:23 +0000 2017,"@431047 soory the service is so busy, we are u...",2717463.0,2717464.0,0,"soory the service is so busy, we are utilising...",en
1128818,1247068,sprintcare,False,Wed Oct 25 23:34:44 +0000 2017,@412581 Hi there! This not the way we want you...,1247066.0,1247069.0,1,Hi there! This not the way we want you to feel...,en
1831776,1987428,TMobileHelp,False,Wed Oct 18 18:55:44 +0000 2017,"@506239 This is right up my alley, shoot me a ...",,1987429.0,0,"This is right up my alley, shoot me a quick DM...",en



# 4 — Prepare Hugging Face datasets


In [None]:
sel = df_samp[['clean_text','is_support']].rename(columns={'clean_text':'text','is_support':'label'})
# Remove any possible nulls
sel = sel.fillna({'text': ''})
# HF dataset
hf_ds = Dataset.from_pandas(sel.reset_index(drop=True))
# small train/test split
split = hf_ds.train_test_split(test_size=0.15, seed=42) if len(hf_ds) > 1 else DatasetDict({ 'train': hf_ds, 'test': hf_ds })
if isinstance(split, dict):
    tokenizable = DatasetDict({ 'train': split['train'], 'test': split['test'] })
else:
    tokenizable = split
print(tokenizable)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 445952
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 78698
    })
})


In [None]:
# Check your current Transformers version
import transformers
print(transformers.__version__)

4.57.0


# 5 — Tokenization


In [None]:
model_name = 'distilbert-base-uncased'  # change for larger/better models
print(f"Loading tokenizer for {model_name} (this downloads model files)...")

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

def preprocess_fn(examples):
    return tokenizer(examples['text'], truncation=True, max_length=128)

# Tokenize (if dataset is tiny this will be fast)
try:
    tokenized = tokenizable.map(preprocess_fn, batched=True)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
except Exception as e:
    print("Tokenization failed — ensure transformers and datasets are installed.", e)
    raise

Loading tokenizer for distilbert-base-uncased (this downloads model files)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/445952 [00:00<?, ? examples/s]

Map:   0%|          | 0/78698 [00:00<?, ? examples/s]

# 6 — Fine-tune (Trainer) — quick demo training

In [None]:
# NOTE: Full training downloads model weights and may be slow. For demo we do 1-2 epochs.
from transformers import TrainingArguments
epochs = 1 if len(tokenized['train']) < 200 else 2

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
metric_acc = evaluate.load('accuracy')
metric_f1 = evaluate.load('f1')

from transformers import EvalPrediction

def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    acc = metric_acc.compute(predictions=preds, references=p.label_ids)['accuracy']
    f1 = metric_f1.compute(predictions=preds, references=p.label_ids, average='weighted')['f1']
    return {'accuracy': acc, 'f1': f1}

training_args = TrainingArguments(
    output_dir='./outputs',
    eval_strategy="epoch",   # or "steps"
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

print("Starting training (this may take time, consider using GPU runtime)...")
trainer.train()
print("Eval:")
trainer.evaluate()
trainer.save_model('./support_classifier')
tokenizer.save_pretrained('./support_classifier')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Starting training (this may take time, consider using GPU runtime)...


  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mayeniemmanuel93[0m ([33mayeniemmanuel93-lifetime-realty[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2154,0.206671,0.90137,0.90043
2,0.1861,0.21675,0.904267,0.903793
3,0.1551,0.262688,0.902475,0.902242


Eval:


('./support_classifier/tokenizer_config.json',
 './support_classifier/special_tokens_map.json',
 './support_classifier/vocab.txt',
 './support_classifier/added_tokens.json',
 './support_classifier/tokenizer.json')

# 7 — Inference pipeline & Slack forwarding helper

In [None]:
# Simple inference pipeline using HF pipeline. Ensure you set SLACK_WEBHOOK if you want forwarding.
model_path = './support_classifier'
clf = pipeline('text-classification', model=model_path, tokenizer=model_path, return_all_scores=False)

# Replace with your Slack webhook URL (or keep empty to disable forwarding)
SLACK_WEBHOOK = ''  # <-- REPLACE with your Slack Incoming Webhook, e.g. 'https://hooks.slack.com/services/XXX/YYY/ZZZ'

# helper functions

def classify_text(text):
    out = clf(text, truncation=True, max_length=128)
    label = out[0]['label']
    score = out[0]['score']
    return label, float(score)


def post_to_slack(text, meta=None):
    if not SLACK_WEBHOOK:
        print("SLACK_WEBHOOK not set — skipping forwarding.")
        return None
    payload = {
        'text': f"*Support tweet detected*: {text}\n{meta if meta else ''}"
    }
    resp = requests.post(SLACK_WEBHOOK, json=payload)
    return resp.status_code, resp.text

# Quick demo inference on first few rows
for i, row in df.head(3).iterrows():
    lab, sc = classify_text(row['clean_text'])
    print(f"Row {i} -> label={lab}, score={sc}")


Device set to use cuda:0


Row 0 -> label=LABEL_0, score=0.9955766201019287
Row 1 -> label=LABEL_0, score=0.9781672954559326
Row 2 -> label=LABEL_1, score=0.6505919694900513




# 8 — FastAPI microservice (demo)


In [None]:
app = FastAPI(title='Support Tweet Classifier')

DATABASE_URL = 'sqlite:///./support_tickets.db'
engine = create_engine(DATABASE_URL, connect_args={'check_same_thread': False})
metadata = MetaData()

tickets = Table(
    'tickets', metadata,
    Column('id', Integer, primary_key=True),
    Column('tweet_text', String),
    Column('label', String),
    Column('score', Float),
    Column('created_at', DateTime),
)
metadata.create_all(engine)
SessionLocal = sessionmaker(bind=engine)

class TweetIn(BaseModel):
    text: str
    username: str = None
    tweet_id: str = None

@app.post('/predict')
async def predict_item(item: TweetIn):
    label, score = classify_text(item.text)
    session = SessionLocal()
    ins = tickets.insert().values(tweet_text=item.text, label=label, score=score, created_at=datetime.utcnow())
    session.execute(ins)
    session.commit()
    session.close()
    if label in ['LABEL_1'] and score > 0.7:
        post_to_slack(item.text, {'label': label, 'score': score, 'username': item.username, 'tweet_id': item.tweet_id})
    return {'label': label, 'score': score}

# If you want to run uvicorn inside a notebook/script, nest_asyncio helps. For production run uvicorn outside.
nest_asyncio.apply()
print("FastAPI app defined. To run locally: uvicorn <this_file_name>:app --reload")

FastAPI app defined. To run locally: uvicorn <this_file_name>:app --reload


# 9 — Streamlit dashboard (written to file)

In [None]:

streamlit_code = """
import streamlit as st
import pandas as pd
from sqlalchemy import create_engine

st.title('Support Ticket Dashboard (MVP)')
engine = create_engine('sqlite:///./support_tickets.db')
df = pd.read_sql_table('tickets', con=engine)
st.write('Total tickets:', len(df))
if not df.empty:
    st.dataframe(df.sort_values('created_at', ascending=False).assign(created_at=lambda d: pd.to_datetime(d['created_at'])))
"""
with open('streamlit_app.py', 'w') as f:
    f.write(streamlit_code)
print('Saved streamlit_app.py. Run with: streamlit run streamlit_app.py')


Saved streamlit_app.py. Run with: streamlit run streamlit_app.py


# 10 — Lightweight tests for preprocessing functions

In [None]:

# These tests are small and quick; they don't require model downloads.

def _test_clean_text_and_lang():
    examples = [
        ("Check this out! http://example.com @user 😂", ['check this out', 'face_with_tears_of_joy']),
        ("No URL here, just text.", ['no url here, just text']),
        (None, [''])
    ]
    for i, (inp, expected_subs) in enumerate(examples):
        out = clean_text(inp)
        out_lower = out.lower()
        ok = all(sub in out_lower for sub in expected_subs)
        print(f"Test clean_text {i}:", "OK" if ok else f"FAIL -> '{out}' expected to contain {expected_subs}")
        print("safe_lang:", safe_lang(out))

print("Running quick preprocessing tests...")
_test_clean_text_and_lang()


Running quick preprocessing tests...
Test clean_text 0: OK
safe_lang: en
Test clean_text 1: OK
safe_lang: en
Test clean_text 2: OK
safe_lang: unknown


# 11 — Notes & next steps


In [None]:

# - This file now runs as a script and as a notebook; the SyntaxError from `!pip install` is removed.
# - If you want an actual .ipynb file exported, tell me and I will prepare it.
# - If you encounter another error, please copy-paste the full traceback here so I can debug.

print('Updated notebook script ready — edit placeholders (dataset_slug, csv_path, SLACK_WEBHOOK) and re-run.')

Updated notebook script ready — edit placeholders (dataset_slug, csv_path, SLACK_WEBHOOK) and re-run.
