# AI Solutions Engineer Assignment – Palindrome Data

**Name:** Emmanuel Sifiso Nkosi

This notebook:
1. Loads the provided dataset of synthetic conversations between an AI chatbot and healthcare clients.
2. Parses each conversation into a structured format.
3. Applies simple rule-based models to produce:
   - HIV acquisition risk score.
   - Mental health risk score.
4. Generates high-level recommendations and a treatment plan aligned with generic South African public-sector (NDOH-style) guidance.

**Time spent on assignment:** 1hr : 25min



Imports and data loading

In [1]:
import re
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import List

import pandas as pd


In [2]:
DATA_PATH = Path("health_ai_whatsapp_100_conversations_long.txt")

raw_text = DATA_PATH.read_text(encoding="utf-8")

# Conversations are separated by a marker line
CONV_SEPARATOR = "========== Conversation =========="
raw_conversations = [c.strip() for c in raw_text.split(CONV_SEPARATOR) if c.strip()]

print(f"Loaded {len(raw_conversations)} conversations")


Loaded 100 conversations


Parsing conversations into a DataFrame

In [3]:
# Pattern: [timestamp] User/AI: message
message_pattern = re.compile(
    r"\[(?P<timestamp>[^\]]+)\]\s+(?P<speaker>User|AI):\s+(?P<text>.+)"
)

def parse_conversation(conv_text: str, conv_id: int) -> pd.DataFrame:
    rows = []
    for match in message_pattern.finditer(conv_text):
        rows.append(
            {
                "conversation_id": conv_id,
                "timestamp": match.group("timestamp"),
                "speaker": match.group("speaker"),
                "text": match.group("text"),
            }
        )
    return pd.DataFrame(rows)

dfs = [parse_conversation(c, i) for i, c in enumerate(raw_conversations)]
messages_df = pd.concat(dfs, ignore_index=True)

messages_df.head()


Unnamed: 0,conversation_id,timestamp,speaker,text
0,0,"01/01/2025, 08:00",User,"Hi, I need help about something sensitive."
1,0,"01/01/2025, 08:03",AI,I'm here with you. Tell me what's going on.
2,0,"01/01/2025, 08:06",User,I'm not sure if what I'm feeling is normal.
3,0,"01/01/2025, 08:09",AI,Thanks for sharing that. Symptoms can have man...
4,0,"01/01/2025, 08:12",User,It started a few days ago.


Aggregate user text per conversation(We’ll score risk using only the user messages.)

In [4]:
user_conv_df = (
    messages_df[messages_df["speaker"] == "User"]
    .groupby("conversation_id")["text"]
    .apply(lambda texts: " ".join(texts))
    .reset_index()
    .rename(columns={"text": "user_text"})
)

user_conv_df.head()

Unnamed: 0,conversation_id,user_text
0,0,"Hi, I need help about something sensitive. I'm..."
1,1,"Hi, I need help about something sensitive. I'm..."
2,2,"Hi, I need help about something sensitive. I'm..."
3,3,"Hi, I need help about something sensitive. I'm..."
4,4,"Hi, I need help about something sensitive. I'm..."


Risk model helpers and dataclass

In [5]:
@dataclass
class RiskOutput:
    hiv_score: float
    hiv_level: str
    mental_score: float
    mental_level: str
    hiv_recommendation: str
    mental_recommendation: str
    general_note: str

In [6]:
def keyword_count(text: str, keywords: List[str]) -> int:
    text_lower = text.lower()
    return sum(text_lower.count(k.lower()) for k in keywords)


Rule-based HIV acquisition risk model

In [None]:
def hiv_risk_model(user_text: str):
    """
    Very simple, rule-based HIV acquisition risk model.
    Returns (score_0_to_1, level, recommendation_text).
    This is for demonstration only and NOT for clinical use.
    """
    text = user_text.lower()
    score = 0.0

    # Example keyword buckets – these would be tuned with real data/experts
    high_risk_terms = [
        "no condom",
        "without a condom",
        "unprotected",
        "didn't use protection",
        "raw sex",
        "multiple partners",
        "new partner",
        "sex worker",
        "needle",
        "inject",
    ]
    medium_risk_terms = [
        "partner tested positive",
        "partner has hiv",
        "sti",
        "discharge",
        "sores",
        "bleeding",
        "condom broke",
        "condom slipped",
        "one-night stand",
    ]
    low_risk_terms = [
        "kiss",
        "hug",
        "sharing utensils",
    ]

    score += 0.25 * keyword_count(text, low_risk_terms)
    score += 0.5 * keyword_count(text, medium_risk_terms)
    score += 1.0 * keyword_count(text, high_risk_terms)

    score = min(score, 3.0) / 3.0

    if "test" in text or "testing" in text:
        score = max(score, 0.2)

    if score < 0.25:
        level = "Low"
    elif score < 0.6:
        level = "Moderate"
    else:
        level = "High"

    rec_lines = [
        "This is not a medical diagnosis. A healthcare professional should always make clinical decisions.",
        f"Based on what you've shared, your HIV acquisition risk is assessed as **{level}** within this simple rule-based model.",
        "General HIV-prevention guidance (aligned with public-sector messaging in South Africa) includes:",
        "- Everyone who is sexually active should know their HIV status. Testing at a clinic or mobile site is strongly encouraged.",
        "- If there has been any possibility of exposure, get an HIV test and ask about tests for other sexually transmitted infections (STIs).",
        "- Consistent condom use and reducing the number of sexual partners help lower future risk.",
        "- If you have ongoing risk (for example, regular unprotected sex with a partner of unknown status), ask a nurse or doctor about HIV pre-exposure prophylaxis (PrEP).",
    ]
    rec = "\n".join(rec_lines)

    return score, level, rec


Rule-based mental health risk model

In [8]:
def mental_health_risk_model(user_text: str):
    """
    Very simple, rule-based mental health risk model.
    Returns (score_0_to_1, level, recommendation_text).
    This is for demonstration only and NOT for clinical use.
    """
    text = user_text.lower()
    score = 0.0

    stress_terms = ["stressed", "overwhelmed", "worried", "anxious", "can't sleep", "insomnia"]
    depression_terms = [
        "hopeless",
        "no hope",
        "empty",
        "can't go on",
        "no energy",
        "tired all the time",
        "lost interest",
        "don't enjoy anything",
        "crying a lot",
        "worthless",
    ]
    crisis_terms = [
        "hurt myself",
        "kill myself",
        "suicidal",
        "end my life",
        "self-harm",
        "cutting",
    ]

    score += 0.3 * keyword_count(text, stress_terms)
    score += 0.6 * keyword_count(text, depression_terms)
    score += 1.5 * keyword_count(text, crisis_terms)

    # If they explicitly mention being worried/stressed, ensure at least some risk
    if "stressed" in text or "worried" in text:
        score = max(score, 0.2)

    score = min(score, 3.0) / 3.0

    if score < 0.25:
        level = "Low"
    elif score < 0.6:
        level = "Moderate"
    else:
        level = "High"

    crisis_flag = any(term in text for term in crisis_terms)

    rec_lines = [
        "This is not a mental health diagnosis. Only a qualified professional can assess and diagnose mental health conditions.",
        f"Based on your messages, emotional/mental health risk is assessed as **{level}** within this simple rule-based model.",
        "General guidance aligned with mental health support approaches in South Africa:",
        "- If distress is affecting your sleep, work, studies, or relationships, it is important to speak to a healthcare worker or counsellor.",
        "- Public clinics can provide an initial assessment and refer you to counselling or mental health services where needed.",
        "- Trusted helplines and NGOs can offer free phone or WhatsApp support if in-person help is hard to access.",
    ]
    if crisis_flag:
        rec_lines.append(
            "- If you ever feel at risk of harming yourself or others, please seek emergency help immediately at your nearest hospital emergency unit or call an emergency helpline."
        )
    rec = "\n".join(rec_lines)

    return score, level, rec


Wrapper: assess one conversation

In [9]:
def assess_conversation(conv_id: int) -> RiskOutput:
    """
    Run both HIV and mental-health rule-based models on one conversation.
    """
    row = user_conv_df.loc[user_conv_df["conversation_id"] == conv_id]
    if row.empty:
        raise ValueError(f"No conversation with id {conv_id}")

    user_text = row.iloc[0]["user_text"]

    hiv_score, hiv_level, hiv_rec = hiv_risk_model(user_text)
    mh_score, mh_level, mh_rec = mental_health_risk_model(user_text)

    general_note = (
        "This tool is a technical prototype for an interview exercise. "
        "It should NOT be used for real clinical triage or decision-making."
    )

    return RiskOutput(
        hiv_score=hiv_score,
        hiv_level=hiv_level,
        mental_score=mh_score,
        mental_level=mh_level,
        hiv_recommendation=hiv_rec,
        mental_recommendation=mh_rec,
        general_note=general_note,
    )


Structured “treatment plan” generator

This answers Task 3: recommendation + treatment plan, written in a way that sounds NDOH-aligned but stays high-level and safe.

In [11]:
def build_treatment_plan(risk: RiskOutput) -> str:
    """
    Build a human-readable treatment and follow-up plan string.
    This is high-level and for demonstration only – not real medical advice.
    """
    lines = []

    # HIV section
    lines.append(f"HIV risk level: {risk.hiv_level}")
    if risk.hiv_level == "Low":
        lines.append(
            "- Encourage routine HIV testing according to local guidelines "
            "(for example, at least once a year if sexually active)."
        )
        lines.append(
            "- Reinforce safer-sex practices such as consistent condom use "
            "and open communication about HIV status with partners."
        )
    elif risk.hiv_level == "Moderate":
        lines.append(
            "- Recommend HIV testing as soon as possible at a public clinic, hospital, "
            "or community testing site."
        )
        lines.append(
            "- Discuss additional STI screening and education about condom use and "
            "reduction of risk behaviours."
        )
        lines.append(
            "- Encourage discussion with a nurse or doctor about PrEP if ongoing risk is likely."
        )
    else:  # High
        lines.append(
            "- Treat this as a priority for same-week clinical assessment and HIV testing, "
            "following national testing and linkage-to-care guidance."
        )
        lines.append(
            "- Assess for recent high-risk exposure and follow local protocols for urgent management, "
            "which may include immediate testing and further medical evaluation."
        )
        lines.append(
            "- Provide strong counselling support and schedule follow-up to ensure linkage to "
            "ongoing HIV prevention or treatment services."
        )

    # Mental-health section
    lines.append("")
    lines.append(f"Mental health risk level: {risk.mental_level}")
    if risk.mental_level == "Low":
        lines.append(
            "- Offer basic psychoeducation about stress and normal emotional reactions to health concerns."
        )
        lines.append(
            "- Share simple coping strategies (sleep hygiene, physical activity, talking to trusted people) "
            "and encourage the user to return if symptoms worsen."
        )
    elif risk.mental_level == "Moderate":
        lines.append(
            "- Recommend a full assessment with a nurse, doctor, or counsellor at a clinic."
        )
        lines.append(
            "- Suggest referral options for counselling or support groups where available."
        )
        lines.append(
            "- Encourage the user to monitor mood, sleep, appetite, and functioning, and to seek earlier help "
            "if these deteriorate."
        )
    else:  # High
        lines.append(
            "- Prioritise urgent mental health assessment, ideally within the same day or as soon as possible."
        )
        lines.append(
            "- Screen carefully for any thoughts of self-harm or suicide and follow emergency pathways if present."
        )
        lines.append(
            "- Arrange close follow-up and, where possible, involvement of family or support persons, "
            "in line with the user's consent and local practice."
        )

    lines.append("")
    lines.append(
        "Important: This plan is generated by a prototype model for demonstration. "
        "Final clinical decisions must always be made by qualified healthcare professionals "
        "using full clinical information."
    )

    return "\n".join(lines)


Run the pipeline on one example conversation

In [None]:
example_id = 0  # pick any conversation index from 0 to 99

print("===== RAW CONVERSATION =====")
print(raw_conversations[example_id])

risk = assess_conversation(example_id)
risk_dict = asdict(risk)

print("\n===== NUMERIC RISK SCORES =====")
print(f"HIV score (0–1):      {risk_dict['hiv_score']:.2f}  | level: {risk_dict['hiv_level']}")
print(f"Mental health score:  {risk_dict['mental_score']:.2f}  | level: {risk_dict['mental_level']}")

print("\n===== HIV RECOMMENDATION =====")
print(risk_dict["hiv_recommendation"])

print("\n===== MENTAL HEALTH RECOMMENDATION =====")
print(risk_dict["mental_recommendation"])

print("\n===== TREATMENT PLAN (SUMMARY) =====")
print(build_treatment_plan(risk))
print("\n===== GENERAL NOTE =====")
print(risk_dict["general_note"])


===== RAW CONVERSATION =====
[01/01/2025, 08:00] User: Hi, I need help about something sensitive.
[01/01/2025, 08:03] AI: I'm here with you. Tell me what's going on.
[01/01/2025, 08:06] User: I'm not sure if what I'm feeling is normal.
[01/01/2025, 08:09] AI: Thanks for sharing that. Symptoms can have many causes.
[01/01/2025, 08:12] User: It started a few days ago.
[01/01/2025, 08:15] AI: How intense is it and has anything changed over time?
[01/01/2025, 08:18] User: I thought it would go away but it hasn't.
[01/01/2025, 08:21] AI: It's understandable to feel worried. We'll figure it out.
[01/01/2025, 08:24] User: I'm feeling stressed about it.
[01/01/2025, 08:27] AI: Some issues settle naturally, others need more care.
[01/01/2025, 08:30] User: I'm unsure if I should see someone in person.
[01/01/2025, 08:33] AI: Seeing someone in person is helpful sometimes, but not always required.
[01/01/2025, 08:36] User: Could this be serious?
[01/01/2025, 08:39] AI: I'll help you understand if 

Run on all conversations and summarise

In [13]:
results = [asdict(assess_conversation(i)) for i in range(len(raw_conversations))]
results_df = pd.DataFrame(results)

results_df.head()

Unnamed: 0,hiv_score,hiv_level,mental_score,mental_level,hiv_recommendation,mental_recommendation,general_note
0,0.2,Low,0.1,Low,This is not a medical diagnosis. A healthcare ...,This is not a mental health diagnosis. Only a ...,This tool is a technical prototype for an inte...
1,0.2,Low,0.1,Low,This is not a medical diagnosis. A healthcare ...,This is not a mental health diagnosis. Only a ...,This tool is a technical prototype for an inte...
2,0.2,Low,0.1,Low,This is not a medical diagnosis. A healthcare ...,This is not a mental health diagnosis. Only a ...,This tool is a technical prototype for an inte...
3,0.2,Low,0.1,Low,This is not a medical diagnosis. A healthcare ...,This is not a mental health diagnosis. Only a ...,This tool is a technical prototype for an inte...
4,0.2,Low,0.1,Low,This is not a medical diagnosis. A healthcare ...,This is not a mental health diagnosis. Only a ...,This tool is a technical prototype for an inte...


Most conversations in this synthetic dataset show low HIV and low-moderate mental health risk