## Общие сведения об используемых адаптерах

В рамках эксперимента было обучено два типа PEFT-адаптеров (Parameter-Efficient Fine-Tuning) с использованием технологии **QLoRA** для базовой модели **Qwen-3B**. Основная цель заключалась в проверке гипотезы научить LLM быть врачом. В ходе работы была принята еще одна гипотеза о влиянии качества предобработки медицинских данных на итоговую безопасность и точность ответов. (Сравнение двух обученных адаптеров)

1. **RAW_DATA_QLoRA (Адаптер на сырых данных Doctor-HealthCare-100k.csv):**
* **Датасет:** Обучен на оригинальном корпусе вопросов и ответов (QA) без предварительной фильтрации. 1000 сэмплов
* **Характеристики:** Адаптер перенял специфический стиль платформы Chat Doctor, включая вводные клише («Hello, Welcome to Chat Doctor...») и технический шум.


2. **Cleaned_DATA_QLoRA (Адаптер на очищенных данных из Doctor-HealthCare-100k.csv):**
* **Датасет:** Прошел двухэтапную очистку (удаление дублей, сокращение избыточных вежливостей, фильтрация неинформативных назначений). 1000 сэмплов
* **Характеристики:** Фокусируется на структурированном клиническом подходе: «Вероятная причина — Рекомендуемое обследование — Специалист».

---

## Оценка эффективности (Alignment с эталонными ответами врачей)

Оценка проводилась путем сопоставления ответов моделей с полем `doctor_answer`, выполнена LLM-as-a-Judge

### Соответствие логике врача

* **Base Model (Qwen3B):** Демонстрирует общую эрудицию, но ответы часто слишком размыты. Модель дает много «бытовых» советов, что снижает профессиональный тон консультации.
* **RAW_DATA_QLoRA:** Имеет серьезные проблемы с «галлюцинациями» и безопасностью.
* *Пример (Кейс №2):* Врач рекомендует простые манипуляции (отрыгивание) для младенца. Адаптер RAW назначает пантопразол и цетиризин, что является недопустимым для самолечения в таком возрасте.
* *Пример (Кейс №4, №26):* При нехватке данных модель уходит в бесконечную циклическую генерацию списков витаминов (B1...B1250) или анализов крови, что делает ответ нечитаемым и бесполезным.


* **Cleaned_DATA_QLoRA:** Наиболее близко соответствует медицинской логике эталона. Она не пытается угадать конкретное лекарство, если это рискованно, а корректно описывает тактику дифференциальной диагностики.

### Сравнительный анализ по критериям качества

| Критерий / Модель          | Base Model (Qwen3B) | RAW_DATA_QLoRA | Cleaned_DATA_QLoRA | Комментарий                                                                 |
|----------------------------|----------------------|----------------|---------------------|-----------------------------------------------------------------------------|
| **Точность**              | 3.9                 | 2.1           | **4.4**            | Cleaned ближе всего к фактам doctor_answer; RAW часто галлюцинирует         |
| **Полнота**               | 4.2                 | 2.5           | **4.3**            | Base избыточен, Cleaned — сбалансирован, RAW неполный из-за обрезок         |
| **Безопасность**          | 4.5                 | 3.0           | **4.8**            | Все рекомендуют врача, но RAW иногда даёт вредные списки лекарств           |
| **Краткость**             | 3.0                 | 1.5           | **4.5**            | RAW страдает от сильных повторений, Cleaned — лаконичен                     |
| **Релевантность**         | 3.8                 | 1.8           | **4.6**            | Cleaned фокусирован на вопросе, RAW часто уходит в нерелевантные списки     |
| **Эмпатия и Тон**         | **4.1**             | 2.8           | 3.9                | Base наиболее эмпатичен, Cleaned — профессиональный, RAW — механистичный    |
| **Общая Полезность**      | 3.8                 | 2.4           | **4.2**            | Cleaned даёт наиболее полезные для пациента ответы                          |
| **Средние токены (gen)**  | 280                 | 320           | **90**             | Cleaned самый экономичный по количеству генерируемых токенов                |
| **Средний балл (1–5)**    | 3.8                 | 2.4           | **4.3**            | Cleaned заметно выигрывает по большинству критериев                         |


| Критерий | RAW_DATA_QLoRA | Cleaned_DATA_QLoRA |
| --- | --- | --- |
| **Точность (Accuracy)** | Низкая (часто ставит ошибочные диагнозы) | **Высокая** (соответствует логике врача) |
| **Безопасность (Safety)** | **Критически низкая** (рискованные назначения) | **Высокая** (маршрутизация к врачу) |
| **Структура (Structure)** | Хаотичная, часто зацикленная | **Тезисная, медицинская** |
| **Зашумленность (Noise)** | Высокая (мусорные токены, клише) | Минимальная |

---

## Финальная сравнительная таблица моделей

Данная таблица суммирует поведение всех участников эксперимента на базе 30 кейсов.

| Характеристика | Base Model (Qwen3B) | RAW_DATA_QLoRA | Cleaned_DATA_QLoRA | Doctor Answer (Эталон) |
| --- | --- | --- | --- | --- |
| **Стиль ответа** | Информационный / Общий | Избыточно вежливый / Шумный | Лаконичный / Клинический | Экспертный / Консультативный |
| **Риск галлюцинаций** | Средний (общие факты) | **Критический (циклы, списки)** | Минимальный | Отсутствует |
| **Эффективность токенов** | Избыточная (много «воды») | Крайне низкая (мусорный текст) | **Высокая (строго по делу)** | Оптимальная |
| **Рекомендация** | Подходит для общей справки | **Не пригодна для медицины** | **Рекомендована для мед-интерфейсов** | Золотой стандарт |

---

### Заключение по этапу дообучения:

Опыт показал, что **RAW_DATA_QLoRA** впитал не только знания, но и все дефекты исходного корпуса: излишнюю вежливость платформы, неинформативные куски и структурные ошибки. Очистка корпуса в два этапа позволила создать **Cleaned_DATA_QLoRA**, которая не просто копирует текст, а выстраивает логику ответа, максимально приближенную к эталонным ответам врачей, исключая опасные назначения и технические сбои генерации.

# Очистка тектса корпуса алгоритмами

In [1]:
import pandas as pd
import re
import unicodedata
from difflib import SequenceMatcher
import pandas as pd
import re
import nltk
from nltk.tokenize import sent_tokenize


# =========================
# Pandas display settings
# =========================
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.expand_frame_repr", False)


# Загрузка токенизатора
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

# Загрузка датасета
df = pd.read_csv('Doctor-HealthCare-100k.csv')

# df = df.sample(10000)

def update_instruction(df):
    df['instruction'] = """
You are a licensed medical doctor.
Respond in a professional, neutral, and explanatory tone.
"""
    return df

print(f"Dataset loaded. Rows: {len(df)}")


Dataset loaded. Rows: 112156


In [2]:
DISCLAIMER_TAIL = (
    " This response is generated by MedAI. "
    "Be sure to consult with a specialist."
)

STRUCTURAL_GUARD_PATTERNS = (
    r"\bfirst\b",
    r"\bbefore\b",
    r"\brule out\b",
    r"\bif\b",
    r"\bunless\b",
    r"\bin case\b",
    r"\bshould\b",
    r"\brecommended\b",
    r"\bimportant\b"
)

In [3]:
import re
import spacy

try:
    nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])
except OSError:
    import spacy.cli
    spacy.cli.download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])


def has_main_verb(text: str) -> bool:
    doc = nlp(text)
    return any(tok.pos_ == "VERB" for tok in doc)


def is_structurally_important(sentence: str) -> bool:
    s = sentence.lower()
    return any(re.search(p, s) for p in STRUCTURAL_GUARD_PATTERNS)


def looks_like_fragment_v21(sentence: str) -> bool:
    s = sentence.strip().lower()

    if len(s) < 50:
        return True

    if not has_main_verb(sentence):
        return True

    if s.startswith((
        "i am.", "for asking", "thanks", "thank you",
        "hope i", "wish you", "dear", "hi"
    )):
        return True

    return False


def sentence_repair_layer_v21(text: str) -> str:
    if not isinstance(text, str):
        return text

    sentences = re.split(r'(?<=[.!?])\s+', text)
    repaired = []

    for sent in sentences:
        s = sent.strip()
        if not s:
            continue

        if is_structurally_important(s):
            repaired.append(s)
            continue

        if looks_like_fragment_v21(s):
            continue

        repaired.append(s)

    return " ".join(repaired)

In [None]:
STYLE_NORMALIZATION_MAP_V21 = [
    # --- 1. Smart Rewriter (Hedging с защитой грамматики) ---
    # Заменяем "you have" только если дальше нет отрицания
    (r"\b(you have|you are suffering from) (a|an)?\b(?!\s+(not|no|unlikely))", "the clinical picture may suggest "),
    
    # Заменяем "it is/this is" ТОЛЬКО если дальше нет слов-исключений (not, impossible, hard, clear)
    # Было: (r"\b(it is|this is) (a|an)?\b", "this could be consistent with "), 
    # Стало (Smart Regex):
    (r"\b(it is|this is)\s+(a|an)?\b(?!\s+(not|no|unlikely|impossible|hard|difficult|clear|evident))", 
     "this could be consistent with "),

    # Смягчение категоричности
    (r"\bdefinitely (is|indicates)\b", "may indicate"),
    
    # --- 2. Tail Trimmer (Удаление обрывистых концовок) ---
    # Удаляет ситуации, когда предложение обрывается на союзе или предлоге перед точкой
    # Пример: "severe vomiting or." -> "severe vomiting."
    (r"[,;\s]+(and|or|but|so|because|with|for|as)\s*[.?!]?\s*$", "."),
    
    # --- 3. Fluency & Professional Tone ---
    (r"\bi (think|believe|guess|feel)\b", "it is suggested"),
    (r"\b(really|very|just|actually)\b", ""), # Удаление разговорных усилителей
    
    # --- 4. Эмпатия → Клиническая формулировка ---
    (r"\bi (can\s+)?understand your (concern|problem|situation|anxiety)\b", 
     "Based on the information provided"),

    # --- 5. Очистка мусора (Legacy rules) ---
    (r"\bhope i have answered.*?$", ""),
    (r"\bwish you.*?$", ""),
    (r"\bthanks for.*?$", ""),
    (r"\bthank you.*?$", ""),
    (r"\bif you have additional questions.*?$", ""),
    (r"\bplease rate.*?$", ""),
    (r"\bpost a direct question.*?$", "")
]

def style_normalization_layer_v21(text: str) -> str:
    if not isinstance(text, str):
        return text

    normalized = text
    for pattern, repl in STYLE_NORMALIZATION_MAP_V21:
        normalized = re.sub(pattern, repl, normalized, flags=re.IGNORECASE)

    # Капитализация и чистка пунктуации
    # Разделяем по точке, но сохраняем структуру
    sentences = re.split(r'(?<=[.!?])\s+', normalized)
    normalized = " ".join(
        s.strip().capitalize() 
        for s in sentences 
        if s.strip() and len(s.strip()) > 1 # Удаляем артефакты из 1 буквы
    )
    
    # Финальная зачистка пробелов перед знаками препинания
    normalized = re.sub(r'\s+([,.!?])', r'\1', normalized)
    normalized = re.sub(r'\s{2,}', ' ', normalized).strip()
    
    return normalized

In [6]:
MEDICAL_FIXES = {
    r"\bherd\b": "GERD",
    r"\bit d3\b": "vitamin D3",
    r"\bit b12\b": "vitamin B12",
    r"\bhy\b": "hydration",
    r"\bmetonymic acid\b": "mefenamic acid",
    r"\bholder ecg\b": "Holter ECG"
}


def medical_light_normalization(text: str) -> str:
    if not isinstance(text, str):
        return text

    fixed = text
    for pattern, repl in MEDICAL_FIXES.items():
        fixed = re.sub(pattern, repl, fixed, flags=re.IGNORECASE)

    return fixed

In [7]:
def append_disclaimer(text: str) -> str:
    if not isinstance(text, str):
        return text

    if DISCLAIMER_TAIL.lower() in text.lower():
        return text

    return text.rstrip(". ") + "." + DISCLAIMER_TAIL

In [None]:
def sota_medical_clean_v2(text, additional_noise):
    if not isinstance(text, str) or len(text.strip()) == 0:
        return text

    # --- LAYER 0: Хирургическое удаление артефактов и исправление OCR ---
    # 1. Удаляем упоминания платформы, вклинившиеся в середину текста. 
    # Это критично, иначе "Avoid painkiller Chat Doctor." превратится в бессмыслицу.
    # Заменяем на пробел или точку, если это конец фразы.
    text = re.sub(r'(?i)\bChat\s*Doctor\s*(\.com)?', '', text) 
    
    # 2. Исправление специфических OCR/ASR ошибок из вашего датасета
    text = re.sub(r'(?i)\bCellophane\s+you\b', 'Thank you', text) # Cellophane -> Thank
    text = re.sub(r'(?i)\bdairy\b', 'Dear', text) # dairy -> Dear
    text = re.sub(r'(?i)\bHallow\s+Mummy\b', 'Hello', text) # Hallow Mummy -> Hello
    text = re.sub(r'(?i)\bWel\s+come\b', 'Welcome', text) # Wel come -> Welcome
    text = re.sub(r'(?i)\bInmeantime\b', 'In the meantime', text) # Склейка

    # 3. Нормализация склеек (как в прошлой итерации)
    text = re.sub(r'([a-z])\.([A-Z])', r'\1. \2', text)
    
    # --- LAYER 1: Определение паттернов шума ---
    base_noise = [
        # Приветствия и прощания
        r"My pleasure to help you", r"Thanx", r"Always welcome for further query",
        r"Ly/", r"Hope it will help you", r"I will be happy to answer",
        r"I hope I have answered to your satisfaction", r"Thank you",
        r"I can understand your anxiety", r"I can understand your concern",
        r"I've gone through your query", r"I have gone through your question",
        r"Read and understood your history", r"Hope your concern solved",
        r"Wish you a very good health", r"Wish you a speedy recovery",
        r"Good luck", r"Get well soon", r"Take care", r"Regards", r"Good day",
        r"Don't worry", r"Feel free to ask", r"Happy to help", r"Greetings",
        r"Welcome to", r"Best wishes", r"I remain at your disposal",
        r"Hope this helps", r"Let me know if I can assist", 
        r"Thanks for your question", r"Thanks for consulting",
        r"read carefully your query", r"Hope my answer was helpful",
        r"I hope I was able to answer", r"Rest assured"
    ]
    
    # Добавляем фразы пользователя
    noise_patterns = base_noise + [re.escape(p.strip()) for p in additional_noise]
    # Добавляем паттерны для ссылок
    noise_patterns.append(r"https?://\S+|www\.\S+|bit\.\S+")
    
    combined_pattern = re.compile("|".join(noise_patterns), re.IGNORECASE)

    paragraphs = text.split('\n')
    cleaned_paragraphs = []

    for para in paragraphs:
        if not para.strip():
            continue
        
        sentences = sent_tokenize(para)
        cleaned_sentences = []

        for i, sent in enumerate(sentences):
            s_clean = sent.strip()
            
            # А) Удаление приветствий (расширенное regex для имен)
            if i == 0:
                # Удаляет "Hi Natasha,", "Hello,", "Dear Friend,"
                s_clean = re.sub(r'^(Hi|Hello|Greetings|Dear|Welcome)(\s+\w+){0,3}[\s,.]*', '', s_clean, flags=re.IGNORECASE)
            
            if not s_clean: continue

            # Б) Удаление "висячих" начал (например, если предложение начиналось с "and")
            if i == 0 and re.match(r'^(and|but|so|also)[,\s]', s_clean, re.IGNORECASE):
                 s_clean = re.sub(r'^(and|but|so|also)[,\s]+', '', s_clean, flags=re.IGNORECASE).capitalize()

            # В) Проверка на шум
            if combined_pattern.search(s_clean):
                # Если предложение короткое (< 12 слов) и содержит шум -> удаляем полностью
                if len(s_clean.split()) < 12:
                    continue
                else:
                    # Если длинное -> вырезаем фразу
                    s_clean = combined_pattern.sub('', s_clean).strip()
            
            # Г) Удаление "хвостов" (запятые, "and" в конце после вырезки)
            s_clean = re.sub(r'[,.\s]+$', '.', s_clean) # Заменить висячие запятые на точку
            s_clean = re.sub(r'\s+(and|or)\.$', '.', s_clean) # Удалить " and." в конце

            # Финальная проверка на длину (чтобы не осталось "Dr." или "So.")
            if len(s_clean) > 3:
                cleaned_sentences.append(s_clean)

        if cleaned_sentences:
            cleaned_paragraphs.append(" ".join(cleaned_sentences))

    return "\n\n".join(cleaned_paragraphs)

# --- ПРИМЕНЕНИЕ (Updated Noise List) ---
new_noise_iteration_2 = [
    "welcome to Chat Doctor", "keep your private parts Chat Doctor", 
    "for consulting, and", "Avoid painkiller Chat Doctor", 
    "Thanks for consulting ChatDoctorRead your query",
    "especially when deny Chat Doctor", "Tab. Medicare H is a very good Chat Doctor",
    "Cellophane you for contacting Chat Doctor", "write to me directly on below link",
    "I hope I can help you today", "dairy have gone through your question",
    "I remain at your disposal", "Hallow Mummy", "Rest assured",
    "Hope my answer was helpful", "I hope it helps.", "Hi, "
]

In [9]:
def sota_medical_clean_v21(text, additional_noise):
    # Stage 1 — твой существующий clean
    cleaned = sota_medical_clean_v2(text, additional_noise)

    # Stage 2 — sentence repair
    cleaned = sentence_repair_layer_v21(cleaned)

    # Stage 3 — style normalization
    cleaned = style_normalization_layer_v21(cleaned)

    # Stage 4 — medical fixes
    cleaned = medical_light_normalization(cleaned)

    # Stage 5 — disclaimer
    cleaned = append_disclaimer(cleaned)

    return cleaned


df["output_v21"] = df["output"].apply(
    lambda x: sota_medical_clean_v21(x, new_noise_iteration_2)
)

In [10]:
def hard_quality_filter(text: str) -> bool:
    """
    Жёсткий финальный фильтр качества перед SFT / QLoRA.
    Отсекает короткие, неинформативные, небезопасные и немедицинские ответы.
    """
    if not isinstance(text, str):
        return False

    # Минимальная длина (защита от обрывков)
    if len(text) < 120:
        return False

    # Должно быть несколько предложений (reasoning + recommendation)
    if text.count('.') < 2:
        return False

    lower = text.lower()

    # Обязательное наличие медицинского контекста
    MEDICAL_KEYWORDS = (
        "diagnosis", "treatment", "therapy", "symptom", "condition",
        "infection", "disease", "pain", "management", "medication",
        "antibiotic", "dose", "investigation", "examination", "scan",
        "ultrasound", "mri", "ct", "ecg", "blood test", "biopsy",
        "consult", "specialist", "doctor"
    )

    if not any(k in lower for k in MEDICAL_KEYWORDS):
        return False

    # Не должно быть явных форумных хвостов
    FORUM_TAIL_PATTERNS = (
        "rate this answer",
        "thanks for using",
        "post a direct question",
        "use this link",
        "close the discussion"
    )

    if any(p in lower for p in FORUM_TAIL_PATTERNS):
        return False

    return True

In [11]:
df_final = df[df["output_v21"].apply(hard_quality_filter)].copy()

print(f"Rows before V2.1 filter: {len(df)}")
print(f"Rows after  V2.1 filter: {len(df_final)}")

Rows before V2.1 filter: 112156
Rows after  V2.1 filter: 108462


In [12]:
df_final[['output', 'output_v21']].sample(5)

Unnamed: 0,output,output_v21
59754,"Hi, thanks for writing to Chat Doctor sharing your boys' health problem with us! Well, If I were his family physician for this case of a slightly tender bluish lump on the groin, I would come up with three possibilities, these include","Well, if i were his family physician for this case of a slightly tender bluish lump on the groin, i would come up with three possibilities, these include. This response is generated by MedAI. Be sure to consult with a specialist."
49869,"Hello and welcome to Chat Doctor. As a Urologist, i can understand your concern. Tetkurna is given, to control B.P.(blood pressure). Care has to be taken, if your creatinine is elevated. Your B.P. will be checked frequently. Your blood sugar must be normal. Statin is routine treatment for elevated cholesterol levels. Avoid taking alcohol and fatty foods, while on this treatment. So, theres no harm in taking the two together. If use any other doubts, send it as a direct question to me. Chat Doctor. Manga.","Care has to be taken, if your creatinine is elevated. Avoid taking alcohol and fatty foods, while on this treatment. If use any other doubts, send it as a direct question to me. This response is generated by MedAI. Be sure to consult with a specialist."
15221,Hello! Thanks for query. Renal cyst that you have to not need any treatment if it is single and not infected. Single cyst in kidney at the age of 56 is normal finding. Get your kidney functions done to rule out any underlying kidney disease. Hope this answers your query. If you have additional questions or follow-up queries then please do not hesitate in writing to us. I will be happy to answer your queries. Wishing you good health.,Renal cyst that you have to not need any treatment if it is single and not infected. Get your kidney functions done to rule out any underlying kidney disease. This response is generated by MedAI. Be sure to consult with a specialist.
33498,"Hi, dairy have gone through your question. I can understand your concern. She has multiple loci of brain tumor. Surgery will not work in that case. Radiotherapy and chemotherapy can be given, but it will not cure the disease. Despite all treatment chances of survival is very low. Prognosis and life expectancy is not good at all. Sorry to say but it is fact. One should not expect much from this. Hope I have answered your question, if you have any doubts then contact me at bit.ly/ Chat Doctor. Thanks for using Chat Doctor. Wish you a very good health.","Radiotherapy and chemotherapy can be given, but it will not cure the disease. One should not expect much from this. This response is generated by MedAI. Be sure to consult with a specialist."
108478,"Hello, Thanks for writing to Chat Doctor, I am Chat Doctor. This word \""tight\"" is not a scientific term as such but doctors in general use it after stress testing when there is an artery which has some blockage, usually this blockage is not significant to ask for a stenting, but it is significant to cause some ECG changes in the stress testing. You and your husband do not have to panic, this kind of arteries just need care, by care I mean taking regular medication, physical activity, keeping check at your diet and staying in contact with doctor and pay regular follow-up visits as advised by the doctor. This slight tightness will only get worse if care is not taken otherwise it will regress and things will keep getting better with time. I hope this answered your question, If you have more queries I am happy to answer you. Regards.","This word \""tight\"" is not a scientific term as such but doctors in general use it after stress testing when there is an artery which has some blockage, usually this blockage is not significant to ask for a stenting, but it is significant to cause some ecg changes in the stress testing. You and your husband do not have to panic, this kind of arteries just need care, by care i mean taking regular medication, physical activity, keeping check at your diet and staying in contact with doctor and pay regular follow-up visits as advised by the doctor. This slight tightness will only get worse if care is not taken otherwise it will regress and things will keep getting better with time. I hope this answered your question, if you have more queries i am happy to answer you. This response is generated by MedAI. Be sure to consult with a specialist."


In [13]:
ABSOLUTE_PATTERNS = (
    r"\babsolutely no\b",
    r"\bno chance\b",
    r"\b99%\b",
    r"\b100%\b",
    r"\bdefinitely\b",
    r"\bguaranteed\b"
)

def hard_drop_v21(text):
    if not isinstance(text, str):
        return True

    t = text.lower()

    if any(re.search(p, t) for p in ABSOLUTE_PATTERNS):
        return True

    if text.count('.') < 2:
        return True

    return False

In [None]:
# Паттерны для удаления (строгий фильтр)
ABSOLUTE_PATTERNS = (
    # Категоричность
    r"\babsolutely no\b", r"\bno chance\b", r"\b99%\b", r"\b100%\b",
    r"\bdefinitely\b", r"\bguaranteed\b",
    
    # 1. Запрещенный ролевой контекст (Расширенный список)
    # Ловит: "if i were your physician", "as a gp", "clinician"
    r"if i (were|was) (your|the) (doctor|physician|clinician|gp|pediatrician)",
    r"as (your|a|the) (doctor|physician|clinician|gp)",
    r"my (clinic|office|patient)", # Убирает личные истории врачей
    
    # 2. Дозировки (защита от галлюцинаций чисел)
    r"\d+\s*(mg|mcg|ml|units|mcg/kg|mg/kg|tablet|pill|capsule|ampule|sachet|drops)\b",
    r"\d+\s*(milligrams|milliliters|grams)\b",
    r"\d+\s*times (a|per) day",
    
    # 3. Конкретные препараты (Strict Drug Filter)
    # Включает найденные в аудите: Clamp Kid, Augmentin, Ciprofloxacin и др.
    r"\b(ibuprofen|paracetamol|acetaminophen|aspirin|diclofenac|tramadol)\b", # Painkillers
    r"\b(amoxicillin|metronidazole|ciprofloxacin|azithromycin|augmentin|clamp kid|cefixime)\b", # Antibiotics
    r"\b(omeprazole|pantoprazole|metformin|prednisone|cetirizine|montelukast)\b" # Common chronic meds
)

def hard_drop_v21(text):
    if not isinstance(text, str):
        return True

    t = text.lower().strip()

    # 1. Проверка на стоп-слова из ABSOLUTE_PATTERNS
    if any(re.search(p, t) for p in ABSOLUTE_PATTERNS):
        return True

    # 2. Tail Trimmer Check (Safety Net)
    # Если нормализация пропустила, а текст всё равно заканчивается на союз - дропаем
    if re.search(r'\b(and|or|but|so|with|for)\.?$', t):
        return True

    # 3. Минимальная структурная проверка
    if text.count('.') < 2:
        return True

    return False

# --- ЛОГИКА СКОРИНГА ---
REASONING_MARKERS = (
    "if", "however", "because", "therefore",
    "rule out", "suggestive of", "depends on",
    "in case", "based on", "consider", "due to",
    "possibility", "consistent with"
)

REFERRAL_MARKERS = (
    "consult", "specialist", "evaluation",
    "investigation", "examination", "clinical assessment"
)

def quality_score(text):
    score = 0
    t = text.lower()
    
    # Бонусы за объяснения и направления
    score += sum(m in t for m in REASONING_MARKERS)
    score += sum(m in t for m in REFERRAL_MARKERS)
    
    # Бонус за структуру (до 2.5 баллов)
    score += min(text.count('.'), 5) * 0.5 
    
    # Бонус за оптимальную длину (capped)
    score += min(len(text) / 400, 2)
    
    # --- ШТРАФЫ (Penalty Layer) ---
    # Штраф, если текст не заканчивается знаком препинания (признак обрыва)
    if not t.endswith(('.', '!', '?')):
        score -= 5
        
    return score

In [15]:
df_q = df_final.copy()
df_q = df_q[~df_q["output_v21"].apply(hard_drop_v21)]

df_q["score"] = df_q["output_v21"].apply(quality_score)

df_top10k = (
    df_q
    .sort_values("score", ascending=False)
    .head(10_000)
    .copy()
)

In [19]:
df_top10k[['input', 'output', 'output_v21', 'score']].sample(2)

Unnamed: 0,input,output,output_v21,score
23982,"My husband had oral surgery(wisdom tooth and molar extraction, has stitches) this morning at 11","Hi and thanks for the query, I actually think with the Chat Doctor. However, this pain should normally decrease much within the next 48 hours after surgery. If it actually persists, you might to call the surgeon come over to change the class of the painkiller or increase the dosage if need be. It s but normal to have the scratching sensation at the level of the uvula. The uvula was certainly over manipulated during the surgical process to obtain enough access to the tooth in question. As I mentioned, these symptoms should fade out within the next 48 to 72 hours and are not that rare. Suggest you remain watchful and reassure your husband. IN case of persistence of pain, you might need to call in the physician. Thanks and regards, Bain LE, MD.","However, this pain should normally decrease much within the next 48 hours after surgery. If it actually persists, you might to call the surgeon come over to change the class of the painkiller or increase the dosage if need be. It s but normal to have the scratching sensation at the level of the uvula. The uvula was certainly over manipulated during the surgical process to obtain enough access to the tooth in question. As i mentioned, these symptoms should fade out within the next 48 to 72 hours and are not that rare. Suggest you remain watchful and reassure your husband. In case of persistence of pain, you might need to call in the physician. This response is generated by MedAI. Be sure to consult with a specialist.",9.3125
79771,"There is a small lump about 2 inches left of my belly. When you press on it, it feels as though something is protruding underneath. No other symptoms blood in stool, fever, loss of weight etc. with the exception of possible bloating at times. I am menopausal and have had irregular period patterns. Thats it. Any ideas.","Thank you for asking Chat Doctor. My name is Chat Doctor. Keeping in view your presentation of Symptoms of this belly Lump, it seems to be either Para umbilical Hernia umbilical Hernia ...and another Hernia which present in this region is Epigastric but that's a little higher and above between the belly button and mid-chest... Such lumps need evaluation by the General Surgeon and clinical examination whether the lump is involving skin/under skin tissue or It's something protruding from inside the Abdominal wall.to role out these, Ultrasound evolution is done for the Nature of the lump /contents and for the defect in the Abdominal wall... Treatment is then Cause directed accordingly. Hope this may help you. Let me know if anything is not clear. Thanks.","Keeping in view your presentation of symptoms of this belly lump, it seems to be either para umbilical hernia umbilical hernia ...and another hernia which present in this region is epigastric but that's a little higher and above between the belly button and mid-chest. Such lumps need evaluation by the general surgeon and clinical examination whether the lump is involving skin/under skin tissue or it's something protruding from inside the abdominal wall.to role out these, ultrasound evolution is done for the nature of the lump /contents and for the defect in the abdominal wall. Let me know if anything is not clear. This response is generated by MedAI. Be sure to consult with a specialist.",9.24


In [23]:
df_top10k.to_csv("Doctor_QLoRA_Ready_10k.csv")

In [52]:
data = pd.read_csv("Doctor_QLoRA_Ready_10k.csv")

In [53]:
data[['input', 'output_v21']].head()

Unnamed: 0,input,output_v21
0,"my friend is currently in the hospital because of her stomach pains . she has lost a ton of weight because of stress. she has also been having sex latley, the doctor assume its her appendix but i know she didnt tell the whole story in front of her parents. can it be something else?","I must tell you in the beginning that the treating doctor of your friend would be the best judge to tell you what she has got, because the differential diagnosis though can be guessed from the partial history, confirmation could be done only by a detailed physical examination and related investigations. If the doctors assume that it is the appendix, then they would go for a surgery and the tissue for histopathology. However, history of loss of weight is perplexing in that simple appendix might not give rise to significant weight loss as you described. Mostly chronic inflammatory diseases or some other kind of pathological process could land the patient up in gross weight loss. She would have to be investigated more to know the cause of weight loss, which could also be due to the history of stress you wrote. Therefore, the management would vary depending on the clinical findings and investigation reports by the treating doctor. I hope this information would help you in discussing with your family physician/treating doctor in further management of your problem. Please do not hesitate to ask in case of any further doubts. This response is generated by MedAI. Be sure to consult with a specialist."
1,"Hi, I have a very sharp stabbing pain in the very top of my abd very time I cough to the point I am trying not to cough anymore. The pain is located in a line accross my abd and is tender to the touch, could it just be a pulled muscle or should i worry?","Although it needs a physical examination of your abdomen with related investigations like a cbc and an ultrasound abdomen, with the history, it could be a simple abdominal muscle spasm due to the chronic cough of yours. However, i would also go for an x-ray chest to rule out any chances of a pleurisy (inflammation of the covering lining of the lungs) due to a lung infection. In this case it should get alright with an anti-inflammatory along with a proton pump inhibitor. If the pain and tenderness increases or persists, then i would advise you to take an opinion from a general surgeon in your area for further management. I hope this information would help you in discussing with your family physician/treating doctor in further management of your problem. Please do not hesitate to ask in case of any further doubts. This response is generated by MedAI. Be sure to consult with a specialist."
2,Hello I have been diagnosed with RA for the past 22 years; along with that I have a frozen shoulder. My specialist has advised that I should do the surgery in order to gain mobility in both of my shoulders. I am hesistant to do this procedure. Is there an alternate route that I can take?What natural supplement(s) would you recommend.,"Frozen shoulder [adhesive capsulitis] can occur independently or due to autoimmune disease [ra in your case], and since diagnosis is usually clinical, and it is often self-limited, observation and reassurance may be considered, but because of the painful and debilitating nature of the disease, it may not be acceptable to some patients.2. Since your specialist advised for surgery, it might be possible that your range of motion/movement might be/is compromised.3. As problem of ra is chronic, thus seek opinion of rheumatologist regarding specific examination and investigation to know the progress of disease, and if ra comes out to be provoking factor for frozen shoulder, alleviate it first.4. Because there is inflammation resulting in thickening of the capsule thus shot ha heir nikita [anti-inflammatory measures]. Diabetic myopathy although occurs in lower limb, however, when symptoms involve the shoulder girdle, it must be considered in the differential diagnosis of shoulder painful conditions [because it is a predisposing factor. In ayurveda, we treat frozen shoulder on the line of avabahuka. This response is generated by MedAI. Be sure to consult with a specialist."
3,"Is it possible that having titanium rods in both of my femurs can affect fatigue. I am chronically cold but one website said that rods aren t affected by weather, however Im not sure why I am so incredibly freezing all the time since the placement of my rods, but I am wondering then if there is any correlation of metal in your body and fatigue.","There is no correlation between insertion of titanium rods in femur and the sensation of fatigue from what we know from current available literature. However, it is possible that you might feel week in your legs after such an injury and surgery due to the lack of use of the muscles of the lower limb. It is advisable to do strengthening exercises of all the muscles around your lower limb under the supervision of your physician and a trained physiotherapist. It is possible that over a time frame of 3 to 6 months your will start feeling better strength which such exercises. On the other hand if you are feeling general fatigue including your whole body general factors like anemia should be considered. It is possible that you would have lost considerable amount of blood during your injury and during your surgery, which can result in anemia. Based on certain blood tests you might require supplements to help you recover from such a situation. Regarding feeling cold, low grade smoldering fever can sometimes make you feel cold. Since you have an implant in your body it would be better to see your treating physical to rule out the same. Examination and some blood tests can provide evidence regarding the samendhope this answers your question. This response is generated by MedAI. Be sure to consult with a specialist."
4,"Ive been on birth control for over eight months now a d take them perfectly. Although,my boyfriend and I started having sex last week. Almost a week ago the condom broke and he came inside me. Me, being stupid, I grabbed two birth control pills and swallowed them to be sure. The next day I accidentally took two pills. I know taking more than one pill is dangerous but I wasnt thinking. Anyway, I am on my 6th day of placebo pills and still havent started my period. Is this because of e extra pills? Or a I pregnant this early?","The emergency birth control pills are considered effective when taken within 72 hours of unprotected sexual intercourse. And since you have taken 4 pills accidentally, it can cause hormonal imbalance in the body, resulting in delay of periods. However, to relieve your apprehension, you can get a pregnancy test done using a pregnancy kit. In case of pregnancy, it will give positive result within one week. The reasons for delay in periods and spotting can be many such as hormonal imbalance(as already explained), ovulatory cycles, stress, any infection or pathology in the uterus, ovary, etc. So, if pregnancy test is negative & periods are further delayed, kindly consult a gynecologist & get a proper clinical examination done. Investigations like complete blood count, ultrasound scan of abdomen and pelvis will be helpful to rule out any pathological cause for your symptoms. This response is generated by MedAI. Be sure to consult with a specialist."


In [54]:
TAIL = "This response is generated by MedAI. Be sure to consult with a specialist."

data["output_v21"] = (
    data["output_v21"]
    # remove MedAI tail (exact match only)
    .str.replace(TAIL, "", regex=False)
    # replace & with and
    .str.replace("&", "and", regex=False)
    # clean trailing spaces/newlines
    .str.rstrip()
)

In [55]:
data.head()

Unnamed: 0.1,Unnamed: 0,instruction,input,output,output_v21,score
0,53582,"If you are a doctor, please answer the medical questions based on the patient's description.","my friend is currently in the hospital because of her stomach pains . she has lost a ton of weight because of stress. she has also been having sex latley, the doctor assume its her appendix but i know she didnt tell the whole story in front of her parents. can it be something else?","Hi ! Good morning. I am Chat Doctor answering your query. I must tell you in the beginning that the treating doctor of your friend would be the best judge to tell you what she has got, because the differential diagnosis though can be guessed from the partial history, confirmation could be done only by a detailed physical examination and related investigations. If the doctors assume that it is the appendix, then they would go for a surgery and the tissue for histopathology. However, history of loss of weight is perplexing in that simple appendix might not give rise to significant weight loss as you described. Mostly chronic inflammatory diseases or some other kind of pathological process could land the patient up in gross weight loss. She would have to be investigated more to know the cause of weight loss, which could also be due to the history of stress you wrote. Therefore, the management would vary depending on the clinical findings and investigation reports by the treating doctor. I hope this information would help you in discussing with your family physician/treating doctor in further management of your problem. Please do not hesitate to ask in case of any further doubts. Thanks for choosing Chat Doctor to clear doubts on your health problems. Wishing you an early recovery. Chat Doctor.","I must tell you in the beginning that the treating doctor of your friend would be the best judge to tell you what she has got, because the differential diagnosis though can be guessed from the partial history, confirmation could be done only by a detailed physical examination and related investigations. If the doctors assume that it is the appendix, then they would go for a surgery and the tissue for histopathology. However, history of loss of weight is perplexing in that simple appendix might not give rise to significant weight loss as you described. Mostly chronic inflammatory diseases or some other kind of pathological process could land the patient up in gross weight loss. She would have to be investigated more to know the cause of weight loss, which could also be due to the history of stress you wrote. Therefore, the management would vary depending on the clinical findings and investigation reports by the treating doctor. I hope this information would help you in discussing with your family physician/treating doctor in further management of your problem. Please do not hesitate to ask in case of any further doubts.",14.5
1,84348,"If you are a doctor, please answer the medical questions based on the patient's description.","Hi, I have a very sharp stabbing pain in the very top of my abd very time I cough to the point I am trying not to cough anymore. The pain is located in a line accross my abd and is tender to the touch, could it just be a pulled muscle or should i worry?","Hi! Good afternoon. I am Chat Doctor answering your query. Although it needs a physical examination of your abdomen with related investigations like a CBC and an ultrasound abdomen, with the history, it could be a simple abdominal muscle spasm due to the chronic cough of yours. However, I would also go for an X-ray chest to rule out any chances of a pleurisy (inflammation of the covering lining of the lungs) due to a lung infection. In this case it should get alright with an anti-inflammatory along with a proton pump inhibitor Chat Doctor. If the pain and tenderness increases or persists, then I would advise you to take an opinion from a general surgeon in your area for further management. I hope this information would help you in discussing with your family physician/treating doctor in further management of your problem. Please do not hesitate to ask in case of any further doubts. Thanks for choosing Chat Doctor to clear doubts on your health problems. I wish you an early recovery. Chat Doctor.","Although it needs a physical examination of your abdomen with related investigations like a cbc and an ultrasound abdomen, with the history, it could be a simple abdominal muscle spasm due to the chronic cough of yours. However, i would also go for an x-ray chest to rule out any chances of a pleurisy (inflammation of the covering lining of the lungs) due to a lung infection. In this case it should get alright with an anti-inflammatory along with a proton pump inhibitor. If the pain and tenderness increases or persists, then i would advise you to take an opinion from a general surgeon in your area for further management. I hope this information would help you in discussing with your family physician/treating doctor in further management of your problem. Please do not hesitate to ask in case of any further doubts.",13.5
2,69572,"If you are a doctor, please answer the medical questions based on the patient's description.",Hello I have been diagnosed with RA for the past 22 years; along with that I have a frozen shoulder. My specialist has advised that I should do the surgery in order to gain mobility in both of my shoulders. I am hesistant to do this procedure. Is there an alternate route that I can take?What natural supplement(s) would you recommend.,"**1. Frozen shoulder [adhesive capsulitis] can occur independently or due to autoimmune disease [RA in your case], and since diagnosis is usually clinical, and it is often self-limited, observation and reassurance may be considered, but because of the painful and debilitating nature of the disease, it may not be acceptable to some patients.2. Since your specialist advised for surgery, it might be possible that your range of motion/movement might be/is compromised.3. As problem of RA is chronic, thus seek opinion of Rheumatologist regarding specific examination and investigation to know the progress of disease, and if RA comes out to be provoking factor for frozen shoulder, alleviate it first.4. Because there is inflammation resulting in thickening of the capsule thus shot ha heir Nikita [anti-inflammatory measures]. Diabetic myopathy although occurs in lower limb, however, when symptoms involve the shoulder girdle, it must be considered in the differential diagnosis of shoulder painful conditions [because it is a Predisposing factor. In Ayurveda, we treat frozen shoulder on the line of Avabahuka","Frozen shoulder [adhesive capsulitis] can occur independently or due to autoimmune disease [ra in your case], and since diagnosis is usually clinical, and it is often self-limited, observation and reassurance may be considered, but because of the painful and debilitating nature of the disease, it may not be acceptable to some patients.2. Since your specialist advised for surgery, it might be possible that your range of motion/movement might be/is compromised.3. As problem of ra is chronic, thus seek opinion of rheumatologist regarding specific examination and investigation to know the progress of disease, and if ra comes out to be provoking factor for frozen shoulder, alleviate it first.4. Because there is inflammation resulting in thickening of the capsule thus shot ha heir nikita [anti-inflammatory measures]. Diabetic myopathy although occurs in lower limb, however, when symptoms involve the shoulder girdle, it must be considered in the differential diagnosis of shoulder painful conditions [because it is a predisposing factor. In ayurveda, we treat frozen shoulder on the line of avabahuka.",13.5
3,71491,"If you are a doctor, please answer the medical questions based on the patient's description.","Is it possible that having titanium rods in both of my femurs can affect fatigue. I am chronically cold but one website said that rods aren t affected by weather, however Im not sure why I am so incredibly freezing all the time since the placement of my rods, but I am wondering then if there is any correlation of metal in your body and fatigue.","Hi there, welcome to Chat Doctor, thanks for your question. There is no correlation between insertion of titanium rods in femur and the sensation of fatigue from what we know from current available literature. However, it is possible that you might feel week in your legs after such an injury and surgery due to the lack of use of the muscles of the lower limb. This is something similar to disuse atrophy. It is advisable to do strengthening exercises of all the muscles around your lower limb under the supervision of your physician and a trained physiotherapist. It is possible that over a time frame of 3 to 6 months your will start feeling better strength which such exercises. On the other hand if you are feeling general fatigue including your whole body general factors like anemia should be considered. It is possible that you would have lost considerable amount of blood during your injury and during your surgery, which can result in anemia. Based on certain blood tests you might require supplements to help you recover from such a situation. Regarding feeling cold, low grade smoldering fever can sometimes make you feel cold. Since you have an implant in your body it would be better to see your treating physical to rule out the same. Examination and some blood tests can provide evidence regarding the samendHope this answers your question. All the best, Thank you. With best regards Chat Doctor.","There is no correlation between insertion of titanium rods in femur and the sensation of fatigue from what we know from current available literature. However, it is possible that you might feel week in your legs after such an injury and surgery due to the lack of use of the muscles of the lower limb. It is advisable to do strengthening exercises of all the muscles around your lower limb under the supervision of your physician and a trained physiotherapist. It is possible that over a time frame of 3 to 6 months your will start feeling better strength which such exercises. On the other hand if you are feeling general fatigue including your whole body general factors like anemia should be considered. It is possible that you would have lost considerable amount of blood during your injury and during your surgery, which can result in anemia. Based on certain blood tests you might require supplements to help you recover from such a situation. Regarding feeling cold, low grade smoldering fever can sometimes make you feel cold. Since you have an implant in your body it would be better to see your treating physical to rule out the same. Examination and some blood tests can provide evidence regarding the samendhope this answers your question.",13.5
4,44926,"If you are a doctor, please answer the medical questions based on the patient's description.","Ive been on birth control for over eight months now a d take them perfectly. Although,my boyfriend and I started having sex last week. Almost a week ago the condom broke and he came inside me. Me, being stupid, I grabbed two birth control pills and swallowed them to be sure. The next day I accidentally took two pills. I know taking more than one pill is dangerous but I wasnt thinking. Anyway, I am on my 6th day of placebo pills and still havent started my period. Is this because of e extra pills? Or a I pregnant this early?","Hello dear, The emergency birth control pills are considered effective when taken within 72 hours of unprotected sexual intercourse. These pills have high hormonal content. And since you have taken 4 pills accidentally, it can cause hormonal imbalance in the body, resulting in delay of periods. However, to relieve your apprehension, you can get a Pregnancy test done using a Pregnancy kit. In case of pregnancy, it will give positive result within one week. The reasons for delay in periods and spotting can be many such as hormonal imbalance(as already explained), ovulatory cycles, stress, any infection or pathology in the uterus, ovary, etc. So, if pregnancy test is negative & periods are further delayed, kindly consult a Gynecologist & get a proper clinical examination done. Investigations like complete blood count, Ultrasound scan of abdomen and pelvis will be helpful to rule out any pathological cause for your symptoms. Wishing you a good health. Take care.","The emergency birth control pills are considered effective when taken within 72 hours of unprotected sexual intercourse. And since you have taken 4 pills accidentally, it can cause hormonal imbalance in the body, resulting in delay of periods. However, to relieve your apprehension, you can get a pregnancy test done using a pregnancy kit. In case of pregnancy, it will give positive result within one week. The reasons for delay in periods and spotting can be many such as hormonal imbalance(as already explained), ovulatory cycles, stress, any infection or pathology in the uterus, ovary, etc. So, if pregnancy test is negative and periods are further delayed, kindly consult a gynecologist and get a proper clinical examination done. Investigations like complete blood count, ultrasound scan of abdomen and pelvis will be helpful to rule out any pathological cause for your symptoms.",13.5


In [57]:
import pandas as pd
import nltk
import re

nltk.download("punkt")

def normalize_sentences(text: str) -> str:
    if not isinstance(text, str) or not text.strip():
        return text

    # normalize spacing around punctuation
    text = re.sub(r"\s+([.,!?])", r"\1", text)
    text = re.sub(r"([.,!?])([A-Za-z])", r"\1 \2", text)
    text = re.sub(r"\s+", " ", text).strip()

    sentences = nltk.sent_tokenize(text)

    sentences = [
        s[0].upper() + s[1:] if s else s
        for s in sentences
    ]

    return " ".join(sentences)


data["output_v21"] = data["output_v21"].apply(normalize_sentences)


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ivannemcenko/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [59]:
data.sample(2)

Unnamed: 0.1,Unnamed: 0,instruction,input,output,output_v21,score
6449,85724,"If you are a doctor, please answer the medical questions based on the patient's description.","Hi, I am a non vegetarian male and working with a bank. Since last 2 years I have been suffering from severe tail bone pain. My uric acid is also 6.5. I have to travel in uncomfortable positions at least thrice in a week and I also have to sit on chair for prolonged hours.Please advice.","Hi, thanks for posting your concern in the Chat Doctor. Coccydynia may occur due to abrupt (fall on coccyx) or nonabrupt (prolonged sitting) trauma. Degenerative changes in the sacrococcygeal or intracoccygeal junction may also cause coccyx pain. Rarely hyperglycemia may present with this. A lateral view X-ray of the coccyx both sitting and standing position should be done. Hemorrhoids, lumbar degenerative disc disease, lumbar facet arthropathy, performs syn Chat Doctor. Physiotherapy, surgery or sympathetic nerve block may be needed in case of coccyx pathology. My recommendation is to visit your local doctor for further evaluation and treatment. For any further questions please write back to us. Regards.","Coccydynia may occur due to abrupt (fall on coccyx) or nonabrupt (prolonged sitting) trauma. Degenerative changes in the sacrococcygeal or intracoccygeal junction may also cause coccyx pain. A lateral view x-ray of the coccyx both sitting and standing position should be done. Hemorrhoids, lumbar degenerative disc disease, lumbar facet arthropathy, performs syn. Physiotherapy, surgery or sympathetic nerve block may be needed in case of coccyx pathology. My recommendation is to visit your local doctor for further evaluation and treatment. For any further questions please write back to us.",9.17
7418,6603,"If you are a doctor, please answer the medical questions based on the patient's description.",Hi I carried my 2 year daughter and shes pretty heavy to me. She weighs 26lbs and shes pretty solid. when I picked her up my lower stomach started hurting real bad. Felt like harsh cramping I could hardly walk nor sit comfortably. After awhile pain went away but comes back off and on and at times when my bladder gets really full. What can it be???,"Hi. Thanks for your query. Lifting and carrying a heavy baby of 26 lbs can cause the lower abdominal muscles to go into spam, sprain and strain. This is the reason when your bladder is full there is pressure from within onto the muscles that cause pain in you. Yet, to confirm that there is no other problem like Hernia or any pelvic pathology I would advise you to consult a General Surgeon for clinical evaluation, physical examination, ultrasonography and necessary tests, and also to get a proper prescription for anti-inflammatory and necessary medicines.","Lifting and carrying a heavy baby of 26 lbs can cause the lower abdominal muscles to go into spam, sprain and strain. This is the reason when your bladder is full there is pressure from within onto the muscles that cause pain in you. Yet, to confirm that there is no other problem like hernia or any pelvic pathology i would advise you to consult a general surgeon for clinical evaluation, physical examination, ultrasonography and necessary tests, and also to get a proper prescription for anti-inflammatory and necessary medicines.",9.02


In [65]:
data["length_words"].describe()

count    10000.000000
mean       110.581200
std         45.581957
min         30.000000
25%         80.000000
50%         97.000000
75%        128.000000
max        555.000000
Name: length_words, dtype: float64

In [73]:
data = data.drop(columns=['score', 'length_words'])
data.to_csv('Doctor_QLoRA_Ready_without_instruction_10k.csv', index=False)

# Доочистка корпуса с помощью LLM

**Буду использовать ChatGPT**
- Загружать один из двух промптов и выборку из 20 случайных QA
- На выходе будет составлен jsonl файл с обработанными QA, которые пойдут на обучение QLoRA

In [181]:
data = pd.read_csv('Doctor_QLoRA_Ready_without_instruction_10k.csv', index_col=0)

**Промпт для LLM Based доочистки корпуса с ролью врача**

You are preparing medical QA answers for supervised fine-tuning of a language model
that must demonstrate the behavior of a qualified physician responding to a patient.

The goal is NOT stylistic abstraction, but preservation of clinical reasoning,
patient-oriented explanation, and appropriate medical guidance.

CRITICAL CONSTRAINT:
Do NOT add any new medical facts. Do NOT invent diagnoses, tests, or treatments.

Input fields:
- Unnamed: 0
- output_v21 (partially cleaned answer)

Task:
Rewrite output_v21 so that it represents a concise, clinically appropriate
physician response to a patient, following these rules:

STYLE AND ROLE:
- The answer MUST read as if written by a physician addressing a patient.
- Calm, professional, medically grounded tone.
- Allow direct patient addressing (e.g. "this may indicate", "it is recommended").
- Allow limited clinical reassurance or escalation when appropriate.

CONTENT RULES:
- Preserve medical reasoning and causal explanations.
- Preserve advice, recommendations, and next steps if present.
- Remove only:
  - greetings, farewells, apologies
  - platform references
  - emotional filler, religious or philosophical language
- Do NOT remove clinically meaningful explanations.
- Do NOT remove risk framing or practical advice.
- Do NOT introduce disclaimers unrelated to medicine.

STRUCTURE:
- Short paragraphs or bullet points allowed.
- Logical flow: symptom → interpretation → recommendation.
- No unnecessary verbosity.

PROHIBITIONS:
- Do NOT add new medical facts.
- Do NOT speculate beyond the original content.
- Do NOT turn the answer into a generic medical summary.
- Do NOT convert the answer into third-person encyclopedic text.

Return ONLY a JSONL object in the format:
{"Unnamed: 0": <index>, "output_v21": "<final physician-style medical answer>"}
All output_v21
One object per input.
No extra text.

**Второй промпт для LLM Based доочистки корпуса с вариацией ответа в роли врача**

You are preparing medical QA answers for supervised fine-tuning of a language model
that must demonstrate the behavior of a qualified physician responding to a patient.

The goal is NOT stylistic abstraction, but preservation of clinical reasoning,
patient-oriented explanation, and appropriate medical guidance.

CRITICAL CONSTRAINT:
Do NOT add any new medical facts.
Do NOT invent diagnoses, tests, or treatments.
Do NOT contradict the original medical content.

Input fields:
- Unnamed: 0
- output_v21 (partially cleaned answer)

GLOBAL BATCH CONSTRAINT (IMPORTANT):
You are processing a batch of 20 QA answers.
Across the entire batch:
- NO MORE THAN 20–25% of answers may primarily consist of clarification questions.
- AT LEAST 50% of answers MUST contain an explicit clinical interpretation
  (e.g. "most likely explanation", "suggestive of", "consistent with").
- AT LEAST 50% of answers MUST include a clear next-step decision or recommendation.

If an answer is dominated by questions or requests for more information,
you MUST refactor it to include a provisional clinical interpretation
based on the available information, while acknowledging uncertainty.

Task:
Rewrite output_v21 so that it represents a concise, clinically appropriate
physician response to a patient, following the rules below.

ANSWER TYPE CLASSIFICATION (MANDATORY):
First, determine which type the original answer belongs to:

Type A — Clinical guidance answer:
- Contains diagnosis, differential diagnosis, risk assessment, or management advice.

Type B — Information-seeking answer:
- Primarily asks for missing details or clarification.

REFORMATTING RULES BY TYPE:

For Type A (Clinical guidance answers):
- Preserve and emphasize the clinical conclusion.
- Make the "most likely explanation" explicit if it is implied.
- Ensure there is a clear next-step decision (e.g. reassurance, referral, test, monitoring).
- Do NOT add questions unless they are minimal and clinically necessary.

For Type B (Information-seeking answers):
- Reduce the number of clarification questions.
- Convert the answer into a provisional clinical assessment using available data.
- Use language such as:
  "Based on the information provided, the most likely explanation is…"
- Include a clear next-step recommendation even if further details are needed.
- Clarification questions, if present, must be secondary, not dominant.

STYLE AND ROLE:
- The answer MUST read as if written by a physician addressing a patient.
- Calm, professional, medically grounded tone.
- Direct patient addressing is allowed and encouraged.
- Appropriate reassurance or escalation should be included when relevant.

CONTENT RULES:
- Preserve medical reasoning and causal explanations.
- Preserve advice, recommendations, and next steps if present.
- Remove only:
  - greetings, farewells, apologies
  - platform references
  - emotional filler, religious or philosophical language
- Do NOT remove clinically meaningful explanations.
- Do NOT remove risk framing or practical advice.
- Do NOT introduce non-medical disclaimers.

STRUCTURE:
- Short paragraphs or bullet points are allowed.
- Preferred flow:
  symptoms → clinical interpretation → most likely explanation → next step.
- Avoid vague endings such as "further evaluation is needed" without specifying why or how.

PROHIBITIONS:
- Do NOT add new medical facts.
- Do NOT speculate beyond the original content.
- Do NOT turn the answer into a generic medical summary.
- Do NOT convert the answer into third-person encyclopedic text.

Return ONLY a JSONL object in the format:
{"Unnamed: 0": <index>, "output_v21": "<final physician-style medical answer>"}
All output_v21
One object per input.
No extra text.

итерируюсь по корпусу QA, очищая в сумме 1000 сэмплов

In [236]:
data[['output_v21']].iloc[4180:4181]

Unnamed: 0_level_0,output_v21
Unnamed: 0,Unnamed: 1_level_1
3664,"The visible paining vein that you are seeing at the back of the leg and the intensity of pain which increases when you stand or walk for longer time indicate towards the possibility of varicose vein. Veins are the blood tubes that carry deoxygenated blood from various parts of our body to heart and lungs. If these veins become abnormally thick, twisted or tortuous, they are called varicose veins. Usually these happen in elderly people, but if there is some abnormality in the vein wall, you are obese, your routine requires you to stand for very long periods of time then younger people like you may suffer from it, too. To get relief from this, you need to wear compression stockings that are available in medical stores. Elevate your legs while sleeping and if possible, when you are sitting. If you do not find relief in symptoms or then worsen, you should see a physician for physical examination."


In [122]:
data.head(1)

Unnamed: 0_level_0,instruction,input,output,output_v21
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
53582,"If you are a doctor, please answer the medical questions based on the patient's description.","my friend is currently in the hospital because of her stomach pains . she has lost a ton of weight because of stress. she has also been having sex latley, the doctor assume its her appendix but i know she didnt tell the whole story in front of her parents. can it be something else?","Hi ! Good morning. I am Chat Doctor answering your query. I must tell you in the beginning that the treating doctor of your friend would be the best judge to tell you what she has got, because the differential diagnosis though can be guessed from the partial history, confirmation could be done only by a detailed physical examination and related investigations. If the doctors assume that it is the appendix, then they would go for a surgery and the tissue for histopathology. However, history of loss of weight is perplexing in that simple appendix might not give rise to significant weight loss as you described. Mostly chronic inflammatory diseases or some other kind of pathological process could land the patient up in gross weight loss. She would have to be investigated more to know the cause of weight loss, which could also be due to the history of stress you wrote. Therefore, the management would vary depending on the clinical findings and investigation reports by the treating doctor. I hope this information would help you in discussing with your family physician/treating doctor in further management of your problem. Please do not hesitate to ask in case of any further doubts. Thanks for choosing Chat Doctor to clear doubts on your health problems. Wishing you an early recovery. Chat Doctor.","I must tell you in the beginning that the treating doctor of your friend would be the best judge to tell you what she has got, because the differential diagnosis though can be guessed from the partial history, confirmation could be done only by a detailed physical examination and related investigations. If the doctors assume that it is the appendix, then they would go for a surgery and the tissue for histopathology. However, history of loss of weight is perplexing in that simple appendix might not give rise to significant weight loss as you described. Mostly chronic inflammatory diseases or some other kind of pathological process could land the patient up in gross weight loss. She would have to be investigated more to know the cause of weight loss, which could also be due to the history of stress you wrote. Therefore, the management would vary depending on the clinical findings and investigation reports by the treating doctor. I hope this information would help you in discussing with your family physician/treating doctor in further management of your problem. Please do not hesitate to ask in case of any further doubts."


**По итогам доочистки с помощбю LLM был составлен jsonl файл с очищенными output**
- Далее составляю новый корпус только с готовыми QA парами

In [16]:
import json
import pandas as pd

REQUIRED_KEYS = {"Unnamed: 0", "output_v21"}

bad_lines = []
missing_keys = []
clean_objects = []

with open("v2_data_ready.jsonl", "r", encoding="utf-8") as f:
    for i, line in enumerate(f, start=1):
        line = line.strip()
        if not line:
            continue
        try:
            obj = json.loads(line)

            # нормализация ключей
            obj = {k.strip(): v for k, v in obj.items()}

            if not REQUIRED_KEYS.issubset(obj):
                missing_keys.append((i, obj.keys()))
            else:
                clean_objects.append(obj)

        except json.JSONDecodeError as e:
            bad_lines.append((i, str(e)))

print(f"❌ Строк с JSON-ошибками: {len(bad_lines)}")
print(f"⚠️ Строк без обязательных ключей: {len(missing_keys)}")
print(f"✅ Валидных строк: {len(clean_objects)}")

# загрузка в DataFrame уже с нормализованными ключами
jsonl_df = pd.DataFrame(clean_objects)

❌ Строк с JSON-ошибками: 0
⚠️ Строк без обязательных ключей: 0
✅ Валидных строк: 1000


In [17]:
import pandas as pd

data = pd.read_csv('Doctor_QLoRA_Ready_without_instruction_10k.csv')

# оставляем только нужные колонки из jsonl
jsonl_df = jsonl_df[["Unnamed: 0", "output_v21"]]

# merge по Unnamed: 0
merged_df = data.merge(
    jsonl_df,
    on="Unnamed: 0",
    how="inner"
)

# финальный датасет с нужными колонками
final_df = merged_df[
    ["Unnamed: 0", "instruction", "input", "output_v21_y"]
].rename(columns={"output_v21_y": "output_v21"})


In [18]:
instruction_text = (
    "You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone."
)

final_df["instruction"] = instruction_text

In [19]:
final_df[['instruction', 'input', 'output_v21']].head()

Unnamed: 0,instruction,input,output_v21
0,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.","my friend is currently in the hospital because of her stomach pains . she has lost a ton of weight because of stress. she has also been having sex latley, the doctor assume its her appendix but i know she didnt tell the whole story in front of her parents. can it be something else?","Based on the limited history provided, a definite diagnosis cannot be made without physical examination and appropriate investigations by the treating physician. If appendicitis is suspected, surgical evaluation and histopathology after removal would clarify the diagnosis. However, the history of significant weight loss is not typical for uncomplicated appendicitis and raises concern for a chronic inflammatory condition or another pathological process. Further evaluation is required to determine the cause of weight loss, which may also be influenced by stress as mentioned. Management will depend on clinical findings and investigation results, and this should be discussed in detail with the treating doctor."
1,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.","Hi, I have a very sharp stabbing pain in the very top of my abd very time I cough to the point I am trying not to cough anymore. The pain is located in a line accross my abd and is tender to the touch, could it just be a pulled muscle or should i worry?","Based on your symptoms, abdominal examination and investigations such as a complete blood count and abdominal ultrasound are needed. Given the history, the pain may be due to abdominal muscle spasm related to chronic coughing. A chest X-ray is also reasonable to rule out pleurisy related to lung infection. If this is the cause, symptoms usually improve with anti-inflammatory treatment along with a proton pump inhibitor. If pain or tenderness increases or persists, consultation with a general surgeon is recommended for further evaluation."
2,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.",Hello I have been diagnosed with RA for the past 22 years; along with that I have a frozen shoulder. My specialist has advised that I should do the surgery in order to gain mobility in both of my shoulders. I am hesistant to do this procedure. Is there an alternate route that I can take?What natural supplement(s) would you recommend.,"Frozen shoulder (adhesive capsulitis) can occur on its own or in association with autoimmune conditions such as rheumatoid arthritis. Diagnosis is usually clinical, and the condition is often self-limiting, but pain and restricted movement can be significant. Surgical advice may suggest that shoulder range of motion is markedly limited. Given the chronic nature of rheumatoid arthritis, evaluation by a rheumatologist is important to assess disease activity and determine whether it is contributing to the shoulder problem. Management focuses on controlling inflammation, as capsular thickening is inflammatory in nature. Other contributing conditions should be considered based on clinical assessment."
3,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.","Is it possible that having titanium rods in both of my femurs can affect fatigue. I am chronically cold but one website said that rods aren t affected by weather, however Im not sure why I am so incredibly freezing all the time since the placement of my rods, but I am wondering then if there is any correlation of metal in your body and fatigue.","There is no established association between titanium rods in the femur and generalized fatigue. However, weakness in the legs is common after injury and surgery due to reduced muscle use. Strengthening exercises for the lower limbs under medical and physiotherapy supervision are recommended, and improvement is usually noted over 3 to 6 months. If you are experiencing generalized fatigue, causes such as anemia should be considered, especially if there was significant blood loss during injury or surgery. Blood tests may indicate the need for supplements. Feeling cold can sometimes be associated with low-grade fever, and given the presence of an implant, evaluation by your treating physician with examination and blood tests is advisable."
4,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.","Ive been on birth control for over eight months now a d take them perfectly. Although,my boyfriend and I started having sex last week. Almost a week ago the condom broke and he came inside me. Me, being stupid, I grabbed two birth control pills and swallowed them to be sure. The next day I accidentally took two pills. I know taking more than one pill is dangerous but I wasnt thinking. Anyway, I am on my 6th day of placebo pills and still havent started my period. Is this because of e extra pills? Or a I pregnant this early?","Emergency contraceptive pills are effective when taken within 72 hours of unprotected intercourse. Taking multiple pills can lead to hormonal imbalance, which may cause delayed periods and spotting. To rule out pregnancy, a home pregnancy test can be done, which usually becomes positive within one week if pregnancy has occurred. If the test is negative and periods remain delayed, further evaluation by a gynecologist is recommended, as menstrual delay can also result from hormonal imbalance, stress, ovulatory variations, or uterine or ovarian conditions. Investigations such as blood tests and ultrasound may be required."


In [20]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   1000 non-null   int64 
 1   instruction  1000 non-null   object
 2   input        1000 non-null   object
 3   output_v21   1000 non-null   object
dtypes: int64(1), object(3)
memory usage: 31.4+ KB


In [21]:
final_df.to_csv("Doctor_V2_QLoRA_1000.csv", index=False)

**Решил изменить стиль инференса с "в одну строку" на "по строке"**

In [22]:
import re
import pandas as pd

def split_sentences(text: str) -> str:
    if not isinstance(text, str):
        return text
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    return "\n".join(sentences)

df = pd.read_csv("Doctor_V2_QLoRA_1000.csv")
df["output_v21"] = df["output_v21"].apply(split_sentences)

df.to_csv("Doctor_V2_QLoRA_1000_split.csv", index=False)

# Пример готового корпуса для QLoRA

In [23]:
df.sample(20)

Unnamed: 0.1,Unnamed: 0,instruction,input,output_v21
44,96173,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.","my sister is 76 old ,She been having stomach problem and have lost 50 Lb and have no straingth at all. she has gone to the doctor and give her ulcer medication.It is very sad to see canadas health system how it works.Today she was feling very bed and end up some how in the hospital.The specialist dr. tels her to go home and to see her family doctor.Ho sick is that!!!thank you and god bles","In view of significant weight loss at this age, further evaluation is warranted.\nInitial assessment should include a general physical examination and blood tests such as a complete blood count, ESR, blood sugar, and liver and thyroid function tests, along with an abdominal ultrasound to look for any intra-abdominal pathology.\nIf these tests are inconclusive, referral to a gastroenterologist may be needed, and endoscopic evaluation with possible biopsy could be considered.\nSymptomatic treatment with an antispasmodic and a proton pump inhibitor may be used initially, with further management guided by investigation results."
484,54089,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.","My 3 year old has fever of 100.8 since last night. tylenol is helping, but then it spikes again. When the fever is spiking, she says she has a tummy/head ache. Early today she started saying her teeth hurt and I could feel a couple moalers comming in. Should I be concered or is it just the teeth? Ohh n she has had a slight runny nose.","Teething commonly causes gum discomfort, swelling, and mild temperature elevation.\nHowever, the presence of runny nose and fever makes a viral upper respiratory infection the more likely explanation for those symptoms, with gum pain possibly related to teething.\nThe appropriate next steps are supportive care: continue acetaminophen for pain or fever relief, ensure adequate fluid intake, offer a soft diet, provide rest, and consider steam inhalation or use of a humidifier for nasal symptoms."
142,87745,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.","hello, doctor iam facing a sever back pain in the mid of my spinal bone for several days and i cunsulted a doctor and took my x-ray and MRI scan in which it shows some L1, L2, D11,D12, and D10 had some prob which is causing a back pain and says it could be an Spine TB so just advice me where to get the good treatment in hyderabad, India.","Based on the report you describe, the radiologist’s impression suggests spinal tuberculosis as a possible cause.\nThis condition is treatable, though bone tuberculosis typically requires a longer course of therapy.\nConfirmation and treatment planning require direct review of imaging and clinical evaluation.\nA physician at a tertiary care center can initiate therapy and arrange ongoing monitoring.\nRegular follow-up is necessary to assess response and adjust treatment as needed."
810,59416,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.","Hi, Im almost 5months pregnant and lately im suffering from itching at my private part, this started after having contact with my hubby.. When do pregnant women stop from having contact (sex?) I also have a history of UTI. Hope you could help me. Thank you.","The first three months of pregnancy are a sensitive period, and sexual abstinence is generally advised.\nAfter this, if there are no risk factors, intercourse may be allowed with caution and comfortable positions.\nIn later pregnancy, sex may become uncomfortable, and abdominal pressure can trigger contractions.\nYou also describe urinary and genital symptoms suggestive of infection.\nUrine examination with culture is needed to guide antibiotics.\nGenital itching and discharge after intercourse suggest possible infection, and intercourse should be avoided until both partners are fully treated.\nYour obstetrician should be informed so appropriate treatment can be given."
234,81326,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.","My daughter is 9 years old and a fairly healthy child. But for the last month or so she has had fever thrice coupled with loose stomach and once a throat infection. She recovers quickly. I have attributed this to changed weather , heat, swimming, school food. However today when I showed her to a doctor , he diagnosed her with swollen cervical glands and said it wud need treatment for 6 months or so and can lead to retardation in growth and tuberculosis . Pls advise.","The history you provided suggests recurrent infections, most likely viral, involving the gastrointestinal and upper respiratory tracts.\nEnlarged cervical lymph nodes require proper clinical examination to determine the cause.\nIt is important to know whether these lymph nodes are a new finding or have been present previously.\nShort-term enlargement is often due to viral or bacterial infection and may resolve with appropriate treatment.\nTuberculosis should only be considered if confirmed by investigations.\nEvaluation will guide further management.\nPrevious febrile episodes do not appear directly related to lymph node enlargement."
577,49162,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.",for past 6 months i have thick phelm stuck in my throat which makes my tounge coated in yellow stuff my mouth is swallen inside and it affects my speach i went to see few doctors they say is acid reflux but i know i havnt got a heart burn the phelm is scoming out from the back of my nose. plz help,"Thick phlegm in the throat may originate from postnasal drip due to sinusitis or from laryngopharyngeal reflux, which can occur without typical heartburn.\nA yellow coating in the mouth suggests possible oral thrush, and swelling may reflect associated inflammation.\nDefinitive diagnosis requires clinical evaluation and appropriate investigations.\nSupportive measures such as steam inhalation, gargling, and antacids may provide relief until further assessment guides targeted treatment."
256,40822,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.",Today I had some light pink vaginal discharge. I ve never had that before. Before my periods I just get light bleeding in clear discharge. This was kinda thick and pink. My periods are irregular and I use a diaphragm or condoms when having sex. I ve only ever been with my husband. Does that sound like a sign a of pregnancy?,"If you have been consistently using contraception, including barrier methods, the likelihood of pregnancy is very low.\nOccasional abnormal vaginal discharge is more suggestive of a vaginal infection than pregnancy.\nMild bleeding can occur if the vaginal lining is irritated, and thick discharge raises the possibility of a yeast infection causing local inflammation.\nPregnancy symptoms vary widely, but the most common sign is a missed period.\nOther symptoms such as nausea, fatigue, bloating, breast tenderness, and mood changes can also be related to hormonal fluctuations.\nIf your period is not delayed, the symptoms are more likely due to vaginal infection rather than pregnancy.\nEvaluation by a gynecologist or women’s health provider is recommended to examine the vagina and determine the cause of discharge."
908,67570,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.","Hi, I am 23 yrs old n now I am trying to conceive but My periods are regular but my last period has continue for 8 days...I getting fear... I dont know why I it happen...please can u help me on it....I am trying to get conceive so can u suggest me what I do to get pregnent as soon as possible..","Repeated or prolonged menstrual bleeding requires proper evaluation.\nFor conception, frequent unprotected intercourse is advised around the ovulation period, typically between days 10 and 17 of the cycle.\nIf bleeding between periods is excessive, investigations such as thyroid testing and pelvic ultrasound are needed to rule out organic causes.\nMedications like tranexamic acid or hormonal therapy may be prescribed if indicated.\nIf pregnancy does not occur after one year of regular intercourse, further investigations including blood tests, imaging, and possibly endometrial biopsy or hysteroscopy may be required.\nConsultation with a gynecologist is recommended for examination and planning."
293,83397,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.","Hi I fell fully forward onto my hands/wrists about 10 weeks ago. Xray shows no fracture of my right wrist (the left is fine) but it it painfull to use, especially in a twisting action ie. open a lid of a jar etc. 4 weeks ago the tip of my thumb (outer aspect only) went numb suddenly and is still numb now. My GP said it will just go away by itself but i am worried it might be nerve damage? Do I need to see a physio?","The hand and wrist are supplied by nerves originating from the neck, and pain or numbness can result from nerve compression along their course.\nWhile the absence of fracture reduces the likelihood of bony injury, nerve involvement cannot be excluded without examination.\nTwisting movements place stress on wrist ligaments and can cause pain.\nPersistent numbness of the thumb after several weeks warrants evaluation by a neurologist, with coordination with a physiotherapist.\nNerve conduction studies may be required following clinical assessment."
788,73991,"You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone.","hi, i am 4 months pregnant, and i have been experiencing heart palpitations, chest pain and pain my left arm. Mostly just pain in my left arm. I have been through the routine check for heart disease, which included a heart echo, and ekg s and have been seen by the top cardilogists in toronto, ontario. I can t seem to figure out what is wrong with me. Other doctors that i have been to said that there can possibly be a problem with anxiety, like an anxiety attack but I really and truly feel that this is not the problem. I am worried for my unborn child, which is not my first. I have never had any problems with my previous pregnancy, so I am wondering what is wrong with me. please help","Since the ECG and echocardiogram are normal, the palpitations are most likely intermittent rhythm disturbances that may not be captured on routine testing.\nThe next appropriate step is Holter monitoring to record heart rhythm during symptomatic periods.\nContributing factors such as anemia, stress, or infection should also be evaluated.\nManagement will depend on Holter findings.\nIn the meantime, adequate rest, stress reduction, and maintaining a balanced diet are advised."


# Конвертация модели с адаптером в MLX для совместимости с MAC платформой

In [1]:
from pathlib import Path
from mlx_lm import load, generate
import time

  from .autonotebook import tqdm as notebook_tqdm


In [1]:
from mlx_lm import load, generate

model_path = "./Qwen3B_MLX_FP16"

# Загружаем модель
model, tokenizer = load(model_path)

# Формируем промпт правильно
messages = [
    {"role": "system", "content": """
You are an AI doctor. Your goal is to write answers that look very similar
in style and format to the reference answers in this dataset.

EXAMPLE OF STYLE:
"Hi, Thanks for the query. Usually posterior subaerial fibroid may not affect the chances of pregnancy.
But it can grow during pregnancy and can lead to complications like red degeneration, difficulty in labor,
increasing the chances of surgical procedures etc. So, better to get treated for the fibroid before planning
for pregnancy. The exact time taken for the fibroid to grow cannot be told as that can differ from person to
person and depends on hormonal levels. Treatment options of fibroid depends on size, symptoms etc. Consult
local gynecologist once and take her opinion. You can go for hormonal pills for few months, that can decrease
the size of the fibroid. If fibroid is not decreasing or increases in size, you may need surgical procedure.
Mastectomy can be done either by laparoscopic method or by laparotomy. After the recovery with your doctors'
advice you can plan for pregnancy. Take care."

INSTRUCTIONS FOR YOUR ANSWER:
- Follow the same tone: polite, explanatory, slightly repetitive, focused on reassurance and guidance.
- Use similar sentence length and paragraph style (continuous prose, no bullet points).
- Mention common phrases like "depends on", "consult local doctor", "take care" when appropriate.
- Adapt the medical content to the new question, but keep the overall style very close to the example.
- Do NOT write lists or headings; use only plain sentences.
- Do NOT refer to this example explicitly in your answer.
    """},
    {"role": "user", "content": "Hi! Please help,I have a small, slightly bumpy white splotch on the inside of both my cheeks. It is surrounded by small prominent purple veins. Is this cancerous? Im a 23 year old non-smoking male. Should I wait a few days or see a dr immediately? Thank you for your kindness."}
]

# Создаем финальную строку промпта
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Генерируем
response = generate(
    model, 
    tokenizer, 
    prompt=prompt, 
    max_tokens=512, 
    verbose=True
)

  from .autonotebook import tqdm as notebook_tqdm
The tokenizer you are loading from 'Qwen3B_MLX_FP16' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Hi, Thanks for the query. The splotch on the inside of the cheeks can be due to a benign condition like sebaceous cyst or a small ulcer.
It is important to consult a doctor for clinical examination and investigations like fine needle aspiration cytology or biopsy if required.
This will help in confirming the diagnosis and planning further management.
In the meantime, avoid touching or scratching the area as it can lead to infection.
Take care.
Prompt: 419 tokens, 213.930 tokens-per-sec
Generation: 90 tokens, 24.715 tokens-per-sec
Peak memory: 6.431 GB


=Query=
Hi! Please help,I have a small, slightly bumpy white splotch on the inside of both my cheeks. 
It is surrounded by small prominent purple veins. Is this cancerous? Im a 23 year old non-smoking male. 
Should I wait a few days or see a dr immediately? Thank you for your kindness.

=doctor_answer=
Thanks for posting your query to Chat Doctor. After going through your history, I want to assure you not to get worried about it. White spots in oral cavity can occur due to many reasons which can only be told after proper visual examination. Do you have any other skin problem as sometimes these spots may be related to other systemic conditions? I would suggest you to visit a dentist and if needed a dermatologist for proper examination and treatment. Hope my answer will help you.

=RAW_DATA_QLoRA=
Hello, Welcome to Chat Doctor, As you have a white patch on the cheek which is surrounded by small prominent purple veins. 
This could be due to a condition called as oral candidiasis. 
It is a fungal infection of the mouth. The fungus grows in moist areas of the mouth. 
It is commonly seen in people who wear dentures, people with diabetes, people who take broad spectrum antibiotics, people who smoke, people who drink alcohol, people who have HIV/AIDS, people who have had radiation therapy to the head and neck area. 
You need to consult an oral physician for proper diagnosis and treatment. You can take antifungal medications like fluconazole, itraconazole, etc. Take care.

=Cleaned_DATA_QLoRA=
The appearance you describe is most consistent with a benign lesion such as a sebaceous cyst or a papilloma.
These are common and usually harmless.
If the lesion is painful or growing, or if you have any concerns, evaluation by a physician is appropriate.
If the lesion is asymptomatic and stable, observation is reasonable.
If you have any other symptoms or concerns, a clinical examination by a physician is recommended.

=Cleaned_DATA_QLoRA+System_PROMPT=
Hi, Thanks for the query. The splotch on the inside of the cheeks can be due to a benign condition like sebaceous cyst or a small ulcer.
It is important to consult a doctor for clinical examination and investigations like fine needle aspiration cytology or biopsy if required.
This will help in confirming the diagnosis and planning further management.
In the meantime, avoid touching or scratching the area as it can lead to infection.
Take care.

In [None]:
# ============================================================
# QLoRA CLEANED DATA REGENERATION
# ============================================================

import json
import time
import io
import re
from contextlib import redirect_stdout
from pathlib import Path

from mlx_lm import load, generate

# ----------------------------
# CONFIG
# ----------------------------
INPUT_JSON_PATH = "./QLoRA.json"
OUTPUT_JSON_PATH = "./QLoRA_Final.json"
MODEL_PATH = "./Qwen3B_MLX_FP16"

SYSTEM_PROMPT = (
    "You are a licensed medical doctor. Respond in a professional, neutral, and explanatory tone."
)

MAX_TOKENS = 1024

# ----------------------------
# LOAD MODEL
# ----------------------------
print("Loading MLX model...")
model, tokenizer = load(MODEL_PATH)
print("Model loaded.\n")

# ----------------------------
# LOAD INPUT JSON
# ----------------------------
with open(INPUT_JSON_PATH, "r", encoding="utf-8") as f:
    data = json.load(f)

# ----------------------------
# ITERATE & INFERENCE
# ----------------------------
for key, item in data.items():

    query_text = item.get("query", "").strip()
    if not query_text:
        print(f"[{key}] SKIPPED (empty query)\n")
        continue

    print(f"[{key}] Inference started")

    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": query_text},
    ]

    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    stdout_buffer = io.StringIO()
    start_time = time.time()

    with redirect_stdout(stdout_buffer):
        response = generate(
            model=model,
            tokenizer=tokenizer,
            prompt=prompt,
            max_tokens=MAX_TOKENS,
            verbose=True
        )

    latency_sec = round(time.time() - start_time, 4)
    stdout_text = stdout_buffer.getvalue()

    # ----------------------------
    # PARSE VERBOSE OUTPUT
    # ----------------------------
    generated_tokens = None
    total_tokens = None

    prompt_match = re.search(r"Prompt:\s+(\d+)\s+tokens", stdout_text)
    gen_match = re.search(r"Generation:\s+(\d+)\s+tokens", stdout_text)



    if gen_match:
        generated_tokens = int(gen_match.group(1))


    # ----------------------------
    # LOG PER-ITEM METRICS
    # ----------------------------
    print(
        f"[{key}] latency_sec={latency_sec}, "
        f"generated_tokens={generated_tokens}, "
        f"total_tokens={total_tokens}\n"
    )

    # ----------------------------
    # OVERWRITE Cleaned_DATA_QLoRA
    # ----------------------------
    item["Cleaned_DATA_QLoRA"] = {
        "text": response.strip(),
        "latency_sec": latency_sec,
        "generated_tokens": generated_tokens,
        "total_tokens": total_tokens
    }

    # ----------------------------
    # REMOVE UNWANTED FIELDS
    # ----------------------------
    for block in ["base_model_Qwen3B", "RAW_DATA_QLoRA", "Cleaned_DATA_QLoRA"]:
        if block in item:
            item[block].pop("timestamp_utc", None)

    if "meta" in item:
        item["meta"].pop("created_at_utc", None)
        if not item["meta"]:
            item.pop("meta")

# ----------------------------
# SAVE FINAL JSON
# ----------------------------
with open(OUTPUT_JSON_PATH, "w", encoding="utf-8") as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

print(f"All done. Saved to {OUTPUT_JSON_PATH}")

  from .autonotebook import tqdm as notebook_tqdm


Loading MLX model...


The tokenizer you are loading from 'Qwen3B_MLX_FP16' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Model loaded.

[1] Inference started
[1] latency_sec=4.6923, prompt_tokens=99, generated_tokens=85, total_tokens=184

[2] Inference started
[2] latency_sec=4.0908, prompt_tokens=147, generated_tokens=93, total_tokens=240

[3] Inference started
[3] latency_sec=3.0502, prompt_tokens=116, generated_tokens=69, total_tokens=185

[4] Inference started
[4] latency_sec=3.2895, prompt_tokens=103, generated_tokens=75, total_tokens=178

[5] Inference started
[5] latency_sec=3.5544, prompt_tokens=114, generated_tokens=82, total_tokens=196

[6] Inference started
[6] latency_sec=4.5659, prompt_tokens=151, generated_tokens=103, total_tokens=254

[7] Inference started
[7] latency_sec=3.2378, prompt_tokens=136, generated_tokens=71, total_tokens=207

[8] Inference started
[8] latency_sec=5.758, prompt_tokens=142, generated_tokens=129, total_tokens=271

[9] Inference started
[9] latency_sec=5.447, prompt_tokens=95, generated_tokens=120, total_tokens=215

[10] Inference started
[10] latency_sec=3.5834, pr