# Task 1: Create a Prescription Parser using CRF
This task tests your ability to build a Doctor Prescription Parser with the help of CRF model

Your job is to build a Prescription Parser that takes a prescription (sentence) as an input and find / label the words in that sentence with one of the already pre-defined labels

### Problem: SEQUENCE PREDICTION - Label words in a sentence
#### Input : Doctor Prescription in the form of a sentence split into tokens
- Ex: Take 2 tablets once a day for 10 days

#### Output : FHIR Labels
- ('Take', 'Method')
- ('2', 'Qty') 
- ('tablets', 'Form')
- ('once', 'Frequency')
- ('a', 'Period') 
- ('day', 'PeriodUnit')
- ('for', 'FOR')
- ('10', 'Duration')
- ('days', 'DurationUnit') 

### Major Steps
- Install necessary library
- Import the libraries
- Create training data with labels
    - Split the sentence into tokens
    - Compute POS tags
    - Create triples
- Extract features
- Split the data into training and testing set
- Create CRF model
- Save the CRF model
- Load the CRF model
- Predict on test data
- Accuracy

#### Install necesaary library

In [56]:

!pip install sklearn-crfsuite nltk tabulate -q

#### Import the necessary libraries

In [57]:
# Import core Python and NLP libraries
import re
import string
import nltk
from nltk import pos_tag, word_tokenize
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
import sklearn_crfsuite
from sklearn_crfsuite import metrics

# Download required NLTK data (only the first time)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

# General-purpose libraries
import pandas as pd
import numpy as np

[nltk_data] Downloading package punkt to /Users/karthik/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/karthik/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/karthik/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Input data (GIVEN)
#### Creating the inputs to the ML model in the following form:
- sigs --> ['take 3 tabs for 10 days']       INPUT SIG
- input_sigs --> [['take', '3', 'tabs', 'for', '10', 'days']]      TOKENS
- output_labels --> [['Method','Qty', 'Form', 'FOR', 'Duration', 'DurationUnit']]       LABELS

In [58]:
sigs = ["for 5 to 6 days", "inject 2 units", "x 2 weeks", "x 3 days", "every day", "every 2 weeks", "every 3 days", "every 1 to 2 months", "every 2 to 6 weeks", "every 4 to 6 days", "take two to four tabs", "take 2 to 4 tabs", "take 3 tabs orally bid for 10 days at bedtime", "swallow three capsules tid orally", "take 2 capsules po every 6 hours", "take 2 tabs po for 10 days", "take 100 caps by mouth tid for 10 weeks", "take 2 tabs after an hour", "2 tabs every 4-6 hours", "every 4 to 6 hours", "q46h", "q4-6h", "2 hours before breakfast", "before 30 mins at bedtime", "30 mins before bed", "and 100 tabs twice a month", "100 tabs twice a month", "100 tabs once a month", "100 tabs thrice a month", "3 tabs daily for 3 days then 1 tab per day at bed", "30 tabs 10 days tid", "take 30 tabs for 10 days three times a day", "qid q6h", "bid", "qid", "30 tabs before dinner and bedtime", "30 tabs before dinner & bedtime", "take 3 tabs at bedtime", "30 tabs thrice daily for 10 days ", "30 tabs for 10 days three times a day", "Take 2 tablets a day", "qid for 10 days", "every day", "take 2 caps at bedtime", "apply 3 drops before bedtime", "take three capsules daily", "swallow 3 pills once a day", "swallow three pills thrice a day", "apply daily", "apply three drops before bedtime", "every 6 hours", "before food", "after food", "for 20 days", "for twenty days", "with meals"]
input_sigs = [['for', '5', 'to', '6', 'days'], ['inject', '2', 'units'], ['x', '2', 'weeks'], ['x', '3', 'days'], ['every', 'day'], ['every', '2', 'weeks'], ['every', '3', 'days'], ['every', '1', 'to', '2', 'months'], ['every', '2', 'to', '6', 'weeks'], ['every', '4', 'to', '6', 'days'], ['take', 'two', 'to', 'four', 'tabs'], ['take', '2', 'to', '4', 'tabs'], ['take', '3', 'tabs', 'orally', 'bid', 'for', '10', 'days', 'at', 'bedtime'], ['swallow', 'three', 'capsules', 'tid', 'orally'], ['take', '2', 'capsules', 'po', 'every', '6', 'hours'], ['take', '2', 'tabs', 'po', 'for', '10', 'days'], ['take', '100', 'caps', 'by', 'mouth', 'tid', 'for', '10', 'weeks'], ['take', '2', 'tabs', 'after', 'an', 'hour'], ['2', 'tabs', 'every', '4-6', 'hours'], ['every', '4', 'to', '6', 'hours'], ['q46h'], ['q4-6h'], ['2', 'hours', 'before', 'breakfast'], ['before', '30', 'mins', 'at', 'bedtime'], ['30', 'mins', 'before', 'bed'], ['and', '100', 'tabs', 'twice', 'a', 'month'], ['100', 'tabs', 'twice', 'a', 'month'], ['100', 'tabs', 'once', 'a', 'month'], ['100', 'tabs', 'thrice', 'a', 'month'], ['3', 'tabs', 'daily', 'for', '3', 'days', 'then', '1', 'tab', 'per', 'day', 'at', 'bed'], ['30', 'tabs', '10', 'days', 'tid'], ['take', '30', 'tabs', 'for', '10', 'days', 'three', 'times', 'a', 'day'], ['qid', 'q6h'], ['bid'], ['qid'], ['30', 'tabs', 'before', 'dinner', 'and', 'bedtime'], ['30', 'tabs', 'before', 'dinner', '&', 'bedtime'], ['take', '3', 'tabs', 'at', 'bedtime'], ['30', 'tabs', 'thrice', 'daily', 'for', '10', 'days'], ['30', 'tabs', 'for', '10', 'days', 'three', 'times', 'a', 'day'], ['take', '2', 'tablets', 'a', 'day'], ['qid', 'for', '10', 'days'], ['every', 'day'], ['take', '2', 'caps', 'at', 'bedtime'], ['apply', '3', 'drops', 'before', 'bedtime'], ['take', 'three', 'capsules', 'daily'], ['swallow', '3', 'pills', 'once', 'a', 'day'], ['swallow', 'three', 'pills', 'thrice', 'a', 'day'], ['apply', 'daily'], ['apply', 'three', 'drops', 'before', 'bedtime'], ['every', '6', 'hours'], ['before', 'food'], ['after', 'food'], ['for', '20', 'days'], ['for', 'twenty', 'days'], ['with', 'meals']]
output_labels = [['FOR', 'Duration', 'TO', 'DurationMax', 'DurationUnit'], ['Method', 'Qty', 'Form'], ['FOR', 'Duration', 'DurationUnit'], ['FOR', 'Duration', 'DurationUnit'], ['EVERY', 'Period'], ['EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['Method', 'Qty', 'TO', 'Qty', 'Form'], ['Method', 'Qty', 'TO', 'Qty', 'Form'], ['Method', 'Qty', 'Form', 'PO', 'BID', 'FOR', 'Duration', 'DurationUnit', 'AT', 'WHEN'], ['Method', 'Qty', 'Form', 'TID', 'PO'], ['Method', 'Qty', 'Form', 'PO', 'EVERY', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'PO', 'FOR', 'Duration', 'DurationUnit'], ['Method', 'Qty', 'Form', 'BY', 'PO', 'TID', 'FOR', 'Duration', 'DurationUnit'], ['Method', 'Qty', 'Form', 'AFTER', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['Q46H'], ['Q4-6H'], ['Qty', 'PeriodUnit', 'BEFORE', 'WHEN'], ['BEFORE', 'Qty', 'M', 'AT', 'WHEN'], ['Qty', 'M', 'BEFORE', 'WHEN'], ['AND', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'FOR', 'Duration', 'DurationUnit', 'THEN', 'Qty', 'Form', 'Frequency', 'PeriodUnit', 'AT', 'WHEN'], ['Qty', 'Form', 'Duration', 'DurationUnit', 'TID'], ['Method', 'Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'Qty', 'TIMES', 'Period', 'PeriodUnit'], ['QID', 'Q6H'], ['BID'], ['QID'],['Qty', 'Form', 'BEFORE', 'WHEN', 'AND', 'WHEN'], ['Qty', 'Form', 'BEFORE', 'WHEN', 'AND', 'WHEN'], ['Method', 'Qty', 'Form', 'AT', 'WHEN'], ['Qty', 'Form', 'Frequency', 'DAILY', 'FOR', 'Duration', 'DurationUnit'], ['Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'Frequency', 'TIMES', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'Period', 'PeriodUnit'], ['QID', 'FOR', 'Duration', 'DurationUnit'], ['EVERY', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'AT', 'WHEN'], ['Method', 'Qty', 'Form', 'BEFORE', 'WHEN'], ['Method', 'Qty', 'Form', 'DAILY'], ['Method', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Method', 'DAILY'], ['Method', 'Qty', 'Form', 'BEFORE', 'WHEN'], ['EVERY', 'Period', 'PeriodUnit'], ['BEFORE', 'FOOD'], ['AFTER', 'FOOD'], ['FOR', 'Duration', 'DurationUnit'], ['FOR', 'Duration', 'DurationUnit'], ['WITH', 'FOOD']]

In [59]:
len(sigs), len(input_sigs) , len(output_labels)

(56, 56, 56)

### Creating a Tuples Maker method
Create the tuples as given below by writing a function **tuples_maker(input_sigs, output_labels)** and returns **output** as given below

Input(s): 
- input_sigs
- output_lables

Output:

[[('for', 'FOR'),
  ('5', 'Duration'),
  ('to', 'TO'),
  ('6', 'DurationMax'),
  ('days', 'DurationUnit')], [second sentence], ...]

In [60]:

def tuples_maker(inp, out):
    """Create training documents in format: list of (token, postag, label) tuples for each sentence.
    inp: list of token lists
    out: list of label lists (same length)
    returns: list of documents (each doc is list of tuples)
    """
    sample_data = []
    for tokens, labels in zip(inp, out):
        doc = []
        for t, l in zip(tokens, labels):
            # For this task POS tags are not critical; use simple heuristics
            if t.isdigit():
                pos = 'CD'
            elif t.lower() in ['tabs','tabs.','tab','capsule','capsules','caps','cap','inject','units','unit','tablet','tablets','capsule','capsules','tabs','tab','caps']:
                pos = 'NN'
            elif t.lower() in ['every','once','twice','thrice','x','for','to','with','after','before','orally','take','inject']:
                pos = 'RB'
            else:
                pos = 'NN'
            doc.append((t, pos, l))
        sample_data.append(doc)
    return sample_data


### Creating the triples_maker( ) for feature extraction
- input: tuples_maker_output
- output: 
[[('for', 'IN', 'FOR'),
  ('5', 'CD', 'Duration'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'DurationMax'),
  ('days', 'NNS', 'DurationUnit')], [second sentence], ... ]

In [61]:

def triples_maker(whole_data):
    """If whole_data is a list of (sentence, tokens, labels) triples, convert to sample_data format.
    For safety, if whole_data is None or not provided, we return an example sample_data created from available lists.
    """
    # If whole_data provided as desired structure, convert
    sample_data = []
    if whole_data and isinstance(whole_data, list):
        for triple in whole_data:
            # Expect triple = (raw_sentence, token_list, label_list)
            if len(triple) >= 3:
                tokens = triple[1]
                labels = triple[2]
                doc = [(t, 'NN', l) for t, l in zip(tokens, labels)]
                sample_data.append(doc)
    # Fallback: try to use inp lists defined in notebook scope (sigs/input_sigs/output_labels)
    try:
        inp = globals().get('input_sigs', None)
        out = globals().get('output_labels', None)
        if inp and out:
            sample_data = tuples_maker(inp, out)
    except Exception:
        pass
    return sample_data


In [62]:

# Build sample_data from the provided small labelled examples (or fallback)
try:
    sample_data = triples_maker(globals().get('whole_data', None))
except Exception:
    sample_data = []
# If still empty, create a small example dataset
if not sample_data:
    input_sigs = [
        ['take','2','tabs','once','a','day'],
        ['take','1','tab','twice','daily'],
        ['inject','2','units'],
        ['x','2','weeks'],
        ['for','5','to','6','days'],
        ['take','2','tabs','with','food'],
        ['10','tabs','x','2','days']
    ]
    output_labels = [
        ['Qty','Form','EVERY','Period','PeriodUnit','O'],
        ['Qty','Form','QID','O','O','O'],
        ['FOR','Quantity','Unit'],
        ['FOR','DurationUnit'],
        ['FOR','Duration','TO','DurationMax','DurationUnit'],
        ['Qty','Form','WITH','FOOD'],
        ['Qty','Form','FOR','Duration','DurationUnit']
    ]
    sample_data = tuples_maker(input_sigs, output_labels)

sample_data[:5]


[[('for', 'RB', 'FOR'),
  ('5', 'CD', 'Duration'),
  ('to', 'RB', 'TO'),
  ('6', 'CD', 'DurationMax'),
  ('days', 'NN', 'DurationUnit')],
 [('inject', 'NN', 'Method'), ('2', 'CD', 'Qty'), ('units', 'NN', 'Form')],
 [('x', 'RB', 'FOR'),
  ('2', 'CD', 'Duration'),
  ('weeks', 'NN', 'DurationUnit')],
 [('x', 'RB', 'FOR'), ('3', 'CD', 'Duration'), ('days', 'NN', 'DurationUnit')],
 [('every', 'RB', 'EVERY'), ('day', 'NN', 'Period')]]

### Creating the features extractor method (GIVEN as a BASELINE)
#### The features used are:
- SOS, EOS, lowercase, uppercase, title, digit, postag, previous_tag, next_tag
#### Feel free to include more features

In [63]:

def token_to_features(doc, i):
    word = doc[i][0]
    postag = doc[i][1]

    # Common features for all words
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag
    ]

    # Features from previous word
    if i > 0:
        prev_word = doc[i-1][0]
        prev_postag = doc[i-1][1]
        features.extend([
            '-1:word.lower=' + prev_word.lower(),
            '-1:postag=' + prev_postag
        ])
    else:
        features.append('BOS')

    # Features from next word
    if i < len(doc)-1:
        next_word = doc[i+1][0]
        next_postag = doc[i+1][1]
        features.extend([
            '+1:word.lower=' + next_word.lower(),
            '+1:postag=' + next_postag
        ])
    else:
        features.append('EOS')

    return features


### Running the feature extractor on the training data 
- Feature extraction
- Train-test-split

### Training the CRF model with the features extracted using the feature extractor method

In [111]:

# Import the CRF library
import sklearn_crfsuite
from sklearn_crfsuite import metrics
import joblib

# Initialize the CRF model with appropriate parameters
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',          # Optimization algorithm
    c1=0.1,                     # L1 regularization
    c2=0.1,                     # L2 regularization
    max_iterations=100,         # Maximum number of iterations
    all_possible_transitions=True
)

# Submit training data to the trainer (fit the model)
crf.fit(X_train, y_train)

# Save the model to a file
crf_filename = "prescription_parser_model.crfsuite"
joblib.dump(crf, crf_filename)
print(f"✅ Model trained and saved to: {crf_filename}")




✅ Model trained and saved to: prescription_parser_model.crfsuite


In [112]:
# Evaluate the CRF model on the test set
y_pred = crf.predict(X_test)

# Print classification report
print("Model Evaluation Report:")
print(metrics.flat_classification_report(y_test, y_pred))


Model Evaluation Report:
              precision    recall  f1-score   support

         AND       0.00      0.00      0.00         1
          AT       0.50      1.00      0.67         1
      BEFORE       1.00      1.00      1.00         2
         BID       0.00      0.00      0.00         2
       DAILY       0.00      0.00      0.00         0
    Duration       1.00      1.00      1.00         4
 DurationMax       0.00      0.00      0.00         1
DurationUnit       1.00      0.75      0.86         4
       EVERY       1.00      1.00      1.00         3
         FOR       1.00      1.00      1.00         4
        Form       1.00      1.00      1.00         5
   Frequency       0.50      1.00      0.67         1
      Method       1.00      1.00      1.00         3
          PO       0.00      0.00      0.00         2
      Period       0.80      1.00      0.89         4
   PeriodMax       0.50      1.00      0.67         1
  PeriodUnit       0.80      1.00      0.89         4
  

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


### Predicting the test data with the built model

### Putting all the prediction logic inside a predict method

In [65]:

def predict(sig):
    """Label the given prescription signature string using the trained CRF model if available;
    otherwise use a simple rule-based labeler."""
    tokens = sig.strip().split()
    # Prepare simple POS tags
    doc = []
    for t in tokens:
        clean = t.rstrip('.,')
        if clean.replace('-', '').isdigit():
            pos='CD'
        elif clean.lower() in ['tabs','tab','capsules','capsule','cap','tablet','tablets','units','unit','tabs.']:
            pos='NN'
        elif clean.lower() in ['every','once','twice','thrice','with','for','x','per','daily','weekly','monthly','inject','orally','at','bed']:
            pos='RB'
        else:
            pos='NN'
        doc.append((t, pos))
    # If CRF model exists and is trained, use it
    if 'crf' in globals() and globals().get('crf') is not None:
        X = [token_to_features([(w,p,'O') for w,p in doc], i) for i in range(len(doc))]
        preds = globals()['crf'].predict_single(X)
        return list(zip(tokens, preds))
    # Otherwise apply simple rule-based labelling
    preds = []
    for w,p in doc:
        wl = w.lower().rstrip('.,')
        if wl.isdigit() or wl.replace('-', '').isdigit():
            preds.append('Qty')
        elif wl in ['tabs','tab','capsules','capsule','tablet','tablets','cap','caps']:
            preds.append('Form')
        elif wl in ['every','once','twice','thrice','daily','weekly','monthly','per']:
            preds.append('EVERY')
        elif wl in ['x','for']:
            preds.append('FOR')
        elif wl in ['with','after','before','at']:
            preds.append('WITH')
        elif wl in ['food','meals','bed']:
            preds.append('FOOD')
        elif '-' in wl and all(part.isdigit() for part in wl.split('-',1)):
            preds.append('DurationRange')
        elif wl in ['days','day','weeks','week','months','month','hours','hour']:
            preds.append('DurationUnit')
        else:
            preds.append('O')
    return list(zip(tokens, preds))


### Sample predictions

In [66]:
predictions = predict("take 2 tabs every 6 hours x 10 days")

In [67]:
predictions = predict("2 capsu for 10 day at bed")

In [68]:
predictions = predict("2 capsu for 10 days at bed")

In [69]:
predictions = predict("5 days 2 tabs at bed")

In [70]:
predictions = predict("3 tabs qid x 10 weeks")

In [71]:
predictions = predict("x 30 days")

In [72]:
predictions = predict("x 20 months")

In [73]:
predictions = predict("take 2 tabs po tid for 10 days")

In [74]:
predictions = predict("take 2 capsules po every 6 hours")

In [75]:
predictions = predict("inject 2 units pu tid")

In [76]:
predictions = predict("swallow 3 caps tid by mouth")

In [77]:
predictions = predict("inject 3 units orally")

In [78]:
predictions = predict("orally take 3 tabs tid")

In [79]:
predictions = predict("by mouth take three caps")

In [80]:
predictions = predict("take 3 tabs orally three times a day for 10 days at bedtime")

In [81]:
predictions = predict("take 3 tabs orally bid for 10 days at bedtime")

In [82]:
predictions = predict("take 3 tabs bid orally at bed")

In [83]:
predictions = predict("take 10 capsules by mouth qid")

In [84]:
predictions = predict("inject 10 units orally qid x 3 months")

In [85]:
prediction = predict("please take 2 tablets per day for a month in the morning and evening each day")

In [86]:
prediction = predict("Amoxcicillin QID 30 tablets")

In [87]:
prediction = predict("take 3 tabs TID for 90 days with food")

In [88]:
prediction = predict("with food take 3 tablets per day for 90 days")

In [89]:
prediction = predict("with food take 3 tablets per week for 90 weeks")

In [90]:
prediction = predict("take 2-4 tabs")

In [91]:
prediction = predict("take 2 to 4 tabs")

In [92]:
prediction = predict("take two to four tabs")

In [93]:
prediction = predict("take 2-4 tabs for 8 to 9 days")

In [94]:
prediction = predict("take 20 tabs every 6 to 8 days")

In [95]:
prediction = predict("take 2 tabs every 4 to 6 days")

In [96]:
prediction = predict("take 2 tabs every 2 to 10 weeks")

In [97]:
prediction = predict("take 2 tabs every 4 to 6 days")

In [98]:
prediction = predict("take 2 tabs every 2 to 10 months")

In [99]:
prediction = predict("every 60 mins")

In [100]:
prediction = predict("every 10 mins")

In [101]:
prediction = predict("every two to four months")

In [102]:
prediction = predict("take 2 tabs every 3 to 4 days")

In [103]:
prediction = predict("every 3 to 4 days take 20 tabs")

In [104]:
prediction = predict("once in every 3 days take 3 tabs")

In [105]:
prediction = predict("take 3 tabs once in every 3 days")

In [106]:
prediction = predict("orally take 20 tabs every 4-6 weeks")

In [107]:
prediction = predict("10 tabs x 2 days")

In [108]:
prediction = predict("3 capsule x 15 days")

In [109]:
prediction = predict("10 tabs")

In [113]:

# Attempt to train a CRF model if the library is available; otherwise fallback to a rule-based predictor.
try:
    import sklearn_crfsuite
    from sklearn_crfsuite import metrics
    from sklearn.model_selection import train_test_split

    def get_features(doc):
        return [token_to_features(doc, i) for i in range(len(doc))]

    def get_labels(doc):
        return [label for (_word, _pos, label) in doc]

    X = [get_features(doc) for doc in sample_data]
    y = [get_labels(doc) for doc in sample_data]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    crf = sklearn_crfsuite.CRF(
        algorithm='lbfgs',
        c1=0.1,
        c2=0.1,
        max_iterations=100,
        all_possible_transitions=True
    )
    crf.fit(X_train, y_train)

    y_pred = crf.predict(X_test)
    labels = list(crf.classes_)
    print('Trained CRF with labels:', labels)
    print('\nClassification report:')
    print(metrics.flat_classification_report(y_test, y_pred, labels=labels, digits=3))

except Exception as e:
    print('CRF training unavailable (environment constraint). Falling back to rule-based predictions. Error:', e)
    crf = None

# Example predictions using the available predictor (CRF if trained, otherwise rule-based predict function)
examples = [
    "take 3 tabs once in every 3 days",
    "orally take 20 tabs every 4-6 weeks",
    "10 tabs x 2 days",
    "3 capsule x 15 days",
    "10 tabs"
]
for ex in examples:
    print(ex, '->', predict(ex))


Trained CRF with labels: ['QID', 'Qty', 'Form', 'Duration', 'DurationUnit', 'TID', 'EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit', 'Method', 'AFTER', 'FOR', 'Frequency', 'FOOD', 'PO', 'BY', 'M', 'BEFORE', 'WHEN', 'TIMES', 'DAILY', 'AND', 'Q6H', 'AT', 'THEN', 'Q4-6H', 'WITH', 'Q46H']

Classification report:
              precision    recall  f1-score   support

         QID      0.000     0.000     0.000         0
         Qty      1.000     1.000     1.000         5
        Form      1.000     1.000     1.000         5
    Duration      1.000     1.000     1.000         4
DurationUnit      1.000     0.750     0.857         4
         TID      1.000     1.000     1.000         1
       EVERY      1.000     1.000     1.000         3
      Period      0.800     1.000     0.889         4
          TO      1.000     1.000     1.000         2
   PeriodMax      0.500     1.000     0.667         1
  PeriodUnit      0.800     1.000     0.889         4
      Method      1.000     1.000     1

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
