<a href="https://colab.research.google.com/github/EmiljaB/NLP_Projects/blob/Prescription_Parser/Prescription_Parser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1: Create a Prescription Parser using CRF
### Emilja Beneja
This task tests your ability to build a Doctor Prescription Parser with the help of CRF model

Your job is to build a Prescription Parser that takes a prescription (sentence) as an input and find / label the words in that sentence with one of the already pre-defined labels

### Problem: SEQUENCE PREDICTION - Label words in a sentence
#### Input : Doctor Prescription in the form of a sentence split into tokens
- Ex: Take 2 tablets once a day for 10 days

#### Output : FHIR Labels
- ('Take', 'Method')
- ('2', 'Qty')
- ('tablets', 'Form')
- ('once', 'Frequency')
- ('a', 'Period')
- ('day', 'PeriodUnit')
- ('for', 'FOR')
- ('10', 'Duration')
- ('days', 'DurationUnit')

### Major Steps
- Install necessary library
- Import the libraries
- Create training data with labels
    - Split the sentence into tokens
    - Compute POS tags
    - Create triples
- Extract features
- Split the data into training and testing set
- Create CRF model
- Save the CRF model
- Load the CRF model
- Predict on test data
- Accuracy

#### Install necesaary library

In [None]:
!pip install sklearn_crfsuite

Collecting sklearn_crfsuite
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-crfsuite>=0.9.7 (from sklearn_crfsuite)
  Downloading python_crfsuite-0.9.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl (10 kB)
Downloading python_crfsuite-0.9.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-crfsuite, sklearn_crfsuite
Successfully installed python-crfsuite-0.9.11 sklearn_crfsuite-0.5.0


### Import libraries

In [None]:
import pandas as pd
from sklearn_crfsuite import CRF, metrics
from sklearn.model_selection import train_test_split
import joblib

### Input data (GIVEN)
#### Creating the inputs to the ML model in the following form:
- sigs --> ['take 3 tabs for 10 days']       INPUT SIG
- input_sigs --> [['take', '3', 'tabs', 'for', '10', 'days']]      TOKENS
- output_labels --> [['Method','Qty', 'Form', 'FOR', 'Duration', 'DurationUnit']]       LABELS

In [None]:
sigs = ["for 5 to 6 days", "inject 2 units", "x 2 weeks", "x 3 days", "every day", "every 2 weeks", "every 3 days", "every 1 to 2 months", "every 2 to 6 weeks", "every 4 to 6 days", "take two to four tabs", "take 2 to 4 tabs", "take 3 tabs orally bid for 10 days at bedtime", "swallow three capsules tid orally", "take 2 capsules po every 6 hours", "take 2 tabs po for 10 days", "take 100 caps by mouth tid for 10 weeks", "take 2 tabs after an hour", "2 tabs every 4-6 hours", "every 4 to 6 hours", "q46h", "q4-6h", "2 hours before breakfast", "before 30 mins at bedtime", "30 mins before bed", "and 100 tabs twice a month", "100 tabs twice a month", "100 tabs once a month", "100 tabs thrice a month", "3 tabs daily for 3 days then 1 tab per day at bed", "30 tabs 10 days tid", "take 30 tabs for 10 days three times a day", "qid q6h", "bid", "qid", "30 tabs before dinner and bedtime", "30 tabs before dinner & bedtime", "take 3 tabs at bedtime", "30 tabs thrice daily for 10 days ", "30 tabs for 10 days three times a day", "Take 2 tablets a day", "qid for 10 days", "every day", "take 2 caps at bedtime", "apply 3 drops before bedtime", "take three capsules daily", "swallow 3 pills once a day", "swallow three pills thrice a day", "apply daily", "apply three drops before bedtime", "every 6 hours", "before food", "after food", "for 20 days", "for twenty days", "with meals"]
input_sigs = [['for', '5', 'to', '6', 'days'], ['inject', '2', 'units'], ['x', '2', 'weeks'], ['x', '3', 'days'], ['every', 'day'], ['every', '2', 'weeks'], ['every', '3', 'days'], ['every', '1', 'to', '2', 'months'], ['every', '2', 'to', '6', 'weeks'], ['every', '4', 'to', '6', 'days'], ['take', 'two', 'to', 'four', 'tabs'], ['take', '2', 'to', '4', 'tabs'], ['take', '3', 'tabs', 'orally', 'bid', 'for', '10', 'days', 'at', 'bedtime'], ['swallow', 'three', 'capsules', 'tid', 'orally'], ['take', '2', 'capsules', 'po', 'every', '6', 'hours'], ['take', '2', 'tabs', 'po', 'for', '10', 'days'], ['take', '100', 'caps', 'by', 'mouth', 'tid', 'for', '10', 'weeks'], ['take', '2', 'tabs', 'after', 'an', 'hour'], ['2', 'tabs', 'every', '4-6', 'hours'], ['every', '4', 'to', '6', 'hours'], ['q46h'], ['q4-6h'], ['2', 'hours', 'before', 'breakfast'], ['before', '30', 'mins', 'at', 'bedtime'], ['30', 'mins', 'before', 'bed'], ['and', '100', 'tabs', 'twice', 'a', 'month'], ['100', 'tabs', 'twice', 'a', 'month'], ['100', 'tabs', 'once', 'a', 'month'], ['100', 'tabs', 'thrice', 'a', 'month'], ['3', 'tabs', 'daily', 'for', '3', 'days', 'then', '1', 'tab', 'per', 'day', 'at', 'bed'], ['30', 'tabs', '10', 'days', 'tid'], ['take', '30', 'tabs', 'for', '10', 'days', 'three', 'times', 'a', 'day'], ['qid', 'q6h'], ['bid'], ['qid'], ['30', 'tabs', 'before', 'dinner', 'and', 'bedtime'], ['30', 'tabs', 'before', 'dinner', '&', 'bedtime'], ['take', '3', 'tabs', 'at', 'bedtime'], ['30', 'tabs', 'thrice', 'daily', 'for', '10', 'days'], ['30', 'tabs', 'for', '10', 'days', 'three', 'times', 'a', 'day'], ['take', '2', 'tablets', 'a', 'day'], ['qid', 'for', '10', 'days'], ['every', 'day'], ['take', '2', 'caps', 'at', 'bedtime'], ['apply', '3', 'drops', 'before', 'bedtime'], ['take', 'three', 'capsules', 'daily'], ['swallow', '3', 'pills', 'once', 'a', 'day'], ['swallow', 'three', 'pills', 'thrice', 'a', 'day'], ['apply', 'daily'], ['apply', 'three', 'drops', 'before', 'bedtime'], ['every', '6', 'hours'], ['before', 'food'], ['after', 'food'], ['for', '20', 'days'], ['for', 'twenty', 'days'], ['with', 'meals']]
output_labels = [['FOR', 'Duration', 'TO', 'DurationMax', 'DurationUnit'], ['Method', 'Qty', 'Form'], ['FOR', 'Duration', 'DurationUnit'], ['FOR', 'Duration', 'DurationUnit'], ['EVERY', 'Period'], ['EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['Method', 'Qty', 'TO', 'Qty', 'Form'], ['Method', 'Qty', 'TO', 'Qty', 'Form'], ['Method', 'Qty', 'Form', 'PO', 'BID', 'FOR', 'Duration', 'DurationUnit', 'AT', 'WHEN'], ['Method', 'Qty', 'Form', 'TID', 'PO'], ['Method', 'Qty', 'Form', 'PO', 'EVERY', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'PO', 'FOR', 'Duration', 'DurationUnit'], ['Method', 'Qty', 'Form', 'BY', 'PO', 'TID', 'FOR', 'Duration', 'DurationUnit'], ['Method', 'Qty', 'Form', 'AFTER', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['Q46H'], ['Q4-6H'], ['Qty', 'PeriodUnit', 'BEFORE', 'WHEN'], ['BEFORE', 'Qty', 'M', 'AT', 'WHEN'], ['Qty', 'M', 'BEFORE', 'WHEN'], ['AND', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'FOR', 'Duration', 'DurationUnit', 'THEN', 'Qty', 'Form', 'Frequency', 'PeriodUnit', 'AT', 'WHEN'], ['Qty', 'Form', 'Duration', 'DurationUnit', 'TID'], ['Method', 'Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'Qty', 'TIMES', 'Period', 'PeriodUnit'], ['QID', 'Q6H'], ['BID'], ['QID'],['Qty', 'Form', 'BEFORE', 'WHEN', 'AND', 'WHEN'], ['Qty', 'Form', 'BEFORE', 'WHEN', 'AND', 'WHEN'], ['Method', 'Qty', 'Form', 'AT', 'WHEN'], ['Qty', 'Form', 'Frequency', 'DAILY', 'FOR', 'Duration', 'DurationUnit'], ['Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'Frequency', 'TIMES', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'Period', 'PeriodUnit'], ['QID', 'FOR', 'Duration', 'DurationUnit'], ['EVERY', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'AT', 'WHEN'], ['Method', 'Qty', 'Form', 'BEFORE', 'WHEN'], ['Method', 'Qty', 'Form', 'DAILY'], ['Method', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Method', 'DAILY'], ['Method', 'Qty', 'Form', 'BEFORE', 'WHEN'], ['EVERY', 'Period', 'PeriodUnit'], ['BEFORE', 'FOOD'], ['AFTER', 'FOOD'], ['FOR', 'Duration', 'DurationUnit'], ['FOR', 'Duration', 'DurationUnit'], ['WITH', 'FOOD']]

In [None]:
len(sigs), len(input_sigs) , len(output_labels)

(56, 56, 56)

### Creating a Tuples Maker method
Create the tuples as given below by writing a function **tuples_maker(input_sigs, output_labels)** and returns **output** as given below

Input(s):
- input_sigs
- output_lables

Output:

[[('for', 'FOR'),
  ('5', 'Duration'),
  ('to', 'TO'),
  ('6', 'DurationMax'),
  ('days', 'DurationUnit')], [second sentence], ...]

In [None]:
def tuples_maker(input_sigs, output_labels):
    """
    Create tuples that pair each word with its corresponding label.

    Parameters:
    input_sigs (list of lists): Tokenized input sequences (e.g., [["for", "5", "to", "6", "days"], ...]).
    output_labels (list of lists): Label sequences (e.g., [["FOR", "Duration", "TO", "DurationMax", "DurationUnit"], ...]).

    Returns:
    list of lists: Each inner list contains tuples of (word, label) for each word in a sentence.
    """
    output = []

    for words, labels in zip(input_sigs, output_labels):
        sentence_tuples = list(zip(words, labels))
        output.append(sentence_tuples)

    return output


In [None]:
# Call the function using your existing data
output = tuples_maker(input_sigs, output_labels)

# Print each sentence's tuples in a readable format
for sentence in output:
    print(sentence)


[('for', 'FOR'), ('5', 'Duration'), ('to', 'TO'), ('6', 'DurationMax'), ('days', 'DurationUnit')]
[('inject', 'Method'), ('2', 'Qty'), ('units', 'Form')]
[('x', 'FOR'), ('2', 'Duration'), ('weeks', 'DurationUnit')]
[('x', 'FOR'), ('3', 'Duration'), ('days', 'DurationUnit')]
[('every', 'EVERY'), ('day', 'Period')]
[('every', 'EVERY'), ('2', 'Period'), ('weeks', 'PeriodUnit')]
[('every', 'EVERY'), ('3', 'Period'), ('days', 'PeriodUnit')]
[('every', 'EVERY'), ('1', 'Period'), ('to', 'TO'), ('2', 'PeriodMax'), ('months', 'PeriodUnit')]
[('every', 'EVERY'), ('2', 'Period'), ('to', 'TO'), ('6', 'PeriodMax'), ('weeks', 'PeriodUnit')]
[('every', 'EVERY'), ('4', 'Period'), ('to', 'TO'), ('6', 'PeriodMax'), ('days', 'PeriodUnit')]
[('take', 'Method'), ('two', 'Qty'), ('to', 'TO'), ('four', 'Qty'), ('tabs', 'Form')]
[('take', 'Method'), ('2', 'Qty'), ('to', 'TO'), ('4', 'Qty'), ('tabs', 'Form')]
[('take', 'Method'), ('3', 'Qty'), ('tabs', 'Form'), ('orally', 'PO'), ('bid', 'BID'), ('for', 'FOR'),

### Creating the triples_maker( ) for feature extraction
- input: tuples_maker_output
- output:
[[('for', 'IN', 'FOR'),
  ('5', 'CD', 'Duration'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'DurationMax'),
  ('days', 'NNS', 'DurationUnit')], [second sentence], ... ]

In [None]:
import nltk

# Ensure the POS tagger data is downloaded (only needs to be done once)
nltk.download('averaged_perceptron_tagger')

def triples_maker(whole_data):
    """
    Create triples (word, POS, label) for feature extraction.

    Parameters:
    whole_data (list of lists): Output from tuples_maker, containing (word, label) pairs.

    Returns:
    list of lists: Each inner list contains triples (word, POS, label) for each word in a sentence.
    """
    sample_data = []

    for sentence in whole_data:
        words, labels = zip(*sentence)
        pos_tags = nltk.pos_tag(words)

        # Combine word, POS, and label into a triple for each word
        sentence_triples = [(word, pos, label) for (word, pos), label in zip(pos_tags, labels)]
        sample_data.append(sentence_triples)

    return sample_data

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
# Example call
sample_data = triples_maker(output)
sample_data


[[('for', 'IN', 'FOR'),
  ('5', 'CD', 'Duration'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'DurationMax'),
  ('days', 'NNS', 'DurationUnit')],
 [('inject', 'JJ', 'Method'), ('2', 'CD', 'Qty'), ('units', 'NNS', 'Form')],
 [('x', 'RB', 'FOR'),
  ('2', 'CD', 'Duration'),
  ('weeks', 'NNS', 'DurationUnit')],
 [('x', 'RB', 'FOR'),
  ('3', 'CD', 'Duration'),
  ('days', 'NNS', 'DurationUnit')],
 [('every', 'DT', 'EVERY'), ('day', 'NN', 'Period')],
 [('every', 'DT', 'EVERY'),
  ('2', 'CD', 'Period'),
  ('weeks', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('3', 'CD', 'Period'),
  ('days', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('1', 'CD', 'Period'),
  ('to', 'TO', 'TO'),
  ('2', 'CD', 'PeriodMax'),
  ('months', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('2', 'CD', 'Period'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'PeriodMax'),
  ('weeks', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('4', 'CD', 'Period'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'PeriodMax'),
  ('

### Creating the features extractor method (GIVEN as a BASELINE)
#### The features used are:
- SOS, EOS, lowercase, uppercase, title, digit, postag, previous_tag, next_tag
#### Feel free to include more features

In [None]:
def token_to_features(doc, i):
    word = doc[i][0]
    postag = doc[i][1]

    # Common features for all words
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag
    ]

    # Features for words that are not at the beginning of a document
    if i > 0:
        word1 = doc[i-1][0]
        postag1 = doc[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:word.isdigit=%s' % word1.isdigit(),
            '-1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'beginning of a document'
        features.append('BOS')

    # Features for words that are not at the end of a document
    if i < len(doc)-1:
        word1 = doc[i+1][0]
        postag1 = doc[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:word.isdigit=%s' % word1.isdigit(),
            '+1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'end of a document'
        features.append('EOS')

    return features

In [None]:
# Example document
doc = [('for', 'IN'), ('5', 'CD'), ('days', 'NNS')]
# Extract features for each token in the document
features = [token_to_features(doc, i) for i in range(len(doc))]
features


[['bias',
  'word.lower=for',
  'word[-3:]=for',
  'word[-2:]=or',
  'word.isupper=False',
  'word.istitle=False',
  'word.isdigit=False',
  'postag=IN',
  'BOS',
  '+1:word.lower=5',
  '+1:word.istitle=False',
  '+1:word.isupper=False',
  '+1:word.isdigit=True',
  '+1:postag=CD'],
 ['bias',
  'word.lower=5',
  'word[-3:]=5',
  'word[-2:]=5',
  'word.isupper=False',
  'word.istitle=False',
  'word.isdigit=True',
  'postag=CD',
  '-1:word.lower=for',
  '-1:word.istitle=False',
  '-1:word.isupper=False',
  '-1:word.isdigit=False',
  '-1:postag=IN',
  '+1:word.lower=days',
  '+1:word.istitle=False',
  '+1:word.isupper=False',
  '+1:word.isdigit=False',
  '+1:postag=NNS'],
 ['bias',
  'word.lower=days',
  'word[-3:]=ays',
  'word[-2:]=ys',
  'word.isupper=False',
  'word.istitle=False',
  'word.isdigit=False',
  'postag=NNS',
  '-1:word.lower=5',
  '-1:word.istitle=False',
  '-1:word.isupper=False',
  '-1:word.isdigit=True',
  '-1:postag=CD',
  'EOS']]

### Running the feature extractor on the training data
- Feature extraction
- Train-test-split

In [None]:
#Extract features and labels
def extract_features_and_labels(data):
    X = []  # Features
    y = []  # Labels

    for sentence in data:
        # Extract features for each word in the sentence
        X.append([token_to_features(sentence, i) for i in range(len(sentence))])
        # Collect labels for each word in the sentence
        y.append([label for _, _, label in sentence])

    return X, y

# Example call to extract features and labels
X, y = extract_features_and_labels(sample_data)

# 2. Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Displaying a small sample of training data features and labels
print("Sample X_train:", X_train[:1])
print("Sample y_train:", y_train[:1])


Sample X_train: [[['bias', 'word.lower=qid', 'word[-3:]=qid', 'word[-2:]=id', 'word.isupper=False', 'word.istitle=False', 'word.isdigit=False', 'postag=NN', 'BOS', 'EOS']]]
Sample y_train: [['QID']]


### Training the CRF model with the features extracted using the feature extractor method

In [None]:
# Define the CRF model with specified parameters
crf = CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.01,
    max_iterations=1000,
    all_possible_transitions=True
)

# Submit training data to the trainer
crf.fit(X_train, y_train)

# Save the trained model to a file
model_filename = "trained_crf_model.joblib"
joblib.dump(crf, model_filename)
print(f"Model saved as {model_filename}")

# Predict on the test set and evaluate the model
y_pred = crf.predict(X_test)

# Print the classification report for evaluation
print("Classification Report:")
print(metrics.flat_classification_report(y_test, y_pred, digits=3))

# Additional output to show number of features, active features, and optimization summary
print("\nFeature generation")
print("type: CRF1d")
print(f"feature.minfreq: {0}")
print(f"feature.possible_states: {0}")
print(f"feature.possible_transitions: {1}")
print("Number of features:", len(crf.transition_features_))
print("Optimization algorithm: L-BFGS")
print(f"c1: {crf.c1}")
print(f"c2: {crf.c2}")
print("Max iterations:", crf.max_iterations)


Model saved as trained_crf_model.joblib
Classification Report:
              precision    recall  f1-score   support

         AND      0.000     0.000     0.000         1
          AT      0.500     1.000     0.667         1
      BEFORE      1.000     1.000     1.000         2
         BID      0.000     0.000     0.000         2
       DAILY      0.000     0.000     0.000         0
    Duration      1.000     1.000     1.000         4
 DurationMax      0.000     0.000     0.000         1
DurationUnit      1.000     0.750     0.857         4
       EVERY      1.000     1.000     1.000         3
         FOR      1.000     1.000     1.000         4
        Form      1.000     1.000     1.000         5
   Frequency      0.500     1.000     0.667         1
      Method      1.000     1.000     1.000         3
          PO      0.000     0.000     0.000         2
      Period      1.000     1.000     1.000         4
   PeriodMax      0.500     1.000     0.667         1
  PeriodUnit      

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Predicting the test data with the built model

In [None]:
# Load the saved model (if needed)
import joblib

model_filename = "trained_crf_model.joblib"
crf = joblib.load(model_filename)

# Predict on the test data
y_pred = crf.predict(X_test)

for i, (sentence_features, prediction) in enumerate(zip(X_test, y_pred)):
    print(f"Sentence {i+1} Predictions:")
    for feature, label in zip(sentence_features, prediction):
        # Find the feature that contains 'word.lower=' and extract the word
        word_feature = next((f for f in feature if f.startswith('word.lower=')), None)
        if word_feature:
            word = word_feature.split('=')[1]  # Extract word from 'word.lower=word'
            print(f"{word}: {label}")
    print("\n")


Sentence 1 Predictions:
for: FOR
5: Duration
to: TO
6: PeriodMax
days: PeriodUnit


Sentence 2 Predictions:
every: EVERY
2: Period
weeks: PeriodUnit


Sentence 3 Predictions:
bid: QID


Sentence 4 Predictions:
swallow: Method
three: Qty
capsules: Form
tid: TID
orally: DAILY


Sentence 5 Predictions:
every: EVERY
4: Period
to: TO
6: PeriodMax
hours: PeriodUnit


Sentence 6 Predictions:
every: EVERY
6: Period
hours: PeriodUnit


Sentence 7 Predictions:
30: Qty
tabs: Form
before: BEFORE
dinner: WHEN
&: AT
bedtime: WHEN


Sentence 8 Predictions:
100: Qty
tabs: Form
twice: Frequency
a: Period
month: PeriodUnit


Sentence 9 Predictions:
apply: Method
3: Qty
drops: Form
before: BEFORE
bedtime: WHEN


Sentence 10 Predictions:
take: Method
3: Qty
tabs: Form
orally: Frequency
bid: PeriodUnit
for: FOR
10: Duration
days: DurationUnit
at: AT
bedtime: WHEN


Sentence 11 Predictions:
for: FOR
twenty: Duration
days: DurationUnit


Sentence 12 Predictions:
x: FOR
3: Duration
days: DurationUnit




### Putting all the prediction logic inside a predict method

In [None]:
# Load the trained CRF model
crf = joblib.load("trained_crf_model.joblib")

def predict(sig):
    """
    predict(sig)
    Purpose: Labels the given sig into corresponding labels.
    @param sig: A string representing a medical prescription sig written by a doctor.
    @return: A list containing a list of predicted labels for each token in the sentence.
    """
    # Tokenize the sentence and tag POS
    tokens = nltk.word_tokenize(sig)
    pos_tags = nltk.pos_tag(tokens)

    # Convert tokens and POS tags to the format required by the model
    doc = [(token, pos) for token, pos in pos_tags]

    # Extract features for each token in the sig
    features = [token_to_features(doc, i) for i in range(len(doc))]

    # Predict labels using the CRF model
    predictions = crf.predict([features])[0]

    return [predictions]




### Sample predictions

In [None]:
predictions = predict("take 2 tabs every 6 hours x 10 days")
print(predictions)

[array(['Method', 'Qty', 'Form', 'EVERY', 'Period', 'PeriodUnit', 'FOR',
       'Duration', 'DurationUnit'], dtype=object)]


In [None]:
predictions = predict("2 capsu for 10 day at bed")
print(predictions)

[array(['Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'AT', 'WHEN'],
      dtype=object)]


In [None]:
predictions = predict("2 capsu for 10 days at bed")
print(predictions)

[array(['Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'AT', 'WHEN'],
      dtype=object)]


In [None]:
predictions = predict("5 days 2 tabs at bed")
print(predictions)

[array(['Duration', 'DurationUnit', 'Qty', 'Form', 'AT', 'WHEN'],
      dtype=object)]


In [None]:
predictions = predict("3 tabs qid x 10 weeks")
print(predictions)

[array(['Qty', 'Form', 'QID', 'FOR', 'Duration', 'DurationUnit'],
      dtype=object)]


In [None]:
predictions = predict("x 30 days")
print(predictions)

[array(['FOR', 'Duration', 'DurationUnit'], dtype=object)]


In [None]:
predictions = predict("x 20 months")
print(predictions)

[array(['FOR', 'Duration', 'DurationUnit'], dtype=object)]


In [None]:
predictions = predict("take 2 tabs po tid for 10 days")
print(predictions)

[array(['Method', 'Qty', 'Form', 'PO', 'TID', 'FOR', 'Duration',
       'DurationUnit'], dtype=object)]


In [None]:
predictions = predict("take 2 capsules po every 6 hours")
print(predictions)

[array(['Method', 'Qty', 'Form', 'PO', 'EVERY', 'Period', 'PeriodUnit'],
      dtype=object)]


In [None]:
predictions = predict("inject 2 units pu tid")
print(predictions)

[array(['Method', 'Qty', 'Form', 'Frequency', 'TID'], dtype=object)]


In [None]:
predictions = predict("swallow 3 caps tid by mouth")
print(predictions)

[array(['Method', 'Qty', 'Form', 'TID', 'BY', 'PO'], dtype=object)]


In [None]:
predictions = predict("inject 3 units orally")
print(predictions)

[array(['Method', 'Qty', 'Form', 'Frequency'], dtype=object)]


In [None]:
predictions = predict("orally take 3 tabs tid")
print(predictions)

[array(['Method', 'Method', 'Qty', 'Form', 'TID'], dtype=object)]


In [None]:
predictions = predict("by mouth take three caps")
print(predictions)

[array(['BY', 'PO', 'Method', 'Qty', 'Form'], dtype=object)]


In [None]:
predictions = predict("take 3 tabs orally three times a day for 10 days at bedtime")
print(predictions)

[array(['Method', 'Qty', 'Form', 'Frequency', 'Qty', 'TIMES', 'Period',
       'PeriodUnit', 'FOR', 'Duration', 'DurationUnit', 'AT', 'WHEN'],
      dtype=object)]


In [None]:
predictions = predict("take 3 tabs orally bid for 10 days at bedtime")
print(predictions)

[array(['Method', 'Qty', 'Form', 'Frequency', 'PeriodUnit', 'FOR',
       'Duration', 'DurationUnit', 'AT', 'WHEN'], dtype=object)]


In [None]:
predictions = predict("take 3 tabs bid orally at bed")
print(predictions)

[array(['Method', 'Qty', 'Form', 'Frequency', 'PeriodUnit', 'AT', 'WHEN'],
      dtype=object)]


In [None]:
predictions = predict("take 10 capsules by mouth qid")
print(predictions)

[array(['Method', 'Qty', 'Form', 'BY', 'PO', 'QID'], dtype=object)]


In [None]:
predictions = predict("inject 10 units orally qid x 3 months")
print(predictions)

[array(['Method', 'Qty', 'Form', 'Frequency', 'QID', 'FOR', 'Duration',
       'DurationUnit'], dtype=object)]


In [None]:
predictions = predict("please take 2 tablets per day for a month in the morning and evening each day")
print(predictions)

[array(['Method', 'Method', 'Qty', 'Form', 'Frequency', 'PeriodUnit',
       'FOR', 'Period', 'PeriodUnit', 'EVERY', 'Period', 'PeriodUnit',
       'AND', 'EVERY', 'Period', 'PeriodUnit'], dtype=object)]


In [None]:
predictions = predict("Amoxcicillin QID 30 tablets")
print(predictions)

[array(['Duration', 'DurationUnit', 'Qty', 'Form'], dtype=object)]


In [None]:
prediction = predict("take 3 tabs TID for 90 days with food")
print(predictions)

[array(['Duration', 'DurationUnit', 'Qty', 'Form'], dtype=object)]


In [None]:
prediction = predict("with food take 3 tablets per day for 90 days")
print(predictions)

[array(['Duration', 'DurationUnit', 'Qty', 'Form'], dtype=object)]


In [None]:
prediction = predict("with food take 3 tablets per week for 90 weeks")
print(predictions)

[array(['Duration', 'DurationUnit', 'Qty', 'Form'], dtype=object)]


In [None]:
prediction = predict("take 2-4 tabs")
print(predictions)

[array(['Duration', 'DurationUnit', 'Qty', 'Form'], dtype=object)]


In [None]:
prediction = predict("take 2 to 4 tabs")
print(predictions)

[array(['Duration', 'DurationUnit', 'Qty', 'Form'], dtype=object)]


In [None]:
prediction = predict("take two to four tabs")
print(predictions)

[array(['Method', 'Qty', 'Form', 'EVERY', 'Period', 'TO', 'PeriodMax',
       'PeriodUnit'], dtype=object)]


In [None]:
prediction = predict("take 2-4 tabs for 8 to 9 days")
print(predictions)

[array(['Method', 'Qty', 'Form', 'EVERY', 'Period', 'TO', 'PeriodMax',
       'PeriodUnit'], dtype=object)]


In [None]:
prediction = predict("take 20 tabs every 6 to 8 days")
print(predictions)

[array(['Method', 'Qty', 'Form', 'EVERY', 'Period', 'TO', 'PeriodMax',
       'PeriodUnit'], dtype=object)]


In [None]:
prediction = predict("take 2 tabs every 4 to 6 days")
print(predictions)

[array(['Method', 'Qty', 'Form', 'EVERY', 'Period', 'TO', 'PeriodMax',
       'PeriodUnit'], dtype=object)]


In [None]:
predictions = predict("take 2 tabs every 2 to 10 weeks")
print(predictions)

[array(['Method', 'Qty', 'Form', 'EVERY', 'Period', 'TO', 'PeriodMax',
       'PeriodUnit'], dtype=object)]


In [None]:
prediction = predict("take 2 tabs every 4 to 6 days")
print(predictions)

[array(['Method', 'Qty', 'Form', 'EVERY', 'Period', 'TO', 'PeriodMax',
       'PeriodUnit'], dtype=object)]
