<a href="https://colab.research.google.com/github/Lauralug0/GBC/blob/main/LauraLugo_Task1_Prescription_parser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1: Create a Prescription Parser using CRF
This task tests your ability to build a Doctor Prescription Parser with the help of CRF model

Your job is to build a Prescription Parser that takes a prescription (sentence) as an input and find / label the words in that sentence with one of the already pre-defined labels

### Problem: SEQUENCE PREDICTION - Label words in a sentence
#### Input : Doctor Prescription in the form of a sentence split into tokens
- Ex: Take 2 tablets once a day for 10 days

#### Output : FHIR Labels
- ('Take', 'Method')
- ('2', 'Qty')
- ('tablets', 'Form')
- ('once', 'Frequency')
- ('a', 'Period')
- ('day', 'PeriodUnit')
- ('for', 'FOR')
- ('10', 'Duration')
- ('days', 'DurationUnit')

### Major Steps
- Install necessary library
- Import the libraries
- Create training data with labels
    - Split the sentence into tokens
    - Compute POS tags
    - Create triples
- Extract features
- Split the data into training and testing set
- Create CRF model
- Save the CRF model
- Load the CRF model
- Predict on test data
- Accuracy

#### Install necesaary library

In [32]:
!pip install nltk
!pip install sklearn-crfsuite
!pip uninstall -y textacy spacy thinc networkx
!pip install -U spacy==3.7.2 textacy==0.12.0 "networkx<3.0"
!python -m spacy download en_core_web_sm



Found existing installation: textacy 0.12.0
Uninstalling textacy-0.12.0:
  Successfully uninstalled textacy-0.12.0
Found existing installation: spacy 3.7.2
Uninstalling spacy-3.7.2:
  Successfully uninstalled spacy-3.7.2
Found existing installation: thinc 8.2.5
Uninstalling thinc-8.2.5:
  Successfully uninstalled thinc-8.2.5
Found existing installation: networkx 2.8.8
Uninstalling networkx-2.8.8:
  Successfully uninstalled networkx-2.8.8
Collecting spacy==3.7.2
  Using cached spacy-3.7.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting textacy==0.12.0
  Using cached textacy-0.12.0-py3-none-any.whl.metadata (4.8 kB)
Collecting networkx<3.0
  Using cached networkx-2.8.8-py3-none-any.whl.metadata (5.1 kB)
Collecting thinc<8.3.0,>=8.1.8 (from spacy==3.7.2)
  Using cached thinc-8.2.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Using cached spacy-3.7.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.4 MB)
Usi

#### Import the necessary libraries

In [48]:
import nltk
import joblib
import sklearn_crfsuite
from sklearn.model_selection import train_test_split
from sklearn_crfsuite import metrics
from nltk import pos_tag

# Download necessary NLTK data (if not already downloaded)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

### Input data (GIVEN)
#### Creating the inputs to the ML model in the following form:
- sigs --> ['take 3 tabs for 10 days']       INPUT SIG
- input_sigs --> [['take', '3', 'tabs', 'for', '10', 'days']]      TOKENS
- output_labels --> [['Method','Qty', 'Form', 'FOR', 'Duration', 'DurationUnit']]       LABELS

In [36]:
sigs = ["for 5 to 6 days", "inject 2 units", "x 2 weeks", "x 3 days", "every day", "every 2 weeks", "every 3 days", "every 1 to 2 months", "every 2 to 6 weeks", "every 4 to 6 days", "take two to four tabs", "take 2 to 4 tabs", "take 3 tabs orally bid for 10 days at bedtime", "swallow three capsules tid orally", "take 2 capsules po every 6 hours", "take 2 tabs po for 10 days", "take 100 caps by mouth tid for 10 weeks", "take 2 tabs after an hour", "2 tabs every 4-6 hours", "every 4 to 6 hours", "q46h", "q4-6h", "2 hours before breakfast", "before 30 mins at bedtime", "30 mins before bed", "and 100 tabs twice a month", "100 tabs twice a month", "100 tabs once a month", "100 tabs thrice a month", "3 tabs daily for 3 days then 1 tab per day at bed", "30 tabs 10 days tid", "take 30 tabs for 10 days three times a day", "qid q6h", "bid", "qid", "30 tabs before dinner and bedtime", "30 tabs before dinner & bedtime", "take 3 tabs at bedtime", "30 tabs thrice daily for 10 days ", "30 tabs for 10 days three times a day", "Take 2 tablets a day", "qid for 10 days", "every day", "take 2 caps at bedtime", "apply 3 drops before bedtime", "take three capsules daily", "swallow 3 pills once a day", "swallow three pills thrice a day", "apply daily", "apply three drops before bedtime", "every 6 hours", "before food", "after food", "for 20 days", "for twenty days", "with meals"]
input_sigs = [['for', '5', 'to', '6', 'days'], ['inject', '2', 'units'], ['x', '2', 'weeks'], ['x', '3', 'days'], ['every', 'day'], ['every', '2', 'weeks'], ['every', '3', 'days'], ['every', '1', 'to', '2', 'months'], ['every', '2', 'to', '6', 'weeks'], ['every', '4', 'to', '6', 'days'], ['take', 'two', 'to', 'four', 'tabs'], ['take', '2', 'to', '4', 'tabs'], ['take', '3', 'tabs', 'orally', 'bid', 'for', '10', 'days', 'at', 'bedtime'], ['swallow', 'three', 'capsules', 'tid', 'orally'], ['take', '2', 'capsules', 'po', 'every', '6', 'hours'], ['take', '2', 'tabs', 'po', 'for', '10', 'days'], ['take', '100', 'caps', 'by', 'mouth', 'tid', 'for', '10', 'weeks'], ['take', '2', 'tabs', 'after', 'an', 'hour'], ['2', 'tabs', 'every', '4-6', 'hours'], ['every', '4', 'to', '6', 'hours'], ['q46h'], ['q4-6h'], ['2', 'hours', 'before', 'breakfast'], ['before', '30', 'mins', 'at', 'bedtime'], ['30', 'mins', 'before', 'bed'], ['and', '100', 'tabs', 'twice', 'a', 'month'], ['100', 'tabs', 'twice', 'a', 'month'], ['100', 'tabs', 'once', 'a', 'month'], ['100', 'tabs', 'thrice', 'a', 'month'], ['3', 'tabs', 'daily', 'for', '3', 'days', 'then', '1', 'tab', 'per', 'day', 'at', 'bed'], ['30', 'tabs', '10', 'days', 'tid'], ['take', '30', 'tabs', 'for', '10', 'days', 'three', 'times', 'a', 'day'], ['qid', 'q6h'], ['bid'], ['qid'], ['30', 'tabs', 'before', 'dinner', 'and', 'bedtime'], ['30', 'tabs', 'before', 'dinner', '&', 'bedtime'], ['take', '3', 'tabs', 'at', 'bedtime'], ['30', 'tabs', 'thrice', 'daily', 'for', '10', 'days'], ['30', 'tabs', 'for', '10', 'days', 'three', 'times', 'a', 'day'], ['take', '2', 'tablets', 'a', 'day'], ['qid', 'for', '10', 'days'], ['every', 'day'], ['take', '2', 'caps', 'at', 'bedtime'], ['apply', '3', 'drops', 'before', 'bedtime'], ['take', 'three', 'capsules', 'daily'], ['swallow', '3', 'pills', 'once', 'a', 'day'], ['swallow', 'three', 'pills', 'thrice', 'a', 'day'], ['apply', 'daily'], ['apply', 'three', 'drops', 'before', 'bedtime'], ['every', '6', 'hours'], ['before', 'food'], ['after', 'food'], ['for', '20', 'days'], ['for', 'twenty', 'days'], ['with', 'meals']]
output_labels = [['FOR', 'Duration', 'TO', 'DurationMax', 'DurationUnit'], ['Method', 'Qty', 'Form'], ['FOR', 'Duration', 'DurationUnit'], ['FOR', 'Duration', 'DurationUnit'], ['EVERY', 'Period'], ['EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['Method', 'Qty', 'TO', 'Qty', 'Form'], ['Method', 'Qty', 'TO', 'Qty', 'Form'], ['Method', 'Qty', 'Form', 'PO', 'BID', 'FOR', 'Duration', 'DurationUnit', 'AT', 'WHEN'], ['Method', 'Qty', 'Form', 'TID', 'PO'], ['Method', 'Qty', 'Form', 'PO', 'EVERY', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'PO', 'FOR', 'Duration', 'DurationUnit'], ['Method', 'Qty', 'Form', 'BY', 'PO', 'TID', 'FOR', 'Duration', 'DurationUnit'], ['Method', 'Qty', 'Form', 'AFTER', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['Q46H'], ['Q4-6H'], ['Qty', 'PeriodUnit', 'BEFORE', 'WHEN'], ['BEFORE', 'Qty', 'M', 'AT', 'WHEN'], ['Qty', 'M', 'BEFORE', 'WHEN'], ['AND', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'FOR', 'Duration', 'DurationUnit', 'THEN', 'Qty', 'Form', 'Frequency', 'PeriodUnit', 'AT', 'WHEN'], ['Qty', 'Form', 'Duration', 'DurationUnit', 'TID'], ['Method', 'Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'Qty', 'TIMES', 'Period', 'PeriodUnit'], ['QID', 'Q6H'], ['BID'], ['QID'],['Qty', 'Form', 'BEFORE', 'WHEN', 'AND', 'WHEN'], ['Qty', 'Form', 'BEFORE', 'WHEN', 'AND', 'WHEN'], ['Method', 'Qty', 'Form', 'AT', 'WHEN'], ['Qty', 'Form', 'Frequency', 'DAILY', 'FOR', 'Duration', 'DurationUnit'], ['Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'Frequency', 'TIMES', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'Period', 'PeriodUnit'], ['QID', 'FOR', 'Duration', 'DurationUnit'], ['EVERY', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'AT', 'WHEN'], ['Method', 'Qty', 'Form', 'BEFORE', 'WHEN'], ['Method', 'Qty', 'Form', 'DAILY'], ['Method', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Method', 'DAILY'], ['Method', 'Qty', 'Form', 'BEFORE', 'WHEN'], ['EVERY', 'Period', 'PeriodUnit'], ['BEFORE', 'FOOD'], ['AFTER', 'FOOD'], ['FOR', 'Duration', 'DurationUnit'], ['FOR', 'Duration', 'DurationUnit'], ['WITH', 'FOOD']]

In [49]:
len(sigs), len(input_sigs) , len(output_labels)

(56, 56, 56)

### Creating a Tuples Maker method
Create the tuples as given below by writing a function **tuples_maker(input_sigs, output_labels)** and returns **output** as given below

Input(s):
- input_sigs
- output_lables

Output:

[[('for', 'FOR'),
  ('5', 'Duration'),
  ('to', 'TO'),
  ('6', 'DurationMax'),
  ('days', 'DurationUnit')], [second sentence], ...]

In [37]:
def tuples_maker(inp, out):

    sample_data = []

    for tokens, labels in zip(inp, out):
        sentence_tuples = [(t, l) for t, l in zip(tokens, labels)]
        sample_data.append(sentence_tuples)

    return sample_data


### Creating the triples_maker( ) for feature extraction
- input: tuples_maker_output
- output:
[[('for', 'IN', 'FOR'),
  ('5', 'CD', 'Duration'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'DurationMax'),
  ('days', 'NNS', 'DurationUnit')], [second sentence], ... ]

In [38]:
def triples_maker(whole_data):
    from nltk import pos_tag
    sample_data = []


    for sentence in whole_data:

        tokens = [token for token, _ in sentence]
        labels = [label for _, label in sentence]


        pos_tags = [pos for _, pos in pos_tag(tokens)]


        triples = [(tok, pos, lab) for tok, pos, lab in zip(tokens, pos_tags, labels)]


        sample_data.append(triples)

    return sample_data


### Creating the features extractor method (GIVEN as a BASELINE)
#### The features used are:
- SOS, EOS, lowercase, uppercase, title, digit, postag, previous_tag, next_tag
#### Feel free to include more features

In [39]:
def token_to_features(doc, i):
    word = doc[i][0]
    postag = doc[i][1]

    # Common features for all words
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word[:2]=' + word[:2],
        'word[:3]=' + word[:3],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'word.length=%s' % len(word),
        'word.hasdigit=%s' % any(ch.isdigit() for ch in word),
        'word.hasdash=%s' % ('-' in word),
        'postag=' + postag,
        'postag[:2]=' + postag[:2]

    ]

    # Features for words that are not
    # at the beginning of a document
    if i > 0:
        word1 = doc[i-1][0]
        postag1 = doc[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:word.isdigit=%s' % word1.isdigit(),
            '-1:postag=' + postag1,
        ])
    else:
        # Indicate that it is the 'beginning of a document'
        features.append('BOS')

    # Features for words that are not
    # at the end of a document
    if i < len(doc)-1:
        word1 = doc[i+1][0]
        postag1 = doc[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:word.isdigit=%s' % word1.isdigit(),
            '+1:postag=' + postag1,
            '+1:postag[:2]=' + postag1[:2],
        ])
    else:
        # Indicate that it is the 'end of a document'
        features.append('EOS')

    return features

### Running the feature extractor on the training data
- Feature extraction
- Train-test-split

In [40]:
tuples_data = tuples_maker(input_sigs, output_labels)
train_data_triples = triples_maker(tuples_data)


# Helper functions
def sent2features(sent):
    return [token_to_features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

X = [sent2features(s) for s in train_data_triples]
y = [sent2labels(s) for s in train_data_triples]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

print("Sample features for first sentence:")
print(X_train[0][0])  # Features for first word
print("Labels:", y_train[0])  # Labels for full sentence



Training samples: 44
Testing samples: 12
Sample features for first sentence:
['bias', 'word.lower=qid', 'word[-3:]=qid', 'word[-2:]=id', 'word[:2]=qi', 'word[:3]=qid', 'word.isupper=False', 'word.istitle=False', 'word.isdigit=False', 'word.length=3', 'word.hasdigit=False', 'word.hasdash=False', 'postag=NN', 'postag[:2]=NN', 'BOS', 'EOS']
Labels: ['QID']


In [41]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)


print("Training CRF model... ")
crf.fit(X_train, y_train)
print("Training completed!")


model_filename = "prescription_parser_crf_model.pkl"
joblib.dump(crf, model_filename)
print(f"Model saved successfully")





Training CRF model... 
Training completed!
Model saved successfully


### Predicting the test data with the built model

In [42]:

crf_loaded = joblib.load("prescription_parser_crf_model.pkl")


y_pred = crf_loaded.predict(X_test)


from sklearn_crfsuite import metrics

print("Model accuracy (F1-score):",
      metrics.flat_f1_score(y_test, y_pred, average='weighted'))


Model accuracy (F1-score): 0.8731628453850676


### Putting all the prediction logic inside a predict method

In [43]:
def predict(sig):
    """
    predict(sig)
    Purpose: Labels the given sig into corresponding labels
    @param sig: A Sentence (medical prescription sig)
    @return: A list of predicted labels
    >>> predict('2 tabs every 4 hours')
    [['Qty', 'Form', 'EVERY', 'Period', 'PeriodUnit']]
    >>> predict('2 tabs with food')
    [['Qty', 'Form', 'WITH', 'FOOD']]
    >>> predict('2 tabs qid x 30 days')
    [['Qty', 'Form', 'QID', 'FOR', 'Duration', 'DurationUnit']]
    """

    tokens = nltk.word_tokenize(sig)


    pos_tagged_tokens = nltk.pos_tag(tokens)


    formatted_input = [(token, pos) for token, pos in pos_tagged_tokens]


    features_for_sig = [token_to_features(formatted_input, i) for i in range(len(formatted_input))]


    predictions = crf.predict([features_for_sig])

    return predictions


### Sample predictions

In [50]:
import nltk
nltk.download('punkt')  # For word tokenization
nltk.download('averaged_perceptron_tagger')  # For POS tagging
predictions = predict("take 2 tabs every 6 hours x 10 days")
print(predictions)

[['Method' 'Qty' 'Form' 'EVERY' 'Period' 'PeriodUnit' 'FOR' 'Duration'
  'DurationUnit']]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [52]:
predictions = predict("2 capsu for 10 day at bed")
print(predictions)

[['Qty' 'Form' 'FOR' 'Duration' 'DurationUnit' 'AT' 'WHEN']]


In [57]:
predictions = predict("inject 3 units orally")

In [58]:
predictions = predict("by mouth take three caps")

In [59]:
predictions = predict("take 3 tabs orally three times a day for 10 days at bedtime")

In [60]:
predictions = predict("take 3 tabs orally bid for 10 days at bedtime")

In [61]:
predictions = predict("take 10 capsules by mouth qid")

In [62]:
prediction = predict("Amoxcicillin QID 30 tablets")

In [63]:
prediction = predict("take two to four tabs")

In [64]:
prediction = predict("take 2 tabs every 4 to 6 days")

In [65]:
prediction = predict("every 10 mins")

In [66]:
prediction = predict("once in every 3 days take 3 tabs")

# Task
Train the `sklearn_crfsuite.CRF` model using `X_train` and `y_train`, then predict labels for `X_test` and store them in `y_pred`.

## Train CRF Model

### Subtask:
Initialize and train the `sklearn_crfsuite.CRF` model using the prepared `X_train` (features) and `y_train` (labels) data.


**Reasoning**:
Initialize the CRF model with the specified parameters and then train it using the `X_train` and `y_train` data.



In [None]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)
print("CRF model trained successfully.")

**Reasoning**:
The `NameError` indicates that `X_train` and `y_train` were not defined before attempting to train the CRF model. This means the preceding steps for data preparation, feature extraction, and train-test split were not executed or their results were lost. I need to re-execute these steps to properly define `X_train` and `y_train` before training the model.



In [55]:
from sklearn.model_selection import train_test_split


processed_data = tuples_maker(input_sigs, output_labels)


train_data_triples = triples_maker(processed_data)


def sent2features(sent):
    return [token_to_features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]


X = [sent2features(s) for s in train_data_triples]
y = [sent2labels(s) for s in train_data_triples]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)
print("CRF model trained successfully.")

CRF model trained successfully.


## Predict on Test Data

### Subtask:
Use the trained CRF model to make predictions on the `X_test` dataset and store the results in `y_pred`.


**Reasoning**:
Use the trained CRF model to predict labels for the `X_test` dataset and store them in `y_pred`.



In [56]:
y_pred = crf.predict(X_test)
print("Predictions generated for X_test.")

Predictions generated for X_test.


## Final Task

### Subtask:
Complete the task of predicting on the test data.


## Summary:

### Data Analysis Key Findings
*   An `sklearn_crfsuite.CRF` model was successfully initialized with `algorithm='lbfgs'`, `c1=0.1`, `c2=0.1`, `max_iterations=100`, and `all_possible_transitions=True`.
*   The CRF model was trained using the prepared `X_train` (features) and `y_train` (labels) datasets.
*   Predictions were successfully generated for `X_test` using the trained CRF model.
*   The predicted labels for the test data were stored in the `y_pred` variable.
*   Initially, data preparation and feature extraction steps needed to be re-executed to define `X_train` and `y_train`, resolving a `NameError`.

### Insights or Next Steps
*   The next logical step is to evaluate the performance of the trained CRF model by comparing the `y_pred` results against the true labels `y_test` using appropriate metrics (e.g., accuracy, precision, recall, F1-score).
*   Investigate the features used for training and consider engineering additional features that might improve the model's predictive capabilities, especially if performance metrics are not optimal.
