<div style="color:white;display:fill;border-radius:5px;background-color:#CCCCFF;
       font-size:150%;font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 10px;color:white;"><b> 1- ABOUT THE ANALYSIS:</b></p>
</div>

This competition is sponsored by the National Board of Medical Examiners® (NBME®). The goal of this competition is to develop an automated way of identifying the relevant features within each patient note, with a special focus on the patient history portions of the notes where the information from the interview with the standardized patient is documented.

I am trying to apply my train of thoughts to explore the data.

<div style="color:white;display:fill;border-radius:5px;background-color:#CCCCFF;
       font-size:150%;font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 10px;color:white;"><b> 2- ABOUT THE DATA:</b></p>
</div>


There are some important components in the training data provided, that consists of the following:

1. Training data:

> 1.1- train.csv - Feature annotations for 1000 of the patient notes, 100 for each of ten cases.

> 1.2- patient_notes.csv - A collection of about 40,000 Patient Note history portions. 

> 1.3- features.csv - The rubric of features (or key concepts) for each clinical case.

2. Test data: Example instances selected from the training set.

3. sample_submission.csv - A sample submission file in the correct format.


<div style="color:white;display:fill;border-radius:5px;background-color:#CCCCFF;
       font-size:150%;font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 10px;color:white;"><b> 3- EXPLORE THE DATA:</b></p>
</div>

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.filterwarnings("ignore")

In [None]:
train_df = pd.read_csv('/kaggle/input/nbme-score-clinical-patient-notes/train.csv')
test_df = pd.read_csv('/kaggle/input/nbme-score-clinical-patient-notes/test.csv')
notes = pd.read_csv('/kaggle/input/nbme-score-clinical-patient-notes/patient_notes.csv')
features = pd.read_csv('/kaggle/input/nbme-score-clinical-patient-notes/features.csv')
sample_submission = pd.read_csv('/kaggle/input/nbme-score-clinical-patient-notes/sample_submission.csv')

#### Exploring the train data:

(Taken from the project description)

The train data consists of 6 columns with the folllowing details:

* id - Unique identifier for each patient note / feature pair.
* pn_num - The patient note annotated in this row.
* feature_num - The feature annotated in this row.
* case_num - The case to which this patient note belongs.
* annotation - The text(s) within a patient note indicating a feature. A feature may be indicated multiple times within a single note.
* location - Character spans indicating the location of each annotation within the note. Multiple spans may be needed to represent an annotation, in which case the spans are delimited by a semicolon ;.

In [None]:
train_df

Look at the train data info below, this shows that we have no NA values but as mentioned in the description, only a fraction of the train data is annotated, that means we don't have all annotations data and it is filled with '[]'. Check above

In [None]:
train_df.info()

In [None]:
train_df.nunique()

In [None]:
test_df.head()

In [None]:
# plot
train_df['case_num'].value_counts(normalize = True).sort_values().plot(kind='bar', figsize=(10,4), color = 'gold', rot=0)

plt.xlabel("case_num", labelpad=10, fontsize=20)
plt.ylabel("Percent of data", labelpad=10, fontsize=20)
plt.xticks(size = 12)
plt.yticks(size = 12)
plt.title("Percent of data belonging to each case_num in the train set", y=1.02, fontsize=15)

#### Explore the features data:

features.csv - The rubric of features (or key concepts) for each clinical case.
* feature_num - A unique identifier for each feature.
* case_num - A unique identifier for each case.
* feature_text - A description of the feature.


In [None]:
features.head()

In [None]:
features.info()

In [None]:
features.nunique()

In [None]:
# How many unique features per case number?

feat_count = features.groupby('case_num')['feature_num'].count().reset_index()
print(feat_count)
plt.figure(figsize=(10, 4))
sns.barplot(x = feat_count['case_num'].astype(str), y= feat_count['feature_num'].astype(int))
plt.xlabel("case number", labelpad=10, fontsize=12)
plt.ylabel("number of features", labelpad=10, fontsize=12)
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.title("total number of features per case number", y=1.02, fontsize=15)

#### Explore the notes data:

patient_notes.csv - A collection of about 40,000 Patient Note history portions. Only a subset of these have features annotated. 
* pn_num - A unique identifier for each patient note.
* case_num - A unique identifier for the clinical case a patient note represents.
* pn_history - The text of the encounter as recorded by the test taker.

In [None]:
notes.head()

In [None]:
notes.info()

In [None]:
notes.nunique()

In [None]:
# How many unique patient notes per case number?

notes_count = notes.groupby('case_num')['pn_num'].count().reset_index()
plt.figure(figsize=(10, 4))
sns.barplot(x = notes_count['case_num'].astype(str), y= notes_count['pn_num'].astype(int))
plt.xlabel("case number", labelpad=10, fontsize=12)
plt.ylabel("number of patient notes", labelpad=10, fontsize=12)
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.grid()
plt.title("total number of patient notes per case number", y=1.02, fontsize=15)

To sum up, there are 10 case numbers, that have different features/ rubric that needs to be checked for each patient interaction that happens between a USMLE candidate and the standardized patient.

For example, case number 5(standardized clinical case) has total 18 rubric requirements(the features) and the total number of patient history records(which i think is nothing but the number of students who have interacted with that case) is 7000. We have to find the location in the patient history notes, where the rubric/features has been identified. 

Some annotations and locations in the train data are not present, we have to find the annotation using the patient history. Let's get the history of each pn_num together with the annotations. 

In [None]:
# There are many ways to match the patient history with feature text. I will go with our good old pandas merge. 
df = pd.merge(train_df, notes, on = ['pn_num', 'case_num'])
df1 = pd.merge(df, features, on = ['feature_num', 'case_num'])

# check if we have unique values as per the test data or not
df1.nunique()

Few annotations and locations need to be filled in the training data. For example,the pn_num 95333 doesn't have a annotation and location for feature number 912 and 913.

In [None]:
final_df = df1[['id','case_num', 'pn_num', 'pn_history','feature_num', 'feature_text', 'annotation', 'location']].sort_values(by ='id')
final_df.head()

Trying to find matching phrases and words between the patient notes and the feature text using spaCY's phrasematcher. This will help in getting the location in the patient notes. Let's see how this works. The location doesn't matched though.

In [None]:
# Let's take a look at the pn_history, with feature_text and annotation
print(f'**** 📜 patient history*****\n{final_df.pn_history.iloc[91]}')

print(f'****🧮 feature_text ***** \n {final_df.feature_text.iloc[91]}')

print(f'****📌 annotation ***** \n {final_df.annotation.iloc[91]}')

print(f'****📍 location ***** \n {final_df.location.iloc[91]}')


In [None]:
import spacy
from spacy import displacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
from textblob import TextBlob
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [None]:
# parse the feature text
def parse_feature_text(feature_text:str):
    # remove extra characters and make the text lower
    text = feature_text.replace("-", " ").lower()
    # remove stop words
    clean_text = [x for x in text.split() if x not in stop]
    return clean_text

In [None]:
parsed_feature = parse_feature_text(final_df.feature_text.iloc[91])
print(parsed_feature)

In [None]:
# parse the patient notes, with just making the notes lower case. 
def parse_patient_notes(note_text:str):
    # just make the text lower
    text = note_text.lower()
    return text

In [None]:
print('******Patient Notes')
parsed_notes = parse_patient_notes(final_df.pn_history.iloc[91])
doc = nlp(final_df.pn_history.iloc[91])
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, style="ent", jupyter = True)

In [None]:
print('******Related feature')
doc = nlp(final_df.feature_text.iloc[91])
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, style="dep", jupyter = True)

In [None]:
import re
# Match the keywords
def match_kw(parsed_feature, parsed_notes):
    matcher = PhraseMatcher(nlp.vocab)

    patterns = [nlp.make_doc(text) for text in parsed_feature]
    
    matcher.add("TerminologyList", patterns)

    doc = nlp(parsed_notes)
    matches = matcher(doc)
    
    for match_id, start, end in matches:
        span = doc[start:end]
        print([span.text], [start, end])
    

In [None]:
match_kw(parsed_feature, parsed_notes)

In [None]:
# pip install git+https://github.com/LIAAD/yake

In [None]:
# # let's extract the keywords from the patient notes and match it with the features.
# import yake
# text1 = final_df.pn_history.iloc[0]
# language = "en"
# max_ngram_size = 5
# deduplication_thresold = 0.9
# deduplication_algo = 'leve'
# windowSize = 3
# numOfKeywords = 20

# custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_thresold, dedupFunc=deduplication_algo, windowsSize=windowSize, top=numOfKeywords, features=None)
# keywords = custom_kw_extractor.extract_keywords(text1)

# for kw in keywords:
#     print(kw)