# Data Description

The text data presented here is from the USMLE® Step 2 Clinical Skills examination, a medical licensure exam. This exam measures a trainee ability to recognize pertinent clinical facts during encounters with standardized patients.

During this exam, each test taker sees a Standardized Patient, a person trained to portray a clinical case. After interacting with the patient, the test taker documents the relevant facts of the encounter in a patient note. Each patient note is scored by a trained physician who looks for the presence of certain key concepts or features relevant to the case as described in a rubric. The goal of this competition is to develop an automated way of identifying the relevant features within each patient note, with a special focus on the patient history portions of the notes where the information from the interview with the standardized patient is documented.

Important Terms

Clinical Case: The scenario (e.g., symptoms, complaints, concerns) the Standardized Patient presents to the test taker (medical student, resident or physician). Ten clinical cases are represented in this dataset.

Patient Note: Text detailing important information related by the patient during the encounter (physical exam and interview).

Feature: A clinically relevant concept. A rubric describes the key concepts relevant to each case.

# Training Data

patient_notes.csv - A collection of about 40,000 Patient Note history portions. Only a subset of these have features annotated. You may wish to apply unsupervised learning techniques on the notes without annotations. The patient notes in the test set are not included in the public version of this file.

pn_num - A unique identifier for each patient note.

case_num - A unique identifier for the clinical case a patient note represents.

pn_history - The text of the encounter as recorded by the test taker.

features.csv - The rubric of features (or key concepts) for each clinical case.

feature_num - A unique identifier for each feature.

case_num - A unique identifier for each case.

feature_text - A description of the feature.

train.csv - Feature annotations for 1000 of the patient notes, 100 for each of ten cases.

id - Unique identifier for each patient note / feature pair.

pn_num - The patient note annotated in this row.

feature_num - The feature annotated in this row.

case_num - The case to which this patient note belongs.

annotation - The text(s) within a patient note indicating a feature. A feature may be indicated multiple times within a single note.

location - Character spans indicating the location of each annotation within the note. Multiple spans may be needed to represent an annotation, in which case the spans are delimited by a semicolon ;.



# Example Test Data

test.csv - Example instances selected from the training set.

sample_submission.csv - A sample submission file in the correct format.

# Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sb

# Creating DataFrame

In [None]:
df_train = pd.read_csv("../input/nbme-score-clinical-patient-notes/train.csv")
df_train.head()

In [None]:
df_train.describe()

In [None]:
df_train.shape


In [None]:
df_train.isnull().sum()


In [None]:
df=df_train.groupby('pn_num').size()
df

In [None]:
plt.figure(figsize=(15, 8))
sb.countplot(x='case_num',data=df_train,palette='rainbow')
plt.xlabel('case numbers')
plt.ylabel('Count')
plt.show()

case_num - The case to which this patient note belongs.

There are different numbers of patient notes for each of the ten cases

In [None]:
print(f"There are {len(df_train[df_train.location == '[]'])} empty features")

In [None]:
#Drop emply features

df_train = df_train[df_train.location != '[]'].copy().reset_index(drop=True)

# Creating df_patient

pn_num - A unique identifier for each patient note.

case_num - A unique identifier for the clinical case a patient note represents.

pn_history - The text of the encounter as recorded by the test taker.

In [None]:
df_patient=pd.read_csv("../input/nbme-score-clinical-patient-notes/patient_notes.csv")
df_patient.head()

In [None]:
df_patient.shape


In [None]:
df_patient.isnull().sum()


In [None]:
print(df_patient.pn_history[0])


In [None]:
print(df_patient.pn_history[1])


In [None]:
test = pd.read_csv("../input/nbme-score-clinical-patient-notes/test.csv")
ftr = pd.read_csv("../input/nbme-score-clinical-patient-notes/features.csv")
submit = pd.read_csv("../input/nbme-score-clinical-patient-notes/sample_submission.csv")

In [None]:
ftr.shape


In [None]:
ftr.info()


In [None]:
ftr[ftr.case_num==0]


In [None]:
plt.figure(figsize=(15, 8))
sb.countplot(x='case_num',data=ftr,palette='rainbow')
plt.xlabel('case numbers')
plt.ylabel('Count')
plt.show()

# Get annotation per row

In [None]:
df_train["n_annotation"] = df_train.annotation.apply(lambda x: len(x.split(",")))


In [None]:
df_train[df_train.pn_num == 16]


# Distribution of annotations across different features

In [None]:
train_ftr = pd.DataFrame(df_train.groupby("feature_num")["n_annotation"].sum()).reset_index()
train_ftr = pd.merge(train_ftr, ftr, how="left", on="feature_num")
train_ftr.head()

In [None]:
sb.histplot(train_ftr.n_annotation, bins=30)
plt.title("Histogram of annotations by feature");

Showing Count and Number of annotation of Features

In [None]:
test.shape


In [None]:
test.head()


In [None]:
submit.shape


In [None]:
submit

In [None]:
test = test.join(df_patient[["pn_num", "pn_history"]], on=["pn_num"], how="left", rsuffix="_r").drop("pn_num_r", axis=1)
test

In [None]:
# Now we want location column

In [None]:
test = pd.merge(test, df_train.drop(["annotation", "n_annotation"], axis=1), how="left")
test

In [None]:
test["location"] = test.location.apply(lambda x: x.replace("[","").replace("'","").replace("]","").replace(", ",";"))

In [None]:
test

In [None]:
test = test[submit.columns]
test

In [None]:
test.to_csv('submission.csv', index=False)