# A dataset by [National Board of Medical Examination](https://www.nbme.org) (NBME)

- Score Clinical Patient Notes
- Identify Key Phrases in Patient Notes from Medical Licensing Exams

For any project, it is extremely important to understand the data set. Let's get some understanding on the provided data files in this notebook. 


## Exploratory Data Analysis (EDA)

# Table of Contents

<a id="toc"></a>
- [1. Introduction](#1)
- [2. Imports](#2)
- [3. Exploring given datasets for training](#3)
    - [3.1 train.csv](#3.1)
        - [3.1.1 Missing annotations in train.csv](#3.1.1)
        - [3.1.2 Annotated vs unannotated cases in `train.csv`](#3.1.2)
        - [3.1.3 Annotation analysis](#3.1.3)
            - [3.1.3.1: Annotation count distribution](#3.1.3.1)
            - [3.1.3.2: Annotation length distribution](#3.1.3.2)
    - [3.2 patient_notes.csv](#3.2)
        - [3.2.1 Patient notes per case](#3.2.1)
        - [3.2.2 Patient Notes Length Distribution ](#3.2.2)
    - [3.3 features.csv](#3.3)
        - [3.4.1 Feature Distribution (per Case)  ](#3.3.1)
        - [3.4.2 Feature  Length Distribution ](#3.3.2)
- [4 A sample patient notes and its annotations](#4)
- [5. Marking and visualizing annotations using spaCy](#5)
- [6. Word clouds](#6)
    - [6.1 Word cloud of patient history notes](#6.1)
    - [6.2  Word cloud of two characters words in patient history notes](#6.2)
    - [6.3 Word cloud of features](#6.3)
    - [6.4 Word cloud of annotations](#6.4)
    - [6.5 Another way to generate word clouds using stylecloud](#6.5)
- [7. Final train and test datasets](#7)
    - [7.1: Final train dataset](#7.1)
    - [7.2: Final test dataset](#7.2)
- [8. The submission sample](#8)

<a id="1"></a>
# 1. Introduction

### [NBME - Score Clinical Patient Notes](https://www.kaggle.com/c/nbme-score-clinical-patient-notes/overview)

The text data presented here is from the USMLE® Step 2 Clinical Skills examination, a medical licensure exam. This exam measures a trainee's ability to recognize pertinent clinical facts during encounters with standardized patients.

During this exam, each test taker sees a Standardized Patient, a person trained to portray a clinical case. After interacting with the patient, the test taker documents the relevant facts of the encounter in a patient note. Each patient note is scored by a trained physician who looks for the presence of certain key concepts or features relevant to the case as described in a rubric. The goal of this competition is to develop an automated way of identifying the relevant features within each patient note, with a special focus on the patient history portions of the notes where the information from the interview with the standardized patient is documented.

<a href="#toc" role="button" aria-pressed="true" >👆 Table of contents 👆</a>

<a id="2"></a>
## 2. Imports 

In [None]:
# Imports we need 
import os, glob, random, re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import plotly.graph_objects as go
#import plotly.graph_objs as go

import spacy
import wordcloud

import warnings
warnings.filterwarnings('ignore')

RANDOM_INDEX=5

In [None]:
# Some custom utility functions
def map_df(df):
    #df.dropna(how='all', inplace=True)
    plt.figure(figsize=(18,6), dpi=200)
    ax=sns.heatmap(df.isnull(), cbar = False, cmap = 'viridis')
    #ax.figure.savefig('heatmap.png')
    plt.title("The raw data..., yellow showing the missing observations!",fontsize=18);

#*****************************************************************#
def missing_df(df,rows=None):
    # Columns with missing data!
    cols=[]
    missing=[]
    #count = 0
    for col in df.columns:
        if (df[col].isnull().sum()/len(df)*100)!=0:#<=10:#>60.0:
            cols.append(col)
            missing.append(round(df[col].isnull().sum()/len(df)*100,2))
            #count = count+1
            #print(col,round(df[col].isnull().sum()/len(df)*100,2))
    print("Number of columns with missing data:", len(cols))#count)
    df2 = pd.DataFrame(data = cols, columns=['Column Name'])
    df2['%Missing']=missing
    df2.set_index('Column Name')
    df2=df2.sort_values(by=['%Missing'],ascending=False).reset_index().drop('index', axis = 1)
    if rows==None:
        print(df2)
    else:print(df2.head(rows))
    #[df2['%Missing']!=0])#.sort_values(by=['%Missing'],ascending=False).reset_index().drop('index', axis = 1))


In [None]:
# data directory and the available file 
import os
for data_dir, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(data_dir, filename))

In [None]:
# Training data files
train=pd.read_csv(data_dir+"/train.csv")
patient_notes=pd.read_csv(data_dir+"/patient_notes.csv")
features=pd.read_csv(data_dir+"/features.csv")

# Test data file/s
test=pd.read_csv(data_dir+"/test.csv")

# submission sample 
submission=pd.read_csv(data_dir+"/sample_submission.csv")

<a href="#toc" role="button" aria-pressed="true" >👆 Table of contents 👆</a>

<a id="3"></a>
# 3. Exploring given datasets for training 

Training data includes the following given `.csv` file:
- `train.csv`
- `patient_notes.csv`
- `features.csv`

<a id="3.1"></a>

## 3.1: `train.csv`

In [None]:
train.tail(5)

**Column Description :**
* `id` - **Unique identifier** for each **patient note / feature pair**.
* `case_num` - The case to which this patient note belongs.
* `pn_num` - The patient note annotated in this row.
* `feature_num` - The feature annotated in this row.
* `annotation` - The text(s) within a patient note indicating a feature. A feature **may be indicated multiple times within a single note**.
* `location` - Character spans indicating the location of each annotation within the note. Multiple spans may be needed to represent an annotation, in which case the spans are delimited by a semicolon ;.

In [None]:
print('\033[92mNumber of rows in train data: {}'.format(train.shape[0]))
print('\033[92mNumber of columns in train data: {}'.format(train.shape[1]))
print()
print('\033[94mNumber of unique cases: {}'.format(train.case_num.nunique()))
print('\033[94mNumber of unique patients: {}'.format(train.pn_num.nunique()))

<a id="3.1.1"></a>
### 3.1.1: Missing annotations in train.csv

If you notice, we have `[]` in `annotation` and `location` columns. These are actually missing data and those entries don't have annotation and location. 

Let's fill them with `None` and get a heatmap to visualize the missing data.  

In [None]:
#train.head(2)

In [None]:
train['annotation']=train.annotation.apply(lambda x: None if x=='[]' else x)
train['location']=train.location.apply(lambda x: None if x=='[]' else x)
#train[train.feature_num==915]

In [None]:
#train.head(2)

In [None]:
map_df(train)

In [None]:
# percentage of missing data
missing_df(train)
print()
print('\033[91mTotal number missing values in train.csv data: {}'.format(sum(train.isna().sum())))

<a id="3.1.2"></a>
### 3.1.2: Annotated vs unannotated cases in `train.csv`

**Let's get a bar plot for annotated and unannotated cases in `train.csv`.**

In [None]:
temp=train.groupby("case_num").count()
temp=temp.reset_index()
temp=temp[['case_num','id','annotation']]
temp.columns=['case_number','total_case_count','annotated_case_count']

In [None]:
# Setting some parameters 
tickmode='array'
tickvals=[0,1,2,3,4,5,6,7,8,9]
ticktext=['Case_0','Case_1','Case_2','Case_3','Case_4','Case_5','Case_6','Case_7','Case_8','Case_9']
template="plotly_white"

In [None]:
# Getting plot
fig=go.Figure()
fig.add_trace(go.Bar(x=temp['case_number'], y=temp["total_case_count"], name="Total cases"))
fig.add_trace(go.Bar(x=temp['case_number'], y=temp["annotated_case_count"], name="Annotated cases"))

# Updating layout
fig.update_layout(
    title={'text':'Distribution of case in train.csv','y':0.95,'x':0.5,'xanchor':'center','yanchor':'top'},
    xaxis_title="Case number",yaxis_title="Number of cases",
    xaxis=dict(tickmode=tickmode,tickvals=tickvals,ticktext=ticktext),
    template=template)

<a href="#toc" role="button" aria-pressed="true" >👆 Table of contents 👆</a>

<a id="3.1.3"></a>
### 3.1.3: Annotation Analysis 

We can start by looking at number of annotations in each row of the `train.csv`. To do so, let's create a new column with `annot_count` and get a bar plot. 

<a id="3.1.3.1"></a>
### 3.1.3.1: Annotation Count Distribution

In [None]:
# new column with annotation count in each row
def annot_count(annot):
    if annot != None:
        return len(eval(annot))
    else: return 0
train['annot_count']=train.annotation.apply(annot_count)

# getting annotation count and number of rows for each count 
temp=pd.DataFrame(train.annot_count.value_counts())
temp=temp.reset_index()
temp.columns=['n_annotations','n_rows']

# Figure
fig = px.bar(x=temp['n_annotations'],
             y=temp['n_rows']) 

# Updating layout
fig.update_layout(title={'text': 'Annotations distribution in train.csv',
                         'y':0.95,'x':0.48,'xanchor': 'center','yanchor': 'top'},
                  xaxis_title="Number of annotations",yaxis_title="Row count in train.csv",
                  xaxis=dict(tickmode=tickmode,tickvals=tickvals),#,ticktext=ticktext),
                  template=template)

<a id="3.1.3.2"></a>
### 3.1.3.2: Annotation Length Distribution 

In [None]:
#train

In [None]:
# Getting length of each feature_text in a new column 
train['annot_len']=train['annotation'].apply(lambda x: len(x) if x != None  else None)
# the rows with no annotations will not be considered

# mean length
print("Mean leanth of the annotation in train.csv: {} (chars)".format(round(train['annot_len'].mean(),2)))

# Distribution plot along with box plot (try violin)
fig = px.histogram(x = train['annot_len'],marginal="box",nbins = 200)
fig.update_layout(template=template)
fig.update_xaxes(title = "Lenght of annotation text")

In [None]:
#train[train.annot_len>200]

**********
**********

<a id="3.2"></a>
## 3.2: `patient_notes.csv`

* Patient Notes Data

In [None]:
patient_notes.info(memory_usage='deep')
print()
print('\033[92mNumber of rows in patient_notes data: {}'.format(patient_notes.shape[0]))
print('\033[94mNumber of columns in patient_notes data: {}'.format(patient_notes.shape[1]))

In [None]:
#map_df(patient_notes)

In [None]:
patient_notes.head(2)

In [None]:
#patient_notes.pn_num.nunique()
print("Number of unique cases in patient_notes.csv: {}".format(patient_notes.case_num.nunique()))

**Column Description :**
* `pn_num` - A unique identifier for each patient note.
* `case_num` - A unique identifier for the clinical case a patient note represents.
* `pn_history` - The text of the encounter as recorded by the test taker.

Let's see how a **Sample Patient Note `(pn_history)`** looks like.  

In [None]:
print(patient_notes["pn_history"].iloc[RANDOM_INDEX])

<a id="3.2.1"></a>
### 3.2.1: Patient notes  per case 

In [None]:
# Grouped data and some rearrangements 
pat_notes_counts=patient_notes.groupby("case_num").count()
pat_notes_counts=pat_notes_counts.reset_index()
pat_notes_counts=pat_notes_counts[['case_num','pn_num']]
pat_notes_counts.columns=['Case number','Number of patient notes']

# Figure
fig = px.bar(x=pat_notes_counts['Case number'],
             y=pat_notes_counts['Number of patient notes']) 

# Updating layout
fig.update_layout(title={'text': 'Distribution of patient notes for each case',
                         'y':0.95,'x':0.48,'xanchor': 'center','yanchor': 'top'},
                  xaxis_title="Case number",yaxis_title="Notes count per cases",
                  xaxis=dict(tickmode=tickmode,tickvals=tickvals,ticktext=ticktext),
                  template=template)

<a id="3.2.2"></a>
### 3.2.2: Patient Notes Length Distribution 

Let's check the length on the note and see how it is distributed across the notes dataset!

In [None]:
# Getting length of each note in a new column 
patient_notes['note_len']=patient_notes['pn_history'].apply(lambda x: len(x))

# mean length
print("Mean leanth of the patient history notes is: {}".format(round(patient_notes['note_len'].mean(),2)))

# Distribution plot along with box plot (try violin)
fig = px.histogram(x = patient_notes['note_len'],marginal="box",nbins = 100)
fig.update_layout(template=template)
fig.update_xaxes(title = "Lenght of patient Notes")

### Mean patient note length per case 

In [None]:
temp=pd.DataFrame(patient_notes.groupby('case_num')['note_len'].mean())
temp=temp.reset_index()
fig=px.bar(x=temp['case_num'], y=temp['note_len'])
fig.update_layout(template=template)
# Updating layout
fig.update_layout(title={'text': 'Mean lenght of patient notes for each case',
                         'y':0.95,'x':0.48,'xanchor': 'center','yanchor': 'top'},
                  xaxis_title="Case number",yaxis_title="Mean lenght on notes per case",
                  xaxis=dict(tickmode=tickmode,tickvals=tickvals,ticktext=ticktext),
                  template=template)

* **All the cases have similar mean length of their notes, it might be a good idea to look at the notes length per patient (we have 1000 unique patients in the data). (Later)**

In [None]:
patient_notes.columns

<a href="#toc" role="button" aria-pressed="true" >👆 Table of contents 👆</a>

<a id="3.3"></a>
### 3.3. `features.csv`

In [None]:
features.head(2)

In [None]:
print(features.info())
print()
print(f'\033[92mNumber of rows in features data: {features.shape[0]}')
print(f'\033[94mNumber of columns in features data: {features.shape[1]}')


**Column Description :**
* `feature_num` - A unique identifier for each feature.
* `case_num` - A unique identifier for each case.
* `feature_text` - A description of the feature.

In [None]:
#features.case_num.value_counts()

### Sample Feature text

In [None]:
features["feature_text"].iloc[RANDOM_INDEX]

<a id="3.3.1"></a>
### 3.3.1: Feature Distribution (per Case) 

In [None]:
# Grouped data and some re-arrangements 
feature_counts = features.groupby("case_num").count()
feature_counts=feature_counts.reset_index()
feature_counts=feature_counts[['case_num','feature_num']]

# Figure
fig = px.bar(x=feature_counts['case_num'],
             y=feature_counts['feature_num']) 

# Updating layout
fig.update_layout(title={'text': 'Distribution of features per case',
                         'y':0.95,'x':0.48,'xanchor': 'center','yanchor': 'top'},
                  xaxis_title="Case number",yaxis_title="Notes count per cases",
                  xaxis=dict(tickmode=tickmode,tickvals=tickvals,ticktext=ticktext),
                  template=template)

<a id="3.3.2"></a>
### 3.3.2: Feature Length Distribution 

In [None]:
features.head(2)

In [None]:
# Getting length of each feature_text in a new column 
features['text_len']=features['feature_text'].apply(lambda x: len(x))

# mean length
print("Mean leanth of the patient history notes is: {}".format(round(features['text_len'].mean(),2)))

# Distribution plot along with box plot (try violin)
fig = px.histogram(x = features['text_len'],marginal="box",nbins = 100)
fig.update_layout(template=template)
fig.update_xaxes(title = "Lenght of feature text")

<a href="#toc" role="button" aria-pressed="true" >👆 Table of contents 👆</a>

So far, we have explored all the given files for training purpose. 

Let's move on and grab a single patient to see how the patient notes and its annotations look like. 

<a id="4"></a>
## 4. A sample patient notes and its annotations

We have annotations in `train.csv` and the history notes in `patient_notes.csv`, so we need both dataframes.

**We already know that we have 1000 unique patients, let's grab any one and separate the relevant rows from `train.csv`**

In [None]:
print("Unique Patient Count in train data : ",len(train["pn_num"].value_counts()))

**Separating data for a particular patient**

In [None]:
# List of unique patients 
unique_pt_ids=list(train.pn_num.unique())

# unique patient 
PATIENT_NUMBER=random.choice(unique_pt_ids)
print("Selected patient id is: {}".format(PATIENT_NUMBER))

# Annotated dataframe of the selected patient
patient_df = train[train["pn_num"] == PATIENT_NUMBER]
print("The dataframe for the selected patient is saperated in 'patient_df'")

**Let's see how the patient and notes and its annotations look like**

In [None]:
#patient_notes[patient_notes["pn_num"] == PATIENT_NUMBER]

In [None]:
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_NUMBER]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

<a href="#toc" role="button" aria-pressed="true" >👆 Table of contents 👆</a>

The above notes and annotations looks fine, however, it would be more friendly if we location and mark the annotation with different colour in the history notes, we can get it done with a little help from spaCy. 

Off-course, we need to locate the annotations in the text as well using location column in `train.csv` and highlight the relevant text in the history notes from `patient_notes.csv`.  

<a id="5"></a>
# 5. Marking and visualizing annotations using [spaCy](https://spacy.io/)

In [None]:
def eval_(annot):
    if annot != None:
        return eval(annot)
    else: return 'no-annot'
train["location"] = train["location"].apply(eval_)
train['annotation'] = train['annotation'].apply(eval_)

# Grabbing the data for a single patients from train.csv
patient_df = train[train["pn_num"] == PATIENT_NUMBER]
patient_df=patient_df[patient_df.annotation!='no-annot']

There are some issues in the location data, the numbers are separated by ';'. We need to replace ';' with ','. The correction below is going to do this task for us for the selected patient. 

In [None]:
def correction(list_):
    temp=[]
    for item in list_:
        temp.append(item.split(';'))
    flat_list = [item for sublist in temp for item in sublist]
    return flat_list
patient_df['location']=patient_df.location.apply(correction)

So, we have location for the annotation of selected patient in `patient_df` which is a sub dataframe from train.csv. We also have the complete notes for the same patient in `patient_notes` dataframe. 

Let's highlight the annotations in the patient notes.

In [None]:
# Grabbing location and annotation columns of the seleted patient
location  = patient_df["location"]
annotation = patient_df["annotation"]

# empty lists for start and end points
start_pos = []
end_pos = []
for i in location:
    for j in i:
        start_pos.append(j.split()[0])
        end_pos.append(j.split()[1])

# Marking annotations in the selected notes 
ents = []
for i in range(len(start_pos)):
    ents.append({'start': int(start_pos[i]),
                 'end' : int(end_pos[i]),
                 "label" : "(Annotation)"})

# Patient notes 
doc={'text':patient_notes[patient_notes["pn_num"]==PATIENT_NUMBER]["pn_history"].iloc[0],
     "ents" : ents}

# Colour that we want to highlight the annotation 
colors = {"(Annotation)" :"linear-gradient(0deg,#FFA500,#FFFF00)" } 
options = {"colors": colors}
spacy.displacy.render(doc, style="ent", options = options , manual=True, jupyter=True);

<a href="#toc" role="button" aria-pressed="true" >👆 Table of contents 👆</a>

<a id="6"></a>
# 6. Word clouds

It is a good idea to visualize the most common words in patient history notes, features and annotations using world clouds.  to 

<a id="6.1"></a>

## 6.1: Word cloud of patient history notes

In [None]:
wordcloud_notes = wordcloud.WordCloud(
    stopwords=wordcloud.STOPWORDS,
    max_font_size=120, #max_words=5000,
    width = 600, height = 150,
    background_color='white').generate(" ".join(list(patient_notes['pn_history'])))

fig, ax = plt.subplots(figsize=(18,6))
ax.imshow(wordcloud_notes, interpolation='bilinear')
ax.set_axis_off()
plt.imshow(wordcloud_notes);

<a id="6.2"></a>
## 6.2: Word cloud of two characters words in patient history notes

In [None]:
two_letter_words=[]
for note in list(patient_notes['pn_history']):
    for word in note.split():
        if len(word)==2:
            two_letter_words.append(word)
wordcloud_two_chars = wordcloud.WordCloud(
    stopwords=wordcloud.STOPWORDS, 
    max_font_size=120, max_words=len(set(two_letter_words)),
    width = 600, height = 150,
    background_color='white').generate(" ".join(two_letter_words))

fig, ax = plt.subplots(figsize=(18,6))
ax.imshow(wordcloud_two_chars, interpolation='bilinear')
ax.set_axis_off()
plt.imshow(wordcloud_two_chars);

<a href="#toc" role="button" aria-pressed="true" >👆 Table of contents 👆</a>

<a id="6.3"></a>
## 6.3: Word cloud of features

In [None]:
wordcloud_feat = wordcloud.WordCloud(stopwords=wordcloud.STOPWORDS, max_font_size=120, max_words=5000,
                      width = 600, height = 150,
                      background_color='white').generate(" ".join(list(features['feature_text'])))#all_feat))
fig, ax = plt.subplots(figsize=(18,6))
ax.imshow(wordcloud_feat, interpolation='bilinear')
ax.set_axis_off()
plt.imshow(wordcloud_feat);

<a id="6.4"></a>
## 6.4: Word cloud of annotations

In [None]:
train_annot=train[train["annotation"]!='no-annot']
wordcloud_annot = wordcloud.WordCloud(stopwords=wordcloud.STOPWORDS, max_font_size=120, max_words=5000,
                      width = 600, height = 150,
                      background_color='white').generate(" ".join(list(np.hstack(train_annot["annotation"]))))

fig, ax = plt.subplots(figsize=(18,6))
ax.imshow(wordcloud_annot, interpolation='bilinear')
ax.set_axis_off()
plt.imshow(wordcloud_annot);

<a href="#toc" role="button" aria-pressed="true" >👆 Table of contents 👆</a>

<a id="6.5"></a>
## 6.5: Another way to generate word clouds using [stylecloud](https://pypi.org/project/stylecloud/)

In [None]:
!pip install stylecloud

In [None]:
# Another way to get the word cloud for the data -- patient history notes in this image 
import stylecloud
from IPython.display import Image
concat_data = ' '.join([i for i in patient_notes.pn_history.astype(str)])
stylecloud.gen_stylecloud(text=concat_data,
                          icon_name='fas fa-tree',
                          palette='cartocolors.qualitative.Bold_6',
                          background_color='black',
                          gradient='horizontal',
                          size=1024)


Image(filename="./stylecloud.png", width=1024, height=1024)

In [None]:
#train[train[train.case_num==0]['annotation']!='no-annot']
df=train[train.annotation!='no-annot']
concat_data = ' '.join([i for i in df[df.case_num==1]['annotation'].astype(str)])
stylecloud.gen_stylecloud(text=concat_data,
                          icon_name='fas fa-tree', #'fas fa-eye'
                          palette='cartocolors.qualitative.Bold_6',
                          background_color='black',
                          gradient='horizontal',
                          size=1024)


Image(filename="./stylecloud.png", width=1024, height=1024)

<a href="#toc" role="button" aria-pressed="true" >👆 Table of contents 👆</a>

<a id="7"></a>
## 7. Final `train` and `test` datasets

As we know, the train data includes train.csv, patient_notes.csv and features.csv. Let's merge all the finals and get our final training dataset. 
We can do little preprocessing as well. We have created some new column while EDA. Let's read all the given data files again, the raw files.  

In [None]:
# Training data files
train=pd.read_csv(data_dir+"/train.csv")
patient_notes=pd.read_csv(data_dir+"/patient_notes.csv")
features=pd.read_csv(data_dir+"/features.csv")

# Test data file/s
test=pd.read_csv(data_dir+"/test.csv")

# submission sample 
submission=pd.read_csv(data_dir+"/sample_submission.csv")

In [None]:
# Functions to do little preprocessing the data
def process_feature_text(text):
    text = re.sub('I-year', '1-year', text)
    text = re.sub('-OR-', " or ", text)
    text = re.sub('-', ' ', text)
    return text


def clean_spaces(txt):
    txt = re.sub('\n', ' ', txt)
    txt = re.sub('\t', ' ', txt)
    txt = re.sub('\r', ' ', txt)
#     txt = re.sub(r'\s+', ' ', txt)
    return txt

<a id="7.1"></a>
### 7.1: Final train dataset

In [None]:
# Merging the data files to get train
train = train.merge(features, how="left", on=["case_num", "feature_num"])
train = train.merge(patient_notes, how="left", on=['case_num', 'pn_num'])

# little preprocessing of patient history and feature columsn
train['pn_history'] = train['pn_history'].apply(lambda x: x.strip())
train['feature_text'] = train['feature_text'].apply(process_feature_text)

train['feature_text'] = train['feature_text'].apply(clean_spaces)
train['pn_history'] = train['pn_history'].apply(clean_spaces)


In [None]:
train.head(1)

<a href="#toc" role="button" aria-pressed="true" >👆 Table of contents 👆</a>

<a id="7.2"></a>
### 7.2: Final test dataset

In [None]:
# Merging the data files to get train
test = test.merge(features, how="left", on=["case_num", "feature_num"])
test = test.merge(patient_notes, how="left", on=['case_num', 'pn_num'])

# little preprocessing of patient history and feature columsn
test['pn_history'] = test['pn_history'].apply(lambda x: x.strip())
test['feature_text'] = test['feature_text'].apply(process_feature_text)

test['feature_text'] = test['feature_text'].apply(clean_spaces)
test['pn_history'] = test['pn_history'].apply(clean_spaces)

In [None]:
test.head(1)

<a href="#toc" role="button" aria-pressed="true" >👆 Table of contents 👆</a>

<a id="8"></a>
## 8. The submission sample

This how the submission should look like!

In [None]:
submission

All good, try your luck now. I hope this notebook will be helpful. 

**Good luck!**

[Following notebook](https://www.kaggle.com/code/odins0n/nbme-detailed-eda) is cosulted while creating this notebook 