<a href="https://colab.research.google.com/github/AmarjitMahadik007/Syntactic-Processing/blob/master/Assignment_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment Task - Identify Entities in Healthcare Data

## Background:
A health tech company called ‘BeHealthy’. They aim to connect the medical communities with millions of patients across the country. BeHealthy has a web platform that allows doctors to list their services and manage patient interactions and provides services for patients such as booking interactions with doctors and ordering medicines online. Here, doctors can easily organise appointments, track past medical records and provide e-prescriptions.

### Problem Statement:

BeHealthy require predictive model which can identify disease and treatment from the patients interaction with doctor or ordering medicines online

### Task Summary:
In this assignment, we need to perform the following broad steps:

- Process and modify the data into sentence format. This step has to be done for the 'train_sent' and ‘train_label’ datasets and for test datasets as well.
- Define the features to build the CRF model.
- Apply these features in each sentence of the train and the test dataset to get the feature values.
- Define the target variable and then build the CRF model.
- Evaluate using a test data set.
- Finally create a dictionary in which diseases are keys and treatments are values.

### PyCRF:
pycrf is a package that typically refers to a Python library for Conditional Random Fields (CRFs), which are a class of statistical modeling methods often applied in pattern recognition and machine learning for structured prediction.

### sklearn-crfsuite:
sklearn-crfsuite is a library that provides a simple interface for training and using CRFs. It's built on top of python-crfsuite, a Python binding for CRFsuite, which is an implementation of Conditional Random Fields (CRFs). The sklearn-crfsuite library offers compatibility with the scikit-learn library, making it easy to use CRFs within the scikit-learn framework.

In [1]:
!pip install pycrf
!pip install sklearn-crfsuite

Collecting pycrf
  Downloading pycrf-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pycrf
  Building wheel for pycrf (setup.py) ... [?25l[?25hdone
  Created wheel for pycrf: filename=pycrf-0.0.1-py3-none-any.whl size=1870 sha256=5fab08b012a45845aaa6aeeb37e7c97fac8cd51d33aaf52befd1f857780cd6db
  Stored in directory: /root/.cache/pip/wheels/fd/3a/fb/e4d15c9c2b169f43811b23a863ee9717ff3eda5d2301789043
Successfully built pycrf
Installing collected packages: pycrf
Successfully installed pycrf-0.0.1
Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl (10 kB)
Collecting python-crfsuite>=0.9.7 (from sklearn-crfsuite)
  Downloading python_crfsuite-0.9.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-crfsuite, sklearn-c

In [2]:
# Library Import
import pandas as pd
import re
import spacy
import warnings
warnings.filterwarnings('ignore')

# Import model and metrics
from sklearn_crfsuite import CRF, scorers, metrics

In [3]:
with open('train_sent', 'r', encoding='utf-8') as train_sent_file:
  train_sentences = train_sent_file.readlines()

with open('train_label', 'r', encoding='utf-8') as train_labels_file:
  train_labels = train_labels_file.readlines()

with open('test_sent', 'r', encoding='utf-8') as test_sent_file:
  test_sentences = test_sent_file.readlines()

with open('test_label', 'r', encoding='utf-8') as test_labels_file:
  test_labels = test_labels_file.readlines()


In [4]:
Sent_merged = ''
train_sentences_merged = []
for i in range(len(train_sentences)):
  if train_sentences[i] != '\n':
    Sent_merged += train_sentences[i].strip('\n')+' '
  else:
    train_sentences_merged.append(Sent_merged)
    Sent_merged = ''

In [5]:
label_merged = ''
train_labels_merged = []
for i in range(len(train_labels)):
  if train_labels[i] != '\n':
    label_merged += train_labels[i].strip('\n')+' '
  else:
    train_labels_merged.append(label_merged)
    label_merged = ''

In [6]:
Sent_merged = ''
test_sentences_merged = []
for i in range(len(test_sentences)):
  if test_sentences[i] != '\n':
    Sent_merged += test_sentences[i].strip('\n')+' '
  else:
    test_sentences_merged.append(Sent_merged)
    Sent_merged = ''

In [7]:
label_merged = ''
test_labels_merged = []
for i in range(len(test_labels)):
  if test_labels[i] != '\n':
    label_merged += test_labels[i].strip('\n')+' '
  else:
    test_labels_merged.append(label_merged)
    label_merged = ''

Count the number of sentences in the processed train and test dataset

In [8]:
print(len(train_sentences_merged))
print(len(test_sentences_merged))

2599
1056


In [9]:
# Import spacy small library to find medical related entities
nlp= spacy.load("en_core_web_sm")

In [10]:
# Dataframe of POS tagging,Lemma word and Label for Train and test sentence
train_df = pd.DataFrame(columns=['sentence','word','lemma','pos','label'])
test_df = pd.DataFrame(columns=['sentence','word','lemma','pos','label'])

In [11]:
#train datframe

i=0 #Sentence count
j=0 #Iteration count

for sent,label in zip(train_sentences_merged,train_labels_merged):
    i+=1
    for s,l in zip(sent.split(),label.split()):
        doc = nlp(s)
        for tok in doc:
            train_df.loc[j,['sentence','word','lemma','pos','label']] = [i,tok.text,tok.lemma_,tok.pos_,l]
            j+=1

In [12]:
#test datframe

i=0 #Sentence count
j=0 #Iteration count

for sent,label in zip(test_sentences_merged,test_labels_merged):
    i+=1
    for s,l in zip(sent.split(),label.split()):
        doc = nlp(s)
        for tok in doc:
            test_df.loc[j,['sentence','word','lemma','pos','label']] = [i,tok.text,tok.lemma_,tok.pos_,l]
            j+=1

In [13]:
# Word and it's frequency for word which contains NOUN or PROPN as POS tagging
freq_df = pd.DataFrame()
freq_df = pd.concat((train_df,test_df),axis=0)

In [14]:
# Resetting index
freq_df.reset_index(inplace=True,drop=True)

In [15]:
# Top 25 most frequency values for Train and Test related dataset words
freq_df[(freq_df['pos'] == 'NOUN') | ((freq_df['pos'] == 'PROPN'))]['word'].value_counts()[:25]

word
patients        492
treatment       281
cancer          200
therapy         175
disease         143
cell            140
lung            116
group            94
gene             88
chemotherapy     88
effects          85
results          79
women            77
patient          75
TO_SEE           75
surgery          71
risk             71
cases            71
analysis         70
human            67
rate             67
response         66
survival         65
children         64
effect           64
Name: count, dtype: int64

In [16]:
# Top 25 most frequency values for Train and Test related lemma words
freq_df[(freq_df['pos'] == 'NOUN') | ((freq_df['pos'] == 'PROPN'))]['lemma'].value_counts()[:25]

lemma
patient         587
treatment       316
cancer          226
cell            203
therapy         182
disease         172
effect          163
case            132
group           128
lung            120
result          118
gene            112
year            105
rate            102
trial            91
chemotherapy     91
woman            89
analysis         86
protein          82
response         81
risk             78
child            78
human            77
TO_SEE           75
mutation         75
Name: count, dtype: int64

In [23]:
# Dataframe (Sentence, word, POS) visualisation
train_df.head(90)

Unnamed: 0,sentence,word,lemma,pos,label
0,1,All,all,PRON,O
1,1,live,live,VERB,O
2,1,births,birth,NOUN,O
3,1,>,>,PUNCT,O
4,1,or,or,CCONJ,O
...,...,...,...,...,...
85,4,The,the,PRON,O
86,4,`,`,PUNCT,O
87,4,`,`,PUNCT,O
88,4,corrected,correct,VERB,O


In [None]:
test_df.head(5)

In [None]:
# Sentense-wise detail dataframe preparation
# Fetch detail view of sentence for train set
train_sent_obj = sentencedetail(train_df)
train_sent_detail = train_sent_obj.sentences

In [None]:
# A class to retrieve the sentences details from the dataframe
class sentencedetail(object):
    def __init__(self, data):
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, l) for w, p, l in zip(s["word"].values.tolist(), s["pos"].values.tolist(),s["label"].values.tolist())]
        self.grouped = self.data.groupby("sentence").apply(agg_func)
        self.sentences = [s for s in self.grouped]