In [1]:
!pip install pycrf
!pip install sklearn-crfsuite

import spacy
import sklearn_crfsuite
from sklearn_crfsuite import metrics

model = spacy.load("en_core_web_sm")



ValueError: 'in' is not a valid parameter name

In [2]:
import pandas as pd

### Data Preprocessing

The dataset provided is in the form of one word per line. Let's understand the format of data below:

1. Suppose there are x words in a sentence, then there will be x continuous lines with one word in each line.

2. Further, the two sentences are separated by empty lines. The labels for the data follow the same format.

##### We need to pre-process the data to recover the complete sentences and their labels.

##### Construct the proper sentences from individual words and print the 5 sentences.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [3]:
# Create a function to process the file and return a sentence list
def preprocess_inputfile(input_file):
    i_file = open(input_file, 'r')
    file_name = i_file.readlines()
    i_file.close()

    output_list = []

    full_sentence = ""

    for each_word in file_name:
        each_word = each_word.strip()
        if each_word == "":
            output_list.append(full_sentence) # To append the complete sentence to the output list
            full_sentence = "" # For new sentence start
        else:
            if full_sentence:
                full_sentence += " " + each_word
            else:
                full_sentence = each_word
                
    return output_list

In [5]:
train_sentences = preprocess_inputfile('E:/ML and AI/NLP/Syntactic Processing/Assignment/train_sent')
train_labels = preprocess_inputfile('E:/ML and AI/NLP/Syntactic Processing/Assignment/train_label')
test_sentences = preprocess_inputfile('E:/ML and AI/NLP/Syntactic Processing/Assignment/test_sent')
test_labels = preprocess_inputfile('E:/ML and AI/NLP/Syntactic Processing/Assignment/test_label')

In [6]:
# Print first five sentences from the processed dataset
for each_item in range(5):
    print(f"Sentence {each_item+1} is: {train_sentences[each_item]}")
    print(f"Label {each_item+1} is: {train_labels[each_item]}")
    print("*"*100)

Sentence 1 is: All live births > or = 23 weeks at the University of Vermont in 1995 ( n = 2395 ) were retrospectively analyzed for delivery route , indication for cesarean , gestational age , parity , and practice group ( to reflect risk status )
Label 1 is: O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
****************************************************************************************************
Sentence 2 is: The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 )
Label 2 is: O O O O O O O O O O O O O O O O O O O O O O O O O
****************************************************************************************************
Sentence 3 is: Abnormal presentation was the most common indication ( 25.6 % , 88 of 344 )
Label 3 is: O O O O O O O O O O O O O O O
****************************************************************************************************
Sentence 4 is: The `` corrected '' ce

#### Count the number of sentences in the processed train and test dataset

In [7]:
print(f"Number of sentences in processed train dataset is: {len(train_sentences)}")
print(f"Number of sentences in processed test dataset is: {len(test_sentences)}")

Number of sentences in processed train dataset is: 2599
Number of sentences in processed test dataset is: 1056


#### Count the number of lines of labels in the processed train and test dataset.

In [8]:
print(f"Number of lines of labels in processed train dataset is: {len(train_labels)}")
print(f"Number of lines of labels in processed test dataset is: {len(test_labels)}")

Number of lines of labels in processed train dataset is: 2599
Number of lines of labels in processed test dataset is: 1056


### Concept Identification

We will first explore what are the various concepts present in the dataset. For this, we will use PoS Tagging.

We will identify all the words from the corpus that have a tag of NOUN or PROPN (nouns) and prepare a dictionary of their counts. We will then output the top 25 most frequently discussed concepts in the entire corpus.

The key thing to check is that we are using both test and train sentences. Note that this is okay because we are using a pre-trained model and applying directly on our data. This is an exploratory analysis on the complete data. Since we are not training anything, there is no point is discarding information in test data

##### Extract those tokens which have NOUN or PROPN as their PoS tag and find their frequency

In [9]:
# Creating a list to hold all the tokens which are either NOUN or PROPER NOUN
noun_propn_tokens_list = []

In [14]:
# Each token which is a NOUN or PROPN will be appended to the list "noun_propn_tokens_list"
for sentences in (train_sentences, test_sentences):
    for sent in sentences:
        processed_sent = model(sent)
        for each_token in processed_sent:
            if each_token.pos_ == "NOUN" or each_token.pos_ == "PROPN":
                noun_propn_tokens_list.append(each_token.text)

TypeError: Inputs to a layer should be tensors. Got 'All live births > or = 23 weeks at the University of Vermont in 1995 ( n = 2395 ) were retrospectively analyzed for delivery route , indication for cesarean , gestational age , parity , and practice group ( to reflect risk status )' (of type <class 'str'>) as input for layer 'model'.

In [15]:
# Creating a Series to hold the tokens which are either NOUN or PROPER NOUN
df_noun_propn = pd.Series(noun_propn_tokens_list)

#### Print the top 25 most common tokens with NOUN or PROPN PoS tags

In [16]:
# Getting then count of each token and sorting the data in top 25 most token counts
df_noun_propn.value_counts().sort_values(ascending=False).head(25)

Series([], Name: count, dtype: int64)

### Defining features for CRF

We will train a custom CRF to identify diseases (D) and treatments (T) from the data. For this, we will use the training data to train the model and evaluate it on the test set.

Things to check:


All features needs to be correctly defined

Only the previous word should be used in addition to the current word for evaluating additional features

BEG and END words have been correctly marked

POS tags (pos_tags) have been correctly passed to the method and used


In [17]:
# Let's define the features to get the feature value for one word.

def getFeaturesForOneWord(sentence, pos, pos_tags):
  word = sentence[pos]

  features = [
    'word.lower=' + word.lower(), # serves as word id
    'word[-3:]=' + word[-3:],     # last three characters
    'word[-2:]=' + word[-2:],     # last two characters
    'word.isupper=%s' % word.isupper(),  # is the word in all uppercase
    'word.isdigit=%s' % word.isdigit(),  # is the word a number
    'word.startsWithCapital=%s' % word[0].isupper(), # is the word starting with a capital letter
    'word.pos=' + pos_tags[pos]
  ]

  #Use the previous word also while defining features
  if(pos > 0):
    prev_word = sentence[pos-1]
    features.extend([
    'prev_word.lower=' + prev_word.lower(), 
    'prev_word.isupper=%s' % prev_word.isupper(),
    'prev_word.isdigit=%s' % prev_word.isdigit(),
    'prev_word.startsWithCapital=%s' % prev_word[0].isupper(),
    'prev_word.pos=' + pos_tags[pos-1]
  ])
  # Mark the begining and the end words of a sentence correctly in the form of features.
  else:
    features.append('BEG') # feature to track begin of sentence 

  if(pos == len(sentence)-1):
    features.append('END') # feature to track end of sentence

  return features

### Getting the features

#### Write a code/function to get the features for a sentence

In [18]:
# Function to get features for a sentence.
def getFeaturesForOneSentence(sentence):
    
    # We need to get the pos_tags to be passed to the function
    processed_sent = model(sentence)
    postags = []
    
    for each_token in processed_sent:
        postags.append(each_token.pos_)
    
    sentence_list = sentence.split()
    return [getFeaturesForOneWord(sentence_list, pos, postags) for pos in range(len(sentence_list))]

#### Write a code/function to get the labels of a sentence

In [19]:
# Function to get the labels for a sentence.
def getLabelsInListForOneSentence(labels):
  return labels.split()

### Define input and target variables

Correctly computing X and Y sequence matrices for training and test data. Check that both sentences and labels are processed

##### Define the features' values for each sentence as input variable for CRF model in test and the train dataset

X_train = [getFeaturesForOneSentence(sentence) for sentence in train_sentences]
X_test = [getFeaturesForOneSentence(sentence) for sentence in test_sentences]

### Build the CRF Model

In [21]:
# This is needed to not get the error AttributeError: 'CRF' object has no attribute 'keep_tempfiles'
# pip install scikit-learn==0.22.2 --user

In [22]:
# Build the CRF model.
crf = sklearn_crfsuite.CRF(max_iterations=100)
crf.fit(X_train, Y_train)

NameError: name 'sklearn_crfsuite' is not defined

### Evaluation

#### Predict the labels of each of the tokens in each sentence of the test dataset that has been pre processed earlier.

Y_pred = crf.predict(X_test)

#### Calculate the f1 score using the actual labels and the predicted labels of the test dataset

f1_score = metrics.flat_f1_score(Y_test, Y_pred, average='weighted')
print(f"F1 score is: {round(f1_score,4)}")

### Identifying Diseases and Treatments using Custom NER

We now use the CRF model's prediction to prepare a record of diseases identified in the corpus and treatments used for the diseases.

Create the logic to get all the predicted treatments (T) labels corresponding to each disease (D) label in the test dataset.

![image.png](attachment:image.png)