# Expanding the existing intents

I order to find the right desease we need to get the right synonyms from the user. Hence in our case, every synonym is one intent.
Here you have to keep in mind, that every symptom can have similiar descriptions to other symptoms. Hence it is better to only use as few keywords as needed in order to get as few similarities between the different intents/symptoms.

### Imports

In [1]:
#import standard libs
import pandas as pd
import numpy as np

#import support libs
import itertools
from itertools import chain
import json

#import nlp libs
from nltk.corpus import wordnet
import spacy
nlp = spacy.load('en_core_sci_lg')

#open hand-written intents
with open('intents/intents.json', 'r+') as f:
    intents = json.load(f)

In [2]:
#create new intents as dictionary
data = dict()
data['intents'] = []

### Data Transformation

The following script extends the existing patterns of every intent with synonyms of simple tags of patterns. \n
Simple means, that many tags consist of one or two words. These tags are considered as simple, as synonyms can be composed easily. \n
Some tags like "prominent_veins_on_calf" are more complex and finding synonyms of them are harder to find and not covered here.

In [3]:
for i in range(len(intents['intents'])):
        
    intent = intents['intents'][i]
    tag = intent['tag']
    pattern = intent['patterns']

    
    #single words like headache
    
    #get synonyms for the tag
    synonyms = wordnet.synsets(tag)
    new_words = set(chain.from_iterable([word.lemma_names() for word in synonyms]))
    new_words = [word.replace('_', ' ') for word in new_words] #this is a list with every synonym found by the wordnet from nltk
    
    docs = []
    for item in new_words:
        
        #filter out synonyms that have no scientific background
        #note that this step was due to weird synonyms like "make out" as a synonym for "neck" (the body part)
        #we only could use a scientific dataset from sci-spacy but would have favored a medical dataset here. (More on that at the end)
        doc = list(nlp(item).ents)

        X = []
        for x in doc:
            x = str(x)
            X.append(x)

        X = ' '.join(X)
        docs.append(X)
    
    
    
    pattern.extend(docs)
    
    #two words like neck_pain
    #same process es above but for each of the two words we find synonyms each, then filter them and then combine them to one pattern.
    sentence = tag.split('_')
    if len(sentence) == 2:
        liste=[]
        for part in sentence:
            synonyms = wordnet.synsets(part)
            part = set(chain.from_iterable([word.lemma_names() for word in synonyms]))
            part = [word.replace('_', ' ') for word in part]

            docs= []
            for item in part:
                doc = list(nlp(item).ents)

                X = []
                for x in doc:
                    x = str(x)

                    X.append(x)
                    
                X = ' '.join(X)
                docs.append(X)

            liste.append(docs)

        liste2=[]
        for part in liste:
            part = [x for x in part if x]
            liste2.append(part)

        liste2 = list(itertools.product(liste2[0], liste2[1]))
        
        for sentence in liste2:
            string = ' '.join(sentence) 

            pattern.append(string)
            
    ###########################################################        
    
    #add pattern and the belonging tag/intent/synonym to the intents
    
    pattern = list(set(pattern))
    
    dictionary = dict()
    dictionary['tag']=tag
    dictionary['patterns']=pattern
    
    data['intents'].append(dictionary)

In [4]:
#save intents as new intents
json_object = json.dumps(data, indent = 4)

with open("intents_new2.json", "w") as outfile:
    outfile.write(json_object)

Note that this is our first idea of trying to expand the patterns of our intents.
Other ways would involve excessive text and data mining to find descriptions. Normal people describe their symptoms with simple words but long explanations which means to find such descriptions. This could, however, lead to more problems like described in the top of this notebook.

## Extra

The following code snippet is the beginning of the usage of the UMLS from the NIH (a huge medical database).
This was meant to become the alternative of the scientific dataset from above. There would be other uses for it though, as the database could become the foundation of all the intents but lead to a huge amount of data management and preprocessing.
In order to use this dataset of medical terminologies you need to register on the site and apply for access, therefore this approach was not covered up to this point.

In [None]:
#import pymedtermino as med
#med.LANGUAGE = 'en'
#med.REMOVE_SUPRESSED_CONCEPTS = True
#from pymedtermino import *
#from pymedtermino.snomedct import *