# Medical Specialty Recommender
Trying to train a TF-IDF model which takes a text query from the patient and recommends to them which medical specialty to look for.

In [3]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 578.4 kB/s eta 0:00:00
Collecting tqdm
  Downloading tqdm-4.66.4-py3-none-any.whl (78 kB)
     -------------------------------------- 78.3/78.3 kB 872.2 kB/s eta 0:00:00
Collecting regex>=2021.8.3
  Downloading regex-2024.5.10-cp310-cp310-win_amd64.whl (268 kB)
     ------------------------------------ 269.0/269.0 kB 871.1 kB/s eta 0:00:00
Installing collected packages: tqdm, regex, nltk
Successfully installed nltk-3.8.1 regex-2024.5.10 tqdm-4.66.4



[notice] A new release of pip available: 22.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
# Some necessary NLP resources
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Yasmine\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Yasmine\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Yasmine\AppData\Roaming\nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Yasmine\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [1]:
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


# Loading and Preparing the Dataset
We need a dataset that is comprised of medical specialties and their descriptions. We compile descriptions of what each medical specialty deals with.

### Now we can work with the new CSV

In [2]:
pd.options.display.max_colwidth = 250
df = pd.read_csv("SpecialtyDescriptions.csv", usecols=["Specialty", "sentence"])

In [7]:
df.sample(2)

Unnamed: 0,Specialty,sentence
14,Nephrology,"Nephrologists are specialized physicians who diagnose, treat, and manage acute and chronic kidney conditions. Living with kidney conditions can be very difficult and affect one’s quality of life, so it’s always important to prioritize preventive ..."
5,Hematology,"A hematologist is a doctor who specializes in researching, diagnosing, treating, and preventing blood disorders and disorders of the lymphatic system (lymph nodes and vessels).\nIf your primary care physician has recommended that you see a hemato..."


## Preparing the NLP Functions Needed

In [3]:
# Drop specialties without a description sentence
df.dropna(inplace=True)

In [4]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

STOPWORDS = set(stopwords.words('english'))
MIN_WORDS = 4
MAX_WORDS = 200

PATTERN_S = re.compile("\'s")  # matches `'s` from text
PATTERN_RN = re.compile("\\r\\n") #matches `\r` and `\n`
PATTERN_PUNC = re.compile(r"[^\w\s]") # matches all non 0-9 A-z whitespace

def clean_text(text):
    """
    Series of cleaning. String to lowercase, remove nonwords characters and numbers.
    text (str): input text
    return (str): modified initial text
    """
    text = text.lower()  # lowercase text
    text = re.sub(PATTERN_S, ' ', text)
    text = re.sub(PATTERN_RN, ' ', text)
    text = re.sub(PATTERN_PUNC, ' ', text)
    return text

def tokenizer(sentence, min_words=MIN_WORDS, max_words=MAX_WORDS, stopwords=STOPWORDS, lemmatize=True):
    """
    Lemmatize, tokenize, crop and remove stopwords.
    """
    if lemmatize:
        stemmer = WordNetLemmatizer()
        tokens = [stemmer.lemmatize(w) for w in word_tokenize(sentence)]
    else:
        tokens = [w for w in word_tokenize(sentence)]
    token = [w for w in tokens if (len(w) > min_words and len(w) < max_words and w not in stopwords)]
    return tokens


def clean_sentences(df):
    """
    Remove irrelavant characters (in new column clean_sentence).
    Lemmatize, tokenize words into list of words (in new column tok_lem_sentence).
    """
    print('Cleaning sentences...')
    df['clean_sentence'] = df['sentence'].apply(clean_text)
    df['tok_lem_sentence'] = df['clean_sentence'].apply(
        lambda x: tokenizer(x, min_words=MIN_WORDS, max_words=MAX_WORDS, stopwords=STOPWORDS, lemmatize=True))
    return df

df = clean_sentences(df)

Cleaning sentences...


In [5]:
print(len(df))
df[['sentence', 'clean_sentence', 'tok_lem_sentence']]

18


Unnamed: 0,sentence,clean_sentence,tok_lem_sentence
0,"What does a gynecologist do?\nA gynecologist is a doctor who specializes in women's reproductive health.\nThat is, a gynecologist is a specialist with expertise treating issues or conditions affecting a woman's reproductive system, including:\nVa...",what does a gynecologist do \na gynecologist is a doctor who specializes in women reproductive health \nthat is a gynecologist is a specialist with expertise treating issues or conditions affecting a woman reproductive system including \nvagi...,"[what, doe, a, gynecologist, do, a, gynecologist, is, a, doctor, who, specializes, in, woman, reproductive, health, that, is, a, gynecologist, is, a, specialist, with, expertise, treating, issue, or, condition, affecting, a, woman, reproductive, ..."
1,"An otolaryngologist, or ENT, is a healthcare specialist who treats conditions affecting your ears, nose and throat. They can also perform head and neck surgeries, including surgeries on your ears, mouth, throat, nose, neck and face.\nAnother name...",an otolaryngologist or ent is a healthcare specialist who treats conditions affecting your ears nose and throat they can also perform head and neck surgeries including surgeries on your ears mouth throat nose neck and face \nanother name...,"[an, otolaryngologist, or, ent, is, a, healthcare, specialist, who, treat, condition, affecting, your, ear, nose, and, throat, they, can, also, perform, head, and, neck, surgery, including, surgery, on, your, ear, mouth, throat, nose, neck, and, ..."
2,You don’t want to take any chances of losing your vision. This is why it’s smart to know when you need to visit an eye doctor. Here are a few signs that you might need to book an appointment with a specialist:\nBlurry Vision\nSudden changes in vi...,you don t want to take any chances of losing your vision this is why it s smart to know when you need to visit an eye doctor here are a few signs that you might need to book an appointment with a specialist \nblurry vision\nsudden changes in vi...,"[you, don, t, want, to, take, any, chance, of, losing, your, vision, this, is, why, it, s, smart, to, know, when, you, need, to, visit, an, eye, doctor, here, are, a, few, sign, that, you, might, need, to, book, an, appointment, with, a, speciali..."
3,"A pulmonologist is a physician who specializes in the respiratory system. From the windpipe to the lungs, if your complaint involves the lungs or any part of the respiratory system, a pulmonologist is the doc you want to solve the problem.\nA sim...",a pulmonologist is a physician who specializes in the respiratory system from the windpipe to the lungs if your complaint involves the lungs or any part of the respiratory system a pulmonologist is the doc you want to solve the problem \na sim...,"[a, pulmonologist, is, a, physician, who, specializes, in, the, respiratory, system, from, the, windpipe, to, the, lung, if, your, complaint, involves, the, lung, or, any, part, of, the, respiratory, system, a, pulmonologist, is, the, doc, you, w..."
4,Reasons to See an Internist\nSigns that a patient needs to see an internal medicine doctor\nThe list below includes common signs that may make it necessary for patients to make an appointment with an internist.\nThey have chronic pain\nWhen a pat...,reasons to see an internist\nsigns that a patient needs to see an internal medicine doctor\nthe list below includes common signs that may make it necessary for patients to make an appointment with an internist \nthey have chronic pain\nwhen a pat...,"[reason, to, see, an, internist, sign, that, a, patient, need, to, see, an, internal, medicine, doctor, the, list, below, includes, common, sign, that, may, make, it, necessary, for, patient, to, make, an, appointment, with, an, internist, they, ..."
5,"A hematologist is a doctor who specializes in researching, diagnosing, treating, and preventing blood disorders and disorders of the lymphatic system (lymph nodes and vessels).\nIf your primary care physician has recommended that you see a hemato...",a hematologist is a doctor who specializes in researching diagnosing treating and preventing blood disorders and disorders of the lymphatic system lymph nodes and vessels \nif your primary care physician has recommended that you see a hemato...,"[a, hematologist, is, a, doctor, who, specializes, in, researching, diagnosing, treating, and, preventing, blood, disorder, and, disorder, of, the, lymphatic, system, lymph, node, and, vessel, if, your, primary, care, physician, ha, recommended, ..."
6,"Not everyone who has an infectious disease needs an infectious disease specialist. Your general internist or Primary Care Physician can take care of most infections, but sometimes specialized expertise is needed to either diagnose or manage speci...",not everyone who has an infectious disease needs an infectious disease specialist your general internist or primary care physician can take care of most infections but sometimes specialized expertise is needed to either diagnose or manage speci...,"[not, everyone, who, ha, an, infectious, disease, need, an, infectious, disease, specialist, your, general, internist, or, primary, care, physician, can, take, care, of, most, infection, but, sometimes, specialized, expertise, is, needed, to, eit..."
7,"What Does a Cardiologist Do?\nYour cardiologist, or heart doctor, helps prevent heart disease through screenings and checkups. They treat symptoms of heart conditions or heart diseases. These diseases can include:\nheart attacks, when blood flow ...",what does a cardiologist do \nyour cardiologist or heart doctor helps prevent heart disease through screenings and checkups they treat symptoms of heart conditions or heart diseases these diseases can include \nheart attacks when blood flow ...,"[what, doe, a, cardiologist, do, your, cardiologist, or, heart, doctor, help, prevent, heart, disease, through, screening, and, checkup, they, treat, symptom, of, heart, condition, or, heart, disease, these, disease, can, include, heart, attack, ..."
8,"A psychiatrist is a medical doctor who’s an expert in the field of psychiatry — the branch of medicine focused on the diagnosis, treatment and prevention of mental, emotional and behavioral disorders.\nPsychiatrists can diagnose and treat several...",a psychiatrist is a medical doctor who s an expert in the field of psychiatry the branch of medicine focused on the diagnosis treatment and prevention of mental emotional and behavioral disorders \npsychiatrists can diagnose and treat several...,"[a, psychiatrist, is, a, medical, doctor, who, s, an, expert, in, the, field, of, psychiatry, the, branch, of, medicine, focused, on, the, diagnosis, treatment, and, prevention, of, mental, emotional, and, behavioral, disorder, psychiatrist, can,..."
9,"Scars, Acne, Moles?\nAs your body’s first line of defense, your skin takes a lot of hits. Not only is it the largest organ in your body, but your skin also protects you from germs; repels water; and covers your blood vessels, nerves, and organs....",scars acne moles \nas your body s first line of defense your skin takes a lot of hits not only is it the largest organ in your body but your skin also protects you from germs repels water and covers your blood vessels nerves and organs ...,"[scar, acne, mole, a, your, body, s, first, line, of, defense, your, skin, take, a, lot, of, hit, not, only, is, it, the, largest, organ, in, your, body, but, your, skin, also, protects, you, from, germ, repels, water, and, cover, your, blood, ve..."


## Utility Function

In [19]:
## Util function

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def extract_best_indices(m, topk, mask=None):
    """
    Use the sum of the cosine distance over all tokens, and return the best matches.
    m (np.array): cos matrix of shape (nb_in_tokens, nb_dict_tokens)
    topk (int): number of indices to return (from high to lowest in order)
    """
    # return the sum on all tokens of cosinus for each sentence
    if len(m.shape) > 1:
        cos_sim = np.mean(m, axis=0)
        print("zdg",cos_sim) 
    else:
        cos_sim = m
        print(cos_sim)
    index = np.argsort(cos_sim)[::-1] # from highest idx to smallest score
    if mask is not None:
        assert mask.shape == m.shape
        mask = mask[index]
    else:
        mask = np.ones(len(cos_sim))
        print(mask)
    mask = np.logical_or(cos_sim[index] != 0, mask) #eliminate 0 cosine distance
    print(mask)
    best_index = index[mask][:topk]
    print(best_index)
    return best_index

# TF-IDF

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Adapt stop words
token_stop = tokenizer(' '.join(STOPWORDS), lemmatize=False)

# Fit TFIDF
vectorizer = TfidfVectorizer(stop_words=token_stop, tokenizer=tokenizer)
tfidf_mat = vectorizer.fit_transform(df['sentence'].values) # -> (num_sentences, num_vocabulary)
tfidf_mat.shape



(18, 2968)

In [9]:
display(tfidf_mat)

<18x2968 sparse matrix of type '<class 'numpy.float64'>'
	with 7002 stored elements in Compressed Sparse Row format>

## Prediction

In [22]:
def get_recommendations_tfidf(sentence, tfidf_mat):

    """
    Return the database sentences in order of highest cosine similarity relatively to each
    token of the target sentence.
    """
    # Embed the query sentence
    tokens = [str(tok) for tok in tokenizer(sentence)]
    vec = vectorizer.transform(tokens)
    # Create list with similarity between query and dataset
    mat = cosine_similarity(vec, tfidf_mat)
    # Best cosine distance for each token independantly
    print("asgsg",mat.shape)
    best_index = extract_best_indices(mat, topk=3)
    return best_index

query_sentence = "My joints hurt when I move"
best_index = get_recommendations_tfidf(query_sentence, tfidf_mat)

display(df['Specialty'].iloc[best_index])

asgsg (6, 18)
[0.         0.00189569 0.0047403  0.         0.         0.0014646
 0.         0.00343768 0.         0.         0.         0.01402338
 0.04638382 0.00472076 0.         0.         0.00380896 0.00132208]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True]
[12 11  2]


12      Orthopedics
11         Oncology
2     Ophthalmology
Name: Specialty, dtype: object

In [16]:
query2 = "'sore throat', 'fatigue', 'nausea', 'joint pain', 'chest pain']"
best_index = get_recommendations_tfidf(query2, tfidf_mat)
display(df['Specialty'].iloc[best_index])

(18, 18)


12         Orthopedics
18           Neurology
10    Gastroenterology
Name: Specialty, dtype: object