# **Disease Symptom Recommendation**

For this study, we'll create a recommender function that requires an individual to input a disease and get the five most related diseases. The data was obtained from [Kaggle](https://www.kaggle.com/datasets/itachi9604/disease-symptom-description-dataset?tags=13302-Classification&page=8). This dataset provides the students a source to create a healthcare related system. A project on the same using double Decision Tree Classifiication is available [here](https://github.com/itachi9604/healthcare-chatbot). `Get_dummies` processed file will be available [here](https://www.kaggle.com/rabisingh/symptom-checker?select=Training.csv)

## Content

There are two columns. The first column contains diseases and the second one the description for the disease. This dataset can be easily cleaned by using file handling in any language. 

- The user only needs to understand how rows and coloumns are arranged.

In [1]:
import pandas as pd
disease_description = pd.read_csv('data/symptom_Description.csv')
print(disease_description.head())

          Disease                                        Description
0   Drug Reaction  An adverse drug reaction (ADR) is an injury ca...
1         Malaria  An infectious disease caused by protozoan para...
2         Allergy  An allergy is an immune system response to a f...
3  Hypothyroidism  Hypothyroidism, also called underactive thyroi...
4       Psoriasis  Psoriasis is a common skin disorder that forms...


## Building the Recommender Function

- First we process the data.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import spacy
nlp = spacy.load('en_core_web_lg')

stopwords = spacy.lang.en.stop_words.STOP_WORDS.union({'.', '..', ',', '!', ':', '...', '?'})

lemma_func = lambda row: ' '.join(x.lemma_.lower() 
                                  for x in nlp(row) 
                                  if x.lemma_.lower() 
                                  not in stopwords)

disease_description['Description'] = disease_description.Description.apply(lemma_func)
print(disease_description.head(5))

          Disease                                        Description
0   Drug Reaction  adverse drug reaction ( adr ) injury cause med...
1         Malaria  infectious disease cause protozoan parasite pl...
2         Allergy  allergy immune system response foreign substan...
3  Hypothyroidism  hypothyroidism underactive thyroid low thyroid...
4       Psoriasis  psoriasis common skin disorder form thick red ...


- It is also import that we create a `Series` object that we'll use to index the disease name when we use the recommender.

In [3]:
disease_indices = pd.Series(disease_description.index, index=disease_description.Disease)
disease_indices.head()

Disease
Drug Reaction     0
Malaria           1
Allergy           2
Hypothyroidism    3
Psoriasis         4
dtype: int64

- Next we create tf-idf vectors.

In [4]:
vectorizer = TfidfVectorizer(ngram_range=(1, 3))
vectorized_matrix = vectorizer.fit_transform(disease_description.Description).toarray()

- Generating a cosine similarity matrix

In [5]:
similarity_matrix = linear_kernel(vectorized_matrix, vectorized_matrix)
similarity_matrix

array([[1.        , 0.00281687, 0.        , ..., 0.        , 0.        ,
        0.00232351],
       [0.00281687, 1.        , 0.        , ..., 0.01123695, 0.        ,
        0.03143219],
       [0.        , 0.        , 1.        , ..., 0.01101901, 0.        ,
        0.00503926],
       ...,
       [0.        , 0.01123695, 0.01101901, ..., 1.        , 0.00808688,
        0.00738219],
       [0.        , 0.        , 0.        , ..., 0.00808688, 1.        ,
        0.01187517],
       [0.00232351, 0.03143219, 0.00503926, ..., 0.00738219, 0.01187517,
        1.        ]])

- Defining the recommender function

In [7]:
def get_recommendation(disease, similarity_matrix, indices):
    index = disease_indices[disease]
    scores = sorted([*enumerate(similarity_matrix[index])], 
                    key=lambda x: x[1], 
                    reverse=True)[1:5]
    similar_diseases = [i[0] for i in scores]
    return disease_description.iloc[similar_diseases]

get_recommendation('Malaria', similarity_matrix, disease_indices)

Unnamed: 0,Disease,Description
34,Dengue,acute infectious disease cause flavivirus ( sp...
28,Hepatitis C,inflammation liver hepatitis c virus ( hcv ) u...
40,Tuberculosis,tuberculosis ( tb ) infectious disease usually...
18,Chicken pox,chickenpox highly contagious disease cause var...
