# 0. Problem Statement

The data contains 2 fields: disease and the symptoms in textual format.
<br>It contains the data for 24 disease.

The problem is to for a given textual symptom; predict the disease of the patient.

For the problem; the sentence embedding model from transformer is used which extracts sementic embedding of 384 features from a sentence.

We have used a NN classifier from sklearn to classify and have got a good result of 98 % accuracy.

# 1. Installing dependency

In [1]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- done
Building wheels for collected packages: sentence_transformers
  Building wheel for sentence_transformers (setup.py) ... [?25l- \ done
[?25h  Created wheel for sentence_transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=4621badb0362b9ed5f73e0dd94d617b729ab2edef9aed1bea59705ca2264a1d1
  Stored in directory: /root/.cache/pip/wheels/83/71/2b/40d17d21937fed496fb99145227eca8f20b4891240ff60c86f
Successfully built sentence_transformers
Installing collected packages: sentence_transformers
Successfully installed sentence_transformers-2.2.2
[0m

# 2. Importing Libraries

In [2]:
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
import sklearn
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
import pickle

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# 3. Importing Data

In [3]:
r_data = pd.read_csv('/kaggle/input/symptom2disease/Symptom2Disease.csv').iloc[:,1:]

# 4. Exploratory Analysis

In [4]:
r_data.nunique()

label      24
text     1153
dtype: int64

In [5]:
r_data.head(20)

Unnamed: 0,label,text
0,Psoriasis,I have been experiencing a skin rash on my arm...
1,Psoriasis,"My skin has been peeling, especially on my kne..."
2,Psoriasis,I have been experiencing joint pain in my fing...
3,Psoriasis,"There is a silver like dusting on my skin, esp..."
4,Psoriasis,"My nails have small dents or pits in them, and..."
5,Psoriasis,The skin on my palms and soles is thickened an...
6,Psoriasis,"The skin around my mouth, nose, and eyes is re..."
7,Psoriasis,My skin is very sensitive and reacts easily to...
8,Psoriasis,I have noticed a sudden peeling of skin at dif...
9,Psoriasis,The skin on my genitals is red and inflamed. I...


In [6]:
r_data.iloc[0,1]

'I have been experiencing a skin rash on my arms, legs, and torso for the past few weeks. It is red, itchy, and covered in dry, scaly patches.'

# 4. Feature engineering / Sentence Embeddings (for targets)

In [7]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [8]:
features = model.encode(r_data.iloc[:,1])

Batches:   0%|          | 0/38 [00:00<?, ?it/s]

In [9]:
pickle.dump(model, open('sentence_encoding.sav', 'wb'))

___

## Cheking if simply using cosine_similarity could work

In [10]:
test_0 = model.encode(r_data.iloc[0,1]).reshape(1,-1)
test_0.shape

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(1, 384)

In [11]:
test_1 = model.encode(r_data.iloc[1,1]).reshape(1,-1)
test_1.shape

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(1, 384)

In [12]:
metrics.pairwise.cosine_similarity(test_0, test_1)

array([[0.569144]], dtype=float32)

___

# 5. Creating labels / target for classification

In [13]:
r_targets = r_data.iloc[:,0].values

In [14]:
le = LabelEncoder()
targets = le.fit_transform(r_targets)

In [15]:
pickle.dump(le, open('label_encoder.sav', 'wb'))

In [16]:
features.shape

(1200, 384)

# 6. Train test spliting of dataset

In [17]:
X_train, X_test, y_train, y_test = train_test_split(features, targets, train_size=0.75)

In [18]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(900, 384)
(300, 384)
(900,)
(300,)


# 7. Classification model

In [19]:
clf = MLPClassifier(max_iter=1000)
clf.fit(X_train, y_train)

MLPClassifier(max_iter=1000)

In [20]:
pickle.dump(clf, open('classification_model.sav', 'wb'))

# 8. Model Evaluation

In [21]:
clf.score(X_train,y_train)

1.0

In [22]:
clf.score(X_test, y_test)

0.9833333333333333

# 9. Creating a pipeline

In [23]:
import pickle

def disease_classification(symptom_text):
    model = pickle.load(open('sentence_encoding.sav', 'rb'))
    class_model = pickle.load(open('classification_model.sav', 'rb'))
    label_encoder = pickle.load(open('label_encoder.sav', 'rb'))
    
    temp_encoding = model.encode(symptom_text)
    temp_prediction = class_model.predict([temp_encoding])
    temp_label = label_encoder.inverse_transform(temp_prediction)
    
    return temp_label[0]

In [24]:
symptom_text = 'Dry, thick, and raised patches on the skin are the most common sign of psoriasis. These patches are often covered with a silvery-white coating called scale, and they tend to itch.'

In [25]:
disease_classification(symptom_text)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'Psoriasis'