## In this notebook, we will test the models that take a new input and predict/classify the disease

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from nltk.tokenize import word_tokenize
from sklearn.linear_model import LogisticRegression
import re
import spacy
import numpy 
from numpy import dot
from numpy.linalg import norm
nlp = spacy.load('en_core_web_sm')
import pickle 

### Read the Data & pickle files

In [4]:
# reading the stop words list with pickle
with open ('stop_words.ob', 'rb') as fp:
    domain_stop_word = pickle.load(fp)

In [5]:
file_path = "diseases_with_description.csv"
all_chapters = pd.read_csv(file_path)

### Word Embedding

In [7]:
cv = CountVectorizer(stop_words="english")
cv_tfidf = TfidfVectorizer(stop_words="english")

X = cv.fit_transform(list(all_chapters.loc[:, 'Description' ]))
X_tfidf = cv_tfidf.fit_transform(list(all_chapters.loc[:, 'Description' ]))

df_cv = pd.DataFrame(X.toarray() , columns=cv.get_feature_names_out())
df_tfidf = pd.DataFrame(X_tfidf.toarray() , columns=cv_tfidf.get_feature_names_out())

### First-Step: Cosine Similarity

In [9]:
cosine = lambda v1 , v2 : dot(v1 , v2) / (norm(v1) * norm(v2))

new_text = ["Severe chest pain radiating to the left arm, rapid heartbeat, and shortness of breath. The patient also reports dizziness and fainting spells. Tests reveal arrhythmia, aortic valve stenosis, and elevated troponin levels, indicating myocardial infarction. Leg swelling suggests possible heart failure." 
           ]
new_text_cv = cv.transform(new_text).toarray()[0]
new_text_tfidf = cv_tfidf.transform(new_text).toarray()[0]

for chpter_number in range(int(all_chapters.shape[0])):
    print(f"This is chpter number : {chpter_number} ")
    print(f"Cosin cv :    { cosine( df_cv.iloc[chpter_number]  , new_text_cv )} ")
    print(f"Cosin TFIDF : { cosine( df_tfidf.iloc[chpter_number]  , new_text_tfidf) } ")


This is chpter number : 0 
Cosin cv :    0.18832944617230332 
Cosin TFIDF : 0.19105642120034919 
This is chpter number : 1 
Cosin cv :    0.03128056235449272 
Cosin TFIDF : 0.0357145598197733 
This is chpter number : 2 
Cosin cv :    0.06256112470898544 
Cosin TFIDF : 0.040389095636997295 


As we can see from the previous output, we test sentences **(Severe chest pain radiating to the left arm, rapid heartbeat, and shortness of breath. The patient also reports dizziness and fainting spells. Tests reveal arrhythmia, aortic valve stenosis, and elevated troponin levels, indicating myocardial infarction. Leg swelling suggests possible heart failure.)** which is a symptoms of disease.**

***So, it's clear to see that the highest score is in chapter **One** which is **Cardio**. Then we will go with classification model of chapter one, to classify the new input and got the disease name***

### Second-Step: Predict the same new input, in the trained **cardio_model**

In [12]:
file_path = "extracted_diseases_descriptions.csv"
df = pd.read_csv(file_path, encoding='latin1')
df.columns

Index(['Disease', 'Description', 'Unnamed: 2'], dtype='object')

In [13]:
def clean_text_func(text):
    text = str(text)
    text = text.lower()
    # Clean the text
    text = re.sub(r"[^A-Za-z0-9^,!?.\/'+]", " ", text)
    text = re.sub(r"\+", " ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ", text)
    text = re.sub(r"\?", " ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r"\s{2,}", " ", text)
    text = re.sub(r"[0-9]", " ", text)
    final_text = ""
    for x in text.split():
        if x not in domain_stop_word:
            final_text = final_text + x  +" "
    return final_text

df['Description'] = df['Description'].apply(lambda x: clean_text_func(x))
# df = df.drop('Unnamed: 2', axis =1)
df.iloc[1,1]

'asd goes preschoolers children complain feeling exertion tract infections otherwise appear children shunts growth retardation children asd seldom endocarditis complications adults manifest symptoms fatigability dyspnea exertion point limitation children auscultation reveals quality heard ics patients shunts valve pitched heard border becomes inspiration intensity indicator size shunt pitch sometimes makes hear relatively signs widely caused valve click apex valve prolapse mvp occasionally affects children asd patients defects artery auscultation reveals ejection click clubbing cyanosis become syncope hemoptysis '

In [53]:
X_train = df.Description
y_train = df.Disease

cv1 = TfidfVectorizer()
X_train_cv1 = cv1.fit_transform(X_train)
pd_cv1 = pd.DataFrame(X_train_cv1.toarray(), columns=cv1.get_feature_names_out())

cardio_model_lr = LogisticRegression()
cardio_model_lr.fit(X_train_cv1, y_train)

print(cardio_model_lr.coef_)

pd_cv1.head()

[[-0.01191309 -0.01191309 -0.00993777 ... -0.01497913 -0.01497913
  -0.01191309]
 [-0.01080966 -0.01080966 -0.00989951 ... -0.01327997 -0.01327997
  -0.01080966]
 [-0.01090343 -0.01090343  0.06922689 ... -0.01349918 -0.01349918
  -0.01090343]
 ...
 [-0.01073634 -0.01073634 -0.0098498  ... -0.01326124 -0.01326124
  -0.01073634]
 [ 0.07743677  0.07743677 -0.00985335 ... -0.01368715 -0.01368715
   0.07743677]
 [-0.0111391  -0.0111391  -0.00974006 ... -0.01352467 -0.01352467
  -0.0111391 ]]


Unnamed: 0,abortion,abscesses,abusers,addition,adolescence,adults,affects,ages,along,among,...,vasculature,vegetating,vsd,vsds,weakness,weeks,widely,widened,workload,years
0,0.0,0.0,0.0,0.0838,0.0,0.0,0.0,0.0,0.0838,0.0,...,0.0838,0.0,0.670404,0.167601,0.0,0.0,0.070231,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.088654,0.105782,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.088654,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.287551,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.104927,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125199,0.125199,0.0
4,0.102258,0.102258,0.0,0.0,0.0,0.0,0.0,0.102258,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102258


In [63]:
X_test = """Fatigue, especially during physical activity
Shortness of breath on exertion
Occasional chest discomfort
Palpitations
History of frequent respiratory infections in childhood
Audible heart murmur during examination
."""
# "Sharp chest pain that worsens with deep breaths or lying down, easing when sitting upright. Symptoms also include fever and fatigue."
cleaned_text = clean_text_func(X_test)

X_test_cv3  = cv1.transform([cleaned_text])
y_pred_cv3 = cardio_model_lr.predict(X_test_cv3)

In [65]:
print(y_pred_cv3)
disease_name = y_pred_cv3


['MYOCARDITIS']


Symptoms include cyanosis, difficulty breathing, fainting spells, and growth retardation. Additional signs are clubbing, squatting posture, and developmental delays. : TOF

Signs include heart murmurs, poor weight gain, respiratory distress, and frequent infections. Severe cases show signs of heart failure.: PDA

In [18]:
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

text = "asd goes undetected preschoolers children complain feeling tired exertion tract infections otherwise appear healthy children shunts growth retardation children asd seldom endocarditis complications adults manifest pronounced symptoms fatigability dyspnea exertion point limitation children auscultation reveals midsystolic murmur quality heard third ics patients shunts tricuspid valve pitched diastolic murmur heard sternal border becomes pronounced inspiration murmur intensity indicator size shunt pitch sometimes makes hear gradient relatively murmur detectable signs fixed widely split caused delayed closure pulmonic valve systolic click systolic murmur apex valve prolapse mvp occasionally affects children asd patients uncorrected defects fixed artery auscultation reveals accentuated ejection click audible clubbing cyanosis become syncope hemoptysis"
summary = summarizer(text, max_new_tokens=100, max_length=150, min_length=40, do_sample=True)
print(summary[0]['summary_text'])


Both `max_new_tokens` (=100) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


 asd goes undetected preschoolers children complain feeling tired exertion tract infections otherwise appear healthy children shunts growth retardation children asd seldom endocarditis complications adults manifest pronounced symptoms fatigability dyspnea exertion point limitation.
