#Prognosis AI: Predicting and Managing Your Health


A conversational AI for healthcare that assists users with symptom checking, health management, and disease prediction.

Key Features:\
Symptom Checker: Users can describe their symptoms, and the AI uses NLP to identify potential health issues.
Medical Advice: Offers general medical advice and suggests next steps, such as visiting a doctor or taking home remedies.
Medication Reminders: Sends reminders to users to take their medication on time.
Health Journal: Allows users to track their health metrics, such as blood pressure, weight, and exercise routines.
Doctor Appointments: Helps users schedule and manage appointments with healthcare providers.
Symptom Prediction: Users input their symptoms, and the AI provides a list of possible diseases ranked by likelihood.
Emergency Assistance: Provides emergency contact information or connects the user to emergency services in case of serious symptoms.

Dataset Requirements:\
Symptom Checker: Requires a dataset of symptoms mapped to possible diseases. Datasets like the International Classification of Diseases (ICD) can be used.
Symptom Prediction: Requires a dataset of symptoms and their corresponding diseases, along with treatment information. The UCI Machine Learning Repository and Kaggle may have suitable datasets.
Health Journal: Requires a dataset for tracking health metrics, such as blood pressure, weight, and exercise routines. This can be collected from users or sourced from health databases.

In [None]:
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
faq_df = pd.read_csv('/content/drive/MyDrive/Health_Care/Mental_Health_FAQ.csv')
train_df = pd.read_csv('/content/drive/MyDrive/Health_Care/Training_.csv')
dataset_df = pd.read_csv('/content/drive/MyDrive/Health_Care/dataset.csv')
intents_df = pd.read_json('/content/drive/MyDrive/Health_Care/intents_genral.json')
symptom_desc_df = pd.read_csv('/content/drive/MyDrive/Health_Care/symptom_Description.csv')
symptom_precaution = pd.read_csv('/content/drive/MyDrive/Health_Care/symptom_precaution.csv')
chat_df = pd.read_csv('/content/drive/MyDrive/Health_Care/train_data_chatbot.csv')

In [None]:
faq_df.head(5)

Unnamed: 0,Question_ID,Questions,Answers
0,1590140,What does it mean to have a mental illness?,Mental illnesses are health conditions that di...
1,2110618,Who does mental illness affect?,It is estimated that mental illness affects 1 ...
2,6361820,What causes mental illness?,It is estimated that mental illness affects 1 ...
3,9434130,What are some of the warning signs of mental i...,Symptoms of mental health disorders vary depen...
4,7657263,Can people with mental illness recover?,"When healing from mental illness, early identi..."


In [None]:
chat_df.head(5)

Unnamed: 0,short_question,short_answer,tags,label
0,can an antibiotic through an iv give you a ras...,yes it can even after you have finished the pr...,['rash' 'antibiotic'],1.0
1,can you test positive from having the hep b va...,test positive for what if you had a hep b vacc...,['hepatitis b'],1.0
2,what are the dietary restrictions for celiac d...,omitting gluten from the diet is the key to co...,['celiac disease'],1.0
3,can i transmit genital warts seventeen years a...,famotidine pepcid products is in a drug class ...,['wart'],-1.0
4,is all vitamin d the same,hi this means you do not have hepatitis b and ...,['vitamin d'],-1.0


In [None]:
train_df.head(5)

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection


In [None]:
train_df.columns

Index(['itching', 'skin_rash', 'nodal_skin_eruptions', 'continuous_sneezing',
       'shivering', 'chills', 'joint_pain', 'stomach_pain', 'acidity',
       'ulcers_on_tongue',
       ...
       'blackheads', 'scurring', 'skin_peeling', 'silver_like_dusting',
       'small_dents_in_nails', 'inflammatory_nails', 'blister',
       'red_sore_around_nose', 'yellow_crust_ooze', 'prognosis'],
      dtype='object', length=133)

In [None]:
dataset_df.head(5)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,
1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
3,Fungal infection,itching,skin_rash,dischromic _patches,,,,,,,,,,,,,,
4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,,,,,,,,,,,,,,


In [None]:
intents_df.head(5)

Unnamed: 0,intents
0,"{'tag': 'greeting', 'patterns': ['Hi', 'Hey', ..."
1,"{'tag': 'morning', 'patterns': ['Good morning'..."
2,"{'tag': 'afternoon', 'patterns': ['Good aftern..."
3,"{'tag': 'evening', 'patterns': ['Good evening'..."
4,"{'tag': 'night', 'patterns': ['Good night'], '..."


In [None]:
new_names = {
    'Drug Reaction': 'Disease',
    'An adverse drug reaction (ADR) is an injury caused by taking medication. ADRs may occur following a single dose or prolonged administration of a drug or result from the combination of two or more drugs.': 'Desc',
    }
# Rename the columns using the dictionary
symptom_desc_df = symptom_desc_df.rename(columns=new_names)

symptom_desc_df.head(5)

Unnamed: 0,Disease,Desc
0,Malaria,An infectious disease caused by protozoan para...
1,Allergy,An allergy is an immune system response to a f...
2,Hypothyroidism,"Hypothyroidism, also called underactive thyroi..."
3,Psoriasis,Psoriasis is a common skin disorder that forms...
4,GERD,"Gastroesophageal reflux disease, or GERD, is a..."


In [None]:
new_names = {
    'Drug Reaction': 'Disease',
    'stop irritation': 'advice_1','consult nearest hospital': 'advice_2','stop taking drug': 'advice_3','follow up':'advice_4'
    }
# Rename the columns using the dictionary
symptom_precaution = symptom_precaution.rename(columns=new_names)

symptom_precaution.head(5)

Unnamed: 0,Disease,advice_1,advice_2,advice_3,advice_4
0,Malaria,Consult nearest hospital,avoid oily food,avoid non veg food,keep mosquitos out
1,Allergy,apply calamine,cover area with bandage,,use ice to compress itching
2,Hypothyroidism,reduce stress,exercise,eat healthy,get proper sleep
3,Psoriasis,wash hands with warm soapy water,stop bleeding using pressure,consult doctor,salt baths
4,GERD,avoid fatty spicy food,avoid lying down after eating,maintain healthy weight,exercise


#Pre-Processing

In [None]:
df = dataset_df
column_to_drop = 'Unnamed: 0'
df = df.drop(column_to_drop, axis=1)

df['Combined'] = df.select_dtypes(include=[object]).apply(lambda x: x.str.cat(sep=' '), axis=1)

In [None]:
data = dict(zip(dataset_df['Unnamed: 0'], df['Combined']))
new_dataset_df = pd.DataFrame(data.items(), columns=['Disease', 'symptoms'])  # You can rename the columns
new_dataset_df.head(5)

Unnamed: 0,Disease,symptoms
0,Fungal infection,itching skin_rash nodal_skin_eruptions disc...
1,Allergy,continuous_sneezing shivering chills water...
2,GERD,stomach_pain acidity ulcers_on_tongue vomi...
3,Chronic cholestasis,itching vomiting yellowish_skin nausea los...
4,Drug Reaction,itching skin_rash stomach_pain burning_mict...


In [None]:
def split_text(text):
  return text.split()

new_dataset_df['Common_symptoms'] = new_dataset_df['symptoms'].apply(split_text)

In [None]:
new_dataset_df.head(5)

Unnamed: 0,Disease,symptoms,Common_symptoms
0,Fungal infection,itching skin_rash nodal_skin_eruptions disc...,"[itching, skin_rash, nodal_skin_eruptions, dis..."
1,Allergy,continuous_sneezing shivering chills water...,"[continuous_sneezing, shivering, chills, water..."
2,GERD,stomach_pain acidity ulcers_on_tongue vomi...,"[stomach_pain, acidity, ulcers_on_tongue, vomi..."
3,Chronic cholestasis,itching vomiting yellowish_skin nausea los...,"[itching, vomiting, yellowish_skin, nausea, lo..."
4,Drug Reaction,itching skin_rash stomach_pain burning_mict...,"[itching, skin_rash, stomach_pain, burning_mic..."


In [None]:
dfs = [new_dataset_df,symptom_desc_df,symptom_precaution,intents_df,train_df,chat_df,faq_df]
for i in dfs:
  print(i.describe(include='all'))

                 Disease                                           symptoms  \
count                 41                                                 41   
unique                41                                                 41   
top     Fungal infection  itching  skin_rash  nodal_skin_eruptions  disc...   
freq                   1                                                  1   

                                          Common_symptoms  
count                                                  41  
unique                                                 41  
top     [itching, skin_rash, nodal_skin_eruptions, dis...  
freq                                                    1  
        Disease                                               Desc
count        40                                                 40
unique       40                                                 40
top     Malaria  An infectious disease caused by protozoan para...
freq          1                     

In [None]:
def analyze_and_clean_dfs(dfs, drop_nulls=True):
  cleaned_dfs = []
  for df in dfs:
       # Check for null values
    null_counts = df.isnull().sum()
    print(f"\n* Null value counts:\n{null_counts}")

    if drop_nulls:
      # Optionally create a copy to avoid modifying the original DataFrame
      cleaned_df = df.copy()
      cleaned_df.dropna(inplace=True)
      print(f"\n* After dropping nulls:\n{cleaned_df.describe(include='all')}")
      cleaned_dfs.append(cleaned_df)
    else:
      print("\n* Null values not dropped (optional).")
      cleaned_dfs.append(df.copy())
  return cleaned_dfs

cleaned_dfs = analyze_and_clean_dfs(dfs)


* Null value counts:
Disease            0
symptoms           0
Common_symptoms    0
dtype: int64

* After dropping nulls:
                 Disease                                           symptoms  \
count                 41                                                 41   
unique                41                                                 41   
top     Fungal infection  itching  skin_rash  nodal_skin_eruptions  disc...   
freq                   1                                                  1   

                                          Common_symptoms  
count                                                  41  
unique                                                 41  
top     [itching, skin_rash, nodal_skin_eruptions, dis...  
freq                                                    1  

* Null value counts:
Disease    0
Desc       0
dtype: int64

* After dropping nulls:
        Disease                                               Desc
count        40             

#Progress Building

In [None]:
#Symptom finder

import pandas as pd
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

df = new_dataset_df
# Function to process user input
def process_input(user_input):
    tokens = word_tokenize(user_input.lower())
    # Remove stop words
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    return ' '.join(tokens)

# Function to find disease based on symptoms
def find_disease(user_input):
    processed_input = process_input(user_input)
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(df['Common_symptoms'].apply(lambda x: ' '.join(x)))
    user_tfidf = tfidf_vectorizer.transform([processed_input])
    similarities = cosine_similarity(user_tfidf, tfidf_matrix)[0]
    index = similarities.argmax()
    return df['Disease'][index]

# Sample user input
user_input = "I have itching and skin rash,shivering, chills, joint_pain,stomach_pain"

# Find disease based on user input
disease_f = find_disease(user_input)
print(f"The most likely disease based on your symptoms is: {disease_f}")

The most likely disease based on your symptoms is: Allergy


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#Medical Advice:
Implement a keyword matching or similar technique to match user questions with answers from the dataset.

```
import pandas as pd
!pip install fuzzywuzzy
from fuzzywuzzy import process
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

df_disease = symptom_desc_df
df_advice = symptom_precaution

# Function to process user input
def process_input(user_input):
    tokens = word_tokenize(user_input.lower())
    # Remove stop words
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    return ' '.join(tokens)

# Function to find disease based on symptoms
def find_disease(user_input):
    processed_input = process_input(user_input)
    ratios = df_disease['Desc'].apply(lambda x: process.extractOne(processed_input, word_tokenize(x.lower()))[1])
    index = ratios.idxmax()
    return df_disease.loc[index]

# Sample user input
user_input = "I have itching and skin rash , stomach pain , shivering"

# Find disease based on user input
disease = find_disease(user_input)

# Print disease description
print(f"Disease: {disease['Disease']}\nDescription: {disease['Desc']}\n")

# Print medical advice
advice = df_advice[df_advice['Disease'] == disease['Disease']].iloc[0]
print("Medical Advice:")
for i in range(1, 5):
    if advice[f'advice_{i}'] != 'NaN':
        print(f"- {advice[f'advice_{i}']}")
        ```

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

df_disease = symptom_desc_df

df_advice = symptom_precaution

# Function to process user input
def process_input(user_input):
    tokens = word_tokenize(user_input.lower())
    # Remove stop words
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    return ' '.join(tokens)

# Function to find disease based on symptoms
def find_disease(user_input):
    processed_input = process_input(user_input)
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(df_disease['Desc'])
    user_tfidf = tfidf_vectorizer.transform([processed_input])
    similarities = cosine_similarity(user_tfidf, tfidf_matrix)[0]
    index = similarities.argmax()
    return df_disease.loc[index]

# dissease from
# Find disease based on user input
disease = find_disease(user_input)

# Print disease description
print(f"Disease: {disease['Disease']}\nDescription: {disease['Desc']}\n")

# Print medical advice
advice = df_advice[df_advice['Disease'] == disease['Disease']].iloc[0]
print("Medical Advice:")
for i in range(1, 5):
    if advice[f'advice_{i}'] != 'NaN':
        print(f"- {advice[f'advice_{i}']}")

Disease: Chicken pox
Description: Chickenpox is a highly contagious disease caused by the varicella-zoster virus (VZV). It can cause an itchy, blister-like rash. The rash first appears on the chest, back, and face, and then spreads over the entire body, causing between 250 and 500 itchy blisters.

Medical Advice:
- use neem in bathing 
- consume neem leaves
- take vaccine
- avoid public places


#Using model Building with Conversational AI

In [None]:
# Outer join on 'questions' (df1) and 'short_questions' (df2) columns
df1 = faq_df
df2 = chat_df
merged_df = pd.merge(df1, df2, how='outer', left_on='Questions', right_on='short_question')


In [None]:
merged_df = merged_df.drop(['Question_ID'], axis=1)

In [None]:
merged_df

Unnamed: 0,Questions,Answers,short_question,short_answer,tags,label
0,What does it mean to have a mental illness?,Mental illnesses are health conditions that di...,,,,
1,Who does mental illness affect?,It is estimated that mental illness affects 1 ...,,,,
2,What causes mental illness?,It is estimated that mental illness affects 1 ...,,,,
3,What are some of the warning signs of mental i...,Symptoms of mental health disorders vary depen...,,,,
4,Can people with mental illness recover?,"When healing from mental illness, early identi...",,,,
...,...,...,...,...,...,...
47696,,,what is an icsi intracytoplasmic sperm injection,intracytoplasmic sperm injection icsi is a lab...,['injection' 'sperm'],1.0
47697,,,can b12 vitamins raise lab results b12 lab res...,i am sorry you are going through this and it i...,['b12' 'vitamin'],-1.0
47698,,,what can you give a 1 month old baby that has ...,there are no over the counter medications appr...,['coldness' 'baby' 'runny nose'],1.0
47699,,,when should the doctor be called for diarrhea,okay so type 1 is common to oral area and type...,['diarrhea'],-1.0


In [None]:
df1['Question/short_question'] = df1['Questions'].fillna(df2['short_question'])
df1['Answer/short_answer'] = df1['Answers'].fillna(df2['short_answer'])
print(df1)

    Question_ID                                          Questions  \
0       1590140        What does it mean to have a mental illness?   
1       2110618                    Who does mental illness affect?   
2       6361820                        What causes mental illness?   
4       7657263            Can people with mental illness recover?   
..          ...                                                ...   
93      4373204            How do I know if I'm drinking too much?   
94      7807643  If cannabis is dangerous, why are we legalizin...   
95      4352464       How can I convince my kids not to use drugs?   
96      6521784  What is the legal status (and evidence) of CBD...   
97      3221856                    What is the evidence on vaping?   

                                              Answers  \
0   Mental illnesses are health conditions that di...   
1   It is estimated that mental illness affects 1 ...   
2   It is estimated that mental illness affects 1 ...   
3

###Classifier Model

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

data = train_df

X = data.drop("prognosis", axis=1)
y = data["prognosis"]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # Adjust test_size as needed

# Create a Random Forest Classifier model
model = RandomForestClassifier(n_estimators=100, random_state=42)  # Adjust n_estimators (number of trees)

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))

Accuracy: 1.0000
                                         precision    recall  f1-score   support

(vertigo) Paroymsal  Positional Vertigo       1.00      1.00      1.00        18
                                   AIDS       1.00      1.00      1.00        30
                                   Acne       1.00      1.00      1.00        24
                    Alcoholic hepatitis       1.00      1.00      1.00        25
                                Allergy       1.00      1.00      1.00        24
                              Arthritis       1.00      1.00      1.00        23
                       Bronchial Asthma       1.00      1.00      1.00        33
                   Cervical spondylosis       1.00      1.00      1.00        23
                            Chicken pox       1.00      1.00      1.00        21
                    Chronic cholestasis       1.00      1.00      1.00        15
                            Common Cold       1.00      1.00      1.00        23
          

In [None]:
import random

random_index = random.randint(0, len(data)-1)
random_row = X.iloc[random_index].values.reshape(1, -1)
prediction = model.predict(random_row)
actual_prognosis = y.iloc[random_index]

print(f"Random Row Index: {random_index}")
print(f"Features: {random_row}")
print(f"Predicted Prognosis: {prediction[0]}")
print(f"Actual Prognosis: {actual_prognosis}")

Random Row Index: 497
Features: [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
  0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
Predicted Prognosis: Gastroenteritis
Actual Prognosis: Gastroenteritis




# Transformer Model

In [None]:
advice_cols = ['advice_1', 'advice_2', 'advice_3', 'advice_4']
symptom_precaution['advices'] = symptom_precaution[advice_cols].apply(lambda row: ', '.join(row.dropna()), axis=1)
symptom_precaution['advices'] = symptom_precaution['advices'].apply(lambda x: x.split(', '))
symptom_precaution

Unnamed: 0,Disease,advice_1,advice_2,advice_3,advice_4,advices
0,Malaria,Consult nearest hospital,avoid oily food,avoid non veg food,keep mosquitos out,"[Consult nearest hospital, avoid oily food, av..."
1,Allergy,apply calamine,cover area with bandage,,use ice to compress itching,"[apply calamine, cover area with bandage, use ..."
2,Hypothyroidism,reduce stress,exercise,eat healthy,get proper sleep,"[reduce stress, exercise, eat healthy, get pro..."
3,Psoriasis,wash hands with warm soapy water,stop bleeding using pressure,consult doctor,salt baths,"[wash hands with warm soapy water, stop bleedi..."
4,GERD,avoid fatty spicy food,avoid lying down after eating,maintain healthy weight,exercise,"[avoid fatty spicy food, avoid lying down afte..."
5,Chronic cholestasis,cold baths,anti itch medicine,consult doctor,eat healthy,"[cold baths, anti itch medicine, consult docto..."
6,hepatitis A,Consult nearest hospital,wash hands through,avoid fatty spicy food,medication,"[Consult nearest hospital, wash hands through,..."
7,Osteoarthristis,acetaminophen,consult nearest hospital,follow up,salt baths,"[acetaminophen, consult nearest hospital, foll..."
8,(vertigo) Paroymsal Positional Vertigo,lie down,avoid sudden change in body,avoid abrupt head movment,relax,"[lie down, avoid sudden change in body, avoid ..."
9,Hypoglycemia,lie down on side,check in pulse,drink sugary drinks,consult doctor,"[lie down on side, check in pulse, drink sugar..."


In [None]:
df = symptom_desc_df
merged_df = pd.merge(symptom_precaution[['Disease', 'advices']], symptom_desc_df[['Disease', 'Desc']], how='inner', on='Disease')
merged_df = pd.merge(merged_df[['Disease', 'advices','Desc']], new_dataset_df[['Disease', 'Common_symptoms']], how='inner', on='Disease')
merged_df

Unnamed: 0,Disease,advices,Desc,Common_symptoms
0,Malaria,"[Consult nearest hospital, avoid oily food, av...",An infectious disease caused by protozoan para...,"[chills, vomiting, high_fever, sweating, heada..."
1,Allergy,"[apply calamine, cover area with bandage, use ...",An allergy is an immune system response to a f...,"[continuous_sneezing, shivering, chills, water..."
2,Hypothyroidism,"[reduce stress, exercise, eat healthy, get pro...","Hypothyroidism, also called underactive thyroi...","[fatigue, weight_gain, cold_hands_and_feets, m..."
3,Psoriasis,"[wash hands with warm soapy water, stop bleedi...",Psoriasis is a common skin disorder that forms...,"[skin_rash, joint_pain, skin_peeling, silver_l..."
4,GERD,"[avoid fatty spicy food, avoid lying down afte...","Gastroesophageal reflux disease, or GERD, is a...","[stomach_pain, acidity, ulcers_on_tongue, vomi..."
5,Chronic cholestasis,"[cold baths, anti itch medicine, consult docto...","Chronic cholestatic diseases, whether occurrin...","[itching, vomiting, yellowish_skin, nausea, lo..."
6,hepatitis A,"[Consult nearest hospital, wash hands through,...",Hepatitis A is a highly contagious liver infec...,"[joint_pain, vomiting, yellowish_skin, dark_ur..."
7,Osteoarthristis,"[acetaminophen, consult nearest hospital, foll...",Osteoarthritis is the most common form of arth...,"[joint_pain, neck_pain, knee_pain, hip_joint_p..."
8,(vertigo) Paroymsal Positional Vertigo,"[lie down, avoid sudden change in body, avoid ...",Benign paroxysmal positional vertigo (BPPV) is...,"[vomiting, headache, nausea, spinning_movement..."
9,Hypoglycemia,"[lie down on side, check in pulse, drink sugar...",Hypoglycemia is a condition in which your blo...,"[vomiting, fatigue, anxiety, sweating, headach..."


In [None]:
import random
import pandas as pd

greetings = ["Hi,", "Hey,", "Hi there,", "Hello,", "Hey there,"]
continuation = ['i have been suffereing from ', 'for the past few days ','past few days ','facing some symptoms like ']

df = merged_df

# # Assuming your data is stored in a DataFrame called df
# df['user_input'] = ""

# for index, row in df.iterrows():
#     selected_greeting = random.choice(greetings)
#     selected_continuation = random.choice(continuation)
#     symptoms = [col.replace('_', ' ') for col in df.columns[:-1] if row[col] == 1]
#     user_input_text = f"{selected_greeting} {selected_continuation} {', '.join(symptoms)}"
#     df.at[index, 'user_input'] = user_input_text

# # Display the DataFrame with the new 'user_input' column
# print(df.head())

greetings = ["Hi,", "Hey,", "Hi there,", "Hello,", "Hey there,"]
continuation = ['i have been suffereing from ', 'for the past few days ', 'past few days ', 'facing some symptoms like ']

# Assuming your data is stored in a DataFrame called df
df['user_input'] = df.apply(lambda row: f"{random.choice(greetings)} {random.choice(continuation)} {', '.join([col.replace('_', ' ') for col in df.columns[:-1] if row[col] == 1])}", axis=1)

# Display the DataFrame with the new 'user_input' column
print(df.head())

          Disease                                            advices  \
0         Malaria  [Consult nearest hospital, avoid oily food, av...   
1         Allergy  [apply calamine, cover area with bandage, use ...   
2  Hypothyroidism  [reduce stress, exercise, eat healthy, get pro...   
3       Psoriasis  [wash hands with warm soapy water, stop bleedi...   
4            GERD  [avoid fatty spicy food, avoid lying down afte...   

                                                Desc  \
0  An infectious disease caused by protozoan para...   
1  An allergy is an immune system response to a f...   
2  Hypothyroidism, also called underactive thyroi...   
3  Psoriasis is a common skin disorder that forms...   
4  Gastroesophageal reflux disease, or GERD, is a...   

                                     Common_symptoms  \
0  [chills, vomiting, high_fever, sweating, heada...   
1  [continuous_sneezing, shivering, chills, water...   
2  [fatigue, weight_gain, cold_hands_and_feets, m...   
3  [sk

In [None]:
df

Unnamed: 0,Disease,advices,Desc,Common_symptoms,user_input
0,Malaria,"[Consult nearest hospital, avoid oily food, av...",An infectious disease caused by protozoan para...,"[chills, vomiting, high_fever, sweating, heada...","Hey, past few days"
1,Allergy,"[apply calamine, cover area with bandage, use ...",An allergy is an immune system response to a f...,"[continuous_sneezing, shivering, chills, water...","Hey, facing some symptoms like"
2,Hypothyroidism,"[reduce stress, exercise, eat healthy, get pro...","Hypothyroidism, also called underactive thyroi...","[fatigue, weight_gain, cold_hands_and_feets, m...","Hi there, for the past few days"
3,Psoriasis,"[wash hands with warm soapy water, stop bleedi...",Psoriasis is a common skin disorder that forms...,"[skin_rash, joint_pain, skin_peeling, silver_l...","Hi, for the past few days"
4,GERD,"[avoid fatty spicy food, avoid lying down afte...","Gastroesophageal reflux disease, or GERD, is a...","[stomach_pain, acidity, ulcers_on_tongue, vomi...","Hey, for the past few days"
5,Chronic cholestasis,"[cold baths, anti itch medicine, consult docto...","Chronic cholestatic diseases, whether occurrin...","[itching, vomiting, yellowish_skin, nausea, lo...","Hi there, past few days"
6,hepatitis A,"[Consult nearest hospital, wash hands through,...",Hepatitis A is a highly contagious liver infec...,"[joint_pain, vomiting, yellowish_skin, dark_ur...","Hey there, facing some symptoms like"
7,Osteoarthristis,"[acetaminophen, consult nearest hospital, foll...",Osteoarthritis is the most common form of arth...,"[joint_pain, neck_pain, knee_pain, hip_joint_p...","Hi there, for the past few days"
8,(vertigo) Paroymsal Positional Vertigo,"[lie down, avoid sudden change in body, avoid ...",Benign paroxysmal positional vertigo (BPPV) is...,"[vomiting, headache, nausea, spinning_movement...","Hello, i have been suffereing from"
9,Hypoglycemia,"[lie down on side, check in pulse, drink sugar...",Hypoglycemia is a condition in which your blo...,"[vomiting, fatigue, anxiety, sweating, headach...","Hey there, past few days"


In [None]:
import random

conclude = ['This is may be ', 'Analysing the Symptoms ', 'It may be due to ']

# Assuming your DataFrame is called df
df['AI output'] = ""

for index, row in df.iterrows():
    selected_conclude = random.choice(conclude)
    ai_output_text = f"{selected_conclude} {row['prognosis']}"
    df.at[index, 'AI output'] = ai_output_text

# Display the DataFrame with the new 'AI output' column
print(df.head())

In [None]:
new_df = merged_df
import pandas as pd

new_df['merged'] = new_df.apply(lambda x: ', '.join(map(str, x['advices'])) + ', ' + x['Desc'] + ', ' + ', '.join(map(str, x['Common_symptoms'])), axis=1)
# Display the DataFrame with the new 'merged' column
print(new_df.head())

new_names = {
    'prognosis': 'Disease' }
# Rename the columns using the dictionary
df = df.rename(columns=new_names)

merged_df = pd.merge(new_df[['Disease', 'merged']], df[['Disease', 'user_input' ,'AI output' ]], how='inner', on='Disease')
merged_df['joined'] = merged_df['AI output'] + " "+merged_df['merged']

In [None]:
merged_df = merged_df.drop(['merged','AI output'],axis = 1)

In [None]:
symptoms_question_df = merged_df

In [None]:
# Outer join on 'questions' (df1) and 'short_questions' (df2) columns
faq_df = faq_df.drop(['Question/short_question','Answer/short_answer'],axis=1)
df1 = faq_df
df2 = chat_df
merged_df = pd.merge(df1, df2, how='outer', left_on='Questions', right_on='short_question')

In [None]:
merged_df.fillna(" ", inplace=True)
merged_df['user_input_ques'] = merged_df['Questions'] + merged_df['short_question']
merged_df['user_input_ans'] = merged_df['Answers'] + merged_df['short_answer']
merged_df = merged_df.drop(['Question_ID','Questions','Answers','short_question','short_answer','tags','label'],axis=1)

In [None]:
general_med_df = merged_df

In [None]:
symptoms_question_df

Unnamed: 0,Disease,user_input,joined
0,Malaria,"Hey, i have been suffereing from",It may be due to Malaria Consult nearest hosp...
1,Allergy,"Hi, i have been suffereing from","It may be due to Allergy apply calamine, cove..."
2,Hypothyroidism,"Hello, facing some symptoms like",It may be due to Hypothyroidism reduce stress...
3,Psoriasis,"Hi, past few days",This is may be Psoriasis wash hands with warm...
4,GERD,"Hey there, i have been suffereing from","It may be due to GERD avoid fatty spicy food,..."
5,Chronic cholestasis,"Hey there, for the past few days",Analysing the Symptoms Chronic cholestasis co...
6,hepatitis A,"Hi there, for the past few days",Analysing the Symptoms hepatitis A Consult ne...
7,Osteoarthristis,"Hi there, facing some symptoms like",It may be due to Osteoarthristis acetaminophe...
8,(vertigo) Paroymsal Positional Vertigo,"Hey there, facing some symptoms like",Analysing the Symptoms (vertigo) Paroymsal P...
9,Hypoglycemia,"Hey there, facing some symptoms like","This is may be Hypoglycemia lie down on side,..."


In [None]:
general_med_df

Unnamed: 0,user_input_ques,user_input_ans
0,What does it mean to have a mental illness?,Mental illnesses are health conditions that di...
1,Who does mental illness affect?,It is estimated that mental illness affects 1 ...
2,What causes mental illness?,It is estimated that mental illness affects 1 ...
3,What are some of the warning signs of mental i...,Symptoms of mental health disorders vary depen...
4,Can people with mental illness recover?,"When healing from mental illness, early identi..."
...,...,...
47696,what is an icsi intracytoplasmic sperm injection,intracytoplasmic sperm injection icsi is a la...
47697,can b12 vitamins raise lab results b12 lab re...,i am sorry you are going through this and it ...
47698,what can you give a 1 month old baby that has...,there are no over the counter medications app...
47699,when should the doctor be called for diarrhea,okay so type 1 is common to oral area and typ...


In [None]:
df1 = general_med_df
df2 = symptoms_question_df
final_df = pd.concat([df2.rename(columns={'user_input': 'user_input_ques', 'joined': 'user_input_ans'}), df1])

In [None]:
final_df.to_csv('final_data.csv')

In [None]:
final_df.describe(include = 'all')

Unnamed: 0,Disease,user_input_ques,user_input_ans
count,37,47738,47738
unique,37,22613,26872
top,Malaria,i have a itchy rash that comes and goes what ...,yes
freq,1,30,106


#Training The Model

In [None]:
import pandas as pd
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import time
import math
import random
import datetime
from pathlib import Path

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "1"  # reduce the amount of console output from TF
import tensorflow as tf

from transformers import *
!pip install -q datasets # install HF datasets library
from datasets import load_dataset

logging.set_verbosity_warning()
logging.set_verbosity_error()

import logging

print('TF version',tf.__version__)
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU'))) # check GPU available



[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hTF version 2.15.0
Num GPUs Available:  1


In [None]:
final_df = pd.read_csv('/content/drive/MyDrive/Health_Care/final_data.csv')

In [None]:
final_df = final_df.drop(['Disease','Unnamed: 0'],axis=1)

In [None]:
final_df.head(3)

Unnamed: 0,user_input_ques,user_input_ans
0,"Hey there, for the past few days chills, vomi...",Analysing the Symptoms Malaria Consult neares...
1,"Hey, facing some symptoms like chills, vomiti...",This is may be Malaria Consult nearest hospit...
2,"Hello, past few days chills, vomiting, high f...",It may be due to Malaria Consult nearest hosp...


In [None]:
import os
import re
import numpy as np
import pandas as pd
from time import time
import matplotlib.pyplot as plt
import tensorflow as tf

#!pip install tensorflow-datasets==1.2.0
import tensorflow_datasets as tfds

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU {}'.format(tpu.cluster_spec().as_dict()['worker']))
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy()

print("REPLICAS: {}".format(strategy.num_replicas_in_sync))

REPLICAS: 1


In [None]:
def textPreprocess(input_text):

  def removeAccents(input_text):
      strange='ąćęłńóśżź'
      ascii_replacements='acelnoszz'
      translator=str.maketrans(strange,ascii_replacements)
      return input_text.translate(translator)

  def removeSpecial(input_text):
      special='[^A-Za-z0-9 ]+'
      return re.sub(special, '', input_text)

  def removeTriplicated(input_text):
      return re.compile(r'(.)\1{2,}', re.IGNORECASE).sub(r'\1', input_text)

  return removeTriplicated(removeSpecial(removeAccents(input_text.lower())))


df = final_df

df_shuffled = df.sample(frac=1, random_state=42)
num_rows_to_keep = int(len(df_shuffled) * 0.8)
df_selected = df_shuffled.iloc[:num_rows_to_keep]
df = df_selected
df['user_input_ques'] = df['user_input_ques'].apply(lambda x: textPreprocess(str(x)))
df['user_input_ans'] = df['user_input_ans'].apply(lambda x: textPreprocess(str(x)))
questions, answers = df['user_input_ques'].tolist(), df['user_input_ans'].tolist()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['user_input_ques'] = df['user_input_ques'].apply(lambda x: textPreprocess(str(x)))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['user_input_ans'] = df['user_input_ans'].apply(lambda x: textPreprocess(str(x)))


In [None]:
# Build tokenizer using tfds for both questions and answers
tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    questions + answers, target_vocab_size=2**13)

# Define start and end token to indicate the start and end of a sentence
START_TOKEN, END_TOKEN = [tokenizer.vocab_size], [tokenizer.vocab_size + 1]

# Vocabulary size plus start and end token
VOCAB_SIZE = tokenizer.vocab_size + 2

In [None]:
# Tokenize, filter and pad sentences
def tokenize_and_filter(inputs, outputs):
  tokenized_inputs, tokenized_outputs = [], []

  for (sentence1, sentence2) in zip(inputs, outputs):
    # tokenize sentence
    sentence1 = START_TOKEN + tokenizer.encode(sentence1) + END_TOKEN
    sentence2 = START_TOKEN + tokenizer.encode(sentence2) + END_TOKEN
    # check tokenized sentence max length
    if len(sentence1) <= MAX_LENGTH and len(sentence2) <= MAX_LENGTH:
      tokenized_inputs.append(sentence1)
      tokenized_outputs.append(sentence2)

  # pad tokenized sentences
  tokenized_inputs = tf.keras.preprocessing.sequence.pad_sequences(
      tokenized_inputs, maxlen=MAX_LENGTH, padding='post')
  tokenized_outputs = tf.keras.preprocessing.sequence.pad_sequences(
      tokenized_outputs, maxlen=MAX_LENGTH, padding='post')

  return tokenized_inputs, tokenized_outputs


questions, answers = tokenize_and_filter(questions, answers)

In [None]:
print('Vocab size: {}'.format(VOCAB_SIZE))
print('Number of samples: {}'.format(len(questions)))

Vocab size: 8257
Number of samples: 33196


In [None]:
# decoder inputs use the previous target as input
# remove START_TOKEN from targets
dataset = tf.data.Dataset.from_tensor_slices((
    {
        'inputs': questions,
        'dec_inputs': answers[:, :-1]
    },
    {
        'outputs': answers[:, 1:]
    },
))

dataset = dataset.cache()
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

In [None]:
def scaled_dot_product_attention(query, key, value, mask):
  """Calculate the attention weights. """
  matmul_qk = tf.matmul(query, key, transpose_b=True)

  # scale matmul_qk
  depth = tf.cast(tf.shape(key)[-1], tf.float32)
  logits = matmul_qk / tf.math.sqrt(depth)

  # add the mask to zero out padding tokens
  if mask is not None:
    logits += (mask * -1e9)

  # softmax is normalized on the last axis (seq_len_k)
  attention_weights = tf.nn.softmax(logits, axis=-1)

  output = tf.matmul(attention_weights, value)

  return output

In [None]:
class MultiHeadAttention(tf.keras.layers.Layer):

  def __init__(self, d_model, num_heads, name="multi_head_attention"):
    super(MultiHeadAttention, self).__init__(name=name)
    self.num_heads = num_heads
    self.d_model = d_model

    assert d_model % self.num_heads == 0

    self.depth = d_model // self.num_heads

    self.query_dense = tf.keras.layers.Dense(units=d_model)
    self.key_dense = tf.keras.layers.Dense(units=d_model)
    self.value_dense = tf.keras.layers.Dense(units=d_model)

    self.dense = tf.keras.layers.Dense(units=d_model)

  def split_heads(self, inputs, batch_size):
    inputs = tf.reshape(
        inputs, shape=(batch_size, -1, self.num_heads, self.depth))
    return tf.transpose(inputs, perm=[0, 2, 1, 3])

  def call(self, inputs):
    query, key, value, mask = inputs['query'], inputs['key'], inputs[
        'value'], inputs['mask']
    batch_size = tf.shape(query)[0]

    # linear layers
    query = self.query_dense(query)
    key = self.key_dense(key)
    value = self.value_dense(value)

    # split heads
    query = self.split_heads(query, batch_size)
    key = self.split_heads(key, batch_size)
    value = self.split_heads(value, batch_size)

    # scaled dot-product attention
    scaled_attention = scaled_dot_product_attention(query, key, value, mask)

    scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])

    # concatenation of heads
    concat_attention = tf.reshape(scaled_attention,
                                  (batch_size, -1, self.d_model))

    # final linear layer
    outputs = self.dense(concat_attention)

    return outputs

In [None]:
def create_padding_mask(x):
  mask = tf.cast(tf.math.equal(x, 0), tf.float32)
  # (batch_size, 1, 1, sequence length)
  return mask[:, tf.newaxis, tf.newaxis, :]


def create_look_ahead_mask(x):
  seq_len = tf.shape(x)[1]
  look_ahead_mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
  padding_mask = create_padding_mask(x)
  return tf.maximum(look_ahead_mask, padding_mask)

class PositionalEncoding(tf.keras.layers.Layer):

  def __init__(self, position, d_model):
    super(PositionalEncoding, self).__init__()
    self.pos_encoding = self.positional_encoding(position, d_model)

  def get_angles(self, position, i, d_model):
    angles = 1 / tf.pow(10000, (2 * (i // 2)) / tf.cast(d_model, tf.float32))
    return position * angles

  def positional_encoding(self, position, d_model):
    angle_rads = self.get_angles(
        position=tf.range(position, dtype=tf.float32)[:, tf.newaxis],
        i=tf.range(d_model, dtype=tf.float32)[tf.newaxis, :],
        d_model=d_model)
    # apply sin to even index in the array
    sines = tf.math.sin(angle_rads[:, 0::2])
    # apply cos to odd index in the array
    cosines = tf.math.cos(angle_rads[:, 1::2])

    pos_encoding = tf.concat([sines, cosines], axis=-1)
    pos_encoding = pos_encoding[tf.newaxis, ...]
    return tf.cast(pos_encoding, tf.float32)

  def call(self, inputs):
    return inputs + self.pos_encoding[:, :tf.shape(inputs)[1], :]

In [None]:
def encoder_layer(units, d_model, num_heads, dropout, name="encoder_layer"):
  inputs = tf.keras.Input(shape=(None, d_model), name="inputs")
  padding_mask = tf.keras.Input(shape=(1, 1, None), name="padding_mask")

  attention = MultiHeadAttention(
      d_model, num_heads, name="attention")({
          'query': inputs,
          'key': inputs,
          'value': inputs,
          'mask': padding_mask
      })
  attention = tf.keras.layers.Dropout(rate=dropout)(attention)
  attention = tf.keras.layers.LayerNormalization(
      epsilon=1e-6)(inputs + attention)

  outputs = tf.keras.layers.Dense(units=units, activation='relu')(attention)
  outputs = tf.keras.layers.Dense(units=d_model)(outputs)
  outputs = tf.keras.layers.Dropout(rate=dropout)(outputs)
  outputs = tf.keras.layers.LayerNormalization(
      epsilon=1e-6)(attention + outputs)

  return tf.keras.Model(
      inputs=[inputs, padding_mask], outputs=outputs, name=name)

In [None]:
def encoder(vocab_size,
            num_layers,
            units,
            d_model,
            num_heads,
            dropout,
            name="encoder"):
  inputs = tf.keras.Input(shape=(None,), name="inputs")
  padding_mask = tf.keras.Input(shape=(1, 1, None), name="padding_mask")

  embeddings = tf.keras.layers.Embedding(vocab_size, d_model)(inputs)
  embeddings *= tf.math.sqrt(tf.cast(d_model, tf.float32))
  embeddings = PositionalEncoding(vocab_size, d_model)(embeddings)

  outputs = tf.keras.layers.Dropout(rate=dropout)(embeddings)
  for i in range(int(num_layers)):
    outputs = encoder_layer(
        units=units,
        d_model=d_model,
        num_heads=num_heads,
        dropout=dropout,
        name="encoder_layer_{}".format(i),
    )([outputs, padding_mask])

  return tf.keras.Model(
      inputs=[inputs, padding_mask], outputs=outputs, name=name)

In [None]:
def decoder_layer(units, d_model, num_heads, dropout, name="decoder_layer"):
  inputs = tf.keras.Input(shape=(None, d_model), name="inputs")
  enc_outputs = tf.keras.Input(shape=(None, d_model), name="encoder_outputs")
  look_ahead_mask = tf.keras.Input(
      shape=(1, None, None), name="look_ahead_mask")
  padding_mask = tf.keras.Input(shape=(1, 1, None), name='padding_mask')

  attention1 = MultiHeadAttention(
      d_model, num_heads, name="attention_1")(inputs={
          'query': inputs,
          'key': inputs,
          'value': inputs,
          'mask': look_ahead_mask
      })
  attention1 = tf.keras.layers.LayerNormalization(
      epsilon=1e-6)(attention1 + inputs)

  attention2 = MultiHeadAttention(
      d_model, num_heads, name="attention_2")(inputs={
          'query': attention1,
          'key': enc_outputs,
          'value': enc_outputs,
          'mask': padding_mask
      })
  attention2 = tf.keras.layers.Dropout(rate=dropout)(attention2)
  attention2 = tf.keras.layers.LayerNormalization(
      epsilon=1e-6)(attention2 + attention1)

  outputs = tf.keras.layers.Dense(units=units, activation='relu')(attention2)
  outputs = tf.keras.layers.Dense(units=d_model)(outputs)
  outputs = tf.keras.layers.Dropout(rate=dropout)(outputs)
  outputs = tf.keras.layers.LayerNormalization(
      epsilon=1e-6)(outputs + attention2)

  return tf.keras.Model(
      inputs=[inputs, enc_outputs, look_ahead_mask, padding_mask],
      outputs=outputs,
      name=name)

In [None]:
def decoder(vocab_size,
            num_layers,
            units,
            d_model,
            num_heads,
            dropout,
            name='decoder'):
  inputs = tf.keras.Input(shape=(None,), name='inputs')
  enc_outputs = tf.keras.Input(shape=(None, d_model), name='encoder_outputs')
  look_ahead_mask = tf.keras.Input(
      shape=(1, None, None), name='look_ahead_mask')
  padding_mask = tf.keras.Input(shape=(1, 1, None), name='padding_mask')

  embeddings = tf.keras.layers.Embedding(vocab_size, d_model)(inputs)
  embeddings *= tf.math.sqrt(tf.cast(d_model, tf.float32))
  embeddings = PositionalEncoding(vocab_size, d_model)(embeddings)

  outputs = tf.keras.layers.Dropout(rate=dropout)(embeddings)

  for i in range(int(num_layers)):
    outputs = decoder_layer(
        units=units,
        d_model=d_model,
        num_heads=num_heads,
        dropout=dropout,
        name='decoder_layer_{}'.format(i),
    )(inputs=[outputs, enc_outputs, look_ahead_mask, padding_mask])

  return tf.keras.Model(
      inputs=[inputs, enc_outputs, look_ahead_mask, padding_mask],
      outputs=outputs,
      name=name)

In [None]:
def transformer(vocab_size,
                num_layers,
                units,
                d_model,
                num_heads,
                dropout,
                name="transformer"):
  inputs = tf.keras.Input(shape=(None,), name="inputs")
  dec_inputs = tf.keras.Input(shape=(None,), name="dec_inputs")

  enc_padding_mask = tf.keras.layers.Lambda(
      create_padding_mask, output_shape=(1, 1, None),
      name='enc_padding_mask')(inputs)
  # mask the future tokens for decoder inputs at the 1st attention block
  look_ahead_mask = tf.keras.layers.Lambda(
      create_look_ahead_mask,
      output_shape=(1, None, None),
      name='look_ahead_mask')(dec_inputs)
  # mask the encoder outputs for the 2nd attention block
  dec_padding_mask = tf.keras.layers.Lambda(
      create_padding_mask, output_shape=(1, 1, None),
      name='dec_padding_mask')(inputs)

  enc_outputs = encoder(
      vocab_size=vocab_size,
      num_layers=num_layers,
      units=units,
      d_model=d_model,
      num_heads=num_heads,
      dropout=dropout,
  )(inputs=[inputs, enc_padding_mask])

  dec_outputs = decoder(
      vocab_size=vocab_size,
      num_layers=num_layers,
      units=units,
      d_model=d_model,
      num_heads=num_heads,
      dropout=dropout,
  )(inputs=[dec_inputs, enc_outputs, look_ahead_mask, dec_padding_mask])

  outputs = tf.keras.layers.Dense(units=vocab_size, name="outputs")(dec_outputs)

  return tf.keras.Model(inputs=[inputs, dec_inputs], outputs=outputs, name=name)

def loss_function(y_true, y_pred):
  y_true = tf.reshape(y_true, shape=(-1, MAX_LENGTH - 1))

  loss = tf.keras.losses.SparseCategoricalCrossentropy(
      from_logits=True, reduction='none')(y_true, y_pred)

  mask = tf.cast(tf.not_equal(y_true, 0), tf.float32)
  loss = tf.multiply(loss, mask)

  return tf.reduce_mean(loss)


class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):

  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()

    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps

  def __call__(self, step):
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps**-1.5)

    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

In [None]:
# clear backend
tf.keras.backend.clear_session()

learning_rate = CustomSchedule(D_MODEL)

#optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

optimizer = tf.keras.optimizers.Adam()

def accuracy(y_true, y_pred):
  # ensure labels have shape (batch_size, MAX_LENGTH - 1)
  y_true = tf.reshape(y_true, shape=(-1, MAX_LENGTH - 1))
  return tf.keras.metrics.sparse_categorical_accuracy(y_true, y_pred)

In [None]:
# initialize and compile model within strategy scope
with strategy.scope():
  model = transformer(
      vocab_size=VOCAB_SIZE,
      num_layers=NUM_LAYERS,
      units=UNITS,
      d_model=D_MODEL,
      num_heads=NUM_HEADS,
      dropout=DROPOUT)

  model.compile(optimizer=optimizer, loss=loss_function, metrics=[accuracy])

model.summary()

Model: "transformer"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 inputs (InputLayer)         [(None, None)]               0         []                            
                                                                                                  
 dec_inputs (InputLayer)     [(None, None)]               0         []                            
                                                                                                  
 enc_padding_mask (Lambda)   (None, 1, 1, None)           0         ['inputs[0][0]']              
                                                                                                  
 encoder (Functional)        (None, None, 256)            5276416   ['inputs[0][0]',              
                                                                     'enc_padding_mask[0

In [None]:
import datetime

#32% - 80 epok, po 30 epokach - 29

logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)

model.fit(dataset, epochs=50, callbacks = [tensorboard_callback])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
 57/260 [=====>........................] - ETA: 2:38 - loss: 2.8662 - accuracy: 0.0217

In [None]:
def evaluate(sentence, model):
#   sentence = textPreprocess(sentence)

  sentence = tf.expand_dims(
      START_TOKEN + tokenizer.encode(sentence) + END_TOKEN, axis=0)

  output = tf.expand_dims(START_TOKEN, 0)

  for i in range(MAX_LENGTH):
    predictions = model(inputs=[sentence, output], training=False)

    # select the last word from the seq_len dimension
    predictions = predictions[:, -1:, :]
    predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)


    # return the result if the predicted_id is equal to the end token
    if tf.equal(predicted_id, END_TOKEN[0]):
      break

    # concatenated the predicted_id to the output which is given to the decoder
    # as its input.
    output = tf.concat([output, predicted_id], axis=-1)

  return tf.squeeze(output, axis=0)


def predict(sentence,model):
  prediction = evaluate(sentence,model)

  predicted_sentence = tokenizer.decode(
      [i for i in prediction if i < tokenizer.vocab_size])

  print('Input: {}'.format(sentence))
  print('Output: {}'.format(predicted_sentence))

  return predicted_sentence
model.save_weights('saved_weights.h5')
model.save_weights('/content/drive/MyDrive/Health_Care/saved_weights.h5')
loaded_model = transformer(
      vocab_size=VOCAB_SIZE,
      num_layers=NUM_LAYERS,
      units=UNITS,
      d_model=D_MODEL,
      num_heads=NUM_HEADS,
      dropout=DROPOUT)

#import h5py
#with h5py.File('saved_weights.h5', 'w') as f:
loaded_model.compile(optimizer=optimizer, loss=loss_function, metrics=[accuracy])
loaded_model.load_weights('saved_weights.h5')

In [None]:
output = predict('hey, malaria is bad?', loaded_model)

In [None]:
final_df

Unnamed: 0,user_input_ques,user_input_ans
0,"Hey there, for the past few days chills, vomi...",Analysing the Symptoms Malaria Consult neares...
1,"Hey, facing some symptoms like chills, vomiti...",This is may be Malaria Consult nearest hospit...
2,"Hello, past few days chills, vomiting, high f...",It may be due to Malaria Consult nearest hosp...
3,"Hi, past few days vomiting, high fever, sweat...",Analysing the Symptoms Malaria Consult neares...
4,"Hi there, facing some symptoms like chills, h...",This is may be Malaria Consult nearest hospit...
...,...,...
52136,what is an icsi intracytoplasmic sperm injection,intracytoplasmic sperm injection icsi is a la...
52137,can b12 vitamins raise lab results b12 lab re...,i am sorry you are going through this and it ...
52138,what can you give a 1 month old baby that has...,there are no over the counter medications app...
52139,when should the doctor be called for diarrhea,okay so type 1 is common to oral area and typ...


In [None]:
from transformers import BertTokenizer, BertForQuestionAnswering, AdamW
import torch

# Load the pre-trained model and tokenizer
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)

# Define your dataset
dataset = [{'question': q, 'answer': a} for q, a in zip(final_df['user_input_ques'], final_df['user_input_ans'])]

# Define training parameters
num_train_epochs = 3
learning_rate = 5e-5

# Set the model to training mode
model.train()

# Use the AdamW optimizer
optimizer = AdamW(model.parameters(), lr=learning_rate)

# Custom training loop
for epoch in range(num_train_epochs):
    total_loss = 0
    for example in dataset:
        # Tokenize the input
        inputs = tokenizer(example['question'], example['answer'], return_tensors='pt', truncation=True, padding='max_length', max_length=512)

        # Forward pass
        outputs = model(**inputs, return_dict=True)
        loss = outputs.loss
        if loss is not None:
            total_loss += loss.item()

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    # Print average loss for each epoch
    print(f"Epoch {epoch+1}, Average Loss: {total_loss / len(dataset)}")

# Set the model to evaluation mode
model.eval()

# Use the fine-tuned model to answer new questions
def answer_question(question, context):
    inputs = tokenizer(question, context, return_tensors='pt', truncation=True, padding='max_length', max_length=512)
    outputs = model(**inputs, return_dict=True)
    answer_start = torch.argmax(outputs.start_logits)
    answer_end = torch.argmax(outputs.end_logits) + 1
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))
    return answer

# New question
new_question = "What is the treatment for COVID-19?"

# Context containing information related to the new question
new_context = "The treatment for COVID-19 includes supportive care to relieve symptoms, antiviral medications, and in severe cases, hospitalization and respiratory support."

# Use the model to answer the new question
answer = answer_question(new_question, new_context)

# Print the answer
print("Answer:", answer)


Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


KeyboardInterrupt: 

In [None]:
from transformers import BertTokenizer, BertForQuestionAnswering, AdamW
import torch
from torch.utils.data import DataLoader, Dataset

# Load the pre-trained model and tokenizer
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)

# Define a custom dataset class
class QADataset(Dataset):
    def __init__(self, dataset):
        self.dataset = dataset

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        inputs = tokenizer(item['question'], item['answer'], return_tensors='pt', truncation=True, padding='max_length', max_length=512)
        return inputs

# Define your dataset
dataset = [{'question': q, 'answer': a} for q, a in zip(final_df['user_input_ques'], final_df['user_input_ans'])]
qa_dataset = QADataset(dataset)

# Define training parameters
num_train_epochs = 3
learning_rate = 5e-5
batch_size = 8

# Use DataLoader for batch processing
train_loader = DataLoader(qa_dataset, batch_size=batch_size, shuffle=True)

# Set the model to training mode
model.train()

# Use the AdamW optimizer
optimizer = AdamW(model.parameters(), lr=learning_rate)

# Custom training loop with batch processing
for epoch in range(num_train_epochs):
    total_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].squeeze(1)
        attention_mask = batch['attention_mask'].squeeze(1)
        token_type_ids = batch['token_type_ids'].squeeze(1)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, return_dict=True)
        loss = outputs.loss
        if loss is not None:
            total_loss += loss.item()
            loss.backward()
            optimizer.step()

    # Print average loss for each epoch
    print(f"Epoch {epoch+1}, Average Loss: {total_loss / len(train_loader)}")

# Set the model to evaluation mode
model.eval()

# Use the fine-tuned model to answer a new user question
def answer_user_question(user_question):
    inputs = tokenizer(user_question, return_tensors='pt', truncation=True, padding='max_length', max_length=512)
    input_ids = inputs['input_ids'].squeeze(0)
    attention_mask = inputs['attention_mask'].squeeze(0)
    token_type_ids = inputs['token_type_ids'].squeeze(0)
    outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, return_dict=True)
    answer_start = torch.argmax(outputs.start_logits)
    answer_end = torch.argmax(outputs.end_logits) + 1
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    return answer

# New question from the user
user_question = "What are the symptoms of COVID-19?"

# Use the model to answer the user question
answer = answer_user_question(user_question)

# Print the answer
print("Answer:", answer)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. 