# Let's go!

https://www.kaggle.com/datasets/itachi9604/disease-symptom-description-dataset

## About dataset

About the Dataset Context is a dataset that will provide students with a source for creating a healthcare-related system.
  
Content  
There are columns containing diseases, their symptoms, precautions to be taken, and their weight. This data set can be easily cleaned up using file processing in any language. The user only needs to understand how the rows and columns are arranged.
   
Confession   
I created this dataset with the help of a friend of the Practitioner Rathod. Because there was an existing dataset like this that was difficult to clean up.
О наборе данных Контекст
Набор данных, который предоставит учащимся источник для создания системы, связанной со здравоохранением.
   
## О датасете

Содержание  
Есть столбцы, содержащие болезни, их симптомы, меры предосторожности, которые необходимо предпринять, и их вес. Этот набор данных можно легко очистить, используя обработку файлов на любом языке. Пользователю нужно только понять, как расположены строки и столбцы.  
   
Признание  
Я создал этот набор данных с помощью друга Пратика Ратода. Поскольку существовал существующий набор данных, подобный этому, который было трудно очистить.  

### note / замечено
- There is no Disease not found option in the dataset.
- В датасете отсутствует вариант Болезнь не найдена.

## importing libraries

In [1]:
# imports
from copy import deepcopy
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import pickle
from sklearn.feature_extraction.text import CountVectorizer

from typing import Iterable

PATH = 'data/'           # for reading
MY_PATH = 'my_data/'     # for writing

## Data processing.
- - in the dataset, the data has the form: disease is a set of symptoms, it is necessary to bring them to a form that is convenient to read machine learning algorithms.

## Обработка данных.
- в датасете данные имеют вид : болезнь - набор симптомов, необходимо привести их к виду который удобно читать алгоритму машинного обучения.

In [2]:
# parse the dataset with CountVectorizer
# since CountVectorizer works with a list of strings, we need to turn df into a list of strings
# # there is a problem the symptom may consist of two words 
# let's turn them into one
# function takes a string (the name of the symptom) and replaces and translates it into snake style

# парсить датасет будем CountVectorizer
# т.к. CountVectorizer работает со списком строк, то нам необходимо df превратить в список строк
# возникает проблемка симптом может состоять из двух слов 
# превратим их в одно
# ф-я принимает строку(название симптома) и заменяет и переводит его в змеиный стиль
def string_preparation(st:str) -> str:
    return  st.replace('(',' ')\
            .replace(')',' ')\
            .strip()\
            .replace(' ','_')\
            .replace('__','_')
# for processing strings
# для обработки строк
def row_to_string(row: Iterable['str'] ) -> str:
    return ' '.join(row.values)
# # reading the raw dataset
# читаем сырой датасет
df = pd.read_csv(f'{PATH}dataset.csv',delimiter=",")
# convert each value of df into one "word"
# каждое значение df  преобразуем в одно "слово"
df_new = (df.drop('Disease', axis = 1).fillna('')).applymap(string_preparation)
# turning df into an array of strings
# превращаем df в массив строк
simptoms_list = df_new.apply(row_to_string, axis=1).values.tolist()
#  vectorize
# ну и векторизуем 
vectorizer = CountVectorizer()
# we get a matrix of symptoms
# получаем матрицу симптомов
X = vectorizer.fit_transform(simptoms_list)
# X - sparse matrix
# making df
# мастерим df
my_df = pd.DataFrame(X.toarray(), columns= vectorizer.get_feature_names_out())
symptoms_list = vectorizer.get_feature_names_out()
# there is also the problem that the names of diseases 
# may contain spaces at the edges, we will process...
# так же есть проблема того, что названия болезней 
# могут содержать пробелы по краям, обработаем...
my_df["Disease"] = df["Disease"].map(str.strip)
# we have received data that we can feed to the ML algorithm
# получили данные которые можем скормить алгоритму ML
my_df.shape

(4920, 132)

**We noticed that there may be grammatical errors and differences in the names of diseases (and hence in the names of symptoms).**
# Error search
- let's look for them
- using the penchant library

**Было замечено, что в названиях болезней ( а значит и в названиях симптомов) могут присутствовать грамматические ошибки и различия.**
# Поиск ошибок
- поищем их
- используем библиотеку pyenchant

In [3]:
import enchant  # when using an improt, we write enchant (not pyenchant)  
import re

dictionary = enchant.Dict("en_US")
for dis in my_df["Disease"].unique():
    for word in re.split(pattern = r'[\s()]+', string = dis, maxsplit=0, flags=0):
        if word not in ["", "GERD", 'cholestasis', 'spondylosis']and not dictionary.check(word):
            print (word)  # подозрительное слово , suspicious word
            print(dictionary.suggest(word))  # варианты замены,  replacement options
            print ("dis*   ", dis) # строка с подозрительным словом, # string with a suspicious word

diseae
['disease', 'diseuse']
dis*    Peptic ulcer diseae
hemmorhoids
['hemorrhoids', 'hemorrhoid']
dis*    Dimorphic hemmorhoids(piles)
Osteoarthristis
['Osteoarthritis', 'Osteoarthritic']
dis*    Osteoarthristis
Paroymsal
['Paroxysmal']
dis*    (vertigo) Paroymsal  Positional Vertigo


In [4]:
# После прверки составляем словарь замены  ошибок
# compiling a dictionary of error replacement
errors_dict = {
    'Peptic ulcer diseae': 'Peptic ulcer disease',
    'Dimorphic hemmorhoids(piles)': 'Dimorphic hemorrhoids (piles)',
    'Osteoarthristis': 'Osteoarthritis',
    '(vertigo) Paroymsal  Positional Vertigo': '(vertigo) Paroxysmal  Positional Vertigo'
}
# обрабатываем
def disease_prep(disease):
    if disease in errors_dict:
         return errors_dict[disease]
    return disease
my_df['Disease'] =my_df['Disease'].map(disease_prep)

In [5]:
# проверяем, checking
dictionary = enchant.Dict("en_US")
for dis in my_df["Disease"].unique():
    for word in re.split(pattern = r'[\s()]+', string = dis, maxsplit=0, flags=0):
        if word not in ["", "GERD", 'cholestasis', 'spondylosis']and not dictionary.check(word):
            print (word)
            print(dictionary.suggest(word))
            print ("dis*   ", dis)
# всё чисто, it's cleaned

# We have cleaned out the diseases, we will clean up the symptoms

# С болезнями разобрались почистим симптомы

In [6]:
# ищем ошибки в названиях симптомов, looking for errors in the names of symptoms
for sym in my_df.columns:
    for word in sym.split('_'):
        if word not in ["", "diarrhoea", 'polyuria'] and not dictionary.check(word):
            print (word)
            print(dictionary.suggest(word))
            print ("sym*   ", sym)

feets
['fits', 'fetes', 'frets', 'feet', 'fees', 'fleets', 'fests', 'feats', 'feels', 'felts', 'feeds', 'meets', 'beets', "fee's", 'fee ts']
sym*    cold_hands_and_feets
dischromic
['dichromic', 'dis chromic', 'dis-chromic', 'dichromatism', 'dichromatic', 'dichroism', 'dichroic']
sym*    dischromic_patches
scurring
['scurrying', 'scarring', 'slurring', 'spurring', 'incurring', 'recurring', 'occurring', 'concurring']
sym*    scurring
extremeties
['extremities', 'extreme ties', 'extreme-ties', 'extremeness', 'extremes', 'extremist', 'extremism']
sym*    swollen_extremeties
typhos
['typos', 'typhous', 'typhus', 'typ hos', 'typ-hos', 'typhoons']
sym*    toxic_look_typhos


In [7]:
# It 's not clear what scurrying is
# let's take a closer look
# Непонятно , что такое scurring
# посмотрим внимательнее
my_df[my_df['scurring']==1]['Disease'].unique()
# ага это scarring 100500

array(['Acne'], dtype=object)

In [8]:
# It 's not clear what toxic_look_typhos is
# let's take a closer look
# Непонятно , что такое toxic_look_typhos
# посмотрим внимательнее
my_df[my_df['toxic_look_typhos']==1]['Disease'].unique()
# ага а это toxic_look_typhus

array(['Typhoid'], dtype=object)

In [9]:
# compiling a dictionary of error replacement
# После прверки составляем словарь замены  ошибок
symptoms_errors_dict = {
    'cold_hands_and_feets': 'cold_hands_and_feet',
    'dischromic_patches': 'dyschromic_patches',
    'scurring': 'scarring',
    'swollen_extremeties': 'swollen_extremities',
    'toxic_look_typhos':'toxic_look_typhus'
}
# обрабатываем колонки
def symptom_prep(symptom):
    if symptom in symptoms_errors_dict:
         return symptoms_errors_dict[symptom]
    return symptom
my_df.columns=my_df.columns.map(symptom_prep)

In [10]:
# проверка, check
for sym in my_df.columns:
    for word in sym.split('_'):
        if word not in ["", "diarrhoea", 'polyuria', 'dyschromic'] and not dictionary.check(word):
            print (word)
            print(dictionary.suggest(word))
            print ("sym*   ", sym)

## I hope we have corrected all the errors in the names of diseases and symptoms
- let's look at the number of examples for each disease
  
## Надеюсь, мы исправили все ошибки в названиях болезней и симптомов
- посмотрим на кол-во примеров на каждую болезнь

In [11]:
my_df.groupby('Disease').Disease.count()

Disease
(vertigo) Paroxysmal  Positional Vertigo    120
AIDS                                        120
Acne                                        120
Alcoholic hepatitis                         120
Allergy                                     120
Arthritis                                   120
Bronchial Asthma                            120
Cervical spondylosis                        120
Chicken pox                                 120
Chronic cholestasis                         120
Common Cold                                 120
Dengue                                      120
Diabetes                                    120
Dimorphic hemorrhoids (piles)               120
Drug Reaction                               120
Fungal infection                            120
GERD                                        120
Gastroenteritis                             120
Heart attack                                120
Hepatitis B                                 120
Hepatitis C                     

## it's charming ...
- however, such beauty arouses suspicion ...
- but the dataset is processed - we save it for future use

##  Красивое...
- такая красота вызывает подозрение...
- однако датасет обработан - сохраним его для дальнейшего использования

In [12]:
# сохранение , saving
my_df.to_csv(f'{MY_PATH}my_dataset.csv', index=False)

In [13]:
# Поищем дубликаты
# Let's look for duplicates
my_df[my_df.duplicated ()]

Unnamed: 0,abdominal_pain,abnormal_menstruation,acidity,acute_liver_failure,altered_sensorium,anxiety,back_pain,belly_pain,blackheads,bladder_discomfort,...,watering_from_eyes,weakness_in_limbs,weakness_of_one_body_side,weight_gain,weight_loss,yellow_crust_ooze,yellow_urine,yellowing_of_eyes,yellowish_skin,Disease
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4915,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,(vertigo) Paroxysmal Positional Vertigo
4916,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,Acne
4917,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,Urinary tract infection
4918,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Psoriasis


## Oops!   there are 4616 duplicates here...)
- let's reset the duplicates and look at what's left...

## Опа! 4616 дубликатов...)
- сбросим дубликаты и посмотрим на то что осталось...

In [14]:
df_drop_dup = my_df.drop_duplicates() 
print(f' without duplicated shape  {df_drop_dup.shape}')
df_drop_dup.groupby('Disease').Disease.count().sort_values()

 without duplicated shape  (304, 132)


Disease
Gastroenteritis                              5
AIDS                                         5
Acne                                         5
Urinary tract infection                      5
Allergy                                      5
Fungal infection                             5
Heart attack                                 5
Paralysis (brain hemorrhage)                 5
Arthritis                                    6
Cervical spondylosis                         6
Drug Reaction                                6
Impetigo                                     6
Hypertension                                 6
Dimorphic hemorrhoids (piles)                6
(vertigo) Paroxysmal  Positional Vertigo     7
Psoriasis                                    7
Osteoarthritis                               7
Peptic ulcer disease                         7
Hepatitis C                                  7
Bronchial Asthma                             7
GERD                                         7
Malar

## That's it!
- - it turns out there are very few unique descriptions of diseases.... sad....
- - I suspect that on a stripped-down dataset without duplicates, the accuracy of the models should decrease.

   
## О как! 
- получаеится уникальных описаний болезней очень даже немного.... печалька....
- подозреваю, что на урезаном датасете без дубликатов точность моделей должна снизиться.

In [15]:
# Импортируем разные модели и посмотрим как они справятся с данной задачей
# import models
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

import warnings
warnings.filterwarnings("ignore")

X = df_drop_dup.drop("Disease", axis=1)
y = df_drop_dup["Disease"]


n_fold = 7

# определим  параметры кросс-валидации (стратифицированная 7-фолдовая с перемешиванием)
# let's define the parameters of cross-validation (stratified 7-stock with mixing)
skf = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state= 11)

## Let's use different models of classifiers.
- we use models with default parameters, if we see something worthy, we will select hyperparameters.
   

## Попользуем разные модели класссификаторов.
- используем модели с параметрами по умолчанию , если увидим что-нить достойное будем подбирать гиперпараметры.

In [104]:
clf  = DecisionTreeClassifier(random_state=11)

scores: np.array = cross_val_score(clf, X, y, cv=skf)
print ('all scores ', scores)
print ('mean scores ', round(scores.mean(), 2))

all scores  [0.56818182 0.45454545 0.54545455 0.55813953 0.69767442 0.53488372
 0.65116279]
mean scores  0.57


In [105]:
clf  = LogisticRegression(random_state=11)

scores: np.array = cross_val_score(clf, X, y, cv=skf)
print ('all scores ', scores)
print ('mean scores ', round(scores.mean(), 2))

all scores  [1. 1. 1. 1. 1. 1. 1.]
mean scores  1.0


In [106]:
clf  = KNeighborsClassifier()
 
scores: np.array = cross_val_score(clf, X, y, cv=skf)
print ('all scores ', *scores)
print ('mean scores ', round(scores.mean(), 2))

all scores  1.0 1.0 1.0 1.0 1.0 1.0 1.0
mean scores  1.0


In [16]:
clf  = GaussianNB()

scores: np.array = cross_val_score(clf, X, y, cv=skf)
print ('all scores ', scores)
print ('mean scores ', round(scores.mean(), 2))

all scores  [1.         1.         0.97727273 0.97674419 1.         1.
 1.        ]
mean scores  0.99


In [17]:
clf  = SVC(random_state=11)
 
scores: np.array = cross_val_score(clf, X, y, cv=skf)
print ('all scores ', scores)
print ('mean scores ', round(scores.mean(), 2))

all scores  [1. 1. 1. 1. 1. 1. 1.]
mean scores  1.0


In [18]:
clf  = RandomForestClassifier(random_state=11)

scores: np.array = cross_val_score(clf, X, y, cv=skf)
print ('all scores ', scores)
print ('mean scores ', round(scores.mean(), 2))

all scores  [1. 1. 1. 1. 1. 1. 1.]
mean scores  1.0


# We see that all models are good (except for the tree ...)
- let's look at recall this characteristic, cmc, in medicine it is important

# В принципе, все модели  хороши (ну, кроме дерева...)
- посмотрим на recall  эта характеристика, кмк , в медицине это  важно

In [19]:
from sklearn.metrics import classification_report, accuracy_score, make_scorer

def classification_report_with_accuracy_score(y_true, y_pred):

    print (classification_report(y_true, y_pred)) # print classification report
    return accuracy_score(y_true, y_pred) # return accuracy score

clf = LogisticRegression(random_state=11)

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state= 11)

scores: np.array = cross_val_score(clf, X, y, cv= skf,
                                   scoring=make_scorer(classification_report_with_accuracy_score))

                                          precision    recall  f1-score   support

(vertigo) Paroxysmal  Positional Vertigo       1.00      1.00      1.00         2
                                    AIDS       1.00      1.00      1.00         2
                                    Acne       1.00      1.00      1.00         2
                     Alcoholic hepatitis       1.00      1.00      1.00         2
                                 Allergy       1.00      1.00      1.00         2
                               Arthritis       1.00      1.00      1.00         2
                        Bronchial Asthma       1.00      1.00      1.00         3
                    Cervical spondylosis       1.00      1.00      1.00         2
                             Chicken pox       1.00      1.00      1.00         4
                     Chronic cholestasis       1.00      1.00      1.00         3
                             Common Cold       1.00      1.00      1.00         3
               

In [20]:
clf = SVC(random_state=11)
 
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state= 11)

scores: np.array = cross_val_score(clf, X, y, cv= skf,
                                   scoring=make_scorer(classification_report_with_accuracy_score))

                                          precision    recall  f1-score   support

(vertigo) Paroxysmal  Positional Vertigo       1.00      1.00      1.00         2
                                    AIDS       1.00      1.00      1.00         2
                                    Acne       1.00      1.00      1.00         2
                     Alcoholic hepatitis       1.00      1.00      1.00         2
                                 Allergy       1.00      1.00      1.00         2
                               Arthritis       1.00      1.00      1.00         2
                        Bronchial Asthma       1.00      1.00      1.00         3
                    Cervical spondylosis       1.00      1.00      1.00         2
                             Chicken pox       1.00      1.00      1.00         4
                     Chronic cholestasis       1.00      1.00      1.00         3
                             Common Cold       1.00      1.00      1.00         3
               

In [21]:
clf = KNeighborsClassifier()
 
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state= 11)

scores: np.array = cross_val_score(clf, X, y, cv= skf,
                                   scoring=make_scorer(classification_report_with_accuracy_score))

                                          precision    recall  f1-score   support

(vertigo) Paroxysmal  Positional Vertigo       1.00      1.00      1.00         2
                                    AIDS       1.00      1.00      1.00         2
                                    Acne       1.00      1.00      1.00         2
                     Alcoholic hepatitis       1.00      1.00      1.00         2
                                 Allergy       1.00      1.00      1.00         2
                               Arthritis       1.00      1.00      1.00         2
                        Bronchial Asthma       1.00      1.00      1.00         3
                    Cervical spondylosis       1.00      1.00      1.00         2
                             Chicken pox       1.00      1.00      1.00         4
                     Chronic cholestasis       1.00      1.00      1.00         3
                             Common Cold       1.00      1.00      1.00         3
               

 ### Conclusions: both SVC, and Logistic Regression, KNeighborsClassifier work perfectly.
  - the small size of the dataset is not good, but as it is... we will sculpt the application in a modular form, and it will not be difficult to replace or add a dataset.  
    
### Вывод и SVC, и LogisticRegression, KNeighborsClassifier отрабатывают отлично .
  - напрягает небольшой размер датасета, но как есть... приложение будем ваять в модульном виде, и заменить или добавить датасет труда не составит.  

## I decided not to add the absence of diseases to the training array.
- Because among the diseases there is no option "the disease is not defined" and in the Telegram bot, especially smart users can search for a disease without symptoms, then we will start a function right in the bot that will give an answer to such users ...)
- there are at least 5 lines for each disease

## Решено не добавлять отсутствие  болезней в тренировочный массив.
- Т.к. среди болезней  нет варианта "болезнь не определена" а в Телеграмм боте особо умные юзеры могут поискать болезнь без наличия симптомов , то прям в боте заведём функцию , которая выдаст ответ таким юзерам...)
- на каждую болезнь  приходится минимум 5 строк 

In [114]:
# сохранение чистого датасета
# saving a clean dataset
df_drop_dup.to_csv(f'{MY_PATH}my_dataset_small.csv', index=False)

In [115]:
# Посмотрим на другие данные
# тяжесть симптомов
# Let's look at other data
# severity of symptoms
severity = pd.read_csv(f'{PATH}Symptom-severity.csv').sort_values('Symptom', ascending=True)
# обработка
# processing
severity["Symptom"] = severity["Symptom"].map(string_preparation)
severity["Symptom"] = severity["Symptom"].map(symptom_prep)
severity

Unnamed: 0,Symptom,weight
39,abdominal_pain,4
101,abnormal_menstruation,6
8,acidity,3
44,acute_liver_failure,6
98,altered_sensorium,2
...,...,...
19,weight_loss,3
131,yellow_crust_ooze,3
42,yellow_urine,4
43,yellowing_of_eyes,4


In [117]:
# поиск ошибок
# error search
sym_list = df_drop_dup.columns
for sym in sym_list:
    if sym not in severity.Symptom.values:
        print (sym)

foul_smell_of_urine
Disease


In [118]:
for sym in severity.Symptom.values:
    if sym not in sym_list:
        print (sym)
#Что такое prognosis?
# What is prognosis?

foul_smell_ofurine
prognosis


In [119]:
severity.loc[severity['Symptom']=='foul_smell_ofurine', 'Symptom']='foul_smell_of_urine'
for sym in sym_list:
    if sym not in severity.Symptom.values:
        print (sym)

Disease


In [120]:
severity.to_csv(f'{MY_PATH}my_Symptom_severity.csv', index=False)

In [125]:
# описание болезни
descriptions = pd.read_csv(f'{PATH}symptom_Description.csv')
descriptions["Disease"] = descriptions["Disease"].map(str.strip)
descriptions["Disease"] = descriptions["Disease"].map(disease_prep)
descriptions.head()

Unnamed: 0,Disease,Description
0,Drug Reaction,An adverse drug reaction (ADR) is an injury ca...
1,Malaria,An infectious disease caused by protozoan para...
2,Allergy,An allergy is an immune system response to a f...
3,Hypothyroidism,"Hypothyroidism, also called underactive thyroi..."
4,Psoriasis,Psoriasis is a common skin disorder that forms...


In [127]:
#  проверка, check
for dis in df_drop_dup['Disease'].unique():
    if dis not in descriptions['Disease'].values:
        print (dis)

Dimorphic hemorrhoids (piles)


In [129]:
# поиск соответствия в descriptions
# # matching search in descriptions
for dis in descriptions['Disease'].unique():
    if dis not in df_drop_dup['Disease'].values:
        print (dis)

Dimorphic hemorrhoids(piles)


In [131]:
# приведение к одному виду
# reduction to one view
descriptions.loc[descriptions["Disease"] == 'Dimorphic hemorrhoids(piles)', "Disease"] = 'Dimorphic hemorrhoids (piles)'
for dis in df_drop_dup['Disease'].unique():
    if dis not in descriptions['Disease'].values:
        print (dis)

In [132]:
# в descriptions не помешает добавить строку  нет болезни
# it needs to add the line no illness to descriptions
new_row = {"Disease":'Not identified. Perhaps you are healthy.',
           'Description': "I can't identify the disease by these symptoms..."\
            " Perhaps you are healthy... or you are still underexamined...)"} 
#descriptionsappend row to the dataframe 
descriptions = descriptions.append(new_row, ignore_index=True)
descriptions.tail()

Unnamed: 0,Disease,Description
37,Pneumonia,Pneumonia is an infection in one or both lungs...
38,Arthritis,Arthritis is the swelling and tenderness of on...
39,Gastroenteritis,Gastroenteritis is an inflammation of the dige...
40,Tuberculosis,Tuberculosis (TB) is an infectious disease usu...
41,Not identified. Perhaps you are healthy.,I can't identify the disease by these symptoms...


In [133]:
# сохранение, save
descriptions.to_csv(f'{MY_PATH}my_symptom_Description.csv', index=False)

In [137]:
# порядок действий и меры предосторожности при заболевании
# precautions for the disease
precautions = pd.read_csv(f'{PATH}symptom_precaution.csv')
precautions["Disease"] = precautions["Disease"].map(str.strip)
precautions["Disease"] = precautions["Disease"].map(disease_prep)
for dis in df_drop_dup['Disease'].unique():
    if dis not in precautions['Disease'].values:
        print (dis)

In [138]:
# в precautions не помешает добавить строку  нет болезни
# it needs to add the line no illness to the precautions
new_row = {"Disease":'Not identified. Perhaps you are healthy.',
           'Precaution_1': "Enjoy your life!",
           'Precaution_2': "Enjoy your life!!",
           'Precaution_3': "Enjoy your life!!!",
           'Precaution_4': "But do not forget about regular medical examination",} 
#descriptionsappend row to the dataframe 
precautions = precautions.append(new_row, ignore_index=True)
precautions.tail()

Unnamed: 0,Disease,Precaution_1,Precaution_2,Precaution_3,Precaution_4
37,Pneumonia,consult doctor,medication,rest,follow up
38,Arthritis,exercise,use hot and cold therapy,try acupuncture,massage
39,Gastroenteritis,stop eating solid food for while,try taking small sips of water,rest,ease back into eating
40,Tuberculosis,cover mouth,consult doctor,medication,rest
41,Not identified. Perhaps you are healthy.,Enjoy your life!,Enjoy your life!!,Enjoy your life!!!,But do not forget about regular medical examin...


In [139]:
# сохранение, save
precautions.to_csv(f'{MY_PATH}my_symptom_precaution.csv', index=False)

In [142]:
#  в groupedData мы получаем все возможные симптомы по каждой
# болезни (не варианты наборов а максимальный набор )
# # in grouped Data we get all possible symptoms for each
# diseases (not variants of sets, but the maximum set)
groupedData = df_drop_dup.groupby(df_drop_dup['Disease']).max()
#  и тож сохраним , save
groupedData.to_csv(f'{MY_PATH}my_symptoms_of_diseases.csv')

In [143]:
groupedData

Unnamed: 0_level_0,abdominal_pain,abnormal_menstruation,acidity,acute_liver_failure,altered_sensorium,anxiety,back_pain,belly_pain,blackheads,bladder_discomfort,...,vomiting,watering_from_eyes,weakness_in_limbs,weakness_of_one_body_side,weight_gain,weight_loss,yellow_crust_ooze,yellow_urine,yellowing_of_eyes,yellowish_skin
Disease,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(vertigo) Paroxysmal Positional Vertigo,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
AIDS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Acne,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
Alcoholic hepatitis,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
Allergy,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
Arthritis,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bronchial Asthma,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Cervical spondylosis,0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
Chicken pox,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Chronic cholestasis,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,1


## The data for Telegram_bota is ready.
### in advance!!!

## Данные для Telegramm_bota  готовы.
### наперад!!!

### Logic of work
- interviewing the user 
- making a list of symptoms 
- - we issue a diagnosis using a pre-trained and saved model
--------------------------------
** pros:**
- any model can be screwed
- the list of symptoms and diseases can be changed depending on which the model is trained on


### Логика работы
- опрашиваем пользователя 
- составляем список симптомов 
- выдаём диагноз с помощью предобученой и сохранённой модели
--------------------------------
**плюсы:**
- модель можно прикрутить любую
- список симптомов и болезней можно менять в зависимости от того на котором обучена модель

## Let's train models.
## Обучим модели.

In [148]:
clf = SVC(probability=True, random_state=11)
clf.fit(my_df.drop("Disease", axis=1), my_df["Disease"])
y=clf.predict_proba(X=my_df.loc[[25]].drop('Disease', axis=1))
# посмотрим на предсказание по 27 ряду
print(my_df.loc[27, "Disease"])
df_diagnoses = pd.DataFrame({
    'Diagnosis': clf.classes_,
    'Probability': y[0]
}).sort_values('Probability', ascending=False )
df_diagnoses=df_diagnoses.reset_index(drop=True)
df_diagnoses

GERD


Unnamed: 0,Diagnosis,Probability
0,GERD,0.725608
1,Paralysis (brain hemorrhage),0.018596
2,Gastroenteritis,0.018464
3,Heart attack,0.016671
4,Drug Reaction,0.013326
5,Allergy,0.009918
6,Fungal infection,0.009785
7,AIDS,0.00975
8,Acne,0.009699
9,Urinary tract infection,0.009648


In [151]:
# сохраняем модель
import pickle

# save
with open(f'{MY_PATH}trained_model_SVC.pkl','wb') as f:
    pickle.dump(clf, f)

In [155]:
clf = LogisticRegression(random_state=11)       
clf.fit(my_df.drop("Disease", axis=1), my_df["Disease"])
y=clf.predict_proba(X=my_df.loc[[27]].drop('Disease', axis=1))

df_diagnoses = pd.DataFrame({
    'Diagnosis': clf.classes_,
    'Probability': y[0]
}).sort_values('Probability', ascending=False )
df_diagnoses=df_diagnoses.reset_index(drop=True)
df_diagnoses

Unnamed: 0,Diagnosis,Probability
0,GERD,0.96526
1,Heart attack,0.008904
2,Drug Reaction,0.002578
3,Gastroenteritis,0.002521
4,Paralysis (brain hemorrhage),0.002266
5,Hypertension,0.001637
6,Peptic ulcer disease,0.001193
7,Bronchial Asthma,0.001027
8,(vertigo) Paroxysmal Positional Vertigo,0.000928
9,Alcoholic hepatitis,0.000911


In [153]:
# сохраняем модель
import pickle

# save
with open(f'{MY_PATH}trained_model_LR.pkl','wb') as f:
    pickle.dump(clf, f)

In [156]:
clf = KNeighborsClassifier()       
clf.fit(my_df.drop("Disease", axis=1), my_df["Disease"])
y=clf.predict_proba(X=my_df.loc[[27]].drop('Disease', axis=1))

df_diagnoses = pd.DataFrame({
    'Diagnosis': clf.classes_,
    'Probability': y[0]
}).sort_values('Probability', ascending=False )
df_diagnoses=df_diagnoses.reset_index(drop=True)
df_diagnoses

Unnamed: 0,Diagnosis,Probability
0,GERD,1.0
1,(vertigo) Paroxysmal Positional Vertigo,0.0
2,Migraine,0.0
3,Hypertension,0.0
4,Hyperthyroidism,0.0
5,Hypoglycemia,0.0
6,Hypothyroidism,0.0
7,Impetigo,0.0
8,Jaundice,0.0
9,Malaria,0.0


### KNeighborsClassifier is a hefty self-confident classifier...  I don't trust such doctors...)
### KNeighborsClassifier - дюже уверенный в себе классификатор... чёт я таким врачам не доверяю...)