# <div style="display:fill;border-radius:5px;background-color:#676F9F;letter-spacing:0.5px;overflow:hidden"><p style="padding:20px;color:#F1EFF5;overflow:hidden;margin:0;font-size:150%;font-style: Helvetica;text-align:center"><b></b>WHAT IF DOCTOR🧑‍⚕️ PREPARES AN AI MODEL??🤔</p></div>

In the previous notebook, we came to know the overview of a project and how it looks like. In this notebook, lets discuss about the dataset we are working in..

## THE_DATASET_CREATION........

I searched a lot of websites to get the true LFT(Liver Function Test) dataset with the very features I wanted but I failed to get the same. After so many attempts of searching, I came to a conclusion that as these are the healthcare related datasets, its very difficult to obtain the same as patients should give consent to use their health related information. Since, I am a doctor, I thought to seek Director's permission(Head of the Institution I'm working in) to use the blood reports of many patients to do a project, I stepped back not to do so because there are many policies which I thought will interfere in my way. So, I made my mind to create a one which looks like the real-world dataset.

Case scenario.....If a person comes with the symptoms of itching, yellowish discoloration of eyes or the whole body, dark colored urine, weight loss,fever,abdominal pain and if I suspect jaundice in that person, I prescribe him some of the tests to see what is wrong in his internal biological harmony(However,in some of the countries, even before patients consult doctors, all the routine blood tests might be ordered). Among many tests, I prescribe LFT(Liver Function Tests) and if I find a person's Bilirubin level to be above the normal range, then I prescribe him serial tests which will be described later. 

## <font color='#4287f5'>Importing necessary libraries...</font>

In [1]:
import pandas as pd                       # library for data manipulation (data cleaning,analysing...)
import numpy as np                        # library for numerical computation
import matplotlib.pyplot as plt           # library for data visualization which is built on NumPy arrays
%matplotlib inline
import seaborn as sns                     # library for data visualization built on top of matplotlib and 
                                          # closely integrated with pandas data structures in Python.

import random

import warnings                           # to avoid warning flash
warnings.filterwarnings('ignore')

# Other libraries will be imported in the following steps as and when required.... 

Lets create the list of unique "ID's" as it will be present in the realworld dataset. But in real world, as a part of HIPAA(Health Insurance Portability and Accountability Acts), this column should be either converted to dummy variable or it should be dropped.

In [2]:
ID = list(np.linspace(0,1000,1000).astype(int))

'age' and 'sex' column is must in medical datasets

In [3]:
age = np.random.normal(50, 15, 1000).astype(int).tolist()
age = [(x + 10) if x < 10 else x for x in age]

In [4]:
sex = []
for i in range(0,1000):
    x = random.randint(0, 1)
    sex.append(x)

In [5]:
# create a DataFrame
Report = {'ID': ID,'age': age,'sex': sex}
df = pd.DataFrame(Report)

In [6]:
df

Unnamed: 0,ID,age,sex
0,0,49,0
1,1,42,1
2,2,50,0
3,3,26,0
4,4,52,0
...,...,...,...
995,995,55,0
996,996,43,1
997,997,42,0
998,998,54,1


## <font color='#4287f5'>Lets label the target variable...</font>

In [7]:
df.insert(3,"Diagnosis",True)

In [8]:
# 1 ---> Pre-hepatic jaundice   (Hemolytic jaundice)
# 2 ---> Hepatic jaundice       (Hepatocellular jaundice)
# 3 ---> Post-hepatic jaundice  (Obstructive jaundice)

Diagnosis = []
for i in range(0,1000):
    x = random.randint(1,3)
    Diagnosis.append(x)
    

In [9]:
df['Diagnosis'] = Diagnosis

In [10]:
df

Unnamed: 0,ID,age,sex,Diagnosis
0,0,49,0,2
1,1,42,1,2
2,2,50,0,3
3,3,26,0,2
4,4,52,0,2
...,...,...,...,...
995,995,55,0,3
996,996,43,1,3
997,997,42,0,1
998,998,54,1,2


In [11]:
df['Diagnosis'].value_counts()

1    349
2    327
3    324
Name: Diagnosis, dtype: int64

In [12]:
df.insert(4,"Conjugated_Bilirubin_count",True)

<!DOCTYPE html>
<html>
<head>
    <meta charset=“utf-8”>
    <title></title>
</head>
<body>

<h1 style= "text-align:center;color:#052261;font-size:120%;"><u> BILIRUBIN TEST  </u></h2>
</body>
</html>

**I.)Serum bilirubin estimation** is based on van den Bergh diazo reaction by spectrophotometric method. Diazo reagent
consists of diazotised sulfanilic acid. 
* Water-soluble conjugatedbilirubin gives direct van den Bergh reaction with diazo reagent within one minute, whereas alcohol-solubleunconjugated bilirubin is determined by indirect van denBergh reaction.
* Addition of alcohol to the reaction mixture gives positive test for both conjugated and unconjugated bilirubin pigment.
* The unconjugated bilirubin level is then estimated by subtracting direct bilirubin value from this total value.
* The serum of normal adults contains less than 1 mg/dl of total bilirubin, out of which less than 0.25 mg/dl is conjugated bilirubin. 
* Bilirubin level rises in diseases of hepatocytes, obstruction to biliary excretion into the duodenum, in haemolysis, and defects of hepatic uptake and conjugation of bilirubin pigment such as in Gilbert’s disease. 

**II.) In faeces**, excretion of bilirubin is assessed by inspection of stools. Clay-coloured stool due to absence of faecal
excretion of the pigment indicates obstructive jaundice.

**III.) In urine**, conjugated bilirubin can be detected bycommercially available ‘dipsticks’, Fouchet’s test, foam test or ictotest tablet method.
* Bilirubinuria does not occur in normal subjects nor is unconjugated bilirubin excreted in the urine.
* Bilirubinuria occurs only when there is raised level of conjugated bilirubin (filterable). Its excretion depends upon the level of conjugated bilirubin in plasma that is not protein bound and is therefore available for glomerular filtration.
* Bilirubinuria appears in patients of hepatitis before the patient becomes jaundiced

In [13]:
def get_Conjugated_Bilirubin_count(x):
    if x == 1:
        return round(random.uniform(0.1,0.4),2)
    elif x == 2:
        return round(random.uniform(0.4,3.0),2)
    else :
        return round(random.uniform(3,15),2)

In [14]:
df['Conjugated_Bilirubin_count'] = df['Diagnosis'].apply(lambda x : round(get_Conjugated_Bilirubin_count(x),2))
                

In [15]:
df

Unnamed: 0,ID,age,sex,Diagnosis,Conjugated_Bilirubin_count
0,0,49,0,2,2.72
1,1,42,1,2,1.62
2,2,50,0,3,9.67
3,3,26,0,2,0.64
4,4,52,0,2,1.98
...,...,...,...,...,...
995,995,55,0,3,8.96
996,996,43,1,3,9.44
997,997,42,0,1,0.31
998,998,54,1,2,2.09


In [16]:
df.insert(5,"Unconjugated_Bilirubin_count",True)

In [17]:
df.insert(6,"Total_Bilirubin_count",True)

In [18]:
def get_Total_Bilirubin_count(x):
    if x == 1:
        return round(random.uniform(2.5,12),2)
    elif x == 2:
        return round(random.uniform(5,12),2)
    else :
        return round(random.uniform(16,18),2)

In [19]:
df['Total_Bilirubin_count'] = df['Diagnosis'].apply(lambda x : round(get_Total_Bilirubin_count(x),2))
                

In [20]:
df['Unconjugated_Bilirubin_count'] = df['Total_Bilirubin_count'] - df['Conjugated_Bilirubin_count']

In [21]:
df

Unnamed: 0,ID,age,sex,Diagnosis,Conjugated_Bilirubin_count,Unconjugated_Bilirubin_count,Total_Bilirubin_count
0,0,49,0,2,2.72,4.01,6.73
1,1,42,1,2,1.62,8.55,10.17
2,2,50,0,3,9.67,6.43,16.10
3,3,26,0,2,0.64,10.22,10.86
4,4,52,0,2,1.98,6.09,8.07
...,...,...,...,...,...,...,...
995,995,55,0,3,8.96,7.27,16.23
996,996,43,1,3,9.44,7.80,17.24
997,997,42,0,1,0.31,2.50,2.81
998,998,54,1,2,2.09,4.83,6.92


In [22]:
df.insert(7,"Conjugated/Total_bilirubin_Ratio",True)

In [23]:
df['Conjugated/Total_bilirubin_Ratio'] = round(df['Conjugated_Bilirubin_count']/df['Total_Bilirubin_count'] * 100,1)

In [24]:
df

Unnamed: 0,ID,age,sex,Diagnosis,Conjugated_Bilirubin_count,Unconjugated_Bilirubin_count,Total_Bilirubin_count,Conjugated/Total_bilirubin_Ratio
0,0,49,0,2,2.72,4.01,6.73,40.4
1,1,42,1,2,1.62,8.55,10.17,15.9
2,2,50,0,3,9.67,6.43,16.10,60.1
3,3,26,0,2,0.64,10.22,10.86,5.9
4,4,52,0,2,1.98,6.09,8.07,24.5
...,...,...,...,...,...,...,...,...
995,995,55,0,3,8.96,7.27,16.23,55.2
996,996,43,1,3,9.44,7.80,17.24,54.8
997,997,42,0,1,0.31,2.50,2.81,11.0
998,998,54,1,2,2.09,4.83,6.92,30.2


In [25]:
df.insert(8,"urine_bilirubin",True)

In [26]:
def get_urine_bilirubin(x):
    if x == 1:
        return 0               # absent
    elif x == 2:
        return 1               # minimally present
    else :
        return 2               # heavily present

In [27]:
df['urine_bilirubin'] = df['Diagnosis'].apply(lambda x : get_urine_bilirubin(x))

In [28]:
df

Unnamed: 0,ID,age,sex,Diagnosis,Conjugated_Bilirubin_count,Unconjugated_Bilirubin_count,Total_Bilirubin_count,Conjugated/Total_bilirubin_Ratio,urine_bilirubin
0,0,49,0,2,2.72,4.01,6.73,40.4,1
1,1,42,1,2,1.62,8.55,10.17,15.9,1
2,2,50,0,3,9.67,6.43,16.10,60.1,2
3,3,26,0,2,0.64,10.22,10.86,5.9,1
4,4,52,0,2,1.98,6.09,8.07,24.5,1
...,...,...,...,...,...,...,...,...,...
995,995,55,0,3,8.96,7.27,16.23,55.2,2
996,996,43,1,3,9.44,7.80,17.24,54.8,2
997,997,42,0,1,0.31,2.50,2.81,11.0,0
998,998,54,1,2,2.09,4.83,6.92,30.2,1


In [29]:
df.insert(9,"urine_urobilinogen",True)

<u>**UROBILINOGEN:**</u> 
* Urobilinogen is normally excreted in the urine.
* Its semiquantitative estimation in the urine can be done by preparing dilutions with Ehrlich’s aldehyde reagent or by ‘dipstick’ method. 
* An increase in urobilinogen in the urine is found in hepatocellular dysfunctions such as in alcoholic liver disease, cirrhosis and malignancy of the liver.
* It is also raised in haemolytic disease and in pyrexia. 
* In cholestatic jaundice due to complete biliary obstruction, urobilinogen disappears from the urine.

In [30]:
def get_urine_urobilinogen(x):
    if x == 1:
        return 2               # increased
    elif x == 2:
        return 1               # minimally decreased
    else :
        return 0               # heavily decreased

In [31]:
df['urine_urobilinogen'] = df['Diagnosis'].apply(lambda x : get_urine_bilirubin(x))

In [32]:
df

Unnamed: 0,ID,age,sex,Diagnosis,Conjugated_Bilirubin_count,Unconjugated_Bilirubin_count,Total_Bilirubin_count,Conjugated/Total_bilirubin_Ratio,urine_bilirubin,urine_urobilinogen
0,0,49,0,2,2.72,4.01,6.73,40.4,1,1
1,1,42,1,2,1.62,8.55,10.17,15.9,1,1
2,2,50,0,3,9.67,6.43,16.10,60.1,2,2
3,3,26,0,2,0.64,10.22,10.86,5.9,1,1
4,4,52,0,2,1.98,6.09,8.07,24.5,1,1
...,...,...,...,...,...,...,...,...,...,...
995,995,55,0,3,8.96,7.27,16.23,55.2,2,2
996,996,43,1,3,9.44,7.80,17.24,54.8,2,2
997,997,42,0,1,0.31,2.50,2.81,11.0,0,0
998,998,54,1,2,2.09,4.83,6.92,30.2,1,1


In [33]:
df['urine_urobilinogen'].value_counts()

0    349
1    327
2    324
Name: urine_urobilinogen, dtype: int64

In [34]:
df.insert(10,"fecal_stercobilinogen",True)

In [35]:
def get_fecal_stercobilinogen(x):
    if x == 1:
        return 2               # increased
    elif x == 2:
        return 1               # minimally decreased
    else :
        return 0               # heavily decreased

In [36]:
df['fecal_stercobilinogen'] = df['Diagnosis'].apply(lambda x : get_fecal_stercobilinogen(x))

In [37]:
df

Unnamed: 0,ID,age,sex,Diagnosis,Conjugated_Bilirubin_count,Unconjugated_Bilirubin_count,Total_Bilirubin_count,Conjugated/Total_bilirubin_Ratio,urine_bilirubin,urine_urobilinogen,fecal_stercobilinogen
0,0,49,0,2,2.72,4.01,6.73,40.4,1,1,1
1,1,42,1,2,1.62,8.55,10.17,15.9,1,1,1
2,2,50,0,3,9.67,6.43,16.10,60.1,2,2,0
3,3,26,0,2,0.64,10.22,10.86,5.9,1,1,1
4,4,52,0,2,1.98,6.09,8.07,24.5,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...
995,995,55,0,3,8.96,7.27,16.23,55.2,2,2,0
996,996,43,1,3,9.44,7.80,17.24,54.8,2,2,0
997,997,42,0,1,0.31,2.50,2.81,11.0,0,0,2
998,998,54,1,2,2.09,4.83,6.92,30.2,1,1,1


In [38]:
df.insert(11,"urine_bile_salts",True)

**<u>BILE ACIDS (BILE SALTS):</u>**
* The primary bile acids (cholic acid and cheno-deoxycholic acid) are formed from cholesterol in the hepatocytes.
* These bile acids on secretion into the gut come in contact with colonic bacteria and undergo deconjugation with the production of secondary bile acids (deoxycholic acid and lithocholic acid).
* Most of these bile acids are reabsorbed through enterohepatic circulation and reach the liver. 
* Only about 10% of the total bile acids are excreted in the faeces normally as unabsorbable toxic lithocholic acid.
* Hepatobiliary diseases with cholestasis are associated with raised levels of serum bile acids which are responsible for producing itching (pruritus).
* These acids are excreted in the urine by active transport and passive diffusion and can be detected by simple methods as Hay’s test and ‘dipsticks’.



In [39]:
def get_urine_bile_salts(x):
    if x == 1:
        return 0               # absent
    elif x == 2:
        return 1               # minimally present
    else :
        return 2               # heavily present

In [40]:
df['urine_bile_salts'] = df['Diagnosis'].apply(lambda x : get_urine_bile_salts(x))

In [41]:
df['urine_bile_salts'].value_counts()

0    349
1    327
2    324
Name: urine_bile_salts, dtype: int64

In [42]:
df.insert(12,"AST",True)

**<U>Aspartate aminotransferase (AST):</U>**
* AST exists as two different isoenzymes: mitochondrial and cytoplasmic form.
* AST is found in highest concentration in heart compared with other tissues 'of the body such as liver, skeletal muscle and kidney.
* Elevated mitochondrial AST is seen in extensive tissue necrosis during myocardial infarction and also in acute and chronic liver diseases.
* About 80% of AST activity of liver is contributed by mitochondrial isoenzyme, whereas most of circulating AST activity in normal people is derived from cytosolic isoenzyme.
* A ratio of AST/ALT is >5, especially if ALT is normal or slightly elevated, is suggestive of injury to extrahepatic tissues, such as skeletal muscle in the case of rhabdomyolysis or strenuous exercise.
* ALT is present in highest concentration in periportal hepatocytes and in lowest concentration in hepatocytes surrounding the central vein. AST is present in hepatocytes at more constant levels.
* Hepatocytes around the central vein have the lowest oxygen concentration and thus are more prone to damage in the setting of acute hepatic ischaemia which results in AST value greater than ALT.
* After there is no further injury to hepatocytes, rate of decline of AST and ALT depends on their rate of clearance from the circulation. Plasma half-life of ALT is nearly three times that of AST. Hence AST declines more rapidly than ALT, and ALT may be higher than AST in the recovery phase of injury.


In [43]:
def get_AST_value(x):
    if x == 1:
        return random.randint(5,40)
    elif x == 2:
        return random.randint(120,200)
    else:
        return random.randint(40,120)

In [44]:
df['AST'] = df['Diagnosis'].apply(lambda x : get_AST_value(x))

In [45]:
df.insert(13,"ALT",True)

**<u>Alanine aminotransferase (ALT):</u>**
* Concentration of ALT is much higher in the liver than in other tissues (e.g. kidney, heart muscle). Hence, a raised ALT is a very sensitive index of hepatic damage.
- 100-500 times rise occurs in paracetamol-induced liver damage.
- 10-100 times rise occurs in:
    - Viral hepatitis (increase in ALT associated with hepatitis C infection tends to be more than that associated with hepatitis A or B)
    - Acute drug-induced hepatitis
    - Acute circulatory failure (ischaemic liver injury)
    - Exacerbations of chronic hepatitis
- 2-10 times rise occurs in:
    - Infectious mononucleosis
    - Cytomegalovirus infections
- Less than five fold rise occurs in: 
    - Alcoholic hepatitis (AST:ALT ratio >2) 
    - Obstructive jaundice 
    - Cirrhosis of liver 
    - Non-alcoholic fatty liver 
    - Chronic hepatitis 
* Pyridoxal 5' -phosphate is a coenzyme required for synthesis of transaminases especially ALT. This enzyme is deficient in alcoholic patients; hence, AST:ALT ratio is more than 2 in patients with alcoholic liver disease

In [46]:
def get_ALT_value(x):
    if x == 1:
        return random.randint(0,25)
    elif x == 2:
        return random.randint(80,200)
    else:
        return random.randint(25,80)

In [47]:
df['ALT'] = df['Diagnosis'].apply(lambda x : get_ALT_value(x))

In [48]:
df

Unnamed: 0,ID,age,sex,Diagnosis,Conjugated_Bilirubin_count,Unconjugated_Bilirubin_count,Total_Bilirubin_count,Conjugated/Total_bilirubin_Ratio,urine_bilirubin,urine_urobilinogen,fecal_stercobilinogen,urine_bile_salts,AST,ALT
0,0,49,0,2,2.72,4.01,6.73,40.4,1,1,1,1,141,193
1,1,42,1,2,1.62,8.55,10.17,15.9,1,1,1,1,133,191
2,2,50,0,3,9.67,6.43,16.10,60.1,2,2,0,2,75,73
3,3,26,0,2,0.64,10.22,10.86,5.9,1,1,1,1,174,114
4,4,52,0,2,1.98,6.09,8.07,24.5,1,1,1,1,153,114
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,995,55,0,3,8.96,7.27,16.23,55.2,2,2,0,2,109,42
996,996,43,1,3,9.44,7.80,17.24,54.8,2,2,0,2,58,45
997,997,42,0,1,0.31,2.50,2.81,11.0,0,0,2,0,23,19
998,998,54,1,2,2.09,4.83,6.92,30.2,1,1,1,1,120,94


In [49]:
df.insert(14,"ALP",True)

<U>**Alkaline Phosphatase (ALP):**</U>
* Serum contains ALP activity derived from liver, bone, intestines, proximal tubules of kidneys and placenta. Normal serum level is 3-13 KA units (80-240 IU/L).
* In hepatocellular jaundice, only very small amount of ALP is liberated from the cells and the rise in ALP is less than 2.5 folds.
* In obstructive jaundice, due to obstruction of biliary tract at any level, new ALP is synthesised that escapes into blood. Hence, ALP levels are markedly raised in obstructive jaundice.
* If the source of ALP is not clear, determine the levels of two enzymes, -y-glutamyl transpeptidase and 5 '-nucleotidase. These are more specific for liver. The associated elevation of these two enzymes confirms the source of ALP as liver.


In [50]:
def get_ALP_value(x):
    if x == 1:
        return random.randint(30,115)
    elif x == 2:
        return random.randint(115,300)
    else:
        return random.randint(300,1100)

In [51]:
df['ALP'] = df['Diagnosis'].apply(lambda x : get_ALP_value(x))

In [52]:
df.insert(15,"Serum_albumin",True)

<U>**Serum proteins:**</U>
* Liver cells synthesise albumin,fibrinogen, prothrombin, alpha-1-antitrypsin, haptoglobin, ceruloplasmin, transferrin, alpha fetoproteins and acute phase reactant proteins.
* The blood levels of these plasma proteins are decreased in extensive liver damage.
* Routinely estimated are total concentration of serum proteins (normal 6.7 to 8.6 gm/dl), serum albumin (normal 3.5 to 5.5 gm/dl),serum globulin (normal 2 to 3.5 gm/dl) and albumin/globulin (A/G) ratio (normal 1.5-3:1). 
* Electrophoresis is used to determine the proportions of α1, α2, β and γ globulins.
* Due to the availability of protein electrophoresis, thymol turbidity and flocculation tests based on altered plasma protein components have been discontinued.
* Hypoalbuminaemia may occur in liver diseases having significant destruction of hepatocytes. Hyperglobulinaemia may be present in chronic inflammatory disorders such as in cirrhosis and chronic hepatitis.

In [53]:
def get_Serum_albumin(x):
    if x == 2:
        return random.uniform(1.5,3.5)    #g/dl
    else:
        return random.uniform(3.5,5.5)

In [54]:
df['Serum_albumin'] = df['Diagnosis'].apply(lambda x : round(get_Serum_albumin(x),2))

In [55]:
df

Unnamed: 0,ID,age,sex,Diagnosis,Conjugated_Bilirubin_count,Unconjugated_Bilirubin_count,Total_Bilirubin_count,Conjugated/Total_bilirubin_Ratio,urine_bilirubin,urine_urobilinogen,fecal_stercobilinogen,urine_bile_salts,AST,ALT,ALP,Serum_albumin
0,0,49,0,2,2.72,4.01,6.73,40.4,1,1,1,1,141,193,183,1.55
1,1,42,1,2,1.62,8.55,10.17,15.9,1,1,1,1,133,191,257,3.01
2,2,50,0,3,9.67,6.43,16.10,60.1,2,2,0,2,75,73,902,4.95
3,3,26,0,2,0.64,10.22,10.86,5.9,1,1,1,1,174,114,186,2.83
4,4,52,0,2,1.98,6.09,8.07,24.5,1,1,1,1,153,114,290,2.07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,995,55,0,3,8.96,7.27,16.23,55.2,2,2,0,2,109,42,647,4.28
996,996,43,1,3,9.44,7.80,17.24,54.8,2,2,0,2,58,45,869,3.86
997,997,42,0,1,0.31,2.50,2.81,11.0,0,0,2,0,23,19,42,4.40
998,998,54,1,2,2.09,4.83,6.92,30.2,1,1,1,1,120,94,256,3.43


The very next two features are subjective/clinical findings rather than like other features which are laboratory findings. So,undue prominence should not be given to these features unless prompted...

In [56]:
df.insert(16,"urine_color",True)

In [57]:
def get_urine_color(x):
    if x == 1:
        return 0               #normal
    elif x == 2:
        return 1               #mildly dark
    else :
        return 2               #heavily dark

In [58]:
df['urine_color'] = df['Diagnosis'].apply(lambda x : get_urine_color(x))

In [59]:
df.insert(17,"stool_color",True)

In [60]:
def get_stool_color(x):
    if x == 1:
        return 2                       # dark brown
    elif x == 2:
        return 0                       # normal
    else:
        return 1                       # clay

In [61]:
df['stool_color'] = df['Diagnosis'].apply(lambda x : get_stool_color(x))

In [62]:
df

Unnamed: 0,ID,age,sex,Diagnosis,Conjugated_Bilirubin_count,Unconjugated_Bilirubin_count,Total_Bilirubin_count,Conjugated/Total_bilirubin_Ratio,urine_bilirubin,urine_urobilinogen,fecal_stercobilinogen,urine_bile_salts,AST,ALT,ALP,Serum_albumin,urine_color,stool_color
0,0,49,0,2,2.72,4.01,6.73,40.4,1,1,1,1,141,193,183,1.55,1,0
1,1,42,1,2,1.62,8.55,10.17,15.9,1,1,1,1,133,191,257,3.01,1,0
2,2,50,0,3,9.67,6.43,16.10,60.1,2,2,0,2,75,73,902,4.95,2,1
3,3,26,0,2,0.64,10.22,10.86,5.9,1,1,1,1,174,114,186,2.83,1,0
4,4,52,0,2,1.98,6.09,8.07,24.5,1,1,1,1,153,114,290,2.07,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,995,55,0,3,8.96,7.27,16.23,55.2,2,2,0,2,109,42,647,4.28,2,1
996,996,43,1,3,9.44,7.80,17.24,54.8,2,2,0,2,58,45,869,3.86,2,1
997,997,42,0,1,0.31,2.50,2.81,11.0,0,0,2,0,23,19,42,4.40,0,2
998,998,54,1,2,2.09,4.83,6.92,30.2,1,1,1,1,120,94,256,3.43,1,0


In [63]:
df.insert(18,"Hb%",True)

The percentage of hemoglobin will be decreased in case of hemolytic disease which in turn results in jaundice. In case of hepatocellular disease, Hb% may or may not be decreased. In the below code, I had sligtly decreased Hb% in case of hepatocellular disease.

In [64]:
def get_Hb_level(x):
    if x == 3:
        return random.uniform(11.0,15.0)                      
    elif x == 2:
        return random.uniform(8.0,11.0)                      
    else:
        return random.uniform(4.0,9.0)          

In [65]:
df['Hb%'] = df['Diagnosis'].apply(lambda x : get_Hb_level(x))

In [66]:
df

Unnamed: 0,ID,age,sex,Diagnosis,Conjugated_Bilirubin_count,Unconjugated_Bilirubin_count,Total_Bilirubin_count,Conjugated/Total_bilirubin_Ratio,urine_bilirubin,urine_urobilinogen,fecal_stercobilinogen,urine_bile_salts,AST,ALT,ALP,Serum_albumin,urine_color,stool_color,Hb%
0,0,49,0,2,2.72,4.01,6.73,40.4,1,1,1,1,141,193,183,1.55,1,0,9.776846
1,1,42,1,2,1.62,8.55,10.17,15.9,1,1,1,1,133,191,257,3.01,1,0,10.419697
2,2,50,0,3,9.67,6.43,16.10,60.1,2,2,0,2,75,73,902,4.95,2,1,13.307444
3,3,26,0,2,0.64,10.22,10.86,5.9,1,1,1,1,174,114,186,2.83,1,0,9.250229
4,4,52,0,2,1.98,6.09,8.07,24.5,1,1,1,1,153,114,290,2.07,1,0,9.737639
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,995,55,0,3,8.96,7.27,16.23,55.2,2,2,0,2,109,42,647,4.28,2,1,12.129707
996,996,43,1,3,9.44,7.80,17.24,54.8,2,2,0,2,58,45,869,3.86,2,1,11.641941
997,997,42,0,1,0.31,2.50,2.81,11.0,0,0,2,0,23,19,42,4.40,0,2,5.579198
998,998,54,1,2,2.09,4.83,6.92,30.2,1,1,1,1,120,94,256,3.43,1,0,8.199968


In [67]:
df.insert(19,"Heart_rate",True)

Bradycardia (low heart rate) is the feature (not always) in obstructive jaundice(post-hepatic disease) whereas heart rate will be usually normal in other two cases...

In [68]:
def get_Heart_rate(x):
    if x == 3:
        return random.randint(40,60)                      
    elif x == 2:
        return random.randint(60,75)                      
    else:
        return random.randint(75,100)          

In [69]:
df['Heart_rate'] = df['Diagnosis'].apply(lambda x : get_Heart_rate(x))

In [70]:
df

Unnamed: 0,ID,age,sex,Diagnosis,Conjugated_Bilirubin_count,Unconjugated_Bilirubin_count,Total_Bilirubin_count,Conjugated/Total_bilirubin_Ratio,urine_bilirubin,urine_urobilinogen,fecal_stercobilinogen,urine_bile_salts,AST,ALT,ALP,Serum_albumin,urine_color,stool_color,Hb%,Heart_rate
0,0,49,0,2,2.72,4.01,6.73,40.4,1,1,1,1,141,193,183,1.55,1,0,9.776846,62
1,1,42,1,2,1.62,8.55,10.17,15.9,1,1,1,1,133,191,257,3.01,1,0,10.419697,70
2,2,50,0,3,9.67,6.43,16.10,60.1,2,2,0,2,75,73,902,4.95,2,1,13.307444,57
3,3,26,0,2,0.64,10.22,10.86,5.9,1,1,1,1,174,114,186,2.83,1,0,9.250229,72
4,4,52,0,2,1.98,6.09,8.07,24.5,1,1,1,1,153,114,290,2.07,1,0,9.737639,66
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,995,55,0,3,8.96,7.27,16.23,55.2,2,2,0,2,109,42,647,4.28,2,1,12.129707,56
996,996,43,1,3,9.44,7.80,17.24,54.8,2,2,0,2,58,45,869,3.86,2,1,11.641941,57
997,997,42,0,1,0.31,2.50,2.81,11.0,0,0,2,0,23,19,42,4.40,0,2,5.579198,99
998,998,54,1,2,2.09,4.83,6.92,30.2,1,1,1,1,120,94,256,3.43,1,0,8.199968,69


In [71]:
cols = df.columns.tolist()

In [72]:
cols

['ID',
 'age',
 'sex',
 'Diagnosis',
 'Conjugated_Bilirubin_count',
 'Unconjugated_Bilirubin_count',
 'Total_Bilirubin_count',
 'Conjugated/Total_bilirubin_Ratio',
 'urine_bilirubin',
 'urine_urobilinogen',
 'fecal_stercobilinogen',
 'urine_bile_salts',
 'AST',
 'ALT',
 'ALP',
 'Serum_albumin',
 'urine_color',
 'stool_color',
 'Hb%',
 'Heart_rate']

In [73]:
#interchange the column index
cols = ['ID','age','sex','Conjugated_Bilirubin_count','Unconjugated_Bilirubin_count','Total_Bilirubin_count','Conjugated/Total_bilirubin_Ratio','urine_bilirubin','urine_urobilinogen','fecal_stercobilinogen','urine_bile_salts','AST','ALT','ALP','Serum_albumin','urine_color','stool_color','Hb%','Heart_rate','Diagnosis',]

In [74]:
df = df[cols]

In [75]:
df

Unnamed: 0,ID,age,sex,Conjugated_Bilirubin_count,Unconjugated_Bilirubin_count,Total_Bilirubin_count,Conjugated/Total_bilirubin_Ratio,urine_bilirubin,urine_urobilinogen,fecal_stercobilinogen,urine_bile_salts,AST,ALT,ALP,Serum_albumin,urine_color,stool_color,Hb%,Heart_rate,Diagnosis
0,0,49,0,2.72,4.01,6.73,40.4,1,1,1,1,141,193,183,1.55,1,0,9.776846,62,2
1,1,42,1,1.62,8.55,10.17,15.9,1,1,1,1,133,191,257,3.01,1,0,10.419697,70,2
2,2,50,0,9.67,6.43,16.10,60.1,2,2,0,2,75,73,902,4.95,2,1,13.307444,57,3
3,3,26,0,0.64,10.22,10.86,5.9,1,1,1,1,174,114,186,2.83,1,0,9.250229,72,2
4,4,52,0,1.98,6.09,8.07,24.5,1,1,1,1,153,114,290,2.07,1,0,9.737639,66,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,995,55,0,8.96,7.27,16.23,55.2,2,2,0,2,109,42,647,4.28,2,1,12.129707,56,3
996,996,43,1,9.44,7.80,17.24,54.8,2,2,0,2,58,45,869,3.86,2,1,11.641941,57,3
997,997,42,0,0.31,2.50,2.81,11.0,0,0,2,0,23,19,42,4.40,0,2,5.579198,99,1
998,998,54,1,2.09,4.83,6.92,30.2,1,1,1,1,120,94,256,3.43,1,0,8.199968,69,2


## <font color='#4287f5'>Lets save the dataset...</font>

In [81]:
df.to_csv(r"D:\DSF\PROJECTPRO\MEDICAL_PROJECTS\3.HEPATIC_DISEASES\jaundice.csv",index = False)