# **Data Overview** for Exploratory Data Analysis of **Healthcare Datasets**

- Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main characteristics often plotting them visually.
- Through this exploratory data analysis, we want to get a data overview of each dataset required for this project.

## Different Datasets present for Exploratory Data Analysis

We are going to analyze following datasets for this **Exploratory Data Analysis**:
- Original_Dataset.csv
- Disease_Description.csv
- Doctor_Specialist.csv
- Doctor_Versus_Disease.csv
- Symptom_Weights.csv
- Specialist.xlsx

## Importing Depandencies

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Used to show all columns of the dataset
pd.set_option('display.max_columns', None)

## 1. Data Overview of `Original_Dataset.csv`

### Loading Dataset using Pandas Library

In [2]:
disease_vs_symptoms_df = pd.read_csv('datasets/Original_Dataset.csv')
disease_vs_symptoms_df.head()

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,
1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
3,Fungal infection,itching,skin_rash,dischromic _patches,,,,,,,,,,,,,,
4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,,,,,,,,,,,,,,


### Checking `shape` of dataset

In [3]:
disease_vs_symptoms_df.shape

(4920, 18)

### Checking `data types` of the dataset

In [4]:
disease_vs_symptoms_df.dtypes

Disease       object
Symptom_1     object
Symptom_2     object
Symptom_3     object
Symptom_4     object
Symptom_5     object
Symptom_6     object
Symptom_7     object
Symptom_8     object
Symptom_9     object
Symptom_10    object
Symptom_11    object
Symptom_12    object
Symptom_13    object
Symptom_14    object
Symptom_15    object
Symptom_16    object
Symptom_17    object
dtype: object

### Getting an `overview` of the dataset

In [5]:
disease_vs_symptoms_df.describe(include="all")

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
count,4920,4920,4920,4920,4572,3714,2934,2268,1944,1692,1512,1194,744,504,306,240,192,72
unique,41,34,48,54,50,38,32,26,21,22,21,18,11,8,4,3,3,1
top,Fungal infection,vomiting,vomiting,fatigue,high_fever,headache,nausea,abdominal_pain,abdominal_pain,yellowing_of_eyes,yellowing_of_eyes,irritability,malaise,muscle_pain,chest_pain,chest_pain,blood_in_sputum,muscle_pain
freq,120,822,870,726,378,348,390,264,276,228,198,120,126,72,96,144,72,72


### Checking `null values` of the dataset

In [32]:
disease_vs_symptoms_df.isnull().sum()

Disease          0
Symptom_1        0
Symptom_2        0
Symptom_3        0
Symptom_4      348
Symptom_5     1206
Symptom_6     1986
Symptom_7     2652
Symptom_8     2976
Symptom_9     3228
Symptom_10    3408
Symptom_11    3726
Symptom_12    4176
Symptom_13    4416
Symptom_14    4614
Symptom_15    4680
Symptom_16    4728
Symptom_17    4848
dtype: int64

## 2. Data Overview of `Disease_Description.csv`

### Loading Dataset using Pandas Library

In [7]:
disease_description_df = pd.read_csv('datasets/Disease_Description.csv')
disease_description_df.head()

Unnamed: 0,Disease,Description
0,Drug Reaction,An adverse drug reaction (ADR) is an injury ca...
1,Malaria,An infectious disease caused by protozoan para...
2,Allergy,An allergy is an immune system response to a f...
3,Hypothyroidism,"Hypothyroidism, also called underactive thyroi..."
4,Psoriasis,Psoriasis is a common skin disorder that forms...


### Checking `shape` of dataset

In [8]:
disease_description_df.shape

(41, 2)

### Checking `data types` of the dataset

In [9]:
disease_description_df.dtypes

Disease        object
Description    object
dtype: object

### Getting an `overview` of the dataset

In [10]:
disease_description_df.describe(include="all").T

Unnamed: 0,count,unique,top,freq
Disease,41,41,Drug Reaction,1
Description,41,41,An adverse drug reaction (ADR) is an injury ca...,1


### Checking `null values` of the dataset

In [11]:
disease_description_df.isnull().sum()

Disease        0
Description    0
dtype: int64

## 3. Data Overview of `Doctor_Specialist.csv`

### Loading Dataset using Pandas Library

In [12]:
doctor_specialist_df = pd.read_csv('datasets/Doctor_Specialist.csv')
doctor_specialist_df.head()

Unnamed: 0,Doctor Specialist
0,Dermatologist
1,Allergist
2,Gastroenterologist
3,Hepatologist
4,Osteopathic


### Checking `shape` of dataset

In [13]:
doctor_specialist_df.shape

(19, 1)

### Checking `data types` of the dataset

In [14]:
doctor_specialist_df.dtypes

Doctor Specialist    object
dtype: object

### Getting an `overview` of the dataset

In [15]:
doctor_specialist_df.describe(include="all").T

Unnamed: 0,count,unique,top,freq
Doctor Specialist,19,19,Dermatologist,1


### Checking `null values` of the dataset

In [16]:
doctor_specialist_df.isnull().sum()

Doctor Specialist    0
dtype: int64

## 4. Data Overview of `Doctor_Versus_Disease.csv`

### Loading Dataset using Pandas Library

In [17]:
doctor_versus_disease_df = pd.read_csv('datasets/Doctor_Versus_Disease.csv', header=None, index_col=None, encoding='ISO-8859-1')
doctor_versus_disease_df.head()

Unnamed: 0,0,1
0,Drug Reaction,Allergist
1,Allergy,Allergist
2,Hypertension,Cardiologist
3,Heart attack,Cardiologist
4,Psoriasis,Dermatologist


### Checking `shape` of dataset

In [18]:
doctor_versus_disease_df.shape

(41, 2)

### Checking `data types` of the dataset

In [19]:
doctor_versus_disease_df.dtypes

0    object
1    object
dtype: object

### Getting an `overview` of the dataset

In [20]:
doctor_versus_disease_df.describe().T

Unnamed: 0,count,unique,top,freq
0,41,41,Drug Reaction,1
1,41,18,Hepatologist,6


### Checking `null values` of the dataset

In [21]:
doctor_versus_disease_df.isnull().sum()

0    0
1    0
dtype: int64

## 5. Data Overview of `Symptom_Weights.csv`

### Loading Dataset using Pandas Library

In [41]:
symptom_weights_df = pd.read_csv('datasets/Symptom_Weights.csv', header=None, index_col=None)
symptom_weights_df.head()

Unnamed: 0,0,1
0,abdominal_pain,1
1,abnormal_menstruation,2
2,acidity,3
3,acute_liver_failure,4
4,altered_sensorium,5


### Checking `shape` of dataset

In [35]:
symptom_weights_df.shape

(131, 2)

### Checking `data types` of the dataset

In [36]:
symptom_weights_df.dtypes

0    object
1     int64
dtype: object

### Getting an `overview` of the dataset

In [37]:
symptom_weights_df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
0,131.0,131.0,abdominal_pain,1.0,,,,,,,
1,131.0,,,,66.0,37.960506,1.0,33.5,66.0,98.5,131.0


### Checking `null values` of the dataset

In [38]:
symptom_weights_df.isnull().sum()

0    0
1    0
dtype: int64

## 6. Data Overview of `Specialist.xlsx`

### Loading Dataset using Pandas Library

In [27]:
specialist_df = pd.read_excel('datasets/Specialist.xlsx')
specialist_df.head()

Unnamed: 0.1,Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,continuous_sneezing,shivering,chills,watering_from_eyes,stomach_pain,acidity,ulcers_on_tongue,vomiting,cough,chest_pain,yellowish_skin,nausea,loss_of_appetite,abdominal_pain,yellowing_of_eyes,burning_micturition,spotting_ urination,passage_of_gases,internal_itching,indigestion,muscle_wasting,patches_in_throat,high_fever,extra_marital_contacts,fatigue,weight_loss,restlessness,lethargy,irregular_sugar_level,blurred_and_distorted_vision,obesity,excessive_hunger,increased_appetite,polyuria,sunken_eyes,dehydration,diarrhoea,breathlessness,family_history,mucoid_sputum,headache,dizziness,loss_of_balance,lack_of_concentration,stiff_neck,depression,irritability,visual_disturbances,back_pain,weakness_in_limbs,neck_pain,weakness_of_one_body_side,altered_sensorium,dark_urine,sweating,muscle_pain,mild_fever,swelled_lymph_nodes,malaise,red_spots_over_body,joint_pain,pain_behind_the_eyes,constipation,toxic_look_(typhos),belly_pain,yellow_urine,receiving_blood_transfusion,receiving_unsterile_injections,coma,stomach_bleeding,acute_liver_failure,swelling_of_stomach,distention_of_abdomen,history_of_alcohol_consumption,fluid_overload,phlegm,blood_in_sputum,throat_irritation,redness_of_eyes,sinus_pressure,runny_nose,congestion,loss_of_smell,fast_heart_rate,rusty_sputum,pain_during_bowel_movements,pain_in_anal_region,bloody_stool,irritation_in_anus,cramps,bruising,swollen_legs,swollen_blood_vessels,prominent_veins_on_calf,weight_gain,cold_hands_and_feets,mood_swings,puffy_face_and_eyes,enlarged_thyroid,brittle_nails,swollen_extremeties,abnormal_menstruation,muscle_weakness,anxiety,slurred_speech,palpitations,drying_and_tingling_lips,knee_pain,hip_joint_pain,swelling_joints,painful_walking,movement_stiffness,spinning_movements,unsteadiness,pus_filled_pimples,blackheads,scurring,bladder_discomfort,foul_smell_of urine,continuous_feel_of_urine,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,Disease
0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Dermatologist
1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Dermatologist
2,2,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Dermatologist
3,3,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Dermatologist
4,4,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Dermatologist


### Checking `shape` of dataset

In [28]:
specialist_df.shape

(4920, 133)

### Checking `data types` of the dataset

In [29]:
specialist_df.dtypes

Unnamed: 0                int64
itching                   int64
 skin_rash                int64
 nodal_skin_eruptions     int64
 dischromic _patches      int64
                          ...  
 inflammatory_nails       int64
 blister                  int64
 red_sore_around_nose     int64
 yellow_crust_ooze        int64
Disease                  object
Length: 133, dtype: object

### Getting an `overview` of the dataset

In [30]:
specialist_df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Unnamed: 0,4920.0,,,,2459.5,1420.425992,0.0,1229.75,2459.5,3689.25,4919.0
itching,4920.0,,,,0.137805,0.34473,0.0,0.0,0.0,0.0,1.0
skin_rash,4920.0,,,,0.159756,0.366417,0.0,0.0,0.0,0.0,1.0
nodal_skin_eruptions,4920.0,,,,0.021951,0.146539,0.0,0.0,0.0,0.0,1.0
dischromic _patches,4920.0,,,,0.021951,0.146539,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
inflammatory_nails,4920.0,,,,0.023171,0.150461,0.0,0.0,0.0,0.0,1.0
blister,4920.0,,,,0.023171,0.150461,0.0,0.0,0.0,0.0,1.0
red_sore_around_nose,4920.0,,,,0.023171,0.150461,0.0,0.0,0.0,0.0,1.0
yellow_crust_ooze,4920.0,,,,0.023171,0.150461,0.0,0.0,0.0,0.0,1.0


### Checking `null values` of the dataset

In [31]:
specialist_df.isnull().sum()

Unnamed: 0               0
itching                  0
 skin_rash               0
 nodal_skin_eruptions    0
 dischromic _patches     0
                        ..
 inflammatory_nails      0
 blister                 0
 red_sore_around_nose    0
 yellow_crust_ooze       0
Disease                  0
Length: 133, dtype: int64

## Conclusion:

- **Original_Dataset.csv**: It contains **18 columns** having a `disease` column and **17 symptoms** of a given disease. Every disease needs at least 3 symptoms to get categorized. The data stored is **categorical data** that is being used to classify different types of diseases.
- **Disease_Description.csv**: It contains the description about **41 distinct diseases** that is to be used in our final cleaning data.
- **Doctor_Specialist.csv**: It contains **19 distinct** specialization of doctors that we can refer any patient to in case of any disease.
- **Doctor_Versus_Disease.csv**: It contains which **disease should be referred to which doctor**.
- **Symptom_Weights.csv**: It contains the **weightage** of all diseases by **ranking them in a serial order**.
- **Specialist.xlsx**: Consists of combinations of **131 fields** that indicates any particular `disease` by referring to locations in actual human body. It is the encoded form of **Original_Dataset.csv**.