# Undestand all datasets

**Date**: 2025-11-15

**Author**: Raquel Marques

<br>

To enhance the clarity and organization of this project, please adhere to the following legend to better understand the logic and structure of the documentation:

Legend:
* <span style="color:green">Explanation</span>: Provides detailed reasoning or context for concepts and processes.
* <span style="color:purple">Tips</span>: Offers practical advice or best practices to improve efficiency or outcomes.
* <span style="color:red">Practice</span>: Highlights actionable steps or exercises to apply the concepts.
* <span style="color:blue">Business Context</span>: Connects the technical work to relevant business objectives or scenarios.

## <span style="color:green"> Libraries </span>

Libraries being used in this code.

In [3]:
## LIBRARY
import os
import pandas as pd

## <span style="color:green"> Import Data </span>

In [4]:
## PATH & OTHERS
# Project Directory
project_dir = os.path.join(os.path.expanduser("~"), "OneDrive", "Project_Code", "Project-DiseaseSymptom-Kaggle")


## IMPORT DATA
df_dataset = pd.read_csv(os.path.join(project_dir, "data/raw/", "dataset.csv"))
df_sympDesc = pd.read_csv(os.path.join(project_dir, "data/raw/", "symptom_Description.csv"))
df_sympPrec = pd.read_csv(os.path.join(project_dir, "data/raw/", "symptom_precaution.csv"))
df_sympSev = pd.read_csv(os.path.join(project_dir, "data/raw/", "Symptom-severity.csv"))

## <span style="color:green"> Function </span>

Some functions used in the code.

## <span style="color:green"> Exploratory Data Analysis (EDA) 1 </span>

Let's understand our data:



### <span style="color:green"> Table: Dataset </span>


- dataset.csv
    - Contains 4,920 records of disease and symptom combinations.
    - Every record has at least two symptoms (Symptom_1 and Symptom_2).
    - The table captures extensive symptom lists, though the number of symptoms per entry varies widely, with many entries having fewer than the maximum 17 symptoms. The NULL or missing values indicate that not all diseases present with a full list of 17 symptoms in every recorded instance.

In [5]:
## EDA - Univariable

### Dataset
df_dataset.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4920 entries, 0 to 4919
Data columns (total 18 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Disease     4920 non-null   object
 1   Symptom_1   4920 non-null   object
 2   Symptom_2   4920 non-null   object
 3   Symptom_3   4920 non-null   object
 4   Symptom_4   4572 non-null   object
 5   Symptom_5   3714 non-null   object
 6   Symptom_6   2934 non-null   object
 7   Symptom_7   2268 non-null   object
 8   Symptom_8   1944 non-null   object
 9   Symptom_9   1692 non-null   object
 10  Symptom_10  1512 non-null   object
 11  Symptom_11  1194 non-null   object
 12  Symptom_12  744 non-null    object
 13  Symptom_13  504 non-null    object
 14  Symptom_14  306 non-null    object
 15  Symptom_15  240 non-null    object
 16  Symptom_16  192 non-null    object
 17  Symptom_17  72 non-null     object
dtypes: object(18)
memory usage: 692.0+ KB


### <span style="color:green"> Table: Symptom Description </span>


- symptom_Description.csv
    - Contains 41 rows, each representing a unique disease and its corresponding description.
    - The table has complete data, with no missing values in either column.


In [6]:
### symptom_Description
df_sympDesc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Disease      41 non-null     object
 1   Description  41 non-null     object
dtypes: object(2)
memory usage: 788.0+ bytes


### <span style="color:green"> Table: Symptom Precaution </span>

- symptom_precaution.csv
    - Each row represents a unique disease and its associated precautions.
    - The table structure allows for a maximum of four precautions per disease entry.
    - Data integrity shows most entries have at least two precautions, while one disease lacks a third or fourth precaution entry.

In [7]:
### symptom_precaution
df_sympPrec.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Disease       41 non-null     object
 1   Precaution_1  41 non-null     object
 2   Precaution_2  41 non-null     object
 3   Precaution_3  40 non-null     object
 4   Precaution_4  40 non-null     object
dtypes: object(5)
memory usage: 1.7+ KB


### <span style="color:green"> Table: Symptom Severity </span>

- Symptom-severity.csv
    - Each row represents a unique symptom.
    - The weight column is a measure of a symptom's impact or importance, quantified for analysis.
    - The values in the weight column are assessed every 2 days, based on the provided context. 

In [8]:
### Symptom-severity
df_sympSev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133 entries, 0 to 132
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Symptom  133 non-null    object
 1   weight   133 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 2.2+ KB
