## The Goal of the Dataset
- **Dataset URL:** [Disease Symptom Description Dataset on Kaggle](https://www.kaggle.com/datasets/itachi9604/disease-symptom-description-dataset)

### General Information
The purpose of this dataset is to help researchers and students create systems connected to healthcare. It contains thorough details on a range of illnesses, including descriptions of the conditions, symptoms, and preventative actions. The dataset can be easily cleaned and processed using data handling techniques in any programming language because it is provided in CSV format.

### Attributes (Columns):
1. **Disease:** (Categorical) - The name of the disease.
2. **Symptoms:** (Text - list of symptoms) - Symptoms commonly associated with the disease.
3. **Description:** (Text) - A brief medical summary of the disease.
4. **Precautionary Steps:** (Text - multiple columns: Precaution_1, Precaution_2, Precaution_3, Precaution_4) - Suggested measures to prevent the condition from worsening.

### Note
No specific therapy recommendations are included in the dataset. The preventative measures offer broad recommendations for handling medical issues.



### Summary of the dataset

### Dataset Overview: Disease Symptom Prediction

The "datasetDiseaseSymptomPrediction.csv" dataset serves as a valuable resource for predicting diseases based on various symptoms reported by patients. It provides a structured approach to understanding the relationship between symptoms and potential diseases.

### Key Aspects:
- **Disease Identification**: Each row contains the name of a specific disease, allowing for straightforward classification.
- **Symptom Representation**: The dataset includes up to 17 columns dedicated to individual symptoms, capturing a comprehensive range of clinical manifestations associated with each disease.

### Structure:
- **Rows**: Each entry corresponds to a unique instance, detailing one disease and its related symptoms.
- **Columns**:
  - **1 Column**: Disease name.
  - **17 Columns**: Symptoms, which may include conditions like itching, vomiting, and fatigue. Some symptom columns may be empty if a symptom does not apply.

### Purpose:
The primary aim of this dataset is to support the creation of predictive models that assist healthcare professionals in diagnosing diseases by analyzing reported symptoms. This can improve diagnostic accuracy and patient care by leveraging data-driven insights.

### Sample table before clean dataset:

| Disease               | Symptom_1            | Symptom_2              | Symptom_3          | Symptom_4           | Symptom_5          | Symptom_6          | Symptom_7          | Symptom_8          | Symptom_9          | Symptom_10         | Symptom_11         | Symptom_12         | Symptom_13         | Symptom_14         | Symptom_15         | Symptom_16         | Symptom_17         |
|----------------------|----------------------|------------------------|--------------------|---------------------|--------------------|--------------------|--------------------|--------------------|--------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
| Fungal infection      | itching              | skin_rash              | nodal_skin_eruptions| dischromic_patches   |                     |                    |                    |                    |                    |                     |                     |                     |                     |                     |                     |                     |                     |
| Allergy               | continuous_sneezing  | shivering              | chills             | watering_from_eyes   |                     |                    |                    |                    |                    |                     |                     |                     |                     |                     |                     |                     |                     |
| GERD                 | stomach_pain         | acidity                | ulcers_on_tongue   | vomiting             | cough              | chest_pain         |                    |                    |                    |                     |                     |                     |                     |                     |                     |                     |                     |
| Diabetes              | fatigue              | weight_loss            | restlessness       | lethargy            | irregular_sugar_level| blurred_and_distorted_vision | obesity         | excessive_hunger    | increased_appetite   | polyuria            |                     |                     |                     |                     |                     |                     |                     |
| Jaundice             | itching              | vomiting               | fatigue            | weight_loss          | high_fever         | yellowish_skin     | dark_urine         | abdominal_pain      |                     |                     |                     |                     |                     |                     |                     |                     |                     |


### Key Observations:

1. **Disease Identification**: Each row clearly identifies a specific disease.
2. **Symptom Variety**: Up to 17 symptoms are associated with each disease, indicating diverse clinical presentations.
3. **Missing Data**: Some symptom columns are empty, suggesting not all symptoms apply universally.
4. **Predictive Potential**: The dataset is structured for machine learning, aiding in disease prediction based on symptoms.
5. **Common Symptoms**: Certain symptoms, such as **itching**, appear across multiple diseases (e.g., **Fungal Infection** and **Jaundice**), indicating possible overlaps in clinical presentations.






In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the raw and cleaned datasets
raw_data = pd.read_csv(r'C:\Users\Asus\SWE485ProjectGroup2-main\Dataset\datasetDiseaseSymptomPrediction.csv')  # Original dataset
cleaned_data = pd.read_csv(r'C:\Users\Asus\SWE485ProjectGroup2-main\Code\cleaned_dataset.csv')  # Cleaned dataset

# Summary statistics before cleaning
raw_summary = {
    'Dataset': 'Raw Data',
    'Rows': raw_data.shape[0],
    'Columns': raw_data.shape[1],
    'Missing Values': raw_data.isnull().sum().sum()
}

# Summary statistics after cleaning
cleaned_summary = {
    'Dataset': 'Cleaned Data',
    'Rows': cleaned_data.shape[0],
    'Columns': cleaned_data.shape[1],
    'Missing Values': cleaned_data.isnull().sum().sum()
}

# Create a DataFrame for the summaries
summary_df = pd.DataFrame([raw_summary, cleaned_summary])

# Plotting the summary as a table
plt.figure(figsize=(8, 4))
plt.axis('tight')
plt.axis('off')
plt.table(cellText=summary_df.values, colLabels=summary_df.columns, cellLoc='center', loc='center')
plt.title('Summary of Raw vs Cleaned Data')
plt.show()

# Plotting the comparison as a bar chart
summary_df.set_index('Dataset').plot(kind='bar', figsize=(10, 6), alpha=0.7)
plt.title('Comparison of Raw and Cleaned Data')
plt.ylabel('Count')
plt.xlabel('Dataset')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

### Explanation of the Code and Recurring Symptoms

This code analyzes and compares a raw dataset with a cleaned version, focusing on key summary statistics.

1. **Summary Statistics**:
   - **Raw Data**: Displays the number of rows, columns, and missing values.
   - **Cleaned Data**: Shows the same metrics after cleaning, highlighting improvements in data quality.

2. **Recurring Symptoms**:
   - In the **raw dataset**, recurring symptoms may appear with duplications and inconsistencies.
   - After cleaning, the **cleaned dataset** presents a clearer view of these symptoms, allowing for more accurate analysis of patterns and relationships essential for diagnosis.

### Text Summary

"This analysis compares the raw and cleaned datasets, showcasing improvements in data quality through summary statistics. The focus on recurring symptoms reveals how cleaning removes duplication and inconsistencies, enabling better pattern recognition vital for healthcare decisions."

### Preprocessing techniques:

We used One-Hot Encoding to convert symptoms into numerical values (0 and 1) because machine learning models can't handle text and need numerical values to process the data. In this method, each symptom becomes its own column. If the symptom is present, we put 1, and if it's not, we put 0. 

In [None]:
import pandas as pd

# We used pd.read_csv() because it's the standard method for reading CSV files, which is the format the dataset is stored in.This step loads the data into a DataFrame so we can work with it in Python.
df = pd.read_csv("Dataset/datasetDiseaseSymptomPrediction.csv")  

# The first column (Disease) is the target variable (the disease name), so I want to exclude it when processing symptoms. The rest of the columns represent symptoms, so I need to identify them to perform One-Hot Encoding.
symptom_columns = df.columns[1:]  

# To make the data easier to process, I combined all symptoms for each patient into a single list. This helps later when applying One-Hot Encoding to the symptoms
df["Symptoms"] = df[symptom_columns].values.tolist()

# The "None" values represent the absence of a symptom, and they don't contribute to the One-Hot Encoding process. So, I removed them to avoid unnecessary noise n the data
df["Symptoms"] = df["Symptoms"].apply(lambda x: list(set(x) - {"None"}))

# We used MultiLabelBinarizer to convert the list of symptoms for each patient into numerical values using One-Hot Encoding.
from sklearn.preprocessing import MultiLabelBinarizer

# MultiLabelBinarizer is an effective tool for converting multi-label data (such as symptoms in this case) into a One-Hot representation. This step is crucial because we need to transform the symptoms into binary values (0 or 1), where 1 means the symptom is present and 0 means it's not.
mlb = MultiLabelBinarizer()

# To ensure MultiLabelBinarizer works properly, we convert each symptom to a string. This ensures that the binarizer operates correctly.
df["Symptoms"] = df["Symptoms"].apply(lambda x: [str(s) for s in x])

# This is the core step where One-Hot Encoding is applied.
symptom_encoded = mlb.fit_transform(df["Symptoms"])

# After applying One-Hot Encoding, the result is a matrix of 0s and 1s. We convert this matrix into a DataFrame with meaningful column names, which are the actual symptoms, This makes it easier to interpret the data and perform further analysis.
df_encoded = pd.DataFrame(symptom_encoded, columns=mlb.classes_)

# After encoding the symptoms, we need to combine them with the target variable (Disease) to create a comprehensive dataset. This step ensures that the final dataset includes both the disease label and the encoded symptoms
df_cleaned = pd.concat([df[["Disease"]], df_encoded], axis=1)

# Display all columns in the DataFrame
pd.set_option('display.max_columns', None)  # Display all columns
pd.set_option('display.width', None) # Remove any line breaks for large tables

# Display the final processed dataset
print(df_cleaned.head())  

# Once the data is preprocessed, we save it in a new CSV file for future use.
df_cleaned.to_csv("cleaned_dataset.csv", index=False)

