## The goal of the dataset
- **Dataset URL:** [Disease Symptom Description Dataset on Kaggle](https://www.kaggle.com/datasets/itachi9604/disease-symptom-description-dataset)

### General Information
The purpose of this dataset is to help researchers and students create systems connected to healthcare. It contains thorough details on a range of illnesses, including descriptions of the conditions, symptoms, and preventative actions. The dataset can be easily cleaned and processed using data handling techniques in any programming language because it is provided in CSV format.

### Attributes (Columns):
1. **Disease:** (Categorical) - The name of the disease.
2. **Symptoms:** (Text - list of symptoms) - Symptoms commonly associated with the disease.
3. **Description:** (Text) - A brief medical summary of the disease.
4. **Precautionary Steps:** (Text - multiple columns: Precaution_1, Precaution_2, Precaution_3, Precaution_4) - Suggested measures to prevent the condition from worsening.

### Note
No specific therapy recommendations are included in the dataset. The preventative measures offer broad recommendations for handling medical issues.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the raw and cleaned datasets
raw_data = pd.read_csv('Dataset/datasetDiseaseSymptomPrediction.csv')
cleaned_data = pd.read_csv('Code/cleaned_dataset.csv')

# Summary statistics before cleaning
raw_summary = {
    'Dataset': 'Raw Data',
    'Rows': raw_data.shape[0],
    'Columns': raw_data.shape[1],
    'Missing Values': raw_data.isnull().sum().sum()
}

# Summary statistics after cleaning
cleaned_summary = {
    'Dataset': 'Cleaned Data',
    'Rows': cleaned_data.shape[0],
    'Columns': cleaned_data.shape[1],
    'Missing Values': cleaned_data.isnull().sum().sum()
}

# Create a DataFrame for the summaries
summary_df = pd.DataFrame([raw_summary, cleaned_summary])

# Plotting the summary as a table
plt.figure(figsize=(8, 4))
plt.axis('tight')
plt.axis('off')
plt.table(cellText=summary_df.values, colLabels=summary_df.columns, cellLoc='center', loc='center')
plt.title('Summary of Raw vs Cleaned Data')
plt.show()

# Plotting the comparison as a bar chart
summary_df.set_index('Dataset')[['Rows', 'Columns', 'Missing Values']].plot(kind='bar', figsize=(10, 6), alpha=0.7)
plt.title('Comparison of Raw and Cleaned Data')
plt.ylabel('Count')
plt.xlabel('Dataset')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

### Preprocessing Techniques
We used **One-Hot Encoding** to convert symptoms into numerical values because machine learning models can't handle text. In this method, each symptom becomes its own column. If the symptom is present, we put `1`; if not, we put `0`.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

# Reading dataset
df = pd.read_csv('Dataset/datasetDiseaseSymptomPrediction.csv')
symptom_columns = df.columns[1:]

# Combine all symptoms into a list
df['Symptoms'] = df[symptom_columns].values.tolist()

# Remove 'None' or NaN
df['Symptoms'] = df['Symptoms'].apply(lambda x: [s for s in x if pd.notnull(s) and s != 'None'])

# Apply One-Hot Encoding
mlb = MultiLabelBinarizer()
symptom_encoded = mlb.fit_transform(df['Symptoms'])
df_encoded = pd.DataFrame(symptom_encoded, columns=mlb.classes_)

# Combine with the target variable (Disease)
df_cleaned = pd.concat([df[['Disease']], df_encoded], axis=1)

# Save the cleaned dataset
df_cleaned.to_csv('Code/cleaned_dataset.csv', index=False)