<hr>

# General Disease Diagnosis AI Notebook

<hr>

![imagelogo.jpg](attachment:67847aea-4ae0-4627-8970-edf98ebfc332.jpg)

<hr>

**Problem Statement:**
<br>
<p>AI-driven general disease diagnosis has the potential to revolutionize healthcare by enabling quicker, more accurate, and data-informed diagnostics. In this hackathon, participants are encouraged to create AI tools that assist in identifying a wide range of diseases, using data sources such as medical images, lab results, and patient records. By leveraging machine learning for predictive modeling and pattern recognition, these solutions aim to support healthcare providers in making precise diagnoses more efficiently. The outcome can help reduce diagnostic errors, optimize clinician workflows, and improve access to timely care, especially in resource-constrained settings.
</p>

<hr>

## About Project

<p>This is an AI-driven general disease diagnosis project where we have train the ML model which can predict general diease of some one by taking some input values from the user side.</p>
<hr>

## Disclamer

<p>This general disease prediction ML model is intended for informational and educational purposes only and should not be considered a substitute for professional medical advice, diagnosis, or treatment. The predictions generated by this tool are based on statistical models and are not definitive diagnoses. Users should not rely on the results of this app for making medical decisions.
Always consult a qualified healthcare provider for medical advice, diagnosis, or treatment, especially if you have concerns about lung cancer or related health conditions. The creators of this app are not liable for any medical decisions made based on the predictions or insights provided by this tool.</p>
<hr>

## About This Notebook
<p>This Jupyter Notebook consist of importing, processing, cleaning, visualizing the given data and training, testing, evaluating an AI Model for general disease prediction.</p>
<hr>

## Index (Step's Covered In This Notebook)

1. Importing Data
    * Importing `general_disease_diagnosis.csv` data from `data` folder/
    * Tools/Library used Pandas <a href="https://pandas.pydata.org/docs/index.html#">Link</a>.
2. Evaluate Data
    * Evaluate `data-type` in data.
    * quantity and quality of data.
    * Tools/Library used Pandas <a href="https://pandas.pydata.org/docs/index.html#">Link</a>.
3. Processing Data
    * Cleaning and Scaling data.
    * Checking for any `NULL` values
    * Visualizing Data
    * Tools/Library used Pandas <a href="https://pandas.pydata.org/docs/index.html#">Link</a>.
4. Feature Engineering
    * Selecting features from data.
    * Scaling data.
    * Tools/Library used Pandas <a href="https://pandas.pydata.org/docs/index.html#">Link</a>.
5. Train Test Data Splitting
    * Spliting processed `general_disease_diagnosis.csv` data into train, test datasets
    * Tools/Library used Scikit-Learn <a href="https://scikit-learn.org/stable/index.html">Link</a>
6. Model Selection and Training
    * Selecting Model's
    * Training and evaluating trained models
    * Hyperparameter Tuning
    * Tools/Library used Scikit-Learn <a href="https://scikit-learn.org/stable/index.html">Link</a>
7. Model Testing
    * Using test data for prediction
    * F1 Score 
    * Tools/Library used Scikit-Learn <a href="https://scikit-learn.org/stable/index.html">Link</a>
8. Saving Trained Model
    * Save trained model using joblib, pickle
    * Tools/Library used Joblib <a href="https://joblib.readthedocs.io/en/stable/">Link</a>, Pickle <a href="https://docs.python.org/3/library/pickle.html">Link</a>
    
<hr>

# 1. Loading Dataset
<p>Loading data from `data` folder provided by the hackathon team. The data provided is an `.csv` format file which is an semi-structured data. We are using pandas library to load and evaluate the dataset.</p>

### Loading Tools 

In [2]:
import pandas as pd

### Reading Dataset
<p>Creating a variable to store pandas dataframe generated during loading dataset.</p>

`health_df` - pandas dataframe

In [4]:
health_df = pd.read_csv("../data/general_disease_diagnosis.csv", low_memory=False)

In [5]:
health_df

Unnamed: 0,Patient_Name,Age,Weight_kg,Height_cm,Blood_Pressure_mmHg,Disease
0,Ramesh Patel,10,29,93,102,Kidney Disease
1,Sunita Pandey,12,21,103,152,Hypertension
2,Santosh Kulkarni,11,19,112,154,Thyroid Disorder
3,Swati Verma,32,80,152,95,Tuberculosis
4,Sudha Pandey,30,57,177,95,Hypertension
...,...,...,...,...,...,...
995,Priya Das,80,73,159,110,
996,Swati Kohli,63,48,164,111,
997,Abhinav Sharma,20,74,158,104,
998,Sudha Reddy,9,23,95,126,


# 2. Evaluating Dataset

View first 5 head elements of `health_df` pandas dataframe

In [6]:
health_df.head(5)

Unnamed: 0,Patient_Name,Age,Weight_kg,Height_cm,Blood_Pressure_mmHg,Disease
0,Ramesh Patel,10,29,93,102,Kidney Disease
1,Sunita Pandey,12,21,103,152,Hypertension
2,Santosh Kulkarni,11,19,112,154,Thyroid Disorder
3,Swati Verma,32,80,152,95,Tuberculosis
4,Sudha Pandey,30,57,177,95,Hypertension


### Know About Data
getting known with the given data

Getting number of Columns in `health_df` dataframe.

In [13]:
list(health_df.columns)

['Patient_Name',
 'Age',
 'Weight_kg',
 'Height_cm',
 'Blood_Pressure_mmHg',
 'Disease']

In [19]:
len(health_df.columns)

6

The `health_df` dataframe has 6 columns:
* `Patient_Name`
* `Age`
* `Weight_kg`
* `Height_cm`
* `Blood_Pressure_mmHg`
* `Disease`

Getting shape, quantity of our `health_df` dataframe

In [22]:
len(health_df)

1000

`health_df` has around 1000 row's of data

Getting `info` and `description` of our `health_df` dataframe.

In [24]:
health_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Patient_Name         1000 non-null   object
 1   Age                  1000 non-null   int64 
 2   Weight_kg            1000 non-null   int64 
 3   Height_cm            1000 non-null   int64 
 4   Blood_Pressure_mmHg  1000 non-null   int64 
 5   Disease              750 non-null    object
dtypes: int64(4), object(2)
memory usage: 47.0+ KB


using `info` method in pandas we get to know about:
* `Patient_Name` column has about `1000` quantities of data which is an `object` data type.
* `Age` column has about `1000` quantities of data which is an `int64` data type.
* `Weight_kg` column has about `1000` quantities of data which is an `int64` data type.
* `Height_cm` column has about `1000` quantities of data which is an `int64` data type.
* `Blood_Pressure_mmHg` column has about `1000` quantities of data which is an `int64` data type.
* `Disease` column has about `750` quantities of data which is an `object` data type. Around 250 row's of data has `NULL` values which we use to predict.

In [32]:
health_df.describe(exclude=[object])

Unnamed: 0,Age,Weight_kg,Height_cm,Blood_Pressure_mmHg
count,1000.0,1000.0,1000.0,1000.0
mean,44.992,60.686,154.076,124.785
std,25.781517,16.148194,18.384347,20.443403
min,1.0,15.0,90.0,90.0
25%,22.0,51.0,148.0,107.0
50%,44.0,63.0,158.0,125.0
75%,67.0,73.0,167.0,142.0
max,90.0,85.0,180.0,160.0


Here we describe the `int64` type data while excluding `object` type data.

In [33]:
health_df.describe(include=[object])

Unnamed: 0,Patient_Name,Disease
count,1000,750
unique,693,12
top,Rajesh Verma,Malaria
freq,4,76


Here we describe the `object` type data.