This dataset is originally from the National Institute of Diabetes and Digestive and Kidney
Diseases - https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset

### importing the required packages

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
import os
import joblib

# 1. Data collection

In [2]:
df = pd.read_csv('dataset\diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Check the data types of features

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [5]:
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")
print(f"Column names: {[col for col in df.columns]}")

Number of rows: 768
Number of columns: 9
Column names: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']


# 2. Data preparation
* ### 2.1 Remove duplicate records
* ### 2.2 Split up the numarical and categorical columns
* ### 2.3 Count Each categorical values 
* ### 2.3 Remove or replace the out liers (check only numarical columns)
* ### 2.4 Remove or replace the null values
The dataset is alredy prepocessed so ignore the data preparation steps 

In [6]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


# 3. Feature engineering
* ### 3.1 replace categorical data into numarical data
* ### 3.2 split the data as train and test
* ### 3.3 Feature scaling
The dataset is alredy in numarical values so we perform feature scaling only

## Feature scaling
#### Standardization:
Standardization (also known as z-score normalization) transforms the data such that it has a mean of 0 and a standard deviation of 1. It involves subtracting the mean from each data point and then dividing by the standard deviation. The formula for standardization is:

𝑧 = (𝑥−𝜇)/𝜎

#### Normalization:
Normalization scales the data between 0 and 1. It's useful when the data has varying scales and you want to bring them all to a similar scale. The formula for normalization is:

𝑥 = 𝑥−min(𝑥) / max(𝑥)−min(𝑥)

In [7]:
x, y = df.iloc[:, :-1], df.iloc[:, -1]

In [10]:
# Split the data as train and test
x_train, x_test, y_train, y_test = train_test_split(x, y)

In [11]:
x_train

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
125,1,88,30,42,99,55.0,0.496,26
607,1,92,62,25,41,19.5,0.482,25
590,11,111,84,40,0,46.8,0.925,45
279,2,108,62,10,278,25.3,0.881,22
24,11,143,94,33,146,36.6,0.254,51
...,...,...,...,...,...,...,...,...
713,0,134,58,20,291,26.4,0.352,21
718,1,108,60,46,178,35.5,0.415,24
105,1,126,56,29,152,28.7,0.801,21
66,0,109,88,30,0,32.5,0.855,38


### Let's creating pipeline
A pipeline refers to a sequence of data processing components arranged together in a specific order, where the output of one component becomes the input of the next. 

In [12]:
model_with_preprocess = Pipeline([
    ('scaler', MinMaxScaler()),
    ('classifier', LogisticRegression())
])


`GridSearchCV`, or `Grid Search Cross-Validation`, is a technique used for tuning hyperparameters of a machine learning model. Hyperparameters are parameters that are not directly learned from the data but affect the learning process. Examples include the regularization parameter in linear models or the number of trees in a random forest.

In [14]:
# Define parameters for grid search
param_grid = {
    'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],  # Regularization parameter
    'classifier__solver': ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']
}

# Grid search with cross-validation
grid_search = GridSearchCV(model_with_preprocess, param_grid, cv=5, scoring='accuracy')
grid_search.fit(x_train, y_train)

In [15]:
# Best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

Best Parameters: {'classifier__C': 10, 'classifier__solver': 'newton-cg'}
Best Score: 0.7725187406296852


### Let's get the best model with best model patameter

In [20]:
best_pipeline = grid_search.best_estimator_
best_pipeline

### Check the test accuracy

In [23]:
prd_val = best_pipeline.predict(x_test)

In [26]:
print(f"test accuracy score: {accuracy_score(y_test, prd_val)}")
print(f"confusion matrix: \n{confusion_matrix(y_test, prd_val)}")

test accuracy score: 0.7447916666666666
confusion matrix: 
[[107  27]
 [ 22  36]]


### Save the best model

In [28]:
os.makedirs('model', exist_ok=True)
joblib.dump(best_pipeline, os.path.join('model', 'model_with_preprocess.pkl'))

['model\\model_with_preprocess.pkl']

<h1 align="center">
    MEDICATION REMAINDER WITH PREDICTION FEATURES
</h1>

<h3 align="center">
Streamline Medicin management, add user, Seamlessly check obesity, Pneumonia, Diabitise and so on, assess helth, and improve health. <br>
Alert you'r parents/gordian based on missed medicin.
</h3>

# About

The `MEDICATION REMAINDER WITH PREDICTION FEATURES` is a web-based application built using the MySQL, Express.js, React.js, Node.js and python. It aims to alert user to take medicin in correct time and check the decise like pneumonia, diabetic, obesity level using prediction feature.


***1.Command for run front-end(React) server***
- Locate the folder `medicaton-remainder/front-end`
- run the command to install all the required react package `npm install`
- run the command to start the React server `npm start`

***2.Command for run back-end(Node) server***
- Locate the folder `medicaton-remainder/back-end`
- run the command to install all the required react package `npm install`
- run the command to start the React server `npm run dev`

***3.Command for run features-ML(Python) server***
- Locate the folder `medicaton-remainder/features-ML`
- use virtual environment is better. If virtual environment is available just activate that
- Download the dataset using the link which was given on the top of jupyte-notebooks if dataset is not available 
- Some models are not in git repo, Because of `large size` So just run the .ipynb notebooks which model is not available
- run the command to install all the required react package `pip install -r requirements.txt`
- run the command to start the React server `npm run dev`

# Software Requirements
