# Neuron Network Model for Tradition Chinese Medicine Prescription Training


## About this Notebook

This notebook is a template to train a neuron network model for tradition chinese medicine prescription.

This notebook provide a convenient way to adjust model hyperparameters, as well as saving the performance evaluation metrics of each medicine, and the trained model in .pb format.

## Result Saving

Performance metrics including the following will in saved in csv format:

- f1-score
- precision
- recall
- True Positive counts (TP)
- False Positive counts (FP)
- True Negative counts (TN)
- False Negative counts (FN)

## Model Saving

This notebook also save the trained model in .pb format by `tensorflow.keras.Model.save`

## How to Use

### Package Requirements
To run this noteboook, install packages in `requirements.txt` first.

### Prepare Data

Prepare training data in csv format, follow the format as "simplified_data/simplified_data.csv".

Adjust path to the data in `file_name` in section **[0. Global Variables]**

the default data path is "simplified_data/simplified_data.csv"

another file named `data_grouped_symptom.csv` is an alternative source data, which only contains symptoms as features (features that are not symptoms, such as body status are excluded).

Run the notebook by pressing "Run All".

### Adjusting Parameter

Parameters can be adjusted in section **[0. Global Variables]**

To tune the dense layers, optimizer, activation function and loss function of the model, adjust in section **[5. Build Model]**



# 0. Global Variables

In [None]:
# Remove the entire column related to a specific medicine if its occurrence is below the defined threshold.
# In the given dataset, setting this threshold to 250 would retain columns for only the top 10 most frequently used medicines.
DeleteMedThreshold = 0

# Determine the number of medicines to be trained as output
# The total number of medicine in the default data set is 102
NumMedTrain = 102

# Decide whether to enhance accuracy by utilizing class weights
# If set to True, the inversly proportional class weights will be applied to the loss function
UseClassWeight = True

# Decide learning_rate, this will be used in the Adam Optimizer 
LearningRate = 0.005

# Model num, this string will be the folder name where the result of the model will be saved
ModelNum = "[16]_0_UseWeight"

# Path to source data for training (must be in csv format)
file_name = "./simplified_data/simplified_data.csv"

# 1. Import module


In [98]:

# Importing necessary libraries
import numpy as np
import pandas as pd
import statistics
import csv
from tabulate import tabulate
import matplotlib.pyplot as plt

# Importing TensorFlow for deep learning
import tensorflow
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout
from keras.callbacks import EarlyStopping, LambdaCallback
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy

# Importing scikit-learn for data preprocessing and utilities
from sklearn.utils import shuffle
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils.class_weight import compute_sample_weight

# Importing custom utility functions
from utility_file import my_utilities as myutil
from utility_file import load_data

# 2. Load Data
Load the data using the custom module "load_data"

In [99]:
# Load data for training a model, including debugging information
(X_np, X_val_np, train_y, val_y, 
 num_col_x, num_1_valy, num_0_valy) = load_data.load_data_for_1_med_with_debug(del_med_thres=DeleteMedThreshold, 
                                                                               random_seed=1, 
                                                                               n=NumMedTrain,
                                                                               file_name=file_name)

# Ensure the correct data types for loaded variables
assert isinstance(X_np, np.ndarray)
assert isinstance(X_val_np, np.ndarray)
assert isinstance(train_y, pd.DataFrame)
assert isinstance(val_y, pd.DataFrame)
assert isinstance(num_col_x, int)


--------------------------------------------------------------------------------
ReadData:
Type of data: <class 'pandas.core.frame.DataFrame'>
Shape of data = (797 rows, 215 cols).
End of ReadData
--------------------------------------------------------------------------------
SplitXY:
Shape of X = (796 rows, 111 cols).
Shape of y = (796 rows, 102 cols).
End of SplitXY
--------------------------------------------------------------------------------
In load_data_for_1_med_with_debug of load_data.py, random_seed= 1
After SplitXY, total number of 0, 1 in y:
Number of 0s: 72318
Number of 1s: 8874
save med num done
Train_X.shape:  (637, 111)
Train_y.shape:  (637, 102)

Split Training Validation
Number of 0s in train_y: 57978
Number of 1s train_y: 6996
Number of 0s in val_y: 14340
Number of 1s val_y: 1878
--------------------------------------------------------------------------------


# 3. Data Type Checking

In [100]:
# Uncomment the line below to display the DataFrame content and structure
# myutil.print_df(val_y, "---- y ----")

# Uncomment the line below to print the DataFrame directly
# print(val_y)

# Checking:

# Counting NA values in y
na_count = val_y.isna().sum().sum()

# Counting str values in y
str_count = val_y[val_y.map(type) == str].count().sum()

# Counting int values in y
int_count = val_y[val_y.map(type) == int].count().sum()

# Counting float values in y
float_count = val_y[val_y.map(type) == float].count().sum()

# Display the debugging information
print(f"Number of NA values in y: {na_count}")
print(f"Number of str values in y: {str_count}")
print(f"Number of int values in y: {int_count}")
print(f"Number of float values in y: {float_count}")


Number of NA values in y: 0
Number of str values in y: 0
Number of int values in y: 16218
Number of float values in y: 0


# 4. Compute Class Weight

In [101]:
# Convert the 'train_y' DataFrame to a NumPy array
train_y_np = np.array(train_y)

# Determine the number of labels (columns) in the array
num_labels = train_y_np.shape[1]

# Initialize an empty dictionary to store class weights for each label
class_weight_dic = {}

# Iterate over each label column
for i in range(num_labels):
    # Count the occurrences of each class (0 and 1) in the current label column
    unique_values, counts = np.unique(train_y_np[:, i], return_counts=True)
    
    # Create a dictionary mapping class values to their frequencies
    value_frequency_dict = dict(zip(unique_values, counts))
    
    # Calculate the total number of occurrences for normalization
    total = value_frequency_dict.get(0, 0) + value_frequency_dict.get(1, 0)
    
    # Calculate class weights and store them in the dictionary
    class_weight_dic[i] = {0: (value_frequency_dict.get(1, 0) / total), 1: (value_frequency_dict.get(0, 0) / total)}

# Print the computed class weights for debuging
# print(class_weight_dic)

{0: {0: 0.1695447409733124, 1: 0.8304552590266876}, 1: {0: 0.42543171114599687, 1: 0.5745682888540031}, 2: {0: 0.03139717425431711, 1: 0.9686028257456829}, 3: {0: 0.058084772370486655, 1: 0.9419152276295133}, 4: {0: 0.28728414442700156, 1: 0.7127158555729984}, 5: {0: 0.01098901098901099, 1: 0.989010989010989}, 6: {0: 0.2857142857142857, 1: 0.7142857142857143}, 7: {0: 0.04709576138147567, 1: 0.9529042386185244}, 8: {0: 0.02040816326530612, 1: 0.9795918367346939}, 9: {0: 0.012558869701726845, 1: 0.9874411302982732}, 10: {0: 0.3218210361067504, 1: 0.6781789638932496}, 11: {0: 0.023547880690737835, 1: 0.9764521193092621}, 12: {0: 0.13186813186813187, 1: 0.8681318681318682}, 13: {0: 0.16326530612244897, 1: 0.8367346938775511}, 14: {0: 0.07535321821036106, 1: 0.9246467817896389}, 15: {0: 0.02040816326530612, 1: 0.9795918367346939}, 16: {0: 0.012558869701726845, 1: 0.9874411302982732}, 17: {0: 0.0141287284144427, 1: 0.9858712715855573}, 18: {0: 0.11145996860282574, 1: 0.8885400313971743}, 19:

# 5. Build Model

Adjust the dense layers, optimizer, activation function, loss function, and metrics for the model here if required.

In [102]:
# Define a Sequential model
model = Sequential([
    Dense(units=16, input_shape=(num_col_x,), activation='sigmoid'), 
    Dense(units=2, activation='sigmoid')
])

# Display a summary of the model architecture
model.summary()

# Compile the model with specified optimizer, loss function, and metrics
model.compile(optimizer=Adam(learning_rate=LearningRate),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])


Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_16 (Dense)            (None, 16)                1792      
                                                                 
 dense_17 (Dense)            (None, 2)                 34        
                                                                 
Total params: 1,826
Trainable params: 1,826
Non-trainable params: 0
_________________________________________________________________


# 6. Train Model
Train one model for each medicine. 

Each model has same configuration as set in section **[5. Build Model]**

In [103]:

# Initialize dictionaries to store results and training history
result_df_dict = {}        # Dictionary of DataFrames of each medicine in training set 
accuracy_dict = {}         # Dictionary of accuracy for each medicine
prediction_train_dict = {}  # Dictionary of raw predictions for the training set
prediction_val_dict = {}    # Dictionary of raw predictions for the validation set

# Iterate over each medicine
for i in range(train_y.shape[1]):
    chosen_col = train_y.iloc[:, i].copy()
    
    # Ensure that the chosen column is a pandas Series
    assert(isinstance(chosen_col, pd.Series))
    assert(len(chosen_col) == len(train_y))
    
    print(f"Processing medicine {i + 1} of {train_y.shape[1]}: {chosen_col.name}")

    # Convert the chosen column to NumPy array
    chosen_y_np = chosen_col.values.astype('float64')

    # Copy the corresponding validation set column
    y_val_chosen_col = val_y.iloc[:, i].copy()
    
    # Ensure that the validation set column is a pandas Series
    assert(isinstance(y_val_chosen_col, pd.Series))
    assert(len(y_val_chosen_col) == len(val_y))

    # Early stopping callback
    early_stopping = EarlyStopping(monitor='loss', patience=30, restore_best_weights=True)

    # Fit the model for the current medicine
    Model = model.fit(
        x=X_np,
        y=chosen_y_np,
        class_weight=class_weight_dic[i] if UseClassWeight else None,
        epochs=2000,
        shuffle=True,
        verbose=0,
        callbacks=[early_stopping]
    )
    
    # Print when training stopped
    print(f"Training stopped at epoch {Model.epoch[-1]}")
    
    # Predict against the training set for diagnosing overfitting or underfitting
    predictions_train_set = model.predict(X_np)
    
    # Save raw result numpy array of training set to the dictionary
    prediction_train_dict[chosen_col.name] = predictions_train_set
    
    # Make predictions for the validation set
    predictions_val_set = model.predict(X_val_np)
    
    # Save raw result numpy array of validation set to the dictionary
    prediction_val_dict[chosen_col.name] = predictions_val_set
    
    # Uncomment to plot model history
    
    # Plotting loss vs. epoch
    # plt.plot(Model.history['loss'], label='Training Loss')
    # plt.title('Loss vs. Epoch')
    # plt.xlabel('Epoch')
    # plt.ylabel('Loss')
    # plt.legend()
    # plt.show()

print("Training done.")


Processing medicine 1 of 102: 麻黃


Training stopped at epoch 664
Processing medicine 2 of 102: 桂枝
Training stopped at epoch 544
Processing medicine 3 of 102: 荊芥
Training stopped at epoch 213
Processing medicine 4 of 102: 防風
Training stopped at epoch 144
Processing medicine 5 of 102: 細辛
Training stopped at epoch 525
Processing medicine 6 of 102: 白芷
Training stopped at epoch 216
Processing medicine 7 of 102: 生薑
Training stopped at epoch 470
Processing medicine 8 of 102: 辛夷
Training stopped at epoch 264
Processing medicine 9 of 102: 葛根
Training stopped at epoch 255
Processing medicine 10 of 102: 升麻
Training stopped at epoch 1999
Processing medicine 11 of 102: 柴胡
Training stopped at epoch 766
Processing medicine 12 of 102: 蟬蛻
Training stopped at epoch 530
Processing medicine 13 of 102: 石膏
Training stopped at epoch 717
Processing medicine 14 of 102: 知母
Training stopped at epoch 836
Processing medicine 15 of 102: 梔子
Training stopped at epoch 1141
Processing medicine 16 of 102: 天花粉
Training stopped at epoch 1999
Processing med

# 7. Handle result


7.1 Calculate the f1 score, precision, recall, TP, FP, TN, FN of the training dataset and store the values in TrainMedicineDictioanry

In [104]:
# Calculate True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN) for the training set
total_tp_train = 0
total_fp_train = 0
total_tn_train = 0
total_fn_train = 0

# Create a dictioanry to store all values of a medicine
TrainMedicineDictioanry = {}

# Iterate through each medicine's raw prediction array
for key, arr in prediction_train_dict.items():
    
    # Create a DataFrame from the raw prediction array
    df_tmp = pd.DataFrame(arr, columns=["predicted as 0", "predicted as 1"])

    # Determine the predicted value based on probabilities
    df_tmp["predicted value"] = np.where(df_tmp["predicted as 0"] > df_tmp["predicted as 1"], 0, 1)
    
    # Get the column number of the current medicine in the training labels
    col_num = train_y.columns.get_loc(key)
    
    # Add ground truth values to the DataFrame
    df_tmp["ground truth"] = train_y.iloc[:, col_num].copy().values
    
    
    TP = ((df_tmp['ground truth'] == 1) & (df_tmp['predicted value'] == 1)).sum()
    FP = ((df_tmp['ground truth'] == 0) & (df_tmp['predicted value'] == 1)).sum()
    FN = ((df_tmp['ground truth'] == 1) & (df_tmp['predicted value'] == 0)).sum()
    TN = ((df_tmp['ground truth'] == 0) & (df_tmp['predicted value'] == 0)).sum()
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    f1score = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    # Calculate TP, FP, FN, TN for the current medicine
    total_tp_train += TP
    total_fp_train += FP
    total_fn_train += FN
    total_tn_train += TN
    
    TrainMedicineDictioanry[key] = {
        "TP" : TP,
        "FP" : FP,
        "FN" : FN,
        "TN" : TN,
        "precision" : precision,
        "recall" : recall,
        "f1-score" : f1score
    }

precision = total_tp_train / (total_tp_train + total_fp_train) if (total_tp_train + total_fp_train) > 0 else 0
recall = total_tp_train / (total_tp_train + total_fn_train) if (total_tp_train + total_fn_train) > 0 else 0

TrainMedicineDictioanry["overall"] = {
        "TP" : total_tp_train,
        "FP" : total_fp_train,
        "FN" : total_fn_train,
        "TN" : total_tn_train,
        "precision" : precision,
        "recall" : recall,
        "f1-score" : 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
}

print("TrainMedicineDictioanry:")
for key in TrainMedicineDictioanry:
    print(key, TrainMedicineDictioanry[key])
# print("overall", TrainMedicineDictioanry["overall"])

TrainMedicineDictioanry:
麻黃 {'TP': 102, 'FP': 11, 'FN': 6, 'TN': 518, 'precision': 0.9026548672566371, 'recall': 0.9444444444444444, 'f1-score': 0.9230769230769231}
桂枝 {'TP': 262, 'FP': 6, 'FN': 9, 'TN': 360, 'precision': 0.9776119402985075, 'recall': 0.966789667896679, 'f1-score': 0.9721706864564007}
荊芥 {'TP': 20, 'FP': 35, 'FN': 0, 'TN': 582, 'precision': 0.36363636363636365, 'recall': 1.0, 'f1-score': 0.5333333333333333}
防風 {'TP': 37, 'FP': 35, 'FN': 0, 'TN': 565, 'precision': 0.5138888888888888, 'recall': 1.0, 'f1-score': 0.6788990825688073}
細辛 {'TP': 178, 'FP': 6, 'FN': 5, 'TN': 448, 'precision': 0.967391304347826, 'recall': 0.9726775956284153, 'f1-score': 0.9700272479564033}
白芷 {'TP': 7, 'FP': 31, 'FN': 0, 'TN': 599, 'precision': 0.18421052631578946, 'recall': 1.0, 'f1-score': 0.3111111111111111}
生薑 {'TP': 181, 'FP': 31, 'FN': 1, 'TN': 424, 'precision': 0.8537735849056604, 'recall': 0.9945054945054945, 'f1-score': 0.9187817258883249}
辛夷 {'TP': 30, 'FP': 34, 'FN': 0, 'TN': 573, 'p

7.2 Calculate the f1 score, precision, recall, TP, FP, TN, FN of the validation dataset and store the values in ValMedicineDictioanry

In [105]:
# Calculate True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN) for the training set
total_tp_train = 0
total_fp_train = 0
total_tn_train = 0
total_fn_train = 0

# Create a dictioanry to store all values of a medicine
ValMedicineDictioanry = {}

# Iterate through each medicine's raw prediction array
for key, arr in prediction_val_dict.items():
    
    # Create a DataFrame from the raw prediction array
    df_tmp = pd.DataFrame(arr, columns=["predicted as 0", "predicted as 1"])

    # Determine the predicted value based on probabilities
    df_tmp["predicted value"] = np.where(df_tmp["predicted as 0"] > df_tmp["predicted as 1"], 0, 1)
    
    # Get the column number of the current medicine in the training labels
    col_num = val_y.columns.get_loc(key)
    
    # Add ground truth values to the DataFrame
    df_tmp["ground truth"] = val_y.iloc[:, col_num].copy().values
    result_df_dict[key] = df_tmp
    
    TP = ((df_tmp['ground truth'] == 1) & (df_tmp['predicted value'] == 1)).sum()
    FP = ((df_tmp['ground truth'] == 0) & (df_tmp['predicted value'] == 1)).sum()
    FN = ((df_tmp['ground truth'] == 1) & (df_tmp['predicted value'] == 0)).sum()
    TN = ((df_tmp['ground truth'] == 0) & (df_tmp['predicted value'] == 0)).sum()
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    f1score = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    # Calculate TP, FP, FN, TN for the current medicine
    total_tp_train += TP
    total_fp_train += FP
    total_fn_train += FN
    total_tn_train += TN
    
    ValMedicineDictioanry[key] = {
        "TP" : TP,
        "FP" : FP,
        "FN" : FN,
        "TN" : TN,
        "precision" : precision,
        "recall" : recall,
        "f1-score" : f1score
    }

precision = total_tp_train / (total_tp_train + total_fp_train) if (total_tp_train + total_fp_train) > 0 else 0
recall = total_tp_train / (total_tp_train + total_fn_train) if (total_tp_train + total_fn_train) > 0 else 0

ValMedicineDictioanry["overall"] = {
        "TP" : total_tp_train,
        "FP" : total_fp_train,
        "FN" : total_fn_train,
        "TN" : total_tn_train,
        "precision" : precision,
        "recall" : recall,
        "f1-score" : 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
}

print("ValMedicineDictioanry:")
for key in ValMedicineDictioanry:
    print(key, ValMedicineDictioanry[key])
# print("overall", ValMedicineDictioanry["overall"])

ValMedicineDictioanry:
麻黃 {'TP': 5, 'FP': 15, 'FN': 19, 'TN': 120, 'precision': 0.25, 'recall': 0.20833333333333334, 'f1-score': 0.22727272727272727}
桂枝 {'TP': 28, 'FP': 31, 'FN': 38, 'TN': 62, 'precision': 0.4745762711864407, 'recall': 0.42424242424242425, 'f1-score': 0.448}
荊芥 {'TP': 0, 'FP': 8, 'FN': 0, 'TN': 151, 'precision': 0.0, 'recall': 0, 'f1-score': 0}
防風 {'TP': 1, 'FP': 11, 'FN': 4, 'TN': 143, 'precision': 0.08333333333333333, 'recall': 0.2, 'f1-score': 0.11764705882352941}
細辛 {'TP': 30, 'FP': 23, 'FN': 36, 'TN': 70, 'precision': 0.5660377358490566, 'recall': 0.45454545454545453, 'f1-score': 0.5042016806722689}
白芷 {'TP': 0, 'FP': 9, 'FN': 2, 'TN': 148, 'precision': 0.0, 'recall': 0.0, 'f1-score': 0}
生薑 {'TP': 13, 'FP': 39, 'FN': 20, 'TN': 87, 'precision': 0.25, 'recall': 0.3939393939393939, 'f1-score': 0.30588235294117644}
辛夷 {'TP': 1, 'FP': 16, 'FN': 6, 'TN': 136, 'precision': 0.058823529411764705, 'recall': 0.14285714285714285, 'f1-score': 0.08333333333333333}
葛根 {'TP': 1,

# 8. Result Saving

Export the evaluation metrics of training and validation data set in csv format.


In [106]:
# Create a DataFrame to record f1 score, TP/FP/TN/FN of each medicine of training set
train_f1_df = pd.DataFrame([(key, val['f1-score'], val['precision'], val['recall'], 
                           val['TP'], val['FP'], val['TN'], val['FN']) for key, val in TrainMedicineDictioanry.items()], 
                         columns=['medicine', 'f1-score','precision', 'recall', 'TP', 'FP', 'TN', 'FN']
                        )

# Setup the path to save the result
file_path = "./result/" + ModelNum

# Exporting the DataFrame to csv file
myutil.df_to_csv(train_f1_df, save_path=file_path, file_prefix='train_f1')

# Create a DataFrame to record f1 score, TP/FP/TN/FN of each medicine for validation set
val_f1_df = pd.DataFrame([(key, val['f1-score'], val['precision'], val['recall'], 
                           val['TP'], val['FP'], val['TN'], val['FN']) for key, val in ValMedicineDictioanry.items()], 
                         columns=['medicine', 'f1-score','precision', 'recall', 'TP', 'FP', 'TN', 'FN']
                        )
# Setup the path to save the result
file_path = "./result/" + ModelNum

# Exporting the DataFrame to csv file
myutil.df_to_csv(val_f1_df, save_path=file_path, file_prefix='val_f1')

train_f1 saved to ./result/[16]_0_UseWeight/train_f1.csv
val_f1 saved to ./result/[16]_0_UseWeight/val_f1.csv



# 9. Model Saving

save the model in .pb format by `tensorflow.keras.Model.save`

In [108]:
model.save(file_path)

INFO:tensorflow:Assets written to: ./result/[16]_0_UseWeight\assets
