# Tutorial
This tutorial demonstrates how to use MOGT functions with a demo dataset (SCZ as an example). Once you are familiar with SCZ’s workflow, please replace the demo data with your own data to begin your analysis. 

## How to prepare input data

We recommend getting started with SCZ using the provided demo dataset. When you want to apply SCZ to your own multi-omics dataset, please refer to the following tutorials to learn how to prepare input data.

Overall, the input data consists of two parts: the graph, constructed from SNP-SNP interaction and the node feature including DE, EPI, and gene expression in five brain regions（Parietal Lobe, Frontal Lobe, Temporal Lobe, Cerebellum, and Occipital Lobe）in adolescents and adults.

If you are unfamiliar with MOGT, you may start with our data used in the paper to save your time. For SCZ, the input data as well as the label information are uploaded [here](https://github.com/NBStarry/CGMega/tree/main/data). If you start with this data, you can skip the _step 1_ about _How to prepare input data_.

> The labels should be collected yourself if you choose analyze your own data.

### Import default params and set constants

### load omics data
The input demo data as well as the label information are uploaded [here](https://github.com/JiafangLi/MOGT/tree/main/data), put it into DATA_DIR, and run:

In [None]:
node_feat, pos = get_node_feat(disease=disease)
feat_cols = ["de","brain_cp","brain_gz","neuron_count","oligo_count","micro_count","adolescence_parietal.lobe","adolescence_frontal.lobe","adolescence_temporal.lobe","adolescence_cerebellum","adolescence_occipital.lobe","adulthood_parietal.lobe","adulthood_frontal.lobe","adulthood_temporal.lobe","adulthood_cerebellum","adulthood_occipital.lobe"]
feat_df = pd.DataFrame(data=node_feat, columns=feat_cols)
feat_df = pd.concat([GENE_LIST, feat_df], axis=1)
feat_df[:10]

### SNP-SNP interaction construction
Then, we read the SNP-SNP interaction data (SNP-SNP interaction from [Zenodo]()) and transform it into a graph through the following commands:

In [None]:
snp_mat = get_snp_mat(disease=disease)
snp_mat[20:30, 20:30]

### Load gene labels

In [None]:
node_lab, labeled_idx = get_label(disease=disease )
labeled_lab = [node_lab[i][1] for i in labeled_idx]
print(f"Positive samples: {sum(node_lab)[1]}  Negative samples: {sum(node_lab)[0]}  Unlabeled samples: {len(node_lab) - len(labeled_idx)}")
print(node_lab)

### Build PyG Data instance

In [None]:
from data_preprocess_cv import build_pyg_data
edge_threshhold = configs["edge_threshhold"]
random = configs["random"]
data = build_pyg_data(GENE_LIST,node_feat, node_lab, snp_mat, pos,edge_threshhold,random)
data

### Do train-test split

25% for test set, 75% for a 10-fold train-valid spilt.

In [None]:
from data_preprocess_cv import split_data
train_idx_list,valid_idx_list,test_idx = split_data(CV_FOLDS,labeled_idx,labeled_lab)

### Create CancerDataset and Save

In [None]:
import pickle
cv_dataset = create_cv_dataset(train_idx_list.copy(), valid_idx_list.copy(), test_idx.copy(), data=data)
print(cv_dataset[0])
dataset_dir = "data/" + configs["disease"] + "/"+ configs["disease"]+ "_dataset.pkl"
print(f'Finished! Saving dataset to {dataset_dir} ......')
with open(dataset_dir, 'wb') as f:
    pickle.dump(cv_dataset, f)

### Load data and Train model <div id="load-data-and-train-model"></div>

A 5-fold training.

In [None]:
from main import get_training_modules, train_model, predict, calculate_metrics, pred_to_df,test
from data_preprocess_cv import scale_data

configs['stable'] = True

# set to 'cuda' to use GPU.
configs['device'] = 'cuda'

# Set log_name and logfile to suit your needs, and you will see training details in logfile.
configs['log_name'] = configs["disease"]
configs['logfile'] = os.path.join(configs["log_dir"], configs["log_name"] + ".txt")

dataset = get_data(configs,disease)


def calculate_confidence_interval(scores):
    mean = np.mean(scores)
    std = np.std(scores)
    n = len(scores)
    z = 1.96  # Z-value at 95% confidence level
    bound = (z * std / np.sqrt(n))
    return bound

sum_auprc, sum_auc, sum_acc, sum_f1, sum_tp, train_result = [], [], [], [], [], []
for i in range(CV_FOLDS):
    head_info = True if i == 0 else False 
    configs['fold'] = i
    data = dataset
    data = scale_data(data,i)
    modules = get_training_modules(configs, data)
    auprc, auc, acc, f1, tp, new_ckpt,cutoff = train_model(modules, configs, configs['log_name'], i, head_info,)
    y_score, y_pred, y_true, y_index, genes = predict(modules['model'], modules['test_loader_list'], configs, new_ckpt)
    acc, cf_matrix, auprc, f1, auc, test_cutoff = calculate_metrics(y_true, y_pred, y_score)
     #save test result
    y = pd.DataFrame({'y_true': y_true, 'y_score': y_score,"y_pred":y_pred,"cutoff":cutoff})
    result = pd.concat([genes,y], axis=1)
    result.to_csv(configs["out_dir"]+"/"+disease +"/"+disease+"_"+"test"+"_"+str(i)+".csv",index_label=False)


    tp = cf_matrix[1, 1]
    sum_auprc.append(auprc)
    sum_auc.append(auc)
    sum_acc.append(acc)
    sum_f1.append(f1)
    sum_tp.append(tp)
    with open(configs['logfile'], 'a') as f:
        print("Test AUPRC:{:.4f}, AUROC:{:.4f}, ACC:{:.4f}, F1:{:.4f}, TP:{:.1f},cutoff:{:.4f}"
            .format(auprc, auc, acc, f1, tp,test_cutoff), file=f, flush=True)
    train_result =  pred_to_df(i, train_result, y_index, y_true, y_score)
auc_ci = calculate_confidence_interval(sum_auc)
auprc_ci = calculate_confidence_interval(sum_auprc)
avg_auc = sum(sum_auc) / len(sum_auc)
avg_Auprc = sum(sum_auprc) / len(sum_auprc)
avg_fi = sum(sum_f1) / len(sum_f1)
with open(configs['logfile'], 'a') as f:
    print(f"{CV_FOLDS}-folds AUPRC:{avg_Auprc:.4f}+{auprc_ci}, AUROC:{avg_auc:.4f}+{auc_ci}, F1:{avg_fi:.4f}",
            file=f, flush=True)

