# Cluster analysis - 2

This file is part of the Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project.

Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.


Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with the Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project.  If not, see <http://www.gnu.org/licenses/>.



### Input files:
1. *gtex_filtered_tmm_intersect_test.pkl*
2. *shap_kmeans.pkl*

### Output files:
1. *suppfig8.svg*
2. *rand_gauss.pkl*
 
### Table of contents:
1. [Import Modules](#1.-Import-Modules)  
2. [Set static paths](#2.-Set-static-paths)  
3. [Load files](#3.-Load-files)  
    3.1 [Load test data](#3.1-Load-test-data)  
    3.2 [Load kmeans](#3.1-Load-kmeans)  
4. [Process data](#4.-Process-data)  
    4.1 [Transform data](#4.1-Transform-data)  
5. [Measure clustering](#5.-Measure-clustering)  
    5.1 [Calculate kmeans](#5.1-Calculate-kmeans)  
    5.2 [Transform data](#4.1-Transform-data)  
6. [Create gaussian](#6.-Create-gaussian)  
    6.1 [Calculate mean](#6.1-Calculate-mean)  
    6.2 [Calculate variance](#6.2-Calculate-variance)  
    6.3 [Build gaussian](#6.3-Build-gaussian)  
    6.4 [Plot gaussian](#6.4-Plot-gaussian)  
7. [Save out results](#7.-Save-out-results)  

## 1. Import Modules

In [None]:
import os

In [None]:
util_path = '../src'
os.chdir(util_path)

In [None]:
import pandas as pd
import pickle
from tqdm import tqdm
from cluster import get_random_gene_df, get_kmeans_dict, get_p_value
from vis import plot_umap
from modelling.cnn import log_transform
import statistics 
import math
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

## 2. Set static paths

In [None]:
data_type = "imbalanced"
data_dir = "../data/"

In [None]:
input_dir = data_dir + "processed/"
gene_dir = data_dir + "gene_lists/"
fig_dir = "../figures/"
output_dir = data_dir + "proc/"

## 3. Load files

#### 3.1 Load test data

In [None]:
## Test data
with open(input_dir + "gtex_filtered_tmm_intersect_test.pkl", "rb") as f:
    test_data = pickle.load(f)

#### 3.2 Load kmeans

In [None]:
## Test data
with open(input_dir + "shap_kmeans.pkl", "rb") as f:
    shap_kmeans = pickle.load(f)

## 4. Process data

#### 4.1 Transform data

In [None]:
test_data = log_transform(test_data, label=True)

## 5. Measure clustering

#### 5.1 Calculate kmeans

In [None]:
random_list = []
for i in tqdm(range(10)):
    random_df = get_random_gene_df(test_data, 2423)
    rand_shap_umap_df = plot_umap(
        random_df,
        "supp_fig7d",
        fig_dir,
        label_col="type",
        seed=42,
        save_plot=False
    )
    random_list.append(rand_shap_umap_df)

In [None]:
random_shap_dict = {}
kmeans_dict = {}
for i in range(10):
    random_shap_dict[i]=[]

In [None]:
for x in tqdm(range(10)):
    for i in range(10):
        random_shap_dict[i].append(get_kmeans_dict(random_list[i], "type"))

In [None]:
for i in range(10):
    kmeans_dict[f"Random SHAP {i}"] = random_shap_dict[i]

In [None]:
random_shap_results = []
for i in range(10):
    random_shap_results.append(pd.DataFrame.from_dict(kmeans_dict[f"Random SHAP {i}"]))

In [None]:
rand_mean = []
for i in range(10):
    rand_mean.append(random_shap_results[i]["V-Measure"].mean())

## 6. Create gaussian

#### 6.1 Calculate mean

In [None]:
overall_mean = statistics.mean(rand_mean)

#### 6.2 Calculate variance

In [None]:
var_list = []
for i in range(10):
    var_list.append(random_shap_results[i]["V-Measure"].var())

In [None]:
mean_var = statistics.mean(var_list)

In [None]:
std_dev = math.sqrt(mean_var)

#### 6.3 Build gaussian

In [None]:
rand_gauss = np.random.normal(loc=overall_mean, scale=std_dev, size=10000)

In [None]:
rand_gauss = pd.DataFrame(rand_gauss, columns=["V-Measure"])

#### 6.4 Plot gaussian

In [None]:
metric = "V-Measure"
sns.kdeplot(rand_gauss[metric], label="Random", color="gray").set_title(metric)
plt.axvline(shap_kmeans[metric].mean(), label="SHAP")
plt.legend()
sns.despine()
file_path = fig_dir+"suppfig8.svg"
plt.savefig(file_path)

## 7. Save out results

In [None]:
rand_gauss.to_pickle("rand_gauss.pkl")