# Setup

## Colab setup

Run the following cells in order (by pressing `Shift-Enter` or clicking on the "play" button at the top-left of a cell when mousing over it). When a warning pops up, choose "Run anyway".

In [None]:
!rm -r sample_data
!git clone https://github.com/SimoneBarbaro/data_science_lab_project.git

In [None]:
import os
os.chdir("./data_science_lab_project/data")
!wget -O TWOSIDES_medDRA.csv.gz https://polybox.ethz.ch/index.php/s/Uemf21AIiZ7ooNi/download
os.chdir("..")

While waiting for the download above to complete, open the file browser on the left by clicking on the folder icon.

**Upload the results archive `results_2020_11_27.zip` ([Polybox link](https://polybox.ethz.ch/index.php/f/2160436597)) to the `data_science_lab_project` folder** by hovering over the folder and choosing "Upload" from the three-dots menu that appears on the right. `results_2020_11_27.zip` should then be within the `data_science_lab_project` folder next to the `src` and `data` folders, which you can check by expanding the folder contents by clicking on the triangle on the left of the folder name.

Additionally, **upload the paired SPiDER data `matrix_spider_full.pkl.gz` ([Polybox link](https://polybox.ethz.ch/index.php/f/2160073828)) and `matrix_spider_names_full.pkl.gz` ([Polybox link](https://polybox.ethz.ch/index.php/f/2160073720)) into the `data` folder** (within `data_science_lab_project`) using the three-dots menu of the `data` folder. Expand the `data` folder and check that `matrix_spider_full.pkl.gz` and `matrix_spider_names_full.pkl.gz` as well as `TWOSIDES_medDRA.csv.gz` downloaded automatically above are present in it.

In [None]:
!unzip results_*.zip

In [None]:
!pip install -r src/requirements.txt
import pandas as pd
os.chdir("src")

Running the following cell should now list the runs that can be analyzed: each is a clustering method followed by the number of clusters and other relevant parameters.

In [None]:
cwd = os.getcwd()
results_directory = "../results"
results_directory = os.path.join(cwd, results_directory)
for f in sorted(os.listdir(results_directory)):
    path = os.path.join(results_directory, f)
    if os.path.isdir(path):
        print(f)

## Choose one of the clustering metod from above here

In [None]:
name_analysis = 'aggl15'

In [None]:
result_path = os.path.join(results_directory, name_analysis)
analysis_path = os.path.join(result_path, "analysis")

# Clusters

In [None]:
pth = os.path.join(result_path, 'results.csv')
clustering_results = pd.read_csv(pth)
clustering_results.head()

Query for a specific drug pair provide names of both drugs in cell below:

In [None]:
name1 = 'tamoxifen'
name2 = 'bupropion'
clustering_results[(clustering_results['name1'] == name1) & (clustering_results['name2'] == name2)]

# Side effects analysis

## Choose a level of side effects

In [None]:
analysis_level = 'soc'

In [None]:
pth = os.path.join(analysis_path, 'scores_' + analysis_level + '_term.csv')
sideeffect_results = pd.read_csv(pth)
sideeffect_results

## Clusters numbers

In [None]:
sideeffect_results["cluster"].drop_duplicates().values

In [None]:
num_clusters = len(sideeffect_results["cluster"].drop_duplicates().values)

## Choose a specific cluster to inspect

In [None]:
cluster_no = 0

In [None]:
sideeffect_results[sideeffect_results['cluster'] == cluster_no]

In [None]:
alpha = '0.005'

# Significance Analysis

## Choose a level of side effects

In [None]:
analysis_level = 'soc'

In [None]:
pth = os.path.join(analysis_path, 'significant_' + analysis_level + '_ranks_' + alpha + '.csv')
statistical_results = pd.read_csv(pth)
statistical_results

Look at the significant results summary

In [None]:
pth = os.path.join(analysis_path, 'significant_summary.csv')
summary_results = pd.read_csv(pth)
summary_results

Query for specific drugs

In [None]:
name1 = 'tamoxifen'
name2 = 'bupropion'
summary_results[(summary_results['name1'] == name1) & (summary_results['name2'] == name2)]

# Target distribution for significant clusters

In [None]:
from experiment.interactive_analysis import InteractiveAnalyzer

## Choose the level, the number of clusters to see and the number of targets per clusters to show

In [None]:
analysis_level = 'hlgt'
cluster_number = num_clusters
targets_per_cluster = 5

This may take some time, it needs to load twosides for further analysis

In [None]:
analyzer = InteractiveAnalyzer(result_path)

In [None]:
significant_clusters, important_targets = analyzer.get_important_data(analysis_level, cluster_number, targets_per_cluster)

In [None]:
significant_clusters

## Choose a specific cluster to inspect from the ones in the table above

In [None]:
cluster_no = 11

In [None]:
important_targets[cluster_no].describe()

## Choose a target to visualize

In [None]:
target = 'Serine Threonine Kinase'

In [None]:
important_targets[cluster_no][target].plot.hist(xlim=[0,2])