# Grid searching parameters for TSNE

### Aim: 
Given *n* parameters, where each parameter could be of low, middle, or high value, narrow down 
the range of each parameter such that you converge on a set of parameters that gives 
**distinct clusters** when plotting the dimensionally-reduced data on a graph

### General usage guide:

- User will set a low value and a high value for each parameter (the middle-of-the-range value will be auto-generated)
- Then run the rest of the notebook to:
  - Compute the product between all low/mid/high values between each separate parameter
  - 

Note: any time you change the parameters, you need to rebuild the pipelines list. If you execute from that cell and run all below in sequence, you will be doing things correctly

In [None]:
# Import necessary libraries
import dibs_notebook_header
import dibs
import itertools
import multiprocessing
import numpy as np
import os
import pandas as pd
import time

### Set global runtime variables below

In [None]:
# Set global runtime variables. This cell should only need to be executed once.
num_cpu_processors = 4  # Set the maximum number of processors you are willing to use at a given time
percent_epm_train_files_to_cluster_on = 0.3  # Set a number between 0 and 1. With larger value, train time will take longer but more data will be introduced to clustering process.

# -- Parameter tuning section--

This is where the user input goes to effect clustering outcomes


In [None]:
num_gmm_clusters_aka_num_colours = 8  # Sets the number of clusters that GMM will try to label
perplexity_low  = 100
perplexity_high = 350
early_exaggeration_low  = 10
early_exaggeration_high = 300
learning_rate_low  = 50
learning_rate_high = 450

In [None]:
# Don't touch this section
perplexity_mid = (perplexity_low + perplexity_high)/2
early_exaggeration_mid = (early_exaggeration_low + early_exaggeration_high) / 2
learning_rate_mid = (learning_rate_low + learning_rate_high) / 2

## -- End of parameter tuning section --

---

Everything below must be run from top to bottom, in sequence, to generate the 
new outputs of parameters set above

In [None]:
# Auto-generate the product between all possible parameters
parameters_product = [
    
]
kwargs_product = [
    {'tsne_perplexity': perplexity_i,
     'tsne_early_exaggeration': early_exaggeration_j,
     'tsne_learning_rate': learning_rate_k,
     
     'gmm_n_components': num_gmm_clusters_aka_num_colours,
     'tsne_n_components': 2, # 2D dimensionality reduction
     'cross_validation_n_jobs': 1,
     'classifier_n_jobs':1, 
     'tsne_n_jobs': 1,
    } for perplexity_i, early_exaggeration_j, learning_rate_k in itertools.product(
        [
            perplexity_low,
            perplexity_mid,
            perplexity_high,
        ],
        [
            early_exaggeration_low,
            early_exaggeration_mid,
            early_exaggeration_high,
        ],
        [
            learning_rate_low,
            learning_rate_mid,
            learning_rate_high,
        ])]
pipeline_names_by_index = [f'Pipeline_{i}' for i in range(len(kwargs_product))]
print('Number of permutations:', len(kwargs_product))

In [None]:
# Queue up which data files will be added to each Pipeline
all_files = [os.path.join(dibs.config.DEFAULT_TRAIN_DATA_DIR, file) for file in os.listdir(dibs.config.DEFAULT_TRAIN_DATA_DIR)]
train_data = half_files = all_files[:int(len(all_files) * percent_epm_train_files_to_cluster_on)]
train_data

In [None]:
# Create list of pipelines with all of the different combinations of parameters inserted
# Do not be alarmed if you see DEBUG output as an output in the cell
# To turn off debug output, change DIBS/config.ini [LOGGING] from DEBUG to ERROR
pipelines_ready_for_building = [dibs.pipeline.PipelineMimic(name, **kwargs).add_train_data_source(*train_data) for name, kwargs in zip(pipeline_names_by_index, kwargs_product)]

### Next step: leveraging multiprocessing to get as much work done in as short a time as possible

In [None]:
# The heavy lifting/processing is done here
start_time = time.perf_counter()
with multiprocessing.Pool(num_cpu_processors) as pool:
    pipelines_queued = [pool.apply_async(pipe_i.build) for pipe_i in pipelines_ready_for_building]
    pipelines_results = [res.get() for res in pipelines_queued]
end_time = time.perf_counter()
print(f'Total compute time: {round((end_time-start_time)/60, 2)} minutes')

In [None]:
# Note: evaluating "goodness" of a set of parameters is based on the distinctness of clusters. More distinct = better parameters set.
for i, pipeline_i in enumerate(pipelines_results):
    perplexity_i, learning_rate_i, early_exaggeration_i = pipeline_i.tsne_perplexity, pipeline_i.tsne_learning_rate, pipeline_i.tsne_early_exaggeration
    print(f"perplexity: {perplexity_i} / learning rate: {learning_rate_i} / early_exaggeration: {early_exaggeration_i} ")
    pipeline_i.plot_clusters_by_assignments(fig_file_prefix=f'{time.strftime("%Y-%m-%d_%HH%MM")}__{pipeline_i.name}__', show_now=True, save_to_file=True, figsize=(20,15),s=1.5)
    print('-----------------------------------')

In [None]:
print('All done!')