# Star Galaxy Classification

### In this notebook we run the un-normalized and normalized datasets through the MuyGPyS classifier (a python classifying function that uses the MuyGPS  Gaussian process hyperparameter estimation method), and compare the resulting accuracies.

**Note:** Must have run `data_normalization.ipynb` to continue.

In [None]:
import numpy as np
from MuyGPyS.examples.classify import do_classify
import pandas as pd
from sklearn.model_selection import train_test_split
import random
from tqdm import tqdm
import matplotlib.pyplot as plt

### Read in all flattened data (normalized and un-normalized):

In [None]:
gal_star = pd.read_csv('raw_image_data.csv')
gal_star_norm_1 = pd.read_csv('norm_1_image_data.csv')
gal_star_norm_2 = pd.read_csv('norm_2_image_data.csv')
gal_star_norm_3 = pd.read_csv('norm_3_image_data.csv')
gal_star_norm_4 = pd.read_csv('norm_4_image_data.csv')
gal_star_norm_5 = pd.read_csv('norm_5_image_data.csv')

# Create a list with the name of the variable holding the data, 
# and the name you want associated with the data, for each dataset
data_files = [[gal_star, 'Raw data'], 
              [gal_star_norm_1, 'Normalized data 1'], 
              [gal_star_norm_2, 'Normalized data 2'], 
              [gal_star_norm_3, 'Normalized data 3'],
              [gal_star_norm_4, 'Normalized data 4'],
              [gal_star_norm_5, 'Normalized data 5']]

### Define a function that generates "one-hot" values.

This essentially just takes our truth labels of 0 and 1, and does the following conversions for use in the classifier:
- 0 to [1., -1.]
- 1 to [-1., 1.]

In [None]:
def generate_onehot_value(values):
    onehot = []
    for val in values:
        if val == 0:
            onehot.append([1., -1.])
        elif val == 1:
            onehot.append([-1., 1.])
    return onehot

### Run the classifier on each dataset

For each dataset (un-normalized and normalized) in `data_files`, this for loop does the following:
- Separate labels from data
- Split up data between training and testing
    - `test_size` is the fraction of the data you want to use for testing, where 0.5 means half of the data is used for testing and half for training.
    - `random_state` makes each dataset get trained and tested on the same number of stars and galaxies.
- Gets the one-hot values for the testing and training labels
- Gets `train` and `test` into the proper format for the classifier, a dictionary with the keys: 
    - 'input': 
    - 'output':
    - 'lookup':
- Does the classification (`do_classify`)
- Computes the accuracy of the classifier for the given dataset, by compairing predicted labels to truth labels.

In [None]:
for data, data_label in data_files:
    truth_labels = data.iloc[:, 0].values
    image_data = data.iloc[:, 1:].values

    X_train, X_test, y_train, y_test = train_test_split(image_data, truth_labels, test_size=0.5, random_state=42)

    print("=============== ", data_label, " ===============")
    print('Training data:', len(y_train[y_train==0]), 'stars and', len(y_train[y_train==1]), 'galaxies')
    print('Testing data:', len(y_test[y_test==0]), 'stars and', len(y_test[y_test==1]), 'galaxies')

    onehot_train, onehot_test = generate_onehot_value(y_train), generate_onehot_value(y_test)

    train = {'input': X_train, 'output': onehot_train, 'lookup': y_train}
    test = {'input': X_test, 'output': onehot_test, 'lookup': y_test}

    print("Running Classifier on", data_label)
    #Switch verbose to True for more output
    muygps, nbrs_lookup, surrogate_predictions = do_classify(test_features=np.array(test['input']), 
                                                             train_features=np.array(train['input']), 
                                                             train_labels=np.array(train['output']), 
                                                             nn_count=30, verbose=False) 
    predicted_labels = np.argmax(surrogate_predictions, axis=1)
    print("Total accuracy for", data_label, ":", np.around((np.sum(predicted_labels == np.argmax(test["output"], axis=1))/len(predicted_labels))*100, 3), '%')

<u>***Note:*** Each time you run the classifier will result in different accuracies.</u>

### As you can see, all 5 normalization techniques do much better than the un-normalized data, with some performing better than others.

### Things you can try, to see how they affect the classifier accuracy:
- Play around with different values of `test_size`. What does testing on more or less data do?
- Play around with different parameters that are passed to `do_classify`. Start with `nn_count` and `embed_dim`(For what those arguments are, and a full list of all of the arguments you can pass to do_classify, look at the function `do_classify` in `/MuyGPyS/examples/classify.py`).
- Try generating more cutouts using `generating_ZTF_cutouts_from_ra_dec.ipynb`. How does having more testing and training data affects the classifier?
- Play around with the parameters used to make the cutouts. What happens if you remove blend cuts? Can the classifier classify blends? What is you increase the seeing limit? Can the classifier classify images with bad atmoshperic quality?

<hr style="border:2px solid gray"> </hr>

## <u>**Optional Step:**</u>
### Running each dataset through the classifier multiple times, testing and training on varying amounts of data, different random states, and plotting the accuracy outcomes

- Each time you run the following steps, you change:
    - `test_size`: This is used in `train_test_split`, and changes the size of the testing and training datasets, which effects the accuracy of the classifier.
    - `random_state`: This is used in `train_test_split`, and changes the ratio of how many stars-to-galaxies get tested on.
- You can set how many times to run the classifier with varying test sizes and random states by setting `num_runs`, and you can manually change the test_size values by editing `test_size_values`.

In [None]:
test_size_values = [.2, .25, .33, .4, .5, .75]
num_runs = 3

In [None]:
def run_classifier(image_data, truth_labels, test_size, state):
    X_train, X_test, y_train, y_test = train_test_split(image_data, truth_labels, test_size=test_size, random_state=state)
    onehot_train, onehot_test = generate_onehot_value(y_train), generate_onehot_value(y_test)
    train = {'input': X_train, 'output': onehot_train, 'lookup': y_train}
    test = {'input': X_test, 'output': onehot_test, 'lookup': y_test}
    #Switch verbose to True for more output
    muygps, nbrs_lookup, surrogate_predictions= do_classify(test_features=np.array(test['input']),
                                                            train_features=np.array(train['input']), 
                                                            train_labels=np.array(train['output']), 
                                                            nn_count=30, verbose=False) 
    predicted_labels = np.argmax(surrogate_predictions, axis=1)
    accuracy = (np.sum(predicted_labels == np.argmax(test["output"], axis=1))/len(predicted_labels))*100
    return accuracy

In [None]:
accuracies = pd.DataFrame({'test_size': test_size_values})

# Setting progress bar for each time the classifier will be run during this step
pbar = tqdm(total=len(data_files)*num_runs*len(test_size_values), desc='Running classifier', leave=True)

for data, data_label in data_files:
    truth_labels = data.iloc[:, 0].values
    image_data = data.iloc[:, 1:].values
    all_acc_dataset = []
    for test_size in test_size_values:
        acc = []
        idx = 1
        while idx <= num_runs:
            accuracy = run_classifier(image_data, truth_labels, test_size, state=random.randint(0, 10000))
            acc.append(accuracy)
            pbar.update(1)
            idx += 1
        avg_acc = np.average(acc)
        all_acc_dataset.append(avg_acc)
    temp_df = pd.DataFrame({str(data_label): all_acc_dataset})
    accuracies = pd.concat([accuracies, temp_df], axis=1)
display(accuracies)

In [None]:
plt.figure(figsize=(12,8))

for data, data_labels in data_files:
    plt.plot(accuracies['test_size'].values, accuracies[data_labels].values, label=data_labels)
    
plt.legend(fontsize=12)   
plt.tick_params(labelsize=14)
plt.xlabel("Test size (as a ratio to full data size)", fontsize=18)
plt.ylabel("Accuracy [%]", fontsize=18)
plt.show()