# Star Galaxy Classification

### In this notebook we run the MuyGPyS classifier on the un-normalized and normalized image data, and compares the accuracy of each.

***Note:*** Must have run 'data_normalization.ipynb' to continue.

In [None]:
import numpy as np
from MuyGPyS.examples.classify import do_classify
import pandas as pd
from sklearn.model_selection import train_test_split

### Read in all flattened data (normalized and un-normalized):

In [None]:
gal_star = pd.read_csv('raw_image_data.csv')
gal_star_norm_1 = pd.read_csv('norm_1_image_data.csv')
gal_star_norm_2 = pd.read_csv('norm_2_image_data.csv')
gal_star_norm_3 = pd.read_csv('norm_3_image_data.csv')
gal_star_norm_4 = pd.read_csv('norm_4_image_data.csv')

# Create a list with the name of the variable holding the data, 
# and the name you want associated with the data, for each dataset
data_files = [[gal_star, 'Raw data'], 
              [gal_star_norm_1, 'Normalized data 1'], 
              [gal_star_norm_2, 'Normalized data 2'], 
              [gal_star_norm_3, 'Normalized data 3'], 
              [gal_star_norm_4, 'Normalized data 4']]

### Define a function that generates "one-hot" values.

This essentially just takes our truth labels of 0 and 1, and does the following conversions for use in the classifier:
- 0 to [1., -1.]
- 1 to [-1., 1.]

In [None]:
def generate_onehot_value(values):
    onehot = []
    for val in values:
        if val == 0:
            onehot.append([1., -1.])
        elif val == 1:
            onehot.append([-1., 1.])
    return onehot

### Run the classifier on each dataset

For each dataset (un-normalized and normalized) in `data_files`, this for loop does the following:
- Separate labels from data
- Split up data between training and testing
    - `test_size` is the fraction of the data you want to use for testing, where 0.5 means half of the data is used for testing and half for training.
    - `random_state` makes each dataset get trained and test on the same number of stars and galaxies, for better comparison of the algorithms. Feel free to change the random_state value, or get rid of it all together to change it up!
- Gets the one-hot values for the testing and training labels
- Defines `train` and `test` into the proper form for the classifier, a dictionary with the keys: 
    - 'input': 
    - 'output':
    - 'lookup':
- Does the classification (`do_classify`)
- Computes the accuracy of the classifier for the given dataset, by compairing to truth labels.
  


In [None]:
for data, data_label in data_files:
    truth_labels = data.iloc[:, 0].values
    image_data = data.iloc[:, 1:].values
    
    X_train, X_test, y_train, y_test = train_test_split(image_data, truth_labels, test_size=0.5, random_state=28)
    
    print("=============== ", data_label, " ===============")
    print('Training data:', len(y_train[y_train==0]), 'stars and', len(y_train[y_train==1]), 'galaxies')
    print('Testing data:', len(y_test[y_test==0]), 'stars and', len(y_test[y_test==1]), 'galaxies')
    
    onehot_train, onehot_test = generate_onehot_value(y_train), generate_onehot_value(y_test)
    
    train = {'input': X_train, 'output': np.array(onehot_train), 'lookup': y_train}
    test = {'input': X_test, 'output': np.array(onehot_test), 'lookup': y_test}
    
    print("Running Classifier on", data_label)
    surrogate_predictions= do_classify(train, test, nn_count=50, verbose=False) #Switch verbose to True for more output
    
    predicted_labels = np.argmax(surrogate_predictions, axis=1)
    
    print("Total accuracy for", data_label, ":", np.around((np.sum(predicted_labels == np.argmax(test["output"], axis=1))/len(predicted_labels))*100, 3), '%')

<u>***Note:*** Each time you run the classifier will result in different accuracies.</u>

### As you can see, all 4 normalization techniques do much better than the un-normalized data, with technique 3 & 4 performing the best in most cases.

### Things you can try, to see how they affect the classifier accuracy:
- Play around with different values of `test_size`. What does testing on more or less data do?
- Try generating more cutouts using `generating_ZTF_cutouts_from_ra_dec.ipynb`. How does having more testing and training data affects the classifier?
- Play around with the parameters used to make the cutouts. What happens if you remove blend cuts? Can the classifier classify blends? What is you increase the seeing limit? Can the classifier classify images with bad atmoshperic quality?