# Module 07_01: Classify Stars in Colliding Galaxies

Using ~5000 labeled stars out of ~80,000 in each galaxy (170,000 stars total), make a model to predict and/or classify all the stars in the collection of the galaxy collision.

![KNN Classifier results for colliding galaxies](Assets/KNNPredictionResult.png "Classifying Stars in Colliding Galaxies")


As an optional assignment: 
- Find all neighboring pairs of points in two approaching galaxies.

Follow the creation of two synthetically created star locations in Euclidean coordinates. The galaxies are set to collide with each other. Techniques described are an approach to classifying which stars belong to which galaxy after the collision. We also later identify the coordinates of pairs of points that lie within a radius R of each other. Points already lying with radius R in either galaxy are ignored, and the focus is narrowed to points in one galaxy in close proximity to the points in the other galaxy.

We synthesize ~ 80,000 stars in each of two galaxies—a fictitious GFFA ("galaxy far, far away") and a fictitious "Xeno", which is the backdrop for Superman's Krypton. Combined, this is over 100,000 stars, for which we search for neighboring pairs on the order of a few light years apart.

Play the following video which plots and rotates the combined galaxies. It plots a 3D scatter plot of the nearly 200,000 stars in the two galaxies. It gradually reduces the opacity of each star to ultimately reveal the one or two or handfuls of stars that are within a given radius of each other as a result of the collision of the two galaxies. The center bulge in each galaxy is so dense that the red zone stars are not visible until we turn the opacity of the stars down significantly.


<video controls src="Videos/CollidingGalaxiesAnimation.mp4" />



# Learning Objectives:
 
1. Apply Multiple Classification Algorithms with GPU to classify stars belonging to each galaxy within a combined super galaxy to determine most accurate model.
1. Apply Intel® Extension for Scikit-learn* patch and SYCL context to compute on available GPU resource.
1. **Synthesize** your compreshension by searching for opportunities in each cell to maximize performance. Investigate adding pairwise distance as a means for all the stars within 3 light years


To synthesized the galaxy data uses a parametric equations described in the following paper [A New Formula Describing the Scaffold Structure of Spiral Galaxies](https://arxiv.org/ftp/arxiv/papers/0908/0908.0892.pdf) regarding the parametric equation of arm: :


$$ r \left( \phi \right) = \frac{A}{log(B \  \ tan(\frac{\phi}{2N}))  } $$

The synthesizer used here, generates an arbitrary number of arms, generates a 3D gauassian distribution of stars around a galactic center, then distributes a gausian distribiutoin  of stars along the length of each arm. In addition, it generates an arbitrary number of "globular clusters" of stars accoring to a 3D gaussian distribution sprinkled out randomly along the arm curves.

We also used rotation matrices  from this blog [3D Rotations and Euler angles in Python](https://www.meccanismocomplesso.org/en/3d-rotations-and-euler-angles-in-python/)

# Practicum:

Work through each cell looking for places to patch or unpatch as needed to maxiumize the performace of each cell

# Fictitious Galaxies: 

We create two fictitious galaxies: GFFA ("galaxy far, far away") and Xeno (The purported galaxy for Superman's planet Krypton). We intersect the galaxies and use various classification algorithms to identify the stars in each galaxy.

# k Nearest Neighbors

kNN classification follows the general workflow described in the oneAPI Giyhub repository [Classification Usage Model](https://oneapi-src.github.io/oneDAL/daal/usage/training-and-prediction/classification.html#classification-usage-model).

k-Nearest Neighbors (kNN) classification is a classification (or regression) algorithm. The model of the kNN classifier is based on feature vectors and class labels from the training data set. It uses distances between points as a key element for classifying similar points.

# Random Forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees.

# Define GPU Context and Fit Multiple Sklearn Models Using GPU

We try muliple models and multiple training size values to determine which comboination of models and training size yield optimimum Receiver Operator Characteristic (ROC) scores

In [None]:
# import pandas as pd
# import numpy as np
# XenoSupermanGalaxy = read_dictionary('XenoSupermanGalaxy.pkl')
# GFFA = read_dictionary('GFFA.pkl')

# np.save('XenoSupermanGalaxy_Arms.npy', XenoSupermanGalaxy['Arms'], allow_pickle=False)
# np.save('XenoSupermanGalaxy_CenterGlob.npy', XenoSupermanGalaxy['CenterGlob'], allow_pickle=False)
# np.save('XenoSupermanGalaxy_Stars.npy', XenoSupermanGalaxy['Stars'], allow_pickle=False)
# np.save('XenoSupermanGalaxy_tooClose.npy', XenoSupermanGalaxy['tooClose'], allow_pickle=False)

# np.save('GFFA_Arms.npy', GFFA['Arms'], allow_pickle=False)
# np.save('GFFA_CenterGlob.npy', GFFA['CenterGlob'], allow_pickle=False)
# np.save('GFFA_Stars.npy', GFFA['Stars'], allow_pickle=False)
# np.save('GFFA_tooClose.npy', GFFA['tooClose'], allow_pickle=False)

In [None]:
import pandas as pd
import numpy as np

from sklearnex import patch_sklearn
patch_sklearn()

XenoSupermanGalaxy_Arms = np.load('Data/XenoSupermanGalaxy_Arms.npy')
XenoSupermanGalaxy_CenterGlob = np.load('Data/XenoSupermanGalaxy_CenterGlob.npy')
XenoSupermanGalaxy_Stars = np.load('Data/XenoSupermanGalaxy_Stars.npy')
XenoSupermanGalaxy_tooClose = np.load('Data/XenoSupermanGalaxy_tooClose.npy')

XenoSupermanGalaxy = {}
XenoSupermanGalaxy['Arms'] = XenoSupermanGalaxy_Arms
XenoSupermanGalaxy['CenterGlob'] = XenoSupermanGalaxy_CenterGlob
XenoSupermanGalaxy['Stars'] = XenoSupermanGalaxy_Stars
XenoSupermanGalaxy['tooClose'] = XenoSupermanGalaxy_tooClose

GFFA_Arms = np.load('Data/GFFA_Arms.npy')
GFFA_CenterGlob = np.load('Data/GFFA_CenterGlob.npy')
GFFA_Stars = np.load('Data/GFFA_Stars.npy')
GFFA_tooClose = np.load('Data/GFFA_tooClose.npy')

GFFA = {}
GFFA['Arms'] = GFFA_Arms
GFFA['CenterGlob'] = GFFA_CenterGlob
GFFA['Stars'] = GFFA_Stars
GFFA['tooClose'] = GFFA_tooClose

# Set stage for compuiting on a different device

- If use the **%%writefile** then we write code changes to a **python file for later runs** via qsub on a different device

- If we **comment** out the **%%writefile** line, the **this code will execute immediately** so you can experiment faster

In [None]:
%%writefile Practicum_analyzeGalaxyBatch.py
# Comment the writefile statement in order to run this cell immediately, #comment it out to write the file for targeted execution on another device

import warnings
warnings.filterwarnings("ignore", message="A column-vector y was passed when a 1d array was expected")


from sklearn import datasets
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.svm import NuSVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import time
from random import sample
from sklearn.neighbors import KDTree
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import platform

import pandas as pd
import numpy as np

XenoSupermanGalaxy_Arms = np.load('Data/XenoSupermanGalaxy_Arms.npy')
XenoSupermanGalaxy_CenterGlob = np.load('Data/XenoSupermanGalaxy_CenterGlob.npy')
XenoSupermanGalaxy_Stars = np.load('Data/XenoSupermanGalaxy_Stars.npy')
XenoSupermanGalaxy_tooClose = np.load('Data/XenoSupermanGalaxy_tooClose.npy')

XenoSupermanGalaxy = {}
XenoSupermanGalaxy['Arms'] = XenoSupermanGalaxy_Arms
XenoSupermanGalaxy['CenterGlob'] = XenoSupermanGalaxy_CenterGlob
XenoSupermanGalaxy['Stars'] = XenoSupermanGalaxy_Stars
XenoSupermanGalaxy['tooClose'] = XenoSupermanGalaxy_tooClose

GFFA_Arms = np.load('Data/GFFA_Arms.npy')
GFFA_CenterGlob = np.load('Data/GFFA_CenterGlob.npy')
GFFA_Stars = np.load('Data/GFFA_Stars.npy')
GFFA_tooClose = np.load('Data/GFFA_tooClose.npy')

GFFA = {}
GFFA['Arms'] = GFFA_Arms
GFFA['CenterGlob'] = GFFA_CenterGlob
GFFA['Stars'] = GFFA_Stars
GFFA['tooClose'] = GFFA_tooClose

plt.style.use('dark_background')
    
# dataset is subset of stars from each galaxy
TrainingSize = min(len(GFFA['Stars']), len(XenoSupermanGalaxy['Stars'] ) ) 

collision = dict()
collision['Arms'] = np.vstack((GFFA['Arms'].copy(), XenoSupermanGalaxy['Arms'].copy()))
collision['CenterGlob'] = np.vstack((GFFA['CenterGlob'].copy(), XenoSupermanGalaxy['CenterGlob'].copy()))
collision['Stars'] = np.vstack((GFFA['Stars'].copy(), XenoSupermanGalaxy['Stars'].copy()))
collision['Stars'].shape

# get the index of the stars to use from XenoSupermanGalaxy
XenoIndex = np.random.choice(len(XenoSupermanGalaxy['Stars']), TrainingSize, replace=False)
# get the index of the stars to use from GFFAIndex
GFFAIndex = np.random.choice(len(GFFA['Stars']), TrainingSize, replace=False)

# create a list with a labelforeahc item in the combined training set
# the first hald of the list indicates that class 0 will be for GFFA, 1 will be XenoSupermanGalaxy
y = [0]*TrainingSize + [1]*TrainingSize
# Stack the stars subset in same order as the labels, GFFA first, XenoSupermanGalaxy second
trainGalaxy = np.vstack((GFFA['Stars'][GFFAIndex], XenoSupermanGalaxy['Stars'][XenoIndex]))  

x_train, x_test, y_train, y_test = train_test_split(trainGalaxy, np.array(y), train_size=0.05)


K = 3
myModels = {'KNeighborsClassifier':KNeighborsClassifier(n_neighbors = K) , 
            'RandomForestClassifier': RandomForestClassifier(n_jobs=2, random_state=0), 
           }
#sweep through various training split percentage
TrainingSize = [.001, .01, .03, .05, .1, .2, .5, .8]
bestScore = {}
hi = 0
K = 3
      
for tsz in TrainingSize:
    x_train, x_test, y_train, y_test = train_test_split( \
                trainGalaxy, np.array(y), train_size=tsz)
    y_train = y_train.ravel()
    y_test = y_test.ravel()
        
    for name, modelFunc in myModels.items():   
        print("Compute Device: ", platform.processor())
        start = time.time()
        model = modelFunc

        model.fit(x_train, y_train)
        y_pred = model.predict(x_test) 
        
        print('Results of {} classification'.format(name))
        print('  K: ',K)
        print('  Training size: ', tsz)
        print('  y_train.shape: ',y_train.shape)
        roc = roc_auc_score(y_test, y_pred)
        print('  roc_auc_score: {:4.1f}'.format(100*roc))
        print('  Time: {:5.1f} sec\n'.format( time.time() - start))
        if roc > hi:
            hi = roc
            bestScore = {'name': name,
                    'roc':roc, 
                    'trainingSize':tsz, 
                    'confusionMatrix': confusion_matrix(y_test, y_pred), 
                    'precision': 100*precision_score(y_test, y_pred, average='binary'),
                    'recall': recall_score(y_test, y_pred, average='binary') }
print('bestScore: name', bestScore['name'])
print('bestScore: confusion Matrix', bestScore['confusionMatrix'])
print('bestScore: precision', bestScore['precision'])
print('bestScore: recall', bestScore['recall'])
print('bestScore: roc', bestScore['roc'])

# Notices & Disclaimers

# Intel technologies may require enabled hardware, software or service activation.

# No product or component can be absolutely secure.
# Your costs and results may vary.

# © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. 
# *Other names and brands may be claimed as the property of others.


In [None]:
# %load Practicum_analyzeGalaxyBatch.py
# to run on login node,uncomment the line above

# Set the stage for our GPU experiments later

We will use the lsf qsub mechanism to taget teh python code above on a different compute node

the q.sh is a wrapper around the qsub process

In [None]:
#!python 04_analyzeGalaxyBatch.py   # works on Windows
! chmod 755 q; chmod 755 run_ModelCompare.sh; if [ -x "$(command -v qsub)" ]; then ./q run_ModelCompare.sh; else ./run_ModelCompare.sh; fi

# Questions:
1) Were you able to compute on both CPU and GPU with minor changes to code?
1) Did your results at any time include the phrase:  **"Intel(R) UHD Graphics P630"**?
1) Did your results at any time include the phrase: **"Intel(R) Xeon(R) Gold 6128 CPU"**
1) Describe the **Compute Follows Data** method as applied here: Did you have to cast to or from Numpy for any part of this exercise. Explain?


# Notices & Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.
Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. 
*Other names and brands may be claimed as the property of others.
