3rd ASTERICS-OBELICS International School - Annecy, France - 8-12 April 2019

### Machine Learning Tutorial

# Section 1.b - Supervised learning: classification
by [Emille Ishida](https://www.emilleishida.com/)

### *Take home message 2: It is crucial to understand the algorithm you are using*

**Goal:** 1. Get acquainted with a basic machine learning algorithm    
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2. Discuss hyper parameters  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3. Write your classifier  

**Task**: Star-Galaxy Classification  

**Data**: Clean data resulting from [Notebook 1](EDA_SDSS_answers.ipynb)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;~9100 objects (lines - depends on how you extracted the outliers)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;11 features (columns)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Features we are interested in:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
$ug$: u-g SDSS color  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
$gr$: g-r SDSS color   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
$class$: source classification

In [None]:
# import some basic libaries 
import matplotlib.pylab as plt
import numpy as np
import pandas as pd
import seaborn as sns

### Step 1: Pre-processing

We start by loading the data and checking if our variables of interest are given.
 
<div class="alert alert-info"> 
PS1: as we already cleaned this data, we will not perform an extensive EDA here, but remember that you should always make similar tests to those shown in <a href='EDA_SDSS.ipynb'>Notebook 1</a> every single time... and again, just to be sure.
</div>

In [None]:
# load data
data = pd.read_csv('../data/SDSS_star_galaxy_clean.csv')

# check columns
data.keys()

In [None]:
data.shape

We see that the data we want is there, but there is a lot of other stuff as well.  
Select only the columns that are interesting to you. 

In [None]:
# make a new data frame with the columns you need


# confirm remaining columns
data_use.keys()

In [None]:
# take a look at the data you just separated


### Step 2: Separate your data samples

All supervised learning tasks are made of (at least) 4 phases:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1) **train**: training samples (requires features and labels)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2) **optimize**: validation sample (requires features and labels)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Repeat 1-2 until you are happy with the results**  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3) **evaluate results**: test sample (requires features and labels)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;4) **predict**: target sample (there are only you features, you should believe your code at this point)  
      
        
        
This means that your labelled sample needs to be divided in at least 3 samples: training, validation and test.  
So, in order to continue we must construct these samples.


In [None]:
from sklearn.model_selection import train_test_split

# separate 60% for training and 20% for validation and 20% test
# WARNING: there is probably a smarter way to do this
X_train, X, y_train, y = train_test_split(data_use[['ug', 'gr']], data_use['class'], test_size=0.4, random_state=1)
X_test, X_validation, y_test, y_validation = train_test_split(X, y, test_size=0.5, random_state=1)

# check your samples (size, features, etc.)
print('training sample:    ', X_train.shape, y_train.shape) 
print('validation sample:  ', X_validation.shape, y_validation.shape)
print('test sample:        ', X_test.shape, y_test.shape)

### Step 3: Build a Nearest Neighbor classifier

As a start, we will construct a k-nearest neighbor algorithm (kNN) with k = 1.  
This means that the class of a test point will be given by the class of its nearest neighbor.  
We can describe this strategy as:  

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; For all objects in the unlabelled sample:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1) Calculate the distance to all points in the training sample  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2) Identify its closest neighbor  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3) Assign the class of its nearest neighbor  

In [None]:
# Build a 1-nearest neighbor classifier

def my_NearestNeighbor(train_features, train_labels, unlabelled_features):
    """
    Classify an unlabelled using a k=1 nearest neighbor algorithm.
    
    input: train_features - array, dim=(number of objects, number of features)
           train_labels - array, dim=(number of objects, 1)
           unlabelled_features - array, dim=(number of objects, number of features)
           
    output: estimated classes for all lines in test_features
            array, dim(number of objects, 1)       
    """

    # calculate distances
   
    
    # assign for each element in the test sample the class of its nearest neighbor

    
    return 


Estimate the classes of all objects in the test sample

In [None]:
class_estimate = my_NearestNeighbor( )

# quick look in the first 10 estimated classes
class_estimate[:10]

### Step 4:  Evaluate your results

Now that you have a classifier and already applied it, let's quantify the results.  
Create a metric function that calculates the fraction of correct classifications. 

In [None]:
def metric(estimated_classes, true_classes):
    """
    Calculate the fraction of correct classification results. 
    
    input: estimated_classes, array - dim=(number of objects,)
           true_classes, array - dim=(number of objects, )
           
    output: fraction of correct classifications
    """
    
    # get the number of objects in the sample

    
    # get the number of correct classifications

    
    return

In [None]:
# Calculate metrics
accuracy = metric(class_estimate, y_validation)

print('accuracy: ', accuracy)

### Step 5: Optimize the classifier

The procedure used above aimed to illustrate how you can construct your own classifier, but it is not very practical. Whenever possible, we should make use of available libraries.  

Let's reproduce what we did below using [scikit-learn](https://scikit-learn.org/stable/).

In [None]:
from sklearn import neighbors 
from sklearn.metrics import accuracy_score

# determine number of neighbors
nn = 1
weights = 'uniform'              # this can be 'uniform' or 'distance'

# create an instance of the classifier
classifier = neighbors.KNeighborsClassifier(nn, weights=weights)

# train (or fit) the classifier
classifier.fit(X_train, y_train)

# predict the classes of the te st samtple
class_estimate_sklearn = classifier.predict(X_validation)

# calculate metrics
accuracy_sklearn = accuracy_score(class_estimate, y_validation, normalize=True)

print('accuracy give by sklearn: ', accuracy_sklearn)

Using scikit-learn allows us to easily manipulate the parameters of our algorithm.  
Before we can move forward, we need to optimize these results to the best of our abilities.  
Ask yourself: **Can the results above be improved ?**  

*Try changing the number of neighbors and weights and see how the results change*

*Only go to the next step once you are satisfied with the results!!!*

### Step 6: Calculate final results

In order to ensure some generality to the results of the trained classifier, you must always report its performance in an independent data set, which was not used for training. The aim of the `test` sample is to mimic the results which you would find in the target sample.

In [None]:
# predict the classes of the test sample


# calculate metrics
accuracy_test = accuracy_score(class_estimate_test, y_test, normalize=True)

print('accuracy calculated on test sample : ', accuracy_test)

And this is the accuracy of your optimized classifier.. and the numbers you should report when quoting any output from it!

However, this number is not the reason your classifier was built!  
Now that you have a working classifier your desired output is a list of classes for each object in a completely new sample for which no labels exist.  

So let's read still another data set for which no labels are known.

In [None]:
# load data
data_target = pd.read_csv('../data/SDSS_star_galaxy_target.csv')

# check features
data_target.keys()

**Notice that for saving time, this data set has already been cleaned. In a realistic situation, all the pre-processing performed in the labelled data which does not involve the labels themselves should also be applied to the target sample!**

Now, let's use our classifier:

In [None]:
# Use the classifier you optimized above to estimate the classes for the target sample
class_estimate_target = 

# quick look into the first classes


#### You can now say that you have a new catalog of stars and galaxies, given by:

In [None]:
class_estimate_target

#### whose classes can be trusted to the level of:

In [None]:
print(round(100*accuracy_test), '%')

### You can know use this catalog to do your science!

<div class="alert alert-info">
PS2: In this notebook used only 2 features due to time constraints, but feel free to try it with all collumns. Try accessing, for example, how your results change when you add more information. 
</div>

------------------------------------------------------------------------------------------------------------------
Summary:

## Machine Learning Model    

**Task**:  Identify stars and galaxies based on their photometric measurements.  

       input: 2 different colours in SDSS broad-band filters: u-g and g-r
       output: estimated classes
       
**Task Category**:   Classification

**Data**: Extract from SDSS-DR14 as available through [Kaggle](https://www.kaggle.com/lucidlenn/sloan-digital-sky-survey#Skyserver_SQL2_27_2018%206_51_39%20PM.csv)  
        
        2 Features, x 
              u-g       u - g colour in SDSS system
              g-r       g - r colour in SDSS system
    
        1 response variable (label), y    
            class
            
**Machine Learning category**:  

        Supervised Learning

**Set of possible samples**:  

        ~9100 objects
        
**Set of possible labels**:

        2 classes: STAR or GALAXY
        
**Learner**:
    
        1 - Nearest neighbor    
        
**Loss function**:

        1 - Distance-based