# Classical Machine Learning Algorithms Applied to the Lens Classification Problem

### Authors: Jenny Kim (jennykim1016), Ji Won Park (jiwoncpark)

In this notebook, we apply classifical machine learning algorithms such as linear SVC, nearest neighbor, and random forest to the problem of classifying lenses vs. non-lenses.

In [None]:
from __future__ import print_function
import sys, os
realizer_path = os.path.join(os.environ['SLREALIZERDIR'], 'slrealizer')
sys.path.insert(0, realizer_path)
from utils.utils import *
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
data_path = os.path.join(os.environ['SLREALIZERDIR'], 'data')

lens_object_f = os.path.join(data_path, 'lens_object_table.csv')
nonlens_object_f = os.path.join(data_path, 'nonlens_object_table.csv')

lens_obj = pd.read_csv(lens_object_f)
num_data = len(lens_obj)
nonlens_obj = pd.read_csv(nonlens_object_f).query('(u_trace < 5.12)').sample(num_data, random_state=123).reset_index(drop=True)
assert len(lens_obj) == len(nonlens_obj)

##  Make the feature set

Based on the cornerplot that we drew of SDSS and OM10 (see notebook `Comparing+OM10+vs+SDSS+Objects`), we idenfity features that seem to most strongly differ between lenses and non-lenses. We will hand-engineer the following six features:

- Difference in sizes between u and z bands
- Difference in ellipticities between u and z bands (e)
- Difference in rotation angles of the systems between u and z bands (ϕ)
- Difference in angles (ω) between centroid positions and galactic shears 
- Difference in magnitudes between u and z bands
- Difference in positions of the centroid between u and z bands (x)

In [None]:
for df in [lens_obj, nonlens_obj]:
    for b in 'uz':
        df[b + '_e'], df[b + '_phi'] = e1e2_to_ephi(e1=df[b + '_e1'], e2=df[b + '_e2'])
        df[b + '_mag'] = from_flux_to_mag(lens_obj[b + '_apFlux'], from_unit='nMgy')
        df[b + '_mag'][~np.isfinite(df[b + '_mag'])] = 100.0
        df[b + '_posmod'] = np.power(np.power(df[b + '_x'], 2.0) + np.power(df[b + '_y'], 2.0), 0.5)
        df[b + '_omega'] = (df[b + '_e1']*df[b + '_x'] + df[b + '_e2']*df[b + '_y'])/(df[b + '_e']*df[b + '_posmod'])

for df in [lens_obj, nonlens_obj]:
    df['delta_pos'] = np.power(np.power(df['u_x'] - df['z_x'], 2.0) + np.power(df['u_y'] - df['z_y'], 2.0), 0.5)

In [None]:
def make_truth_table(df, attributes, truth_value, save_file=None):
    num_attributes = len(attributes)
    num_data = len(df)
    #features = np.empty((num_features, num_attributes))
    features_dict = {}
    col_names = ['delta_' + a for a in attributes] + ['label']
    
    for a in attributes:
        if a == 'pos':
            features_dict['delta_' + a] = df['delta_' + a]
        else:
            features_dict['delta_' + a] = df['u_' + a] - df['z_' + a]
        
    features_dict['label'] = np.ones((num_data, ))*truth_value
    #features = np.array(features_dict.values()).reshape(num_data, num_attributes + 1)
    data = pd.DataFrame.from_dict(features_dict)
    data = data[col_names]
    if save_file is not None:
        data.to_csv(save_file)
    return data

In [None]:
attributes = ['trace', 'e', 'phi', 'mag', 'pos', 'omega']
lens_data = make_truth_table(df=lens_obj, attributes=attributes, truth_value=1)
nonlens_data = make_truth_table(df=nonlens_obj, attributes=attributes, truth_value=0)

In [None]:
print(lens_data.shape, nonlens_data.shape)
total_data = pd.concat([lens_data, nonlens_data], axis=0)
print(total_data.shape)

# Machine Learning + Precision Recall Curve for various methods

We are going two use three different algorithms : linearSVC, K-neighbors, and Random Forest. For the K-neighbors and Random Forest, we are going to change the number of neighbors and leaves. Then, we will see which classifier has the best performance.

In order to do so, we import the necessary packages:

In [None]:
from sklearn import svm
from sklearn.calibration import CalibratedClassifierCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn import model_selection

We divide the entire dataset into training and test sets.

In [None]:
total_data = total_data.values
#print(total_data_arr.shape)
y = total_data[:, -1]
X = total_data[:, :-1]
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.33, random_state=123, shuffle=True)

Just making sure the train and test sets have an even number of positive and negative examples...

In [None]:
print("Percentage of positive examples in training set: %0.2f" 
      %(len(y_train[y_train==1])/float(len(y_train))),
      "\n ... in test set: %0.2f" 
      %(len(y_test[y_test==1])/float(len(y_test))))

## 2. Precision-recall curves (PRCs) for various methods

First, we define `models_dict`, a dictionary of models to use. The `colors_dict` assigns a color to each model for plotting purposes.

In [None]:
models_dict = {'svm': CalibratedClassifierCV(svm.LinearSVC()),
               'nn3': KNeighborsClassifier(n_neighbors=3),
               'nn5': KNeighborsClassifier(n_neighbors=5),
               'rf3': RandomForestClassifier(n_estimators=3),
               'rf5': RandomForestClassifier(n_estimators=5),
               'rf10': RandomForestClassifier(n_estimators=10),
              }

colors_dict = {'svm': 'red',
               'nn3': 'orange',
               'nn5': 'green',
               'rf3': 'blue',
               'rf5': 'purple',
               'rf10': 'black'
               }

In [None]:
for label, model in models_dict.iteritems():
    model.fit(X_train, y_train)
    y_score = model.predict_proba(X_test)[:, 1]
    precision, recall, _ = precision_recall_curve(y_test, y_score)
    plt.plot(recall, precision, label=label, color=colors_dict[label])

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([-0.1, 1.1])
plt.xlim([-0.1, 1.1])
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.title('Precision-recall curves (PRCs) for various methods')

## 3. Receiver operating characteristic (ROC) curve for various methods

We are going two use three different algorithms : linearSVC, K-neighbors, and Random Forest. For the K-neighbors and Random Forest, we are going to change the number of neighbors and leaves. Then, we will see which classifier has the best performance.

In [None]:
for label, model in models_dict.iteritems():
    model.fit(X_train, y_train)
    y_score = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_score)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=label + " area = %0.2f" %roc_auc, color=colors_dict[label])

plt.xlabel('False positive rate (FPR)')
plt.ylabel('True positive rate')
plt.ylim([-0.1, 1.1])
plt.plot([-0.1, 1.1], [-0.1, 1.1], 'k--')
plt.xlim([-0.1, 1.1])
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.title('Receiver operating characteristic (ROC) curves for various methods')

## 4. Feature selection

We could see that the random forest classifier with `N=10` and above performed the best. Because Random Forest can also give the measures of how the useful each feature was, we draw a histogram of feature importance. For this purpose, we will use the ExtraTreesClassifier model instead of the original Random Forest with 10 neighbors. Plotting instructions taken from this [page](http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html) of the scikit-learn documentation.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=123)

forest.fit(X, y)

In [None]:
col_names = ['delta_' + a for a in attributes]
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]
col_names_sorted = [col_names[o] for o in indices]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %s (%f)" % (f + 1, col_names_sorted[f], importances[indices[f]]))

In [None]:
# Plot the feature importances of the forest
# Black vertical lines represent inter-trees variability
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), col_names_sorted, rotation='vertical')
plt.xlim([-1, X.shape[1]])

plt.show()