## Overview
A model of stellar magnitude was presented in *[Dysonian SETI With Machine Learning](https://www.kaggle.com/solorzano/dysonian-seti-with-machine-learning)*. Some changes and improvements are introduced in this kernel:

* We're using a [dataset of 257K Gaia DR2 stars](https://www.kaggle.com/solorzano/257k-gaia-dr2-stars) from the Northern and Southern hemispheres, with photometry from a number of other databases.
* Some systematics/artifacts are removed from the data.
* Multiple runs of k-fold cross-validation are averaged out.
* A position-based magnitude bias correction is applied to the residuals of the blend.
* Model RMSE is ~0.115 magnitudes.
* Giants are removed post-training, and RMSE is subsequently ~0.10. It's a sensitive model.

Some outliers and a sample of ordinary stars are shown in <a href="#space_distribution_section">3D scatter charts</a>. Model results are made available in the output tab. The *anomaly* metric is the key result. It has a standard deviation of ~1.0.

**Follow-ups:**
* [Multi-Stellar SETI Candidate Selection, Part 1](https://www.kaggle.com/solorzano/multi-stellar-seti-candidate-selection-part-1)
* [Multi-Stellar SETI Candidate Selection, Part 2](https://www.kaggle.com/solorzano/multi-stellar-seti-candidate-selection-part-2)
* [Multi-Stellar SETI Candidate Selection, Part 3](https://www.kaggle.com/solorzano/multi-stellar-seti-candidate-selection-part-3)

**Updates:**
* 9/1/2018: Removed region with apparent Gaia magnitude artifacts. Additionally, AllWISE magnitudes other than *allwise_w2* are no longer used because of line-of-sight artifacts.

## Data

Data was obtained from the [Gaia Archive](https://gea.esac.esa.int/archive/), using its ADQL query tool, and made available in a [Kaggle dataset](https://www.kaggle.com/solorzano/257k-gaia-dr2-stars). In addition to Gaia DR2 parallax and photometry, the dataset includes magnitude observations from GSC 2.3, PPMXL, 2MASS, AllWISE, and Tycho2.

In [None]:
import pandas as pd

data_temp = pd.read_csv('../input/257k-gaiadr2-sources-with-photometry.csv', dtype={'source_id': str})

In [None]:
len(data_temp)

Columns found in the data frame are:

In [None]:
data_temp.columns

## Removal of duplicates and region with artifact
There are rows with duplicate *source_id* values (multiple Tycho2 matches) and some apparent line-of-sight artifacts, as explained in [*Removal of Gaia DR2 Stars With Apparent Systematic*](https://www.kaggle.com/solorzano/removal-of-gaia-dr2-stars-with-apparent-systematic). We'll just remove all of the rows that are potentially problematic.

In [None]:
should_remove_set = set(pd.read_csv('../input/257k-gaiadr2-should-remove.csv', dtype={'source_id': str})['source_id'])

In [None]:
data_temp = data_temp[~data_temp['source_id'].isin(should_remove_set)]
data_temp.reset_index(inplace=True, drop=True)

In [None]:
len(data_temp)

In [None]:
assert len(data_temp) == len(set(data_temp['source_id']))

## Note about AllWISE artifact
In an earlier version of the kernel we had removed the region around an apparent line-of-sight artifact. Anomalously dim stars would cluster near a plane given by:

0.64X + 0.64Y - Z = 0

Upon further analysis, we narrowed the issue down to the *allwise_w1* magnitude. Magnitudes *allwise_w3* and *allwise_w4* also seem to have their own systematics, based on cross-database analysis. We're only using *allwise_w2* at this point.

## Train/Test split

We'll use 90% of the data (*work_data*) for cross-validation and hype-parameter optimization. The remaining 10% (*test_data*) will be set aside for final confirmation and validation.

In [None]:
import numpy as np

np.random.seed(2018080028)

train_mask = np.random.rand(len(data_temp)) < 0.9
work_data = data_temp[train_mask]
work_data.reset_index(inplace=True, drop=True)
test_data = data_temp[~train_mask]
test_data.reset_index(inplace=True, drop=True)
data_temp = None # Get rid of big frame

In [None]:
len(work_data)

## Helper functions used in modeling
Some boilerplate:

In [None]:
import inspect

pd_concat_argspec = inspect.getfullargspec(pd.concat)
pd_concat_has_sort = 'sort' in pd_concat_argspec.args

def pd_concat(frames):
    # Due to Pandas versioning issue
    new_frame = pd.concat(frames, sort=False) if pd_concat_has_sort else pd.concat(frames)
    new_frame.reset_index(inplace=True, drop=True)
    return new_frame
    
def plt_hist(x, bins=30):
    # plt.hist() can be very slow.
    histo, edges = np.histogram(x, bins=bins)
    plt.bar(0.5 * edges[1:] + 0.5 * edges[:-1], histo, width=(edges[-1] - edges[0])/(len(edges) + 1))

The following function is similar to the one we used to train the original model, except this one averages out multiple runs of k-fold cross-validation. It also has a optional *trim_fraction* parameter that allows outliers of training subsets to be removed. The concept is similar to that of a trimmed average. We will use it in residual and squared error modeling.

In [None]:
import types
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler 

def get_cv_model_transform(data_frame, label_extractor, var_extractor, trainer_factory, response_column='response', 
                           id_column='source_id', n_runs=2, n_splits=2, max_n_training=None, scale = False,
                           trim_fraction=None):
    '''
    Creates a transform function that results from training a regression model with cross-validation.
    The transform function takes a frame and adds a response column to it.
    '''
    default_model_list = []
    sum_series = pd.Series([0] * len(data_frame))
    for r in range(n_runs):
        shuffled_frame = data_frame.sample(frac=1)
        shuffled_frame.reset_index(inplace=True, drop=True)
        response_frame = pd.DataFrame(columns=[id_column, 'response'])
        kf = KFold(n_splits=n_splits)
        first_fold = True
        for train_idx, test_idx in kf.split(shuffled_frame):
            train_frame = shuffled_frame.iloc[train_idx]
            if trim_fraction is not None:
                helper_labels = label_extractor(train_frame) if isinstance(label_extractor, types.FunctionType) else train_frame[label_extractor] 
                train_label_ordering = np.argsort(helper_labels)
                orig_train_len = len(train_label_ordering)
                head_tail_len_to_trim = int(round(orig_train_len * trim_fraction * 0.5))
                assert head_tail_len_to_trim > 0
                trimmed_ordering = train_label_ordering[head_tail_len_to_trim:-head_tail_len_to_trim]
                train_frame = train_frame.iloc[trimmed_ordering]
            if max_n_training is not None:
                train_frame = train_frame.sample(max_n_training)
            train_labels = label_extractor(train_frame) if isinstance(label_extractor, types.FunctionType) else train_frame[label_extractor]
            test_frame = shuffled_frame.iloc[test_idx]
            train_vars = var_extractor(train_frame)
            test_vars = var_extractor(test_frame)
            scaler = None
            if scale:
                scaler = StandardScaler()  
                scaler.fit(train_vars)
                train_vars = scaler.transform(train_vars)  
                test_vars = scaler.transform(test_vars) 
            trainer = trainer_factory()
            fold_model = trainer.fit(train_vars, train_labels)
            test_responses = fold_model.predict(test_vars)
            test_id = test_frame[id_column]
            assert len(test_id) == len(test_responses)
            fold_frame = pd.DataFrame({id_column: test_id, 'response': test_responses})
            response_frame = pd_concat([response_frame, fold_frame])
            if first_fold:
                first_fold = False
                default_model_list.append((scaler, fold_model,))
        response_frame.sort_values(id_column, inplace=True)
        response_frame.reset_index(inplace=True, drop=True)
        assert len(response_frame) == len(data_frame), 'len(response_frame)=%d' % len(response_frame)
        sum_series += response_frame['response']
    cv_response = sum_series / n_runs
    assert len(cv_response) == len(data_frame)
    assert len(default_model_list) == n_runs
    response_map = dict()
    sorted_id = np.sort(data_frame[id_column].values) 
    for i in range(len(cv_response)):
        response_map[str(sorted_id[i])] = cv_response[i]
    response_id_set = set(response_map)
    
    def _transform(_frame):
        _in_trained_set = _frame[id_column].astype(str).isin(response_id_set)
        _trained_frame = _frame[_in_trained_set].copy()
        _trained_frame.reset_index(inplace=True, drop=True)
        if len(_trained_frame) > 0:
            _trained_id = _trained_frame[id_column]
            _tn = len(_trained_id)
            _response = pd.Series([None] * _tn)
            for i in range(_tn):
                _response[i] = response_map[str(_trained_id[i])]
            _trained_frame[response_column] = _response
        _remain_frame = _frame[~_in_trained_set].copy()
        _remain_frame.reset_index(inplace=True, drop=True)
        if len(_remain_frame) > 0:
            _unscaled_vars = var_extractor(_remain_frame)
            _response_sum = pd.Series([0] * len(_remain_frame))
            for _model_tuple in default_model_list:
                _scaler = _model_tuple[0]
                _model = _model_tuple[1]
                _vars = _unscaled_vars if _scaler is None else _scaler.transform(_unscaled_vars)
                _response = _model.predict(_vars)
                _response_sum += _response
            _remain_frame[response_column] = _response_sum / len(default_model_list)
        _frames_list = [_trained_frame, _remain_frame]
        return pd_concat(_frames_list)
    return _transform

In [None]:
import scipy.stats as stats

def print_evaluation(data_frame, label_column, response_column):
    response = response_column(data_frame) if isinstance(response_column, types.FunctionType) else data_frame[response_column]
    label = label_column(data_frame) if isinstance(label_column, types.FunctionType) else data_frame[label_column]
    residual = label - response
    rmse = np.sqrt(sum(residual ** 2) / len(data_frame))
    correl = stats.pearsonr(response, label)[0]
    print('RMSE: %.4f | Correlation: %.4f' % (rmse, correl,), flush=True)

## Extra columns
We'll add some columns for convenience and informational purposes.

In [None]:
def transform_init(data_frame):    
    new_frame = data_frame.copy()
    new_frame.reset_index(inplace=True, drop=True)
    distance = 1000.0 / new_frame['parallax']
    new_frame['distance'] = distance
    new_frame['abs_mag_ne'] = new_frame['phot_g_mean_mag'] - 5 * (np.log10(distance) - 1)
    new_frame['color_index'] = new_frame['phot_bp_mean_mag'] - new_frame['phot_rp_mean_mag']
    return new_frame

In [None]:
work_data = transform_init(work_data)

## Model features
Primarily, we want to use differences between magnitude observations made with different photometric filters. We've found it's better to separate them into groups first.

In [None]:
mag_column_groups = [
    ['phot_g_mean_mag', 'phot_bp_mean_mag', 'phot_rp_mean_mag'],
    ['gsc23_v_mag', 'gsc23_b_mag'],
    ['ppmxl_b1mag', 'ppmxl_b2mag', 'ppmxl_r1mag', 'ppmxl_imag'],
    ['tmass_j_m', 'tmass_h_m', 'tmass_ks_m'],
    ['tycho2_bt_mag', 'tycho2_vt_mag'],
]

The following function gets the within-group differences.

In [None]:
def populate_mag_columns(data_frame, feature_list):
    for group in mag_column_groups:
        len_group = len(group)
        assert len_group >= 2
        for i in range(1, len_group):
            mag_diff = data_frame[group[i]] - data_frame[group[i - 1]]
            feature_list.append(mag_diff)

Next, we define a function that extracts the features used to train regression models. Note that in addition to calling *populate_mag_columns*, we've added some differences between Gaia magnitudes and specific magnitudes from other databases. We've found these additional differences improve model performance.

A star's position in the sky is also a feature. We could presumably lose some information by including position, but it greatly improves model performance.

In addition to distance, we're including distance in the galactic plane and a couple gaussian transformations of the distance *from* the galactic plane, which seem to help the model resolve variations in extinction due to proximity to galactic latitude 0.

In [None]:
# Hyperparameters for gaussian transformations of distance-from-plane.
PLANE_DENSITY_2T_VAR1 = 2 * 15 ** 2
PLANE_DENSITY_2T_VAR2 = 2 * 50 ** 2

In [None]:
def extract_model_vars(data_frame):
    distance = data_frame['distance'].values
    log_distance = np.log(distance)
    latitude_rad = np.deg2rad(data_frame['b'].values)
    longitude_rad = np.deg2rad(data_frame['l'].values)
    sin_lat = np.sin(latitude_rad)
    cos_lat = np.cos(latitude_rad)
    sin_long = np.sin(longitude_rad)
    cos_long = np.cos(longitude_rad)
    
    distance_in_plane = np.abs(distance * cos_lat)
    distance_from_plane_sq = (distance * sin_lat) ** 2
    plane_density_feature1 = np.exp(-distance_from_plane_sq / PLANE_DENSITY_2T_VAR1)
    plane_density_feature2 = np.exp(-distance_from_plane_sq / PLANE_DENSITY_2T_VAR2)
    feature_list = [log_distance, distance, 
                    distance_in_plane, 
                    plane_density_feature1, plane_density_feature2,
                    sin_lat, cos_lat, sin_long, cos_long
                   ]
    
    populate_mag_columns(data_frame, feature_list)
    mag_g = data_frame['phot_g_mean_mag']
    mag_rp = data_frame['phot_rp_mean_mag']
    mag_bp = data_frame['phot_bp_mean_mag']
    feature_list.append(mag_g - data_frame['allwise_w2'])
    feature_list.append(mag_bp - data_frame['tmass_j_m'])
    feature_list.append(mag_bp - data_frame['gsc23_b_mag'])
    feature_list.append(mag_g - data_frame['ppmxl_r1mag'])
    feature_list.append(mag_rp - data_frame['ppmxl_imag'])
    feature_list.append(mag_rp - data_frame['tycho2_bt_mag'])
    feature_list.append(data_frame['tmass_j_m'] - data_frame['allwise_w2'])
    feature_list.append(data_frame['gsc23_b_mag'] - data_frame['ppmxl_imag'])
    
    return np.transpose(feature_list)    

The regression label will be defined by the following variable.

In [None]:
LABEL_COLUMN = 'phot_g_mean_mag'

Regression training is done over a sample of available data. At most we use these many records in each pass:

In [None]:
MAX_N_TRAINING = 50000

## Neural Network

In [None]:
from sklearn.neural_network import MLPRegressor

def get_nn_trainer():
    return MLPRegressor(hidden_layer_sizes=(60, 60), max_iter=400, alpha=0.1, random_state=np.random.randint(1,10000))

In [None]:
def get_nn_transform(label_column):
    return get_cv_model_transform(work_data, label_column, extract_model_vars, get_nn_trainer, 
        n_runs=3, n_splits=2, max_n_training=MAX_N_TRAINING, response_column='nn_' + label_column, scale=True)

In [None]:
transform_nn = get_nn_transform(LABEL_COLUMN)
work_data = transform_nn(work_data)

In [None]:
print_evaluation(work_data, LABEL_COLUMN, 'nn_' + LABEL_COLUMN)

This is a little surprising. It's unusual to find that a neural network does better than a GBM with a relatively small number of features.

## LightGBM model
We're usng [LightGBM](https://github.com/Microsoft/LightGBM) this time. It's fast and accurate.

In [None]:
import lightgbm

def get_lgbm_trainer():
    return lightgbm.LGBMRegressor(num_leaves=80, max_depth=-1, learning_rate=0.1, n_estimators=1000, 
        subsample_for_bin=50000, reg_alpha=0.03, reg_lambda=0.0,
        random_state=np.random.randint(1,10000))

In [None]:
def get_lgbm_transform(label_column):
     return get_cv_model_transform(work_data, label_column, extract_model_vars, get_lgbm_trainer, 
        n_runs=2, n_splits=2, max_n_training=MAX_N_TRAINING, response_column='lgbm_' + label_column, scale=False)

In [None]:
transform_lgbm = get_lgbm_transform(LABEL_COLUMN)
work_data = transform_lgbm(work_data)

In [None]:
print_evaluation(work_data, LABEL_COLUMN, 'lgbm_' + LABEL_COLUMN)

## Blend
The blend only has two models this time.

In [None]:
def extract_blend_vars(data_frame):
    lgbm_responses = data_frame['lgbm_' + LABEL_COLUMN].values
    nn_responses = data_frame['nn_' + LABEL_COLUMN].values
    return np.transpose([lgbm_responses, nn_responses])

In [None]:
from sklearn import linear_model

def get_blend_trainer():
    return linear_model.LinearRegression()

In [None]:
def get_blend_transform(label_column):
    return get_cv_model_transform(work_data, label_column, extract_blend_vars, get_blend_trainer, 
        n_runs=3, n_splits=3, max_n_training=None, response_column='blend_' + label_column, scale=False)

In [None]:
transform_blend = get_blend_transform(LABEL_COLUMN)
work_data = transform_blend(work_data)

In [None]:
print_evaluation(work_data, LABEL_COLUMN, 'blend_' + LABEL_COLUMN)

## Adjustment for position-based distortions
The base model already included position features, as they greatly improve model performance. It's usually possible to squeeze a bit more performance by modeling residuals.

More importantly, there's an issue that can be corrected here. Because the model includes position features, a extreme outlier can affect its neighbors. A concrete example is Gaia DR2 1596779097312755328 — an extremely bright outlier. Its neighbors will be unusually *dim* according to the model. So we'll take advantage of the *trim_fraction* parameter of the cross-validation modeling transform to ignore outliers when training this position-based correction.

In [None]:
def get_bres_label(data_frame):
    return data_frame[LABEL_COLUMN] - data_frame['blend_' + LABEL_COLUMN]

Note that we're including *color_index* as a feature here, in case there are color-based dependencies to the distortions. It seems to help slightly.

In [None]:
def extract_bres_vars(data_frame):
    distance = data_frame['distance'].values
    latitude = np.deg2rad(data_frame['b'].values)
    longitude = np.deg2rad(data_frame['l'].values)
    position_z = distance * np.sin(latitude)
    projection = distance * np.cos(latitude)
    position_x= projection * np.cos(longitude)
    position_y = projection * np.sin(longitude)
    color_index = data_frame['color_index']
    return np.transpose([position_x, position_y, position_z,
                        color_index])    

A random forest works well for this, because of the nature of the feature space.

In [None]:
from sklearn.ensemble import RandomForestRegressor

def get_bres_trainer():
    return RandomForestRegressor(n_estimators=60, max_depth=18, min_samples_split=30, random_state=np.random.randint(1,10000))

In [None]:
transform_bres = get_cv_model_transform(work_data, get_bres_label, extract_bres_vars, get_bres_trainer, 
        n_runs=3, n_splits=2, max_n_training=MAX_N_TRAINING, response_column='modeled_bres', scale=False,
        trim_fraction=0.003)

In [None]:
work_data = transform_bres(work_data)
print_evaluation(work_data, get_bres_label, 'modeled_bres')

The *model_response* column will contain the response of the blend plus the response of the residual model we just trained. 

In [None]:
def transform_final_model(data_frame):
    new_frame = data_frame.copy()
    new_frame['model_response'] = new_frame['blend_' + LABEL_COLUMN] + new_frame['modeled_bres']
    return new_frame

In [None]:
work_data = transform_final_model(work_data)
print_evaluation(work_data, LABEL_COLUMN, 'model_response')

It should be noted that excluding position features from the base model and applying a correction on the residuals wouldn't work this well. This means there's likely a complex interaction between position and spectrophotometric features.

## Giant removal
In prior kernels, we found that the model struggles with giant stars. We do want the model to learn about the spectrophotometric characteristics of giant stars, but we'll remove them from further consideration at this stage of the analysis.

In [None]:
def color_index(data_frame):
    return data_frame['phot_bp_mean_mag'] - data_frame['phot_rp_mean_mag']

The separation function we will use to remove giants is visually derived.

In [None]:
def giant_separation_y(x):
    return x * 40.0 - 25

In [None]:
import matplotlib.pyplot as plt

In [None]:
work_data_sample = work_data.sample(2000)
plt.rcParams['figure.figsize'] = (10, 5)
_color_index = color_index(work_data_sample)
plt.scatter(_color_index, work_data_sample['abs_mag_ne'] ** 2, s=2)
plt.plot(_color_index, giant_separation_y(_color_index), '--', color='orange')
plt.gca().invert_yaxis()
plt.title('Pseudo H-R diagram for giant removal')
plt.xlabel('BP - RP color index')
plt.ylabel('Absolute magnitude squared')
plt.show()

In [None]:
def transform_rm_giants(data_frame):
    new_frame = data_frame[data_frame['abs_mag_ne'] ** 2 >= giant_separation_y(color_index(data_frame))]
    new_frame.reset_index(inplace=True, drop=True)
    return new_frame

In [None]:
work_data = transform_rm_giants(work_data)
len(work_data)

## Squared residual modeling
We corrected for position-based bias, but error also seems to depend on position in the sky.

In [None]:
RESPONSE_COLUMN = 'model_response'

In [None]:
def transform_residual(data_frame):
    new_frame = data_frame.copy()
    new_frame['model_residual'] = data_frame[LABEL_COLUMN] - data_frame[RESPONSE_COLUMN]
    return new_frame

In [None]:
work_data = transform_residual(work_data)

In [None]:
mean_model_residual = np.mean(work_data['model_residual'].values)

def get_squared_res_label(data_frame):
    return (data_frame['model_residual'] - mean_model_residual) ** 2

Note that in addition to position features, we're including a *parallax_error*-derived feature. It seems reasonable to do that, even though it's not clear that it helps a whole lot.

The distance-from-plane transformations improve meta-model performance considerably.

In [None]:
def extract_residual_vars(data_frame):
    parallax = data_frame['parallax']
    parallax_error = data_frame['parallax_error']
    parallax_high = parallax + parallax_error
    parallax_low = parallax - parallax_error
    var_error_diff = np.log(parallax_high) - np.log(parallax_low)
    
    flux_error = data_frame['phot_g_mean_flux_error']
    
    latitude_rad = np.deg2rad(data_frame['b'].values)
    longitude_rad = np.deg2rad(data_frame['l'].values)
    sin_lat = np.sin(latitude_rad)
    cos_lat = np.cos(latitude_rad)
    sin_long = np.sin(longitude_rad)
    cos_long = np.cos(longitude_rad)

    distance = data_frame['distance']
    distance_in_plane = np.abs(distance * cos_lat)
    distance_from_plane_sq = (distance * sin_lat) ** 2
    plane_density_feature1 = np.exp(-distance_from_plane_sq / PLANE_DENSITY_2T_VAR1)
    plane_density_feature2 = np.exp(-distance_from_plane_sq / PLANE_DENSITY_2T_VAR2)
    return np.transpose([
        distance,
        sin_lat, cos_lat, sin_long, cos_long,
        plane_density_feature1, plane_density_feature2,
        var_error_diff,
        flux_error
    ])

In [None]:
def get_residual_trainer():
    return RandomForestRegressor(n_estimators=60, max_depth=9, min_samples_split=10, random_state=np.random.randint(1,10000))

Note that, once again, we're using the *trim_fraction* parameter so that extreme outliers don't distort neighbor results.

In [None]:
transform_expected_res_sq = get_cv_model_transform(work_data, get_squared_res_label, extract_residual_vars, get_residual_trainer, 
        n_runs=4, n_splits=2, max_n_training=MAX_N_TRAINING, response_column='expected_res_sq', scale=False,
        trim_fraction=0.003)

In [None]:
work_data = transform_expected_res_sq(work_data)

In [None]:
print_evaluation(work_data, get_squared_res_label, 'expected_res_sq')

These results are not that bad, if you consider we're modeling noise. We're basically using a random forest to estimate the regional variance (i.e. mean squared error) of magnitude residuals.

## Anomaly metric
The *anomaly* metric is defined as the locally standardized model residual.

In [None]:
def transform_anomaly(data_frame):
    new_frame = data_frame.copy()
    new_frame_residual = new_frame['model_residual'].values
    new_frame['anomaly'] = (new_frame_residual - mean_model_residual) / np.sqrt(new_frame['expected_res_sq'].astype(float))
    return new_frame

In [None]:
work_data = transform_anomaly(work_data)

The standard deviation of *anomaly* should be ~1.0, absent extreme outliers.

In [None]:
np.std(work_data['anomaly'])

## Final transformation and validation

In [None]:
transform_list = [transform_init,                          # extra info columns
                  transform_lgbm, transform_nn,            # individual models
                  transform_blend,                         # the blend
                  transform_bres, transform_final_model,   # position-based residual correction
                  transform_rm_giants,                     # removal of giants
                  transform_residual,                      # add the residual column
                  transform_expected_res_sq,               # regional residual variance
                  transform_anomaly                        # anomaly metric
                 ]

In [None]:
def combined_transform(data_frame):
    _frame = data_frame
    for t in transform_list:
        _frame = t(_frame)
    return _frame

In [None]:
test_data = combined_transform(test_data)

The 'test' data we set aside has the following approximate RMSE:

In [None]:
np.std(test_data['model_residual'])

This is lower than we saw previously, but that's because we removed giants. It actually is consistent with what we see in the *work_data* frame now:

In [None]:
np.std(work_data['model_residual'])

By concatenating *work_data* and *test_data* we end up with results for the whole dataset (minus giants).

In [None]:
data = pd_concat([work_data, test_data])
work_data = None
test_data = None

In [None]:
len(data)

## KIC 8462852
KIC 8462852 is an enigmatic star (Boyajian et al. 2015) that happens to be in the dataset. Its model results follow.

In [None]:
data[data['source_id'] == '2081900940499099136'][
    ['source_id', 'distance', 'abs_mag_ne', 'model_residual', 'anomaly']]

It's an ordinary star according to the model.

## Anomalous and control group selection
For visualization purposes, let's get a list of dim "outliers" at a 3-sigma cut-off. We'll also get some bright and ordinary controls.

In [None]:
CAND_SD_THRESHOLD = 3.0

In [None]:
data_anomalies = data['anomaly']

In [None]:
anomaly_std = np.std(data_anomalies)

In [None]:
cand_threshold = anomaly_std * CAND_SD_THRESHOLD
candidates = data[data_anomalies >= cand_threshold]
len(candidates)

In [None]:
bright_control_group = data.sort_values('anomaly', ascending=True).head(len(candidates))

In [None]:
normal_control_group = data[(data_anomalies < anomaly_std) & (data_anomalies > -anomaly_std)].sample(len(candidates))

In [None]:
data_anomalies = None # Discard big array

<a id='space_distribution_section'></a>
## Space distribution of star groups
The following function calculates rectangular (x-y-z) coordinates for each star. *X* points to the galactic center and *Z* is perpendicular to the galactic plane.

In [None]:
def get_position_frame(data_frame):
    new_frame = pd.DataFrame(columns=['source_id', 'x', 'y', 'z'])
    new_frame['source_id'] = data_frame['source_id'].values
    distance = data_frame['distance'].values
    latitude = np.deg2rad(data_frame['b'].values)
    longitude = np.deg2rad(data_frame['l'].values)
    new_frame['z'] = distance * np.sin(latitude)
    projection = distance * np.cos(latitude)
    new_frame['x'] = projection * np.cos(longitude)
    new_frame['y'] = projection * np.sin(longitude)
    return new_frame

We'll include our sun in visualizations. Look for the blue dot.

In [None]:
def get_sun():
    new_frame = pd.DataFrame(columns=['source_id', 'x', 'y', 'z'])
    new_frame.loc[0] = ['sun', 0.0, 0.0, 0.0]
    return new_frame

We're also adding KIC 8462852 to the 3D scatter charts. Look for the black dot.

In [None]:
candidates_wbstar = pd_concat([candidates, data[data['source_id'] == '2081900940499099136']])
candidates_pos_frame = pd_concat([get_position_frame(candidates_wbstar), get_sun()])

In [None]:
import plotly.plotly as py
import plotly.offline as py
import plotly.graph_objs as go

py.init_notebook_mode(connected=False)

In [None]:
def plot_pos_frame(pos_frame, star_color, sun_color = 'blue', bstar_color = 'black'):    
    star_color = [(bstar_color if row['source_id'] == '2081900940499099136' else (sun_color if row['source_id'] == 'sun' else star_color)) for _, row in pos_frame.iterrows()]
    trace1 = go.Scatter3d(
        x=pos_frame['x'],
        y=pos_frame['y'],
        z=pos_frame['z'],
        mode='markers',
        text=pos_frame['source_id'],
        marker=dict(
            size=3,
            color=star_color,
            opacity=0.67
        )
    )
    scatter_data = [trace1]
    layout = go.Layout(
        margin=dict(
            l=0,
            r=0,
            b=0,
            t=0
        )
    )
    fig = go.Figure(data=scatter_data, layout=layout)
    py.iplot(fig)

In [None]:
%%html
<!-- Allow bigger output cells -->
<style>
.output_wrapper, .output {
    height:auto !important;
    max-height: 1500px;
}
</style>

A sample of ordinary stars is shown in gray.

In [None]:
normal_control_group_wbstar = pd_concat([normal_control_group, data[data['source_id'] == '2081900940499099136']])
normal_control_group_pos_frame = pd_concat([get_position_frame(normal_control_group_wbstar), get_sun()])
plot_pos_frame(normal_control_group_pos_frame, 'gray')

Anomalously dim stars are shown in green.

In [None]:
plot_pos_frame(candidates_pos_frame, 'green')

What we're looking for here is that there a no beams of candidates aligned with Earth, which would indicate there are possible line-of-sight artifacts. Those kinds of artifacts seem to be largely removed.

The anomalously bright star control group is shown in red below. Since giants have been removed, these brighter-than-expected stars are largely in the main sequence.

In [None]:
bright_control_group_wbstar = pd_concat([bright_control_group, data[data['source_id'] == '2081900940499099136']])
bright_control_group_pos_frame = pd_concat([get_position_frame(bright_control_group_wbstar), get_sun()])
plot_pos_frame(bright_control_group_pos_frame, 'red')

## Output
Model results and other dataset columns are made available in the output tab of this kernel. The *anomaly* column is the key result. Its standard deviation is roughly 1.0. Positive *anomaly* values indicate a star is dimmer than expected. 

In [None]:
SAVED_COLUMNS = ['source_id', 'tycho2_id', 'ra', 'dec', 'pmra', 'pmdec', 'l', 'b', 'distance', 'color_index',
                 LABEL_COLUMN, 'blend_' + LABEL_COLUMN, 'model_residual', 'anomaly']

In [None]:
data[SAVED_COLUMNS].to_csv('mag-modeling-results.csv', index=False)

## Acknowledgments

This work has made use of data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC, https://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement.

## References

Boyajian, et al. (2015). _Planet Hunters X. KIC 8462852 - Where's the Flux?_ arXiv:1509.03622

Bradbury et al. (2011). _Dysonian Approach to SETI: A Fruitful Middle Ground?_ Journal of the British Interplanetary Society, vol. 64, p. 156-165

Zackrisson et al. (2018). _SETI with Gaia: The observational signatures of nearly complete Dyson spheres_. arXiv:1804.08351 