# **Techsoc Astro-Analytics Hackathon 2020**
### Notebook by **Nishant Prabhu** (Team MechBoisDoingAnalytics)

In this notebook, I have described my approach to predicted trajectories and velocities of satellites given simulation and timestamp data. Using the model below, my public leaderboard position was 10 (SMAPE: 28.4637) and private leaderboard position was 11 (SMAPE: 35.3282). Of all the models tried, the one shown belowed gave best results. Details of other attempts have been described in an appropriate section below.  

In [1]:
import numpy as np
import pandas as pd
import datetime as dt

from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import ExtraTreesRegressor

from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)
from IPython.display import clear_output

### **Training and testing Data**
The data given to us was structured as follows:
1. `train.csv`: Consists of position and velocity of 600 unique satellites in cartesian system, recorded at different times (timestamp provided). There are 6 simulation columns (3 for position, 3 for velocity) and 6 actual data columns (all of which are supposed to be predicted for records in test data.
2. `test.csv`: Consists only of simulated position and velocity data for 300 unique satellites (all of which are present in training data) recorded at different times. For all satellites, testing data starts at a time later than the end of training data (gaps present for some satellites). 

In [2]:
train = pd.read_csv('.././mod_data/train_lag2.csv', parse_dates=['epoch'])
test = pd.read_csv('.././mod_data/test_lag2.csv', parse_dates=['epoch'])

### **Feature Engineering**
Features related to spherical coordinate system for each satellite were generated using cartesian coordinates through suitable transformations. Lag features (2 time steps before and after each) were generated (see `LagFeaturesGenerator.ipynb`) as we have used tree based models (which assume no interdependence between individual records). Also, we generated second-degree polynomial features of simulated data using `PolynomialFeatures` from `scikit-learn`. Some other features that were tested (but were not useful for the model) were:
1. Error between simulated and actual data with time. This was regressed upon first for the testing data, and then this was used as a feature along with simulated data for predicting actual data.
2. Hour, day and month of recording using timestamp information, in an attempt to capture seasonal variations.

In [3]:
def add_new_features(train):
    train['r_sim'] = np.sqrt(train['x_sim']**2 + train['y_sim']**2 + train['z_sim']**2)
    train['xy_r_sim'] = np.sqrt(train['x_sim']**2 + train['y_sim']**2)
    train['yz_r_sim'] = np.sqrt(train['y_sim']**2 + train['z_sim']**2)
    train['zx_r_sim'] = np.sqrt(train['x_sim']**2 + train['z_sim']**2)
    train['alpha_sim'] = np.arccos(train['x_sim']/train['r_sim'])
    train['beta_sim'] = np.arccos(train['y_sim']/train['r_sim'])
    train['gamma_sim'] = np.arccos(train['z_sim']/train['r_sim'])
    train['phi_xy_sim'] = np.arctan(train['x_sim']/train['y_sim'])
    train['phi_yz_sim'] = np.arctan(train['y_sim']/train['z_sim'])
    train['phi_zx_sim'] = np.arctan(train['z_sim']/train['x_sim'])
    return train

In [4]:
train = add_new_features(train)
test = add_new_features(test)

In [5]:
pred_cols = ['x', 'y', 'z', 'Vx', 'Vy', 'Vz']
train_cols = train.drop(['id', 'epoch', 'sat_id'] + pred_cols, axis=1).columns.tolist()
lag_cols = [col for col in train_cols if ('_b' in col) or ('_f' in col)]

### **Model Building and Training**
We have used `ExtraTreesRegressor` from `scikit-learn` (500 estimators, random state = 123) to generate our regressor. Other models that we tried (in descending order of performance) included:
1. XGBoost Regressor `xgboost.XGBRegressor()`
2. K Nearest Neighbors Regressor `sklearn.neighbors.KNeighborsRegressor()`
3. LightGBM Regressor `lightgbm.LGBMRegressor()`
4. CatBoost Regressor `catboost.CatBoostRegressor()`
5. ARIMA (did not tune much, but top performers seem to have used some variant of this) `statsmodels.tsa.arima_model.ARIMA`
6. Support Vector Regressor `sklearn.svm.SVR`

In [6]:
# Models 
reg = ExtraTreesRegressor(n_estimators=500, random_state=123)

In [8]:
poly_cols = ['x_sim', 'y_sim', 'z_sim', 'Vx_sim', 'Vy_sim', 'Vz_sim']
pred_cols = ['x', 'y', 'z', 'Vx', 'Vy', 'Vz']
poly = PolynomialFeatures(degree=2)

# Main algorithm
for i in range(test['sat_id'].nunique()):
    
    clear_output()
    print("Now processing {} of {} IDs".format(i+1, test['sat_id'].nunique()))
    idx = test['sat_id'].unique()[i]
    
    # Extract data for that satellite
    d_train = train[train['sat_id'] == idx].reset_index()
    d_test = test[test['sat_id'] == idx].reset_index()
    
    # Training data for position
    for cdn in pred_cols:
        
        X_train, y_train = np.hstack((poly.fit_transform(d_train[poly_cols]), d_train[lag_cols].values)), d_train[cdn].values
        X_test = np.hstack((poly.fit_transform(d_test[poly_cols]), d_test[lag_cols].values))
        
        reg.fit(X_train, y_train)
        test_pred = reg.predict(X_test) 
        
        test.loc[test['sat_id'] == idx, cdn] = test_pred

Now processing 300 of 300 IDs


In [10]:
# Get appropriate columns and create submission
sub = test[['id', 'x', 'y', 'z', 'Vx', 'Vy', 'Vz']]
sub.to_csv(".././submissions/sub18_ETR.csv", index=False)