# Purpose of Notebook
The goal of this code is to:
- Create models for predicting mass using a set of parameters:
    - Effective Temperature
    - Log g
    - Iron Abundance [Fe/H]
    - Alpha Abundance [alpha/Fe]
    - Nitrogen Abundance [N/Fe]
    - Oxygen Abundance [O/Fe]
- The model will be created using two datasets

- Use K2 model to predict the mass for APOGEE and GALAH
- Use APOKSAC model to predict mass for APOGEE

A good question is: why do all of this?
- Although the datasets used have their own mass prediction model, they are for general purposes
- Since we are interested specifically in low mass stars up to 2.5 solar masses, we want more specific prediction of the stars masses
- After using the model, we should also compare the distribution predicted by both models to see if the different models predict similar distribution of masses

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# K2 Model
First, we seperate the data into a set of data used for training and one for verifying the fit out the model

In [None]:
# X = data Y = mass
X = []
y = []
# The parameters we will be using to create the model
features = ["teff", "logg", "fe_h", "al_fe", "c_fe", "n_fe", "o_fe"]
# Using 80% of the data to train the model and use 20% to verify it
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)


## Random Forest Regression (RFR)
The first model we will use to predict the mass is Random Forest Regression.
We will need to access the fit of our model.
To do this, we will assess the mean squared error (square rooted) and correlation coefficient.

In [None]:
# Create the model
seed = 0 # Make sure results are replicable
k2_model_RFR = RandomForestRegressor(n_estimators=200, random_state=seed)
k2_model_RFR.fit(X_train, y_train)
y_pred_RFR = k2_model_RFR.predict(X_test)
# Determine whether the model
score_RFR = np.sqrt(mean_absolute_error(y_pred_RFR, y_test))
corr_RFR = np.corrcoef(y_test, y_pred_RFR)

Initial parameters to create the model:
- `n_estimator = 200`
- `train_size = 0.8`

## Polynomial Regression (PR)
The second model we will use to predict the mass is polynomial regression.
While this model runs into the problem of some assumptions not being satisfied such as the need for variables to be independent (mass/log gravity is typically correlated with things such as effective temperature) and the parameters not being additive, it is nonetheless a standard model worth using.

The code is largely inspired by [Data36 guide](https://data36.com/polynomial-regression-python-scikit-learn/)

In [None]:
# Create the model
poly = PolynomialFeatures(degree=5, include_bias=False)
poly_features = poly.fit_transform(X_train)
k2_model_PR = LinearRegression()
# Train the model
k2_model_PR.fit(poly_features, y_train)
# Test the model
y_pred_PR = k2_model_PR.predict(X_test)
score_PR = np.sqrt(mean_absolute_error(y_pred_PR, y_test))
corr_PR = np.corrcoef(y_test, y_pred_PR)

Initial parameters to create the model:
- `degree = 5`
    - The choice of degree is arbitrary

# K2 and GALAH

# K2 and APOGEE

# APOKSAC Model

# APOKSAC and