# Most general form of cross-validation
---

This provides no personalization, and still avoids the issue of using a subject's future data for prediction.

In [1]:
%pylab --no-import-all inline

from os import path
import pandas as pd
import seaborn as sns

Populating the interactive namespace from numpy and matplotlib


In [2]:
file = path.join("..", "data", "processed", "df.csv")
df = pd.read_csv(file, index_col=0)

In [3]:
X = df[["AGE", "TEMP1", "TEMP2", "TEMP3", "TEMP4", "TEMP5", "TEMP6"]]
y = df.L_PREOVULATION
grouping = df.ID

In [4]:
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
rfr = RandomForestRegressor(n_estimators=200, random_state=1337)
imp = Imputer(strategy='median')
pipeline = Pipeline([('imp', imp), ('rfr', rfr)])

In [None]:
from sklearn.model_selection import GroupKFold, cross_val_predict
cv = GroupKFold(n_splits=10)

In [None]:
from sklearn.model_selection import cross_val_score
y_pred = cross_val_predict(pipeline, X, y, 
                           cv=cv, groups=grouping,
                           verbose=True, n_jobs=-1)

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_pred=y_pred, y_true=y)

## Discussion
---

This model is extremely simple. Our features are only the first six temperatures of the cycle and the participant's age. With it, we achieve a MSE of about 18, which is not that far from the Bortot paper's 15. In terms of use case, they are about equal.

In [None]:
df.L_PERIOD.median()

Now, the median period length is 5, which means that we are really using measurements of BBT during the period to determine the day of ovulation.