# 1. Introduction
[NASA HI-SEAS](https://hi-seas.org/) missions act as a testbed and training ground for humans as we develop the capability to explore Mars. 
A recent NASA Space Apps Challenge hackathon asked participants to use data collected from the HI-SEAS site to predict solar radiation given a set of measurable meteorological conditions.
Knowing when conditions are most favorable for incident solar radiation is crucial for deciding when and where to deploy solar energy harvesting equipment, especially for colonists or astronauts on the surface of Mars.

The original Kaggle dataset & competition can be found here: https://www.kaggle.com/dronio/SolarEnergy

## 1.1. Scenario
We are participants in a NASA HI-SEAS (Hawai’i Space Exploration Analog and Simulation) mission, simulating a human settlement on Mars. 

A large solar array and battery bank are installed at the settlement and are the only power source available.
On sunny days, the array collects enough energy to power the entire settlement and recharge the battery bank.
The battery bank is used (sparingly) at night and on overcast days.
There is a strict power budget for operations each day to make sure vital equipment stays online, and we also have a number of experiments to run.

We have been collecting data at our settlement since the end of the last HI-SEAS mission in September, 2016.
It is now January, 2017, and our mission is about to begin.

Can we model solar radiation as a function of the information our sensors can gather, based on previously collected data?

## 1.2. About this dataset
These datasets are meteorological data from the HI-SEAS weather station from four months (September through December 2016) between Mission IV and Mission V.

For each dataset, the fields are:

A row number (1-n) is useful in sorting this export's results The UNIX time_t date (seconds since Jan 1, 1970). Useful in sorting this export's results with other export's results The date in `yyyy-mm-dd` format The local time of day in `hh:mm:ss` 24-hour format The numeric data, if any (may be an empty string) The text data, if any (may be an empty string)

The units of each dataset are:

* Solar radiation: watts per meter^2
* Temperature: degrees Fahrenheit
* Humidity: percent
* Barometric pressure: Hg
* Wind direction: degrees
* Wind speed: miles per hour
* Sunrise/sunset: Hawaii time

## 1.3. About this kernel
The purpose of this kernel is to explore this dataset and apply basic machine learning techniques in order to predict solar radiation given a set of weather conditions.

__Assumptions__
* In an application of this model, predicted meterological data (hourly temperature and humidity, for example) would be used as inputs (rather than using the model as a true predictor).
* Effects of rain and cloud cover are neglected, except those which are indirectly measured by temperature, pressure, and humidity.
* The sun angle with regard to the solar array is neglected. While it could be derived from the location of HI-SEAS and date-time data, this adds too much complexity to the problem for now.

This is my first foray into the world of data science and machine learning (ML). 
As such, it is expected that the following code will not be optimized for memory, runtime, or readability.
However, the result is still expected to be at worst interesting, at most useful.

I have decided to use Python 3 since it is widely used for data science, a versatile and useful language (even outside of data processing and ML), and a new language to me -- I'd like the practice.

## 1.4. Thanks
I would like to thank the following groups and individuals for providing the tutorials, resources, and inspiration to conduct this study.
* [NASA & Kaggle](https://www.kaggle.com/dronio/SolarEnergy), for providing this dataset.
* [sentdex on YouTube](https://www.youtube.com/playlist?list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v), for his incredible series on machine learning and everything else Python.
* Sarah Linden, for encouraging me to do engineering projects in my free time.

# 2. Preprocessing the data
Before applying any machine learning techniques, the input data must be ingested and conditioned.
Not all the data provided is useful!

## 2.1. Defining Features and Labels
Machine learning algorithms operate on _features_ to predict _labels_.

* A __feature__ is an attribute of the system that affects the output. 
Features act as "inputs" to the model.
Ideally, features are _independent_ variables.
* A __label__ is the value being predicted. 
Labels act as "outputs" of the model.

Now, let's consider our scenario. 
Recall the available data:
* Date
* Time of Day 
* Solar Radiation
* Temperature
* Pressure
* Humidity
* Wind Direction
* Wind Speed
* Time at Sunrise
* Time at Sunset

### 2.1.1. Features
At every timestamp within each day, there are values for all other variables.
No other variables impact the values of time or date.
Therefore __date__ and __time of day__ are _independent_ variables.

For each date, there is _one_ value for `Time at Sunrise` and `Time at Sunset`. 
The difference of these values yields the length of a given day, which is directly related to the date.
More exploration of the dataset is needed to determine if the length of a given date supersedes `date` in the amount of useful information it provides.

Temperature, pressure, and humidity do not directly affect one another significantly, but since they are all properties which describe the local atmosphere, they do not vary independently from one another.
Similarly, all three of these variables have a stong relationship to time of day. 

Therefore we consider the following variables to be _features_ to the machine learning algorithm:
* Date (or Length of Day)
* Time of day
* Temperature
* Pressure
* Humidity
* Wind Speed
* Wind Direction

Further exploration of the dataset may modify this list, but for now this is our best guess.

### 2.1.2. Labels
The goal is to model solar radiation based on the available features, so it makes sense for `radiation` to be the algorithm's label. 
Recorded radiation measurements serve as the truth values to train and test the supervised machine learning algorithm.

## 2.2. Importing the Data
First, all of the data is loaded in as the appropriate data types.
The column `Data` contains a single, unchanging timestamp. This appears to be the dat the dataset was published, but that is unclear.
For our purposes, this is not useful information and the column is removed from the dataset.

Time of day, sunrise, and sunset values are converted to `datetime` objects which are stored as timezone naive UNIX time values (we can always translate it back later).

In [None]:
## IMPORT LIBRARIES
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pytz # timezones

def ingest_data(filename):
    '''Read data from a CSV file and construct a pandas DataFrame
    Inputs:
        filename as string
    Outputs:
        df as DataFrame
    '''
    # read csv file
    df = pd.read_csv(filename)

    # 'Data' column is unused. All elements contain the same value.
    # 'Time' is redundant and superseded by UNIXTime.
    df.drop(['Data','Time'],axis=1,inplace=True)

    # interpret columns as appropriate data types to ensure compatibility
    df['UNIXTime']      = pd.to_datetime(df['UNIXTime'],unit='s')
    df['Radiation']     = df['Radiation'].astype(float)
    df['Temperature']   = df['Temperature'].astype(float) # or int
    df['Pressure']      = df['Pressure'].astype(float)
    df['Humidity']      = df['Humidity'].astype(int) # or int
    df['WindDirection(Degrees)'] = df['WindDirection(Degrees)'].astype(float)
    df['Speed']         = df['Speed'].astype(float)
    df['TimeSunRise']   = pd.to_datetime(df['TimeSunRise'],format='%H:%M:%S')
    df['TimeSunSet']    = pd.to_datetime(df['TimeSunSet'],format='%H:%M:%S')
    df.rename(columns={'WindDirection(Degrees)': 'WindDirection', 'Speed': 'WindSpeed'}, inplace=True)

    # compute length of each day
    df['DayLength'] = (df['TimeSunSet']-df['TimeSunRise'])/np.timedelta64(1, 's')

    # we don't need sunrise or sunset times anymore, so drop them
    df.drop(['TimeSunRise','TimeSunSet'],axis=1,inplace=True)

    # index by UNIX time
    df.sort_values('UNIXTime', inplace=True) # sort by UNIXTime
    df.set_index('UNIXTime',inplace=True) # index by UNIXTime

    # Localize the index (using tz_localize) to UTC (to make the Timestamps timezone-aware) and then convert to Eastern (using tz_convert)
    hawaii=pytz.timezone('Pacific/Honolulu')
    df.index=df.index.tz_localize(pytz.utc).tz_convert(hawaii)

    # assign unit labels to data keys
    units={'Radiation':'W/m^2','Temperature':'F','Pressure':'in Hg','Humidity':'\%','DayLength':'sec'}
    return df, units

In [None]:
df, units = ingest_data('../input/SolarPrediction.csv')
print(df.head())

Note that the `Time` column is dropped in favor of the `UNIXTime` timestamp. 
UNIX time encodes both date and time, so the `Time` column is redundant.
UNIX time is converted from UTC to Hawaii Standard Time, then data is sorted by UNIX time.

## 2.3. Exploring the Data
Plotting libraries are imported to visualize data.
Then each measurement is visualized and Pearson correlations are calculated to determine which parameters have the most impact on one another.

In [None]:
## IMPORT LIBRARIES
import numpy as np # linear algebra
from scipy import stats # statistics
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plotting tools
import seaborn as sns # advanced plotting tools
sns.set(style="white")

# make IPython render plots inline
%matplotlib inline 

First, a basic correlation matrix is generated to weed out irrelevant data and identify the most significant features in the set.

In [None]:
def corrPairs(df):
    '''Pairwise correlation matrix'''
    corr = df.corr() # Compute the correlation matrix
    mask = np.zeros_like(corr, dtype=np.bool) # make mask
    mask[np.triu_indices_from(mask)] = True # mask upper triangle
    sns.heatmap(corr, mask=mask, cmap='coolwarm', center=0, square=True, linewidths=.5, annot=True, cbar=False)

df['WeekOfYear'] = df.index.week # add week to view correlation
plt.figure(figsize=(7,7))
corrPairs(df)

Examining the stronger correlations more closely:

In [None]:
def corrfunc(x, y, **kws):
    '''add pearsonr correlatioin to plots'''
    r, _ = stats.pearsonr(x, y)
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),xy=(.1, .9), xycoords=ax.transAxes, color='white')
    return

def corrMap(df,features):
    '''plot bivariate correlations'''
    g = sns.PairGrid(df, vars=features)
    g.map_upper(plt.scatter, s=10)
    g.map_diag(sns.distplot, kde=False)
    g.map_lower(sns.kdeplot, cmap="coolwarm", shade=True, n_levels=30)
    g.map_lower(corrfunc)
    g.map_lower(corrfunc)

In [None]:
feature_list=['Radiation','Temperature','Humidity','Pressure']
# bivariate density matrix
corrMap(df,feature_list)
plt.show()

There are several timescales to consider in this dataset:
* Monthly
* Daily
* Hourly

Looking into data by the minute is too granular to draw broad conclusions at this stage, but something to be considered when constructing the prediction algorithm.

Radiation is expected vary with the date due to seasonal weather changes. 
The dataset only contains data from autumn and winter, so the model developed from this data may be less capable of predicting radiation during the summer. 
Fortunately, the seasonal climate at the HI-SEAS facility in Hawai'i is fairly consistent year-round.

Recalling the most correlated features, we plot Radiation as a function of Temperature, Humidity, and Pressure on the various timescales.

In [None]:
def color_y_axis(ax, color):
    '''Color y axis on two-axis plots'''
    for t in ax.get_yticklabels():
        t.set_color(color)
    ax.yaxis.label.set_color(color)
    return None

def plotVs(df,timescale,feature1,feature2,ax1,units):
    '''Plot feature vs radiation'''
    ax2=ax1.twinx()
    df_grouped= df.groupby(timescale)

    df_feature1 = df_grouped[feature1].mean()
    df_feature1_errorpos =  df_feature1+df_grouped[feature1].std()/2
    df_feature1_errorneg =  df_feature1-df_grouped[feature1].std()/2
    ax1.plot(df_feature1)
    ax1.fill_between(df_feature1.index, df_feature1_errorpos.values, df_feature1_errorneg.values, alpha=0.3, antialiased=True)
    ax1.set_ylabel(feature1+' '+units[feature1])
    color_y_axis(ax1, 'b')

    if feature2 == 'Radiation':
        rad = df_grouped['Radiation'].mean()
        ax2.plot(rad,'r')
        ax2.fill_between(df_feature1.index, 0, rad, alpha=0.3, antialiased=True, color='red')
        ax2.set_ylabel('Radiation'+' '+units['Radiation'])
        color_y_axis(ax2, 'r')
    else:
        df_feature2 = df_grouped[feature2].mean()
        df_feature2_errorpos =  df_feature2+df_grouped[feature2].std()/2
        df_feature2_errorneg =  df_feature2-df_grouped[feature2].std()/2
        ax1.plot(df_feature2)
        ax1.fill_between(df_feature2.index, df_feature2_errorpos.values, df_feature2_errorneg.values, alpha=0.3, antialiased=True)
        ax1.set_ylabel(feature2+' '+units[feature2])
        color_y_axis(ax1, 'g')
    return ax1, ax2

def HourlyWeeklyVs(df,feature1,feature2,units):
    '''Plot a feature vs radiation for time of day and week of year'''
    plt.figure(figsize=(18, 6))
    ax=plt.subplot(121) # hourly
    ax1,ax2 = plotVs(df,df.index.hour,feature1,feature2,ax,units)
    lines1, labels1 = ax1.get_legend_handles_labels()
    lines2, labels2 = ax2.get_legend_handles_labels()
    ax2.legend(lines1 + lines2, labels1 + labels2)
    plt.xlabel('Hour of Day (Local Time)')
    plt.title('Mean Hourly {0} vs. Mean Hourly {1}'.format(feature1,feature2))

    ax=plt.subplot(122) # weekly
    ax1, ax2 = plotVs(df,pd.Grouper(freq='W'),feature1,feature2,ax,units)
    lines1, labels1 = ax1.get_legend_handles_labels()
    lines2, labels2 = ax2.get_legend_handles_labels()
    ax2.legend(lines1 + lines2, labels1 + labels2)
    plt.xlabel('Week of Year')
    plt.title('Mean Weekly {0} vs. Mean Weekly {1}'.format(feature1,feature2))
    return


In [None]:
for feature in feature_list[1:]: # radiation vs feature
    HourlyWeeklyVs(df,feature,feature_list[0],units)
plt.show()

## 2.4. Thoughts So Far
From this exploration of the data, we see the following patterns in this dataset:
* __Higher temperatures correlate to more radiation throughput.__ This is confirmed by a Pearson R-value of 0.73 and the observed behavior of radiation "following" temperature on the daily and weekly time scales.
* __Humidity has a lesser (but potentially significant) impact on radiation throughput.__ With a Pearson R-value of magnitude above 0.20, humidity cannot be ignored as a potential driver of the system. Evidence for the negative correlation between the two features is found on the weekly time scale.
* __Pressure doesn't correlate much to radiation, but does correlate to temperature and humidity.__ Weather, basically. Since temperature, pressure, and humidity are all characteristics of the atmosphere it is not surprising that they are correlated.
* __Wind speed and direction are not relevant in this analysis.__ Though both are characteristics of local weather, they do not make sense as predictors of radiation. Wind direction has a moderate correlation to temperature (-0.26), pressure (-0.23) and radiation (-0.23) but through engineering judgement we know that this is only _correlation_ and not _causation_.
* __Seasonal changes are significant.__ Even though Hawai'i does not see seasons as drastic as the northern continental United States, seasonal changes in temperature and humidity are still severe enough to be taken into account, as shown by the weekly measurement comparisons.
* __Weekly timescales are the best predictors.__ Month-to-month variation is too broad to capture seasonal changes within a single year. Daily and hourly measurements have quite a bit of noise when looking for seasonal changes. Since day-to-day weather is dominated by temperature, pressure, and humidity (rather than seasonal changes), week of the year is the best indicator of seasonal trends.

In [None]:
df.drop(['WindDirection','WindSpeed'], axis=1, inplace=True) # drop irrelevant features

Also, though it is obvious, solar radiation has a strong correlation to time of day. 
We add time of day as a feature so the algorithm can tell day from night. 
As a side effect, this implicitly accounts for sun angle.

In [None]:
df['TimeOfDay'] = df.index.hour # add time of day to correlation

# 3. Training & Testing
We desire an algorithm that will predict values (radiation for a given set of inputs), we have plenty of data to train with, and we have "unlimited" time.
There are many models to choose from, and more than one may be appropriate.

In this analysis, we will try several models and compare their performance to evaluate the best algorithm to predict solar radiation.
* Linear Regression
* Random Forest Regression
* Neural Network Regression
* Support Vector Regression

In [None]:
# IMPORT ML CLASSIFIERS
from sklearn.linear_model import LinearRegression # Linear regression
from sklearn.ensemble import RandomForestRegressor # random forest regression
from sklearn.neural_network import MLPRegressor # neural network regression
from sklearn.svm import SVR # support vector regression

## 3.1. Preparing the algorithm
Even before we downselect to a specific model, we can prepare a prediction algorithm that takes in our data and makes a prediction.
Using [`scikit-learn`](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning), it is easy to swap out different models and maintain the same higher-level structure to the program.

To train the algorithm, we implement a split train/test methodology to prevent bias in the learning.
The dataset is split into a randomly sampled pool of datapoints. 
80% of those points are used for training, the remaining 20% is used for validation of the training data. 
So the test data is not necessarily continuous time, but rather a random selection of points from the set.

For demonstration purposes, we use the entire dataset (including training and test points) to visualize algorithm performance over time.
This is inherently biased, since some of the points we will see will have been points that the algorithm has already trained on and potentially optimized to.
However, we validate the algorithm accuracy against the subset of testing points (which the were not used for training), so we can still be confident in evaluating the performance using the accuracy metric and by keeping this potential bias in mind.

In [None]:
x = df.drop('Radiation',axis=1).as_matrix()
y = df['Radiation'].as_matrix()

In [None]:
from sklearn import preprocessing # ML tools
from sklearn.model_selection import train_test_split # split data

from bokeh.plotting import figure, show, output_notebook

def plot_test(clf,X_test,y_test):
    y_predicted = clf.predict(X_test)

    p = figure(tools='pan,box_zoom,reset',x_range=[0, 100], title='Model validation',y_axis_label='radiation')
    p.grid.minor_grid_line_color = '#eeeeee'

    p.line(range(len(y_test)),y_test,legend='actual',line_color='blue')
    p.line(range(len(y_test)),y_predicted,legend='prediction',line_color='red')
    output_notebook()
    show(p)
    return

def plot_real(clf,x,y_actual,index):
    ''' Plot predictions for actual measurements.
    inputs:
        clf         as classifier   the trained algorithm
        x           as array        timeseries of measurement inputs
        y_actual    as array        corresponding timeseries of actual results
    '''
    y_predicted = clf.predict(x)

    p = figure(toolbar_location='right', title='Predicted vs Actual',y_axis_label='radiation',x_axis_type="datetime")
    p.grid.minor_grid_line_color = '#eeeeee'

    p.line(index,y_actual,legend='actual',line_color='blue')
    p.line(index,y_predicted,legend='prediction',line_color='red')
    output_notebook()
    show(p)
    return

def train_model(X,y,clf,debug=False):
    ''' Train algorithm.
    inputs:
        X       as array        features
        y       as array        label(s)
        clf     as scikit-learn classifier (untrained)
    returns:
        clf     as trained classifier
        accuracy  as float
    '''
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
    model = clf.fit(X_train,y_train)
    accuracy = clf.score(X_test,y_test)
    return clf, model, accuracy, X_test, y_test

def go(x,y,algorithm,debug=True):
    ''' Easy model train and test. '''
    clf, model, accuracy, X_test, y_test=train_model(x,y,algorithm,debug=True)
    print('Accuracy: %s percent'%str(accuracy*100))

    if debug:
        plot_test(clf,X_test,y_test)
        plot_real(clf,x,y,df.index.values)
    return

## 3.2. Linear Regression
Let's implement the first ML algorithm: __Linear regression.__

Linear regression is probably the simplest fit, but weather characteristics are probably quite nonlinear.
Regardless, let's see how it performs -- it might be good enough.

In [None]:
go(x,y,LinearRegression())

Not horrible, but the ~60% accuracy is definitely apparent.

Here's a good example of where the algorithm roughly succeeds and miserably fails.

![figure: linear regression](http://github.com/runphilrun/kaggle-radiation-prediction/blob/master/figs/linreg_plot.png?raw=true)

## 3.3. Random Forest Regression
Another algorithm to try is __random forest regression__. 
This works in a fundamentally different way to linear regression, so maybe we'll have more success.
Most importantly, this algorithm can handle nonlinear inputs.

In [None]:
go(x,y,RandomForestRegressor())

Wow, 92%! Random forest regression is much more flexible at handling this case where the underlying relationship of the data is nowhere near linear.
There are still areas where the algorithm fails, but the accuracy is dramatically improved overall.

![figure: random forest regression](https://github.com/runphilrun/kaggle-radiation-prediction/blob/master/figs/rforest_plot.png?raw=true)

## 3.4. Neural Network Regression
Neural Networks are very tunable to suit a wide variety of problems.
In this case, a neural network will be used to optimize squared error.
Since this is just an exploration, we use default parameters knowing that performance may be much different if these values are tuned to suit our problem.

In [None]:
go(x,y,MLPRegressor())

Wow, worse than linear regression! 
Although better results are probably possible with this algorithm, we already have random forest regression performing north of 90% accuracy. 
Tuning the neural network is not really worth the trouble at this point.

## 3.5. Support Vector Regression
This is another algorithm that comes packaged with scikit-learn.
Let's implement it without digging into the theory, just to see how it performs out of the box.

In [None]:
go(x,y,SVR())

The runtime for training this algorithm is exceptionally longer than the others. 
As for accuracy... -32%? 
Accuracy is measured as the R-squared value, and a negative result indicates that the mean is a better predictor than this trained result!

Support vector regression (SVR) supports different _kernels_, and defaults to [Radial Basis Function (RBF)](http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html).
I attempted to try linear and polynomial kernels before passing judgement on this algorithm, but they took too long to run on my personal machine (more than 15 minutes).
SVR is worth returning to, but right now we already have better models. 
Tinkering with this is not getting us very far.

In [None]:
# go(x,y,SVR(kernel='linear'))
# go(x,y,SVR(kernel='poly'))

To recap, recall the accuracy of each algorithm attempted so far:
* _Linear Regression:_ ~60%
* **_Random Forest Regression:_ >90%**
* _Neural Network Regression:_ ~50%
* _Support Vector Regression:_ <50%

Thus we select __Random Forest Regression__ as our algorithm.

# 4. Tuning the Algorithm
Now let's consider how we can improve the accuracy of our model. (I watched [this video](https://www.youtube.com/watch?v=YkVscKsV_qk) and read the [scikit-learn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) to better understand this algorithm and you should too.)

Here's what [the docs](http://scikit-learn.org/stable/modules/ensemble.html#random-forests) say:
> In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.



On a high level, regression derived from decision trees often results in low bias, high variance models, and is prone to overfitting.
While the random forest method (which is built upon many decision trees) is more robust against bias and variance, overfitting is still a potential pitfall. 

For random forests, there are three main tuning parameters:
* __Number of trees.__ (`n_estimators`) More is better, with diminishing returns. Obviously more trees means longer compute times. A critical number of trees must be found where significant accuracy and compute times are optimized.
* __Number of features to consider at each split.__ (`max_features`) If some trees consider a different subset of features than others, the correlation between those two groups is minimal. This is desirable because it teases out the influence of each individual feature.
* __Depth of trees.__ (`max_depth`) Having trees go too deep can lead to overfitting. There is a critical depth where the trees split enough to result in useful fit without being too influenced by single values. Depth may instead be constrained by `min_samples_split`, `min_samples_leaf`, `min_weight_fraction_leaf`, or `max_leaf_nodes` rather than specifying tree depth outright.
   ```python     
        # DEFAULT VALUES
        RandomForestRegressor(n_estimators=10, 
                              criterion='mse', 
                              max_depth=None, 
                              min_samples_split=2, 
                              min_samples_leaf=1, 
                              min_weight_fraction_leaf=0.0, 
                              max_features='auto', 
                              max_leaf_nodes=None, 
                              min_impurity_decrease=0.0, 
                              min_impurity_split=None, 
                              bootstrap=True, 
                              oob_score=False, 
                              n_jobs=1, 
                              random_state=None, 
                              verbose=0, 
                              warm_start=False)
    ```
Start by seeing if performance improves by simply increasing the number of trees (and letting scikit-learn use all available cores on our PC).

In [None]:
# default algorithm for reference
print('Default random forest regressor:')
go(x,y,RandomForestRegressor(),debug=False)

# tuning round 1
print('Tuned regressor:')
go(x,y,RandomForestRegressor(n_estimators=100, n_jobs=-1),debug=False)

No real improvement. Trying every possible combination of values for `n_estimators`, `max_features`, and so on would take forever. 
Using [Gregory Saunders's talk at PyCon Australia](https://youtu.be/YkVscKsV_qk?t=1940) as an example, we can write a function to let Python search for optimal values for us using the `oob_score` (out-of-bag score) to evaluate the accuracy of each combination, and return its best guess.

In [None]:
def optimize_randomforest(x,y,try_n=10,try_f='auto',try_s=1):
    ''' Find best combo of tunable params for random forest regressor. '''
    best_score = float('-inf') # initialize score
    for n in try_n:
        for f in try_f:
            for s in try_s:
                clf = RandomForestRegressor(oob_score=True,n_estimators=n,max_features=f,min_samples_leaf=s,n_jobs=-1)
                clf.fit(x,y)
                if clf.oob_score_ > best_score:
                    best_score, best_clf, best_n, best_f, best_s = clf.oob_score_, clf, n, f, s
    return clf, best_n, best_f, best_s

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
n=[100,200,300,500]
f=[2,4,6]
s=[1,2,4,8,16]
clf, n, f, s = optimize_randomforest(x_train,y_train,try_n=n,try_f=f,try_s=s)
print('n_estimators: '+str(n))
print('max_features: '+str(f))
print('min_samples_leaf: '+str(s))
go(x,y,RandomForestRegressor(n_estimators=n,max_features=f,min_samples_leaf=s,n_jobs=-1))

Some improvement, but it looks like the default values weren't too bad either.

We settle for the following optimized values (accuracy: 93.460329455 percent):

| parameter | value |
| --- | --- |
| n_estimators | 500 |
| max_features | 2 |
| min_samples_leaf | 2 |

![figure: tuned random forest](https://github.com/runphilrun/kaggle-radiation-prediction/blob/master/figs/tuned-rforest_plot.png?raw=true)

# 5. Extending this idea
This analysis is by no means a complete approach to predicting solar radiation for a HI-SEAS mission, but we still developed a model through machine learning that yields more than 90% accurate predictions, and is thus a valid proof-of-concept.
A more accurate model could be developed by spending more time in several areas:
* __Choosing the right algorithm.__ Random forests worked well, and was certainly the best performer out of the four algorithms we tried with default values. But selection of those four algorithms, including random forests, was purely arbitrary. Spending some more time studying regression methods and choosing the most suitable one is surely worth the time.
* __Tweaking the algorithm (intelligently).__ Iterating through a list of arbitrary values and choosing the one that yielded the best results is not the best approach to optimizing tunable parameters. More time spent tuning, including perhaps an analytical method to determine optimal values, would be beneficial.
* __Considering feature relationships.__ We know that some features (temperature, pressure, humidity) are not completely independent. If we somehow work their relationships and influences on one another into the algorithm, that could potentially create a more realistic model.
* __Considering more features.__ Cloud cover and precipitation, for starters. More (relevant) data migh lead to a better model. Desirable features are ones that we know truly impact the transmission of light through the atmosphere, especially at the wavevlengths where the HI-SEAS solar arrays are most sensitive.