## Imports

In [2]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler

## Prep the Data

First, we read in the datasets.

In [3]:
#| label: raw-dataset-preview
wind_df = pd.read_csv("../data/wind.csv")
solar_df = pd.read_csv("../data/solar.csv")

print("Previews of the datasets:")
print(wind_df.head(10))
print("_________________________________")
print(solar_df.head(10))

Previews of the datasets:
   id        lat        long  wind_speed farm_type  capacity  capacity_factor  \
0   0  23.510410 -117.147260        6.07  offshore        16            0.169   
1   1  24.007446  -93.946777        7.43  offshore        16            0.302   
2   2  25.069138  -97.482483        8.19  offshore        16            0.375   
3   3  25.069443  -97.463135        8.19  offshore        16            0.375   
4   4  25.069763  -97.443756        8.19  offshore        16            0.376   
5   5  25.070091  -97.424377        8.19  offshore        16            0.375   
6   6  25.070404  -97.404999        8.19  offshore        16            0.375   
7   7  25.086678  -97.482849        8.18  offshore        16            0.375   
8   8  25.087006  -97.463470        8.19  offshore        16            0.376   
9   9  25.087318  -97.444092        8.19  offshore        16            0.376   

   power_generation  estimated_cost  
0          23687.04        20800000  
1     

Now, we must shuffle the datasets to reduce bias.

In [4]:
wind_df = wind_df.sample(frac=1)
solar_df = solar_df.sample(frac=1)

print("Previews of the shuffled datasets:")
print(wind_df.head(10))
print("_________________________________")
print(solar_df.head(10))

Previews of the shuffled datasets:
            id        lat        long  wind_speed farm_type  capacity  \
68627    68627  41.921593  -92.838684        7.59   onshore        16   
25972    25972  36.406723  -99.578583        8.07   onshore        16   
117318  117318  44.622005  -70.790405        7.73   onshore        10   
48875    48875  39.382561  -83.549896        6.85   onshore        16   
69534    69534  41.455593 -107.204559       10.09   onshore        16   
81785    81785  43.028027 -100.520996        8.15   onshore        16   
65898    65898  41.775879  -94.841034        8.03   onshore        16   
15177    15177  34.647572  -94.876343        8.65   onshore        14   
41287    41287  36.667110  -75.807404        8.17  offshore        16   
38991    38991  35.806236 -117.794632        5.85   onshore        16   

        capacity_factor  power_generation  estimated_cost  
68627             0.420          58867.20        20800000  
25972             0.445          62371.20

Looking at each dataset, we can identify which variables we want to use for our models.

In [5]:
# Wind data
wind_X = wind_df.loc[:, [False, True, True, True, False, True, True, False, False]]
wind_y = wind_df.loc[:, [False, False, False, False, False, False, False, True, True]]

# Solar data
solar_X = solar_df.loc[:, [False, True, True, True, False, True, True, False, False]]
solar_y = solar_df.loc[:, [False, False, False, False, False, False, False, True, True]]

Now we split into training and testing sets, reserving about 80% for training and 20% for testing.

In [6]:
# Wind data
wind_X_train = wind_X[:100000]
wind_X_test = wind_X[100000:]
wind_y_train = wind_y[:100000]
wind_y_test = wind_y[100000:]

# Solar data
solar_X_train = solar_X[:9500]
solar_X_test = solar_X[9500:]
solar_y_train = solar_y[:9500]
solar_y_test = solar_y[9500:]

Some models perform better when inputs are within a certain range, like [-1, 1] for example. We scale the data points appropriately.

In [7]:
scaler = StandardScaler()

# Wind data
scaler.fit(wind_X_train)
wind_X_train = scaler.transform(wind_X_train)
wind_X_test = scaler.transform(wind_X_test)

# Solar data
scaler.fit(solar_X_train)
solar_X_train = scaler.transform(solar_X_train)
solar_X_test = scaler.transform(solar_X_test)

## Training the Models

Now that the data is pre-processed accordingly, the models can be trained and fit. Here, we set `random state` to zero to ensure consistency between both data sets, and re-runs of the training and fitting.

In [None]:
# Wind regression
wind_reg = RandomForestRegressor(random_state=0)
wind_reg.fit(wind_X_train, wind_y_train)

# Solar regression
solar_reg = RandomForestRegressor(random_state=0)
solar_reg.fit(solar_X_train, solar_y_train)

## Testing the Models

With trained models, we can now test them and make predictions.

In [9]:
# Wind
wind_test = wind_reg.predict(wind_X_test)
print("Predicted outputs for wind data:")
print(wind_test)
print()

# Solar
solar_test = solar_reg.predict(solar_X_test)
print("Predicted outputs for solar data:")
print(solar_test)

Predicted outputs for wind data:
[[   36161.28   20800000.    ]
 [   58586.88   20800000.    ]
 [   76248.4416 20800000.    ]
 ...
 [   38684.16   15600000.    ]
 [   66996.48   20800000.    ]
 [   60829.44   20800000.    ]]

Predicted outputs for solar data:
[[1.01526034e+01 1.33000000e+04]
 [9.62149950e+03 1.33000000e+07]
 [1.18088103e+01 1.33000000e+04]
 ...
 [2.29657039e+01 2.66000000e+04]
 [1.16820008e+04 1.33000000e+07]
 [2.68208152e+02 3.99000000e+05]]
