### Initial Baseline Model

To compare with future models, we use an initial baseline model. For each output measurement (temperature and relative humidity), this model simply takes the mean value across all training examples and predicts this for any testing examples.

We use two metrics, the Root Mean Squared Error (RMSE), comparing the difference between true and predicted values, and the Relative RMSE (RRMSE), that takes the range of values into account.

In [1]:
import numpy as np
import pandas as pd

In [2]:
traindf = pd.read_csv("train.csv")
testdf = pd.read_csv("test.csv")

In [3]:
testdf.head()

Unnamed: 0,stationId,imageDate,imgPath,meteoDate,dew,humidity,precipitation,roadCondition,temperature C bellow 5cm,temperature C 0cm,temperature C 20cm,temperature C 2m,warnings,wind direction,speed[m/s],imageDate_t,imageDate_ts,cluster,time
0,1093-0,16/01/2019 13:07,/save/1093-0/59/1547644059_1093-0.jpg,16/01/2019 13:00,2.5,94.9,none,wet,0,3.6,3.3,3.3,none,349,1.9,16/01/2019 13:07,5836059,2,13:07:39
1,1242-0,21/02/2019 11:50,/save/1242-0/21/1550749821_1242-0.jpg,21/02/2019 11:50,5.0,92.1,shower,wet,0,8.6,0.0,6.2,none,57,0.5,21/02/2019 11:50,8941821,0,11:50:21
2,67-0,24/12/2018 12:10,/save/67-0/20/1545653420_67-0.jpg,24/12/2018 13:06,-1.8,87.7,unknown,saline,0,1.2,0.0,0.0,low warning,136,5.6,24/12/2018 12:10,3845420,1,12:10:20
3,1881-0,23/02/2019 11:14,/save/1881-0/44/1550920444_1881-0.jpg,23/02/2019 11:10,-11.7,47.7,none,dry,0,6.2,0.2,-1.9,none,19,0.0,23/02/2019 11:14,9112444,0,11:14:04
4,1215-0,24/02/2019 13:50,/save/1215-0/46/1551016246_1215-0.jpg,24/02/2019 13:50,-5.4,39.3,none,dry,0,12.4,0.0,7.7,none,178,0.0,24/02/2019 13:50,9208246,0,13:50:46


In [4]:
#extract features
humidityFeature = traindf["humidity"].values
temperatureFeature = traindf["temperature C 2m"].values

In [5]:
testHumidity = testdf["humidity"].values
testTemperature = testdf["temperature C 2m"].values

In [6]:
meanHumidity = np.round(np.mean(humidityFeature), 4)
print("Mean Humidity: {}".format(meanHumidity))
baseHumPrediction = np.asarray([meanHumidity for i in range(testdf.shape[0])])

Mean Humidity: 76.4434


In [7]:
meanTemperature = np.round(np.mean(temperatureFeature), 4)
print("Mean Temperature: {}".format(meanTemperature))
baseTempPrediction = np.asarray([meanTemperature for i in range(testdf.shape[0])])

Mean Temperature: 3.7677


In [8]:
humidityRMSE = np.round(np.sqrt(np.mean((testHumidity - baseHumPrediction) ** 2)), 4)
print("Baseline Humidity RMSE (%): {}".format(humidityRMSE))
avgHumidity = np.mean(testHumidity)
humidityRRMSE = np.round(np.sqrt(np.mean((testHumidity - baseHumPrediction) ** 2)) / avgHumidity, 4) * 100
print("Baseline Humidity RRMSE (%): {}".format(humidityRRMSE))

Baseline Humidity RMSE (%): 21.3469
Baseline Humidity RRMSE (%): 27.91


In [9]:
temperatureRMSE = np.round(np.sqrt(np.mean((testTemperature - baseTempPrediction) ** 2)), 4)
print("Baseline Temperature RMSE (degrees Celsius): {}".format(temperatureRMSE))
scaledAvgTemperature = np.mean(testTemperature) - min(testTemperature)
temperatureRRMSE = np.round(np.sqrt(np.mean((testTemperature - baseTempPrediction) ** 2)) / scaledAvgTemperature, 4) * 100
print("Baseline Temperature RRMSE (%): {}".format(temperatureRRMSE))

Baseline Temperature RMSE (degrees Celsius): 4.7704
Baseline Temperature RRMSE (%): 40.9


### Improved Baseline Model

As we've noted, the trained models rely on the features specific to each station to learn and predict. Therefore, we also consider a baseline model that takes the mean value across all training examples for a particular station and predicts this value on any testing examples for that station.

In [10]:
stationHums = {}
for i in range(traindf.shape[0]):
    station = traindf.iloc[i][0]
    humidity = traindf.iloc[i][5]
    if station not in stationHums.keys():
        stationHums[station] = [humidity]
    else:
        stationHums[station].append(humidity)

for key in stationHums.keys():
    station = key
    humidities = stationHums[station]
    stationMeanHum = np.round(np.mean(humidities), 4)
    stationHums[station] = stationMeanHum

baseHumPrediction = []
for i in range(testdf.shape[0]):
    testStation = testdf.iloc[i][0]
    stationMeanHum = stationHums[testStation]
    baseHumPrediction.append(stationMeanHum)

humidityRMSE = np.round(np.sqrt(np.mean((testHumidity - baseHumPrediction) ** 2)), 4)
print("Improved Baseline Humidity RMSE (%): {}".format(humidityRMSE))
avgHumidity = np.mean(testHumidity)
humidityRRMSE = np.round(np.sqrt(np.mean((testHumidity - baseHumPrediction) ** 2)) / avgHumidity, 4) * 100
print("Baseline Humidity RRMSE (%): {}".format(humidityRRMSE))

Improved Baseline Humidity RMSE (%): 14.269
Baseline Humidity RRMSE (%): 18.66


In [11]:
stationTemps = {}
for i in range(traindf.shape[0]):
    station = traindf.iloc[i][0]
    temperature = traindf.iloc[i][11]
    if station not in stationTemps.keys():
        stationTemps[station] = [temperature]
    else:
        stationTemps[station].append(temperature)

for key in stationTemps.keys():
    station = key
    temps = stationTemps[station]
    stationMeanTemp = np.round(np.mean(temps), 4)
    stationTemps[station] = stationMeanTemp

baseTempPrediction = []
for i in range(testdf.shape[0]):
    testStation = testdf.iloc[i][0]
    stationMeanTemp = stationTemps[testStation]
    baseTempPrediction.append(stationMeanTemp)

temperatureRMSE = np.round(np.sqrt(np.mean((testTemperature - baseTempPrediction) ** 2)), 4)
print("Improved Baseline Temperature RMSE (degrees Celsius): {}".format(temperatureRMSE))
scaledAvgTemperature = np.mean(testTemperature) - min(testTemperature)
temperatureRRMSE = np.round(np.sqrt(np.mean((testTemperature - baseTempPrediction) ** 2)) / scaledAvgTemperature, 4) * 100
print("Baseline Temperature RRMSE (%): {}".format(temperatureRRMSE))

Improved Baseline Temperature RMSE (degrees Celsius): 3.898
Baseline Temperature RRMSE (%): 33.42


Clearly, we hope to beat both baseline models with our trained CNNs.