# Prediction of temperature using machine learning and comparison of various regression models for Yarra Valley

The dataset that we have chosen for this project was obtained by the nasa portal https://power.larc.nasa.gov/data-access-viewer/¶<br>
Latitude:  -37.6327  Longitude:  145.7981<br>
Time Extent:   09/30/1981  -  12/31/2020<br>
Elevation:  393.26 meters<br>
Yarra Valley

##### We must first import the necessary libraries and modules for data preprocessing and model development.

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

We must first determine if the data is suitable before feeding it to the model. There may be some missing or null values in the data, as well as some non-required values that must be handled properly.
We should make an effort to comprehend the data thoroughly; this will aid in data processing.
As a result, our data consists of 15 columns and a large number of rows. 

In [2]:
yarra_df = pd.read_csv('data/YarraValley.csv')

In [3]:
yarra_df.head()

Unnamed: 0,LAT,LON,YEAR,MO,DY,T2MDEW,T2MWET,T2M_MAX,T2M_MIN,T2M,PRECTOT,WS10M,PS,QV2M,ALLSKY_SFC_SW_DWN
0,-37.63269,145.79811,1981,10,1,6.8,6.8,14.85,7.13,11.42,4.54,1.75,96.19,6.43,-999.0
1,-37.63269,145.79811,1981,10,2,10.37,10.37,17.09,11.62,13.98,15.97,1.81,95.45,8.33,-999.0
2,-37.63269,145.79811,1981,10,3,5.53,5.53,12.16,4.72,7.49,12.64,1.91,95.82,6.01,-999.0
3,-37.63269,145.79811,1981,10,4,5.97,5.98,11.61,4.57,8.28,2.77,1.83,97.14,6.03,-999.0
4,-37.63269,145.79811,1981,10,5,7.24,7.24,16.0,8.31,11.68,0.1,1.98,97.46,6.53,-999.0


The columns in the dataset are :

In [4]:
yarra_df.columns

Index(['LAT', 'LON', 'YEAR', 'MO', 'DY', 'T2MDEW', 'T2MWET', 'T2M_MAX',
       'T2M_MIN', 'T2M', 'PRECTOT', 'WS10M', 'PS', 'QV2M',
       'ALLSKY_SFC_SW_DWN'],
      dtype='object')

The shape of the dataset :

In [5]:
yarra_df.shape

(14337, 15)

Lets see the description of the datasset:

In [6]:
yarra_df.describe()

Unnamed: 0,LAT,LON,YEAR,MO,DY,T2MDEW,T2MWET,T2M_MAX,T2M_MIN,T2M,PRECTOT,WS10M,PS,QV2M,ALLSKY_SFC_SW_DWN
count,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0
mean,-37.63269,145.7981,2000.872358,6.55158,15.730557,7.287119,7.284904,17.842395,8.133656,12.666312,2.540024,1.447113,97.075005,6.763635,-40.793936
std,1.399818e-12,3.586945e-11,11.332154,3.456927,8.800849,3.205294,3.199686,7.032904,3.974373,5.321336,4.801685,0.479854,0.677898,1.605244,207.653537
min,-37.63269,145.7981,1981.0,1.0,1.0,-3.89,-3.79,3.3,-2.67,1.42,0.0,0.43,94.01,2.91,-999.0
25%,-37.63269,145.7981,1991.0,4.0,8.0,5.04,5.04,12.13,5.21,8.33,0.04,1.09,96.63,5.65,2.12
50%,-37.63269,145.7981,2001.0,7.0,16.0,6.87,6.87,16.76,7.61,11.84,0.45,1.36,97.1,6.42,3.54
75%,-37.63269,145.7981,2011.0,10.0,23.0,9.11,9.11,22.83,10.53,16.28,2.85,1.72,97.54,7.49,5.84
max,-37.63269,145.7981,2020.0,12.0,31.0,21.83,21.83,42.35,25.75,32.65,70.66,3.91,99.16,17.12,9.69


The mean Maximo temperature is approximately 18 degrees, with a huge variance of 7 degrees. We have entries for all of the columns for a given day, and the data is reliable.

# Data Mining

 ##### prints a concise sumary of a DataFrame

In [7]:
yarra_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14337 entries, 0 to 14336
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   LAT                14337 non-null  float64
 1   LON                14337 non-null  float64
 2   YEAR               14337 non-null  int64  
 3   MO                 14337 non-null  int64  
 4   DY                 14337 non-null  int64  
 5   T2MDEW             14337 non-null  float64
 6   T2MWET             14337 non-null  float64
 7   T2M_MAX            14337 non-null  float64
 8   T2M_MIN            14337 non-null  float64
 9   T2M                14337 non-null  float64
 10  PRECTOT            14337 non-null  float64
 11  WS10M              14337 non-null  float64
 12  PS                 14337 non-null  float64
 13  QV2M               14337 non-null  float64
 14  ALLSKY_SFC_SW_DWN  14337 non-null  float64
dtypes: float64(12), int64(3)
memory usage: 1.6 MB


Let's see if the dataset has any missing values.

In [8]:
yarra_df.isnull().any() # checks if there is any null value in a particular column

LAT                  False
LON                  False
YEAR                 False
MO                   False
DY                   False
T2MDEW               False
T2MWET               False
T2M_MAX              False
T2M_MIN              False
T2M                  False
PRECTOT              False
WS10M                False
PS                   False
QV2M                 False
ALLSKY_SFC_SW_DWN    False
dtype: bool

There are no values in any of the columns, as can be seen above.

Renaming few columns

In [9]:
yarra_df = yarra_df.rename(columns={'LAT':'latitude','LON':'longitude','YEAR':'year', 'MO':'month', 'DY':'day','PRECTOT':'precipitation', 'T2M_MAX':'max_temp', 'T2M_MIN':'min_temp', 'T2MWET':'Wet Bulb', 'T2MDEW':'Dew/Frost', 'QV2M':'Specific_Humidity' })
print(yarra_df.shape)
yarra_df

(14337, 15)


Unnamed: 0,latitude,longitude,year,month,day,Dew/Frost,Wet Bulb,max_temp,min_temp,T2M,precipitation,WS10M,PS,Specific_Humidity,ALLSKY_SFC_SW_DWN
0,-37.63269,145.79811,1981,10,1,6.80,6.80,14.85,7.13,11.42,4.54,1.75,96.19,6.43,-999.00
1,-37.63269,145.79811,1981,10,2,10.37,10.37,17.09,11.62,13.98,15.97,1.81,95.45,8.33,-999.00
2,-37.63269,145.79811,1981,10,3,5.53,5.53,12.16,4.72,7.49,12.64,1.91,95.82,6.01,-999.00
3,-37.63269,145.79811,1981,10,4,5.97,5.98,11.61,4.57,8.28,2.77,1.83,97.14,6.03,-999.00
4,-37.63269,145.79811,1981,10,5,7.24,7.24,16.00,8.31,11.68,0.10,1.98,97.46,6.53,-999.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14332,-37.63269,145.79811,2020,12,27,8.52,8.29,31.43,9.98,21.17,2.82,2.41,96.12,7.24,2.67
14333,-37.63269,145.79811,2020,12,28,2.41,2.35,19.15,7.34,12.41,0.04,1.81,96.62,4.70,8.71
14334,-37.63269,145.79811,2020,12,29,4.57,4.40,22.22,6.50,14.49,0.03,1.55,97.01,5.45,9.22
14335,-37.63269,145.79811,2020,12,30,9.01,8.91,24.82,8.33,16.81,0.00,1.22,97.19,7.36,8.19


In [10]:
yarra_df.describe()

Unnamed: 0,latitude,longitude,year,month,day,Dew/Frost,Wet Bulb,max_temp,min_temp,T2M,precipitation,WS10M,PS,Specific_Humidity,ALLSKY_SFC_SW_DWN
count,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0
mean,-37.63269,145.7981,2000.872358,6.55158,15.730557,7.287119,7.284904,17.842395,8.133656,12.666312,2.540024,1.447113,97.075005,6.763635,-40.793936
std,1.399818e-12,3.586945e-11,11.332154,3.456927,8.800849,3.205294,3.199686,7.032904,3.974373,5.321336,4.801685,0.479854,0.677898,1.605244,207.653537
min,-37.63269,145.7981,1981.0,1.0,1.0,-3.89,-3.79,3.3,-2.67,1.42,0.0,0.43,94.01,2.91,-999.0
25%,-37.63269,145.7981,1991.0,4.0,8.0,5.04,5.04,12.13,5.21,8.33,0.04,1.09,96.63,5.65,2.12
50%,-37.63269,145.7981,2001.0,7.0,16.0,6.87,6.87,16.76,7.61,11.84,0.45,1.36,97.1,6.42,3.54
75%,-37.63269,145.7981,2011.0,10.0,23.0,9.11,9.11,22.83,10.53,16.28,2.85,1.72,97.54,7.49,5.84
max,-37.63269,145.7981,2020.0,12.0,31.0,21.83,21.83,42.35,25.75,32.65,70.66,3.91,99.16,17.12,9.69


In [11]:
yarra_df['date'] = pd.to_datetime(yarra_df[['year', 'month', 'day']])
cols = list(yarra_df.columns)
cols = [cols[-1]] + cols[:-1]

yarra_df = yarra_df[cols]

yarra_df

Unnamed: 0,date,latitude,longitude,year,month,day,Dew/Frost,Wet Bulb,max_temp,min_temp,T2M,precipitation,WS10M,PS,Specific_Humidity,ALLSKY_SFC_SW_DWN
0,1981-10-01,-37.63269,145.79811,1981,10,1,6.80,6.80,14.85,7.13,11.42,4.54,1.75,96.19,6.43,-999.00
1,1981-10-02,-37.63269,145.79811,1981,10,2,10.37,10.37,17.09,11.62,13.98,15.97,1.81,95.45,8.33,-999.00
2,1981-10-03,-37.63269,145.79811,1981,10,3,5.53,5.53,12.16,4.72,7.49,12.64,1.91,95.82,6.01,-999.00
3,1981-10-04,-37.63269,145.79811,1981,10,4,5.97,5.98,11.61,4.57,8.28,2.77,1.83,97.14,6.03,-999.00
4,1981-10-05,-37.63269,145.79811,1981,10,5,7.24,7.24,16.00,8.31,11.68,0.10,1.98,97.46,6.53,-999.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14332,2020-12-27,-37.63269,145.79811,2020,12,27,8.52,8.29,31.43,9.98,21.17,2.82,2.41,96.12,7.24,2.67
14333,2020-12-28,-37.63269,145.79811,2020,12,28,2.41,2.35,19.15,7.34,12.41,0.04,1.81,96.62,4.70,8.71
14334,2020-12-29,-37.63269,145.79811,2020,12,29,4.57,4.40,22.22,6.50,14.49,0.03,1.55,97.01,5.45,9.22
14335,2020-12-30,-37.63269,145.79811,2020,12,30,9.01,8.91,24.82,8.33,16.81,0.00,1.22,97.19,7.36,8.19


Setting the date column as index

In [12]:
yarra_df.set_index('date', inplace=True)



#### Getting the average tempereture from minimum and maximum temperature

In [13]:
col = yarra_df.loc[:, ('max_temp','min_temp')]
yarra_df['avg_temp'] = col.mean(axis=1)
yarra_df

Unnamed: 0_level_0,latitude,longitude,year,month,day,Dew/Frost,Wet Bulb,max_temp,min_temp,T2M,precipitation,WS10M,PS,Specific_Humidity,ALLSKY_SFC_SW_DWN,avg_temp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1981-10-01,-37.63269,145.79811,1981,10,1,6.80,6.80,14.85,7.13,11.42,4.54,1.75,96.19,6.43,-999.00,10.990
1981-10-02,-37.63269,145.79811,1981,10,2,10.37,10.37,17.09,11.62,13.98,15.97,1.81,95.45,8.33,-999.00,14.355
1981-10-03,-37.63269,145.79811,1981,10,3,5.53,5.53,12.16,4.72,7.49,12.64,1.91,95.82,6.01,-999.00,8.440
1981-10-04,-37.63269,145.79811,1981,10,4,5.97,5.98,11.61,4.57,8.28,2.77,1.83,97.14,6.03,-999.00,8.090
1981-10-05,-37.63269,145.79811,1981,10,5,7.24,7.24,16.00,8.31,11.68,0.10,1.98,97.46,6.53,-999.00,12.155
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-12-27,-37.63269,145.79811,2020,12,27,8.52,8.29,31.43,9.98,21.17,2.82,2.41,96.12,7.24,2.67,20.705
2020-12-28,-37.63269,145.79811,2020,12,28,2.41,2.35,19.15,7.34,12.41,0.04,1.81,96.62,4.70,8.71,13.245
2020-12-29,-37.63269,145.79811,2020,12,29,4.57,4.40,22.22,6.50,14.49,0.03,1.55,97.01,5.45,9.22,14.360
2020-12-30,-37.63269,145.79811,2020,12,30,9.01,8.91,24.82,8.33,16.81,0.00,1.22,97.19,7.36,8.19,16.575


##### Dropping the columns latitude and longitude  and adding the column region with Yarra Valley region.

In [14]:
yarra_df['region'] = yarra_df.latitude + yarra_df['longitude']
yarra_df.drop(['latitude', 'longitude'], axis=1, inplace=True)
yarra_df = yarra_df.replace(yarra_df['region'].values, 'YarraValley', regex=True,)
yarra_df

Unnamed: 0_level_0,year,month,day,Dew/Frost,Wet Bulb,max_temp,min_temp,T2M,precipitation,WS10M,PS,Specific_Humidity,ALLSKY_SFC_SW_DWN,avg_temp,region
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1981-10-01,1981,10,1,6.80,6.80,14.85,7.13,11.42,4.54,1.75,96.19,6.43,-999.00,10.990,YarraValley
1981-10-02,1981,10,2,10.37,10.37,17.09,11.62,13.98,15.97,1.81,95.45,8.33,-999.00,14.355,YarraValley
1981-10-03,1981,10,3,5.53,5.53,12.16,4.72,7.49,12.64,1.91,95.82,6.01,-999.00,8.440,YarraValley
1981-10-04,1981,10,4,5.97,5.98,11.61,4.57,8.28,2.77,1.83,97.14,6.03,-999.00,8.090,YarraValley
1981-10-05,1981,10,5,7.24,7.24,16.00,8.31,11.68,0.10,1.98,97.46,6.53,-999.00,12.155,YarraValley
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-12-27,2020,12,27,8.52,8.29,31.43,9.98,21.17,2.82,2.41,96.12,7.24,2.67,20.705,YarraValley
2020-12-28,2020,12,28,2.41,2.35,19.15,7.34,12.41,0.04,1.81,96.62,4.70,8.71,13.245,YarraValley
2020-12-29,2020,12,29,4.57,4.40,22.22,6.50,14.49,0.03,1.55,97.01,5.45,9.22,14.360,YarraValley
2020-12-30,2020,12,30,9.01,8.91,24.82,8.33,16.81,0.00,1.22,97.19,7.36,8.19,16.575,YarraValley


### Dropping unnecesary columns
We will drop year, month and day columns as we have extracted that info to create the "date" column. Also, we will drop ALLSKY_SFC_SW_DWN, precipitation, latitud and longitude as they are always the same values because we are talking about ame place, Yarra Valley.

In [15]:
yarra_df.drop(['year','month','day','ALLSKY_SFC_SW_DWN', 'precipitation','min_temp','max_temp'], axis='columns', inplace=True)
yarra_df

Unnamed: 0_level_0,Dew/Frost,Wet Bulb,T2M,WS10M,PS,Specific_Humidity,avg_temp,region
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1981-10-01,6.80,6.80,11.42,1.75,96.19,6.43,10.990,YarraValley
1981-10-02,10.37,10.37,13.98,1.81,95.45,8.33,14.355,YarraValley
1981-10-03,5.53,5.53,7.49,1.91,95.82,6.01,8.440,YarraValley
1981-10-04,5.97,5.98,8.28,1.83,97.14,6.03,8.090,YarraValley
1981-10-05,7.24,7.24,11.68,1.98,97.46,6.53,12.155,YarraValley
...,...,...,...,...,...,...,...,...
2020-12-27,8.52,8.29,21.17,2.41,96.12,7.24,20.705,YarraValley
2020-12-28,2.41,2.35,12.41,1.81,96.62,4.70,13.245,YarraValley
2020-12-29,4.57,4.40,14.49,1.55,97.01,5.45,14.360,YarraValley
2020-12-30,9.01,8.91,16.81,1.22,97.19,7.36,16.575,YarraValley


In [16]:
yarra_df['avg_temp'].value_counts()

9.285     14
8.240     13
9.230     13
8.790     12
5.365     12
          ..
7.850      1
7.265      1
28.620     1
17.310     1
17.525     1
Name: avg_temp, Length: 5361, dtype: int64

Let's now isolate the function that needs to be expected from the rest of the features. The remainder of the dataset is stored in yarra x, while the temperature column is stored in weather y.

In [18]:
yarra_df_num = yarra_df[list(yarra_df.dtypes[yarra_df.dtypes!='object'].index)]

Now that we've prepared our dataset, it's time to feed it to the model for training.

In [19]:
yarra_y = yarra_df_num.pop('avg_temp')
yarra_x = yarra_df_num

In [20]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LinearRegression

import tensorflow as tf
from tensorflow import keras

from sklearn import preprocessing

Now is the time to divide the dataset into training and testing parts.

In [21]:
train_x,test_x,train_y,test_y = train_test_split(yarra_x, yarra_y,test_size = 0.2, random_state=4)

In [22]:
train_x.head()

Unnamed: 0_level_0,Dew/Frost,Wet Bulb,T2M,WS10M,PS,Specific_Humidity
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2007-09-09,6.45,6.46,10.23,1.51,97.13,6.23
2004-04-05,10.45,10.45,17.88,1.48,97.86,8.09
2002-11-24,8.43,8.46,22.79,1.66,96.5,7.69
2001-12-02,8.16,8.17,16.37,1.29,96.29,7.17
1997-03-18,4.75,4.77,18.45,1.52,96.96,5.57


Train x has all of the features except the temperature average, and train y has the temperature range for those features. In supervised machine learning, we feed the model input and associated output first, then test it with new data.

## Implementation of the Models 

### Linear Regression

In [23]:
model = LinearRegression()
model.fit(train_x,train_y)

LinearRegression()

In [24]:
prediction = model.predict(test_x)

In [25]:
from sklearn import metrics
from math import sqrt

def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

mape = mean_absolute_percentage_error(test_y, prediction)
print(mape)

3.3101321425181887


For this model the was Error — 3.31013 

In [26]:
pd.DataFrame({'actual':test_y,
             'prediction':prediction,
             'diff':(test_y-prediction)})

Unnamed: 0_level_0,actual,prediction,diff
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-05-01,11.825,11.929435,-0.104435
1998-11-01,9.080,9.297473,-0.217473
2000-06-16,9.380,9.897217,-0.517217
2008-09-16,6.655,6.454141,0.200859
1997-01-23,16.650,16.349373,0.300627
...,...,...,...
2020-01-04,20.185,21.114300,-0.929300
1991-12-10,11.550,12.807768,-1.257768
1997-04-02,12.455,12.031206,0.423794
2006-06-15,7.520,7.825107,-0.305107


As we can see the accuracy of the temperature is very similar to what was predicted and as an exemple of the first row  the temperature in 01/05/17 was **11.825** and the prediction was **11.92** with a difference of only 0.10 less than it was.

# Random Forest with maximum Dept — 10

In [61]:
from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(max_depth=10,random_state=0, n_estimators=100)
regr.fit(train_x,train_y)

RandomForestRegressor(max_depth=10, random_state=0)

In [51]:
prediction2 = regressor.predict(test_x)

In [57]:
def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

mape = mean_absolute_percentage_error(test_y, prediction2)
print(mape)

4.470773544888394


For this model the was Error — 4.47077 

In [53]:
pd.DataFrame({'actual':test_y,
             'prediction':prediction2,
             'diff':(test_y-prediction2)})

Unnamed: 0_level_0,actual,prediction,diff
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-05-01,11.825,12.015,-0.190
1998-11-01,9.080,8.605,0.475
2000-06-16,9.380,9.570,-0.190
2008-09-16,6.655,6.780,-0.125
1997-01-23,16.650,16.275,0.375
...,...,...,...
2020-01-04,20.185,20.990,-0.805
1991-12-10,11.550,12.935,-1.385
1997-04-02,12.455,12.175,0.280
2006-06-15,7.520,7.930,-0.410


As we can see the accuracy of the temperature is very similar to what was predicted and as an exemple of the first row  the temperature on 01/05/17 was **11.825** and the prediction was **12.015** with a difference of only 0.19 less than it was.

# Comparison:

Linear Regression = 3.31013 <br>

Random Forest Dept — 10 => 4.470


We can see that Linear Regression is the best model to predict the weather on this datset.