# LA Weather Prediction

In this project, I will be predicting LA weather using pandas and scikit learn module. Before anything I'll be downloading the LA Weather data from NOAA, I will also do some analysis and show some plots on the downloaded data. I will create a model which will predict the weather using the previous data and then finally i will create a backtesting engine with the data.

Ill also like to explain the meaning of some of the columns we will be working with:
- STATION = 11 character station identification code. Please see ghcnd-stations section below for an explantation
- DATE = 8 character date in YYYYMMDD format (e.g. 19860529 = May 29, 1986)
- ELEMENT = 4 character indicator of element type
- PRCP = Precipitation (tenths of mm)
- SNOW = Snowfall (mm)
- SNWD = Snow depth (mm)
- TMAX = Maximum temperature (tenths of degrees C)
- TMIN = Minimum temperature (tenths of degrees C)

### Method
- We'll be downloading the LA Weather data.
- Then we'll explore the dataset
- Then we'll be setting the dataset to predict the weather
- We'll also test a machine learning model
- Setup a backtesting engine
- Improve the accuracy of the model

## Downloading and loading the dataset

We'll download our data from __[NOAA](https://www.noaa.gov/)__ , first we'll search the LA Weather data on the website, and download it. After downloading we'll use the pandas read_csv method to load the LA Weather downloaded.

In [495]:
#Importing the required libraries

import pandas as pd
import seaborn as sns


#reading our weather dataset
data = pd.read_csv('LAWeather.csv', index_col = "DATE")
data

Unnamed: 0_level_0,STATION,NAME,ACMH,ACSH,AWND,FMTM,PGTM,PRCP,SNOW,SNWD,...,WT10,WT11,WT13,WT14,WT16,WT18,WT21,WV01,WV03,WV20
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1980-01-01,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",50.0,80.0,,,1406.0,0.00,0.0,0.0,...,,,,,,,,,,
1980-01-02,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",10.0,0.0,,,618.0,0.00,0.0,0.0,...,,,,,,,,,,
1980-01-03,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",70.0,80.0,,,1454.0,0.00,0.0,0.0,...,,,,,,,,,,
1980-01-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",40.0,90.0,,,1600.0,0.00,0.0,0.0,...,,,,,,,,,,
1980-01-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",60.0,80.0,,,1312.0,0.00,0.0,0.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-02-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",,,5.82,,,0.00,,,...,,,,,,,,,,
2023-02-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",,,11.18,,,0.01,,,...,,,,,,,,,,
2023-02-06,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",,,10.07,,,0.00,,,...,,,,,,,,,,
2023-02-07,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",,,4.47,,,0.00,,,...,,,,,,,,,,


## Exploring our Data

Now we'll use the pandas describe() method to do some calculations on our DataFrame,check for some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame. It analyzes both numeric and object series and also the DataFrame column sets of mixed data types.

In [496]:
#Checking our data

data.describe()

Unnamed: 0,ACMH,ACSH,AWND,FMTM,PGTM,PRCP,SNOW,SNWD,TAVG,TMAX,...,WT10,WT11,WT13,WT14,WT16,WT18,WT21,WV01,WV03,WV20
count,6268.0,6269.0,14280.0,6069.0,11047.0,15745.0,6390.0,7059.0,6266.0,15745.0,...,3.0,2.0,2968.0,323.0,1805.0,1.0,98.0,24.0,2.0,7.0
mean,45.8403,45.731536,7.435854,1591.157028,1464.353218,0.033396,0.0,0.0,63.564635,70.609717,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
std,31.680664,34.118125,2.129677,874.808413,353.581407,0.180189,0.0,0.0,6.143643,7.293328,...,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0
min,0.0,0.0,1.79,0.0,0.0,0.0,0.0,0.0,0.0,50.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,20.0,10.0,6.04,1410.0,1331.0,0.0,0.0,0.0,59.0,65.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
50%,50.0,40.0,7.16,1527.0,1455.0,0.0,0.0,0.0,64.0,70.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
75%,70.0,80.0,8.5,1646.0,1618.0,0.0,0.0,0.0,68.0,75.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,100.0,100.0,22.59,9999.0,2359.0,4.53,0.0,0.0,87.0,106.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Checking for missing values

In our DataFrame, you can see that some data are numerical and while some columns shows NaN as values. This means that some values are missing in the DataFrame. To check how many datas are missing in each column, we use the oandas isna().sum() method. this will show us the sum of missing values we have in each columns of the DataFrame.

In [497]:
#Checking for missing values

data.isna().sum()

STATION        0
NAME           0
ACMH        9477
ACSH        9476
AWND        1465
FMTM        9676
PGTM        4698
PRCP           0
SNOW        9355
SNWD        8686
TAVG        9479
TMAX           0
TMIN           0
TSUN       14599
WDF1       15287
WDF2        6035
WDF5        6439
WDFG        9500
WESD        9901
WSF1       15287
WSF2        6035
WSF5        6439
WSFG        9496
WT01        9391
WT02       14842
WT03       15577
WT04       15743
WT05       15628
WT07       15387
WT08        8530
WT09       15672
WT10       15742
WT11       15743
WT13       12777
WT14       15422
WT16       13940
WT18       15744
WT21       15647
WV01       15721
WV03       15743
WV20       15738
dtype: int64

We'll then check for the percentage of missing values so we can determine what columns we will be using as our predictors


In [498]:
#Checking the percentage of missing values to determine what elements we'll use for predictions

missing_pct = data.apply(pd.isnull).sum()/data.shape[0]
missing_pct

STATION    0.000000
NAME       0.000000
ACMH       0.601905
ACSH       0.601842
AWND       0.093045
FMTM       0.614544
PGTM       0.298380
PRCP       0.000000
SNOW       0.594157
SNWD       0.551667
TAVG       0.602032
TMAX       0.000000
TMIN       0.000000
TSUN       0.927215
WDF1       0.970911
WDF2       0.383296
WDF5       0.408955
WDFG       0.603366
WESD       0.628835
WSF1       0.970911
WSF2       0.383296
WSF5       0.408955
WSFG       0.603112
WT01       0.596443
WT02       0.942648
WT03       0.989330
WT04       0.999873
WT05       0.992569
WT07       0.977263
WT08       0.541759
WT09       0.995364
WT10       0.999809
WT11       0.999873
WT13       0.811496
WT14       0.979486
WT16       0.885360
WT18       0.999936
WT21       0.993776
WV01       0.998476
WV03       0.999873
WV20       0.999555
dtype: float64

In [499]:
print(data.columns)

Index(['STATION', 'NAME', 'ACMH', 'ACSH', 'AWND', 'FMTM', 'PGTM', 'PRCP',
       'SNOW', 'SNWD', 'TAVG', 'TMAX', 'TMIN', 'TSUN', 'WDF1', 'WDF2', 'WDF5',
       'WDFG', 'WESD', 'WSF1', 'WSF2', 'WSF5', 'WSFG', 'WT01', 'WT02', 'WT03',
       'WT04', 'WT05', 'WT07', 'WT08', 'WT09', 'WT10', 'WT11', 'WT13', 'WT14',
       'WT16', 'WT18', 'WT21', 'WV01', 'WV03', 'WV20'],
      dtype='object')


In [500]:
weather = data[['STATION', 'NAME', 'ACMH', 'ACSH', 'AWND', 'FMTM', 'PGTM', 'PRCP',
       'SNOW', 'SNWD', 'TAVG', 'TMAX', 'TMIN']].copy()

We copied the data to a new variable, this new variable will make it easy for us to select the columns we want to use a predictors.

After that we converted all columns to lower cases with the str.lower() method. Columns with lower cases are easier to access and it makes coding a bit easy for us. This will help us not to be typing in upper cases whenever we want to call a variabe.

In [501]:
weather.columns = weather.columns.str.lower()

In [502]:
weather

Unnamed: 0_level_0,station,name,acmh,acsh,awnd,fmtm,pgtm,prcp,snow,snwd,tavg,tmax,tmin
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1980-01-01,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",50.0,80.0,,,1406.0,0.00,0.0,0.0,,71,54
1980-01-02,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",10.0,0.0,,,618.0,0.00,0.0,0.0,,79,50
1980-01-03,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",70.0,80.0,,,1454.0,0.00,0.0,0.0,,78,55
1980-01-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",40.0,90.0,,,1600.0,0.00,0.0,0.0,,72,47
1980-01-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",60.0,80.0,,,1312.0,0.00,0.0,0.0,,68,47
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-02-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",,,5.82,,,0.00,,,56.0,66,46
2023-02-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",,,11.18,,,0.01,,,57.0,63,50
2023-02-06,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",,,10.07,,,0.00,,,58.0,71,53
2023-02-07,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",,,4.47,,,0.00,,,59.0,67,47


#### Filling up missing values

There are several ways to fill up missing values in pandas, but we'll be using ffill() method to replace the values of the missing datasets with the values before them, but after doing that there are some values that are still missing, the way we can work on the missing values is to use the mean() of the column. 

We'll use the pandas mean() method to replace the missing values in the columns.

In [503]:
#Forward filling through the dataset.

weather = weather.ffill()
weather.apply(pd.isnull).sum()

station       0
name          0
acmh          0
acsh          0
awnd       1461
fmtm       5569
pgtm          0
prcp          0
snow          0
snwd          0
tavg       6665
tmax          0
tmin          0
dtype: int64

In [504]:
#Using the mean() method to fill the dataset

weather = weather.fillna(weather.mean())

  weather = weather.fillna(weather.mean())


In [505]:
weather

Unnamed: 0_level_0,station,name,acmh,acsh,awnd,fmtm,pgtm,prcp,snow,snwd,tavg,tmax,tmin
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1980-01-01,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",50.0,80.0,7.437264,1567.728282,1406.0,0.00,0.0,0.0,65.244053,71,54
1980-01-02,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",10.0,0.0,7.437264,1567.728282,618.0,0.00,0.0,0.0,65.244053,79,50
1980-01-03,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",70.0,80.0,7.437264,1567.728282,1454.0,0.00,0.0,0.0,65.244053,78,55
1980-01-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",40.0,90.0,7.437264,1567.728282,1600.0,0.00,0.0,0.0,65.244053,72,47
1980-01-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",60.0,80.0,7.437264,1567.728282,1312.0,0.00,0.0,0.0,65.244053,68,47
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-02-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,5.820000,1510.000000,1629.0,0.00,0.0,0.0,56.000000,66,46
2023-02-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,11.180000,1510.000000,1629.0,0.01,0.0,0.0,57.000000,63,50
2023-02-06,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,10.070000,1510.000000,1629.0,0.00,0.0,0.0,58.000000,71,53
2023-02-07,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,4.470000,1510.000000,1629.0,0.00,0.0,0.0,59.000000,67,47


In [506]:
weather.dtypes

station     object
name        object
acmh       float64
acsh       float64
awnd       float64
fmtm       float64
pgtm       float64
prcp       float64
snow       float64
snwd       float64
tavg       float64
tmax         int64
tmin         int64
dtype: object

### Converting to time series data

After checking the type of our data columns, we will then convert the date column to a datetime so that it will be effective in predicting our values. We'll use the to.datetime function to convert the date column to the datetime series.

In [507]:
weather.index

Index(['1980-01-01', '1980-01-02', '1980-01-03', '1980-01-04', '1980-01-05',
       '1980-01-06', '1980-01-07', '1980-01-08', '1980-01-09', '1980-01-10',
       ...
       '2023-01-30', '2023-01-31', '2023-02-01', '2023-02-02', '2023-02-03',
       '2023-02-04', '2023-02-05', '2023-02-06', '2023-02-07', '2023-02-08'],
      dtype='object', name='DATE', length=15745)

In [508]:
weather.index = pd.to_datetime(weather.index)

In [509]:
weather

Unnamed: 0_level_0,station,name,acmh,acsh,awnd,fmtm,pgtm,prcp,snow,snwd,tavg,tmax,tmin
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1980-01-01,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",50.0,80.0,7.437264,1567.728282,1406.0,0.00,0.0,0.0,65.244053,71,54
1980-01-02,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",10.0,0.0,7.437264,1567.728282,618.0,0.00,0.0,0.0,65.244053,79,50
1980-01-03,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",70.0,80.0,7.437264,1567.728282,1454.0,0.00,0.0,0.0,65.244053,78,55
1980-01-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",40.0,90.0,7.437264,1567.728282,1600.0,0.00,0.0,0.0,65.244053,72,47
1980-01-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",60.0,80.0,7.437264,1567.728282,1312.0,0.00,0.0,0.0,65.244053,68,47
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-02-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,5.820000,1510.000000,1629.0,0.00,0.0,0.0,56.000000,66,46
2023-02-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,11.180000,1510.000000,1629.0,0.01,0.0,0.0,57.000000,63,50
2023-02-06,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,10.070000,1510.000000,1629.0,0.00,0.0,0.0,58.000000,71,53
2023-02-07,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,4.470000,1510.000000,1629.0,0.00,0.0,0.0,59.000000,67,47


In [510]:
weather.index.year.value_counts().sort_index()

1980    366
1981    365
1982    365
1983    365
1984    366
1985    365
1986    365
1987    365
1988    366
1989    365
1990    365
1991    365
1992    366
1993    365
1994    365
1995    365
1996    366
1997    365
1998    365
1999    365
2000    366
2001    365
2002    365
2003    365
2004    366
2005    365
2006    365
2007    365
2008    366
2009    365
2010    365
2011    365
2012    366
2013    365
2014    365
2015    365
2016    366
2017    365
2018    365
2019    365
2020    366
2021    365
2022    365
2023     39
Name: DATE, dtype: int64

In [511]:
weather["station"].unique()

array(['USW00023174'], dtype=object)

## Setting our Target variable

To set our target variable, that is what we want our prediction to be, we'll be using the shift method to go back a day to get the maximim temperature of that day(tmax). This is what we'll use to get our predictions.

After shifting the target, we'll use the ffill() method to fill in the missing value since we're shifting a day backward, the value of the last day will be missing so we'll fill it with the previous day.

In [512]:
# Setting our target

weather["target_x"] = weather.shift(-1)["tmax"]

In [513]:
# Forward filling to clear missing values

weather.ffill()

Unnamed: 0_level_0,station,name,acmh,acsh,awnd,fmtm,pgtm,prcp,snow,snwd,tavg,tmax,tmin,target_x
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1980-01-01,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",50.0,80.0,7.437264,1567.728282,1406.0,0.00,0.0,0.0,65.244053,71,54,79.0
1980-01-02,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",10.0,0.0,7.437264,1567.728282,618.0,0.00,0.0,0.0,65.244053,79,50,78.0
1980-01-03,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",70.0,80.0,7.437264,1567.728282,1454.0,0.00,0.0,0.0,65.244053,78,55,72.0
1980-01-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",40.0,90.0,7.437264,1567.728282,1600.0,0.00,0.0,0.0,65.244053,72,47,68.0
1980-01-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",60.0,80.0,7.437264,1567.728282,1312.0,0.00,0.0,0.0,65.244053,68,47,64.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-02-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,5.820000,1510.000000,1629.0,0.00,0.0,0.0,56.000000,66,46,63.0
2023-02-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,11.180000,1510.000000,1629.0,0.01,0.0,0.0,57.000000,63,50,71.0
2023-02-06,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,10.070000,1510.000000,1629.0,0.00,0.0,0.0,58.000000,71,53,67.0
2023-02-07,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,4.470000,1510.000000,1629.0,0.00,0.0,0.0,59.000000,67,47,71.0


Now if you check our data from the beginning, you'll see that some columns have more numbers than other columns. If we make predictions like that it may result to a bad model. So the way we will use to make all our dataset balanced is to standardize the columns with higher numbers. To standardize a dataset means to scale all of the values in the dataset such that the mean value is 0 and the standard deviation is 1. 

the formlua for standardization is : xnew = (xi – x) / s. Where;
- xi: The ith value in the dataset
- x: The sample mean
- s: The sample standard deviation

In [514]:
weather['fmtm'] = (weather['fmtm'] - weather['fmtm'].mean()) / weather['fmtm'].std()
weather['pgtm'] = (weather['pgtm'] - weather['pgtm'].mean()) / weather['pgtm'].std()
weather

Unnamed: 0_level_0,station,name,acmh,acsh,awnd,fmtm,pgtm,prcp,snow,snwd,tavg,tmax,tmin,target_x
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1980-01-01,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",50.0,80.0,7.437264,-1.124728e-14,-0.093511,0.00,0.0,0.0,65.244053,71,54,79.0
1980-01-02,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",10.0,0.0,7.437264,-1.124728e-14,-2.566994,0.00,0.0,0.0,65.244053,79,50,78.0
1980-01-03,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",70.0,80.0,7.437264,-1.124728e-14,0.057158,0.00,0.0,0.0,65.244053,78,55,72.0
1980-01-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",40.0,90.0,7.437264,-1.124728e-14,0.515443,0.00,0.0,0.0,65.244053,72,47,68.0
1980-01-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",60.0,80.0,7.437264,-1.124728e-14,-0.388571,0.00,0.0,0.0,65.244053,68,47,64.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-02-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,5.820000,-1.057626e-01,0.606472,0.00,0.0,0.0,56.000000,66,46,63.0
2023-02-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,11.180000,-1.057626e-01,0.606472,0.01,0.0,0.0,57.000000,63,50,71.0
2023-02-06,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,10.070000,-1.057626e-01,0.606472,0.00,0.0,0.0,58.000000,71,53,67.0
2023-02-07,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,4.470000,-1.057626e-01,0.606472,0.00,0.0,0.0,59.000000,67,47,71.0


## Setting up our Model

Since we're prediicting the future of the maximum temperature of the day, it is not a classification problem. Hence, since it is a continuos value, we'll be treating it as a regression problem. We'll be using ridge regression for this model. We'll import ridge from sklearn linear model, and then we'll also set alpha to be .1

- alpha : {float, array-like}. 
    shape = [n_targets] Small positive values of alpha improve the conditioning of the problem and reduce the variance of the estimates. Alpha corresponds to (2*C)^-1 in other linear models such as LogisticRegression or LinearSVC. If an array is passed, penalties are assumed to be specific to the targets. Hence they must correspond in number.

In [515]:
from sklearn.linear_model import Ridge

rr = Ridge(alpha=.1)

We'll then set our predictors to be the following; 
- ACMH = Average cloudiness midnight to midnight from manual observations (percent)
- ACSH = Average cloudiness sunrise to sunset from manual observations (percent)
- AWND = Average daily wind speed (tenths of meters per second)
- FMTM = Time of fastest mile or fastest 1-minute wind (hours and minutes,i.e., HHMM)
- PGTM = Peak gust time (hours and minutes, i.e., HHMM)
- SNOW = Snowfall (mm)
- SNWD = Snow depth (mm)
- TAVG = Average temperature (tenths of degrees C) 
- TMAX = Maximum temperature (tenths of degrees C)
- TMIN = Minimum temperature (tenths of degrees C)

We'll use the above predictors to create a backtest function

In [516]:
predictors = weather.columns[~weather.columns.isin(["target_x", "name", "station"])]

In [517]:
predictors

Index(['acmh', 'acsh', 'awnd', 'fmtm', 'pgtm', 'prcp', 'snow', 'snwd', 'tavg',
       'tmax', 'tmin'],
      dtype='object')

## Creating a Backtesting Fuction

Our backtesting function will loop over the dataset, and train a model every 90 rows. We'll make it a function so we can avoid rewriting the code if we want to backtest again. Ideally, we'd train a model more often than every 90 rows.

Before we write our full backtesting loop, let's write the code for a single iteration. In the below code:

- We'll take the first 3650 rows of the data as our training set
- We'll take the next 90 rows as our testing set
- We'll fit our machine learning model to the training set
- We'll make predictions on the test set

The backtest function below, takes the weather data, our model, our predictors , also a start and step value to loop through the datset. We will the fit the model with the predictors, and then convert the fitted model to a pandas series. We'll then create a DataFrame combining the Targets and the Predictions, and then we'll create a new column in the new DataFrame( the new coumn will be the difference between prediction and the actual values so we can know well our predictions are)

In [518]:
def backtest(weather, model, predictors, start = 3650, step = 90):
    all_predictions = []
    for i in range(start, weather.shape[0], step):
        train = weather.iloc[:i,:]
        test = weather.iloc[i:(i + step), :]
        model.fit(train[predictors], train["target_x"])
        preds = model.predict(test[predictors])
        preds = pd.Series(preds, index=test.index)
        combined = pd.concat([test["target_x"], preds], axis =1)
        combined.columns = ["actual", "prediction"]
        combined["diff"] = (combined["prediction"] - combined["actual"]).abs()
        all_predictions.append(combined)
    return pd.concat(all_predictions)

In [519]:
# Getting our predictions

predictions = backtest(weather, rr, predictors)

In [520]:
predictions = predictions.iloc[:-1 , :]
predictions

Unnamed: 0_level_0,actual,prediction,diff
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1989-12-29,70.0,67.443171,2.556829
1989-12-30,70.0,69.591147,0.408853
1989-12-31,61.0,69.330496,8.330496
1990-01-01,62.0,61.557471,0.442529
1990-01-02,64.0,62.366628,1.633372
...,...,...,...
2023-02-03,66.0,67.784136,1.784136
2023-02-04,63.0,66.441951,3.441951
2023-02-05,71.0,63.981955,7.018045
2023-02-06,67.0,70.165085,3.165085


### Measuring Error

By measuring the error of our model, we will know how well our model did. We'll be importing mean_absolute_error and mean_squared_error from sklearn.  

The mean absolute error (MAE) of a model with respect to a test set is the mean of the absolute values of the individual prediction errors on over all instances in the test set. Each prediction error is the difference between the true value and the predicted value for the instance.

The mean squared error (MSE) measures the amount of error in statistical models. It assesses the average squared difference between the observed and predicted values. When a model has no error, the MSE equals zero. As model error increases, its value increases.

In [521]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

mean_absolute_error(predictions["actual"],predictions["prediction"])

3.1601777220021363

In [522]:
predictions["diff"].mean()

3.16017772200213

As we can see from the values of our MAE(3.1601777220021363), our model is doing a great job. Because it is known that the lower the MAE the better the model.

In [523]:
predictions.sort_values("diff", ascending=False)

Unnamed: 0_level_0,actual,prediction,diff
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2002-02-20,89.0,63.884505,25.115495
2004-09-04,101.0,77.522754,23.477246
1996-04-28,94.0,70.682679,23.317321
2008-10-21,96.0,73.401671,22.598329
2021-11-20,88.0,65.818500,22.181500
...,...,...,...
2022-10-06,75.0,74.998929,0.001071
2002-03-19,68.0,68.000722,0.000722
2013-05-25,70.0,70.000491,0.000491
2010-06-24,70.0,69.999871,0.000129


In [524]:
pd.Series(rr.coef_, index=predictors)

acmh   -0.038677
acsh    0.024455
awnd   -0.190437
fmtm   -0.039802
pgtm    0.003162
prcp   -0.868845
snow    0.000000
snwd    0.000000
tavg   -0.018614
tmax    0.686891
tmin    0.162180
dtype: float64

## Feature Engineering

Now that we're done with our backtesting, we'll be creating new predictors that we can call in the backtest fucntion. We'll start off by creating a fucntion that returns (new-old)/old. Then we create a function that computes the rolling averages for a week, two weeks and a month on some of the elements which are the predictors.

In [525]:
def pct_diff(old, new):
    return(new-old)/old

def compute_rol(weather, horizon, col):
    label = f"rolling_{horizon}_{col}"
    weather[label] = weather[col].rolling(horizon).mean()
    weather[f"{label}_pct"] = pct_diff(weather[label], weather[col])
    return weather

rolling_horizons = [7,14,30]
for horizon in rolling_horizons:
    for col in ["tmax", "tmin", "awnd", "prcp"]:
        weather = compute_rol(weather, horizon, col)
        
weather

Unnamed: 0_level_0,station,name,acmh,acsh,awnd,fmtm,pgtm,prcp,snow,snwd,...,rolling_14_prcp,rolling_14_prcp_pct,rolling_30_tmax,rolling_30_tmax_pct,rolling_30_tmin,rolling_30_tmin_pct,rolling_30_awnd,rolling_30_awnd_pct,rolling_30_prcp,rolling_30_prcp_pct
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1980-01-01,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",50.0,80.0,7.437264,-1.124728e-14,-0.093511,0.00,0.0,0.0,...,,,,,,,,,,
1980-01-02,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",10.0,0.0,7.437264,-1.124728e-14,-2.566994,0.00,0.0,0.0,...,,,,,,,,,,
1980-01-03,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",70.0,80.0,7.437264,-1.124728e-14,0.057158,0.00,0.0,0.0,...,,,,,,,,,,
1980-01-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",40.0,90.0,7.437264,-1.124728e-14,0.515443,0.00,0.0,0.0,...,,,,,,,,,,
1980-01-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",60.0,80.0,7.437264,-1.124728e-14,-0.388571,0.00,0.0,0.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-02-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,5.820000,-1.057626e-01,0.606472,0.00,0.0,0.0,...,0.048571,-1.000000,63.100000,0.045959,46.666667,-0.014286,6.650667,-0.124900,0.162333,-1.000000
2023-02-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,11.180000,-1.057626e-01,0.606472,0.01,0.0,0.0,...,0.049286,-0.797101,63.133333,-0.002112,46.800000,0.068376,6.866667,0.628155,0.162667,-0.938525
2023-02-06,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,10.070000,-1.057626e-01,0.606472,0.00,0.0,0.0,...,0.049286,-1.000000,63.233333,0.122826,46.966667,0.128460,7.038333,0.430736,0.162667,-1.000000
2023-02-07,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,4.470000,-1.057626e-01,0.606472,0.00,0.0,0.0,...,0.049286,-1.000000,63.366667,0.057338,46.933333,0.001420,7.030667,-0.364214,0.162667,-1.000000


Next we'll create a fuction that returns the expanding mean, and then we'll loop through some of the predictors to get daily, weekly and monthly averages. These daily, weekly and monthly averages will be grouped using the pandas groupby method. This groupby allows us to group our data by the days, weeks and months on the columns we want to form new variables from.

In [526]:
def expand_mean(df):
    return df.expanding(1).mean()

for col in ["tmax", "tmin", "awnd", "prcp"]:
    weather[f"month_avg{col}"] = weather[col].groupby(weather.index.month,group_keys=False).apply(expand_mean)
    weather[f"week_avg{col}"] = weather[col].groupby(weather.index.week,group_keys=False).apply(expand_mean)
    weather[f"day_avg{col}"] = weather[col].groupby(weather.index.day_of_year,group_keys=False).apply(expand_mean)

  weather[f"week_avg{col}"] = weather[col].groupby(weather.index.week,group_keys=False).apply(expand_mean)
  weather[f"week_avg{col}"] = weather[col].groupby(weather.index.week,group_keys=False).apply(expand_mean)
  weather[f"week_avg{col}"] = weather[col].groupby(weather.index.week,group_keys=False).apply(expand_mean)
  weather[f"week_avg{col}"] = weather[col].groupby(weather.index.week,group_keys=False).apply(expand_mean)


In [527]:
print(weather.columns)

Index(['station', 'name', 'acmh', 'acsh', 'awnd', 'fmtm', 'pgtm', 'prcp',
       'snow', 'snwd', 'tavg', 'tmax', 'tmin', 'target_x', 'rolling_7_tmax',
       'rolling_7_tmax_pct', 'rolling_7_tmin', 'rolling_7_tmin_pct',
       'rolling_7_awnd', 'rolling_7_awnd_pct', 'rolling_7_prcp',
       'rolling_7_prcp_pct', 'rolling_14_tmax', 'rolling_14_tmax_pct',
       'rolling_14_tmin', 'rolling_14_tmin_pct', 'rolling_14_awnd',
       'rolling_14_awnd_pct', 'rolling_14_prcp', 'rolling_14_prcp_pct',
       'rolling_30_tmax', 'rolling_30_tmax_pct', 'rolling_30_tmin',
       'rolling_30_tmin_pct', 'rolling_30_awnd', 'rolling_30_awnd_pct',
       'rolling_30_prcp', 'rolling_30_prcp_pct', 'month_avgtmax',
       'week_avgtmax', 'day_avgtmax', 'month_avgtmin', 'week_avgtmin',
       'day_avgtmin', 'month_avgawnd', 'week_avgawnd', 'day_avgawnd',
       'month_avgprcp', 'week_avgprcp', 'day_avgprcp'],
      dtype='object')


As we can see above, these are the new predictors that we have added to the predictors.

We'll then select the predictors from after the first 30 days, since we made new variables, we'll have some missing values on the first 30 columns, but we can work without them since we have a lot of c=values in our dataset, we can make predictions with them and it won't affect our predictions.

In [528]:
weather = weather.iloc[30:,:]
weather = weather.fillna(0)

In [529]:
weather

Unnamed: 0_level_0,station,name,acmh,acsh,awnd,fmtm,pgtm,prcp,snow,snwd,...,day_avgtmax,month_avgtmin,week_avgtmin,day_avgtmin,month_avgawnd,week_avgawnd,day_avgawnd,month_avgprcp,week_avgprcp,day_avgprcp
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1980-01-31,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",50.0,60.0,7.437264,-1.124728e-14,0.848170,0.00,0.0,0.0,...,74.000000,52.903226,54.000000,57.000000,7.437264,7.437264,7.437264,0.224839,0.482500,0.000000
1980-02-01,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",10.0,10.0,7.437264,-1.124728e-14,-0.294403,0.00,0.0,0.0,...,76.000000,55.000000,54.200000,55.000000,7.437264,7.437264,7.437264,0.000000,0.386000,0.000000
1980-02-02,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",30.0,50.0,7.437264,-1.124728e-14,0.038324,0.00,0.0,0.0,...,73.000000,55.000000,54.333333,55.000000,7.437264,7.437264,7.437264,0.000000,0.321667,0.000000
1980-02-03,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",70.0,100.0,7.437264,-1.124728e-14,0.684946,0.00,0.0,0.0,...,75.000000,54.333333,54.142857,53.000000,7.437264,7.437264,7.437264,0.000000,0.275714,0.000000
1980-02-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",70.0,90.0,7.437264,-1.124728e-14,0.515443,0.00,0.0,0.0,...,74.000000,54.000000,53.000000,53.000000,7.437264,7.437264,7.437264,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-02-04,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,5.820000,-1.057626e-01,0.606472,0.00,0.0,0.0,...,66.204545,50.085316,49.498371,49.568182,7.415440,6.552421,7.377024,0.099114,0.070749,0.030909
2023-02-05,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,11.180000,-1.057626e-01,0.606472,0.01,0.0,0.0,...,65.886364,50.085246,49.500000,49.068182,7.418525,6.567446,6.629524,0.099041,0.070552,0.053409
2023-02-06,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,10.070000,-1.057626e-01,0.606472,0.00,0.0,0.0,...,65.363636,50.087633,49.814570,49.386364,7.420697,7.164117,7.031569,0.098960,0.078675,0.118409
2023-02-07,USW00023174,"LOS ANGELES INTERNATIONAL AIRPORT, CA US",13.0,13.0,4.470000,-1.057626e-01,0.606472,0.00,0.0,0.0,...,65.954545,50.085106,49.805281,49.568182,7.418282,7.155226,6.979524,0.098879,0.078416,0.145227


We'll be updating our predictors, what the code below means is our predictors will contain all the columns in weather except target_x, name, station.  We'll then call our backtest function afterwards.

In [530]:
predictors = weather.columns[~weather.columns.isin(["target_x", "name", "station"])]

In [535]:
predictions = backtest(weather, rr, predictors)
mean_absolute_error(predictions["actual"],predictions["prediction"])

3.090102396200059

From the result of the MAE above, you can see that our model got better. 

In [536]:
# Checking for the mean squared error

print(mean_squared_error(predictions["actual"],predictions["prediction"]))

18.612415211553937


In [537]:
predictions.sort_values("diff", ascending=False)

Unnamed: 0_level_0,actual,prediction,diff
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-02-08,0.0,70.082072,70.082072
2002-02-20,89.0,64.146051,24.853949
2004-09-04,101.0,77.986995,23.013005
1996-04-28,94.0,71.246534,22.753466
2010-09-26,105.0,82.840536,22.159464
...,...,...,...
2014-06-28,76.0,75.999101,0.000899
2003-08-22,76.0,75.999731,0.000269
1995-01-03,59.0,59.000261,0.000261
2004-11-14,76.0,75.999905,0.000095
