# Steps:

## Step 1: Create a folder which has all the ERA.csv data ready to be read
## Step 2: Create a folder called [Feature_Arrays] to hold the converted time-series shaped data in numpy-array format

The purpose of this file is to show my process for converting the regularly changed hourly weather features into a time-series, and saving the output into numpy-array files which will later be fed to a time-series reccurent neural network via a data-loader in order to prevent memory problems thanks to large-size of files.

### Note: Possible problem of normalization
A normalization for data relies on the idea that the population paramters (means, standard-deviance) are stable over time. The problem is we might get confused details if we normalize features values without taking into account possible time-differences thanks to global warming, as well as the idea that different latitudes and longitudes for those features might in turn have different population paramters for those features (ie: for a longer state like California you would expect the normal temperuature in the north to differ from that in the south of the state)

We also have the problem that thanks to the sheer amount of data and normalization requiring that data be within in order to determine the sample mean and standard deviation to then normalize the data. Since this is difficult given the size of the data, some work around must be considered. For this, I have 3 ideas:

#### Idea 1: No normalizaton.
This is admitidly a very risky proposition since we see that temperatures (which seem to be in Kelvin if a range of 199->300 can be explained) are vasily higher compared to humidity (0.0 to 0.013966 is very low indeed)

#### Idea 2: Yearly Normalization
In order to desregard control for monthly changes and save room, we could instead open the data one at a time and create a running average of the mean and standard-deviation over the course of a single year for the oringinal data. Then we use this in a normalizaton one at a time.

This would require running through the ERA-Files twice. Once to generate the population parameters for our normalization, then again in order to sort and create and normalize our array-data for time-series.

**I will be doing this one for now, then might try idea 3 later on if needed.**

##### Possible Alteration: Do per-month normalization, and apply the normalization at each month.



#### Idea 3: Yearly-Local Normalizaton
An even more complex idea for normalization that takes into account local averages in these features differing across the state, it could do as above, but also take into account latitude/longitude range location, by making it be the average of the nearest local cells. So the top-corner-left cell would be normalized for the averages across it and the surrounding 3 cells, the next at the top for the surrounding 5, the one in the center for the surrounding 8 cells, ect. This process would be extremely difficult, and might lead to problems.

However, it would control for local variations which we might not be able to see, as some areas might have less wind/humdity at certain altitudes thanks to mountains/valleys acting as buffer-zones. These might not factor into upper atmospheres however.

## Running through files over course of year to generate normalization parameters.
Each file will be opened, the atomosphereic levels will be combined with the four-features to create 37 x 4 = 148 features. Each will then have their mean, standard-deviance, and variance saved. After running through all the files, these running totals for each feature will be averaged. This assumes then that every year is normal than that overall these features do not change.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
#drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import datetime
import os
import sklearn.preprocessing as skp
import sklearn as sk
import json

In [None]:
print(os.getcwd())
os.chdir('gdrive/MyDrive/Deep_Learning_Project')
print(os.getcwd())

/content
/content/gdrive/.shortcut-targets-by-id/1DL5QaBuUUKXzdFvZyHGum9yxdUfsShCE/Deep_Learning_Project


Everything below to the next text point only needs to be run once, after which just open the produced JSON-File at to save time

In [None]:
for filename in sorted(os.listdir('Bobbie_ERA5')):
    for file in sorted(os.listdir('Bobbie_ERA5/'+filename)):
        print(file)

ERA5_201101.csv
ERA5_201102.csv
ERA5_201103.csv
ERA5_201104.csv
ERA5_201105.csv
ERA5_201106.csv
ERA5_201107.csv
ERA5_201108.csv
ERA5_201109.csv
ERA5_201110.csv
ERA5_201111.csv
ERA5_201112.csv
ERA5_201201.csv
ERA5_201202.csv
ERA5_201203.csv
ERA5_201204.csv
ERA5_201205.csv
ERA5_201206.csv
ERA5_201207.csv
ERA5_201208.csv
ERA5_201209.csv
ERA5_201210.csv
ERA5_201211.csv
ERA5_201212.csv
ERA5_201301.csv
ERA5_201302.csv
ERA5_201303.csv
ERA5_201304.csv
ERA5_201305.csv
ERA5_201306.csv
ERA5_201307.csv
ERA5_201308.csv
ERA5_201309.csv
ERA5_201310.csv
ERA5_201311.csv
ERA5_201312.csv
ERA5_201401.csv
ERA5_201402.csv
ERA5_201403.csv
ERA5_201404.csv
ERA5_201405.csv
ERA5_201406.csv
ERA5_201407.csv
ERA5_201408.csv
ERA5_201409.csv
ERA5_201410.csv
ERA5_201411.csv
ERA5_201412.csv
ERA5_201501.csv
ERA5_201502.csv
ERA5_201503.csv
ERA5_201504.csv
ERA5_201505.csv
ERA5_201506.csv
ERA5_201507.csv
ERA5_201508.csv
ERA5_201509.csv
ERA5_201510.csv
ERA5_201511.csv
ERA5_201512.csv
ERA5_201601.csv
ERA5_201602.csv
ERA5_201

In [None]:
total_means = []
total_std = []
#Note: Could change to min/max scaling (might be better idea)
for filename in sorted(os.listdir('Bobbie_ERA5')):
    for file in sorted(os.listdir('Bobbie_ERA5/'+filename)):
        era = pd.read_csv('Bobbie_ERA5/' + filename + '/' + file, infer_datetime_format=True)
        era = era[era['latitude'] < 42]
        era = era[era['longitude'] > -89]
        era['time'] = pd.to_datetime(era['time'])
        era = era.set_index(['time', 'latitude', 'longitude', 'level']).unstack('level')
        era = era.reindex(columns=sorted(era.columns, key=lambda x: x[::-1]))
        era = era.sort_index(level = ['time', 'latitude', 'longitude'], ascending = [True, False, True])
        total_means.append(list(era.mean()))
        total_std.append(list(era.std()))
era

In [None]:
print(len(total_means))
print(len(total_means[0]))
#Worked

144
148


In [None]:
total_means = np.array(total_means)
total_std = np.array(total_std)

total_means = np.mean(total_means, axis = 0)
total_std = np.mean(total_std, axis = 0)

print(total_means)
print(total_std)

[ 3.89506860e-06  2.59765973e+02  2.05512754e+01  4.11997429e+00
  3.73806519e-06  2.56624863e+02  1.64801349e+01  2.87709858e+00
  3.63788541e-06  2.48899536e+02  1.35538529e+01  2.11001294e+00
  3.50575674e-06  2.38322037e+02  1.01508158e+01  1.10120914e+00
  3.40584091e-06  2.32197918e+02  7.73748721e+00  6.17399836e-01
  3.29680269e-06  2.27220304e+02  5.30066468e+00  3.09923269e-01
  3.06227972e-06  2.20116052e+02  1.91831078e+00 -1.25144914e-01
  2.95408422e-06  2.16799683e+02  1.66457871e+00 -2.70203074e-01
  2.82552277e-06  2.13068177e+02  4.59069095e+00 -3.69248694e-01
  2.76094392e-06  2.10882481e+02  9.47733688e+00 -3.86563168e-01
  3.06931553e-06  2.10529377e+02  1.71108801e+01 -4.78370501e-01
  3.79517711e-06  2.12142810e+02  2.19826932e+01 -4.44150774e-01
  6.02050339e-06  2.14158354e+02  2.55062341e+01 -4.63093372e-01
  1.18317120e-05  2.16018576e+02  2.79544492e+01 -1.54978957e-01
  2.30947232e-05  2.17996312e+02  2.89707300e+01  2.90700286e-01
  4.12702726e-05  2.20783

In [None]:
total_means = list(total_means)
total_std = list(total_std)
len(total_means)
len(total_std)

148

In [None]:
normalizer = {'feature_means':total_means, 'feature_std':total_std}
out_file = open("normalizer.json", "w")

json.dump(normalizer, out_file)

out_file.close()

Above only has to be run once, in order to produce the normalizer file. Such a process takes nearly 2 hours and so should avoid being repeated

In [None]:
with open('normalizer.json') as f:
  normalizer_2 = json.load(f)
total_means = np.array(normalizer_2['feature_means'])
total_std = np.array(normalizer_2['feature_std'])

In [None]:
print(total_means.shape)
print(total_means)

(148,)
[ 3.89506860e-06  2.59765973e+02  2.05512754e+01  4.11997429e+00
  3.73806519e-06  2.56624863e+02  1.64801349e+01  2.87709858e+00
  3.63788541e-06  2.48899536e+02  1.35538529e+01  2.11001294e+00
  3.50575674e-06  2.38322037e+02  1.01508158e+01  1.10120914e+00
  3.40584091e-06  2.32197918e+02  7.73748721e+00  6.17399836e-01
  3.29680269e-06  2.27220304e+02  5.30066468e+00  3.09923269e-01
  3.06227972e-06  2.20116052e+02  1.91831078e+00 -1.25144914e-01
  2.95408422e-06  2.16799683e+02  1.66457871e+00 -2.70203074e-01
  2.82552277e-06  2.13068177e+02  4.59069095e+00 -3.69248694e-01
  2.76094392e-06  2.10882481e+02  9.47733688e+00 -3.86563168e-01
  3.06931553e-06  2.10529377e+02  1.71108801e+01 -4.78370501e-01
  3.79517711e-06  2.12142810e+02  2.19826932e+01 -4.44150774e-01
  6.02050339e-06  2.14158354e+02  2.55062341e+01 -4.63093372e-01
  1.18317120e-05  2.16018576e+02  2.79544492e+01 -1.54978957e-01
  2.30947232e-05  2.17996312e+02  2.89707300e+01  2.90700286e-01
  4.12702726e-05  

## Now create numpy files.
Process works like this:

1) Re-open the ERA-Files two at a time, but cut off after the fifth. This will help to prevent gaps in time-series array between the files.

2) Reconvert to ERA-Files to have 148 variables (4 at each atmospheric level)

3) Normalize each data-frames variables using the vairables **total_means, total_std**

4) Convert dataframes to array, then reshape to form (int(x.shape[0]/(20*20), 20, 20, 148). Effectivly this is (..., lat, lon, features), with 20 being the ranges between the latitudes/longitudes of form [37, 37.25) ... [41.75, 42.00). So instances with 42 will be deleted, then it will be saved.

5) Create duplicates of arrays in loop to create an overlapping time-series array of shape (..., 5, 20, 20, 148).

6) Save array as a **.npz file** or array file in an appropriate directory.

#### Note: Size of files after all data stored may be too large to fit on computer unless compressed. If so, edit code as needed to create folders which can be compressed for more efficient storage. Code may be subject to change to include opening zip-files to make process space-viable

#### Reason for removing latitude 42, longitude -89
In order to recognize our latitude and longitude, we will think of it in terms of ranges of the form (A, B] meaning that the range category can include B but not A. That Latitude goes down in its range means the range is of form [(42, 41.75], (41.75, 41.5], ..., (37.25, 37.0]]. This removal then is necessary in order to have fitted labels within another code file for these same ranges.

#### Note: Reason for opening 2-files at a time.
The reason we open two files at a time, but shave off most of the instances by time-series is to make sure there aren't gaps in the pattern of [t-2, t-1, t, t+1, t+2] as by the nature of it, since we need the past 2 days to make predictions for the 3rd, we cannot predict for the first 2-days of the month. If this happens while opening a single file at a time, we end up with a 4-day gap via the last 2-days of the previous month the first 2-days of the next month.

Opening two files at a time, and getting to the end of the last month with the time-series slide of days [26, 27, 28, 29, 30] then to get the 29th (2nd to last day of the month) then we need [27, 28, 29, 30, 1(next-month)], then to predict last day we need [28, 29, 30, 1, 2], and so on till to predict the 2nd day of the next month we need [30, 1, 2, 3, 4]. Then this collection of slides is saved before moving on. We close both files, then reopen with the 2nd month of the first subet, and the month after. So the pattern is [A, B], [B, C], [C, D], [D, E], ect. For the last iteration (november and december of 2022), do not throw out instances in december after 5th day and instead keep all of them. This ensures that we will have no gaps and only end up missing 4-days total: the first 2 days of the year 2010, and the last 2-days of the year 2022.

#### Note: Correction, forget is by hour. Should instead by the first 4-hours or the first 65268 instances

#### Note: This relies on for opening files A and B, B always being the next month. Thus it requires the data be stored in the directory in sequential month order, which may require specific naming convetions for the files to ensure this is happening. This process also depends on each ERA file being an entire month in a single year to prevent issues with days.

In [None]:
eras = os.listdir('Bobbie_ERA5')
for i in range(len(eras) - 1):
    print(i, i+ 1)

0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
20 21
21 22
22 23


In [None]:
eras = []
for filename in sorted(os.listdir('Bobbie_ERA5')):
    for file in sorted(os.listdir('Bobbie_ERA5/'+filename)):
        eras.append('Bobbie_ERA5/'+filename + '/'+file)
eras
print(len(eras))
print(eras)

144
['Bobbie_ERA5/2011/ERA5_201101.csv', 'Bobbie_ERA5/2011/ERA5_201102.csv', 'Bobbie_ERA5/2011/ERA5_201103.csv', 'Bobbie_ERA5/2011/ERA5_201104.csv', 'Bobbie_ERA5/2011/ERA5_201105.csv', 'Bobbie_ERA5/2011/ERA5_201106.csv', 'Bobbie_ERA5/2011/ERA5_201107.csv', 'Bobbie_ERA5/2011/ERA5_201108.csv', 'Bobbie_ERA5/2011/ERA5_201109.csv', 'Bobbie_ERA5/2011/ERA5_201110.csv', 'Bobbie_ERA5/2011/ERA5_201111.csv', 'Bobbie_ERA5/2011/ERA5_201112.csv', 'Bobbie_ERA5/2012/ERA5_201201.csv', 'Bobbie_ERA5/2012/ERA5_201202.csv', 'Bobbie_ERA5/2012/ERA5_201203.csv', 'Bobbie_ERA5/2012/ERA5_201204.csv', 'Bobbie_ERA5/2012/ERA5_201205.csv', 'Bobbie_ERA5/2012/ERA5_201206.csv', 'Bobbie_ERA5/2012/ERA5_201207.csv', 'Bobbie_ERA5/2012/ERA5_201208.csv', 'Bobbie_ERA5/2012/ERA5_201209.csv', 'Bobbie_ERA5/2012/ERA5_201210.csv', 'Bobbie_ERA5/2012/ERA5_201211.csv', 'Bobbie_ERA5/2012/ERA5_201212.csv', 'Bobbie_ERA5/2013/ERA5_201301.csv', 'Bobbie_ERA5/2013/ERA5_201302.csv', 'Bobbie_ERA5/2013/ERA5_201303.csv', 'Bobbie_ERA5/2013/ERA5_

In [None]:
records = []
for i in range(1, len(eras)):
  if i < 10:
    i = f"00{i}"
  elif i < 100:
    i = f"0{i}"
  records.append(f'Record #{i}')
list(sorted(records))

['Record #001',
 'Record #002',
 'Record #003',
 'Record #004',
 'Record #005',
 'Record #006',
 'Record #007',
 'Record #008',
 'Record #009',
 'Record #010',
 'Record #011',
 'Record #012',
 'Record #013',
 'Record #014',
 'Record #015',
 'Record #016',
 'Record #017',
 'Record #018',
 'Record #019',
 'Record #020',
 'Record #021',
 'Record #022',
 'Record #023',
 'Record #024',
 'Record #025',
 'Record #026',
 'Record #027',
 'Record #028',
 'Record #029',
 'Record #030',
 'Record #031',
 'Record #032',
 'Record #033',
 'Record #034',
 'Record #035',
 'Record #036',
 'Record #037',
 'Record #038',
 'Record #039',
 'Record #040',
 'Record #041',
 'Record #042',
 'Record #043',
 'Record #044',
 'Record #045',
 'Record #046',
 'Record #047',
 'Record #048',
 'Record #049',
 'Record #050',
 'Record #051',
 'Record #052',
 'Record #053',
 'Record #054',
 'Record #055',
 'Record #056',
 'Record #057',
 'Record #058',
 'Record #059',
 'Record #060',
 'Record #061',
 'Record #062',
 'Record

In [None]:
for i in range(1, len(eras)):
    #Open two files at a time to prevent gaps
    #restart at 51
    print(i)
    a = pd.read_csv(eras[i-1], infer_datetime_format=True)
    b = pd.read_csv(eras[i], infer_datetime_format=True)
    a['time'] = pd.to_datetime(a['time'])
    b['time'] = pd.to_datetime(b['time'])
    #if not the last 2 months of all our files to convert, cutoff any time after first 4-hours to prevent duplicates
    if i != (len(eras) - 1):
        b = b[:65268]
        #b = b[b['time'].dt.day < 5]
    #concat files togheter
    period = pd.concat([a,b], axis = 0)
    period = period[period['latitude'] < 42]
    period = period[period['longitude'] > -89]
    #turn time/lat/long/level into index-levels, then unstack level to make it a column
    #this reshapes the shape to (original-rows/37, 4*37)
    period = period.set_index(['time', 'latitude', 'longitude', 'level']).unstack('level')
    period = period.reindex(columns=sorted(period.columns, key=lambda x: x[::-1]))
    period = period.sort_index(level = ['time', 'latitude', 'longitude'], ascending = [True, False, True])
    print(period.shape)
    period = (period - total_means)/total_std
    period = period.to_numpy()
    period = period.reshape((int(period.shape[0]/(20*20)), 20,20,148))
    print(period.shape)
    #Create duplicates to form timeperiod of [t-2, t-1, t, t+1, t+2]
    z = []
    for j in range(len(period) - 4):
        z.append(period[j:j+5])
    z = np.array(z)
    if i < 10:
      i = f"00{i}"
    elif i < 100:
      i = f"0{i}"
    print(z.shape, end='\n\n')
    np.save(f'ERA_Numpy_Files/timeseries_array_#{i}', z)

1
(299200, 148)
(748, 20, 20, 148)
(744, 5, 20, 20, 148)

2
(270400, 148)
(676, 20, 20, 148)
(672, 5, 20, 20, 148)

3
(299200, 148)
(748, 20, 20, 148)
(744, 5, 20, 20, 148)

4
(289600, 148)
(724, 20, 20, 148)
(720, 5, 20, 20, 148)

5
(299200, 148)
(748, 20, 20, 148)
(744, 5, 20, 20, 148)

6
(289600, 148)
(724, 20, 20, 148)
(720, 5, 20, 20, 148)

7
(299200, 148)
(748, 20, 20, 148)
(744, 5, 20, 20, 148)

8
(299200, 148)
(748, 20, 20, 148)
(744, 5, 20, 20, 148)

9
(289600, 148)
(724, 20, 20, 148)
(720, 5, 20, 20, 148)

10
(299200, 148)
(748, 20, 20, 148)
(744, 5, 20, 20, 148)

11
(289600, 148)
(724, 20, 20, 148)
(720, 5, 20, 20, 148)

12
(299200, 148)
(748, 20, 20, 148)
(744, 5, 20, 20, 148)

13
(299200, 148)
(748, 20, 20, 148)
(744, 5, 20, 20, 148)

14
(280000, 148)
(700, 20, 20, 148)
(696, 5, 20, 20, 148)

15
(299200, 148)
(748, 20, 20, 148)
(744, 5, 20, 20, 148)

16
(289600, 148)
(724, 20, 20, 148)
(720, 5, 20, 20, 148)

17
(299200, 148)
(748, 20, 20, 148)
(744, 5, 20, 20, 148)

18
(28

In [None]:
#This code is to instead generate more but simplier files, but not quintupling the size and dividing by a third (12-days)
#Thus out output for January would be 3 files of size (288, 20, 20, 148), (288, 20, 20, 148), (172, 20, 20, 148)