## Creating Dataset From NetCDF Data


The dataset taken from https://www.copernicus.eu/en is in NetCDF format which means it has three dimensions viz. latitude, longitude and time. 

There is sea surface temperature data point in this 3d matrix. But for simplification we are aggregating the data across the whole sea surface area with its mean, standard deviation, variance, minimum and maximum temperature for a given day.

I have taken the dataset from January 1st 1991 up to March 24th 2024.

In [6]:
import netCDF4 as nc
import numpy as np
import pandas as pd
import os
from datetime import datetime

In [2]:
def data_creation(path):
    data_list = []
    for root, dir, files in os.walk(path):
        for file in files:
            file_path = os.path.join(root, file)
            if '.DS_Store' not in file_path:
                dataset = nc.Dataset(file_path, 'r')
                date_object = datetime.strptime(file[:8], '%Y%m%d')
                mean_temp = np.mean(dataset.variables['analysed_sst'][0,:,:])
                var_temp = np.var(dataset.variables['analysed_sst'][0,:,:])
                stdv_temp = np.std(dataset.variables['analysed_sst'][0,:,:])
                min_temp = np.min(dataset.variables['analysed_sst'][0,:,:])
                max_temp = np.max(dataset.variables['analysed_sst'][0,:,:])
                data_list.append({
                    'date_object': date_object,
                    'mean_temp': mean_temp,
                    'var_temp': var_temp,
                    'stdv_temp': stdv_temp,
                    'min_temp': min_temp,
                    'max_temp': max_temp              
                })
    df = pd.DataFrame.from_records(data_list, columns=['date_object','mean_temp', 'var_temp','stdv_temp','min_temp', 'max_temp' ])
    return df


In [3]:
data = data_creation('./data')
data.to_csv('./data/curated_data.csv')

In [4]:
data.head()

Unnamed: 0,date_object,mean_temp,var_temp,stdv_temp,min_temp,max_temp
0,2013-03-16,288.111629,3.265959,1.807197,281.959991,292.529999
1,2013-01-28,288.593712,3.419196,1.849107,282.320007,293.22998
2,2013-03-01,288.022179,3.369696,1.835673,280.519989,292.109985
3,2013-02-23,288.091876,2.960083,1.720489,280.959991,292.299988
4,2013-08-29,297.181748,9.474313,3.078037,286.919983,302.949982


In [5]:
data.describe()

Unnamed: 0,date_object,mean_temp,var_temp,stdv_temp,min_temp,max_temp
count,45120,45120.0,45120.0,45120.0,45120.0,45120.0
mean,2004-02-22 02:33:45.957446656,292.177055,5.43848,2.273008,284.958923,297.535339
min,1981-08-25 00:00:00,286.772977,1.45625,1.206752,277.199982,290.600006
25%,1994-05-27 00:00:00,288.879731,3.362481,1.833707,282.48999,293.459991
50%,2004-04-02 00:00:00,291.669028,4.709435,2.170123,285.559998,297.359985
75%,2014-05-23 00:00:00,295.4689,7.02319,2.65013,287.470001,301.509979
max,2024-03-24 00:00:00,298.932261,16.454772,4.056448,291.519989,305.839996
std,,3.373437,2.546698,0.52146,2.960095,4.167403
