# Mission

Our dataset contains daily observations of station-based measurements worldwide. 

We have chosen to look into stations in Italy, which has the oldest records, with another country to identy contrasts.
By looking into the precipitation data, we aim to gain insights in climate change, and understand the fluctuations.

__________

# Imports

In [5]:
import os
import pandas as pd

# AWS Bucket interaction
import boto3
from botocore import UNSIGNED
from botocore.config import Config

# Visualization
import matplotlib

___________

# Data Download

The data can is located in an AWS bucket. An overview of the bucket can be found here: http://noaa-ghcn-pds.s3.amazonaws.com/

This part of the notebook connects to the bucket, checks if the desired files are present in the local path, and downloads them if they are not.

### Setup Bucket connection

In [6]:
s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))
s3_resource = boto3.resource("s3", config=Config(signature_version=UNSIGNED))
bucket = s3_resource.Bucket('noaa-ghcn-pds')

### Download files

There are two types of files in the bucket: one `.csv` per year between 1763 and 2022 (e.g. `csv.gz/2019.csv.gz`), and one `.csv` per station (e.g. `csv.gz/by_station/AGE00147713.csv.gz`). The prefix of the `by_station` ID (such as "AGE00147713") indicates the country, whereas one country can have multiple prefixes (e.g. "IT", "ITE", "ITM" and "ITW" for Italy). 

The yearly files combine the data of all stations which have datapoints for that year, and the ones per station combine all the years for which data was recorded for that station. Since we are only interested in the stations in Italy, we only download the files which have `by_station` and "IT" in the URL. This will also include Italy's other prefixes.

As the files in the bucket are sorted (yearly files first, files per station after in alphabetical order), the iteration stops after the first element that has `by_station/IV` in its key, which is the next country prefix after Italy.

In [7]:
# Change local path depending on where data will be downloaded to
data_root = 'LOCAL_PATH'

In [8]:
for obj in bucket.objects.all():
    # Download the files for Italian stations.
    if 'by_station/IT' in obj.key:
        _, filename = os.path.split(obj.key)
        local_path = data_root + filename
        if not os.path.isfile(local_path):
            s3_client.download_file('noaa-ghcn-pds', obj.key, local_path)
            print(f'{filename} downloaded')
        else:
            print(f'{filename} already exists')
    # Stop the iteration after stations in Italy.
    if 'by_station/IV' in obj.key:
        break

IT000016090.csv.gz downloaded
IT000016134.csv.gz downloaded
IT000016232.csv.gz downloaded
IT000016239.csv.gz downloaded
IT000016320.csv.gz downloaded
IT000016550.csv.gz downloaded
IT000016560.csv.gz downloaded
IT000160220.csv.gz downloaded
IT000162240.csv.gz downloaded
IT000162580.csv.gz downloaded
ITE00100550.csv.gz downloaded
ITE00100551.csv.gz downloaded
ITE00100552.csv.gz downloaded
ITE00100553.csv.gz downloaded
ITE00100554.csv.gz downloaded
ITE00105250.csv.gz downloaded
ITE00115584.csv.gz downloaded
ITE00115588.csv.gz downloaded
ITE00155336.csv.gz downloaded
ITE00155337.csv.gz downloaded
ITE00155338.csv.gz downloaded
ITE00155339.csv.gz downloaded
ITE00155340.csv.gz downloaded
ITE00155341.csv.gz downloaded
ITE00155342.csv.gz downloaded
ITE00155343.csv.gz downloaded
ITE00155344.csv.gz downloaded
ITE00155345.csv.gz downloaded
ITE00155346.csv.gz downloaded
ITE00155347.csv.gz downloaded
ITE00155348.csv.gz downloaded
ITE00155349.csv.gz downloaded
ITE00155350.csv.gz downloaded
ITE0015535

_________

# Data Analysis

In [9]:
column_names = ['ID', 'DATE', 'ELEMENT', 'DATA VALUE', 'M-FLAG', 'Q-FLAG', 'S-FLAG', 'OBS-TIME']

### Load file

In [10]:
def load_df(station):
    df = pd.read_csv(data_root + station + '.csv.gz', header = 0, names = column_names)
    return df

In [11]:
df = load_df('IT000016090')

In [12]:
df

Unnamed: 0,ID,DATE,ELEMENT,DATA VALUE,M-FLAG,Q-FLAG,S-FLAG,OBS-TIME
0,IT000016090,19450519,TAVG,252,H,,S,
1,IT000016090,19450520,TAVG,255,H,,S,
2,IT000016090,19450521,TAVG,249,H,,S,
3,IT000016090,19450522,TAVG,218,H,,S,
4,IT000016090,19450523,TAVG,185,H,,S,
...,...,...,...,...,...,...,...,...
93123,IT000016090,20240216,TAVG,75,H,,S,
93124,IT000016090,20240217,TAVG,84,H,,S,
93125,IT000016090,20240218,TAVG,94,H,,S,
93126,IT000016090,20240219,TAVG,101,H,,S,
