<div style="text-align: center;">
    <a href="https://www.dataia.eu/">
        <img border="0" src="https://github.com/ramp-kits/template-kit/raw/main/img/DATAIA-h.png" width="90%"></a>
</div>

# Template Kit for RAMP challenge

<i> Thomas Moreau (Inria) </i>

## Introduction

Describe the challenge, in particular:

- Where the data comes from?
- What is the task this challenge aims to solve?
- Why does it matter?

# Exploratory data analysis

The goal of this section is to show what's in the data, and how to play with it.
This is the first set in any data science project, and here, you should give a sense of the data the participants will be working with.

You can first load and describe the data, and then show some interesting properties of it.

In [17]:
import xarray as xr
import pandas as pd

# Open the NetCDF file
dataset = xr.open_dataset('/home/leo/Programmation/Python/DataCamp/template-kit/data/pr_Amon_CNRM-CM6-1_historical_r1i1p1f2_gr_20000116-20141216.nc')

# Print the dataset information
print(dataset)

# Convert the dataset to a pandas DataFrame
df = dataset.to_dataframe().reset_index()

# Print the DataFrame
print(df)

<xarray.Dataset> Size: 941kB
Dimensions:      (lat: 26, lon: 50, time: 180, axis_nbounds: 2)
Coordinates:
  * lat          (lat) float64 208B 35.72 37.12 38.52 ... 67.94 69.34 70.74
  * lon          (lon) float64 400B -23.91 -22.5 -21.09 ... 42.19 43.59 45.0
  * time         (time) datetime64[ns] 1kB 2000-01-16T12:00:00 ... 2014-12-16...
Dimensions without coordinates: axis_nbounds
Data variables:
    time_bounds  (time, axis_nbounds) datetime64[ns] 3kB ...
    pr           (time, lat, lon) float32 936kB ...
Attributes: (12/54)
    name:                   /scratch/work/voldoire/outputs/CMIP6/DECK/CNRM-CM...
    Conventions:            CF-1.7 CMIP-6.2
    creation_date:          2018-06-20T08:40:01Z
    description:            CMIP6 historical
    title:                  CNRM-CM6-1 model output prepared for CMIP6 / CMIP...
    activity_id:            CMIP
    ...                     ...
    xios_commit:            1442-shuffle
    nemo_gelato_commit:     49095b3accd5d4c_6524fe19b00467a


In [23]:
df.head()

Unnamed: 0,lat,lon,time,axis_nbounds,time_bounds,pr
0,35.719532,-23.90625,2000-01-16 12:00:00,0,2000-01-01,2.9e-05
1,35.719532,-23.90625,2000-01-16 12:00:00,1,2000-02-01,2.9e-05
2,35.719532,-23.90625,2000-02-15 12:00:00,0,2000-02-01,4.2e-05
3,35.719532,-23.90625,2000-02-15 12:00:00,1,2000-03-01,4.2e-05
4,35.719532,-23.90625,2000-03-16 12:00:00,0,2000-03-01,2.3e-05


In [19]:
import zipfile
import xarray as xr
import tempfile
import os

# Path to the ZIP file
zip_path = 'data/historical_precipitation_cmip6.zip'

with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
    # Open the ZIP file
    with zipfile.ZipFile(zip_path, 'r') as z:
        temp_file_path = tmp_file.name
        # Extract only the .nc files inside the ZIP to the temporary directory
        for file_info in z.infolist():
            if file_info.filename.endswith('.nc'):
                with z.open(file_info.filename) as f:
                        tmp_file.write(f.read())

    # Load the NetCDF files with xarray
    dataset = xr.open_dataset(
        os.path.join(temp_file_path),
        engine='netcdf4',
    ).to_dataframe()

    # Print the dataset information
    print(dataset)

    # Convert the dataset to a pandas DataFrame
    df = dataset.reset_index()

    # Print the DataFrame
    print(df)

    os.remove(temp_file_path)

                                                     time_bounds        pr
lat       lon       time                axis_nbounds                      
35.719532 -23.90625 2000-01-16 12:00:00 0             2000-01-01  0.000029
                                        1             2000-02-01  0.000029
                    2000-02-15 12:00:00 0             2000-02-01  0.000042
                                        1             2000-03-01  0.000042
                    2000-03-16 12:00:00 0             2000-03-01  0.000023
...                                                          ...       ...
70.738059  45.00000 2014-10-16 12:00:00 1             2014-11-01  0.000022
                    2014-11-16 00:00:00 0             2014-11-01  0.000021
                                        1             2014-12-01  0.000021
                    2014-12-16 12:00:00 0             2014-12-01  0.000017
                                        1             2015-01-01  0.000017

[468000 rows x 2 columns

In [14]:
tmp_dir

'/tmp/tmp_1ykbkzg'

In [None]:
df.columns = ['wet_days']
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,cwd
time,latitude,longitude,Unnamed: 3_level_1
2000-01-31,35.04986,-24.95014,
2000-01-31,35.04986,-24.85014,
2000-01-31,35.04986,-24.75014,
2000-01-31,35.04986,-24.65014,
2000-01-31,35.04986,-24.55014,


In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

# Load the data

import problem
X_df, y = problem.get_train_data()

# Challenge evaluation

A particularly important point in a challenge is to describe how it is evaluated. This is the section where you should describe the metric that will be used to evaluate the participants' submissions, as well as your evaluation strategy, in particular if there is some complexity in the way the data should be split to ensure valid results.

# Submission format

Here, you should describe the submission format. This is the format the participants should follow to submit their predictions on the RAMP plateform.

This section also show how to use the `ramp-workflow` library to test the submission locally.

## The pipeline workflow

The input data are stored in a dataframe. To go from a dataframe to a numpy array we will use a scikit-learn column transformer. The first example we will write will just consist in selecting a subset of columns we want to work with.

In [2]:
# %load submissions/starting_kit/estimator.py

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


def get_estimator():
    pipe = make_pipeline(
        StandardScaler(),
        LogisticRegression()
    )

    return pipe


## Testing using a scikit-learn pipeline

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(get_estimator(), X_df, y, cv=5, scoring='accuracy')
print(scores)

[0.97222222 0.96527778 0.97212544 0.95121951 0.96167247]


## Submission

To submit your code, you can refer to the [online documentation](https://paris-saclay-cds.github.io/ramp-docs/ramp-workflow/stable/using_kits.html).