In [1]:
import pandas as pd

## Data Loading

#### Train Data

In [5]:
train_df = pd.read_csv('./data/UH_2023_TRAIN.txt', sep='|')
train_df.head(3)

Unnamed: 0,CAMPAÑA,ID_FINCA,ID_ZONA,ID_ESTACION,ALTITUD,VARIEDAD,MODO,TIPO,COLOR,SUPERFICIE,PRODUCCION
0,14,76953,515,4,660,26,2,0,1,0.0,22215.0
1,14,84318,515,4,660,26,2,0,1,0.0,22215.0
2,14,85579,340,4,520,32,2,0,1,0.0,20978.0


Based on this information, we can will do the following processing:
- **Campaña**: convert it to a year to better understand that value
- **Superficie**: assume that the superficie of the finca is the same independently on the pass of the years
- **MODO/TIPO/VARIEDAD/COLOR**: categorical variables

A priori, seems the ID_FINCA and ID_ZONA to do not be interesting to the analysis (except for the ID_FINCA will be used to infer the SUPERFICIE for the ones that is 0.0).

#### Meteo Data

In [6]:
meteo_df = pd.read_csv('./data/DATOS_METEO.TXT', sep='|')
meteo_df.head(3)

Unnamed: 0,validTimeUtc,precip1Hour,precip6Hour,precip24Hour,precip2Day,precip3Day,precip7Day,precipMtd,precipYtd,pressureChange,...,temperatureMax24Hour,temperatureMin24Hour,temperatureDewPoint,temperatureFeelsLike,uvIndex,visibility,windDirection,windGust,windSpeed,ID_ESTACION
0,2015-06-29 16:20:00,0.0,0.0,0.0,,,,,,-1.4,...,36.3,17.9,12.8,34.5,2.0,16.09,,,18.7,13
1,2015-06-29 17:20:00,0.0,0.0,0.0,,,,,,-1.0,...,35.0,17.9,12.3,34.3,1.0,16.09,,,18.0,13
2,2015-06-29 18:20:00,0.0,0.0,0.0,,,,,,-0.3,...,34.7,17.9,12.4,32.8,0.0,16.09,,,16.6,13


Based on this information, we will do the following:
- **Group** the data into the year, and extract features: mean, median, ...

Could even be interesting that for example having a certain precipitation in december could influence the final outcom. So we will not group in all the year, but to extract the months. And then the means for those months will be features added in the train df.

- **Merge** with the training data then based on the ID_ESTACION

#### Eteo Data

In [8]:
eteo_df = pd.read_csv('./data/DATOS_ETO.TXT', sep='|')
eteo_df.head(3)

Unnamed: 0,date,DewpointLocalAfternoonAvg,DewpointLocalAfternoonMax,DewpointLocalAfternoonMin,DewpointLocalDayAvg,DewpointLocalDayMax,DewpointLocalDayMin,DewpointLocalDaytimeAvg,DewpointLocalDaytimeMax,DewpointLocalDaytimeMin,...,WindSpeedLocalMorningAvg,WindSpeedLocalMorningMax,WindSpeedLocalMorningMin,WindSpeedLocalNighttimeAvg,WindSpeedLocalNighttimeMax,WindSpeedLocalNighttimeMin,WindSpeedLocalOvernightAvg,WindSpeedLocalOvernightMax,WindSpeedLocalOvernightMin,ID_ESTACION
0,20150629,285.9,285.9,285.9,286.0,287.0,285.4,285.9,285.9,285.9,...,,,,2.6,5.0,1.1,1.7,2.1,1.1,13
1,20150630,283.0,283.6,282.5,284.3,286.5,282.5,283.2,283.9,282.5,...,2.2,3.8,1.4,2.7,5.2,1.4,1.5,1.9,1.4,13
2,20150701,286.1,286.5,285.5,285.8,288.0,283.8,285.4,286.5,283.8,...,2.7,4.3,1.2,3.0,5.7,1.4,1.9,2.8,1.4,13


We would need to do the following on this dataframe:
- **date**: data processing to extract the day/month/year
- **group**: the same as in meteo
- **merge**: merging to train_df based on the ID_ESTACION

## Data Pre-Processing

- Formatting columns
- Setting column names
- Setting correct dtypes

In [12]:
train_df_pre_proc = train_df.copy()

#### Train Data

In [14]:
train_df_pre_proc = train_df_pre_proc.rename(columns={
    'CAMPAÑA': 'campaing',
    'ID_FINCA': 'id_finca',
    'ID_ZONA': 'id_zone',
    'ID_ESTACION': 'id_estation',
    'VARIEDAD': 'variety',
    'ALTITUD': 'altitude',
    'MODO': 'mode',
    'TIPO': 'type',
    'COLOR': 'color',
    'SUPERFICIE': 'surface',
    'PRODUCCION': 'production'
})

train_df_pre_proc.head(3)

Unnamed: 0,campaing,id_finca,id_zone,id_estation,altitude,variety,mode,type,color,surface,production
0,14,76953,515,4,660,26,2,0,1,0.0,22215.0
1,14,84318,515,4,660,26,2,0,1,0.0,22215.0
2,14,85579,340,4,520,32,2,0,1,0.0,20978.0


In [None]:
# TODO: Sklearn mixing for the processing

Before starting the exploration, we will do first a **train/valid** split. This way we make sure we are not biasing ourselves.

In [None]:
# TODO: Train/val split
# TODO: Get also the dataframe to predict: produccion = NaN

## Data Exploration

Is where we will group the data and explore it.

## Data Processing

- Handling the NaNs
- Creating the mean features

## Data Engineering