<div style="text-align: center;">
    <a href="https://www.data.gouv.fr/fr/datasets/historique-detaille-des-surfaces-cheptels-et-nombre-doperateurs-par-commune">
        <img border="0" src="image/agence_bio_logo.jpg" width="40%"></a>
</div>

# Farming activity RAMP challenge

AVAKIAN Alexia, DEROUX Alexandre, ERRAJI Kenza, GUILLAUME Constantin, MARIN-BERTIN Guillaume, TOZZA Jean

## Introduction

The [dataset](https://www.data.gouv.fr/fr/datasets/historique-detaille-des-surfaces-cheptels-et-nombre-doperateurs-par-commune) originates from the agricultural agency [Agence Bio](https://www.agencebio.org) and aims to track farming activities across French municipalities from 2008 to 2023. It includes detailed information on the location of the farming parcels (region, department, etc.) and the usage of these parcels (surface, type of production, etc.).

This data covers metropolitan France and the DROMs. This is anonymized data, i.e. the information concerning the natural or legal person who cultivates these plots is absent.

The challenge is to design an algorithm able to predict the activity type (`code_activites`) based on the other characteristics of the farming parcels. 

During recent years, a push for a more sustainable agriculture and farming practices has been made, with significant implications regarding environmental and health concerns.

Understanding farming activity patterns can help policy makers plan more appropriate initiatives in order to support farming organizations and farmers, encourage the transition to organic production, and improve the sustainability of practices. Previous policies can also be evaluated thanks to this.

### Exploratory data analysis

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option("display.max_columns", None)

import problem

X_df, y = problem.get_train_data()

In [3]:
print("Dataset shape:", X_df.shape)

Dataset shape: (3338636, 25)


In [4]:
X_df.head()

Unnamed: 0,annee,code_region,region,code_departement,departement,code_epci,epci,code_insee_commune,code_postal_commune,commune,nombre_operateurs,code_groupe_surface,nombre_exploitations_surface,surface_terme_conversion,surface_conversion_annee_1,surface_conversion_annee_2,surface_conversion_annee_3,surface_totale_conversion,surface_totale_bio,code_groupe_cheptel,nombre_exploitations_cheptel,cheptel_terme_conversion,cheptel_conversion_simultanee,cheptel_conversion_non_simultanee,cheptel_total_bio
0,2022,52.0,15,72,82,698,612,20135,72800,896,1,6,4.0,164.85,0.0,0.0,0.0,0.0,164.85,8,1.0,48000.0,0.0,0.0,48000.0
1,2013,75.0,13,33,33,890,445,9257,33620,18416,1,0,1.0,6.27,0.0,0.0,0.0,0.0,6.27,13,,,,,
2,2010,84.0,0,6,6,660,9,1463,7430,17434,1,1,1.0,0.0,0.41,0.0,0.0,0.41,0.41,13,,,,,
3,2015,27.0,1,19,21,588,704,4745,21220,2897,1,2,1.0,33.59,13.58,0.0,0.0,13.58,47.17,2,1.0,1.0,0.0,0.0,1.0
4,2012,75.0,13,18,18,334,679,4475,19370,3937,1,6,2.0,96.25,0.0,0.0,0.0,0.0,96.25,4,1.0,130.0,0.0,0.0,130.0


### Challenge evaluation

In this challenge, the evaluation is based on prediction accuracy.

The dataset is split into 80% for the training set and 20% for the test set. To ensure that the distribution of the target data is preserved in both sets, the `stratify` parameter is used during splitting.

The best models will be those performing the highest accuracy on the test set.

### The pipeline workflow

The input data are stored in a dataframe. To go from a dataframe to a numpy array we will use a scikit-learn column transformer. The first example we will write will just consist in selecting a subset of columns we want to work with.

In [5]:
# %load submissions/starting_kit/estimator.py

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

def get_estimator():
    pipe = make_pipeline(
        StandardScaler(),
        XGBClassifier(eval_metric="mlogloss", random_state=42)
    )
    return pipe

### Testing using a scikit-learn pipeline

In [6]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(get_estimator(), X_df, y, cv=5, scoring="accuracy")
print(scores)

[0.80273405 0.80246418 0.80247616 0.80260945 0.80284158]


### Submission

To submit your code, you can refer to the [online documentation](https://paris-saclay-cds.github.io/ramp-docs/ramp-workflow/stable/using_kits.html).