# RAMP on predicting 2022 French presidential elections

## Introduction

The different data sets used in this project are all freely accessible through either the French Government open data platform or INSEE website. To get more details on how to download it, check the README file. <br>
The main dataset contains for each candidate, for each city and for each polling station the scores of each candidate at the first round of the 2022 French Presidential Elections. <br>
Additionnaly to this data, we also collected the latest INSEE datasets on employement statistics and the income statistics, as well as 2017 elections results. Both INSEE datasets come with a documentation file : they can be found in the data/documentation folder or online (check read me for more details)

For practical reasons, the challenge focuses only on French Departments (mainland and overseas). Territories such as Mayotte or overseas polling stations are excluded for practical reasons (missing data, no financial data, etc...).

The goal is to predict the score of each candidate for each french city. The prediction quality is measured with the RMSE.

Users are not restricted to the data provided with this challenge and we encourage looking for other potential datasets.

In [1]:
%matplotlib inline
import os
import importlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%load_ext autoreload
%autoreload 2

## Load the dataset using pandas

The training and testing sets are located in the folder data. They can be loaded with pandas

In [17]:
from problem import get_train_data

data_train, labels_train = get_train_data()

  


In [19]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 836376 entries, 227884 to 99937
Data columns (total 10 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Code du département            836376 non-null  object
 1   Libellé du département         836376 non-null  object
 2   Code de la circonscription     836376 non-null  int64 
 3   Libellé de la circonscription  836376 non-null  object
 4   Code de la commune             836376 non-null  object
 5   Libellé de la commune          836376 non-null  object
 6   Code du b.vote                 836376 non-null  object
 7   Inscrits                       836376 non-null  int64 
 8   Votants                        836376 non-null  int64 
 9   location                       791880 non-null  object
dtypes: int64(3), object(7)
memory usage: 70.2+ MB


We can see some missing values for location. we should take a first look and see how big this polling stations are

In [28]:
data_train[data_train['location'].isnull()].drop_duplicates(['Code de la commune']).agg({"Inscrits":[np.mean, np.sum]}).round(0)

Unnamed: 0,Inscrits
mean,1478.0
sum,1334892.0


Given the number of missing values, we should investigate a bit

In [33]:
data_train[data_train['location'].isnull()].drop_duplicates(['Code de la commune'])['Libellé du département'].value_counts().head(10)

Français établis hors de France    152
Hérault                            149
Saône-et-Loire                      53
Creuse                              52
Polynésie française                 48
Haute-Corse                         39
Nouvelle-Calédonie                  32
Aude                                27
Alpes-de-Haute-Provence             26
Meuse                               23
Name: Libellé du département, dtype: int64

In [30]:
data_train

Unnamed: 0,Code du département,Libellé du département,Code de la circonscription,Libellé de la circonscription,Code de la commune,Libellé de la commune,Code du b.vote,Inscrits,Votants,location
227884,01,Ain,4,4ème circonscription,01001,L'Abergement-Clémenciat,1,645,537,"46.14943, 4.924647"
633784,01,Ain,4,4ème circonscription,01001,L'Abergement-Clémenciat,1,645,537,"46.14943, 4.924647"
259862,01,Ain,4,4ème circonscription,01001,L'Abergement-Clémenciat,1,645,537,"46.14943, 4.924647"
259861,01,Ain,4,4ème circonscription,01001,L'Abergement-Clémenciat,1,645,537,"46.14943, 4.924647"
548851,01,Ain,4,4ème circonscription,01001,L'Abergement-Clémenciat,1.0,645,537,"46.14943, 4.924647"
...,...,...,...,...,...,...,...,...,...,...
783606,97,Nouvelle-Calédonie,2,2ème circonscription,97833,Kouaoua,1,572,104,
783605,97,Nouvelle-Calédonie,2,2ème circonscription,97833,Kouaoua,1,572,104,
13748,97,Nouvelle-Calédonie,2,2ème circonscription,97833,Kouaoua,1,572,104,
125161,97,Nouvelle-Calédonie,2,2ème circonscription,97833,Kouaoua,2,285,46,
