# NoteBook To create all sorts of DataSets

Save all new datasets in `data/new datasets/`

All datasets should have their own subFolder with files 

`x_train` columns: All variable names with index 0-XXX in numbers (no arrays)

`y_train` column: y column of 0/1 boolean values

`x_pred` same as x_train

### IMPORTS

In [1]:
using CSV, DataFrames, Statistics, Dates, Gadfly, LinearAlgebra, Distributions, Random, ScikitLearn, GLM

┌ Info: Loading DataFrames support into Gadfly.jl
└ @ Gadfly /home/williamglazer/.julia/packages/Gadfly/09PWZ/src/mapping.jl:228


# Chargement des données et nettoyage préliminaire

## Chargement des surverses

In [2]:
data = CSV.read("./data/surverses.csv", missingstring="-99999")
first(data,5)

Unnamed: 0_level_0,NO_OUVRAGE,DATE,SURVERSE,RAISON
Unnamed: 0_level_1,String,Date,Int64⍰,String⍰
1,0642-01D,2013-05-01,0,missing
2,0642-01D,2013-05-02,0,missing
3,0642-01D,2013-05-03,0,missing
4,0642-01D,2013-05-04,0,missing
5,0642-01D,2013-05-05,0,missing


## Nettoyage des données sur les surverses

#### Extraction des surverses pour les mois de mai à octobre inclusivement

In [3]:
data = filter(row -> month(row.DATE) > 4, data) 
data = filter(row -> month(row.DATE) < 11, data) 
first(data,5)

Unnamed: 0_level_0,NO_OUVRAGE,DATE,SURVERSE,RAISON
Unnamed: 0_level_1,String,Date,Int64⍰,String⍰
1,0642-01D,2013-05-01,0,missing
2,0642-01D,2013-05-02,0,missing
3,0642-01D,2013-05-03,0,missing
4,0642-01D,2013-05-04,0,missing
5,0642-01D,2013-05-05,0,missing


#### Remplacement des valeurs *missing* dans la colonne :RAISON par "Inconnue"

In [4]:
raison = coalesce.(data[:,:RAISON],"Inconnue")
data[!,:RAISON] = raison
first(data,5)

Unnamed: 0_level_0,NO_OUVRAGE,DATE,SURVERSE,RAISON
Unnamed: 0_level_1,String,Date,Int64⍰,String
1,0642-01D,2013-05-01,0,Inconnue
2,0642-01D,2013-05-02,0,Inconnue
3,0642-01D,2013-05-03,0,Inconnue
4,0642-01D,2013-05-04,0,Inconnue
5,0642-01D,2013-05-05,0,Inconnue


#### Exlusion des surverses coccasionnées par d'autres facteurs que les précipitations liquides

Ces facteurs correspondent à : 
- la fonte de neige (F), 
- les travaux planifiés et entretien (TPL)
- urgence (U)
- autre (AUT)

In [5]:
data = filter(row -> row.RAISON ∈ ["P","Inconnue","TS"], data) 
select!(data, [:NO_OUVRAGE, :DATE, :SURVERSE])
first(data,5)

Unnamed: 0_level_0,NO_OUVRAGE,DATE,SURVERSE
Unnamed: 0_level_1,String,Date,Int64⍰
1,0642-01D,2013-05-01,0
2,0642-01D,2013-05-02,0
3,0642-01D,2013-05-03,0
4,0642-01D,2013-05-04,0
5,0642-01D,2013-05-05,0


#### Exclusion des lignes où :SURVERSE est manquante

In [6]:
surverse_df = dropmissing(data, disallowmissing=true)
first(surverse_df,5)

Unnamed: 0_level_0,NO_OUVRAGE,DATE,SURVERSE
Unnamed: 0_level_1,String,Date,Int64
1,0642-01D,2013-05-01,0
2,0642-01D,2013-05-02,0
3,0642-01D,2013-05-03,0
4,0642-01D,2013-05-04,0
5,0642-01D,2013-05-05,0


In [7]:
CSV.write("./data/new datasets/surverse list.csv", surverse_df)

"./data/new datasets/surverse list.csv"

## Chargement des précipitations

In [8]:
data = CSV.read("data/precipitations.csv",missingstring="-99999")
rename!(data, Symbol("St-Hubert")=>:StHubert)
first(data,5)

Unnamed: 0_level_0,date,heure,McTavish,Bellevue,Assomption,Trudeau,StHubert
Unnamed: 0_level_1,Date,Int64,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,2013-01-01,0,0,0,0,0,missing
2,2013-01-01,1,0,0,0,0,missing
3,2013-01-01,2,0,0,0,0,missing
4,2013-01-01,3,0,0,0,0,missing
5,2013-01-01,4,0,0,0,0,missing


## Nettoyage des données sur les précipitations

#### Extraction des précipitations des mois de mai à octobre inclusivement

In [9]:
data = filter(row -> month(row.date) > 4, data) 
data = filter(row -> month(row.date) < 11, data) 
first(data,5)

Unnamed: 0_level_0,date,heure,McTavish,Bellevue,Assomption,Trudeau,StHubert
Unnamed: 0_level_1,Date,Int64,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,2013-05-01,0,0,0,0,0,missing
2,2013-05-01,1,0,0,0,0,missing
3,2013-05-01,2,0,0,0,0,missing
4,2013-05-01,3,0,0,0,0,missing
5,2013-05-01,4,0,0,0,0,missing


 ### Remplissage des données manquantes
Nous allons tenter de remplir les données manquantes par des moyennes de précipitations lorsque les données sont inconnues pour 2 stations ou plus

# LIST OF TECHNIQUES TO FILL MISSING DATA

- use ridge regression to fill missing values (*TODO*)
- use mean of line to fill values

## Filling precipitation rows by doing Mean of Stations per Hour

In [10]:
include("datasets/countMissing.jl")
include("datasets/meanLine.jl")
include("datasets/replaceMissing.jl")
precipitation_df = data[:,Not(:date)][:,Not(:heure)]
for row in eachrow(precipitation_df)
    nbMissing, ind = countMissing(row)
    if(nbMissing<5)
        replaceMissing(row,round(meanLine(row)))
    end
end
precipitation_df.heure = data[:,:heure]
precipitation_df.date = data[:,:date]
CSV.write("data/new datasets/precipitaion_filed_mean_per_hour.csv",precipitation_df)
first(precipitation_df,10)

Unnamed: 0_level_0,McTavish,Bellevue,Assomption,Trudeau,StHubert,heure,date
Unnamed: 0_level_1,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64,Date
1,0,0,0,0,0,0,2013-05-01
2,0,0,0,0,0,1,2013-05-01
3,0,0,0,0,0,2,2013-05-01
4,0,0,0,0,0,3,2013-05-01
5,0,0,0,0,0,4,2013-05-01
6,0,0,0,0,0,5,2013-05-01
7,0,0,0,0,0,6,2013-05-01
8,0,0,0,0,0,7,2013-05-01
9,0,0,0,0,0,8,2013-05-01
10,0,0,0,0,0,9,2013-05-01


# Daily sum as 5 Explicative Variables

In [11]:
precipitation_daily_sum = by(precipitation_df, :date,  McTavish = :McTavish=>sum, Bellevue = :Bellevue=>sum, 
   Assomption = :Assomption=>sum, Trudeau = :Trudeau=>sum, StHubert = :StHubert=>sum)
first(precipitation_df ,10)

Unnamed: 0_level_0,McTavish,Bellevue,Assomption,Trudeau,StHubert,heure,date
Unnamed: 0_level_1,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64,Date
1,0,0,0,0,0,0,2013-05-01
2,0,0,0,0,0,1,2013-05-01
3,0,0,0,0,0,2,2013-05-01
4,0,0,0,0,0,3,2013-05-01
5,0,0,0,0,0,4,2013-05-01
6,0,0,0,0,0,5,2013-05-01
7,0,0,0,0,0,6,2013-05-01
8,0,0,0,0,0,7,2013-05-01
9,0,0,0,0,0,8,2013-05-01
10,0,0,0,0,0,9,2013-05-01


filter out 2019 year fore prediction

In [12]:
precipitation_daily_sum_train = filter(row -> row[:date] != Year(2019), precipitation_daily_sum)
precipitation_daily_sum_pred  = filter(row -> row[:date] == Year(2019), precipitation_daily_sum)

Unnamed: 0_level_0,date,McTavish,Bellevue,Assomption,Trudeau,StHubert
Unnamed: 0_level_1,Date,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰


send to csv

In [13]:
CSV.write("data/new datasets/precipitation_daily_sum/x_train.csv", precipitation_daily_sum_train[:,Not(:date)])
CSV.write("data/new datasets/precipitation_daily_sum/y_train.csv", surverse_df)
CSV.write("data/new datasets/precipitation_daily_sum/x_pred.csv", precipitation_daily_sum_train[:,Not(:date)])

"data/new datasets/precipitation_daily_sum/x_pred.csv"

# Daily Maximum as 5 Explicative Variables

#### Extraction du taux horaire journalier maximum des précipitations pour chacune des stations météorologiques

In [14]:
precipitation_daily_max = by(precipitation_df, :date,  McTavish = :McTavish=>maximum, Bellevue = :Bellevue=>maximum, 
   Assomption = :Assomption=>maximum, Trudeau = :Trudeau=>maximum, StHubert = :StHubert=>maximum)
first(precipitation_daily_max,10)

Unnamed: 0_level_0,date,McTavish,Bellevue,Assomption,Trudeau,StHubert
Unnamed: 0_level_1,Date,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,2013-05-01,0,0,0,0,0
2,2013-05-02,0,0,0,0,0
3,2013-05-03,0,0,0,0,0
4,2013-05-04,0,0,0,0,0
5,2013-05-05,0,0,0,0,0
6,2013-05-06,0,0,0,0,0
7,2013-05-07,0,0,0,0,0
8,2013-05-08,0,0,0,0,0
9,2013-05-09,10,0,19,0,5
10,2013-05-10,0,4,20,0,5


filter out 2019 year fore prediction

In [15]:
precipitation_daily_max_train = filter(row -> row[:date] != Year(2019), precipitation_daily_max)
precipitation_daily_max_pred  = filter(row -> row[:date] == Year(2019), precipitation_daily_max)

Unnamed: 0_level_0,date,McTavish,Bellevue,Assomption,Trudeau,StHubert
Unnamed: 0_level_1,Date,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰


send to csv

In [16]:
CSV.write("./data/new datasets/precipitation_daily_max/x_train.csv", precipitation_daily_max_train[:,Not(:date)])
CSV.write("./data/new datasets/precipitation_daily_max/y_train.csv", surverse_df)
CSV.write("./data/new datasets/precipitation_daily_max/x_pred.csv", precipitation_daily_max_train[:,Not(:date)])

"./data/new datasets/precipitation_daily_max/x_pred.csv"