# Logging of the project

In this notebook you will find all the steps it took us to acurately predict the temperature of nuclear waste canister. 

### **Imports** 

In [70]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import KNNImputer

### **Loading of the datasets** 

In [71]:
coordinates_test = pd.read_csv("data/Coordinates_Test.csv")
coordinates_train = pd.read_csv("data/Coordinates_Training.csv")
humidity_test = pd.read_csv("data/Test_Time_humidity.csv")
humidity_train = pd.read_csv("data/Training_data_humidity.csv")
pressure_test = pd.read_csv("data/Test_Time_pressure.csv")
pressure_train = pd.read_csv("data/Training_data_pressure.csv")
temperature_train = pd.read_csv("data/Training_data_temperature.csv")

### **Visualizing the datasets**

In [72]:
display(coordinates_train.sample(10))
coordinates_train.info()
coordinates_train["Material"].unique()

Unnamed: 0.1,Unnamed: 0,Sensor ID,Index,Material,Coor X [m],Coor Y [m],Coor Z [m],R [m]
155,155,N_156,156,OPA,-17.655549,6.763699,9.051354,19.8405
898,898,N_899,899,CAN,-0.484575,28.63374,-0.182562,0.517824
776,776,N_777,777,OPA,3.035954,34.792792,-1.023535,3.203848
451,451,N_452,452,OPA,9.551454,36.549077,-9.087492,13.183808
419,419,N_420,420,OPA,12.552277,4.26346,11.584986,17.081322
833,833,N_834,834,OPA,1.051175,32.005121,-2.513081,2.724067
164,164,N_165,165,OPA,-16.977145,48.040845,-4.561215,17.579196
769,769,N_770,770,OPA,-0.817869,16.239954,2.779201,2.897045
61,61,N_62,62,OPA,3.29853,16.785288,10.970838,11.455984
739,739,N_740,740,OPA,1.85323,19.289364,0.817955,2.025713


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  900 non-null    int64  
 1   Sensor ID   900 non-null    object 
 2   Index       900 non-null    int64  
 3   Material    900 non-null    object 
 4   Coor X [m]  900 non-null    float64
 5   Coor Y [m]  900 non-null    float64
 6   Coor Z [m]  900 non-null    float64
 7   R [m]       900 non-null    float64
dtypes: float64(4), int64(2), object(2)
memory usage: 56.4+ KB


array(['OPA', 'SHCR', 'GBM', 'EDZ', 'VOID', 'CAN', 'BBLOCK'], dtype=object)

We can see there are no missing data and apparently no false measures (outliers) on the position, but some columns are useless like the numbering of rows, the Sensor ID, and the index. Because the rows are in an ascending order and as long as the indices match between the files, the name of the sensor and its number doesn't matter. We are also renaming the columns to make it easier later.  

We will also one-hot encode the Material column to feed it later in the model, as they are caterogical features. 

**do we have missing data in the Material column ???, in that case we might not use KNN ?**

In [73]:
coordinates_train = coordinates_train[["Material", "Coor X [m]", "Coor Y [m]", "Coor Z [m]", "R [m]"]].copy()
coordinates_test = coordinates_test[["Material", "Coor X [m]", "Coor Y [m]", "Coor Z [m]", "R [m]"]].copy()

# changing the column names for faster typing later
new_col_names: dict = {
    "Coor X [m]": "x",
    "Coor Y [m]": "y",
    "Coor Z [m]": "z",
    "R [m]": "r"
}
coordinates_train.rename(columns = new_col_names, inplace=True)
coordinates_test.rename(columns = new_col_names, inplace=True)

material_mapping: dict = {
    'OPA': 0,
    'SHCR': 1,
    'GBM': 2,
    'EDZ': 3,
    'VOID': 4,
    'CAN': 5,
    'BBLOCK': 6
}

## cf intro to pandas notebook from the weekly exercises
coordinates_test = pd.get_dummies(coordinates_test) 
coordinates_train = pd.get_dummies(coordinates_train)
display(coordinates_train.head(5))

Unnamed: 0,x,y,z,r,Material_BBLOCK,Material_CAN,Material_EDZ,Material_GBM,Material_OPA,Material_SHCR,Material_VOID
0,0.208042,14.436936,-2.875503,2.883019,False,False,False,False,True,False,False
1,-8.970832,28.229841,-0.134437,8.971839,False,False,False,False,True,False,False
2,-14.289501,6.685726,-10.399048,17.672862,False,False,False,False,True,False,False
3,6.114855,2.685645,-3.189981,6.896914,False,False,False,False,True,False,False
4,4.048845,48.70859,11.260503,11.966289,False,False,False,False,True,False,False


In [74]:

mean_humidity = humidity_train.mean(axis=0).iloc[1:] ## not keeping the time for the mean
mean_humidity.dropna(inplace=True)
print(f"Global mean: {np.mean(mean_humidity, axis=0)}, variance: {np.std(mean_humidity, axis=0)}")

## starting the time at 0 as explained further:
humidity_test["M.Time[d]"] = (humidity_test["M.Time[d]"] - 1554).astype('int32')
humidity_train["M.Time[d]"] = (humidity_train["M.Time[d]"] - 1554).astype('int32')

display(humidity_train.iloc[:,0:13])

Global mean: 98.19472180013585, variance: 5.56673083885869


Unnamed: 0,M.Time[d],N_1,N_2,N_3,N_4,N_5,N_6,N_7,N_8,N_9,N_10,N_11,N_12
0,0,100.0,100,100,100,100,100,100,100,100,100,100,100
1,2,100.0,100,100,100,100,100,100,100,100,100,100,100
2,4,100.0,100,100,100,100,100,100,100,100,100,100,100
3,6,100.0,100,100,100,100,100,100,100,100,100,100,100
4,9,100.0,100,100,100,100,100,100,100,100,100,100,100
5,13,100.0,100,100,100,100,100,100,100,100,100,100,100
6,18,100.0,100,100,100,100,100,100,100,100,100,100,100
7,24,100.0,100,100,100,100,100,100,100,100,100,100,100
8,31,100.0,100,100,100,100,100,100,100,100,100,100,100
9,41,100.0,100,100,100,100,100,100,100,100,100,100,100


The humidity doesn't seem to contain much information and we will discard it for our first model.

In [75]:
display(pressure_train.head(10))
pressure_train.isnull().sum().sum() # See how many missing values there are

Unnamed: 0,M.Time[d],N_1,N_2,N_3,N_4,N_5,N_6,N_7,N_8,N_9,...,N_891,N_892,N_893,N_894,N_895,N_896,N_897,N_898,N_899,N_900
0,1554,281.143955,1462.382827,1656.041967,1322.584393,1473.656845,1480.325846,1650.794639,1511.445756,1397.850463,...,-57423.47996,202.296319,395.883971,651.026983,-405.358355,-60428.6899,-768.38951,336.655683,-63100.53626,831.154054
1,1556,279.9688,1461.806069,1656.00744,1322.251688,1473.611215,1480.290558,1650.692513,1511.327468,1397.594172,...,-57398.67576,201.782481,395.55376,649.949956,-405.782281,-60422.76359,-769.736168,335.53192,-63004.35183,830.963631
2,1558,278.79731,1461.224531,1655.972023,1321.91935,1473.56518,1480.255784,1650.589484,1511.207764,1397.33862,...,-57373.82834,201.4704,395.458253,648.902434,-405.531701,-60416.87545,-771.05471,334.539419,-62924.41589,830.870655
3,1560,277.631157,1460.639147,1655.93575,1321.587287,1473.518663,1480.221507,1650.485576,1511.086587,1397.083769,...,-57348.80938,201.459983,395.687942,647.909216,-404.343841,-60411.01323,-772.329622,333.773806,-62855.8855,830.915078
4,1563,275.898544,1459.753528,1655.879443,1321.089332,1473.447601,1480.170896,1650.327726,1510.901301,1396.702138,...,-57310.61967,202.456185,397.004063,646.629125,-400.066605,-60402.23717,-774.086664,333.414898,-62770.54127,831.441631
5,1567,273.638131,1458.561177,1655.80089,1320.424636,1473.350137,1480.104114,1650.113538,1510.647552,1396.19224,...,-57258.35772,206.006803,400.775274,645.494662,-390.148591,-60390.48679,-776.043415,334.725681,-62678.04907,833.235986
6,1572,270.952332,1457.058588,1655.697105,1319.591329,1473.223592,1480.020813,1649.83973,1510.319456,1395.550471,...,-57191.21953,213.937626,408.653531,645.27185,-372.715269,-60375.63042,-777.774434,339.314345,-62585.29747,837.420404
7,1578,268.052957,1455.254033,1655.564414,1318.58674,1473.064656,1479.920316,1649.50224,1509.909966,1394.771852,...,-57108.95397,227.660269,422.065061,647.016759,-347.410031,-60357.50067,-778.805126,348.607103,-62496.78004,845.165059
8,1585,265.274994,1453.181008,1655.398671,1317.40782,1472.869729,1479.801887,1649.09648,1509.411524,1393.851404,...,-57012.03405,247.67348,441.757632,651.895122,-315.44352,-60335.95127,-778.741243,363.366767,-62415.09885,857.330082
9,1595,262.778601,1450.403277,1655.140854,1315.709849,1472.573579,1479.63029,1648.493616,1508.661613,1392.516777,...,-56874.1784,280.108536,474.431535,664.079941,-270.929194,-60304.55821,-776.751639,388.901327,-62326.9024,878.947658


96

We see there are 96 missing values for the pressure, we don't want to delete them as it would remove quite a lot of data. So we prefer to impute the missing data, we could use constant or mean imputation but  we'll use KNN imputation to replace them, with 2 neighbors as it will more likely preserve the time continuation.

We'll also start the time at 0 days because it is convenient.

In [76]:
imputer = KNNImputer(missing_values = np.nan, n_neighbors = 2).set_output(transform="pandas")
pressure_train = imputer.fit_transform(pressure_train)
pressure_test = imputer.fit_transform(pressure_test)
pressure_train.isnull().sum().sum() 

pressure_test["M.Time[d]"] = (pressure_test["M.Time[d]"] - 1554).astype('int32')
pressure_train["M.Time[d]"] = (pressure_train["M.Time[d]"] - 1554).astype('int32')

display(pressure_train.head(5))

Unnamed: 0,M.Time[d],N_1,N_2,N_3,N_4,N_5,N_6,N_7,N_8,N_9,...,N_891,N_892,N_893,N_894,N_895,N_896,N_897,N_898,N_899,N_900
0,0,281.143955,1462.382827,1656.041967,1322.584393,1473.656845,1480.325846,1650.794639,1511.445756,1397.850463,...,-57423.47996,202.296319,395.883971,651.026983,-405.358355,-60428.6899,-768.38951,336.655683,-63100.53626,831.154054
1,2,279.9688,1461.806069,1656.00744,1322.251688,1473.611215,1480.290558,1650.692513,1511.327468,1397.594172,...,-57398.67576,201.782481,395.55376,649.949956,-405.782281,-60422.76359,-769.736168,335.53192,-63004.35183,830.963631
2,4,278.79731,1461.224531,1655.972023,1321.91935,1473.56518,1480.255784,1650.589484,1511.207764,1397.33862,...,-57373.82834,201.4704,395.458253,648.902434,-405.531701,-60416.87545,-771.05471,334.539419,-62924.41589,830.870655
3,6,277.631157,1460.639147,1655.93575,1321.587287,1473.518663,1480.221507,1650.485576,1511.086587,1397.083769,...,-57348.80938,201.459983,395.687942,647.909216,-404.343841,-60411.01323,-772.329622,333.773806,-62855.8855,830.915078
4,9,275.898544,1459.753528,1655.879443,1321.089332,1473.447601,1480.170896,1650.327726,1510.901301,1396.702138,...,-57310.61967,202.456185,397.004063,646.629125,-400.066605,-60402.23717,-774.086664,333.414898,-62770.54127,831.441631


In [77]:
display(temperature_train.head(10))

Unnamed: 0,M.Time[d],N_1,N_2,N_3,N_4,N_5,N_6,N_7,N_8,N_9,...,N_891,N_892,N_893,N_894,N_895,N_896,N_897,N_898,N_899,N_900
0,1554,17.623059,17.15422,17.641578,17.455701,,16.604935,17.662407,16.503001,16.943823,...,17.503931,17.225297,17.498277,17.268529,17.573474,17.412215,17.526257,17.36494,24.026562,17.538194
1,1556,17.62086,17.154263,17.641672,17.45585,16.415312,16.605042,17.662519,16.503121,16.943985,...,17.510776,17.22329,17.498581,17.267488,17.578925,17.409841,17.52286,17.363663,33.729552,17.53746
2,1558,17.618608,17.154303,17.641766,17.455998,16.415377,16.605148,17.662632,16.50324,16.944146,...,17.534085,17.223733,17.501874,2872.837827,17.599256,17.407913,17.520157,17.36385,41.602481,17.537433
3,1560,17.616334,2717.706176,17.641859,17.456146,16.41544,16.605254,17.662744,16.503357,16.944307,...,17.58161,17.228355,17.50967,17.266326,17.640317,17.40677,17.51875,17.366504,48.21898,17.538652
4,1563,17.612991,17.154388,17.642,17.456367,16.415531,16.605414,17.662912,16.50353,16.944544,...,17.723547,17.249726,17.535358,17.267759,17.757592,17.408069,17.521699,17.379102,56.258743,17.545154
5,1567,17.609008,17.154454,17.642187,17.456661,16.415646,16.605626,17.663136,16.503753,16.944855,...,2766.426947,17.310388,17.598978,17.275449,18.010005,17.417672,17.538472,17.415806,64.775395,17.565561
6,1572,17.605614,17.154568,17.64242,17.457028,16.415779,16.60589,17.663415,16.504023,16.945233,...,18.634191,17.438013,17.72479,17.297646,18.444537,17.444923,17.583032,17.495517,73.147489,17.612783
7,1578,17.605318,17.154841,17.6427,17.457465,16.415925,16.606206,17.663749,16.504334,16.94567,...,19.53027,17.656072,17.934088,17.346024,19.076916,17.501008,17.670392,17.636533,80.996009,17.701098
8,1585,17.612025,17.15553,17.643025,17.457972,16.416077,16.606574,17.664138,16.504678,16.946154,...,20.726138,17.976554,18.239215,17.433777,19.889638,17.596132,17.812325,17.851145,88.108493,17.842181
9,1595,17.638763,17.157931,17.643489,17.458689,16.416256,16.607098,17.664689,16.505137,16.946813,...,22.524169,18.511739,18.750719,17.618164,21.073668,17.779369,18.073014,18.223945,95.570527,18.100318


But here we notice sudden spikes in the temperature going from about 20 to 2'000+ (the mineral would most likly melt at that temperature) and going back down to 20 a few days later. We can consider those value as measurement errors. We can replace those outliers by the mean of the 2 nearest neighbors if they are above a threshold of 500 (we can see it should work well by looking at the graph below), this is called value cliping.

In [78]:
# We need to be careful and not taking into account the first column with high values of time
columns = temperature_train.columns.difference(["M.Time[d]"])

## Commented out a it takes 15s to run
# sns.boxplot(temperature_train.iloc[:,1:])

temperature_train[columns] = temperature_train[columns].mask(temperature_train[columns] >= 500, np.nan)
# temperature_train.iloc[:, :10]

## We impute the missing data with the 2 nearest neighbors as done with the pressure
temperature_train = imputer.fit_transform(temperature_train)
temperature_train.isnull().sum().sum() # check that it worked

## Start the time at 0: 
temperature_train["M.Time[d]"] = (temperature_train["M.Time[d]"] - 1554).astype('int32')

### **Feature engineering**

#### **Data normalization**

Data normalization is essential for a faster convergence of the descent methods and an appropriate penalization of the weights. We will use a z-score scaling technique as a min-max scaling would be scalled up by the outliers and concentrate the values too much. We'll then make sure the data is not too heavy-tailed for the scaling to work best.

In [None]:
## Mean and standard deviation are always computed on the training data
## axis = None computes the mean over the entire DataFrame with our current version of pandas: 2.0.3
## but it doesn't work with the std: 
#  https://stackoverflow.com/questions/25140998/pandas-compute-mean-or-std-standard-deviation-over-entire-dataframe
temperature_mean = temperature_train.iloc[:,1:].mean(axis=None) 
temperature_std = temperature_train.iloc[:,1:].values.std()
print(temperature_mean)
print(temperature_std)

24.636442673404822
15.459417098229599


Model: Recurrent neural net ?? 
(cf lecture 7.3)