# Logging of the project

In this notebook you will find all the steps it took us to acurately predict the temperature of nuclear waste canister. 

### **Imports** 

In [31]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import KNNImputer

### **Loading of the datasets** 

In [32]:
coordinates_test = pd.read_csv("data/Coordinates_Test.csv")
coordinates_train = pd.read_csv("data/Coordinates_Training.csv")
humidity_test = pd.read_csv("data/Test_Time_humidity.csv")
pressure_test = pd.read_csv("data/Test_Time_pressure.csv")
humidity_train = pd.read_csv("data/Training_data_humidity.csv")
pressure_train = pd.read_csv("data/Training_data_pressure.csv")
temperature_train = pd.read_csv("data/Training_data_temperature.csv")

### **Visualizing the datasets**

In [33]:
display(coordinates_train.sample(10))
coordinates_train.info()
coordinates_train["Material"].unique()

Unnamed: 0.1,Unnamed: 0,Sensor ID,Index,Material,Coor X [m],Coor Y [m],Coor Z [m],R [m]
637,637,N_638,638,OPA,2.863192,32.174717,0.01766,2.863246
453,453,N_454,454,OPA,-17.994576,3.978569,6.979155,19.300606
861,861,N_862,862,OPA,-3.483447,34.895662,0.1844,3.488324
227,227,N_228,228,OPA,18.372444,22.069052,-6.224668,19.398278
868,868,N_869,869,OPA,-1.779008,28.141155,0.312831,1.806303
441,441,N_442,442,VOID,-0.458902,6.606527,-0.309136,0.553314
869,869,N_870,870,OPA,-1.43523,28.970785,3.256784,3.559007
807,807,N_808,808,OPA,3.56259,30.423877,-1.708266,3.950977
17,17,N_18,18,OPA,16.968222,33.114548,9.603087,19.497175
88,88,N_89,89,OPA,3.790601,28.972202,12.908659,13.453703


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  900 non-null    int64  
 1   Sensor ID   900 non-null    object 
 2   Index       900 non-null    int64  
 3   Material    900 non-null    object 
 4   Coor X [m]  900 non-null    float64
 5   Coor Y [m]  900 non-null    float64
 6   Coor Z [m]  900 non-null    float64
 7   R [m]       900 non-null    float64
dtypes: float64(4), int64(2), object(2)
memory usage: 56.4+ KB


array(['OPA', 'SHCR', 'GBM', 'EDZ', 'VOID', 'CAN', 'BBLOCK'], dtype=object)

We can see there are no missing data and apparently no false measures (outliers) on the position, but some columns are useless like the numbering of rows, the Sensor ID, and the index. Because the rows are in an ascending order and as long as the indices match between the files, the name of the sensor and its number doesn't matter. We are also renaming the columns to make it easier later.  

We will also one-hot encode the Material column to feed it later in the model, as they are caterogical features. 

In [None]:
coordinates_train = coordinates_train[["Material", "Coor X [m]", "Coor Y [m]", "Coor Z [m]", "R [m]"]].copy()
coordinates_test = coordinates_test[["Material", "Coor X [m]", "Coor Y [m]", "Coor Z [m]", "R [m]"]].copy()

# changing the column names for faster typing later
new_col_names: dict = {
    "Coor X [m]": "x",
    "Coor Y [m]": "y",
    "Coor Z [m]": "z",
    "R [m]": "r"
}
coordinates_train.rename(columns = new_col_names, inplace=True)
coordinates_test.rename(columns = new_col_names, inplace=True)

material_mapping: dict = {
    'OPA': 0,
    'SHCR': 1,
    'GBM': 2,
    'EDZ': 3,
    'VOID': 4,
    'CAN': 5,
    'BBLOCK': 6
}

## cf intro to pandas notebook from the weekly exercises
coordinates_test = pd.get_dummies(coordinates_test) 
coordinates_train = pd.get_dummies(coordinates_train)
display(coordinates_train.head(5))

Unnamed: 0,x,y,z,r,Material_BBLOCK,Material_CAN,Material_EDZ,Material_GBM,Material_OPA,Material_SHCR,Material_VOID
0,0.208042,14.436936,-2.875503,2.883019,False,False,False,False,True,False,False
1,-8.970832,28.229841,-0.134437,8.971839,False,False,False,False,True,False,False
2,-14.289501,6.685726,-10.399048,17.672862,False,False,False,False,True,False,False
3,6.114855,2.685645,-3.189981,6.896914,False,False,False,False,True,False,False
4,4.048845,48.70859,11.260503,11.966289,False,False,False,False,True,False,False


In [35]:
display(humidity_train.iloc[:,100:130])
mean_humidity = humidity_train.mean(axis=0).iloc[1:] ## not keeping the time for the mean
mean_humidity.dropna(inplace=True)
print(f"Global mean: {np.mean(mean_humidity, axis=0)}, variance: {np.std(mean_humidity, axis=0)}")

Unnamed: 0,N_100,N_101,N_102,N_103,N_104,N_105,N_106,N_107,N_108,N_109,...,N_120,N_121,N_122,N_123,N_124,N_125,N_126,N_127,N_128,N_129
0,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,100,100,100
1,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,100,100,100
2,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,100,100,100
3,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,100,100,100
4,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,100,100,100
5,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,100,100,100
6,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,100,100,100
7,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,100,100,100
8,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,100,100,100
9,100,100,100,100,100,100,100,100,100,100,...,100,100,100,100,100,100,100,100,100,100


Global mean: 98.19472180013585, variance: 5.56673083885869


The humidity doesn't seem to contain much information and we will discard it for our first model.

In [36]:
display(pressure_train.sample(10))
pressure_train.isnull().sum().sum() # See how many missing values there are

Unnamed: 0,M.Time[d],N_1,N_2,N_3,N_4,N_5,N_6,N_7,N_8,N_9,...,N_891,N_892,N_893,N_894,N_895,N_896,N_897,N_898,N_899,N_900
27,4133,35.483143,1613.366784,1625.964221,1111.704517,1572.137598,1551.249501,1560.455896,1647.797984,1379.885119,...,-37960.37188,102.145498,407.715011,627.24363,-590.170315,-53835.06275,-1159.198459,165.223822,-59641.98455,936.737819
29,5587,-126.443948,1427.4156,1599.832439,1044.165672,1582.56119,1568.662633,1518.937457,1609.704476,1287.358849,...,-32874.04125,-92.627052,206.471809,410.454022,-794.210744,-50782.56296,-1354.899393,-47.678273,-58459.67072,736.076259
26,3616,106.274939,1684.808222,1628.652425,1137.270914,1555.532477,1534.928239,1571.155251,1643.03845,1405.974281,...,-40500.998,195.13012,500.025021,731.59732,-495.292741,-55013.86874,-1075.59736,259.967716,-60047.50393,1023.587342
7,1578,268.052957,1455.254033,1655.564414,1318.58674,1473.064656,1479.920316,1649.50224,1509.909966,1394.771852,...,-57108.95397,227.660269,422.065061,647.016759,-347.410031,-60357.50067,-778.805126,348.607103,-62496.78004,845.165059
13,1662,276.089676,1445.439966,1652.937807,1304.070726,1470.281668,1478.459543,1643.928649,1503.114566,1384.002387,...,-56020.85926,460.86974,674.76562,808.421271,-96.448798,-60092.11262,-751.045993,541.729133,-62043.75043,1033.392931
30,6597,-214.380046,1321.5461,1572.359127,1001.267602,1573.175157,1564.077945,1485.179496,1567.71744,1223.577043,...,-30535.99632,-191.773443,100.892032,301.510391,-904.287405,-48851.81512,-1466.118921,-161.202948,-57515.23771,628.138609
11,1621,263.300379,1445.110747,1654.37501,1311.236446,1471.731106,1479.176256,1646.820627,1506.566481,1389.031124,...,-56528.221,363.489506,562.737272,713.673752,-177.863324,-60221.66068,-766.980365,458.395808,-62177.76093,942.758921
20,2090,327.870118,1670.042287,1635.025242,1239.640727,1468.399228,1476.23904,1611.33012,1509.888862,1390.031486,...,-51569.22808,561.512961,843.901837,1110.664107,-96.492185,-58863.88557,-772.727303,605.028927,-61440.53123,1263.655853
28,4779,-42.559165,1526.163722,1617.176252,1080.92698,1581.852995,1563.519452,1543.682097,1636.850272,1340.123077,...,-35383.29199,5.817968,309.437067,519.659696,-689.66761,-52434.06393,-1252.216675,62.01498,-59132.59251,840.084431
17,1826,322.93674,1519.891773,1646.087585,1277.042803,1465.769942,1476.096189,1631.435437,1495.30359,1376.628487,...,-54210.97965,581.441547,840.182744,1040.206752,-40.479411,-59603.86903,-735.299719,635.896303,-61761.55134,1205.466548


96

We see there are 96 missing values for the pressure, we don't want to delete them as it would remove quite a lot of data. So we prefer to impute the missing data, we could use constant or mean imputation but  we'll use KNN imputation to replace them, with 2 neighbors as it will more likely preserve the time continuation.

In [37]:
imputer = KNNImputer(missing_values = np.nan, n_neighbors = 2).set_output(transform="pandas")
pressure_train = imputer.fit_transform(pressure_train)
pressure_test = imputer.fit_transform(pressure_test)
pressure_train.isnull().sum().sum() 


0

In [38]:
display(temperature_train.head(10))

Unnamed: 0,M.Time[d],N_1,N_2,N_3,N_4,N_5,N_6,N_7,N_8,N_9,...,N_891,N_892,N_893,N_894,N_895,N_896,N_897,N_898,N_899,N_900
0,1554,17.623059,17.15422,17.641578,17.455701,,16.604935,17.662407,16.503001,16.943823,...,17.503931,17.225297,17.498277,17.268529,17.573474,17.412215,17.526257,17.36494,24.026562,17.538194
1,1556,17.62086,17.154263,17.641672,17.45585,16.415312,16.605042,17.662519,16.503121,16.943985,...,17.510776,17.22329,17.498581,17.267488,17.578925,17.409841,17.52286,17.363663,33.729552,17.53746
2,1558,17.618608,17.154303,17.641766,17.455998,16.415377,16.605148,17.662632,16.50324,16.944146,...,17.534085,17.223733,17.501874,2872.837827,17.599256,17.407913,17.520157,17.36385,41.602481,17.537433
3,1560,17.616334,2717.706176,17.641859,17.456146,16.41544,16.605254,17.662744,16.503357,16.944307,...,17.58161,17.228355,17.50967,17.266326,17.640317,17.40677,17.51875,17.366504,48.21898,17.538652
4,1563,17.612991,17.154388,17.642,17.456367,16.415531,16.605414,17.662912,16.50353,16.944544,...,17.723547,17.249726,17.535358,17.267759,17.757592,17.408069,17.521699,17.379102,56.258743,17.545154
5,1567,17.609008,17.154454,17.642187,17.456661,16.415646,16.605626,17.663136,16.503753,16.944855,...,2766.426947,17.310388,17.598978,17.275449,18.010005,17.417672,17.538472,17.415806,64.775395,17.565561
6,1572,17.605614,17.154568,17.64242,17.457028,16.415779,16.60589,17.663415,16.504023,16.945233,...,18.634191,17.438013,17.72479,17.297646,18.444537,17.444923,17.583032,17.495517,73.147489,17.612783
7,1578,17.605318,17.154841,17.6427,17.457465,16.415925,16.606206,17.663749,16.504334,16.94567,...,19.53027,17.656072,17.934088,17.346024,19.076916,17.501008,17.670392,17.636533,80.996009,17.701098
8,1585,17.612025,17.15553,17.643025,17.457972,16.416077,16.606574,17.664138,16.504678,16.946154,...,20.726138,17.976554,18.239215,17.433777,19.889638,17.596132,17.812325,17.851145,88.108493,17.842181
9,1595,17.638763,17.157931,17.643489,17.458689,16.416256,16.607098,17.664689,16.505137,16.946813,...,22.524169,18.511739,18.750719,17.618164,21.073668,17.779369,18.073014,18.223945,95.570527,18.100318


But here we notice sudden spikes in the temperature going from about 20 to 2'000+ (the mineral would most likly melt at that temperature) and going back down to 20 a few days later. We can consider those value as measurement errors. We can replace those outliers by the mean of the 2 nearest neighbors if they are above a threshold of 500 (we can see it should work well by looking at the graph below)

In [39]:
# We need to be careful and not taking into account the first column with high values of time
columns = temperature_train.columns.difference(["M.Time[d]"])

## Commented out a it takes 15s to run
# sns.boxplot(temperature_train.iloc[:,1:])

temperature_train[columns] = temperature_train[columns].mask(temperature_train[columns] >= 500, np.nan)
# temperature_train.iloc[:, :10]

## We impute the missing data with the 2 nearest neighbors as done with the pressure
temperature_train = imputer.fit_transform(temperature_train)
temperature_train.isnull().sum().sum() # check that it worked

0

### **Feature engineering**

Model: Recurrent neural net ?? 
(cf lecture 7.3)