## Interpolating data and first alignment and aggregation

Now, we know we can interpolate to approximate measurements for each day. These approximates will be:
$$\frac{\left[\left(6-(d-1)\right)\,M_\mathrm{before}\right]+\left[(d-1)\,M_\mathrm{after}\right]}{6}$$
where $d$ is the number of days from last repair and runs from $1$ to $6$ for operation days (it will be $7$ for a maintenance day). $M_\mathrm{before}$ is any of the $M_1$, $M_2$ or $M_3$ measurements (same as the measurement we want to approximate) at the previous maintenance day: it will be `"M1 (before)"`, `"M2 (before)"` and `"M3 (before)"`  at that day, if there was no repair and recalibration and `"M1 (after)"`, `"M2 (after)"` and `"M3 (after)"` at that day if there was a repair and recalibration. $M_\mathrm{before}$ the appropriate measurement at the next maintenance day: it will be `"M1 (before)"`, `"M2 (before)"` and `"M3 (before)"` at that day. In the first few days when there is no previous maintenance, we can use the value from the first maintenance day (this gives us fixed values equal to the measurements at the first maintenance day for those days). Symmetrically, in the last few days when there is no next maintenance day,  we can use the value from the last maintenance day (this will also give us fixed values equal to the measurements at the last maintenance day for those days).

In [1]:
import numpy as np

In [2]:
%store -r maintenance_df_7
%store -r is_repair
%store -r weather_df_3
%store -r dataframe_dropped

In [3]:
m1_prev = maintenance_df_7["M1 (before)"].values.astype("float")
m1_prev[is_repair] = maintenance_df_7.loc[is_repair, "M1 (after)"].values.astype("float")
m1_next = maintenance_df_7["M1 (before)"].values.astype("float")
m2_prev = maintenance_df_7["M2 (before)"].values.astype("float")
m2_prev[is_repair] = maintenance_df_7.loc[is_repair, "M2 (after)"].values.astype("float")
m2_next = maintenance_df_7["M2 (before)"].values.astype("float")
m3_prev = maintenance_df_7["M3 (before)"].values.astype("float")
m3_prev[is_repair] = maintenance_df_7.loc[is_repair, "M3 (after)"].values.astype("float")
m3_next = maintenance_df_7["M3 (before)"].values.astype("float")

weather_maintenance_df = weather_df_3.copy()
num_days = weather_maintenance_df.shape[0]

m1 = np.zeros((num_days,))
m2 = np.zeros((num_days,))
m3 = np.zeros((num_days,))

maintenance_dates = maintenance_df_7["Maintenance date"].values.astype("str")
num_maintenance_days = maintenance_dates.shape[0]

previous_maintenance_index = 0
next_maintenance_index = 0
d = 1
for i, date in enumerate(weather_maintenance_df["Date"].values):
    if np.datetime64(maintenance_dates[next_maintenance_index]) < np.datetime64(date):
        next_maintenance_index += 1
        previous_maintenance_index = next_maintenance_index - 1
        d = 1
    else:
        d += 1
    if next_maintenance_index == num_maintenance_days:
        next_maintenance_index -= 1
    previous_maintenance_date = np.datetime64(maintenance_dates[previous_maintenance_index])
    next_maintenance_date = np.datetime64(maintenance_dates[next_maintenance_index])
    m1[i] = (((7 - d) * m1_prev[previous_maintenance_index]) + ((d - 1) * m1_next[next_maintenance_index])) / 6
    m2[i] = (((7 - d) * m2_prev[previous_maintenance_index]) + ((d - 1) * m2_next[next_maintenance_index])) / 6
    m3[i] = (((7 - d) * m3_prev[previous_maintenance_index]) + ((d - 1) * m3_next[next_maintenance_index])) / 6
weather_maintenance_df["M1"] = m1
weather_maintenance_df["M2"] = m2
weather_maintenance_df["M3"] = m3

In [4]:
display(weather_maintenance_df)

Unnamed: 0,Date,Temperature,Humidity,Wind level > 0,Wind level > 1,Wind level > 2,M1,M2,M3
0,2015-03-15,1.242275,1.283167,1,0,0,-1.451276,2.184456,-0.265251
1,2015-03-16,1.306493,1.448742,1,0,0,-1.451276,2.184456,-0.265251
2,2015-03-17,1.434929,1.582010,1,0,0,-1.451276,2.184456,-0.265251
3,2015-03-18,1.434929,1.666817,1,0,0,-1.451276,2.184456,-0.265251
4,2015-03-19,1.434929,1.682971,1,0,0,-1.451276,2.184456,-0.265251
...,...,...,...,...,...,...,...,...,...
777,2017-04-30,1.481633,1.194321,1,0,0,0.677127,-1.371244,1.665619
778,2017-05-01,1.382387,0.964131,1,0,0,0.677127,-1.371244,1.665619
779,2017-05-02,1.329845,1.186244,1,0,0,0.677127,-1.371244,1.665619
780,2017-05-03,1.259789,1.303359,1,0,0,0.677127,-1.371244,1.665619


The date and time of events is not a useful feature that is relevant to our classification. So, in the end, we will not need it. However, we will need dates to align the two datasets for aggregation. Times (of day), however, are entirely not useful (since they won't be used in aligning datasets) In the first step, we have to reformat the timestamps of the events dataset to contain ony the dates and not times

In [6]:
dates = []
for datetime in dataframe_dropped['Date And Time'].values.astype("str"):
    dates.append(datetime[:10])
new_df= dataframe_dropped.drop(['Date And Time'], axis=1)
new_df["Date"] = dates
display(new_df)

Unnamed: 0,Mode_1,Mode_2,Mode_3,Events,Sensor1_new,Sensor_3_new,Sensor4_new,Sensor7_new,Sensor_9_new,Date
0,0.0,1.0,0.0,P,-1.008362,1.431407,1.036396,1.356761,0.949536,2015-03-17
1,1.0,0.0,0.0,N,-2.526479,1.339257,0.674007,1.043972,0.033504,2015-03-17
2,0.0,0.0,1.0,N,-0.920593,-0.225143,-0.503194,0.337810,0.365361,2015-03-17
3,1.0,0.0,0.0,N,-0.142516,-1.235099,-1.281430,-0.512888,0.937044,2015-03-17
4,1.0,0.0,0.0,N,1.081931,-0.527013,-0.242260,-1.719536,-1.516428,2015-03-17
...,...,...,...,...,...,...,...,...,...,...
3885,1.0,0.0,0.0,P,0.235177,-0.901188,0.069539,0.452761,-0.892809,2017-05-03
3886,0.0,1.0,0.0,P,-0.337261,0.747981,0.587963,0.171028,-0.924738,2017-05-03
3887,0.0,0.0,1.0,N,1.073390,-0.494030,-0.247538,-1.276431,0.282477,2017-05-03
3888,0.0,0.0,1.0,P,-0.682168,0.406822,0.171139,0.313067,-0.427293,2017-05-03


### Aligning and aggregating the data

We can start aligning and aggregating. We will store the appropriate readings from the combined average daily weather and maintenance logs datasets for each row in the events dataset in lists:

In [7]:
final_temperature = []
final_humidity = []
final_wind_level_lt_0 = []
final_wind_level_lt_1 = []
final_wind_level_lt_2 = []
final_m1 = []
final_m2 = []
final_m3 = []
final_df = new_df.copy()

- A for loop to go through dates in `final_df` converted to a string-valued NumPy array, stroring each date in variable `date` (if `"Name"` is a column of DataFrame `df` a string-valued NumPy array containing the values of that column can be obtained by `df["Name"].values.astype("str")`). In there:
    - Extract a Boolean series of that `date` in the `"Date"` column of `weather_maintenance_df` DataFrame. There will be a single `True` value since we have only and exactly one entry for each operation day in that DataFrame;
    - Append, to `final_temperature` list, the protion of the `weather_maintenance_df` at the intersection of `weather_maintenance_index` Boolean series (which will act as a filter) and `"Temperature"` column (using `.loc`), converted to a NumPy array and the first element extracted (by chaining a `.values[0]` to the end of your call, since there is always a single `True` in `weather_maintenance_index`, that first element is unique and exactly what we want). You can append `a` to a list `l` by doing `l.append(a)`;
    - Repeat this for `final_humidity`, `final_wind_level_lt_0`, `final_wind_level_lt_1`, `final_wind_level_lt_2`, `final_m1`, `final_m2` and `final_m3` for columns `"Humidity"`, `"Wind level > 0"`, `"Wind level > 1"`, `"Wind level > 2"`, `"M1"`, `"M2"` and `"M3"` of `weather_maintenance_df` (in conjunction with `weather_maintenance_index`).
- Add a new column `"Temperature"` to `final_df` having values in `final_temperature`, respectively;
- Repeat the above for `"Humidity"`, `"Wind level > 0"`, `"Wind level > 1"`, `"Wind level > 2"`, `"M1"`, `"M2"` and `"M3"` in `final_df` with values of `final_humidity`, `final_wind_level_lt_0`, `final_wind_level_lt_1`, `final_wind_level_lt_2`, `final_m1`, `final_m2` and `final_m3`, respectively.

In [8]:
### START CODE BELOW THIS LINE ###
for date in final_df["Date"].values.astype("str"):
    weather_maintenance_index = (weather_maintenance_df["Date"] == date)
    final_temperature.append(weather_maintenance_df.loc[weather_maintenance_index, "Temperature"].values[0])
    final_humidity.append(weather_maintenance_df.loc[weather_maintenance_index, "Humidity"].values[0])
    final_wind_level_lt_0.append(weather_maintenance_df.loc[weather_maintenance_index, "Wind level > 0"].values[0])
    final_wind_level_lt_1.append(weather_maintenance_df.loc[weather_maintenance_index, "Wind level > 1"].values[0])
    final_wind_level_lt_2.append(weather_maintenance_df.loc[weather_maintenance_index, "Wind level > 2"].values[0])
    final_m1.append(weather_maintenance_df.loc[weather_maintenance_index, "M1"].values[0])
    final_m2.append(weather_maintenance_df.loc[weather_maintenance_index, "M2"].values[0])
    final_m3.append(weather_maintenance_df.loc[weather_maintenance_index, "M3"].values[0])
final_df["Temperature"]=final_temperature
final_df["Humidity"]=final_humidity
final_df["Wind level > 0"]=final_wind_level_lt_0
final_df["Wind level > 1"]=final_wind_level_lt_1
final_df["Wind level > 2"]=final_wind_level_lt_2
final_df["M1"]=final_m1
final_df["M2"]=final_m2
final_df["M3"]=final_m3

We have successfully aligned and aggregated the dataset. We need two final touches and we are done.

## Removing extra columns

In [9]:
final_df = final_df.drop(columns="Date")

## Rearranging columns

In [None]:
final_column_order = list(final_df.columns)
final_column_order.remove("Events")
final_column_order.append("Events")
final_df = final_df[final_column_order]

In [10]:
final_df

Unnamed: 0,Mode_1,Mode_2,Mode_3,Events,Sensor1_new,Sensor_3_new,Sensor4_new,Sensor7_new,Sensor_9_new,Temperature,Humidity,Wind level > 0,Wind level > 1,Wind level > 2,M1,M2,M3
0,0.0,1.0,0.0,P,-1.008362,1.431407,1.036396,1.356761,0.949536,1.434929,1.582010,1,0,0,-1.451276,2.184456,-0.265251
1,1.0,0.0,0.0,N,-2.526479,1.339257,0.674007,1.043972,0.033504,1.434929,1.582010,1,0,0,-1.451276,2.184456,-0.265251
2,0.0,0.0,1.0,N,-0.920593,-0.225143,-0.503194,0.337810,0.365361,1.434929,1.582010,1,0,0,-1.451276,2.184456,-0.265251
3,1.0,0.0,0.0,N,-0.142516,-1.235099,-1.281430,-0.512888,0.937044,1.434929,1.582010,1,0,0,-1.451276,2.184456,-0.265251
4,1.0,0.0,0.0,N,1.081931,-0.527013,-0.242260,-1.719536,-1.516428,1.434929,1.582010,1,0,0,-1.451276,2.184456,-0.265251
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3885,1.0,0.0,0.0,P,0.235177,-0.901188,0.069539,0.452761,-0.892809,1.259789,1.303359,1,0,0,0.677127,-1.371244,1.665619
3886,0.0,1.0,0.0,P,-0.337261,0.747981,0.587963,0.171028,-0.924738,1.259789,1.303359,1,0,0,0.677127,-1.371244,1.665619
3887,0.0,0.0,1.0,N,1.073390,-0.494030,-0.247538,-1.276431,0.282477,1.259789,1.303359,1,0,0,0.677127,-1.371244,1.665619
3888,0.0,0.0,1.0,P,-0.682168,0.406822,0.171139,0.313067,-0.427293,1.259789,1.303359,1,0,0,0.677127,-1.371244,1.665619


## Saving File

In [11]:
final_df.to_csv("final_data.csv", index=False)