# tb.lx Data Science Challenge - Part II
----
----
## Introduction

Dear applicant,

Congratulations on passing the first screening! We’re excited to get to know you better and get a better feeling of your competences. In this round, we will test you on your problem-solving skills and data science experience by giving you a case to solve.

After handing us over your solution, we will review it and let you know our feedback. In the case you have passed, you will be called to an on-site interview. During the interview, you’ll get the opportunity to explain your solution and the steps that you took to get there. We've prepared this notebook for you, to help you walk us through your ideas and decisions.

If you're not able to fully solve the case, please elaborate as precisely as you can:

- Which next steps you'd be taking;
- Which problems you'd be foreseeing there and how you'd solve those.

In case you have any questions, feel free to contact ana.cunha@daimler.com or sara.gorjao@daimler.com for any more info. 

Best of luck!

## Context:

Predictive Maintenance is one of the hottest topics in the heavy-industry field. The ability to detect failures before they happen is of utmost importance, as it enables the full utilization of materials saving in unnecessary early replacements, and enables optimizations in maintenance planning reducing the downtime.


## Data:

One of the challenges in the auto-tech industry is to detect failures before they happen. For this, we included a dataset including:
* `telemetry.csv`: Consists of a dataset with sensor values along time
* `faults.csv`: Consists of a dataset with faults for each machine along time.
* `errors.csv`: Consists of a dataset with errors for each machine along time.
* `machines.csv`: Consists of a dataset with features for each machine. 


## Task:

In the second part of the challenge, we would like to know that a failure is going to happen before it actually happens. The decision of the prediction horizon is totally up to you, **but the goal is to predict failures before they happen**.


## Questions:

Follows a set of theoretical questions:

1. How can you create a machine learning model that leverages all the data that we provided whilst adapting to the specificities of each turbine (e.g., operating in different weather conditions)?
2. Modeling the normal behaviour of such machines can prove itself to be a good feature. After training a model that captures the normal turbine dynamics, we need to decide when the displayed behaviour may be considered an anomaly or not. How can one design a framework that creates alerts for abnormality without overloading the end-user with too many false positives?
3. How would you measure aleatoric uncertainty of the predictions of your model?

## Requirements:

- Solution implemented in Python3.6+;
- Provide requirements.txt to test the solution in the same environment;
- Write well structured, documented, maintainable code;
- Write sanity checks to test the different steps of the pipeline;

In [None]:
# Isto aqui vai ser muito como as coisas que ja tenho visto de TTF. Load datasets ver o RUL ver quantos time-steps faltam
# até o RUL e por ai fora
#https://www.kaggle.com/nafisur/predictive-maintenance-using-lstm-on-sensor-data
#https://www.kaggle.com/billstuart/predictive-maintenance-ml-iiot
#https://www.kaggle.com/hanwsf8/lstm-lgb-catb-for-predictive-maintenance-upper
#https://www.kaggle.com/juhumbertaf/tutorial
#https://iopscience.iop.org/article/10.1088/1742-6596/1037/6/062003/pdf
#https://www.kaggle.com/c/equipfails/overview
#https://www.kaggle.com/uciml/aps-failure-at-scania-trucks-data-set
#https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/predictive-maintenance-playbook#data-science-for-predictive-maintenance
#https://github.com/Azure-Samples/MachineLearningSamples-DeepLearningforPredictiveMaintenance
#https://gallery.azure.ai/Notebook/Predictive-Maintenance-Implementation-Guide-R-Notebook-2
#https://gallery.azure.ai/Collection/Predictive-Maintenance-Template-3

# Para a primeira pergunta: Dizer algo como garantir que o modelo não esta a fazer overfitting de maneira a conseguir
# adaptar-se a novas turbinas (também posso dizer "garantir que os dados são representativos do que queremos")

# Para a segunda pergunta: Fazer one class classification

# Para a terceira pergunta: algo como bayesian estimation

In [49]:
import bisect
import numpy as np
import pandas as pd

# Loading the datasets
telemetry = pd.read_csv("../data/sensor/telemetry.csv", index_col=0)
failures = pd.read_csv("../data/sensor/failures.csv")
errors = pd.read_csv("../data/sensor/errors.csv")
machines = pd.read_csv("../data/sensor/machines.csv")

Converting datetime strings to datetime objects

In [50]:
telemetry["datetime"] = pd.to_datetime(telemetry["datetime"])
failures["datetime"] = pd.to_datetime(failures["datetime"])
errors["datetime"] = pd.to_datetime(errors["datetime"])

### Joining the additional information

Since we have other relevant information scattered on different dataframes, it is useful to  join these dataframes into one containing all the needed information

In [51]:
# Joining the newly added failures column to the telemetry dataframe
failures["failures"] = 1
telemetry = telemetry.merge(failures, on=["machineID", "datetime"], how="left")
telemetry["failures"].fillna(0, inplace=True)

# Joining the errors columns to the telemetry dataframe
telemetry = telemetry.merge(errors, on=["machineID", "datetime"], how="left")
telemetry["errorID"].fillna("no_errors", inplace=True)

# Joining the static machine information to the telemetry dataframe
telemetry = telemetry.merge(machines, on="machineID", how="left")
telemetry.head()

Unnamed: 0,datetime,machineID,volt,rotate,pressure,vibration,failures,errorID,model,age
0,2015-01-01 06:00:00,1,176.217853,,113.077935,45.087686,0.0,no_errors,model3,18
1,2015-01-01 07:00:00,1,162.879223,,95.460525,,0.0,no_errors,model3,18
2,2015-01-01 08:00:00,1,,527.349825,75.237905,34.178847,0.0,no_errors,model3,18
3,2015-01-01 09:00:00,1,162.462833,346.149335,109.248561,41.122144,0.0,no_errors,model3,18
4,2015-01-01 10:00:00,1,157.610021,,111.886648,25.990511,0.0,no_errors,model3,18


We now have a complete dataset with time-series data and static data to use. We can see there are some missing values present in the data

In [52]:
telemetry.isna().sum()

datetime          0
machineID         0
volt         139548
rotate       139597
pressure     139461
vibration    139558
failures          0
errorID           0
model             0
age               0
dtype: int64

### Handling missing data

Nearly 16% of the `volt`, `rotate`, `pressure` and `vibration` columns are missing which is a good portion of the data. Given the size of the missing data it discards the possibility of simply dropping rows with missing values in them, as that would have too much of an impact in the quality of our data.

What we can do instead is to impute the missing values with a value that makes sense. This value can be the mean, mode, median, etc. of these columns. However different machines might have different mean/median/mdode values for these columns so it is imporant to impute these values with consideration about which machine is being imputed.

Since the missing values are in the dynamic data columns, we can go one step further than imputting with just the mean for example. Since these values change over time we can impute the missing values with a linea interpolation of the previous points in order to minimize the disruption in the data, giving us the best possible value for our missing data

In [53]:
def fill_missing_values(df, group_column, column_name, interpolate=True):
    """
        Fills the missing values in column_name by either the interpolated value (if interpolate is True)
        or the mean of the taken from from each group in group_column
        
        Arguments:
            df: The dataframe to update
            group_column: The name of the column by which to group the data
            column_name: The name of the column in which to replace the missing values
            interpolate: Boolean flag indicating if the missing values should be imputed with the interpolated
                value or by the mean
                
        Returns:
            A pd.Series object with missing value replaced by a meaningful value, to replace the original column
            in the dataframe
    """
    
    if interpolate:
        return df.groupby(group_column)[column_name].apply(lambda x: x.fillna(x.interpolate(method='linear')))
    else:
        return df.groupby(group_column)[column_name].apply(lambda x: x.fillna(x.mean()))


for column in ["volt", "rotate", "pressure", "vibration"]:
    
    # Impute missing values with the interpolated values
    telemetry[column] = fill_missing_values(telemetry, "machineID", column)
    
    # Some missing values may still persists (in cases where interpolation is not possible) for those 
    # we'll impute with the mean
    telemetry[column] = fill_missing_values(telemetry, "machineID", column, interpolate=False)
    
# Checking the missing values
telemetry.isna().sum()

Great, no more missing values in our data

In [56]:
# Adding information about the cycle of each machine
telemetry["cycle"] = telemetry.groupby("machineID").cumcount()

In [None]:
# Adding RUL (Remaining Useful Life)
def add_rul(df):
    """Added the remaining useful life of an observation to the dataset"""

    rul = pd.DataFrame(df.groupby("machineID")["cycle"].max()).reset_index()
    rul.columns = ['machineID', 'max']
    df = df.merge(rul, on=['machineID'], how='left')
    df["RUL"] = df["max"] - df["cycle"]
    return df.drop('max', axis=1)



In [107]:
def get_failure_cycles():
    """
        Gets a hash-map (a python dictionary) of all the cycles where a fail occured for each machine, for faster lookup times
        Hash-Map format {machineID: <List of cycles where fail occured>}
        The maximum cycle of each machine is also added in the last position of each list
        
        Returns:
            A hash-map containing a list of cycles where each machine failed
    """
    
    hash_map = {}
    
    # Iterate over every machineID
    for machine_id in telemetry["machineID"].unique():
        
        # Get the list of cycles where a failure occured for this machine
        failure_cycles = telemetry.loc[(telemetry["machineID"] == machine_id) & (telemetry["failures"] == 1), "cycle"].to_numpy()
        
        # Also get the maximum cycle for this machine
        max_cycle = telemetry.loc[(telemetry["machineID"] == machine_id), "cycle"].max()
        
        failure_cycles = np.append(failure_cycles, max_cycle)
        
        # Insert new key and new value to hash-map
        hash_map[machine_id] = np.unique(failure_cycles)
        
    return hash_map

failure_cycles = get_failure_cycles()

def add_time_to_failure(row, failure_cycles):
    
    machine_failure_cycles = failure_cycles[row["machineID"]]
    
    
    if row["cycle"] in machine_failure_cycles:
        return 0
    
    else:
        # Get the index in which the current cycle of this machine would be added
        # in the list of failures for this machine. This index holds the value of of the next failure cycle
        # Example: 
        #    arr = [96, 150, 300]
        #    current_cycle = 50
        #    bisect.bisect(arr, current_cycle) 
        #    >> 0
        #
        #    current_cycle = 200
        #    bisect.bisect(arr, current_cycle) 
        #    >> 2
        next_failure_index = bisect.bisect(machine_failure_cycles, row["cycle"])

        return machine_failure_cycles[next_failure_index] - row["cycle"]

telemetry.apply(lambda row: add_time_to_failure(row, failure_cycles), axis=1)

IndexError: ('index 8 is out of bounds for axis 0 with size 8', 'occurred at index 8762')

In [106]:
arr = [96, 1536, 2617, 4057, 5857, 6938, 8378, 8762]
cycles = 8782

{1: array([  96, 1536, 2617, 4057, 5857, 6938, 8378, 8763]),
 2: array([1850, 2571, 8692, 8765]),
 3: array([ 145,  865, 4826, 6627, 8068, 8765]),
 4: array([ 385, 1105, 2186, 4707, 5787, 6868, 8765]),
 5: array([ 193, 1273, 2353, 4154, 5954, 6675, 7755, 8764]),
 6: array([8761]),
 7: array([ 554,  914, 2714, 3074, 3435, 4516, 6316, 7037, 8477, 8766]),
 8: array([1561, 1921, 2642, 5162, 6962, 8763]),
 9: array([1488, 4009, 4369, 4730, 5811, 6171, 6892, 7612, 8333, 8766]),
 10: array([ 433, 2234, 3315, 3675, 4036, 8765]),
 11: array([ 457, 1177, 2619, 6579, 8379, 8764]),
 12: array([ 146, 1947, 4467, 5907, 6628, 8765]),
 13: array([2401, 3841, 4201, 4562, 5283, 5644, 6364, 7084, 8164, 8526, 8767]),
 14: array([ 721, 1441, 4681, 5042, 8763]),
 15: array([ 458, 3698, 5138, 7300, 8740, 8765]),
 16: array([  21,  384, 1464, 2184, 3264, 3625, 3985, 4347, 5427, 6867, 7948,
        8765]),
 17: array([  21,  360, 1081, 1801, 2163, 3243, 3963, 4684, 5044, 5764, 7205,
        7926, 8646, 8767]),

## Creating the labels

In order to predict a failure before it happens we need to lag the information about the failure back a few cycles so our model can learn a failure pattern before it happens.

Looking at the data we see that each cycle (an observation of each machine) represents one hour. Given this we might lag our label column 15 cycles back in order to predict failures from at most 15 hours in advance. (__SECALHAR AQUI DEVO RETIRAR TAMBÉM A OBSERVAÇÃO NO MOMENTO DA FALHA PORQUE NÃO ME INTERESSE PREVER QUANDO FALHOU__)