# Unit 1 Capstone Project: Narrative analytics and experimentation

##### Task description:
First, dive in and explore the data set. Include your code and visuals from this process in your final write up. While doing this, look for something that provokes a question; specifically one that can be answered with an experiment.

The main component of this capstone is a research proposal. Using the data set you selected, propose and outline an experiment plan. The plan should consist of three key components:

1. Analysis that highlights your experimental hypothesis.
2. A rollout plan showing how you would implement and rollout the experiment
3. An evaluation plan showing what constitutes success in this experiment

Your experiment should be as real as possible. Though you obviously will not have access to the full production environment to deploy your experiment, it should be feasible and of interest to the parties involved with your actual data source.

The target size of your research proposal should be 3-5 pages.

## The Dataset:
Levin Vehicle Telematics with
Vehicle and Driving Data
* https://www.kaggle.com/yunlevin/levin-vehicle-telematics/version/3#v2.csv
* https://github.com/YunSolutions/levin-openData


#### Context
The dataset provided here is a sample data for data we collect real-time. This dataset is collected for over a 4 month period on 23 4-wheelers. We collect OBD data at 1Hz frequency (1 record per second) while accelerometer data is collected at 25 hz (25 data points per second)

This metadata includes – Device Id, timestamp, trip id, accelerometer data, speed gathered from GPS, battery voltage, coolant temperature, diagnostic trouble codes, engine load, intake air temperature, manifold absolute pressure, calculated mileage, mass airflow, engine RPM, speed collected from OBD, timing advance, throttle positions

1. Device Id – Each device has a unique identifier. Device and Car is one to one mapping 
2. Time Stamp – Time stamp refers to time. The value corresponds to data collected in that very second. Format – Year – Month – Day Hrs:Min:Sec 
3. Trip ID – The trip id corresponds to 1 trip, Trip begins when engine is switched on and end when car engine is switched off. 
4. accData – Refers to Accelerometer and Magnetometer sensor data. The data is collected from the OBD device, values are in terms of G-force. The data is across X, Y, Z axis where X-axis is horizontal, Y- axis is vertical and Z-axis is the direction of movement of the car. The data is provided in raw format. To extract values, please use following formular 
5. gps_speed – The speed in kmph as noted from GPS sensor 
6. battery – The battery voltage corresponds to voltage of the battery installed in Car, which supplies electrical energy to a motor vehicle. 
7. cTemp – The Temperature of the engine coolant of an internal combustion engine. The normal operating temperature for most engines is in a range of 90 to 104 degree Celsius (195 to 220 degrees Fahrenheit) 
8. dtc – Number of diagnostic trouble codes. DTC's, or Diagnostic Trouble Codes, are used by automobile manufacturers to diagnose problems related to the vehicle. 
9. eLoad - Engine load measures how much air (and fuel) you're sucking into the engine and then compares that value to the theoretical maximum. 
10. iat - The Intake Air Temperature sensor (IAT) has been utilised as an Engine Control Unit (ECU) input signal, as a requirement for calculating the Air Mass volume for the incoming air charge. This is, to assist in determining the correct engine fuel requirement to suit the operating air temperature. 
11. imap - The manifold absolute pressure sensor (MAP sensor) is one of the sensors used in an internal combustion engine's electronic control system. The manifold absolute pressure sensor provides instantaneous manifold pressure information to the engine's electronic control unit (ECU). The data is used to calculate air density and determine the engine's air mass flow rate, which in turn determines the required fuel metering for optimum combustion (see stoichiometry) and influence the advance or retard of ignition timing. 
12. kpl – KMPL is mileage in kilometres per litre. It is a derived metric derived from speed and fuel to air mass flow ratio. This ratio is constant in case of Petrol cars while changes for other Fuel types. Hence, the KMPL value is accurate for petrol cars, and contain some error in case of other fuel types. 
13. maf - A mass (air) flow sensor (MAF) is used to find out the mass flow rate of air entering a fuel-injected internal combustion engine. The air mass information is necessary for the engine control unit (ECU) to balance and deliver the correct fuel mass to the engine. 
14. rpm – RPM here means engine RPM. 
15. speed – Speed data as collected from OBD device mounted in the car. 
16. tAdv – Timing advance refers to the number of degrees before top dead center (BTDC) that the spark will ignite the air-fuel mixture in the combustion chamber during the compression stroke. 
17. tPos – Refers to throttle position  

In [22]:
import numpy as np
import pandas as pd
pd.set_option('float_format', '{:.2f}'.format)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

In [17]:
"""
AWS Sagemaker notebook config
#############################
import boto3
bucket='thinkful-rk'
data_file = 'telematic_v2.csv'
data_location = 's3://{}/{}'.format(bucket, data_file)
df = pd.read_csv(data_location)
"""

df = pd.read_csv("Datasets/telematic_v2.csv")
# remove the rows where you have the CSV headers in the dataset because of the simple concatenation of multiple CSVs
df = df[df['gps_speed'] != 'gps_speed']

In [40]:
df['timeStamp'] = pd.to_datetime(df['timeStamp'])
df['battery'] = pd.to_numeric(df['battery'])
df['gps_speed'] = pd.to_numeric(df['gps_speed'])
df['cTemp'] = pd.to_numeric(df['cTemp'])
df['eLoad'] = pd.to_numeric(df['eLoad'])
df['iat'] = pd.to_numeric(df['iat'])
df['imap'] = pd.to_numeric(df['imap'])
df['kpl'] = pd.to_numeric(df['kpl'])
df['maf'] = pd.to_numeric(df['maf'])
df['rpm'] = pd.to_numeric(df['rpm'])
df['speed'] = pd.to_numeric(df['speed'])
df['tAdv'] = pd.to_numeric(df['tAdv'])
df['tPos'] = pd.to_numeric(df['tPos'])
## print(df.info())
df.describe(include='all')

Unnamed: 0,tripID,deviceID,timeStamp,accData,gps_speed,battery,cTemp,dtc,eLoad,iat,imap,kpl,maf,rpm,speed,tAdv,tPos
count,3120240.0,3120240.0,3120240,3120240.0,3120240.0,3120240.0,3120240.0,3120240.0,3120240.0,3120240.0,3120240.0,3120240.0,3120240.0,3120240.0,3120240.0,3120240.0,3120240.0
unique,,23.0,1740022,1811224.0,,,,5.0,,,,,,,,,
top,,12.0,2017-12-14 18:51:22,0.0,,,,0.0,,,,,,,,,
freq,,655360.0,10,425984.0,,,,2399305.0,,,,,,,,,
first,,,2017-11-18 16:23:30,,,,,,,,,,,,,,
last,,,2018-01-31 23:18:50,,,,,,,,,,,,,,
mean,117.21,,,,24.1,2.4,60.95,,30.93,26.15,71.91,5.48,7.92,968.86,22.64,1.88,12.44
std,101.14,,,,26.19,5.22,34.46,,27.7,15.55,49.33,9.14,8.63,672.83,25.31,7.36,25.13
min,1.0,,,,0.0,0.0,-40.0,,0.0,-40.0,0.0,0.0,0.0,0.0,0.0,-25.0,0.0
25%,35.0,,,,0.0,0.0,42.0,,0.0,19.0,10.0,0.0,0.0,702.0,0.0,0.0,0.0


In [43]:
df.head(5)

Unnamed: 0,tripID,deviceID,timeStamp,accData,gps_speed,battery,cTemp,dtc,eLoad,iat,imap,kpl,maf,rpm,speed,tAdv,tPos
0,1,0.0,2017-12-22 18:43:05,10c0f8e00448fa18c80515d30000000000000000000000...,24.26,0.0,66.0,0.0,28.63,40.0,97.0,0.0,0.0,1010.75,23.0,0.0,0.0
1,1,0.0,2017-12-22 18:43:06,1138f8c804780a1ebdf718bcf919d10617c8e301b31017...,23.15,0.0,66.0,0.0,33.73,40.0,98.0,0.0,0.0,815.5,21.0,0.0,0.0
2,1,0.0,2017-12-22 18:43:07,10f0f89804480612c30010c30714ce0520b7f41dbdf118...,18.71,0.0,66.0,0.0,43.14,40.0,98.0,0.0,0.0,862.25,17.0,0.0,0.0
3,1,0.0,2017-12-22 18:43:08,10d0f84804480d15bd0210c9f822c80017caf81ccd0517...,16.48,0.0,66.0,0.0,41.57,40.0,97.0,0.0,0.0,817.0,17.0,0.0,0.0
4,1,0.0,2017-12-22 18:43:09,1090f8c80480041dc9081cc50815c60511c60112c40514...,17.41,0.0,66.0,0.0,43.14,40.0,97.0,0.0,0.0,804.25,15.0,0.0,0.0


## Experimental hypothesis
### 1. "Is it possible to diagnose technical problems of vehicles in advance?"
### 2. “Does this predictive diagnostic help the vehicle driver to minimize maintenance cost and vehicle downtimes?”
First Null hypothesis: It is not possible to diagnose technical problems of vehicles in advance.    
Seconde Null hypothesis: Maintenance cost and vehicle downtimes stay at the same level.

#### The problem
As an IoT startup, data enabled features which deliver value for the user base are important further product developments securing the business success. This experiment outlines the first step to provide more value from user’s vehicle telematic data. People do not know in which time period their car will break and have no chance to do something in advance or plan a solution accordingly. In this experiment we try to solve this issue. To do so the already given telematic data is a good point to start with. In there we will take a further look into DTCs. 

Automobile manufacturers use DTC's, or Diagnostic Trouble Codes, to diagnose problems related to the vehicle. This will be our specific term and concept to describe technical problems of vehicles as stated in the hypothesis. 

#### The potential solution
In this experiment we would like find out if we can build a model that is able to predict the DTCs from the telematic data.

## Rollout plan and design of the experiment
First, we use the telematics data that we have available in the CSV file above. We should get any other data that has already been collected and is newer than 2018-01-31. In the meantime of building this model, the collection of data should be continued, so we have fresh data for future model evaluation. We also need to increase the number of vehicles in our dataset. The current count of 24 vehicles is not enough to get a statistically significant result. 

There are two ways to think about how to build the model:
It is a multivariabel classification problem. Input are the raw telematic data with the DTCs as a label (supervised learning). Output would be the predicted likelihood in percentage for each DTC. In this dataset we have five different DTCs. The value “0” represents the state if there was no DTC recorded. 
Time series approach and anomaly detection problem

#### Notes on the first approach:
Here we assume that each datapoint is independent from each other. This would make it easier to choose a fitting algorithm and how to split the dataset.
In this approach we are going to split the data randomly in a training set (80%), validation set (10%) and test set (10%). The first educated guess on which algorithm to try would be multinomiale logistische Regression.
Downside of this approach is that there are indeed timewise dependencies between data points. The telematic data represents time series data. Each sensor collects data in defined repeating intervals.

#### Notes on the second approach:
Because this is time series data it is more difficult to split it. You can split the data grouped by deviceID or tripID. It is only important that consecutive data points stay together. This approach has characteristics of an anomaly detection problem. This should be a good starting point to pick a fitting algorithm. 

If the first Null hypothesis is right, we cannot advance to the second hypothesis. Assuming we find a  a good model for predicting DTCs and therefore disprove the first Null hypothesis we will continue to test the second hypothesis: “Does this predictive diagnostic help the vehicle driver to minimize maintenance cost and vehicle downtimes?”.

### Kick off the A/B-Test:
We are building two version of the LEVIN app. LEVIN is a mobile phone app which is collecting the data and present the results to the driver. 

#### The two versions
One version has no DTC prediction model deployed and the other version has. For both version we additionally need to track the maintenance cost or the vehicle downtime. Both variables are new data points we will use in our evaluation plan. 
#### Sample
We use the whole LEWIN app user base and split them randomly in two halfs. 
## Evaluation plan
#### Outcome
Our key metrics are the cumulative maintenance cost in $ and the vehicle downtime as a duration in minutes. 
Benchmark: The new version is successful, if both metrics improved by 10% plus the threshold of an A/A-Test
​
To test for statistical significance we will verify if both data sets from the two app versions are different enough by using the t-test and p-test. 

To get an idea about false and true positives/negatives we are going to plot the chaos matrix for each DTC.
​
## Wrap up
If we could achieve our benchmark it would be a huge step forward to drive customer demand and get even more data and use this data to improve our model and to provide even more value. The virtuous data cycle would be complete and gain momentum. 
