# Digital For Industrial Summative - Part 1
 
## Creating A Digital Thread


A Digital Thread is the one unifying theme or characteristic that connects every aspect of an asset or unit, right from its inception and design, to manufacture, deployment, operations, maintenance to eventual retirement.

In analysis, a digital thread is the logical with which we bind and merge the various data sources into one whole, so that it lends itself to quantitative approaches easily.

<img src = 'images/Digital_Thread.JPG' width=500>


A digital thread is a technique to 'stitch' the data that comes in disjoint tables, such that they can be put together logically. That is a task for this exercise.

Data sets provided:

We have been give 5 data sets, all of which related to one month's worth of readings taken at a live volcano site. The volcano was instrumented with multiple sensors in 10 different geographical points (nodes). Our goal is to combine and merge all of this into one digital thread, making it amenable for analysis.

Tasks:

    1.0 Read all the needed input files
    2.0 Plotting Sensor Time Series
    3.0 Descriptive Analysis One data frame at a time 
    4.0 Creating a Digital Thread from the data sets
    5.0 Time Series based analysis
    6.0 Correlations Analysis
    7.0 Data Manipulations to Merge multiple data sets
    8.0 Building A Battery Remaining-Life prediction model

In [1]:
from datetime import datetime, timedelta
from datetime import datetime
import datetime as dt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [2]:
%matplotlib inline

### 1.0 Read all the needed input files

Create multiple data frames, one to hold each data table.

Convert all Data frame time stamps into date-time format, so that time-based indexing is possible

In [3]:
data_dir = 'This PC/Documents/Digital4industry'

In [4]:
data_dir


'This PC/Documents/Digital4industry'

In [5]:
datapoints = pd.read_csv('datapoints.csv', sep=",")
datapoints.head()
datapoints['timestamp']=pd.to_datetime(datapoints['timestamp'])
datapoints.head(20)
datapoints.info()
# dtat['Day'] = dtat['timestamp'].dt.day 
# datapoints['Month'] = datapoints['timestamp'].dt.month
# datapoints['Year'] = datapoints['timestamp'].dt.year
# datapoints['Time']=datapoints['timestamp'].dt.time
# data1['timestamp'] = pd.to_datetime(data1['timestamp'], format="%m/%d/%Y, %H:%M:%S",errors='raise')
# data10 = pd.to_datetime(data22, format='%H:%M:%S')

# datapoints

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 176534 entries, 0 to 176533
Data columns (total 4 columns):
id           176534 non-null object
value        63456 non-null float64
timestamp    176534 non-null datetime64[ns, UTC]
sensor_id    176534 non-null object
dtypes: datetime64[ns, UTC](1), float64(1), object(2)
memory usage: 5.4+ MB


In [6]:
datatypes = pd.read_csv('datatypes.csv', sep=",")
datatypes.head()




Unnamed: 0,id,type,si_unit,type_id
0,35dcb3c0-8679-11e6-bda3-ef77801087ee,temperature,celcius,TCA
1,35dcb3c1-8679-11e6-bda3-ef77801087ee,pressure,pascal,PA
2,35dcb3c2-8679-11e6-bda3-ef77801087ee,humidity,relative humidity,HUMA
3,35dcdad0-8679-11e6-bda3-ef77801087ee,carbon dioxide concentration,ppm,GP_CO2
4,35dcdad1-8679-11e6-bda3-ef77801087ee,hydrogen sulfide concentration,ppm,GP_H2S


In [7]:
nodes = pd.read_csv('nodes.csv', sep=",")
#data3 = pd.to_datetime(data3, format='%M:%S.%f')
nodes.head()




Unnamed: 0,id,name,description,location,status,created_at,updated_at,volcano_id
0,c5e39fa0-867a-11e6-a353-2f6c041e2491,N1,\N,\N,OFFLINE,2016-08-17 01:06:49+00,2016-09-29 19:28:05.932+00,35dc3e90-8679-11e6-bda3-ef77801087ee
1,c147ece0-8679-11e6-a353-2f6c041e2491,N9,\N,\N,OFFLINE,2016-08-07 00:51:40+00,2016-09-29 19:41:07.065+00,35dc3e90-8679-11e6-bda3-ef77801087ee
2,76309900-8679-11e6-a353-2f6c041e2491,N8,\N,\N,OFFLINE,2016-08-03 22:09:40+00,2016-09-29 19:20:17.417+00,35dc3e90-8679-11e6-bda3-ef77801087ee
3,762b8ff0-8679-11e6-a353-2f6c041e2491,N10,\N,\N,OFFLINE,2016-08-03 04:39:25+00,2016-09-29 19:20:17.431+00,35dc3e90-8679-11e6-bda3-ef77801087ee
4,c14ccee0-8679-11e6-a353-2f6c041e2491,N2,\N,\N,OFFLINE,2016-08-10 04:20:22+00,2016-09-29 19:20:17.523+00,35dc3e90-8679-11e6-bda3-ef77801087ee


In [8]:
sensors = pd.read_csv('sensors.csv', sep=",")
sensors.head()




Unnamed: 0,id,description,data_frequency,status,created_at,updated_at,data_type_id,node_id
0,c15a6370-8679-11e6-a353-2f6c041e2491,\N,1800000,OFFLINE,2016-08-07 00:51:40+00,2016-09-29 19:41:07.066+00,35dcdad0-8679-11e6-bda3-ef77801087ee,c147ece0-8679-11e6-a353-2f6c041e2491
1,7635c920-8679-11e6-a353-2f6c041e2491,\N,1800000,OFFLINE,2016-08-03 04:39:25+00,2016-09-29 20:13:52.793+00,35dcdad2-8679-11e6-bda3-ef77801087ee,762b8ff0-8679-11e6-a353-2f6c041e2491
2,763b9580-8679-11e6-a353-2f6c041e2491,\N,1800000,OFFLINE,2016-08-03 04:39:25+00,2016-09-29 20:13:52.797+00,35dcdad1-8679-11e6-bda3-ef77801087ee,762b8ff0-8679-11e6-a353-2f6c041e2491
3,763ca6f0-8679-11e6-a353-2f6c041e2491,\N,1800000,OFFLINE,2016-08-03 04:39:25+00,2016-09-29 20:13:52.803+00,35dcb3c0-8679-11e6-bda3-ef77801087ee,762b8ff0-8679-11e6-a353-2f6c041e2491
4,763d9150-8679-11e6-a353-2f6c041e2491,\N,1800000,OFFLINE,2016-08-03 04:39:25+00,2016-09-29 20:13:52.811+00,35dcb3c1-8679-11e6-bda3-ef77801087ee,762b8ff0-8679-11e6-a353-2f6c041e2491


In [9]:
volcanos = pd.read_csv('volcanos.csv', sep=",")
volcanos.head()




Unnamed: 0,id,name,description,location,status,created_at,updated_at
0,35dc3e90-8679-11e6-bda3-ef77801087ee,Masaya,"This is the world's biggest, baddest, most evi...","{11.985318299999999,-86.178342900000004}",OFFLINE,2016-09-29 19:16:23.419+00,2016-10-07 07:43:05.015+00


In [None]:
# data1=pd.merge(datapoints,datatypes,left_on='id')
# right_on='data_type_id',how='left'
# data1.head()

# outer_merged = pd.merge(datapoints,datatypes, how="outer", on=["id"])
# outer_merged#.head()
# outer_merged.shape
# outer_merged.drop(764c5e60-8679-11e6-a353-2f6c041e2491)

In [None]:
# plt.plot(datapoints['value'])
# plt.xticks(range(len(datapoints['0'])) , datapoints['4'])
# #plt.figure();


# matplotlib.style.use('ggplot')
# ax = datapoints.transpose().plot(kind='line', title ="sensor data", figsize=(15, 10), legend=True, fontsize=12)
# ax.set_xlabel("timestamp", fontsize=12)
# ax.set_ylabel("value", fontsize=12)
# plt.show()



datapoints.plot(x="i",y="value")
plt.xlabel("id",size=16)
plt.ylabel("value",size=16)
plt.title("Sensor Vaues Vs timestamp", size=18)



In [11]:
datapoints['Day'] = datapoints['timestamp'].dt.day 
datapoints['Month'] = datapoints['timestamp'].dt.month
datapoints['Year'] = datapoints['timestamp'].dt.year
datapoints['Time']=datapoints['timestamp'].dt.time
#plots = datapoints.groupby('sensor_id')



### 2. Plotting Sensor Time Series

Create separate plots for each sensor in the dataset. Store each one in a separate file.

**What (if anything) can we tell about the various sensors from the plots?**




In [12]:
Grp_dp = datapoints.groupby('sensor_id')
Grp_dp.head(30)

Unnamed: 0,id,value,timestamp,sensor_id,Day,Month,Year,Time
0,764c5e60-8679-11e6-a353-2f6c041e2491,98.000,2016-08-03 04:39:25+00:00,7635c920-8679-11e6-a353-2f6c041e2491,3,8,2016,04:39:25
1,764c5e61-8679-11e6-a353-2f6c041e2491,33.198,2016-08-03 04:39:25+00:00,763b9580-8679-11e6-a353-2f6c041e2491,3,8,2016,04:39:25
2,764dbdf0-8679-11e6-a353-2f6c041e2491,29.300,2016-08-03 04:39:25+00:00,763ca6f0-8679-11e6-a353-2f6c041e2491,3,8,2016,04:39:25
3,764dbdf1-8679-11e6-a353-2f6c041e2491,96893.110,2016-08-03 04:39:25+00:00,763d9150-8679-11e6-a353-2f6c041e2491,3,8,2016,04:39:25
4,764dbdf2-8679-11e6-a353-2f6c041e2491,459.736,2016-08-03 04:39:25+00:00,763701a1-8679-11e6-a353-2f6c041e2491,3,8,2016,04:39:25
...,...,...,...,...,...,...,...,...
167743,12cd98b0-e707-11e6-89c8-314aa4f67f8c,218.000,2016-07-01 04:00:01+00:00,1248ff60-e707-11e6-89c8-314aa4f67f8c,1,7,2016,04:00:01
167744,12cf4660-e707-11e6-89c8-314aa4f67f8c,201.000,2016-07-01 04:20:02+00:00,1248ff60-e707-11e6-89c8-314aa4f67f8c,1,7,2016,04:20:02
167745,12d07ee0-e707-11e6-89c8-314aa4f67f8c,225.000,2016-07-01 04:30:02+00:00,1248ff60-e707-11e6-89c8-314aa4f67f8c,1,7,2016,04:30:02
167746,12d0f410-e707-11e6-89c8-314aa4f67f8c,182.000,2016-07-01 04:40:02+00:00,1248ff60-e707-11e6-89c8-314aa4f67f8c,1,7,2016,04:40:02


In [61]:
# datapoints['sensor_id'].nunique()

In [15]:
sensor1 = Grp_dp.get_group("7635c920-8679-11e6-a353-2f6c041e2491")
sensor1
# sensor1

Unnamed: 0,id,value,timestamp,sensor_id,Day,Month,Year,Time
0,764c5e60-8679-11e6-a353-2f6c041e2491,98.0,2016-08-03 04:39:25+00:00,7635c920-8679-11e6-a353-2f6c041e2491,3,8,2016,04:39:25
9,765251d0-8679-11e6-a353-2f6c041e2491,98.0,2016-08-03 04:47:41+00:00,7635c920-8679-11e6-a353-2f6c041e2491,3,8,2016,04:47:41
11,7655fb50-8679-11e6-a353-2f6c041e2491,98.0,2016-08-03 05:08:52+00:00,7635c920-8679-11e6-a353-2f6c041e2491,3,8,2016,05:08:52
16,7659cbe0-8679-11e6-a353-2f6c041e2491,98.0,2016-08-03 05:14:08+00:00,7635c920-8679-11e6-a353-2f6c041e2491,3,8,2016,05:14:08
21,765dea90-8679-11e6-a353-2f6c041e2491,96.0,2016-08-03 05:28:55+00:00,7635c920-8679-11e6-a353-2f6c041e2491,3,8,2016,05:28:55
...,...,...,...,...,...,...,...,...
113243,d17e84e0-86ae-11e6-b9eb-2b0883ebdaeb,100.0,2016-09-07 15:11:11+00:00,7635c920-8679-11e6-a353-2f6c041e2491,7,9,2016,15:11:11
113923,e4893e91-86ae-11e6-b9eb-2b0883ebdaeb,100.0,2016-09-07 15:41:10+00:00,7635c920-8679-11e6-a353-2f6c041e2491,7,9,2016,15:41:10
114008,1cfc89d0-86af-11e6-b9eb-2b0883ebdaeb,99.0,2016-09-07 16:41:09+00:00,7635c920-8679-11e6-a353-2f6c041e2491,7,9,2016,16:41:09
114072,09f0beb0-86af-11e6-b9eb-2b0883ebdaeb,99.0,2016-09-07 16:11:09+00:00,7635c920-8679-11e6-a353-2f6c041e2491,7,9,2016,16:11:09


In [17]:




# # Using plotly.express
# import plotly.express as px

# datapoints = px.data.stocks()
# fig = px.line(datapoints, x='datetime', y="value")
# fig.show()
# 
# pd.options.plotting.backend = "plotly"
# df.plot(x='date', y=['sessions', 'cost'])



# import matplotlib.pyplot as plt
# import itertools
# import warnings


# # fontsizes = itertools.cycle([8, 16, 24, 32])

# def datapoints(ax):
#     ax.plot([1, 2])
#     ax.set_xlabel('timestamp', fontsize=(12))
#     ax.set_ylabel('value', fontsize=(12))
#     ax.set_title('Sensor value VS Timeframe', fontsize=(12))




In [None]:
# fig, ax = plt.subplots()
# datapoints(ax)
# plt.tight_layout()



In [None]:
# fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1)
# example_plot(ax1)
# example_plot(ax2)
# plt.tight_layout()

In [None]:
# fig, axes = plt.subplots(nrows=3, ncols=3)
# for row in axes:
#     for ax in row:
#         datapoints(ax)
# plt.tight_layout()

data_test=datapoints.loc[datapoints['id'].isin('39e82000-e707-11e6-89c8-314aa4f67f8c')]

# data_test = datapoints.loc[datapoints['id'] == '39e16940-e707-11e6-89c8-314aa4f67f8c']

data_test


In [None]:

def getgroup(sensor_id):
    get=plots.get_group(sensor_id) 
    sensorIndex = get.set_index('day', inplace=True)
    groupfinal=get.groupby(["Day"], as_index=False)["value"].max()
    return groupfinal


    



In [None]:








# # sensor1 = getgroup('764c5e60-8679-11e6-a353-2f6c041e2491')

# datapoints.plot(x="timestamp",y="value")
# plt.xlabel("timestamp",size=10)
# plt.ylabel("value",size=10)
# plt.title("Sensor Vaues Vs timestamp", size=10)

# # plt.figure(figure(figsize=(25,5))
# # plt.figure(figsize=(25,5))
# # plt.subplot(1,1,1)
# # custom_plot(x=sensor1.index, y=sensor1.value)
# # plt.title('Maximum Daily Values for c15q6370 Sensor')



In [None]:
pd.options.plotting.backend = "plotly"
df.plot(x='date', y=['sessions', 'cost'])

### 3.0 Descriptive Analysis One data frame at a time *
3.1: How many sensors of each type are there?






In [None]:
s2.type_id.value_counts()

----

### 4.0 Creating a Digital Thread from the data sets**

**Goal: Merge everything into one wide data table ** 

You have two data frames: `sensors` and `types` Merge each sensors with its type. (Hint: Use `pd.merge()`)

Question: What does this achieve? Why should we do this?

Now take the raw `data` data frame and merge it with all the `sensors` and their types.

In [16]:
#merging data frame
full=pd.merge(datatypes,sensors,left_on='id',right_on='data_type_id', how='left')
# right_on='data_type_id'
full.head()



Unnamed: 0,id_x,type,si_unit,type_id,id_y,description,data_frequency,status,created_at,updated_at,data_type_id,node_id
0,35dcb3c0-8679-11e6-bda3-ef77801087ee,temperature,celcius,TCA,763ca6f0-8679-11e6-a353-2f6c041e2491,\N,1800000,OFFLINE,2016-08-03 04:39:25+00,2016-09-29 20:13:52.803+00,35dcb3c0-8679-11e6-bda3-ef77801087ee,762b8ff0-8679-11e6-a353-2f6c041e2491
1,35dcb3c0-8679-11e6-bda3-ef77801087ee,temperature,celcius,TCA,c15a6371-8679-11e6-a353-2f6c041e2491,\N,1800000,OFFLINE,2016-08-07 00:51:40+00,2016-09-29 20:13:52.836+00,35dcb3c0-8679-11e6-bda3-ef77801087ee,c147ece0-8679-11e6-a353-2f6c041e2491
2,35dcb3c0-8679-11e6-bda3-ef77801087ee,temperature,celcius,TCA,7644bd40-8679-11e6-a353-2f6c041e2491,\N,1800000,OFFLINE,2016-08-03 22:09:40+00,2016-09-29 19:20:17.417+00,35dcb3c0-8679-11e6-bda3-ef77801087ee,76309900-8679-11e6-a353-2f6c041e2491
3,35dcb3c0-8679-11e6-bda3-ef77801087ee,temperature,celcius,TCA,c15f9391-8679-11e6-a353-2f6c041e2491,\N,1800000,OFFLINE,2016-08-10 04:20:22+00,2016-09-29 19:20:17.516+00,35dcb3c0-8679-11e6-bda3-ef77801087ee,c14ccee0-8679-11e6-a353-2f6c041e2491
4,35dcb3c0-8679-11e6-bda3-ef77801087ee,temperature,celcius,TCA,763bbc90-8679-11e6-a353-2f6c041e2491,\N,1800000,OFFLINE,2016-08-03 21:50:32+00,2016-09-30 18:12:04.578+00,35dcb3c0-8679-11e6-bda3-ef77801087ee,762c5340-8679-11e6-a353-2f6c041e2491


In [None]:
full = pd.merge(data, s2, left_on='sensor_id', right_on='id_x')

we have to sensor data and Datatpes together, this is merging the sensor id and sensor values with sensor data types 

-----

### 5.0 Time Series based analysis

Now, we are going to take this stitiched data frame and use it for our analysis. Specifically, we are going to perform Time-based analysis on this data.

    Step 1: Take the `full` data frame and make the Time Stamp datetime format
    Step 2: Resample the dataset to the daily level. (One observation per date)
    Step 3: For the resampled data, calculate the daily mean, min and max values for each sensor.

Resample to get daily averages.
Then subset to select the rows you need.

In [None]:
daily = full[full.data_type_id=='HUMA']['value'].resample('D')

### 6.0 Correlations Analysis

The next task is to perform a correlation analysis. 

1.  Goal: We want to find all the sensors that are strongly correlated to each other.
2. One of the reasons for doing this is that if two (or more) sensors are very highly correlated, we only need to keep one out of each correlated set. (This reduces the problem size and also takes care of colinearlity-related unstability in certain calculations.)

The Correlation heatmap that we are looking for should be along the following lines:


<img src = "images/Corr_plot.JPG" width=450>

First, look for correlations visually. 

**Subtask: Plot all the sensors values (of one type) over time.**

Sub task: Identify all the temperature sensors in the data set. Hint: These are the ones that have the string 'TCA' in their name id's

In [None]:
criteria = full.type_id == 'TCA'
plt.figure(figsize=(20,10))
full[criteria]['value'].plot()

In [None]:
temp_sensors = [s for s in s2[s2.type_id =='TCA']['id_x']]
temperature_senors = [s for t,s in zip(s2['type_id'], s2['id_x']) if t=='TCA'] # alternative way

** Using Pandas to plot all the temperature sensors on one single plot. **

- Use pandas to loop over each temperature sensor, and plot them one at a time.
- Hint: Use the following trick to do this is to plot one line at a time, over and over in a loop

In [None]:
plt.figure(figsize=(20,5))

for s in temp_sensors:
    #print(s)
    sub_df = full[full['sensor_id']==s]
    plt.plot(sub_df.index, sub_df['value'], '.')
    

This plot is fine, but all the lines are too close together. We cannot see how each sensor is behaving. For that, we can try drawing "Subplots." In these plots, each sensor gets its own plot (called a 'panel').

**Creating Subplots - Each sensor gets its own panel**

In [None]:
fig, axarr = plt.subplots(10, sharex=True)
fig.set_size_inches(20,30) 

plt.figure(figsize=(20,20))
for i,s in enumerate(temp_sensors):
    sub_df = full[full['sensor_id']==s]
    axarr[i].plot(sub_df.index, sub_df['value'], '.')
    

**Task: Creating a reshaped Data Frame of just Temperature sensors**

For this, we are going to have each column be 1 sensor... from 1 to 10. The rows will be timestamps, as before.

Hint: pd.pivot() is perfect for this task.

In [None]:
# df.pivot(index='patient', columns='obs', values='score')

In [None]:
temp_df = full[full['sensor_id'].isin(temp_sensors)]
del temp_df['timestamp']
temp_df = temp_df.reset_index()
temp_df.pivot(columns='sensor_id', values='value')



Since this is a real sensor data set, there are some time stamps and sensor_id's that are repeating. (Unfortunately, this happens often in real data sets.)

**Task: Find all rows with the same [Timestamp, Sensor_id] and delete them **

**Here's a clever way to find out all the duplicated rows.**

Some Timestamp and sensor_id are repeating. That causes Indexing problems.

In [None]:
#pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
pd.concat(g for _, g in temp_df.groupby(['timestamp', 'sensor_id']) if len(g) > 1)

**Task: Drop all the rows where timestamp and sensor_id are duplicated**

In [None]:
tdf = temp_df.drop_duplicates(subset = ['timestamp', 'sensor_id'], keep='first')
# Hint: Look at https://segment.com/blog/5-advanced-testing-techniques-in-go/

In [None]:
#temp_df_cor = tdf.pivot(index='timestamp', columns='sensor_id', values='value')


In [None]:
#daily_temp_cor_df = temp_df_cor.resample('D').mean()

Now, we are finally ready to calculate the correlations across sensor values. Hint: `Use corr()`

In [None]:
#corr_df = daily_temp_cor_df.corr()

In [None]:
# Create a mask to display only the lower triangle of the matrix (since it's mirrored around its 
# top-left to bottom-right diagonal).
mask = np.zeros_like(corr_df)
mask[np.triu_indices_from(mask)] = True


In [None]:
# Create the heatmap using seaborn library. 
# List if colormaps (parameter 'cmap') is available here: http://matplotlib.org/examples/color/colormaps_reference.html
sns.heatmap(corr_df, cmap='RdYlGn_r', vmax=1.0, vmin=-1.0 , mask = mask, linewidths=2.5)
 
# Show the plot we reorient the labels for each column and row to make them easier to read.
plt.yticks(rotation=0) 
plt.xticks(rotation=90) 
plt.show()

In [None]:
full2 = full.drop_duplicates(subset = ['timestamp', 'sensor_id'], keep='first')

In [None]:
full_wide = full2.pivot(index='timestamp', columns='sensor_id', values='value')


In [None]:
daily_all_sensors = full_wide.resample('D').mean()

In [None]:
daily_all_sensors

**Trying to building a Linear Model**

In order to do that, we first need to create a data frame with the columns representing only those sensors for ONE NODE.

Try to see if pd.pivot() can help with grouping Nodes together

In [None]:
nodes.name # so we have 10 "nodes" with 6 sensors each. [T, Pr, HUMA, PPM , PPM2, BATT]

In [None]:
fullnode = pd.merge(full, nodes, left_on='node_id', right_on='id')

In [None]:
fullnode.columns, fullnode.index

In [None]:
fullnode.set_index('timestamp', inplace=True)

In [None]:
fullnode.resample('D').mean()

In [None]:
# fullnode.pivot_table(index=['timestamp', 'name'], columns='type', values='value')
fn_wide = fullnode.pivot_table(index=['timestamp','name'], columns=['type'], values='value')


In [None]:
#Now, let's make the Node ('name') into its own column. We do this by reset_index() for that level (=1)
fn_wide.reset_index(level=1, inplace=True)

In [None]:
fn_wide.head()

Before we can perform Linear Regression, we have one last step remaining. We'd like to "resample" all the data, aggregating it down to 'Daily' Levels.

In [None]:
lmfn = fn_wide.resample('D').mean()

In [None]:
lmfn.shape

### End of Stitching. 

**The Digital Thread for this dataset has been created**

This "digital Thread" has been used to 'stitch' the data frame with all the values we wish to analyze.

---
Now we finally have the data frame in the shape we wanted to enable Linear Regression.

### 8.0 Sample Modeling

**8.1 Building A Battery Remaining-Life prediction model **
Build a machine learning model (LR, tree-based or any other) to try and predict the Battery life (remaining) as a function of any of the other sensor characteristic.

* Which variable (sensor) is a good predictor of battery life?
* Is your linear regression a "good fit?"
* What it the RMSE of your predicted values?

In [None]:
#==============================================================================
# Supervised learning linear regression
#==============================================================================

from sklearn import linear_model

# Split the data into training/testing sets
train = lmfn[:-30]
test = lmfn[-30:]

In [None]:
train.columns

In [None]:
target, predictors = 'battery', 'temperature'

x_train=train[predictors].to_frame() #converts the pandas Series to numpy.ndarray
y_train=train[target].to_frame()
x_test=test[predictors].to_frame() #converts the pandas Series to numpy.ndarray
y_test=test[target].to_frame()

In [None]:
# 2.- Create linear regression object
regr = linear_model.LinearRegression()

# 3.- Train the model using the training sets
regr.fit(x_train,y_train)

# The coefficients
print("Coefficients: ",  float(regr.coef_))
# The mean squared error
print("Mean squared error: %.2f"
      % np.mean((regr.predict(x_train) - y_train) ** 2))