# Outreachy Datascience
### Issue 6
#### Device Failue: Modeling Dataset - Feature Generation
The current telemetry payload is too large. We need a good representation that approximates the dataset, with a reduced number of columns.
-  What are some methods for reducing the size of the dataset?
-  What is the trade-off between dataset size and fidelity to original dataset


In [41]:
#import libraries for faster performance and easier code.
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
% matplotlib inline

In [57]:
#load csv and assess data
df = pd.read_csv('../device-failure/device_failure.csv')
df.head()

Unnamed: 0,date,device,attribute1,attribute2,attribute3,attribute4,attribute5,attribute6,attribute7,attribute8,attribute9,failure
0,15001,S1F01085,215630672,56,0,52,6,407438,0,0,7,0
1,15001,S1F0166B,61370680,0,3,0,6,403174,0,0,0,0
2,15001,S1F01E6Y,173295968,0,0,0,12,237394,0,0,0,0
3,15001,S1F01JE0,79694024,0,0,0,6,410186,0,0,0,0
4,15001,S1F01R2B,135970480,0,0,0,15,313173,0,0,3,0


## Assess Dataset 
The following assessment will figure out if there are any issues with the dataset such as abnormal values, empty cells etc


In [43]:
#length
len(df)

124494

In [44]:
#number of columns
len(df.columns)

12

In [45]:
#check for null columns
df.isnull().any()

date          False
device        False
attribute1    False
attribute2    False
attribute3    False
attribute4    False
attribute5    False
attribute6    False
attribute7    False
attribute8    False
attribute9    False
failure       False
dtype: bool

In [46]:
#check for duplicates 
len(df[df.duplicated() == True])

0

This data is relatively clean, with no duplicated records and no null cells. We shall have to check if any of the attritubes have high correlation to the failure

I thought of separating the date column. However, there are advantages to not:
-  Grouping becomes easy as these are integers to work with 
-  Sorting also becomes easy because of the same thing
-  It keeps the size of the dataset small.

However, it does require more knowledge on the part of the programmer to run with a date as this

In [56]:
#plot the correlation matrix 

corr = df[df.columns[2:]].corr()
corr.style.background_gradient().set_precision(2)

Unnamed: 0,attribute1,attribute2,attribute3,attribute4,attribute5,attribute6,attribute7,attribute8,attribute9,failure
attribute1,1.0,-0.0042,0.0037,0.0018,-0.0034,-0.0015,0.00015,0.00015,0.0011,0.002
attribute2,-0.0042,1.0,-0.0026,0.15,-0.014,-0.026,0.14,0.14,-0.0027,0.053
attribute3,0.0037,-0.0026,1.0,0.097,-0.0067,0.009,-0.0019,-0.0019,0.53,-0.00095
attribute4,0.0018,0.15,0.097,1.0,-0.0098,0.025,0.046,0.046,0.036,0.067
attribute5,-0.0034,-0.014,-0.0067,-0.0098,1.0,-0.017,-0.0094,-0.0094,0.0059,0.0023
attribute6,-0.0015,-0.026,0.009,0.025,-0.017,1.0,-0.012,-0.012,0.021,-0.00055
attribute7,0.00015,0.14,-0.0019,0.046,-0.0094,-0.012,1.0,1.0,0.0069,0.12
attribute8,0.00015,0.14,-0.0019,0.046,-0.0094,-0.012,1.0,1.0,0.0069,0.12
attribute9,0.0011,-0.0027,0.53,0.036,0.0059,0.021,0.0069,0.0069,1.0,0.0016
failure,0.002,0.053,-0.00095,0.067,0.0023,-0.00055,0.12,0.12,0.0016,1.0


We have to reduce the size of the dataset by reducing the number of columns and this can be done by checking the correlation of all variables with failure variable. We can thus, see that attribute2, attribute4, attribute7 and attribute8 are the most important in making an analysis. Other columns are not that important and may not be used. The advantage of using correlation over covariance is that it is normalised and not dependent on the scales of the column values.

Other ways that can be used to check for relationships:
-  plot variables
-  check for joint, marginal, and conditional distributions


In [72]:
#separtaing the data to proof timing improvements
df_small = df.drop(['device','attribute3','attribute5','attribute6','attribute9'], axis=1)

In [75]:
%%timeit
df.describe()

80.7 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [76]:
%%timeit
df_small.describe()

49.9 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [79]:
%%timeit 
new_df = df[df['failure'] == 1]['attribute7']

664 µs ± 21.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [81]:
%%timeit
new_df = df_small[df_small['failure'] == 1]['attribute7']

556 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [84]:
%%timeit
df.groupby(['failure']).size()

1.91 ms ± 42.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [85]:
%%timeit
df_small.groupby(['failure']).size()

1.85 ms ± 2.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


It can easily be seen that in simple functions, the larger dataframe times are almost double the smaller dataframe times. 
Even for more complex funtions, the smaller dataframe always performs better. In a time where every microsecond counts in system performance, even a 15 milisecond difference is very large. 

However, it should be noted that there is loss of data, in the smaller df. You can never be two sure that the attributes left out are not worthy (except for the device attribute as that is unique to each row). Thus, maybe a better solution would be to make two smaller dataframes; one of the important attributes and one for the unimportant attributes. 