# Hand-in 3, Part 1: Data handling and exploration

In this first notebook you will show us how you handle data being separated over several files, as well as exploring the quality and properties of your data.

#### Section 1: bash scripting

You have downloaded a zip file containing 5 CSV files, each containing part of the data you need. First, use your bash tools to look at the headers and size of the file. What do the different files contain?

Write a bash script that concatenates the 4 data files (except the flow_criticality_data.csv file). Exlain in the markdown cell below, what each part of your script does.

**Q#1** *Explain your script here (by double clicking on this text).*

* Define that we are using bash
```bash
#! /bin/bash
```

* Quick explanation of what the bashscript does
```bash
# Joining the files
#
# Usage "./collect_data.sh 
```
* Make temp1.csv containing half the data -t, means the file is comma seperated, -a1 and -a2 means that it will also print any unpairable line -oauto means that it will automatically format the data
```bash
join -t, -a1 -a2 -oauto energy_demand_data.csv exchange_data.csv > temp1.csv
```
* Make temp2.csv containing the other half of data
```bash
join -t, -a1 -a2 -oauto renewable_production_data.csv generator_production_data.csv > temp2.csv
```
* Combine the temp files, to one whole file
```bash
join -t, -a1 -a2 -oauto temp1.csv temp2.csv > joined_data.csv
```
* Clean up by removing the temp files
```bash
rm temp1.csv
rm temp2.csv
```


## Section 2: Visualizing the data
Here you will plot the resulting data file from the previous section, and plot it in order to identify missing data and see if you can already draw some conclusions on the data.

* *Hint: remember the hint given in Exc.13.3, on how to find out if your data contains NaN values*

In [23]:
# Importing the packages we need
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib notebook
import numpy as np
import scipy

In [24]:
# Import of the data created with the bash script
data  = pd.read_csv('joined_data.csv')

In [25]:
# Simple plots showing a small pick of our data

data.plot(x='time',y='prod_gen_1',title='prod_gen_1')
data.plot(x='time',y='load_node_130',title='load_node_130')
data.plot(x='time',y='renew_node_24',title='renew_node_24')



<IPython.core.display.Javascript object>

  mplDeprecation)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f314fe04908>

**would it be practical to plot all possible scatter plots (scatter matrix)?**

*Not really, unless you use a cluster to compute the data, as there are way too many datapoints and columns*

**Q#2** For this data, what is the reasonable approach to dealing with the NaN values? Why?

*As there is no directly observable pattern, interpolation does not seem like a good aproach. Therefor we will be dropping the NaN's.*


In [26]:
# Removal of NaN's


print("Total number of NaN's before removal: " +str(data.isnull().sum().sum()))
# Drop any row containing any NaN's
data_nona = data.dropna(how='any')
print("Total number of NaN's after removal: " +str(data_nona.isnull().sum().sum()))

Total number of NaN's before removal: 1711
Total number of NaN's after removal: 0


### Feature reduction
Since you must reduce the amount of sensors, you need to find out which ones you can get rid of. 

**Q#3** Why would PCA be useful for this?

*It shows you which components are principal for your data. Thereby showing which sensors are impacting the overall data more*

*With that in mind, you can narrow down the amount of sensors, used to generate data. Obviously you will loose some data, but choosing the correct amount of components, you can get pretty close to the full data*

In [27]:
from sklearn.decomposition import PCA

# Removal of the time column, as this is not usable in math
datanotime = data_nona.drop('time',1)


# Removing any column where more than 90% of the data is zero
data_notime2 = datanotime.loc[:, (datanotime != 0).sum()>len(datanotime)*.1]

# First we standardize the data
data_stand = (data_notime2 - data_notime2.mean()) / data_notime2.std()

# The standardization created a lot of NaN's
# These came from dividing by zero, as a lot of the prod_gen sensors had only zero's
data_stand = data_stand.dropna(axis='columns',how='all')



print("NaN's in data set: " + str(data_stand.isnull().sum().sum()))
    
# Then instantiate the data, and make the fit
comp=data_stand.shape[1]
pca = PCA(n_components=comp)
pca_data=pca.fit(data_stand)

# Plotting the explained variance
fig = plt.figure(figsize=(8,8))
sing_vals = np.arange(comp) + 1
plt.plot(sing_vals, pca.explained_variance_ratio_, 'o-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Cumulative explained variance')
plt.plot(sing_vals, pca.explained_variance_ratio_.cumsum(), 'o-', linewidth=2)

plt.legend(['Individual','Cumulative'], loc='best', borderpad=0.3, 
            shadow=False,
            markerscale=0.4)

# Saving the pca components
data_components = pca.components_

#with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
print((datanotime != 0).sum())

NaN's in data set: 0


<IPython.core.display.Javascript object>

load_node_130    8311
load_node_133    8311
load_node_134    8311
load_node_143    8311
load_node_16     8311
load_node_145    8311
load_node_21     8311
load_node_153    8311
load_node_154    8311
load_node_156    8311
load_node_159    8311
load_node_160    8311
load_node_162    8311
load_node_167    8311
load_node_43     8311
load_node_52     8311
load_node_181    8311
load_node_182    8311
load_node_184    8311
load_node_186    8311
load_node_60     8311
load_node_63     8311
load_node_64     8311
load_node_66     8311
load_node_68     8311
load_node_69     8311
load_node_197    8311
load_node_73     8311
load_node_77     8311
load_node_208    8311
                 ... 
prod_gen_22      3227
prod_gen_23      3304
prod_gen_24         0
prod_gen_25      3328
prod_gen_26         0
prod_gen_27      3103
prod_gen_28         0
prod_gen_29      3035
prod_gen_30      2238
prod_gen_31      3091
prod_gen_32      3210
prod_gen_33      2344
prod_gen_34      3083
prod_gen_35       750
prod_gen_3

In [7]:
# Find the number of components needed to explain 90% of the variance
for i in range(0,comp):
    if pca.explained_variance_ratio_.cumsum()[i] >= 0.9:
        print("with " +str(i+1) +" components you explain " + str(pca.explained_variance_ratio_.cumsum()[i]*100) +"%" )
        new_comp=(i+1)
        break;


with 21 components you explain 90.2140550348%


### Scree plot
**Q#4** How many principal components do you need to explain 90 % of the variance

26 pricipal components would be needed to explain more than 90% of the variance

26 components will explain 90.35% of the variance

### Clustering
You want to reduce the amount of field sensors to 20. You should now have from the previous question, an array with all your loading vectors (pca.components\_), one vector per principal component, with 137 elements (one per each sensor). Use clustering to group sensors that behave the same. 

**Q#5** How would you choose which sensors in each cluster you should keep?

In [28]:
from sklearn.cluster import KMeans

# Slice the components we need. We take 26 as that gives us about 90% explained variance
slice=data_components[0:new_comp]

# Instanciate and fit the clusters
kmeans = KMeans(n_clusters=20, n_init=200, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, n_jobs=2)
kmeans.fit(slice.T)
cluster_pred = kmeans.predict(slice.T)

# Prints the predicted clusters
print(cluster_pred)

[ 0  0  0  0 19  0 19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0 14 13 17  0  4  0 14 18 14 10  0  3 11  0
  3 10  0  5 14 15 13  7  0  8  6 18  8  7  6  7 16 18  6  5  6 18 18  2  2
  2  9 11  9 11 10 11 13 10  0  1 15 12 14 14  3 15  1  3 15  4  1  1  1  3
  4  4  3 12  1 15 15 15  3 14]


In [29]:
import scipy

# Make a pandas dataframe of our sensors, and the clusters they belong to
clust = pd.DataFrame(data_stand.columns, columns=['sensors'])
clust['cluster'] = cluster_pred

# calculate the distance from a sensor to the cluster center
temp = []
for i in range(slice.T.shape[0]):
    A=slice.T[i]
    B=kmeans.cluster_centers_[clust.cluster[i]]
    temp.append(scipy.spatial.distance.euclidean(A, B))
clust['dist_to_center'] = temp

# Find the sensor in a cluster that is closest to the cluster centers
x=0
data_reduced = []
sensor_cluster = []
while x <= max(clust.cluster):
    temp_min = 100000
    temp_ind = 0
    for i in range(slice.shape[1]):
        if clust.cluster[i] == x:
            if clust.dist_to_center[i] < temp_min:
                temp_min = clust.dist_to_center[i]
                temp_ind = i
    data_reduced.append(clust.sensors[temp_ind])
    x+=1

# Make a new pandas dataframe witht the reduced sensor list
data_reduced = pd.DataFrame(data_reduced, columns=['sensors'])

# Sort the list of clusters by their distance to the center, for manual check up
sorted_clust = clust.sort_values(by=['dist_to_center'])

# For manual check
#print(data_reduced.to_string(index=False))
#with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
#    print(sorted_clust)

In [30]:
x=0
index = []
for i in range(data_nona.shape[1]):
       for x in range(data_reduced.sensors.shape[0]):
            if data_reduced.sensors.iloc[x] == data_nona.columns[i]:
                #print(i)
                #print(data_stand.columns[i])
                index.append(i)
                x+=1

data_reduced = data_nona[index]     
#print(data_stand.columns[0])
#data_reduced.sensors.iloc[2]
print(data_reduced)

      load_node_21  load_node_98  export_node_130  export_node_68  \
0         215.2788       28.1763       -13.551973      -42.135706   
1         206.9927       26.2146       -11.235127      -39.463754   
2         197.3522       24.4661       -11.064903      -36.113954   
3         182.1743       23.3269        -8.057087      -33.309821   
4         170.3825       22.8990        -6.710398      -33.277878   
5         163.4111       22.8241       -11.968168      -33.750251   
6         159.7063       22.9061       -14.185416      -33.848466   
7         157.6746       22.5140        -7.571923      -32.970572   
8         147.4763       22.7774        -6.455184      -32.062082   
9         149.3885       24.1631        -4.704244      -28.566337   
10        163.1721       25.8698        -8.997888      -24.867395   
11        181.8954       26.7083        -9.447832      -12.307933   
12        198.4277       27.2846        -9.721543       -5.917242   
13        201.8139       27.2777  

### Save your chosen sensors

Now that you have chosen 20 sensors which are representative of your data, create a DataFrame that contains these sensors. You can save them to csv file using the code in the following cell.

In [32]:
# Assuming of course that your reduced data set is called data_reduced

data_reduced['time']=data_nona['time']
data_reduced.to_csv('reduced_field_data.csv')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
