# Assignment 1

### Authors
* Jordi Mellado Romagosa 
* Jordi Adan Domínguez

## 3 Step: Data adquisition
**How to load the data?**

To load the data, we are using the web crawling technique, using the library called Scrapy. Once the first URL is introduced, all files are extracted into the pc, and once they are downloaded, the extracted files are again processed to obtain the actual contents of the raw data.

**Another important concern is how to save the data once read it?**

Once the extraction is done, for each individual, we obtain a folder and inside this folder, there are the files that contain the readings for each repetition. No name change is done, thus is easy to identify the alcoholic and non-alcoholic users looking at the folder name. This is useful if later we want to use the data from a specific user or user-type.

**In which format the data needs to be kept?**

The data is extracted in the same format as the original data. After the extraction, we process the files in each folder into another folder called “results”. In this folder, one file is prepared for every user, containing the same data that the downloaded file but it is ready for process. Each line in this file has the following data and in this strict order: user id, alcoholic/non-alcoholic, experiment type, experiment repetition, channel and finally the 256 readings.

**How many data we want to deal with?**

We prepare the full data for later use at the time of the downloading. But because of the big bunch of data it’s unrealistically to use all at the same time. So we plan on using random files from alcoholic/non-alcoholic to see if there is a difference between them, and also examine some individuals readings without any other reference to see if there is a difference between experiments and repetitions of the same experiment.

## 4 Step: Data Exploration
**Do you detect any problem with your data?**
Yes. The first problem we detect is that there are some users that have the same ID, only changing if they are alcoholic or non-alcoholic or being identical between them. So we can’t believe any data that we can obtain from this files.

**Are there any outliers?**

**Do the users have the same number of samples?**

No, some users don’t have the full set of samples. We will have to work taking this into consideration, but having different number of samples doesn’t mean that the data is wrong. 

### 4.1. Exercise 1 Represent the 'FP1' channel (first one). Be sure to correctly specify axis.

In [43]:
import plotly.graph_objs as go
from plotly.graph_objs import *

# Layout with common style for the graphs
def getLayout(exerecise_name):
    layout = go.Layout(
        title=exerecise_name,
        xaxis=dict(
            title='Time (ms)'
        ),
        yaxis=dict(
            title='Voltage (µV)'
        )   
    )
    return layout

In [44]:
import plotly.plotly as py
import plotly
import pandas as pd
import numpy as np

# This line is needed to plot results on plot.ly
plotly.tools.set_credentials_file(username='x2799830', api_key='R0ETv9zOSa7Hkc3t2Q0p')
# Read data file
df = pd.read_csv("results/co2a0000364.txt", sep=" ", lineterminator="\n", header=None)
# Remove the first five columns, like subject id, channel number...
df.drop(df.columns[[0, 1, 2, 3, 4]], axis=1, inplace=True)

voltage = np.array(df.values)[0]
time = np.array(df.columns.values, dtype=float)
for i in range(0, len(time)):
    time[i] = (time[i] - 4) * 3.906

df.columns = time

# Create a bar graph
data = [go.Bar(
    x = time,
    y = voltage
)]

# Plot the graph
fig = go.Figure(data=data, layout=getLayout("Exercise 1"))
py.iplot(fig, filename='Exercise1')

### 4.2. Exercise 2 Represent the 'FP1' channel (first one) as well as the next 3

In [45]:
voltage = np.array(df.values)

# Create traces
trace0 = go.Scatter(
    x=time,
    y=voltage[0],
    mode='lines',
    name='FP1 - Channel 0'
)

trace1 = go.Scatter(
    x=time,
    y=voltage[1],
    mode='lines',
    name='FP2 - Channel 1'
)

trace2 = go.Scatter(
    x=time,
    y=voltage[2],
    mode='lines',
    name='F7 - Channel 2'
)

trace3 = go.Scatter(
    x=time,
    y=voltage[3],
    mode='lines',
    name='F8 - Channel 3'
)

data = [trace0, trace1, trace2, trace3]

# Plot the graph
fig = go.Figure(data=data, layout=getLayout("Exercise 2"))
py.iplot(fig, filename='Exercise2')

### 4.3. Exercise 3 (optional) Represent all 64 channels

In [46]:
data = [
    go.Surface(
        z=df.as_matrix()
    )
]

layout = go.Layout(
    title='Exercise 3',
    autosize=True,
    scene=Scene(
        xaxis=XAxis(title='Repetition'),
        yaxis=YAxis(title='Channel'),
        zaxis=ZAxis(title='Voltage (µV)'),
    )

)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Exercise3')

### 4.4. Are there any outliers? First you need to define what is an outlier?
An outlier is some data that is much bigger or smaller than the nearest data point. In the data we have explored, there isn’t any outlier.

## 5 Step: Data Cleaning, Data Transformation and Reporting
**Why data can’t be Clean at first?**

Because there are always invalid values, non-answered values and some values that are outliers and must be ignored.

**Is our data in a tidy format?**

Originally, no. But after the processing we mentioned early, we have the data ready for interpretation. The only thing that is missing is removing outliers and any noise.



**Exercise: Clean the Data Set to Avoid Excessive Noise**

In [47]:
from numpy.linalg import norm

X = df.as_matrix()
linfnorm = norm(X, axis=1, ord=np.inf)
X.astype(np.float) / linfnorm[:,None]

array([[-0.4485845 , -0.42404586, -0.12943129, ...,  0.21431086,
         0.28797707,  0.41077086],
       [ 0.03368336,  0.13231018,  0.23089661, ...,  0.11256058,
         0.38865105,  0.52669628],
       [-0.63160742, -0.39849792,  0.03656557, ...,  0.22305318,
         0.36288706,  0.40950896],
       ..., 
       [-0.08883107, -0.17952054, -0.17952054, ...,  0.81862107,
         0.72774577,  0.54636685],
       [-0.44710669, -0.39803094, -0.15280289, ...,  0.2396524 ,
         0.31324091,  0.43585493],
       [-0.23337474, -0.11204969,  0.06993789, ...,  0.65627329,
         0.67648033,  0.61581781]])

**Exercise: Transform the Data Set to get additional Insights**

    Obtain the quantiles


In [48]:
first = df.quantile(.25)
second = df.quantile(.5)
third = df.quantile(.75)

In [49]:
first

3.906     -6.36025
7.812     -7.09000
11.718    -4.54175
15.624    -3.83025
19.530    -1.32225
23.436     0.09675
27.342     0.69700
31.248     0.41200
35.154    -1.22350
39.060    -2.27575
42.966    -3.62375
46.872    -3.39550
50.778    -1.73450
54.684    -0.88000
58.590    -0.97950
62.496    -2.44375
66.402    -3.90900
70.308    -3.25750
74.214    -3.62425
78.120    -2.99075
82.026    -2.01450
85.932    -2.28850
89.838    -3.11250
93.744    -2.52300
97.650    -0.59025
101.556    0.07350
105.462    0.55950
109.368    0.79350
113.274   -2.51300
117.180   -4.50100
            ...   
886.662   -2.91675
890.568   -1.52600
894.474    0.61825
898.380    2.06525
902.286    1.15725
906.192   -0.86700
910.098   -2.43650
914.004   -2.39575
917.910   -1.92025
921.816   -1.32725
925.722   -0.97950
929.628   -0.90525
933.534   -1.00225
937.440   -0.09150
941.346   -0.17825
945.252   -0.22925
949.158    0.80875
953.064    0.42700
956.970    0.54650
960.876    1.34050
964.782    1.41925
968.688    0

In [50]:
second

3.906     -2.3805
7.812     -2.7975
11.718    -1.6275
15.624    -0.7530
19.530     1.4545
23.436     2.9045
27.342     3.3110
31.248     2.6040
35.154     1.4035
39.060    -0.5340
42.966    -2.1970
46.872    -1.0070
50.778     0.4730
54.684     1.5310
58.590     0.5745
62.496    -0.9310
66.402    -0.4425
70.308    -0.4980
74.214    -0.0255
78.120    -0.3410
82.026     0.2545
85.932    -0.0560
89.838    -0.5850
93.744     2.4360
97.650     3.3825
101.556    3.9625
105.462    3.3165
109.368    2.7870
113.274    0.5035
117.180   -1.2920
            ...  
886.662    2.5585
890.568    2.6905
894.474    3.7235
898.380    3.9930
902.286    2.5480
906.192    1.5155
910.098    0.1885
914.004    1.2205
917.910    1.4140
921.816    0.8240
925.722    2.1110
929.628    1.5510
933.534    2.1210
937.440    3.1535
941.346    2.9905
945.252    3.5150
949.158    2.6805
953.064    3.1585
956.970    3.0365
960.876    3.7180
964.782    3.9770
968.688    3.5905
972.594    2.9805
976.500    2.1210
980.406   

In [51]:
third

3.906      0.49825
7.812      0.14975
11.718     0.37150
15.624     1.68900
19.530     4.66150
23.436     7.52000
27.342     8.41000
31.248     5.75775
35.154     3.78950
39.060     2.26850
42.966     0.99950
46.872     1.49775
50.778     1.72150
54.684     3.23500
58.590     2.94275
62.496     2.29400
66.402     2.53300
70.308     1.38375
74.214     1.36800
78.120     1.58700
82.026     3.01125
85.932     2.50225
89.838     2.95275
93.744     5.66375
97.650     6.93275
101.556    6.82050
105.462    6.72900
109.368    4.89825
113.274    2.72150
117.180    1.23625
            ...   
886.662    6.84875
890.568    7.33175
894.474    8.29050
898.380    7.83250
902.286    6.03750
906.192    5.38350
910.098    6.28900
914.004    8.65700
917.910    8.01075
921.816    7.89125
925.722    7.85075
929.628    5.68125
933.534    5.37875
937.440    4.53975
941.346    6.58950
945.252    6.74950
949.158    7.65725
953.064    8.32400
956.970    9.71950
960.876    9.69700
964.782    8.96225
968.688    9

    Obtain AR coefficients (Optional) 
    Obtain DCT coefficients (Optional) 
    Obtain PCA components (Optional) 

Optional: Try to figure out what dirty data can you find in a general data set.

Exercise: Transform the Data Set to get additional Insights 
Obtain the quantiles

