# Loading Module Demonstration
This Jupyter Notebook explains and demonstrates the structure and function of the loading module. In general, the module divides into five basic classes: 

#### DataContainer()
- The DataContainer class implements a data structure to store features, targets and covariates. 
- The constructor builds these three data objects (see other classes later). 
- Additionally, the DataContainer class implements basic methods for adding new data to the data container, for printing summaries and so on.

#### BaseObject()
- The BaseObject class is a meta class that is used as blue print for the features, targets and covariates classes. As a meta class it is only used for inheritance and is not used directly. 
- It implements the basic structure of the data objects and all methods that these three data object classes share.

#### FeaturesObject(BaseObject), TargetsObject(BaseObject), CovariatesObject(BaseObject)
- The FeaturesObject inherits from BaseObject. 
- It builds the data object for the features and overrides some of the methods in order to adapt to special characteristics of the specific class (for covariates variable names are needed, for features not). 



### Load modules

In [1]:
from loading import DataContainer
import numpy as np
import pandas as pd
%matplotlib inline

### Create some fake data

In [2]:
# Create some data
data = np.random.rand(100,2000)
IQ = np.random.randint(70,130,100)
age = np.random.randint(18,80,100)
gender = np.random.randint(0,2,100)
weight = np.random.randint(40,90,100)
height = np.random.randint(150,200,100)

### Create a data container
Create a data container object using DataContainer(). Use the .add() function to add new data to the container. By specifying the input type and what the data is used for, the data container automatically handles loading from different formats and saving it in the correct data structure. This will build three separate data objects (for features, targets and covariates). 

In [3]:
# Create data object
dc = DataContainer()

# add features
dc.add(data, input_type='numpy', name='my_data', use_as='features')

# add targets
dc.add(IQ, input_type='numpy', name='IQ', use_as='targets')

# add covariates
dc.add(age, input_type='numpy', name='age', use_as='covariate')
dc.add(gender, input_type='numpy', name='gender', use_as='covariate')

### Summaries
To get summaries of the data, use the .summary() function either for the whole data container or one of the three data objects.

TO DO: Maybe implement a summary function that gives a simpler overview for the data container, i.e. number of variables, shape, and so on...

In [4]:
dc.summary()

Features :

             0           1           2           3           4           5     \
count  100.000000  100.000000  100.000000  100.000000  100.000000  100.000000   
mean     0.484726    0.508056    0.477522    0.492798    0.535564    0.506962   
std      0.285823    0.289320    0.280862    0.322702    0.314045    0.287873   
min      0.003120    0.020154    0.003853    0.003619    0.005878    0.004665   
25%      0.263248    0.238528    0.230539    0.187344    0.275598    0.276210   
50%      0.485645    0.555010    0.456459    0.486866    0.540756    0.527414   
75%      0.721184    0.731863    0.727972    0.800129    0.841326    0.737768   
max      0.993380    0.983685    0.987809    0.997724    0.997470    0.994423   

             6           7           8           9        ...            1990  \
count  100.000000  100.000000  100.000000  100.000000     ...      100.000000   
mean     0.492073    0.525814    0.449631    0.495598     ...        0.482784   
std      0.2949

In [5]:
dc.targets.summary()

Targets :

               IQ
count  100.000000
mean    98.430000
std     18.357975
min     70.000000
25%     82.500000
50%     98.000000
75%    115.250000
max    129.000000




### Access data
To access data simply select a specific variable (the data is saved in a dictionary called 'data', the specific variable is within that dictionary under its own name) and use pandas .iloc indexing. To get the data as actual array (not as pandas array), use pandas .as_matrix() method as shown in this example.

In the future accessing data and variables might need to be simplified a bit...

In [6]:
# Access only part of data
pandas_array = dc.features.data['my_data'].iloc[[0,4,5,9],:]
matrix = dc.features.data['my_data'].iloc[[0,4,5,9],:].as_matrix()
print(pandas_array)
print(matrix)

       0         1         2         3         4         5         6     \
0  0.556597  0.210305  0.940698  0.013733  0.352790  0.536781  0.873278   
4  0.949723  0.477074  0.642115  0.535315  0.452941  0.452752  0.803612   
5  0.148588  0.290260  0.751093  0.648882  0.866956  0.804609  0.825707   
9  0.650029  0.957175  0.061037  0.482761  0.594936  0.263470  0.573244   

       7         8         9       ...         1990      1991      1992  \
0  0.039877  0.874892  0.861021    ...     0.376978  0.108445  0.622759   
4  0.473780  0.954272  0.110748    ...     0.330800  0.798805  0.318103   
5  0.456094  0.070324  0.528958    ...     0.989841  0.421299  0.094478   
9  0.936150  0.017362  0.465676    ...     0.363650  0.396733  0.244623   

       1993      1994      1995      1996      1997      1998      1999  
0  0.354296  0.060893  0.440575  0.118024  0.524774  0.353858  0.769785  
4  0.312299  0.524718  0.012004  0.458731  0.431186  0.049555  0.827264  
5  0.896740  0.207057  0.9

### Get names of variables
Maybe you've forgotten the variable names within your data container. This is how to get them.

In [7]:
dc.covariates.get_names()

dict_keys(['gender', 'age'])

## Loading excel and csv
To load excel and csv files, the loading module uses the built-in functions of pandas (read_csv() and read_xls()). Currently the files need to have header to infer variable names. 

In [8]:
# Create some csv or xls file
cov = np.concatenate((weight[:,None],height[:,None]), axis=1)
file = 'covariates.csv'
header = 'Weight,Height'
np.savetxt(file, cov, delimiter=",", header=header)

# again use .add function and specify the input_type
# in this case the name is irrelevant, this needs to change in the future 
dc.add(file, input_type='csv', name='not important', use_as='covariate')

In [9]:
dc.covariates.summary()

Covariates :

       gender
count  100.00
mean     0.55
std      0.50
min      0.00
25%      0.00
50%      1.00
75%      1.00
max      1.00


             age
count  100.00000
mean    52.97000
std     18.51096
min     19.00000
25%     37.75000
50%     54.50000
75%     69.00000
max     79.00000


           Height
count  100.000000
mean   174.540000
std     14.513261
min    150.000000
25%    161.750000
50%    175.000000
75%    188.000000
max    198.000000


         # Weight
count  100.000000
mean    63.100000
std     14.016224
min     40.000000
25%     51.000000
50%     63.000000
75%     75.000000
max     87.000000




## Load .mat-files
PHOTON uses the scipy.io module to load mat-files. Currently only works with simple variables within the mat-file. Nested structures are not supported yet. To load a specific mat-file just use the standard .add() function and specify the file and input_type="mat". If there are multiple variables within the mat-file, just pass the parameter var_name with the name of the variable you want to load.

### Create mat-file first

In [10]:
import scipy.io as spio
_dict = {}
_dict['eye_colour'] = np.random.randint(0,3,100)
spio.savemat('eyes.mat',_dict)

### Now load and add it to our data container 

In [11]:
dc.add('eyes.mat', input_type='mat', name='eye_colour', use_as='targets', var_name='eye_colour')

In [12]:
dc.targets.summary()

Targets :

               IQ
count  100.000000
mean    98.430000
std     18.357975
min     70.000000
25%     82.500000
50%     98.000000
75%    115.250000
max    129.000000


       eye_colour
count  100.000000
mean     1.050000
std      0.808728
min      0.000000
25%      0.000000
50%      1.000000
75%      2.000000
max      2.000000


