# Loading Module Demonstration
This Jupyter Notebook explains and demonstrates the structure and function of the loading module. In general, the module divides into five basic classes: 

#### DataContainer()
- The DataContainer class implements a data structure to store features, targets and covariates. 
- The constructor builds these three data objects (see other classes later). 
- Additionally, the DataContainer class implements basic methods for adding new data to the data container, for printing summaries and so on.

#### BaseObject()
- The BaseObject class is a meta class that is used as blue print for the features, targets and covariates classes. As a meta class it is only used for inheritance and is not used directly. 
- It implements the basic structure of the data objects and all methods that these three data object classes share.

#### FeaturesObject(BaseObject), TargetsObject(BaseObject), CovariatesObject(BaseObject)
- The FeaturesObject inherits from BaseObject. 
- It builds the data object for the features and overrides some of the methods in order to adapt to special characteristics of the specific class (for covariates variable names are needed, for features not). 



### Load modules

In [2]:
from loading import DataContainer
import numpy as np
import pandas as pd
%matplotlib inline

### Create some fake data

In [16]:
# Create some data
data = np.random.rand(100,2000)
IQ = np.random.randint(70,130,100)
age = np.random.randint(18,80,100)
gender = np.random.randint(0,2,100)
weight = np.random.randint(40,90,100)
height = np.random.randint(150,200,100)

### Create a data container
Create a data container object using DataContainer(). Use the .add() function to add new data to the container. By specifying the input type and what the data is used for, the data container automatically handles loading from different formats and saving it in the correct data structure. This will build three separate data objects (for features, targets and covariates). 

In [4]:
# Create data object
dc = DataContainer()

# add features
dc.add(data, input_type='numpy', name='my_data', use_as='features')

# add targets
dc.add(IQ, input_type='numpy', name='IQ', use_as='targets')

# add covariates
dc.add(age, input_type='numpy', name='age', use_as='covariate')
dc.add(gender, input_type='numpy', name='gender', use_as='covariate')

### Summaries
To get summaries of the data, use the .summary() function either for the whole data container or one of the three data objects.

TO DO: Maybe implement a summary function that gives a simpler overview for the data container, i.e. number of variables, shape, and so on...

In [5]:
dc.summary()

Features :

             0           1           2           3           4           5     \
count  100.000000  100.000000  100.000000  100.000000  100.000000  100.000000   
mean     0.500701    0.507193    0.500982    0.499879    0.510912    0.476950   
std      0.294265    0.299845    0.299243    0.318515    0.307352    0.280546   
min      0.004979    0.002660    0.000491    0.013647    0.015312    0.008887   
25%      0.242743    0.243073    0.226596    0.220033    0.218217    0.261960   
50%      0.471637    0.529359    0.543967    0.472527    0.475626    0.407187   
75%      0.761327    0.758898    0.732873    0.813078    0.801826    0.730163   
max      0.997244    0.998911    0.999402    0.991931    0.997855    0.998637   

             6           7           8           9        ...            1990  \
count  100.000000  100.000000  100.000000  100.000000     ...      100.000000   
mean     0.516416    0.498919    0.541940    0.514799     ...        0.460371   
std      0.3017

In [6]:
dc.targets.summary()

Targets :

               IQ
count  100.000000
mean    99.960000
std     16.563877
min     70.000000
25%     86.750000
50%    101.000000
75%    113.000000
max    129.000000




### Access data
To access data simply select a specific variable (the data is saved in a dictionary called 'data', the specific variable is within that dictionary under its own name) and use pandas .iloc indexing. To get the data as actual array (not as pandas array), use pandas .as_matrix() method as shown in this example.

In the future accessing data and variables might need to be simplified a bit...

In [11]:
# Access only part of data
pandas_array = dc.features.data['my_data'].iloc[[0,4,5,9],:]
matrix = dc.features.data['my_data'].iloc[[0,4,5,9],:].as_matrix()
print(pandas_array)
print(matrix)

       0         1         2         3         4         5         6     \
0  0.098796  0.118214  0.864925  0.558561  0.183244  0.138694  0.566432   
4  0.345449  0.842837  0.371190  0.360023  0.192961  0.528998  0.263611   
5  0.249701  0.554303  0.706898  0.016218  0.850104  0.164721  0.931956   
9  0.176355  0.485849  0.262210  0.812424  0.908076  0.405098  0.755206   

       7         8         9       ...         1990      1991      1992  \
0  0.005187  0.140163  0.324155    ...     0.670060  0.861376  0.770222   
4  0.577394  0.425395  0.612820    ...     0.534889  0.530402  0.815281   
5  0.769586  0.368978  0.311375    ...     0.794888  0.560696  0.397087   
9  0.733677  0.522029  0.565354    ...     0.333332  0.884283  0.293187   

       1993      1994      1995      1996      1997      1998      1999  
0  0.090955  0.963288  0.798535  0.885771  0.984932  0.973733  0.560248  
4  0.881087  0.992161  0.266993  0.381911  0.669649  0.606405  0.775730  
5  0.407701  0.825603  0.2

### Get names of variables
Maybe you've forgotten the variable names within your data container. This is how to get them.

In [12]:
dc.covariates.get_names()

dict_keys(['gender', 'age'])

## Loading excel and csv
To load excel and csv files, the loading module uses the built-in functions of pandas (read_csv() and read_xls()). Currently the files need to have header to infer variable names. 

In [22]:
# Create some csv or xls file
cov = np.concatenate((weight[:,None],height[:,None]), axis=1)
file = 'covariates.csv'
header = 'Weight,Height'
np.savetxt(file, cov, delimiter=",", header=header)

# again use .add function and specify the input_type
# in this case the name is irrelevant, this needs to change in the future 
dc.add(file, input_type='csv', name='not important', use_as='covariate')

In [24]:
dc.covariates.summary()

Covariates :

           Height
count  100.000000
mean   175.880000
std     14.444096
min    150.000000
25%    164.000000
50%    177.000000
75%    187.000000
max    199.000000


           height
count   16.000000
mean   180.250000
std      8.489209
min    170.000000
25%    171.750000
50%    180.500000
75%    190.000000
max    192.000000


              age
count  100.000000
mean    46.850000
std     18.478694
min     18.000000
25%     32.000000
50%     46.000000
75%     64.000000
max     79.000000


       gender
count  100.00
mean     0.45
std      0.50
min      0.00
25%      0.00
50%      0.00
75%      1.00
max      1.00


         # Weight
count  100.000000
mean    65.020000
std     13.398311
min     40.000000
25%     54.750000
50%     65.000000
75%     77.000000
max     89.000000


          weight
count  16.000000
mean   66.000000
std     4.226898
min    60.000000
25%    62.000000
50%    66.000000
75%    70.250000
max    72.000000


