# Quafing example workflow

This notebook demonstrates the use of quafing on the basis of an example workflow analysing quesstionaire data about subaks. The answers to the individual questions are assumed to be entirely independent.


This notebook assumes that it being executed from `quafing/notebooks/`, i.e. that `quafing/quafing/` is at the same directory hierarchy level, with aa shared parent directory. Before beginning we change the working directory to this common parent directory

In [None]:
import os
os.chdir('../')
print(os.getcwd())

Start by importing quafing. (If quafing is installed via a package manager the previous step can be ignored)

In [None]:
import quafing as q

### Data ingestion

Specify the file name of the questionaire data (needs to be adapted by user)


In [None]:
filepath ='/Users/eslt0101/Projects/SABM/FINE_Code/code/data/omri_subak_data.xlsx'


In general for spreadsheet type files (.xlsx,.xls,.odf.,ods), quafing assumes columnar data with meta data on the columns located on the same sheet. The standard format corresponds to (all columns and row are 0-indexed):

- Data and metadata are located on sheet 0.
- Row 0 contains the column type (see below)
- Row 1 contains the number of the asociated question
- Row 2 (header row) contains the column names
- Data starts on row 3
- No rows (read 0) are skipped at the end

standard row types (denoted by single str characters) are:

    e: excluded
    g: group by this column
    c: continuous variable
    u: unordered discrete
    o: ordered discrete
    b: binary
    
It should be emphasized that the user can depart from this standard. As long as the basic format of columnar data with metadata for each column is maintained, the actual inddicees of the rows can be changed. Similarly a different row type schema can be used, albeit preferably string based. However, such alterations require additional specification in quafings functions, while the default values are configired to support the standard schema


In [None]:
rawmetadata,rawdata = q.load(filepath) #no further arguments necessary in this case

quafing's `load` function loas data into a pandas DataFrame an creates a metadata dictionary.

### Preproccessing

Further processing, however, requires additional selection, specification, and pre-processing of the data.`quafing` supplies a `PreProcessor class` for this purpose, which takes in the data aand metadata and exposes functionss to select, split, and prepare the data for processing   

In [None]:
prep = q.PreProcessor(rawdata,rawmetadata)

We start by selecting the columns to be analyzed. Below, we create a selection, by deselecting all columns with type 'e'. However, direct selections by type or column name, or index are also possible. 

In [None]:
prep.select_columns(cols=['e'],deselect=True)

Next, we specify which columns contain continuous an discrete data, respectively. Quafing maintains an inner represenation of this distinction, thus supporting user defined column type schemes via this method. 
The default values, however, correspond to the standard defiined above.

In [None]:
prep.set_cont_disc()

With the data columns selected and the type of data specified, the penultimate preprocessing step is definng which density estimation methods are to be used in constructing the pdfs for each variable. This, again, can be done by column type, column name, or column index.

In our example all columns contain discrete data and the answers/variables are assume to be independent. Accoordingly, for each column (selected by type) a discrete 1d pdf will be estimated. 

In [None]:
prep.set_density_method(method='Discrete1D',cols=['o','u','b'])

Finally, the data is split into groups. This is based on grouping information supplied by the user (e.g. the column of type `g` in the standard format). To avoid ambiguity or mismatches with user defined type schemes, quaafing supports selection of the column to group by by column name or index only.
Here we are grouping and spliting based on the column with index 0

In [None]:
prep.split_to_groups(0)

### A collection of multi-dimensional pdfs

Having preprocessed an split the data, we can create a collection of multi-dimensional pdfs -- one for each group. As the answers for each question are assumed to be independent the full joint multidimensional pdf factorizes and we can create a factorized multi-dimensional pddf for each group, combinig them into a collection. `quafing` provides a convenience function for this operation.


In [None]:
mdpdfcol = q.create_mdpdf_collection('factorized',prep._groups,prep._grouplabels,prep._groupcolmetadata,)

Having created the collection we can calculate the Fisher information matrix, i.e. the matrix of pairwise FI distances.
Several possible algorithmic approximations of the FI distance are supported (here we use the hellinger distance). Distances are computed for each constituent pdf of the factorized multi-dimensional pdfs, and aggregated to combined distance using their root mean square.

In [None]:
mdpdfcol.calculate_distance_matrix(method='hellinger',pwdist='rms')


Given the distance matrix it is straight forward to determine the shortest path matrix

In [None]:
mdpdfcol.calculate_shortest_path_matrix()

### Embedding

Given N questions on the questionaire, the FI distances and shortest paths lie and are defined on the N-1 dimensional hypershere, making investiagting/understanding the structure of the data difficult. To this end the collection can be embedded in a lower dimensional space usingg the previously calculated information distances.

`quafing` provides an `Embedder` class, with support for a range oof embeddingg algorithms (mds, further options under development)

In [None]:
embedder = q.get_embedder('mds',mdpdfcol)

For example the multi-dimenssional pdf collection can be embedded in 2 dimensions

In [None]:
embedding = embedder.embed(dimension=2,return_all=True)

an embeding consists of the actual embedding and a dictionary with relevant data about the settings used. It is up to the user to ensure that data/multi-dimensional pdf collection and embedding stay together

In [None]:
embedding


Specifically for the MDS embedder evaluating the stress of the embedding as a function of its dimensionality (with graphical representation) is supported  

In [None]:
embedder.eval_stress_v_dimension(plot=True)

### Visualization

Finally, for embeddings in 2 or 3 dimensions, `quafing` also provides a convenience function for visualizing the embedding, which takes the calculated embedding and the multi-dimensional pdf collection object as inputs

In [None]:
q.plot_embedding(embedding,mdpdfcol)

### Full capabilities

This notebook is only meant to demonstrate a questionaire analysis workflow, and DOES NOT shpw case all calling options (or functionaalities of `quafing`). Please consult the (inline) documentation for full details. 