In [None]:
#pip install git+https://github.com/NextBrain-ml/nbsynthetic.git

## 1. Imports

In [None]:
import numpy as np
import pandas as pd
from nbsynthetic.data import input_data
from nbsynthetic.data_preparation import SmartBrain
from nbsynthetic.wgan import WGAN
from nbsynthetic.w_synthetic import synthetic_data
from nbsynthetic.statistics import mmd_rbf, Wilcoxon, Student_t, Kolmogorov_Smirnov
from nbsynthetic.statistics import plot_histograms
from nbsynthetic.tda import Topology
from nbsynthetic.geometry import concentration

## 2. Load and prepare data

In [None]:
df = input_data('ROI', decimal='.')
SB = SmartBrain() 
df = SB.nbEncode(df)
df['Year'] = df['Year'].astype('category')
#check the data types
df.dtypes

We have not found empty cells in the dataset, but we have
        removed the columns ['Unnamed: 0'] because are likely id columns or will have
        a very poor predictive performance.


Google ads                          float64
MRR (Monthly Recurring Revenue)     float64
Month                                 int64
Year                               category
dtype: object

## 3. Create synthetic dataset with wgan

In [None]:
newdf = synthetic_data(
    df,
    WGAN, 
    samples= 2000,
    n_features = len(df.columns),
    initial_lr = 0.0001, 
    dropout = 0.4, 
    epochs = 10
    )

Epoch (1/10) | D. loss: -0.17 | G. loss: 15.42 |: 100%|##########| 5/5 [00:05<00:00,  1.13s/it]
Epoch (2/10) | D. loss: -1.79 | G. loss: 15.42 |: 100%|##########| 5/5 [00:01<00:00,  2.84it/s]
Epoch (3/10) | D. loss: -7.33 | G. loss: 15.42 |: 100%|##########| 5/5 [00:02<00:00,  2.33it/s]
Epoch (4/10) | D. loss: -24.80 | G. loss: 15.42 |: 100%|##########| 5/5 [00:01<00:00,  2.87it/s]
Epoch (5/10) | D. loss: -45.13 | G. loss: 15.42 |: 100%|##########| 5/5 [00:02<00:00,  2.26it/s]
Epoch (6/10) | D. loss: -101.89 | G. loss: 15.42 |: 100%|##########| 5/5 [00:01<00:00,  3.14it/s]
Epoch (7/10) | D. loss: -203.85 | G. loss: 15.42 |: 100%|##########| 5/5 [00:00<00:00,  5.27it/s]
Epoch (8/10) | D. loss: -288.04 | G. loss: 15.42 |: 100%|##########| 5/5 [00:00<00:00,  5.28it/s]
Epoch (9/10) | D. loss: -458.21 | G. loss: 15.42 |: 100%|##########| 5/5 [00:00<00:00,  5.36it/s]
Epoch (10/10) | D. loss: -712.81 | G. loss: 15.42 |: 100%|##########| 5/5 [00:00<00:00,  5.22it/s]


## 4. Statistical tests

In [None]:
mmd_rbf(df, newdf, gamma=None)

Maximum Mean Discrepance = 0.05313


In [None]:
plot_histograms(df, newdf)

## 5. Test with topological data analysis

Topological Data Analysis (TDA) is now an emerging area for analyzing complex data. TDA refers to a class of methods that collect information from topological structures in topological space's data. The reasons behind this rising where described by Carlsson in his article *'Topology and data'*  in 2009 (https://www.ams.org/journals/bull/2009-46-02/S0273-0979-09-01249-X/S0273-0979-09-01249-X.pdf):<p>
- Qualitative information is neededto obtain knowledge about the data <p>
- Metrics are not theoretically justified. Apart from physical systems where the phenomena studied often support  clean explanatory theories which  tell one exactly what metric to use, in other kind of systems is not clear how much significance has actual distances. <p>
- Coordinates are not natural Instead of  restrict ourselves  to  studying  properties of  thedata which depend on any particular choice of coordinate, we should look for meaning out of this coordinates cause often they hasn't. <p>
- Summaries are more valuable than individual parameter choices. Since real things don’t depend on how you choose to describe them, the parts of a theory corresponding to real things should be invariants. It is therefore productive to develop other mechanisms in which the behavior of invariants can be effectively summarized. <p>

### Persisten homology

Persistent homology refers to a class of methods for measuring topological features of shapes and functions. It converts the data into simplicial complexes and describes the topological structure of a space at different spatial resolutions. Topologies that are more persistent are detected over a wide range of spatial scales and are deemed more likely to represent true features of the underlying space rather than sampling variations, noise, etc. The input data for the computation of persistent homology is represented as a point cloud, a collection of data points (objects or space) defined by a given coordinates system.

<big>Definitions</big></br>
- Point Clouds. Point clouds are finite sets of points equipped with a distance function. They are finite samples taken from a geometric object.<p> 
- Simplicial complex. Simplicial complexes provide a particularly simple combinatorial way to describe certain topological spaces. A simplicical complex structure on a space is an expression of the space as a union of points, intervals, triangles, and higher dimensional analogues.  It can be defined as a pair $(V,Σ)$, where $V$ is a finite set, and $Σ$ is a family of non-empty subsets of $V$ such that $σ∈Σ$  (σ is a i-simplex) and $τ⊆σ$ implies that $τ∈Σ$. Associated to a simplicial complex is a topological space $|(V,Σ)|$. The building blocks of a simplicial complex are called simplices (plural of simplex). For example a 0-simplex is just a point, a 1-simplex is two points connected with a line segment, a 2-simplex is a filled triangle. For a given subset  $Σ_k$, then #$(σ)=k+1$, it is, the number of elements of $σ$ is equal to $k+1$.
        
- Filtration</u>. A filtration on a simplicial complex $Σ$ is a collection of subcomplexes ${Σ(t)|t ∈ R}$ of $Σ$ such that $Σ(t)⊂K(t')$ whenever $t ≤ t'$. The filtration value of a simplex $σ∈K$ is the smallest $t$ such that $σ∈Σ(t)$. A filtered simplicial complex is a simplicial complex equipped with a filtration. Filtration will tune how data points are converted into the simplicial complexes and thus, how converting set points into a topological space.

- Persistence diagrams, Families of spaces parametrized by a single parameter $\epsilon$ are a way of extracting connectivity information from a point cloud or finite metric space. <p>
- Vietoris-Rips filtration. Let $X$ denote a metric space, with metric $d$.  Then the Vietoris-Rips complex for $X$ (a simplicial complex), attached to the parameter $\epsilon$ (threshold value). It is denoted by $VR(X,\epsilon)$, so it will be the simplicial complex whose vertex set is $X$, and where ${x_0,x_1,...,x_k}$ spans a $k$-simplex if and only if $d(x_i,x_j) ≤ \epsilon $ for all $0 ≤ i, j ≤ k $. $VR(X,\epsilon)$ encodes useful information about the topology of the underlying metric space.<p>

References:</br>
- Topology and data. Author: Gunnar Carlsson Journal: Bull. Amer. Math. Soc. 46 (2009), 255–308. DOI: https://doi.org/10.1090/S0273-0979-09-01249-X.

- Cohen-Steiner, D., Edelsbrunner, H. & Harer, J. (2007)Stability of Persistence Diagrams. Discrete Comput Geom 37, 103–120. https://doi.org/10.1007/s00454-006-1276-5

- H. Edelsbrunner , D. Letscher , A. Zomorodian (2000). Topological Persistence and Simplification.

- G. Singh, F. Memoli and G. Carlsson (2007). Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition, Point Based Graphics 2007, Prague.

- E. Riehl (2017). Category theory in context, Courier Dover Publications.

In [None]:
tp = Topology()
# Vietory Rips filtration
dgms1 = tp.vietory_rips(np.array(df))
dgms2 = tp.vietory_rips(np.array(newdf))

In [None]:
# Persistence diagram original data
tp.plot_diagram(dgms1)

In [None]:
# Persistence diagram synthetic data
tp.plot_diagram(dgms2)

In [None]:
# Test if there are differences in the geometry of both point 
# clouds. If the pvalue is smaller than the significance level α 
# (usually α is 0.05). For higher values of p means that there
# are not differences between the geometry of both point clouds.
tp.mann_whitney(dgms1, dgms2)


      Mann Whitney U test p-value for dimension HO: 1.0
      Mann Whitney U test p-value for dimension H1: 1.1738352721245308e-12
      


- Persistent entropy (PE)is a measure of the graph complexity, acting as a global topological feature for topological classification. PE or $H$ is a Shannon-like entropy for the measurement of the information discovered during the filtration process within the TDA. The maximum persistent entropy corresponds to the situation in which all the intervals in the barcode are of equal length or $H = log n$, being $n$ the number of input collections of persistence diagrams. Or, maxim entropy according Shannon correspond to a uniform distribution. The value of the $H$ decreases as more intervals of different length are present, setting away from an uniform distribution.<p>
- Bottleneck distance. The limit $p \to \infty$ defines the bottleneck distance. More explicitly, it is the infimum over the same set of bijections of the value</br>
<center>$\sup_{x \in D_1 \cup \Delta} ||x - \gamma(x)||_{\infty}.$</center></br>
The set of persistence diagrams together with the Bottleneck distance is a metric space.
Note: A metric space is a set $X$ with a function $d : X \times X \to \mathbb R$ if the values of $d$ are all non negative for all $x,y,z \in X$.*

In [None]:
# Persistence entropy
E1 = tp.persistent_entropy(dgms1, normalize=False)
E2 = tp.persistent_entropy(dgms2, normalize=False)
print(f"""
Persistence entropy
-------------------
Persistence entropy original data = {list(E1)}
Persistence entropy synthetic data = {list(E2)}

Bottleneck distance
-------------------

W = {tp.bottleneck(dgms1, dgms2):.2f}
""")


Persistence entropy
-------------------
Persistence entropy original data = [2.588572016128281, -0.0]
Persistence entropy synthetic data = [7.328259842735767, 5.439703443334857]

Bottleneck distance
-------------------

W = 6778.90



Above we can see the persistence diagrams of both original and synthetic data. The main difference is that original data persistence diagram has points only on the dimension H0 (corresponding to links). For synthetic data, persistence diagram shows also several points in the dimension H1 (corresponding to loops). Some of this H1 points are noise because they fall very near the diagonal. But there are several points that represents 'real' loop. It is a consequence of increasing the data size and the generation of new features relationships. 

## 6. Test with Machine Learning model

In [None]:
from sklearn.model_selection import train_test_split, cross_validate, KFold, cross_val_score, cross_val_predict
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error, explained_variance_score
target = 'MRR (Monthly Recurring Revenue)'

"""Algorithm"""
clf = RandomForestRegressor(max_depth=4, random_state=0)
#clf = RandomForestClassifier(max_depth=4)

"""Original data"""
X0, y0 = df.drop(columns=[target]), df[target]
X_train0, X_test0, y_train0, y_test0 = train_test_split(
    X0, y0, test_size=0.33)
clf.fit(X_train0, y_train0)
y_pred0 = clf.predict(X_test0)

"""Syntehtic data"""
X, y = newdf.drop(columns=[target]), newdf[target]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
#accuracy_score(y_test, y_pred)

"""Check algorithm with original data"""
X1, y1 = df.drop(columns=[target]), df[target]
predictions = cross_val_predict(clf, X1, y1, cv=5)

print(f"""
Original data
-------------
Score without cross validation = {explained_variance_score(y_test0, y_pred0):.2f}
Scores with cross validation = {cross_val_score(clf, X0, y0, cv=5, scoring='explained_variance')}


Syntehtic data
--------------
Score without cross validation = {explained_variance_score(y_test, y_pred):.2f}
Scores with cross validation = {cross_val_score(clf, X, y, cv=5, scoring='explained_variance')}


Check algorithm with original data
----------------------------------
Score with cross validation prediction = {explained_variance_score(y1, predictions):.2f}
""")


Original data
-------------
Score without cross validation = 0.71
Scores with cross validation = [ 0.19254948 -7.0973158   0.1455913   0.18710539 -0.14113018]


Syntehtic data
--------------
Score without cross validation = 0.80
Scores with cross validation = [0.8009446  0.81271862 0.79139598 0.81252436 0.83137774]


Check algorithm with original data
----------------------------------
Score with cross validation prediction = 0.71



In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.figure_factory as ff

In [None]:
limit = max(np.maximum(np.array(y1), predictions))*1.1
fig = make_subplots(
    rows=1, 
    cols=2, 
    subplot_titles=("Original vs. Predicted", "Residuals")
    )

fig.add_trace(
    go.Scatter(
    x=y1, 
    y=predictions,
    mode='markers',
    showlegend=False,), 
    row=1, col=1
    )
fig.add_trace(
    go.Scatter(
    x=[0, limit], 
    y=[0, limit],
    mode='lines',
    showlegend=False,),
    row=1, col=1
    )

fig.update_xaxes(
    title_text="Original", 
    range=[0, limit], 
    row=1, col=1
    )
fig.update_yaxes(
    title_text="Predicted", 
    range=[0, limit], 
    row=1, col=1
    )

fig.add_trace(
    go.Scatter(
    x=predictions, 
    y=predictions-y1,
    mode='markers',
    marker_color='grey',
    showlegend=False,), 
    row=1, col=2
    )

fig.add_trace(
    go.Scatter(
    x=[0, limit], 
    y=[0, 0],
    mode='lines',
    line_color='red',
    showlegend=False,),
    row=1, col=2
    )
fig.update_xaxes(
    title_text="Predicted",  
    row=1, col=2
    )
fig.update_yaxes(
    title_text="Predicted - Original", 
    row=1, col=2
    )

fig.update_layout(plot_bgcolor='white')
fig.show()

In [None]:
fig = make_subplots(
    rows=1, 
    cols=2, 
    subplot_titles=("Original", "Synthetic"),
    shared_xaxes=True,
    shared_yaxes=True
    )

fig.add_trace(
    go.Scatter(
        x=df['Google ads'], 
        y=df['MRR (Monthly Recurring Revenue)'],
        mode='markers',
        marker_color=df['Month'],
        showlegend=False,),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(
        x=newdf['Google ads'], 
        y=newdf['MRR (Monthly Recurring Revenue)'],
        mode='markers',
        marker_color=newdf['Month'],
        marker_colorbar_title='Month',
        showlegend=False,),
    row=1, col=2
)
fig.update_xaxes(title_text='Google ads spend', row=1, col=1)
fig.update_yaxes(title_text='MRR (Monthly Recurring Revenue)', row=1, col=1)
fig.update_xaxes(title_text='Google ads spend', row=1, col=2)
fig.update_yaxes(title_text='MRR (Monthly Recurring Revenue)', row=1, col=2)
fig.update_layout(title_text="Original data versus sytnehtic data",plot_bgcolor='white')
fig.show()

In [None]:
colors = ['Blue',  'Red', 'lightblue','salmon']
fig = ff.create_distplot(
    [df['MRR (Monthly Recurring Revenue)'], df['Google ads'], newdf['MRR (Monthly Recurring Revenue)'], newdf['Google ads']], 
    ['Original/MRR (Monthly Recurring Revenue)', 'Original/Google ads spend', 'Synthetic/MRR (Monthly Recurring Revenue)', 'Synthetic/Google ads spend'],
    show_hist=False,
    colors = colors
  )
fig.update_layout(title_text="Distribution plots", plot_bgcolor='white')
fig.show()

## 7. Test with differential geometry analysis

**Theorem 2** (from Hinneburg et al.):</br>
Let $\mathcal{F}$ be an arbitrary distribution of two points and
the distance function $\lVert \mathbf{.} \rVert$ be an $\mathcal{L}_k$ metric. Then, <br/>

<center>$\lim_{d \to +\infty} E = \left[ \frac{Dmax_k^{d}-Dmin_k^{d}}{d^{1/k-1/2}}\right] = C_k$</center><br/>

where $C_k$ is some constant dependent on $k$ and $\frac{Dmax_k^{d}-Dmin_k^{d}}{d^{1/k-1/2}}$ is the relative contrast $\zeta_{\mathcal{L}_k}$. Then the metric $Dmax-Dmin$ - also called $contrast$ (Hinneburg et al.) -, will converge at $C_k$ when increrasing the dimensionality $d \to +\infty$ for the euclidean norm ($d$ = 2). It illustrates the concentration phenomenon (Beyer et al.). </br>



### Measuring the concentracion phenomena
**Teorem 3** (from François et al.):</br>
<center> If $ \lim_{d\to\infty}\frac{ \sqrt{Var(\lVert \mathbf{X}\rVert_p)}}{E(\lVert \mathbf{X}\rVert_p)} = 0$ </center></br>

where the term  $\frac{ \sqrt{Var(\lVert \mathbf{X}\rVert_p)}}{E(\lVert \mathbf{X}\rVert_p)}$ is called Relative Variance and written as $RV_{\mathcal{F},p}$.</br>
Intuitively, we can see that RV measures the concentration 
by relating a measure of spread (standard deviation) to a 
measure of location (expectation). In that sense, it is 
similar to the relative contrast that also relates a 
measure of spread (range) to a measure of location (minimum). 
 As a consequence, high-dimensional data that present a lot of  correlation or dependencies between variables will be much less concentrated than if all variables are independent.</br>
 In [Beyer et al. , 1999; Hinneburg et al. , 2000; Aggarwal et al. ,
2002], it is implicitly assumed that stability of nearest neighbours search and concentration (i.e. small relative contrast) are linked together. The rationale is in a sense that a distance measure that is highly concentrated brings very little relevant discriminative information (and consequently makes the search for nearest neighbours unstable).


### Does the concentration of the norm have an impact on the distance-based learning tools?


**References:**

- Aggarwal, Charu & Hinneburg, Alexander & Keim, Daniel. (2002). On the Surprising Behavior of Distance Metric in High-Dimensional Space. First publ. in: Database theory, ICDT 200, 8th International Conference, London, UK, January 4 - 6, 2001 / Jan Van den Bussche ... (eds.). Berlin: Springer, 2001, pp. 420-434 (=Lecture notes in computer science ; 1973).
- K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. (1999). When is "nearest neighbor" meaningful? in Proc. 7th Int. Conf. Database
Theory, pp. 217–235.
- Alexander Hinneburg, Charu C. Aggarwal, and Daniel A. Keim. (2000). What Is the Nearest Neighbor in High Dimensional Spaces? In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB '00). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 506–515.
- François, D., Wertz, V., & Verleysen, M. (2007). The Concentration of Fractional Distances. IEEE Transactions on Knowledge and Data Engineering, 19, 873-886.

In [None]:
C = concentration()

In [None]:
C.concentration_distances(df, newdf)


            Average distance of points in data with euclidean distance (L2 norm)
            ---------------------------------------------------------------------
            Avg. distance in original data: 16164.25 
            Avg. distance in synthetic data : 16912.08
            Avg. distance in random data : 17048.30

            Distances Variance in original data: 123433417.54 
            Distances Variance in synthetic data : 131881286.03
            Distances Variance in random data : 109261221.86


            Contrast (Hinnenburg et al.)
            ----------------------------
            Contrast converges to a constant when the dimension
            increases and when the euclidean distance is used.
            If this contrast decreases, as it is the case for 
            Minkowski norms with p > 2,precision can be lost.

            Contrast, (L norm 2), original data = 47798.75
            Contrast, (L norm 2), synthetic data = 48055.80
            Contrast, (L norm 

In [None]:
C.plot_distances(df, newdf)

Singular values and singular value decompositions are important in analyzing data.
One simple example of this is 'rank estimation'. Suppose that we have
$n$ data points $v1, . . . , vn$, all of which live in $R^
m$, where $n$ is much larger
than $m$. Let $A$ be the $m × n$ matrix with columns $v1, . . . , vn$. Suppose the data points satisfy some linear relations, so that $v1, . . . , vn$ all lie in an $r-$dimensional subspace of $R^m$. Then we would expect the matrix $A$ to have rank $r$. However if the data points are obtained from measurements with errors, then the matrix $A$ will probably have full rank $m$. But only $r$ of the
singular values of $A$ will be large, and the other singular values will be close to zero. Thus one can compute an 'approximate rank' of $A$ by counting the number of singular values which are much larger than the others, and
one expects the measured matrix $A$ to be close to a matrix $A'$ such that the rank of $A'$ is the “approximate rank” of $A$.

Variance concentration ratios (VCR) is a rigorous and explainable metric to quantify data. To
better examine the explainability of manifold learning in terms of DLPs on high-dimensional and low-dimensional
data, we adopt a variance concentration ratio (VCR), which was initially proposed by Han et al. 2021 to measure
high-frequency trading data, to quantify high and low-dimensional data. The VCR is defined as the ratio
between the largest singular value of the dataset and the total sum of all singular values. It answers the query:<br>
**¿ What’s the data variance percentage concentrated on the direction of the first singular value ?**<br>
Given a dataset $X$
with n observations and $p$ variables: $X ∈ R^{n×p}$
, its variance concentration ratio (VCR) is defined as:<br/>
<center>
$\beta (X) = \frac{s_1}{\sum_{i=1}^{min(n,p)} s_i} $ 
</center></br>

The singular value decomposition (SVD) provides another way to factorize a matrix, into singular vectors and singular values. The SVD allows us to discover some of the same kind of information as the eigendecomposition. However, the SVD is more generally applicable.

References:</br>
- Han, Henry & Teng, Jie & Xia, Junruo & Wang, Yunhan & Guo, Zihao & Li, Deqing. (2021). Predict high-frequency trading marker via manifold learning. Knowledge-Based Systems. 213. 106662. 10.1016/j.knosys.2020.106662.

In [None]:
C.variance_concentration(df, newdf)


            Singular values
            ---------------
            Singular values for original dataset : [1.0, 0.114, 0.051, 0.0]
            Singular values for synthetic dataset : [1.0, 0.11, 0.046, 0.0]
            Singular values for random dataset : [1.0, 0.184, 0.027, 0.0]


            Variance concentration ratio (VCR)
            ----------------------------------
            Variance concentration ratio original data = 85.85%
            Variance concentration ratio synthetic data = 86.49%
            Variance concentration ratio random data = 82.56%
            


Diferential geometry is verty well known in high dimensional data analysis. The analysis of high dimensional spaces and its low dimensional embedding spaces are commonly used in machine learning problems. The idea of using differential geometry generating data is not new. Our idea is that, in the same direction, we can use the same teorems to validate synthetic data generated with a GAN newtwork. Our experimental results shows the following:
- Average distance of points in data with euclidean distance (L2 norm) and contrast of both original and synthetic data are very similar.  
- When generating a synthetic dataset with a larger number of points (but with the same dimensionality), synthetic data relative contrast has to be much higher than original data. If we generate random data with the same size than synthetic data, relative contrast of synthetic data has to be higher than the random data.  
- Singular values for both original and synthetic data are very close. Relative Variance of synthetic data has to be much higher than original data. 
