![raptorQube](./images/raptorqube.jpg)

# Andreas Francois Vermeulen

![Vermeulen](./images/vermeulen.png)

### Supervisors: Dr Juliana Kuster Filipe Bowles / Dr Vladimir Janjic - University of St Andrews

![St Andrews](./images/standrews.jpg)

![raptorQube](./images/raptorqube2.jpg)

## Swampiness Coefficient

### Data Lake Zones and RAPTOR engine

![RIF Core](./images/RIF-Core.JPG)

### Data Vault : Time-Person-Organisation-Location-Event

![TPOLE](./images/TPOLE.JPG)

# Example: Swampiness Coefficient Calculation

In [1]:
import numpy as np
import pandas as pd
import itertools

In [2]:
nrSet = 0       # 0,1,2,3
nrSize = 1000000
nrUser = 1000000
nrValue = float(0.015)
nrRecordBlock = 1000

### Show Case Data Lake

In [3]:
if nrSet == 0:
    nrRecords = 100 * nrSize
    nrRecordswithProvenance = nrRecords * 1.00 # 100%
    nrRecordsValid  = nrRecords * 1.00         # 100%

    ProcessFitness = 1.00                      # 100%
    
    
    ValidProcess = 1.00                        # 100%

    businessValue = float(nrValue / nrRecordBlock)
    nrUsers = nrUser

### Good Data Lake

In [4]:
if nrSet == 1:
    nrRecords = 100 * nrSize
    nrRecordswithProvenance = nrRecords * 0.99999 # 99.999%
    nrRecordsValid  = nrRecords * 0.99999         # 99.999%

    ProcessFitness = 0.99999                      # 99.999%
    ValidProcess = 1.00                        # 100%

    businessValue = float(nrValue / nrRecordBlock)
    nrUsers = nrUser

### Average Data Lake

In [5]:
if nrSet == 2:
    nrRecords = 100 * nrSize
    nrRecordswithProvenance = nrRecords * 0.95 # 95%
    nrRecordsValid  = nrRecords * 0.85         # 85%

    ProcessFitness = 0.98                      # 98%
    
    ValidProcess = 0.95                        # 95%

    businessValue = float(nrValue / nrRecordBlock)
    nrUsers = nrUser

### Bad Data Lake

In [6]:
if nrSet == 3:
    nrRecords = 100 * nrSize
    nrRecordswithProvenance = nrRecords * 0.33 # 33%
    nrRecordsValid  = nrRecords * 0.33         # 33%

    ProcessFitness = 0.33                      # 33%
    
    ValidProcess = 0.33                        # 33%

    businessValue = float(nrValue / nrRecordBlock)
    nrUsers = nrUser

### Alpha Weight

In [7]:
alphaOne   = float(1/nrRecords)
alphaTwo   = float(1.0)
alphaThree = float(1.0)
alphaFour  = float(1.0)
alphaFive  = float(1/nrRecords)

# epsilon 1 : Coefficient of Variety

In [8]:
nrRecordsInSection = nrRecords / 5

DataItemsComplexityData = pd.DataFrame(
    data=np.array([
        [nrRecordsInSection, 1],
        [nrRecordsInSection, 2],
        [nrRecordsInSection, 3],
        [nrRecordsInSection, 4],
        [nrRecordsInSection, 5]
    ]
    ),
    columns=['nrDataItems','complexityFactor'],
    dtype=np.float64
)

print(DataItemsComplexityData)

   nrDataItems  complexityFactor
0   20000000.0               1.0
1   20000000.0               2.0
2   20000000.0               3.0
3   20000000.0               4.0
4   20000000.0               5.0


In [9]:
DataItemsComplexityData['sigma'] = DataItemsComplexityData.apply(lambda row: row['nrDataItems'] * row['complexityFactor'], 
                                                                 axis=1)

In [10]:
print(DataItemsComplexityData)

   nrDataItems  complexityFactor        sigma
0   20000000.0               1.0   20000000.0
1   20000000.0               2.0   40000000.0
2   20000000.0               3.0   60000000.0
3   20000000.0               4.0   80000000.0
4   20000000.0               5.0  100000000.0


In [11]:
sigmaDF=pd.DataFrame(DataItemsComplexityData['sigma'])
epsilonOne = float(sigmaDF.sum(axis=0))
print('epsilon(1) = %21.9f' % epsilonOne)

epsilon(1) =   300000000.000000000


# epsilon 2: Coefficient of Veracity

In [12]:
if nrRecordswithProvenance == 0:
    epsilonTwo = nrRecords
else:
    epsilonTwo = float(nrRecords/nrRecordswithProvenance)
    
print('epsilon(2) = %21.9f' % epsilonTwo)

epsilon(2) =           1.000000000


# epsilon 3:  Coefficient of Validity

In [13]:
if nrRecordsValid == 0:
    epsilonThree = nrRecords
else:
    epsilonThree = float(nrRecords/nrRecordsValid)
        
print('epsilon(3) = %21.9f' % epsilonThree)

epsilon(3) =           1.000000000


# epsilon 4:  Coefficient of Variability

![Data Vault](./images/the_data_vault.png)

In [14]:
tpole=['Time','Person','Organisation','Location','Event']

print('\nHub\n%s' % ('='*20))
hubs = np.array(tuple(itertools.combinations(tpole, 1)))
for i in range(hubs.shape[0]):
    print('Hub: %s' % ('-'.join(hubs[i])))

nrProcessesHub = hubs.shape[0]
print(nrProcessesHub)

print('\nLink\n%s' % ('='*20))
links=np.array(tuple(itertools.combinations_with_replacement(tpole, 2)))
for i in range(links.shape[0]):
    print('Link: %s' % ('-'.join(links[i])))

nrProcessesLink = links.shape[0]
print(nrProcessesLink)
    
print('\nSatellite\n%s' % ('='*20))
satType = ['Alpha','Beta','Charlie','Delta','Echo']

satellites=np.array(tuple(itertools.product(tpole,satType)))
for i in range(satellites.shape[0]):
    print('Satellite: %s' % ('-'.join(satellites[i])))

nrProcessesSatellite = satellites.shape[0]
print(nrProcessesSatellite)

nrProcesses = nrProcessesHub + nrProcessesLink + nrProcessesSatellite
print('nrProcesses: %0d' % nrProcesses)

nrValidProcesses = nrProcesses * ValidProcess
print('nrValidProcesses: %0d' % nrValidProcesses)


Hub
Hub: Time
Hub: Person
Hub: Organisation
Hub: Location
Hub: Event
5

Link
Link: Time-Time
Link: Time-Person
Link: Time-Organisation
Link: Time-Location
Link: Time-Event
Link: Person-Person
Link: Person-Organisation
Link: Person-Location
Link: Person-Event
Link: Organisation-Organisation
Link: Organisation-Location
Link: Organisation-Event
Link: Location-Location
Link: Location-Event
Link: Event-Event
15

Satellite
Satellite: Time-Alpha
Satellite: Time-Beta
Satellite: Time-Charlie
Satellite: Time-Delta
Satellite: Time-Echo
Satellite: Person-Alpha
Satellite: Person-Beta
Satellite: Person-Charlie
Satellite: Person-Delta
Satellite: Person-Echo
Satellite: Organisation-Alpha
Satellite: Organisation-Beta
Satellite: Organisation-Charlie
Satellite: Organisation-Delta
Satellite: Organisation-Echo
Satellite: Location-Alpha
Satellite: Location-Beta
Satellite: Location-Charlie
Satellite: Location-Delta
Satellite: Location-Echo
Satellite: Event-Alpha
Satellite: Event-Beta
Satellite: Event-Charli

In [15]:
totalProcessFitness = nrValidProcesses * ProcessFitness

averageProcessFitness = totalProcessFitness / nrProcesses

epsilonFour = float((nrProcesses/nrValidProcesses) * averageProcessFitness)
        
print('epsilon(4) = %21.9f' % epsilonFour)

epsilon(4) =           1.000000000


# epsilon 5:  Coefficient of Value

In [16]:
epsilonFive = float((nrRecords - nrRecordsValid) * businessValue * nrUsers)
        
print('epsilon(5) = %21.9f' % epsilonFive)

epsilon(5) =           0.000000000


# Coefficient of Swampiness SC

In [17]:
print('epsilon(1) = %29.9f - alpha(1) = %3.9f => %15.9f (Variety - Multi-Formats)' % (epsilonOne, alphaOne,(epsilonOne * alphaOne)))
print('epsilon(2) = %29.9f - alpha(2) = %3.9f => %15.9f (Veracity - Trust Source of Data)' % (epsilonTwo, alphaTwo, (epsilonTwo * alphaTwo)))
print('epsilon(3) = %29.9f - alpha(3) = %3.9f => %15.9f (Validity - Data correctness)' % (epsilonThree, alphaThree, (epsilonThree * alphaThree)))
print('epsilon(4) = %29.9f - alpha(4) = %3.9f => %15.9f (Variability - Ease of Processing)' % (epsilonFour, alphaFour, (epsilonFour * alphaFour)))
print('epsilon(5) = %29.9f - alpha(5) = %3.9f => %15.9f (Value - Business Value Lost)' % (epsilonFive, alphaFive, (epsilonFive * alphaFive)))


SC = (epsilonOne * alphaOne) + (epsilonTwo * alphaTwo) + (epsilonThree * alphaThree) + (epsilonFour * alphaFour) + (epsilonFive * alphaFive)

print('\nSet: (S-%04d) User: (U-%09d) Value: (£ %0.2f) => Coefficient of Swampiness (SC) = %21.9f' % (nrSet, nrUser, nrValue, SC))

epsilon(1) =           300000000.000000000 - alpha(1) = 0.000000010 =>     3.000000000 (Variety - Multi-Formats)
epsilon(2) =                   1.000000000 - alpha(2) = 1.000000000 =>     1.000000000 (Veracity - Trust Source of Data)
epsilon(3) =                   1.000000000 - alpha(3) = 1.000000000 =>     1.000000000 (Validity - Data correctness)
epsilon(4) =                   1.000000000 - alpha(4) = 1.000000000 =>     1.000000000 (Variability - Ease of Processing)
epsilon(5) =                   0.000000000 - alpha(5) = 0.000000010 =>     0.000000000 (Value - Business Value Lost)

Set: (S-0000) User: (U-001000000) Value: (£ 0.01) => Coefficient of Swampiness (SC) =           6.000000000


In [18]:
businessInsightsValueExpected = nrRecords * businessValue * nrUsers
businessInsightsValueGain = nrRecordsValid * businessValue * nrUsers
businessInsightsValueLost = businessInsightsValueExpected - businessInsightsValueGain

print('\nBusiness Insights Value')
print('='*50)
print('> Expected = £{:20,.2f}'.format(businessInsightsValueExpected))
print('+ Gain     = £{:20,.2f}'.format(businessInsightsValueGain))
print('%s%s' % (' '*10,'='*25))
print('- Lost     = £{:20,.2f}'.format(businessInsightsValueLost))
print('='*50)


print('\nBusiness Value per {:0,d} record: £{:0,.6f}'.format(nrRecordBlock, businessValue))
print('\nUser(s): {:0,d}'.format(nrUsers))

print('\nCoefficient of Swampiness (SC) = {:15.9f}'.format(SC))


Business Insights Value
> Expected = £    1,500,000,000.00
+ Gain     = £    1,500,000,000.00
- Lost     = £                0.00

Business Value per 1,000 record: £0.000015

User(s): 1,000,000

Coefficient of Swampiness (SC) =     6.000000000


In [19]:
print('R 0: {:15.9f} > £{:20,.2f}'.format(6.0,0.00))
print('R 1: {:15.9f} > £{:20,.2f}'.format(6.00151,15000.00))
print('R 2: {:15.9f} > £{:20,.2f}'.format(8.459102167, 225000000.00))
print('R 3: {:15.9f} > £{:20,.2f}'.format(20.496666667, 1050000000.00))

R 0:     6.000000000 > £                0.00
R 1:     6.001510000 > £           15,000.00
R 2:     8.459102167 > £      225,000,000.00
R 3:    20.496666667 > £    1,050,000,000.00


In [20]:
print('Done!')

Done!
