# <center><b>Oil Data Quality Index</b></center>

### <center>First Draft<br><br>January 2015<br>Author: James L. Makela</center>

## <u>1. Background</u>

The Adios3 Oil Library will accept data on crude oils and refined products provided they contain a minimum required set of data.  Additional missing data will be generated by approximation formulas according to the <b><i>Oil Properties Estimation</i></b> document.

It is reasonable to propose that the more measured data that an oil record has, the better job we will do when estimating missing oil properties. <i>(Ideally, we would not need to estimate anything, but simply use measured values)</i>

So in addition to requiring a minimum set of measured data, we will try to assess an oil record's <b>quality index</b>.  The quality index is a numeric score that we will use to represent an oils "fitness", or to put it another way, how well we expect to be able to calculate reasonable estimates of the missing oil properties.

In [35]:
%pylab inline
import numpy as np

import oil_library
from oil_library.models import Oil, ImportedRecord, KVis

session = oil_library._get_db_session()

# these are some reasonable samples of oil records in the oil library
ans_mp = session.query(ImportedRecord).filter(ImportedRecord.oil_name == 'ALASKA NORTH SLOPE (MIDDLE PIPELINE)').one()
ans_2002 = session.query(ImportedRecord).filter(ImportedRecord.oil_name == 'ALASKA NORTH SLOPE (2002)').one()
ans_ps9 = session.query(ImportedRecord).filter(ImportedRecord.oil_name == 'ALASKA NORTH SLOPE-PUMP STATION #9, BP').one()
arabian_hvy = session.query(ImportedRecord).filter(ImportedRecord.oil_name == 'ARABIAN HEAVY, AMOCO').one()
borholla = session.query(ImportedRecord).filter(ImportedRecord.oil_name == 'BORHOLLA').one()
lls_bp = session.query(ImportedRecord).filter(ImportedRecord.oil_name == 'LIGHT LOUISIANNA SWEET, BP').one()
benin = session.query(ImportedRecord).filter(ImportedRecord.oil_name == 'BENIN RIVER, CHEVRON').one()
empire = session.query(ImportedRecord).filter(ImportedRecord.oil_name == 'EMPIRE ISLAND, AMOCO').one()
sajaa = session.query(ImportedRecord).filter(ImportedRecord.oil_name == 'SAJAA CONDENSATE, BP').one()


Populating the interactive namespace from numpy and matplotlib


## <u>2. Minimum Data Required</u>

As previously mentioned, we only accept oil records that contain a minimum required set of data.  Otherwise, we will not process the record.

| Type | Density (or API) | Viscosity | Distillation Cuts |
| ---- |:----------------:|:---------:|:-----------------:|
| Crude | Yes | Yes | No |
| Refined Product | Yes | Yes | Yes (at least 3) |


## <u>3. Calculating the Quality Index</u>

### <u>3.1 The General Scoring Method</u>

The total score of the oil will be an aggregation of 1 or more tests which will result in a quality index.  A quality index $Q$ is defined with a range:

$$
\begin{align}
0 &\leq Q \leq 1
\end{align}
$$

This score can (naively) be considered a single test, but is more likely to be a compilation of scores of multiple sub-tests.  In turn, each sub-test could be considered in a similar fashion, making a tree structure of tests of which the leaf-level tests contain only a single test.  So we will define the terminology of our testing processes as either <b>'Aggregate'</b> or <b>'Atomic'</b>.


### <u>3.2 Atomic Test Score</u>

The assumption for an atomic test method is that it is testing only one thing.  As such, the results of an atomic test method will be:

$$
\begin{align}
Q &= 0 \qquad \qquad \text{(the test failed)} \cr
Q &= 1 \qquad \qquad \text{(the test passed)} \cr
\end{align}
$$


### <u>3.3 Aggregate Test Score</u>

The assumption for an aggregate test method is that it is testing a number of things.  These things are assumed to be a collection of sub-tests which will each return a score $Q_i$.  An aggregate test will return an ordinary or weighted average of the collection of sub-test scores $\bar{Q}$, with the same numeric range as $Q$. 

$$
\begin{align}
n &= \text{the number of sub-tests} \cr
Q_i &= \text{a sub-test result indexed by } i \cr
w_i &= \text{the weighted value of a sub-test indexed by } i \cr
&\quad \text{for an ordinary average, the weights will be } [1, 1, \ldots] \cr
\cr
\bar{Q} &= {\sum\limits_{i=1}^{n}{w_i \cdot Q_i} \over \sum\limits_{i=1}^n{w_i}} \cr
\end{align}
$$

In [36]:
def aggregate_score(Q_i, w_i=None):
    Q_i = np.array(Q_i)
    
    if w_i is None:
        w_i = np.ones(Q_i.shape)
    else:
        w_i = np.array(w_i)
    
    return np.sum(w_i * Q_i) / np.sum(w_i)

Q_i = [1.0, 1.0, 1.0, 0.0]
w_i = [1.0, 1.0, 1.0, 1.0]

print aggregate_score(Q_i)  # simple average (0.75)
print aggregate_score(Q_i, w_i) # explicit simple average (0.75)

Q_i = [1.0, 1.0, 1.0, 1.0]
w_i = [3.0, 1.0, 1.0, 1.0]
print aggregate_score(Q_i, w_i)  # weighted average (1.0)

Q_i = [1.0, 0.0, 0.0, 0.0]
w_i = [3.0, 1.0, 1.0, 1.0]
print aggregate_score(Q_i, w_i)  # weighted average (0.5)

Q_i = [0.0, 1.0, 0.0, 0.0]
w_i = [3.0, 1.0, 1.0, 1.0]
print aggregate_score(Q_i, w_i)  # weighted average (1/6 or 0.1666)


0.75
0.75
1.0
0.5
0.166666666667


## <u>4. The Imported Oil Record Tests</u>

### <u>4.1 Oil Demographics</u>

We would like to gauge the richness of the demographic data in the record.  These are the text fields that describe the oil record, and if the need arises, they can help us to investigate the source of the oil data found in the record.

The demographic fields will be tested simply for their presence.  For now, we will not place special importance on any particular demographic field, so a simple average will be used for scoring the multiple fields.

The demographic fields to be tested are:
- reference

<i>(<b>Note</b>: in future versions we may want to add source quality flag, and age.)</i>

In [37]:
def score_demographics(imported_rec):
    fields = ('reference',)
    scores = []

    for f in fields:
        if getattr(imported_rec, f) is not None:
            scores.append(1.0)
        else:
            scores.append(0.0)

    return aggregate_score(scores)

for ir in (ans_mp, ans_ps9, arabian_hvy, borholla,
           lls_bp, benin, empire, sajaa):
    print ('Oil: {}, Demographics Score: {}'
           .format(ir.oil_name, score_demographics(ir)))


Oil: ALASKA NORTH SLOPE (MIDDLE PIPELINE), Demographics Score: 1.0
Oil: ALASKA NORTH SLOPE-PUMP STATION #9, BP, Demographics Score: 1.0
Oil: ARABIAN HEAVY, AMOCO, Demographics Score: 1.0
Oil: BORHOLLA, Demographics Score: 1.0
Oil: LIGHT LOUISIANNA SWEET, BP, Demographics Score: 1.0
Oil: BENIN RIVER, CHEVRON, Demographics Score: 1.0
Oil: EMPIRE ISLAND, AMOCO, Demographics Score: 1.0
Oil: SAJAA CONDENSATE, BP, Demographics Score: 1.0


### <u>4.2 Oil API</u>

An imported oil record is required by us to have either an API density value or a set of densities measured at reference temperatures.  So this may seem like an unnecessary test.  However we would surmise that a record that has both types of density information is of better quality than a record that contains only one or the other.  So for this reason we test the oil for an API value and return an atomic score.

In [38]:
def score_api(imported_rec):
    if imported_rec.api is None:
        return 0.0
    else:
        return 1.0

for ir in (ans_mp, ans_ps9, arabian_hvy, borholla,
           lls_bp, benin, empire, sajaa):
    print ('Oil: {}, API Score: {}'
           .format(ir.oil_name, score_api(ir)))


Oil: ALASKA NORTH SLOPE (MIDDLE PIPELINE), API Score: 1.0
Oil: ALASKA NORTH SLOPE-PUMP STATION #9, BP, API Score: 0.0
Oil: ARABIAN HEAVY, AMOCO, API Score: 0.0
Oil: BORHOLLA, API Score: 1.0
Oil: LIGHT LOUISIANNA SWEET, BP, API Score: 1.0
Oil: BENIN RIVER, CHEVRON, API Score: 1.0
Oil: EMPIRE ISLAND, AMOCO, API Score: 0.0
Oil: SAJAA CONDENSATE, BP, API Score: 1.0


### <u>4.3 Oil Densities</u>

An imported oil record can contain 0 to 4 density values that are measured at different reference temperatures.  In addition, we will consider the oil's API as a density at $15^\circ C$.  We will give an atomic pass/fail score to each density measurement set that is found in the oil record.

Right now, we are testing that the oil's existing density attributes:
- contain a valid numeric density value and...
- contain a valid numeric temperature

<i>(<b>Note</b>: For the future, we could test that density is in a reasonable range for oils, that the temperature is in a reasonable kelvin range, maybe a couple of other things)</i>

If any density attribute fails any one of the density testing criteria, it is given a score of 0.<br>

We would surmise that the more distinct measurements we have, the better we would be able to estimate oil density at an arbitrary temperature, and so the oil record would have a better quality.  I believe that we should place the biggest weight on the first density, and the weights of all successive densities should diminish exponentially as in the following series:

$$
\begin{align}
w_i &= \left[0.5, 0.25, 0.125, \ldots \frac{1}{2^n}\right] \cr
\end{align}
$$

We would like to see 4 valid density measurements present in our oil record, so if it has less than that, we would assign a score $Q = 0$ for any missing density up to that number.

In [39]:
def score_density_rec(density_rec):
    if (density_rec.kg_m_3 is not None and
            density_rec.ref_temp_k is not None):
        return 1.0
    else:
        return 0.0

def score_densities(imported_rec):
    scores = []

    for d in imported_rec.densities:
        scores.append(score_density_rec(d))

    if not any([np.isclose(d.ref_temp_k, [288.0, 288.15]).any()
                for d in imported_rec.densities]):
        scores.append(score_api(imported_rec))

    # We have a maximum number of 4 density field sets in our flat file
    # We can set a lower acceptable number later
    if len(scores) < 4:
        scores += [0.0] * (4 - len(scores))

    # compute our weights 
    w_i = 1.0 / (2.0 ** (np.arange(len(scores)) + 1))
    w_i[-1] = w_i[-2]  # duplicate the last weight so we sum to 1.0

    return aggregate_score(scores, w_i)

for ir in (ans_mp, ans_2002, ans_ps9, arabian_hvy, borholla,
           lls_bp, benin, empire, sajaa):
    print 'Oil: {}'.format(ir.oil_name)
    print '\tAPI: {}'.format(ir.api)
    print '\tDensities: {}'.format(ir.densities)
    print ('\tScore: {}'
           .format(score_densities(ir)))


Oil: ALASKA NORTH SLOPE (MIDDLE PIPELINE)
	API: 29.9
	Densities: [<Density(886.9 kg/m^3 at 273.15K)>, <Density(876.1 kg/m^3 at 288.15K)>]
	Score: 0.75
Oil: ALASKA NORTH SLOPE (2002)
	API: 30.9
	Densities: [<Density(878.0 kg/m^3 at 273.0K)>, <Density(905.0 kg/m^3 at 273.0K)>, <Density(930.0 kg/m^3 at 273.0K)>, <Density(946.0 kg/m^3 at 273.0K)>]
	Score: 1.0
Oil: ALASKA NORTH SLOPE-PUMP STATION #9, BP
	API: None
	Densities: [<Density(888.0 kg/m^3 at 274.15K)>, <Density(875.0 kg/m^3 at 288.15K)>]
	Score: 0.75
Oil: ARABIAN HEAVY, AMOCO
	API: None
	Densities: [<Density(888.0 kg/m^3 at 288.15K)>]
	Score: 0.5
Oil: BORHOLLA
	API: 31.31
	Densities: []
	Score: 0.5
Oil: LIGHT LOUISIANNA SWEET, BP
	API: 35.6
	Densities: []
	Score: 0.5
Oil: BENIN RIVER, CHEVRON
	API: 39.3
	Densities: []
	Score: 0.5
Oil: EMPIRE ISLAND, AMOCO
	API: None
	Densities: [<Density(858.0 kg/m^3 at 288.15K)>]
	Score: 0.5
Oil: SAJAA CONDENSATE, BP
	API: 57.5
	Densities: [<Density(748.0 kg/m^3 at 288.15K)>]
	Score: 0.5


### <u>4.4 Oil Pour Point</u>

An oil has two pour point values.  The notion of pour point is defined as simply the temperature at which the oil enters its solid phase or "stops pouring".  However, the paraffins in the oil have a tendency to form crystalline structures over time which will in turn elevate the pour point temperature.  So a maximum pour point is measured for the case in which the oil has stayed at a constant temperature for awhile, and a minimum pour point value is often measured for the case that the oil was recently heated, breaking down the crystalline structures.  The "freshly heated" pour point is expected to be lower than the pour point of older oil kept in a constant temperature.

So in the interest of evaluating the quality of an oil record, we would definitely like to see at least one valid pour point value.  But the presence of a second (min) value may simply indicate an oil with a lot of paraffins.  It's unclear whether we should assign extra credit for having two values.

At this point in time, we are assigning a score for both a minimum and maximum value, and we will apply the following weights:
- pour_point_max_k: 2
- pour_point_min_k: 1

This indicates that we want to at least see the maximum pour point temperature, and it will account for the majority of scoring for this test.  But we want to give half credit for having a minimum pour point.

In [40]:
def score_pour_point_min(imported_rec):
    return (1.0 if imported_rec.pour_point_min_k is not None else 0.0)
    
def score_pour_point_max(imported_rec):
    return (1.0 if imported_rec.pour_point_max_k is not None else 0.0)
    
def score_pour_point(imported_rec):
    scores = []

    scores.append(score_pour_point_max(imported_rec))
    scores.append(score_pour_point_min(imported_rec))
    weights = [2.0, 1.0]

    return aggregate_score(scores, weights)

print score_pour_point(borholla)  # no pour point data
print score_pour_point(arabian_hvy)  # max, but not min
print score_pour_point(ans_mp)  # both max and min


0.0
0.666666666667
1.0


### <u>4.5 Oil Flash Point</u>

An oil record contains a flash point minimum and maximum value.  It is a bit unclear to me what the distinction between the minimum and maximum is.  However there are two possiblities, which I will explain.

A <b>minimum flash point</b> is defined as the minimum temperature at which an oil or fuel product will ignite on application of an ignition source under specified conditions.

The <b>fire point</b> of an oil or fuel product is the temperature at which the vapor produced by that product will continue to burn for at least 5 seconds after ignition by an open flame. So at the flash point, which would be a lower temperature, a substance will ignite briefly, but vapor might not be produced at a high enough rate to sustain the fire.  The fire point can be estimated to be roughly $10^\circ C$ higher than flash point.

Ok, based on this assessment of flash point and fire point, I looked at the source data for the oil records, and for the vast majority of records that contain both values, the values were nearly identical, which indicates to me we are not dealing with a fire point value stored as a maximum flash point.

There was a small handful of records that had a maximum flash point significantly higher than the minimum, which could indicate that it is a fire point.

I believe that we can consider of both the minimum and maximum values as being a flash point, and that if we have at least one value, we probably have sufficient data quality.  So the rules are:

- if we have no flash point, min or max, then $Q = 0$
- if we have a minimum flash point, then $Q = 1$
- if we have a maximum flash point, then $Q = 1$
- if we have both a maximum and minimum flash point, then $Q = 1$


In [41]:
def score_flash_point(imported_rec):
    if (imported_rec.flash_point_min_k is not None or
            imported_rec.flash_point_max_k is not None):
        return 1.0
    else:
        return 0.0

for ir in (ans_mp, ans_ps9, arabian_hvy, borholla,
           lls_bp, benin, empire, sajaa):
    print ('Oil: {}, \tFlash Point Score: {}'
           .format(ir.oil_name, score_flash_point(ir)))


Oil: ALASKA NORTH SLOPE (MIDDLE PIPELINE), 	Flash Point Score: 1.0
Oil: ALASKA NORTH SLOPE-PUMP STATION #9, BP, 	Flash Point Score: 1.0
Oil: ARABIAN HEAVY, AMOCO, 	Flash Point Score: 1.0
Oil: BORHOLLA, 	Flash Point Score: 0.0
Oil: LIGHT LOUISIANNA SWEET, BP, 	Flash Point Score: 0.0
Oil: BENIN RIVER, CHEVRON, 	Flash Point Score: 0.0
Oil: EMPIRE ISLAND, AMOCO, 	Flash Point Score: 1.0
Oil: SAJAA CONDENSATE, BP, 	Flash Point Score: 0.0


### <u>4.6 Oil SARA Fractions</u>

The sub-compounds that make up an oil have been categorized, at least traditionally, by organic chemists as saturates, aromatics, resins, and asphaltenes.  This group of four chemical categories is known as SARA.  And an imported oil record may (or may not) contain measured fractional values for them.  If it does, then we would say that the record has better data quality, since we have a reasonable reference to double check the veracity of the SARA component estimations that we perform.

No particular SARA value is perceived to have a more important role than any other, so we will evaluate them with equal weights when evaluating the score for existance of SARA fractions.

In [42]:
def score_sara_saturates(imported_rec):
    return (1.0 if imported_rec.saturates is not None else 0.0)

def score_sara_aromatics(imported_rec):
    return (1.0 if imported_rec.aromatics is not None else 0.0)

def score_sara_resins(imported_rec):
    return (1.0 if imported_rec.resins is not None else 0.0)

def score_sara_asphaltenes(imported_rec):
    return (1.0 if imported_rec.asphaltenes is not None else 0.0)

def score_sara_fractions(imported_rec):
    scores = []

    scores.append(score_sara_saturates(imported_rec))
    scores.append(score_sara_aromatics(imported_rec))
    scores.append(score_sara_resins(imported_rec))
    scores.append(score_sara_asphaltenes(imported_rec))

    return aggregate_score(scores)

print score_sara_fractions(ans_mp)  # no SARA fractions
print score_sara_fractions(lls_bp)  # Asphaltenes only
print score_sara_fractions(benin)  # Saturates, Aromatics, Asphaltenes, no Resins


0.0
0.25
0.75


### <u>4.7 Oil Emulsion Constants</u>

An imported oil record contains a minimum and maximum value for emulsion constant.
After a discussion with Bill & Chris we have decided to go with the following weights for our emulsion properties.

- water_content_emulsion: weight = 2
- emuls_constant_min: weight = 3
- emuls_constant_max: weight = 0

In [43]:
def score_water_content_emulsion(imported_rec):
    return (1.0 if imported_rec.water_content_emulsion is not None else 0.0)

def score_emulsion_constant_min(imported_rec):
    return (1.0 if imported_rec.emuls_constant_min is not None else 0.0)

def score_emulsion_constant_max(imported_rec):
    return (1.0 if imported_rec.emuls_constant_max is not None else 0.0)

def score_emulsion_constants(imported_rec):
    scores = []

    scores.append(score_water_content_emulsion(imported_rec))
    scores.append(score_emulsion_constant_min(imported_rec))
    # scores.append(score_emulsion_constant_max(imported_rec))
    w_i = [2.0, 3.0]

    return aggregate_score(scores, w_i)

print score_emulsion_constants(ans_mp)  # no emulsion constant
print score_emulsion_constants(empire)  # both min & max


0.0
0.6


### <u>4.8 Interfacial Tensions</u>

An oil record contains values for oil/water and oil/seawater interfacial tensions measured at a reference temperature.  So the check we need to perform is an atomic score of each measured value and its associated reference temperature.

We will score each measurement set as such:
- if the measurement and temperature are valid numeric values, then $Q = 1$
- else $Q = 0$

No particular interfacial tension value is perceived to have a more important role than the other, so they will be evaluated with an equally weighted score.

In [44]:
def score_oil_water_tension(imported_rec):
    if (imported_rec.oil_water_interfacial_tension_n_m is not None and
            imported_rec.oil_water_interfacial_tension_ref_temp_k is not None):
        return 1.0
    else:
        return 0.0

def score_oil_seawater_tension(imported_rec):
    if (imported_rec.oil_seawater_interfacial_tension_n_m is not None and
            imported_rec.oil_seawater_interfacial_tension_ref_temp_k is not None):
        return 1.0
    else:
        return 0.0

def score_interfacial_tensions(imported_rec):
    scores = []

    scores.append(score_oil_water_tension(imported_rec))
    scores.append(score_oil_seawater_tension(imported_rec))

    return aggregate_score(scores)

print score_interfacial_tensions(lls_bp)  # no interfacial tensions
print score_interfacial_tensions(empire)  # only oil/seawater
print score_interfacial_tensions(ans_mp)  # both oil/water and oil/seawater


0.0
0.5
1.0


### <u>4.9 Oil Viscosities</u>

An oil record can contain measurement data for up to 6 kinematic viscosities and 6 dynamic viscosities, each with an associated measurement reference temperature.  So we need to perform an atomic score of each measured value and its associated reference temperature.

We will score each measurement set as such:
- if the measurement and temperature are valid numeric values, then $Q = 1$
- else $Q = 0$

No particular viscosity measurement is perceived to be more important than the other.
But it is often the case that a dynamic viscosity exists with a redundant reference temperature to that of a kinematic viscosity measurement.  In that case, we will count the kinematic viscosity as a unique measurement and ignore the dynamic measurement.

We would surmise that the more distinct measurements we have, the better we would be able to estimate oil viscosity at an arbitrary temperature, and so the oil record would have a better quality.  So I believe that we should place the biggest weight on the first viscosity, and the weights of all successive viscosities should diminish exponentially as in the following series:

$$
\begin{align}
w_i &= \left[0.5, 0.25, 0.125, \ldots \frac{1}{2^n}\right] \cr
\end{align}
$$

We would also like to see at least 4 valid viscosity measurements present in our oil record, so if it has less than that, we would assign a score $Q = 0$ for any missing viscosity up to that number.<br>
In addition, any viscosity measurement that exists for the record, but does not have a passing score should be counted even if the total number of viscosities exceeds 4.  The reasoning for this is that bad data is just as relevent as missing data.


In [45]:
def score_single_viscosity(viscosity_rec):
    temp = viscosity_rec.ref_temp_k

    try:
        value = viscosity_rec.m_2_s
    except AttributeError:
        value = viscosity_rec.kg_ms

    if (value is not None and temp is not None):
        return 1.0
    else:
        return 0.0

def score_viscosities(imported_rec):
    scores = []
    all_temps = set()
    all_viscosities = []

    for v in imported_rec.kvis + imported_rec.dvis:
        if v.ref_temp_k not in all_temps:
            all_viscosities.append(v)
            all_temps.add(v.ref_temp_k)

    for v in all_viscosities:
        scores.append(score_single_viscosity(v))

    # We require a minimum number of 4 viscosity field sets
    if len(scores) < 4:
        scores += [0.0] * (4 - len(scores))

    # compute our weights 
    w_i = 1.0 / (2.0 ** (np.arange(len(scores)) + 1))
    w_i[-1] = w_i[-2]  # duplicate the last weight so we sum to 1.0

    return aggregate_score(scores, w_i)

for ir in (ans_mp, arabian_hvy, borholla, lls_bp, benin, empire):
    print ir.kvis, ir.dvis
    print score_viscosities(ir)
    print


[] [<DVis(0.034 kg/ms at 273.15K)>, <DVis(0.016 kg/ms at 288.15K)>]
0.75

[<KVis(4.71e-05 m^2/s at 288.15K)>, <KVis(3.54e-05 m^2/s at 295.15K)>] [<DVis(0.0418 kg/ms at 288.15K)>, <DVis(0.0313 kg/ms at 295.15K)>]
0.75

[<KVis(7.36e-06 m^2/s at 323.15K)>, <KVis(3.26e-06 m^2/s at 333.15K)>] []
0.75

[<KVis(5.3e-06 m^2/s at 310.9278K)>] []
0.5

[] [<DVis(0.0951 kg/ms at 303.15K)>]
0.5

[<KVis(1.41e-05 m^2/s at 285.15K)>, <KVis(8.8e-06 m^2/s at 300.15K)>] [<DVis(0.0121 kg/ms at 285.15K)>, <DVis(0.0075 kg/ms at 300.15K)>]
0.75



### <u>4.10 Oil Distillation Cuts</u>

An oil record can contain measurement data for up to 15 distillation cuts, each with an associated vapor temperature, liquid temperature, and a cumulative fractional value representing the portion of oil that is evaporated at that temperature. So we need to perform an aggregate score of each cut.

For each individual cut it is essential that it have at least a distilled fraction, otherwise it is not valid.
And we would prefer a vapor temperature to be present, but we could still make use of a liquid temperature if it doesn't exist.  So we will determine a cut to be valid if it has either of those temperatures.

The score for each individual valid cut will be performed as follows:
- if there is no evaporated fraction then $Q = 0$
- otherwise:
    - if there is a vapor temperature, then  $Q = 1$
    - otherwise, if there is a liquid temperature only then $Q = 0.8$
    - otherwise, $Q = 0$

We would surmise that the more distinct measurements we have, the better we would be able to estimate our oil distillation curve, and so the oil record would have a better quality.  I believe that we should place the biggest weight on the first cut, and the weights of all successive cuts should diminish exponentially as in the following series:

$$
\begin{align}
w_i &= \left[0.5, 0.25, 0.125, \ldots \frac{1}{2^n}\right] \cr
\end{align}
$$

We would also like to see at least 10 valid distillation cuts present in our oil record, so if it has less than that, we would assign a score $Q = 0$ for any missing distillation cut up to that number.<br>
In addition, any cut that exists for the record, but does not have a passing score should be counted even if the total number of cuts exceeds 10.  The reasoning for this is that bad data is just as relevent as missing data.

<i>
(<b>Note</b>: in the future we could be a bit more discerning of this data.  We could, for instance, exclude any cuts for which the distillation fraction does not increase with an increasing temperature.)
</i>

In [46]:
def cut_has_vapor_temp(cut_rec):
    return (0.0 if cut_rec.vapor_temp_k is None else 1.0)

def cut_has_liquid_temp(cut_rec):
    return (0.0 if cut_rec.liquid_temp_k is None else 1.0)

def cut_has_fraction(cut_rec):
    return (0.0 if cut_rec.fraction is None else 1.0)

def score_cut(cut_rec):
    if cut_has_fraction(cut_rec) == 1.0:
        if cut_has_vapor_temp(cut_rec) == 1.0:
            return 1.0
        elif cut_has_liquid_temp(cut_rec) == 1.0:
            return 0.8
        else:
            return 0.0
    else:
        return 0.0

def score_cuts(imported_rec):
    scores = []

    for c in imported_rec.cuts:
        scores.append(score_cut(c))

    # We would like a minimum number of 10 distillation cuts
    if len(scores) < 10:
        scores += [0.0] * (10 - len(scores))

    # compute our weights
    w_i = 1.0 / (2.0 ** (np.arange(len(scores)) + 1))
    w_i[-1] = w_i[-2]  # duplicate the last weight so we sum to 1.0

    return aggregate_score(scores, w_i)

for ir in (ans_mp, benin, ans_ps9, arabian_hvy, sajaa, borholla):
    print 'name = ', ir.oil_name
    print '\tnum_cuts: {}'.format(len(ir.cuts))
    print ('\tCuts that have vapor temp: {}'
           .format(np.sum([(c.vapor_temp_k is not None)
                           for c in ir.cuts])))
    print '\tCuts Score: {}'.format(score_cuts(ir))
    print


name =  ALASKA NORTH SLOPE (MIDDLE PIPELINE)
	num_cuts: 10
	Cuts that have vapor temp: 10
	Cuts Score: 1.0

name =  BENIN RIVER, CHEVRON
	num_cuts: 13
	Cuts that have vapor temp: 13
	Cuts Score: 1.0

name =  ALASKA NORTH SLOPE-PUMP STATION #9, BP
	num_cuts: 8
	Cuts that have vapor temp: 8
	Cuts Score: 0.99609375

name =  ARABIAN HEAVY, AMOCO
	num_cuts: 6
	Cuts that have vapor temp: 6
	Cuts Score: 0.984375

name =  SAJAA CONDENSATE, BP
	num_cuts: 3
	Cuts that have vapor temp: 3
	Cuts Score: 0.875

name =  BORHOLLA
	num_cuts: 0
	Cuts that have vapor temp: 0.0
	Cuts Score: 0.0



### <u>4.11 Oil Toxicities</u>

An oil record can contain up to 6 sets of toxicity information.  These are separated into two groups of three items each; Effective Concentration (EC) and Lethal Concentration (LC).<br>

We don't currently use this information in our models, but it is concievable that it might be useful in the future.  So we will describe a scoring method for this information, but the bar for success will be low.

The effective concentration data set will include the name of the species of animal, and a number of concentrations necessary for immobilization of 50% of the population of that animal after a period of exposure.  The exposure times are 24, 48, and 96 hours.

Similarly, the lethal concentration data set will include the name of the species of animal, and a number of concentrations necessary to cause death of 50% of the population of that animal after a period of exposure.  The exposure times are 24, 48, and 96 hours.

Our test of an individual toxicity set will simply be the presence of a species, and at least one concentration value.  If satisfies that requirement, it will get a score of $Q = 1$.  And we will only need to see one toxicity set to pass with a total score of $Q = 1$.

<i>
(<b>Note</b>: I can't find any oils with toxicities anymore.  Either the filemaker export process is broken, or we have decided not to include this information anymore.)
</i>

In [47]:
def score_single_toxicity(tox_rec):
    if (tox_rec.species is not None and
        (tox_rec.after_24h is not None or
         tox_rec.after_48h is not None or
         tox_rec.after_96h is not None)):
        return 1.0
    else:
        return 0.0

def score_toxicities(imported_rec):
    scores = []

    for t in imported_rec.toxicities:
        scores.append(score_single_toxicity(t))

    if any([(s == 1.0) for s in scores]):
        return 1.0
    else:
        return 0.0

for ir in (ans_mp, benin):
    print 'name = ', ir.oil_name
    print 'toxicities = ', ir.toxicities
    print score_toxicities(ir)
    print


name =  ALASKA NORTH SLOPE (MIDDLE PIPELINE)
toxicities =  []
0.0

name =  BENIN RIVER, CHEVRON
toxicities =  []
0.0



## <u>5. The Final Score of an Imported Oil Record</u>

The final score of an imported oil record will be an aggregation of the resulting scores of the individual tests described above.  We will use a weighted average, and the weights will be tailored to the perceived importance of each test.

The perceived importance of the individual tests are certainly debatable.  For now, here is a current list of the individual tests here with their weighted importance.

| Test                 | Weight | Cumulative |
| ----                 |:------:|:----------:|
| Densities            | 5      | 5          |
| Viscosities          | 5      | 10         |
| SARA Fractions       | 5      | 15         |
| Distillation Cuts    | 10     | 25         |
| Interfacial Tensions | 3      | 28         |
| Pour Point           | 2      | 30         |
| Demographics         | 1      | 31         |
| Flash Point          | 1      | 32         |
| Emulsion Constants   | 1      | 33         |
| Toxicities           | 0      | 33         |

Note: api and density taken together.  api = density at 15C total weight 5

In [48]:
def score_imported_oil(imported_rec):
    scores = [(score_densities(imported_rec), 5.0),
              (score_viscosities(imported_rec), 5.0),
              (score_sara_fractions(imported_rec), 5.0),
              (score_cuts(imported_rec), 10.0),
              (score_interfacial_tensions(imported_rec), 3.0),
              (score_pour_point(imported_rec), 2.0),
              (score_demographics(imported_rec), 1.0),
              (score_flash_point(imported_rec), 1.0),
              (score_emulsion_constants(imported_rec), 1.0)]

    return aggregate_score(*zip(*scores))

for ir in (ans_mp, ans_ps9, arabian_hvy, borholla,
           lls_bp, benin, empire, sajaa):
    print 'Oil: {}, Score: {}'.format(ir.oil_name,
                                      score_imported_oil(ir))


Oil: ALASKA NORTH SLOPE (MIDDLE PIPELINE), Score: 0.742424242424
Oil: ALASKA NORTH SLOPE-PUMP STATION #9, BP, Score: 0.741240530303
Oil: ARABIAN HEAVY, AMOCO, Score: 0.652335858586
Oil: BORHOLLA, Score: 0.219696969697
Oil: LIGHT LOUISIANNA SWEET, BP, Score: 0.583333333333
Oil: BENIN RIVER, CHEVRON, Score: 0.659090909091
Oil: EMPIRE ISLAND, AMOCO, Score: 0.654703282828
Oil: SAJAA CONDENSATE, BP, Score: 0.602272727273
