# PyQAlloy Basic Usage Examples (over the ULTERA Database)

**Welcome to this minimal Jupyter notebook tutorial that shows how to use the PyQAlloy package functions over an established dataset following the expected schema (of the ULTERA Database). It showcases PyQAlloy's deployment over an older snapshot (`CURATED_Dec2022` collection in `ULTERA_internal` sub-database) to curate the dataset with all core methods:**
- [Single Composition Analyzer](#Single-Composition-Analyzer) which operates within scope of a single composition, i.e., does not require any study reference, property data, or other compositions from the database.
- [Single DOI Analyzer](#Single-DOI-Analyzer) which operates within scope of a single study / reference, which should ideally be a DOI, putting many compositions and properties together.
- [All Data (Entire Database) Analyzer](#All-Data-(Entire-Database)-Analyzer) which operates within the scope of the entire database, leveraging multi-study patterns and relationships to curate the data.

**Please note that to execute run this notebook, you need to have API credentials either stored in `credentials.json` file, set using `pyqalloy.setCredentials`, or set using `pyqalloy.setCredentialsFromURI`. If you don't have them, you can still follow it along looking at the pre-computed results to learn the usage of `PyQAlloy` over static and explained results.** Tutorials 2 and 3 will later show you how to use a static snapshot of a database (in-memory) and how to use the PyQAlloy package to curate your data conforming to the ULTERA schema standards.

Let's start by:
- Setting up the ULTERA API credentials. 
- Importing the `analysis` submodule from `pyqalloy`.

In [None]:
from pyqalloy import setCredentials

# Once again, if you don't have them, you can still follow along looking at the pre-computed results in this notebook - just don't run the cells.
setCredentials(
    name='myname',
    dbKey='mydbkey',
    dataServer='mycluster.h20x21.mongodb.net'
)

In [1]:
# All relevant classes for this tutorial are in the analysis submodule
from pyqalloy.curation import analysis

## Single Composition Analyzer

Mehtods of the `SingleCompositionAnalyzer` class explicitly operate within the scope of a single composition, i.e., interpretation of a chemical formula stored in the database and then parsed into several formats (raw, fractional, percentile, relational) defined in ULTERA. 

In the context of compositionally complex materials (CCMs), like multi-principle element alloys (MPEAs), high-entropy alloys (HEAs), high-entropy ceramics (HECs), Ni-based superalloys, or advanced steels, the most powerful (yet simple) method is the `scanCompositionsAround100` method which, as the name suggests, scans looking for compositions that sum to around `100%` (or `1` in fractional form), yet not exactly `100%`; thus, suggesting either (a) parsing errors or (b) normalization errors.

Start by setting up a `SingleCompositionAnalyzer` object called, e.g. `sC`. By default, it will connect to the `'ULTERA_internal'` database's `'CURATED_Dec2022'` collection we use for demonstration purposes in PyQAlloy as of `v0.4.0`.

In [2]:
sC = analysis.SingleCompositionAnalyzer()

Loading the database credentials from default location: /Users/adam/VSCode Projects/PyQAlloy-1/pyqalloy/credentials.json
Connected to the CURATED_Dec2022 in ULTERA_internal with 6073 data points detected.


However, you can easily point it to another database, by pointing it to a MongoDB-compatible Collection object or custom server by providing custom `collection`, `database`, and `credentialsFile` arguments, as shown in the commented-out line below.

In [3]:
# sC = analysis.SingleCompositionAnalyzer(collectionManualOverride=my_mongo_collection_obejct)
# sC = analysis.SingleCompositionAnalyzer(database='mydatabase', collection='mycollection', credentialsFile='../mycredentials.json')

Once connected, the use is quite straightforward and the default values tend to work well for most cases. Start by scanning through all the compositions with `scanCompositionsAround100`, described earlier, looking for the the fraction sums that are close to `100%` but not exactly `100%`. To keep things simple, you can request to limit the results up to `10` results and print them to the console on the fly, so you that you can just visually inspect them as you go.

In [4]:
sC.scanCompositionsAround100(resultLimit=10, printOnFly=True)

DOI: 10.1016/j.msea.2017.04.111
F:   Cr19 Fe19 Co19 Ni37 Cu4 Al4
PF:  Cr18.6 Fe18.6 Co18.6 Ni36.3 Cu3.9 Al3.9
Raw:  Al4Co19Cr19Cu4Fe19Ni37
RF:  Cr4.75 Fe4.75 Co4.75 Ni9.25 Cu1 Al1
[19.0, 19.0, 19.0, 37.0, 4.0, 4.0]
-->  102.0

DOI: 10.3390/ma12071136
F:   Li38 Ca1 Mg48 Al15 Si1
PF:  Li36.9 Ca1 Mg46.6 Al14.6 Si1
Raw:  Al15Li38Mg48Ca1 Si1
RF:  Li38 Ca1 Mg48 Al15 Si1
[38.0, 1.0, 48.0, 15.0, 1.0]
-->  103.0

DOI: 10.1016/j.msea.2012.04.067  --> F9
F:   Hf1.4 Zr0.007 Ti0.4 Ta3.3 W9.4 Mo0.5 Cr8.1 Co9.3 Ni61.5 Al5.7 B0.017 C0.07
PF:  Hf1.4 Zr0 Ti0.4 Ta3.3 W9.4 Mo0.5 Cr8.1 Co9.3 Ni61.7 Al5.7 B0 C0.1
Raw:  Ni61.5 W9.4 Co9.3 Cr8.1 Al5.7 Ta3.3 Hf1.4 Ti0.4 Mo0.5 C0.07 B0.017 Zr0.007
RF:  Hf2.8 Zr0.01 Ti0.8 Ta6.6 W18.8 Mo1 Cr16.2 Co18.6 Ni123 Al11.4 B0.03 C0.14
[1.4, 0.007, 0.4, 3.3, 9.4, 0.5, 8.1, 9.3, 61.5, 5.7, 0.017, 0.07]
-->  99.694

DOI: 10.1016/j.ijfatigue.2018.08.029  --> T6
F:   Ti86.2 V3.15 Al10.2
PF:  Ti86.6 V3.2 Al10.2
Raw:  Ti86.2 Al10.2 V3.15
RF:  Ti27.37 V1 Al3.24
[86.2, 3.15, 10.2]

You should now see a number of suspecious compositions! For each of them, we get:
- `DOI` of the study where the composition was found.
- `F` is the interpreted formula created by parsing the chemical formula string provided in the uploaded data.
- `PF` is the percentile formula, where all the elements have been normalized to sum to `100%`.
- `Raw` is the raw formula provided in the uploaded data, from which the `F`/`PF`/`RF` were derived.
- `RF` is the relational formula, where the element ratios are normalized to the lest common one, conveniently expressing relative ratios.

Let's have a quick look at three of them!

1. The fist, coming from `10.1016/j.msea.2017.04.111`, sums to `102%`. A quick comparison between `F` and `PF` shows that the formula has probably been approximated too much, in order to match `1%` quantization, while it was originally designed to be relational. A quick journey to the abstract of the paper confirms that - it was `Al0.2CrFeCoNi2Cu0.2` or `AlCr5Fe5Co5Ni10Cu1` in the relational form. Comparing that to the `RF` of the database entry, we can quickly see that the fractions of Cu and Al are overestimated, while the rest is underestimated, with Ni being underestimated the most.

2. Next one, coming from `10.3390/ma12071136` may seem to follow a similar problem at first, but it is more trivial. A quick look at the abstract reveals that the studies formula was `Al15Li35Mg48Ca1Si1` and someone parsing the publication typed `Li38` instead of `Li35` because the next fraction `Mg48` ended with `8` and human brains are prone to repetition.

3. Lastly, the one from `10.1016/j.msea.2012.04.067` adds up to `99.694`, which tipped of the detection algorithm under the `0.2%` uncertainty threshold, but could be accounted as a rounding error without significant effect on the ratio of the elements in the composition and could be considered as a false-positive.

Now, if you saw a lot of anomalies detected which seem like rounding errors, you can re-initialize the `sC` object and run again with custom settings (e.g., `uncertainty=1`, i.e., `+/-1%` passed as close enough to `100%`). There are quite a few settings you can modify, so feel free to explore the documentation or ask for help if you need it.

In [5]:
sC = analysis.SingleCompositionAnalyzer()
sC.scanCompositionsAround100(resultLimit=10, printOnFly=True, uncertainty=1)

Loading the database credentials from default location: /Users/adam/VSCode Projects/PyQAlloy-1/pyqalloy/credentials.json
Connected to the CURATED_Dec2022 in ULTERA_internal with 6073 data points detected.
DOI: 10.1016/j.msea.2017.04.111
F:   Cr19 Fe19 Co19 Ni37 Cu4 Al4
PF:  Cr18.6 Fe18.6 Co18.6 Ni36.3 Cu3.9 Al3.9
Raw:  Al4Co19Cr19Cu4Fe19Ni37
RF:  Cr4.75 Fe4.75 Co4.75 Ni9.25 Cu1 Al1
[19.0, 19.0, 19.0, 37.0, 4.0, 4.0]
-->  102.0

DOI: 10.3390/ma12071136
F:   Li38 Ca1 Mg48 Al15 Si1
PF:  Li36.9 Ca1 Mg46.6 Al14.6 Si1
Raw:  Al15Li38Mg48Ca1 Si1
RF:  Li38 Ca1 Mg48 Al15 Si1
[38.0, 1.0, 48.0, 15.0, 1.0]
-->  103.0

DOI: 10.1016/j.actamat.2016.11.016
F:   Cr16 Fe16 Co16 Ni34.4 Al16
PF:  Cr16.3 Fe16.3 Co16.3 Ni35 Al16.3
Raw:  Al16Co16Cr16Fe16Ni34.4
RF:  Cr1 Fe1 Co1 Ni2.15 Al1
[16.0, 16.0, 16.0, 34.4, 16.0]
-->  98.4

DOI: Liu_1999_ProcessingAndHighTemperature  --> F2
F:   Hf0.9 Mo91
PF:  Hf1 Mo99
Raw:  Mo91 Hf0.9
RF:  Hf1 Mo101.11
[0.9, 91.0]
-->  91.9

DOI: 10.1134/S2070205120040231  --> T2
F:   

Now, you will more frequently see some "far off" cases that may require deeper interpretation. For instance:

1. The `Ta1 Nb0.103` from `10.1134/S2070205120040231` sums to `1.103` and a quick look at its `RF` of `TaNb0.103` can point us to suspect that it was parsed from a formula like `Ta-Nb10`, where fraction of `Ta` has been implicit (being in fact `0.897`) and was not included when the formula was parsed, thus overestimating the fraction of `Ta`. Were did the `.3` come from then? A quick look at the abstract confirms the conjecture about implicit `Ta` fraction, but complicates things because the formulas were mass-based! To some surprise, if we then run another composition `Ta-Nb5` or `Ta95 N5 (wt%)` through `pymatgen` with `from pymatgen.core import Composition; print(Composition.from_weight_dict({'Ta': 0.95, 'Nb': 0.05}))` we get the `Ta90.7 Nb9.3` percentile formula `PyQAlloy` just presented to us based on the `Ta1 Nb0.103`; hence, the datapoint was actually correct all along because someone went to extended lenghts to adjust an implicit weight fraction based formula to a molar fraction relational formula!

2. Going forward, the same study on `10.1134/S2070205120040231` appears again and one may be quick to dismiss it as another peticiular false-positive but just like in an M. Night Shyamalan's movie, there can be yet another plot twist comming, i.e, unlike in the last case, here the `RF` is integer based, so the conversion from implicit weight fraction based formula to a molar fraction relational formula is unlikely, what is confirmed when numbers are run - in the case of `Ta-W` system, the person parsing the paper forgot to apply the same method and data contains the error we suspected in the first place!

For more granular analysis, especially over larger collaborative databases, you may want to only look at compositions that a specific researcher uploaded by initilizing the `sC` with the `name` field specified. This time the `printOnFly` is set to `False`, so that the results are not printed on the fly, but rather stored inside `sC` object for further analysis.

In [6]:
sC = analysis.SingleCompositionAnalyzer(name='Adam Krajewski')
sC.scanCompositionsAround100(
    printOnFly=False, 
    resultLimit=100, 
    uncertainty=0.21)

Loading the database credentials from default location: /Users/adam/VSCode Projects/PyQAlloy-1/pyqalloy/credentials.json
Connected to the CURATED_Dec2022 in ULTERA_internal with 6073 data points detected.


Now, you can save that list into a file for later analysis!

In [7]:
sC.writeResultsToFile('singleComp_Adam.txt')

Or, you can also see them by accessing the internal `list` of `str` with result printouts. Let's have a look at the first three of them.

In [8]:
sC.printOuts[:3]

['DOI: 10.1016/j.msea.2017.04.111\nF:   Cr19 Fe19 Co19 Ni37 Cu4 Al4\nPF:  Cr18.6 Fe18.6 Co18.6 Ni36.3 Cu3.9 Al3.9\nRaw:  Al4Co19Cr19Cu4Fe19Ni37\nRF:  Cr4.75 Fe4.75 Co4.75 Ni9.25 Cu1 Al1\n[19.0, 19.0, 19.0, 37.0, 4.0, 4.0]\n-->  102.0\n',
 'DOI: 10.3390/ma12071136\nF:   Li38 Ca1 Mg48 Al15 Si1\nPF:  Li36.9 Ca1 Mg46.6 Al14.6 Si1\nRaw:  Al15Li38Mg48Ca1 Si1\nRF:  Li38 Ca1 Mg48 Al15 Si1\n[38.0, 1.0, 48.0, 15.0, 1.0]\n-->  103.0\n',
 'DOI: 10.1016/j.actamat.2016.06.063\nF:   Mo7 Cr23 Fe23 Co23 Ni23\nPF:  Mo7.1 Cr23.2 Fe23.2 Co23.2 Ni23.2\nRaw:  Co23Cr23Fe23Ni23Mo7\nRF:  Mo1 Cr3.29 Fe3.29 Co3.29 Ni3.29\n[7.0, 23.0, 23.0, 23.0, 23.0]\n-->  99.0\n']

## Single DOI Analyzer

The next context in which we can look at the alloy data in a database is the `SingleDOIAnalyzer` class, which operates within the scope of a single study / reference, which should ideally be a DOI, putting many data points together. 

It also enables us to deploy more robust methods when it comes to detecting anomalies, including investigation of the distance between nearest neighbors in the compositional space and investigation of patterns they from once projected into lower dimensional spaces.

Let's start by picking some DOI present in the ULTERA database, like `10.1016/j.jallcom.2008.11.059`, and looking at the data coming from it. To do so, we will initialize the `dDOI` object with the DOI of interest.

In [9]:
doi = '10.1016/j.jallcom.2008.11.059'
sDOI = analysis.SingleDOIAnalyzer(doi=doi)

Loading the database credentials from default location: /Users/adam/VSCode Projects/PyQAlloy-1/pyqalloy/credentials.json
Connected to the CURATED_Dec2022 in ULTERA_internal with 6073 data points detected.
********  Analyzer Initialized  ********


In the later part of the tutorial, scanning all DOI will be covered. To get a list of all of them, you can call `get_allDOIs` function on any `SingleDOIAnalyzer` object (even initialized without the `doi` field) and store this list for later use. 

In [10]:
doiList = sDOI.get_allDOIs()

### NN Distance Analysis

First, let's analyze distances between all the compositions in the specified publication and print them out. 

You will see the L1 distances in the compositional space displayed in the form of (left) absolute value and (right) normalized to maximum value.

In [11]:
sDOI.analyze_nnDistances()
sDOI.print_nnDistances()


--->  10.1016/j.jallcom.2008.11.059 uploaded by Adam Krajewski (based on MPEA)
0.0866    |  0.9524     <-- F: Ti0.5 Cr1 Fe1 Co1 Ni1 Cu0.5 Al0.5   | PF: Ti9.1 Cr18.2 Fe18.2 Co18.2 Ni18.2 Cu9.1 Al9.1  | Raw: Al0.5CoCrCu0.5FeNiTi0.5   | RF: Ti1 Cr2 Fe2 Co2 Ni2 Cu1 Al1
0.0909    |  1.0        <-- F: Ti0.5 Cr1 Fe1 Co1 Ni1 Cu0.25 Al0.75 | PF: Ti9.1 Cr18.2 Fe18.2 Co18.2 Ni18.2 Cu4.5 Al13.6 | Raw: Al0.75CoCrCu0.25FeNiTi0.5 | RF: Ti2 Cr4 Fe4 Co4 Ni4 Cu1 Al3
0.0823    |  0.9048     <-- F: Ti0.5 Cr1 Fe1 Co1 Ni1 Cu0.75 Al0.25 | PF: Ti9.1 Cr18.2 Fe18.2 Co18.2 Ni18.2 Cu13.6 Al4.5 | Raw: Al0.25CoCrCu0.75FeNiTi0.5 | RF: Ti2 Cr4 Fe4 Co4 Ni4 Cu3 Al1
0.0909    |  1.0        <-- F: Ti0.5 Cr1 Fe1 Co1 Ni1 Cu1           | PF: Ti9.1 Cr18.2 Fe18.2 Co18.2 Ni18.2 Cu18.2       | Raw: CoCrCuFeNiTi0.5           | RF: Ti1 Cr2 Fe2 Co2 Ni2 Cu2    
0.0823    |  0.9048     <-- F: Ti0.5 Cr1 Fe1 Co1 Ni1 Cu0.5 Al0.25  | PF: Ti9.5 Cr19 Fe19 Co19 Ni19 Cu9.5 Al4.8          | Raw: Al0.25CoCrCu0.5FeNiTi0.5  | RF: Ti2 Cr4 Fe4 C

The left one is primarily useful for understanding if the distances follow expectations, i.e., are not too low (e.g. 0.001%) or too high, while the right distance metric is useful for gauging the consistency of the data, i.e., is it roughly uniformly spaced or some point appear to be missing or out-of-order far.

In the case of the first DOI we looked at, the `'10.1016/j.jallcom.2008.11.059'`, we see that all of them are consistently spaced by around `9%`, what matches our

You can also set the name of the researcher to get the same results, but only if they have contributed to the data reported for the study of interest, i.e., they are points of contact if the data appears to be abnormal or needs corrections.

In [12]:
sDOI.setName('Zi-Kui')

In [13]:
sDOI.analyze_nnDistances()
sDOI.print_nnDistances()

Skipping 10.1016/j.jallcom.2008.11.059. Specified researcher (Zi-Kui) not present in the group ({'Adam Krajewski'})



If you are iterating over many DOI, you may not want to see all the verbose output if there is nothing to analyze. In that case, you can set the `skipFailed=True` to mute the output.

In [14]:
sDOI.analyze_nnDistances()
sDOI.print_nnDistances(skipFailed=True)

We can now set the name to match any of the researchers by setting it to `None` and iterate over the list of all DOIs to get the distances present in each individual study. Let's go over 20 studies between `50`th and `70`th DOI in the database snapshot, which were tested before preparing this tutorial and are known to contain some interesting patterns.

To skip over some of the DOIs for which the data is less likely to be anomalous, we can apply a couple additional options that filter for certain patterns. For instance, `skipNearEquidistant` let's us focus on studies in which the spacing was not uniform (distance is over `nearEquidistantThreshold` fraction of the maximum one for all).

In [15]:
sDOI.setName(None)
for doi in doiList[50:70]:
    sDOI.setDOI(doi)
    sDOI.analyze_nnDistances()
    sDOI.print_nnDistances(
        minSamples=3,
        skipNearEquidistant=True,
        nearEquidistantThreshold=0.5,
        skipFailed=True)


--->  10.3390/ma14071660 data from Table 2 uploaded by Shuang Lin
0.0488    |  0.078      <-- F: W1 Fe2.1 Ni4.9       | PF: W12.5 Fe26.2 Ni61.3       | Raw: WNi4.9Fe2.1      | RF: W1 Fe2.1 Ni4.9         
0.5392    |  0.8627     <-- F: W1 Re6 Fe2 Ni8       | PF: W5.9 Re35.3 Fe11.8 Ni47.1 | Raw: WNi8Fe2Re6       | RF: W1 Re6 Fe2 Ni8         
0.625     |  1.0        <-- F: Ta5 W1 Fe3 Ni7       | PF: Ta31.2 W6.2 Fe18.8 Ni43.8 | Raw: WNi7Fe3Ta5       | RF: Ta5 W1 Fe3 Ni7         
0.0714    |  0.1143     <-- F: W1 Re1 Fe3 Ni7       | PF: W8.3 Re8.3 Fe25 Ni58.3    | Raw: WNi7Fe3Re        | RF: W1 Re1 Fe3 Ni7         
0.0682    |  0.1091     <-- F: W1 Fe3 Ni7           | PF: W9.1 Fe27.3 Ni63.6        | Raw: WNi7Fe3          | RF: W1 Fe3 Ni7             
0.0465    |  0.0743     <-- F: W1 Re0.2 Fe2.1 Ni4.9 | PF: W12.2 Re2.4 Fe25.6 Ni59.8 | Raw: WNi4.9Fe2.1Re0.2 | RF: W5 Re1 Fe10.5 Ni24.5   
0.0465    |  0.0743     <-- F: W1 Re0.4 Fe2.1 Ni4.9 | PF: W11.9 Re4.8 Fe25 Ni58.3   | Raw: WNi4.9Fe2.1Re0

While the method is capturing quite many correct compositions and requires careful manual inspection by an expert to detect anomalies, it can quicky capture some patterns that may be missed otherwise. The first study we get to look at, `10.3390/ma14071660`, may seem to have ordinary at first, but a close inspection reveals that two of the alloys are quite dissimilar in a consistent fashion while their `Raw` formulas are oddly similar. A quick look into the methods section of the manuscript quickly reveals that the alloys were reported using notation like `W–8Ni–2Fe–6Re` and `W–4.9Ni–2.1Fe` which got incorrectly parsed as `W8Ni2Fe6Re` and `W4.9Ni2.1Fe` and the two events of change in the minor alloying elements present represented major unwarranted compositional change.

It is worth noting, however, that such cases of much higher errors can also be attributed to many other patterns, such as authors reporting on one or two alloys used as a reference in the study, such as is the case in the next report from `10.1016/j.ijrmhm.2020.105451` being detected by the method.

### Projection Pattern Analysis

As shown above, investigating absolute and relative distances between compositions can be quite useful; however, it does not capture the full potential of the high dimensional compositional data we have at our disposal. Now, we will use the same `SingleDOIAnalyzer` but look at how patterns in these high-dimensional spaces project into lower-dimensional ones through linear transformations (like PCA) and how low dimensional patterns can be used to detect anomalies.

To get started, lets have a look at the composition we already looked at before using `'10.1016/j.jallcom.2008.11.059'` and concluded that since the distances between the compositions were consistent, the data is likely correct.

In [16]:
doi = '10.1016/j.jallcom.2008.11.059'
sDOI = analysis.SingleDOIAnalyzer(doi=doi)

Loading the database credentials from default location: /Users/adam/VSCode Projects/PyQAlloy-1/pyqalloy/credentials.json
Connected to the CURATED_Dec2022 in ULTERA_internal with 6073 data points detected.
********  Analyzer Initialized  ********


We need to call two functions. First, the `get_compVecs_2DPCA` which will obtain composition vectors if `get_compVecs` was not called before and then project them into 2D space using PCA. The second function, `plot_2DPCA` will perform several checks to determine if the data is potentially anomalous and then plot it interactive fashion with `plotly`. Later on, we will also show how to save these plots as static images in a spreadsheet.

In [17]:
sDOI.get_compVecs_2DPCA()
sDOI.analyze_compVecs_2DPCA()

------>  10.1016/j.jallcom.2008.11.059 - non-linear trends detected (minRangeInDim: 0.033>0.001)



<_io.BytesIO at 0x30175b380>

Now, that's interesting! While the data parsed from `10.1016/j.jallcom.2008.11.059` is indeed spaced consistently, 4 of the points form a clear line, while a single point is breaking that pattern. Let's break down the analysis to understand what is happening here:

1. Looking at the `Raw` formulas in the table, we see that **red** point looks suspeciously different from the rest - it is shorter and has only a single decimal fraction; however, it actually follows the pattern exactly (leftmost in the line).

2. The `Raw` formula of the **magenta** point, on the other hand, looks perfectly unassuming and would almost certainly be missed by an expert looking through the data unarmed with `PyQAlloy`. However, a close look, combined with `PF`, reveals that all the other formulas follow a trend of gradually replacing `Cu` with `Al` but this one does not contain constant sum of them.

4. Based on the trend we see, we can quickly see that the **magenta** point should have `Al1` and `Cu0` in order to be the next point in the line (on the right).

3. With a quick look at the `10.1016/j.jallcom.2008.11.059` paper, we immediately see that the formula was indeed `Ti0.5CrFeCoNiAl` and the person parsing the composition into the Ctirine/UCSB dataset probably started writing the green composition and then finished with the blue one, but missed the error because 5 unique compositions went into the table as expected.

Now, let's scan through the same 20 DOIs we looked at before, but this time we will look at the patterns in the 2D PCA space. The `analyze_compVecs_2DPCA` allows us to set the `minSamples` parameter to filter out studies with not too many data points, so that we can focus on the ones that are more likely to contain interpretable patterns. 

In general, this loop will iterate over the DOIs and generate, for each, either (1) info study was skipped because it contained less than `minSamples` data points, (2) info about study being skipped because the data points were on a single line in the high-dimensional compositional space of the specific study, or (3) a nice plot like we saw before!

In [18]:
doiList = sDOI.get_allDOIs()

for doi in doiList[50:70]:
    sDOI.setDOI(doi)
    sDOI.get_compVecs_2DPCA()
    sDOI.analyze_compVecs_2DPCA(
        showFigure=True,
        minSamples=4
        )

Skipping 10.1039/d0mh01341b  . 1 samples are below the minimum requirement set (minSamples=4).
Skipping 10.1080-02670836.2018.1446267. 1 samples are below the minimum requirement set (minSamples=4).
Skipping 10.1126/sciadv.aba7802. 3 samples are below the minimum requirement set (minSamples=4).
------>  10.2320/matertrans.MT-MK2019003 - non-linear trends detected (minRangeInDim: 0.3058>0.001)



Skipping 10.3389/fmats.2020.589052 Nearly 1D linear trand detected.
------>  10.3390/ma14071660 - non-linear trends detected (minRangeInDim: 0.3657>0.001)



Skipping 10.3390/met9121351  . 1 samples are below the minimum requirement set (minSamples=4).
Skipping 10.1007/s11665-020-04744-7 Nearly 1D linear trand detected.
Skipping 10.1007/s11837-019-03861-6. 2 samples are below the minimum requirement set (minSamples=4).
Skipping 10.1016/j.apsusc.2021.149338. 2 samples are below the minimum requirement set (minSamples=4).
Skipping 10.1016/j.apt.2020.10.008. 1 samples are below the minimum requirement set (minSamples=4).
------>  10.1016/j.ijrmhm.2020.105451 - non-linear trends detected (minRangeInDim: 0.2282>0.001)



------>  10.1016/j.ijrmhm.2021.105568 - non-linear trends detected (minRangeInDim: 1.3341>0.001)



Skipping 10.1016/j.intermet.2020.106935. 1 samples are below the minimum requirement set (minSamples=4).
Skipping 10.1016/j.jallcom.2019.153352 Nearly 1D linear trand detected.
Skipping 10.1016/j.jallcom.2020.153963. 1 samples are below the minimum requirement set (minSamples=4).
Skipping 10.1016/j.jallcom.2021.158975. 1 samples are below the minimum requirement set (minSamples=4).
Skipping 10.1016/j.jallcom.2021.159193. 1 samples are below the minimum requirement set (minSamples=4).
------>  10.1016/j.jallcom.2021.159505 - non-linear trends detected (minRangeInDim: 0.1999>0.001)



------>  10.1016/j.jallcom.2021.159740 - non-linear trends detected (minRangeInDim: 0.0087>0.001)



We see quite a few interesting patterns, let'g go through them one by one:

1. The first one, `10.2320/matertrans.MT-MK2019003`, looks like a broken pattern, but the formulas quickly reveal that it was just a low-dimensional combinatorial screening (4 out of 5 elements) and the patten occured by chance - false positive.

2. The second one, `10.3390/ma14071660` we already looked at before, but now we can see the anomalous point even more clearly in the 2D space.

3. The fourth one, `10.1016/j.ijrmhm.2020.105451` is another typical false positive, already discussed before, where the authors added a simpler alloy to the study as a reference.

4. The fifth one, `10.1016/j.ijrmhm.2021.105568` is detected because it included pure element references (skipped by ULTERA production pipeline) what results in no clear patterns - false positive.

5. This one is particularly interesting! The **red** point formula from `10.1016/j.jallcom.2021.159505` looks perfectly in line with the others and even in the percentile (normalized) form looks exactly as expected; yet it lands in a completely different place in the projection. One needs to look really carefully to spot that the alloy has `17.8% of N` rather than expected `17.8 of Ni` because a single `i` keystroke hasn't been registered when parsing.

6. Lastly, the `10.1016/j.jallcom.2021.159740` reveals another false positive pattern - authors just tried a bunch of different compositions and no pattern was actually expected.

To focus on a single researcher's uploads, just like before, simply set the `name` field. Now, only DOIs from studies that were (at least partially) uploaded by the specified researcher will be returned by `get_allDOIs`.

In [19]:
sDOI = analysis.SingleDOIAnalyzer(doi='', name='Adam Krajewski')
doiList = sDOI.get_allDOIs() # Only considers DOIs with Adam's contributions
print(f"{len(doiList)} DOIs to process")
for doi in doiList[0:10]:
    sDOI.setDOI(doi)
    sDOI.get_compVecs_2DPCA()
    sDOI.analyze_compVecs_2DPCA(
        minSamples=4,
        showFigure=True
        )

Loading the database credentials from default location: /Users/adam/VSCode Projects/PyQAlloy-1/pyqalloy/credentials.json
Connected to the CURATED_Dec2022 in ULTERA_internal with 6073 data points detected.
********  Analyzer Initialized  ********
276 DOIs to process
Skipping 10.1007-s10854-020-04470-9. 3 samples are below the minimum requirement set (minSamples=4).
Skipping 10.1016-j.apt.2020.12.019. 1 samples are below the minimum requirement set (minSamples=4).
Skipping 10.1016-j.jallcom.2016.04.320. 3 samples are below the minimum requirement set (minSamples=4).
------>  10.1016-j.jallcom.2018.02.251 - non-linear trends detected (minRangeInDim: 0.5174>0.001)



------>  10.1016-j.matchemphys.2021.124907 - non-linear trends detected (minRangeInDim: 0.0913>0.001)



Skipping 10.1016/j.jallcom.2014.11.061. 1 samples are below the minimum requirement set (minSamples=4).
Skipping 10.1016/j.jallcom.2021.162154. 1 samples are below the minimum requirement set (minSamples=4).
Skipping 10.1016/j.mseb.2009.05.024 Nearly 1D linear trand detected.
Skipping 10.1063/1.4985724   . 1 samples are below the minimum requirement set (minSamples=4).
------>  10.1088-2515-7655-ac6f7e - non-linear trends detected (minRangeInDim: 0.8044>0.001)



And lastly, if you want to persist the results for later use, you can either play with the (very flexible) raw PNG bytestreams of the plots returned by `analyze_compVecs_2DPCA` or collect them and save them into Excel spreadsheet using the `writeManyPlots` function - that way you will be able to easily annotate them and share with your team as a single file. You can also set the `skipFailed` flag to `True` to skip over the studies that will not be persisted in the results.

In [20]:
sDOI = analysis.SingleDOIAnalyzer(name='Hui Sun')
doiList = sDOI.get_allDOIs()
print(f'{len(doiList)} DOIs to process')
toPrintList = []
for doi in doiList:
    sDOI.setDOI(doi)
    sDOI.get_compVecs_2DPCA()
    out = sDOI.analyze_compVecs_2DPCA(
        skipFailed=True,
        showFigure=False
        )
    if out:
        toPrintList.append(out)
sDOI.writeManyPlots(toPlotList=toPrintList, workbookPath='SingleDOI_ResultPCA_Hui.xlsx')

Loading the database credentials from default location: /Users/adam/VSCode Projects/PyQAlloy-1/pyqalloy/credentials.json
Connected to the CURATED_Dec2022 in ULTERA_internal with 6073 data points detected.
********  Analyzer Initialized  ********
43 DOIs to process
------>  10.1007/s11837-015-1563-9 - non-linear trends detected (minRangeInDim: 0.3362>0.001)

------>  10.1016/S0921-5093(97)00555-8 - non-linear trends detected (minRangeInDim: 0.1804>0.001)

------>  10.1016/j.actamat.2020.01.012 - non-linear trends detected (minRangeInDim: 0.0013>0.001)

------>  10.1016/j.intermet.2018.05.013 - non-linear trends detected (minRangeInDim: 0.4375>0.001)

------>  10.1016/j.scriptamat.2013.05.020 - non-linear trends detected (minRangeInDim: 0.5556>0.001)

------>  Cahn_1999_High-temperature Structural Materials - non-linear trends detected (minRangeInDim: 0.1736>0.001)

------>  10.1007/s40195-015-0254-4 - non-linear trends detected (minRangeInDim: 0.0872>0.001)

------>  10.1016/j.actamat.2

You should now be able to see the results in the generared Excel (`.xlsx`) file!

### All Data (Entire Database) Analyzer 

So far, we covered looking at a single data point or their collection within a single study - utilizing very different approaches of looking at individual entities or or relations between them. The latter of which required at least a few datapoints to work properly. However, many studies in the database did not contain such a number of data points (e.g. 1 or 2) and were skipped by the methods we used before. Now, we will look at the patterns formed by all the data in the database, leveraging multi-study information to detect anomalies relative to work by the *community* of researchers.

In [21]:
from pyqalloy.curation import analysis

In [22]:
allD = analysis.AllDataAnalyzer()

Loading the database credentials from default location: /Users/adam/VSCode Projects/PyQAlloy-1/pyqalloy/credentials.json
Connected to the CURATED_Dec2022 in ULTERA_internal with 6073 data points detected.
Updating the list of all unique composition points...
Number of unique formulas found: 1311
Elements Found: {'C', 'Ru', 'Sc', 'V', 'Cu', 'W', 'Re', 'Cr', 'Y', 'Pd', 'Si', 'Ga', 'O', 'Ag', 'Mg', 'Nb', 'Ta', 'Ca', 'Zn', 'Ni', 'Ir', 'S', 'Fe', 'Mn', 'Al', 'Mo', 'Ge', 'Sn', 'B', 'N', 'Hf', 'Zr', 'Be', 'Co', 'Li', 'Ti', 'Nd'}
Done!


In [23]:
allD.getTSNE(perplexity=5)

array([[-2.5952484e+01,  3.5279499e+01],
       [ 4.9784188e+00,  6.7044334e+01],
       [-4.6604988e+01,  2.6536434e+00],
       ...,
       [-1.4140003e+01,  4.9580617e+00],
       [ 3.4433388e+01,  1.2834120e-02],
       [-3.9706459e+01, -2.3695835e+01]], dtype=float32)

In [24]:
allD.showTSNE()

In [25]:
allD.getDBSCAN(eps=0.075, min_samples=2)

Found 115 clusters and 427 outliers.
Outlier ratio: 32.6%


(array([-1,  0,  1, ...,  1, 10, -1]), 427)

In [26]:
allD.showClustersDBSCAN()

In [27]:
allD.showOutliersDBSCAN()

In [28]:
allD.getDBSCANautoEpsilon(outlierTargetN=17)

Running DBSCAN with eps=1.0...
Found 1 clusters and 0 outliers.
Outlier ratio: 0.0%
Running DBSCAN with eps=0.975...
Found 1 clusters and 0 outliers.
Outlier ratio: 0.0%
Running DBSCAN with eps=0.95...
Found 1 clusters and 0 outliers.
Outlier ratio: 0.0%
Running DBSCAN with eps=0.925...
Found 1 clusters and 0 outliers.
Outlier ratio: 0.0%
Running DBSCAN with eps=0.9...
Found 1 clusters and 0 outliers.
Outlier ratio: 0.0%
Running DBSCAN with eps=0.875...
Found 1 clusters and 0 outliers.
Outlier ratio: 0.0%
Running DBSCAN with eps=0.85...
Found 1 clusters and 0 outliers.
Outlier ratio: 0.0%
Running DBSCAN with eps=0.825...
Found 1 clusters and 0 outliers.
Outlier ratio: 0.0%
Running DBSCAN with eps=0.8...
Found 1 clusters and 0 outliers.
Outlier ratio: 0.0%
Running DBSCAN with eps=0.775...
Found 1 clusters and 0 outliers.
Outlier ratio: 0.0%
Running DBSCAN with eps=0.75...
Found 1 clusters and 0 outliers.
Outlier ratio: 0.0%
Running DBSCAN with eps=0.725...
Found 1 clusters and 0 outlier

(array([0, 1, 0, ..., 0, 0, 0]), 17)

In [29]:
allD.showOutliersDBSCAN()

In [30]:
allD.updateOutliersList()

In [31]:
allD.findOutlierDataSources();

Outlier W1.5 Mo6.33 Co1 Ni1.83 Al6 | W9 Mo38 Co6 Ni11 Al36     | Co6W9Al36Mo38Ni11
matched to:  Adam Krajewski       upload from DOI 10.1038/s41467-019-10533-1 

Outlier Ti1 Nb1 Ag1 Zn1 Al1       | Ti20 Nb20 Ag20 Zn20 Al20  | AgAlNbTiZn
matched to:  Adam Krajewski       upload from DOI 10.1016/j.msea.2018.12.020 

Outlier Ti1 V23 Cr1               | Ti4 V92 Cr4               | V92 Cr4 Ti4
matched to:  Marcia Ahn           upload from DOI Natesan_2002_UniaxialCreepBehavior at position F2 

Outlier Nb8 W2.33 Cr1 Co11 Al11   | Nb24 W7 Cr3 Co33 Al33     | Co33W07Al33Nb24Cr03
matched to:  Adam Krajewski       upload from DOI 10.1038/s41467-019-10533-1 

Outlier Be2.25 Zr4.12 Ti1.38 Ni1 Cu1.25 | Be22.5 Zr41.2 Ti13.8 Ni10 Cu12.5 | Zr41.2 Ti13.8 Cu12.5 Ni10 Be22.5
matched to:  Hui Sun              upload from DOI 10.1016/j.scriptamat.2013.05.020 at position T1 

Outlier Zr30.5 Ti1 Cu12.5 Al6     | Zr61 Ti2 Cu25 Al12        | Zr61Ti2Cu25Al12
matched to:  Hui Sun              upload from DOI 10.