### Getting Information from your data:

We now have large dataframes that we aquired from an API and from our local machines. We then made some changes to the columns to extract location data. The last thing we did was combine dataframes and export them to an excel file.

But now we need to get some information from the data. Somethings you may want to know:
* The percentage of samples that have metadata
* How many samples are there that can be uploaded to your API
* How many samples still need metadata

Within this notebook, I will walk through how I found some of this basic info about our data. For this we will be using another common python library, [NumPy](https://numpy.org/).

In [2]:
import pandas as pd
from pandas import isnull
import numpy as np
%store -r dfmerged

First I created two dataframes, one for the data fetched from the API and the other from my local excel sheet. These dataframes only have three columns: latitude, longitude, and Names.

In [3]:
# create 2 new data frames
dfOnline = dfmerged[['Longitude', 'Latitude']]
dfMeta = dfmerged[['longitude', 'latitude']]

# adding a new column Names from sample names of online or metadata
dfOnline.insert(0, "Names", dfmerged[['name']])
dfMeta.insert(0, "Names", dfmerged[['sample_name']])

# truncating the empty rows made from the merge
dfOnline = dfOnline[~dfOnline.Names.isna()]

Here I made two empty lists where, in the for loop, I added all the values from the Names column of each of the two new dataframes. I then turned them into a numpy array so I could perform the `.unique()` function, which returns an array with no repeating values. So now I have two lists of sample names for both the online fetched data and the local data, and I know that none of them are repeating.

In [4]:
Metalist = []
SparrowList = []
for i, series in dfMeta.iterrows():
    Metalist.append(series.Names)

for i, series in dfOnline.iterrows():
    SparrowList.append(series.Names)

Metadatasamples = np.asarray(Metalist, order='F')
Metadatasamples = np.unique(Metadatasamples)
SparrowSamples = np.asarray(SparrowList)
SparrowSamples = np.unique(SparrowSamples)

The `.concatenate()` function for NumPy is less complicated than for Pandas, here it will simply join the two lists I have passed as arguments. I again use the same `.unique()` function as before except this time I have an addition argument. `return_counts: boolean` when set to `True`, false by defualt, will return an interger which is the number of times that a sample name appears in the array. If a sample is repeated the value of the added value would be greater than 1. By returning counts I am able to use that value to parse out which of my samples, on my local machine, have not bee uploaded to the API. 

Assuming that all the API data is on my spreadsheet already, which I knew to be true at the time, it can be reasoned that any name with a `Count==1` would be a sample from my local machine that can be uploaded to the API. The list `Uploadable_total` will then hold the names of all the samples from my local machine that can be uploaded to the API.

In [5]:
totalsamples = np.concatenate((Metadatasamples, SparrowSamples))

totalsamples = np.unique(totalsamples, return_counts=True)

dfTotal = pd.DataFrame(totalsamples)
dfTotal = dfTotal.transpose()
dfTotal.columns = ['Names', 'Counts']

Uploadable_total = []
for i, series in dfTotal.iterrows():
    if series.Counts == 1:
        Uploadable_total.append(series.Names)

In [6]:
%store Uploadable_total

Stored 'Uploadable_total' (list)


Lets now find the percentage of online samples that have location data. A percentage is a calulated value and there are several ways we could figure this out. But we basically want to know how many samples have a `NaN` or `null` value for either the `Latitude` or `Longitude` column. In a scenario like this it will be easier to convert values to Boolean which can be converted to 1's and 0's as we will see.

This first function will replace every cell that has a value with `True` and every cell that doesn't have a value with `False`. We only need one of the two columns about location though, because either they will be both true or both false. And lastly, when you multiply a `True` value by 1 you will get back 1 and when you multiply a `False` value by 1 you will get 0. You can see now how we will be able to caluclate the percentage

In [7]:
dfOnSt= ~dfOnline.isna()

dfOnSt = dfOnSt.drop(columns=['Names', 'Longitude'])

dfOnSt = dfOnSt * 1
dfOnSt

Unnamed: 0,Latitude
0,1
1,1
2,1
3,0
4,1
...,...
1814,1
1815,1
1816,1
1817,1


Now it is simply Math. First I converted the Dataframe to a NumPy array for easy math. Then I added up all of the 1's in the array and divided by the length of the array, and multipled by 100. This represents the percent of online samples with location data because ever sample with a 1 had a true value which meant it had location data.

In [8]:
Online = dfOnSt.to_numpy()

Percent_Loc_Found_Online = np.round(np.count_nonzero(Online) / len(Online) * 100)
Percent_Loc_Found_Online

69.0

### Recap:
Pandas and NumPy can be powerful tools for accessing information about your data and the state of your metadata search. Once you have some analytics you may want to print them in a report. Create a `docx` file directly in Python will be helpful because it will automatically update with your own progress. A good straight forward library to use is [Python-docx](https://python-docx.readthedocs.io/en/latest/).

In [9]:
%store dfOnline

Stored 'dfOnline' (DataFrame)


In [10]:
dfmerged

Unnamed: 0,name,material,location_name,location_name_autoset,is_public,Longitude,Latitude,sample_name,lithology,latitude,...,year,journal,Title,doi_link,Where_to_Find,Unit/Formation,Unpublished,Where_to_find_it,From_PI,Unnamed: 24
0,M2C,Lava Flows,,,True,-149.66,-17.66,SEG 03 32,rhyolite dome,52.351166,...,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
1,90T151A,Baslt,,,True,-156.2311,20.6368,SEG 03 44,dacitic ash flow,52.351166,...,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
2,90T050B,Baslt,,,True,-156.2311,20.6368,SEG 03 66,andesitic lava flow,52.283300,...,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
3,84C207AB,,,,True,,,SEG 03 03,dacitic lava flow,52.375000,...,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
4,LDMEB-13-21,Dacite,,,True,-70.5921,-36.00909,SB87–56,rhyolitic lava flow,52.266833,...,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2295,,,,,,,,LDM249A,,-35.986620,...,,,,,,,,Andersen Paper if not ask brad,,
2296,,,,,,,,LDM500,,-36.174940,...,,,,,,,,Andersen Paper if not ask brad,,
2297,,,,,,,,LDM6,,,...,,,,,,,,Andersen Paper if not ask brad,,
2298,,,,,,,,LDM6,,,...,,,,,,,,Andersen Paper if not ask brad,,
