### Combining and Exporting

In this notebook we will learn how to combine Pandas DataFrames together using the [`.concat()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) and the [`.merge()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) funtions from Pandas. 

The main differences is that `.merge()` will combine dataframes on a common index whereas `.concat()` will either stack the dataframes on top of eachother or side by side depending on the parameters passed.

At the end we will also learn how to export them to an Excel file. __NOTE__ for this you will have to install libraries `xlrd` and `xlsxwriter`.

In [1]:
import pandas as pd
%store -r df1
%store -r df2
%store -r df3

Okay, lets take a look at the `concat()` function. Probably the most important argument to pass is the dataframes you would like to concatenate. I have passed `[df3, df2]` as the first argument, which if you remember were the excel sheets with metadata from my local machine. The second arguement `ignore_index: boolean` will ignore the index for concatenation when true. This is helpful when the index is not relevant. For us the index was automatically set and appears as the list on the lefthand side of the dataframe. Setting `ignore_index` to `True` will allow those numbers to be renamed to the total length of the rows (i.e 0,...,n-1).

I also dropped some columns to get to the barebones metadata that I was looking at.

In [2]:
dfm = pd.concat([df3, df2], ignore_index=True)
dfm = dfm.drop(['sparrow order', 'id', 'technique', 'Irradiation', 'phase', 'sample_name.1', 'GeoDeepDive ping?',
                'Comments', 'Notes'], axis=1)
dfm

Unnamed: 0,sample_name,lithology,latitude,longitude,elevation_m,depth_m,Formation,Member,author,year,journal,Title,doi_link,Where_to_Find,Unit/Formation,Unpublished,Where_to_find_it,From_PI,Unnamed: 24
0,SEG 03 32,rhyolite dome,52.351166,-175.417000,,,Seguam Island Volcanic Complex,Rhyolite flow in crater valley (7.5 ka),Jicha et al.,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
1,SEG 03 44,dacitic ash flow,52.351166,-175.417000,,,Seguam Island Volcanic Complex,Dacitic Ignimbrite,Jicha et al.,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
2,SEG 03 66,andesitic lava flow,52.283300,-172.403300,,,Seguam Island Volcanic Complex,Lava Point Dacite,Jicha et al.,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
3,SEG 03 03,dacitic lava flow,52.375000,-172.389166,,,Seguam Island Volcanic Complex,Finch Cove Rhyodacite,Jicha et al.,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
4,SB87–56,rhyolitic lava flow,52.266833,-172.522166,,,Seguam Island Volcanic Complex,Basaltic to Rhyolitic South Shore Lavas,Jicha et al.,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2295,LDM249A,,-35.986620,-70.494220,2351,,,,,,,,,,,,Andersen Paper if not ask brad,,
2296,LDM500,,-36.174940,-70.395310,,,,,,,,,,,,,Andersen Paper if not ask brad,,
2297,LDM6,,,,,,,,,,,,,,,,Andersen Paper if not ask brad,,
2298,LDM6,,,,,,,,,,,,,,,,Andersen Paper if not ask brad,,


The `.merge()` function. Here we set the index for both dataframes before we merge. For this function we attach it as a dot function to the first dataframe and include the dataframe we are merging with as the first argument. There are a lot of other arguments I have included, but it is basically just telling the function to use both indexes as join keys.

The result is a dataframe where the totality of df1 is on the left and then dfm is on the right. This can be helpful because as we can see as we scroll through the dataframe, dfm, the dataframe from the local machine, has many more samples compared to df1, fetched from the API. This right away tells us that there is new data that we can upload to the API.

In [3]:
df1.set_index('name')
dfm.set_index('sample_name')

dfmerged = df1.merge(dfm, how='outer', on=None, left_on=None, right_on=None, left_index=True, right_index=True)
dfmerged

Unnamed: 0,name,material,location_name,location_name_autoset,is_public,Longitude,Latitude,sample_name,lithology,latitude,...,year,journal,Title,doi_link,Where_to_Find,Unit/Formation,Unpublished,Where_to_find_it,From_PI,Unnamed: 24
0,M2C,Lava Flows,,,True,-149.66,-17.66,SEG 03 32,rhyolite dome,52.351166,...,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
1,90T151A,Baslt,,,True,-156.2311,20.6368,SEG 03 44,dacitic ash flow,52.351166,...,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
2,90T050B,Baslt,,,True,-156.2311,20.6368,SEG 03 66,andesitic lava flow,52.283300,...,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
3,84C207AB,,,,True,,,SEG 03 03,dacitic lava flow,52.375000,...,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
4,LDMEB-13-21,Dacite,,,True,-70.5921,-36.00909,SB87–56,rhyolitic lava flow,52.266833,...,2005.0,Earth and Planetary Science Letters,Contrasting timescales of crystallization and ...,https://doi.org/10.1016/j.epsl.2005.05.002,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2295,,,,,,,,LDM249A,,-35.986620,...,,,,,,,,Andersen Paper if not ask brad,,
2296,,,,,,,,LDM500,,-36.174940,...,,,,,,,,Andersen Paper if not ask brad,,
2297,,,,,,,,LDM6,,,...,,,,,,,,Andersen Paper if not ask brad,,
2298,,,,,,,,LDM6,,,...,,,,,,,,Andersen Paper if not ask brad,,


We now have a couple large dataframes that are showing us large amounts of data together. It may be helpful to have these as excel sheets. Excel sheets may be easier to navigate for people unfamiliar with python and they can easily be shared with teams.

To export these dataframes to excel we will be using the `xlsxwriter` as the writing machine. The first thing we need is a variable that calls *xlsxwriter* and gives the name that we want to call our new excel file. We then use the `.to_excel()` function to tell pandas to turn a dataframe into an excel file and we will pass the varible we made as the first argument and we can also pass an argument to name the sheet. Then we save the first variable using `.save()`.

`writer2 = pd.ExcelWriter('Comparison_Sheet.xlsx', engine='xlsxwriter')
dfmerged.to_excel(writer2, sheet_name='Sheet_1')
writer2.save()`

In [4]:
%store dfmerged
%store dfm

Stored 'dfmerged' (DataFrame)
Stored 'dfm' (DataFrame)
