# Optional Supplemental Exercise: Creating Fancy Indexes and Merging by Index

If you are done with the class early, you can attempt this optional exercise.

### Index Objects and Reindexing

We have already seen that Dataframes have an index object. If you don't stipulate a series as an index, then the default is a range index from 0 to number of items - 1. This is what we have been using up to now, and it does fine for most applications:

We can optain the index of a Dataframe by simply calling:

```python
df.index
```

And we can assign a new index very easily too:

```python
df.index = new_index
```

As long as the `new_index` has the same length as the original index. Because it is possible to reindex a dataframe we could assign a more meaningful index. We have seen that a unique identifier for ODP sample can characterized by a string that incorporates the entire description of the sample location, such as **198-1207-A-2-H-2-65**. So, if we can crate a series that contains this name, we could then use it as a unique index for the samples. The benefit is that we can then compare different datasets by index, and for instance find the same sample in two different dataframes very easily.

In this more open exercise, I want you to redo exercise 4 from the previous notebook but with the following difference:

1. Add a function that creates an index of type string that contains the following information: **Leg-Site-H-Cor-T-Sec** for the two dataframes of interest
2. Instead of rounding the `Depth (mbsf)`, this time we will effectively consider that data within a 1.5 meter interval (a section) is the same.
3. Drop the columns that you use to create your index as well as `Top (cm)`: we don't need them anymore as all the info is contained in the index itself. 
4. You will end up with two `Depth (mbsf)` per sample: the depth where the NGR is taken, and the depth where the geochemical data was taken. This is useful, but the default names of these columns (`Depth (mbsf)_x` and `Depth (mbsf)_y`) are not useful. Given these columns more meaningful names. 
5. Once you have done this, you can rewrite your `merge_by_depth(df1,df2)` to instead merge by index: it should be easier

Here are a few things that will help you:
* Try not to copy things from the previous notebook, but rather to remember how to do it or good the documentation. This will help you become more independent in your coding, and is important for future projects.
* Remember that this is a new notebook: don't forget to do all the necessary imports and to read the data into the dataframes
* The easiest way is to create a Pandas Series that contains the strings, and the set this as your new index.
* There are several ways to go about this exercise. You can for instance simply add the columns of interest separated by adding the `-` character. This approach is likely to result at first in some `UFuncTypeError`. Google what this means, and how to go around it (*hint:* you probably need to cast some types! Read how to do this for an entire dataframe column)
* Another approach is to use a loop and the Pandas function `.iterrows()` for the same job. If you have time, try both approaches to expend your knowledge by reading about `iterrows()`, which is a handy function.

# 💃 Have fun!!! 


In [None]:
# Your Code:



In [77]:
# Solution

import pandas as pd

geochem_df = pd.read_csv('data/1207_Geochemistry.csv')
ngr_df = pd.read_excel('data/1207_NGR.xls')


In [78]:
ngr_df

Unnamed: 0,Leg,Site,H,Cor,T,Sc,Top(cm),Depth (mbsf),Corr. Counts
0,198,1207,A,1,H,1,30,0.3,20.08
1,198,1207,A,1,H,1,60,0.6,17.48
2,198,1207,A,1,H,1,90,0.9,17.75
3,198,1207,A,1,H,1,120,1.2,20.22
4,198,1207,A,1,H,2,30,1.8,19.22
...,...,...,...,...,...,...,...,...,...
741,198,1207,B,49,R,1,50,613.7,24.12
742,198,1207,B,49,R,1,60,613.8,21.45
743,198,1207,B,49,R,1,70,613.9,15.22
744,198,1207,B,49,R,1,80,614.0,11.92


In [106]:
# This is one way to do it:
def reindex_adding_series(df):
    cols = ['Site', 'H', 'Cor', 'T', 'Sc']
    
    new_index = df['Leg'].astype('str') 
    
    for col in cols:
        new_index = new_index + '-' + df[col].astype('str')

    df.index = new_index
    
    return df.drop(['Leg','Site', 'H', 'Cor', 'T', 'Sc', 'Top(cm)'], axis=1)

In [107]:
# This is another way to do it:

def reindex(df):
    cols = ['Leg','Site', 'H', 'Cor', 'T', 'Sc', 'Top(cm)']
    new_index = []
    
    for i, row in df.iterrows():
        new_index.append(f'{row.Leg}-{row.Site}-{row.H}-{row.Cor}-{row["T"]}-{row.Sc}')
    df.index=new_index
    
    return df.drop(cols, axis=1)

In [112]:
# Now merge by index:
def merge_by_depth(df1,df2):
    df1 = reindex(df1)
    df2 = reindex(df2)
    new_df = pd.merge(df1, df2, left_index=True, right_index=True)
    new_df.columns = ['Depth (mbsf) Carb.','CaCO3 (wt %)','Depth (mbsf) NGR', 'Corr. Counts']
    return new_df

In [113]:
merge_by_depth(geochem_df, ngr_df)

Unnamed: 0,Depth (mbsf) Carb.,CaCO3 (wt %),Depth (mbsf) NGR,Corr. Counts
198-1207-A-10-H-1,80.96,44.149,81.1,5.88
198-1207-A-10-H-1,80.96,44.149,81.4,6.02
198-1207-A-10-H-1,80.96,44.149,81.7,3.42
198-1207-A-10-H-1,80.96,44.149,82.0,3.25
198-1207-A-10-H-1,81.52,72.320,81.1,5.88
...,...,...,...,...
198-1207-B-5-R-2,197.74,96.910,198.2,
198-1207-B-6-R-1,205.75,96.600,205.4,
198-1207-B-6-R-1,205.75,96.600,205.7,
198-1207-B-6-R-1,205.75,96.600,206.0,
