In [4]:
import requests
from IPython.display import Markdown

url = 'https://kata.geosci.ai/challenge/fossil-hunting'

r = requests.get(url)
print('Status', r.status_code)

Markdown(r.text)

Status 200


# Fossil hunting

We have some fossil abundance data. Each record contains a number, which represents a geological age, and zero or more fossil symbols. One symbol represents one example of that fossil. For example, we might have a record like this:

    349.8🦐🐚🐚🐟🐟🦐
    
The number is the age of the sample in units of 'millions of years before the present', to 4 significant figures, and is unique (there are no duplicate records). It is immediately followed by the fossil counts for that sample. For example, this record indicates that the samples collected from rocks with age = 349.8 Ma contained two shrimps, two gastropod shells, and two fish specimens.

Your actual dataset will be much larger than this. It's also less organized: the records are not in order.

There are four questions to answer about your data:

1. How many samples are there of the most abundant organism?
2. What is the age of the oldest record with maximum diversity?
3. What is the span (first to last appearance) of the most abundant organism?
4. At what age is the latest appearance of the last fossil to appear?

When the answer is an age, give 1 decimal place of precision.


## Example

    345.1🐟
    346.2🐚🐚🐟
    348.7🐚🦐
    349.8🦐🐚🐚🐟🐟🦐
    350.0🐚🦐🦐🐚🦐🦐
    351.7🦐🐟🦐
    353.8🦐
    354.9

We'd answer the questions this way:

1. The most abundant organism is the shrimp, with **10** specimens.
2. The oldest record with the maximum diversity (3 fossil types) is **349.8**
3. The span of the most abundant organism (the shrimp) is 353.8 - 348.7 = **5.1**
4. The last fossil to appear (the shell) last appears at an age of **346.2**


## A quick reminder how this works

You can retrieve your data by choosing any Python string as a **`<KEY>`** and substituting here:

    https://kata.geosci.ai/challenge/fossil-hunting?key=<KEY>
                                                        ^^^^^
                                                        any old string you like

To answer question 1, make a request like:

    https://kata.geosci.ai/challenge/fossil-hunting?key=<KEY>&question=1&answer=123
                                                        ^^^^^          ^        ^^^
                                                        your key       Q        your answer

[Complete instructions at kata.geosci.ai](https://kata.geosci.ai/challenge)

----

© 2020 Agile Scientific, licensed CC-BY

## Load the input data

In [5]:
my_key = 'scibbatical'

params = {'key': my_key}

r = requests.get(url, params)

# Look at the first bit of the input:
#Markdown(r.text[:200])
r.text[:200]

'127.7🐟🐟🌟🐟🐟🐟🐟🐟🐟🌟🐟🐟🌟🐟🐟🌟322.0\U0001f9a0\U0001f9a0\U0001f9a0\U0001f9a0\U0001f9a0\U0001f9a0🐠🐠🐠\U0001f9a0179.2🐟🐟🐟🐟🐟🌿🐟🌟🌟🌿🌟🐟🌟🐟🐟🐟🐟🐟🌿🐟🌟🐟🐟🌟🌟🌟🐟🐟\U0001f990🐟🌟🐟🌟🐟🌟🦄🌟\U0001f990\U0001f990🐟🐟🌟🌟🌟🌟🦄🌿🌟🦄🌟🐟🐟🌟🦄\U0001f990🐟🐟🌿🌟🐟258.4\U0001f990🐚🐠🌿🐚🌿🐚🐠🐚🐚🐠🐚\U0001f990🌿🌿🐠🌿🐚\U0001f990\U0001f990🌿🐚\U0001f990🐚🐚🐠🐠🐚\U0001f990🐚\U0001f990🌿🐚🐚🐚🌿🌿🌿🐠🐠🐠🌿\U0001f990\U0001f990\U0001f990🐚🐚273.4\U0001f9a0🐚🐚🌿\U0001f990\U0001f9a0🐚\U0001f990🐠\U0001f990🐚🐚🐚🐠\U0001f990🐚🐠🐠\U0001f990\U0001f9a0🐠\U0001f990🐠🐚🐚🐚\U0001f990🐠\U0001f990🐠🐚🐚🐚🐚🐚🐚🐚\U0001f990🐠\U0001f990🐚\U0001f9a0'

Okay, I can see the cute emojis, but I'll think some of the unicode (?) emojis are unrecognized by my browser or something. Hope it's not a big deal!

### 'Parse' the data

I think I want to turn this data into a DataFrame where each column is the count of a given organism. I therefore need to discover all types of organisms.

In [6]:
orig_input = r.text[:]

# Find unique characters in input data, then sort them
chars = list(set(r.text))
chars.sort()

print(chars)

['.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '🌟', '🌿', '🐚', '🐟', '🐠', '🦄', '\U0001f990', '\U0001f9a0']


Okay, I see the first 11 characters are related to the ages and the others are the fossils. Create a list of fossils:

In [7]:
fossils = chars[11:]

Okay, there doesn't appear to be a convenient character that separates one entry from another, so we'll have to get more manual with this parsing.

The ages that I've seen appear to be given with 5 characters (four digits and a '.'), however, that may not always be the case.

In [147]:
# So, define a function that discovers the next non-number character...
def next_fossil(string):
    global fossils
    for i, char in enumerate(string):
        if char in fossils:
            return i        
    
    return -1

# ... and another that finds the next decimal ...
def next_decimal(string):
    global fossils
    for i, char in enumerate(string):
        if char == '.':
            return i     
    
    return -1

# ... and another that finds the next number.
def next_number(string):
    global fossils
    for i, char in enumerate(string):
        if char not in fossils:
            return i        
    
    return -1

# And finally, one that takes the input string and
# returns the next record. 
def tuna(string):
    
    scout = string[next_decimal(string)+2:]
    
    if next_number(scout) == -1:
        return string
    
    if next_number(scout) == 0: # if there is a record with no fossils
        record = string[:next_decimal(string)+2]
    else:
        record = string[:next_fossil(string) + next_number(scout)]
    
    return record
        

Now, parse the input string into a list:

In [148]:
# Make a copy of the input data
interim = orig_input

# Create an empty list that will hold the entries
entries = []

# Initiate a flag that will indicate when the we're done parsing
entry_match = 0


while entry_match == 0:
    
    # find new record
    new = tuna(interim)
    #print(new)
    
    # if break if there isn't a new
    if new == '':
        break
    
    # append the next entry
    entries.append(new)
    
    # remove the entry from interim
    interim = interim[len(new):]
    
print('Input parsed into', len(entries), 'entries.')

Input parsed into 191 entries.


Okay, now we can populate a DataFrame:

In [89]:
import pandas as pd

In [175]:
fossil_data = pd.DataFrame(columns = ['age'] + fossils)

for entry in entries:
    fossil_data = fossil_data.append({'age' : float(entry[:next_decimal(entry)+2]),
                                      fossils[0] : entry.count(fossils[0]),
                                      fossils[1] : entry.count(fossils[1]),
                                      fossils[2] : entry.count(fossils[2]),
                                      fossils[3] : entry.count(fossils[3]),
                                      fossils[4] : entry.count(fossils[4]),
                                      fossils[5] : entry.count(fossils[5]),
                                      fossils[6] : entry.count(fossils[6]),
                                      fossils[7] : entry.count(fossils[7])},
                                      ignore_index = True)
    


In [176]:
fossil_data.head()

Unnamed: 0,age,🌟,🌿,🐚,🐟,🐠,🦄,🦐,🦠
0,127.7,4.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0
1,322.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,7.0
2,179.2,20.0,5.0,0.0,27.0,0.0,4.0,4.0,0.0
3,258.4,0.0,11.0,17.0,0.0,9.0,0.0,10.0,0.0
4,273.4,0.0,1.0,18.0,0.0,9.0,0.0,10.0,5.0


Boo-yeah! Alright. Begin analysis.

## Question 1

_How many samples are there of the most abundant organism?_

Determine which is most prevalent and then find the amount.

In [183]:
# Use .sum()
answer1 = int(max(fossil_data[fossils].sum()))

print('There are', answer1, 'observations of the most abundant fossil.')

There are 1625 observations of the most abundant fossil.


Submit answer 1:

In [184]:
params = {'key': my_key,   # <--- must be the same key as before
          'question': 1,   # <--- which question you're answering
          'answer': answer1,  # <--- your answer to that question
         }

r = requests.get(url, params)

r.text

'Correct'

Boo-yeah!

## Question 2

_What is the age of the oldest record with maximum diversity?_

I will take "diversity" to mean number of unique organisms in a sample.

It might be useful to have a version of the DataFrame with Nans instead of zeros.

In [194]:
import numpy as np

In [195]:
fossil_data_nan = fossil_data.replace(0, np.nan)

fossil_data_nan.head()

Unnamed: 0,age,🌟,🌿,🐚,🐟,🐠,🦄,🦐,🦠
0,127.7,4.0,,,12.0,,,,
1,322.0,,,,,3.0,,,7.0
2,179.2,20.0,5.0,,27.0,,4.0,4.0,
3,258.4,,11.0,17.0,,9.0,,10.0,
4,273.4,,1.0,18.0,,9.0,,10.0,5.0


Okay, now count the non-nan values for each entry and put them in a 'diveristy' column in the fossil_data DataFrame.

In [202]:
fossil_data['diversity'] = fossil_data_nan[fossils].count(axis=1)

fossil_data.head()

Unnamed: 0,age,🌟,🌿,🐚,🐟,🐠,🦄,🦐,🦠,diversity
0,127.7,4.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,2
1,322.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,7.0,2
2,179.2,20.0,5.0,0.0,27.0,0.0,4.0,4.0,0.0,5
3,258.4,0.0,11.0,17.0,0.0,9.0,0.0,10.0,0.0,4
4,273.4,0.0,1.0,18.0,0.0,9.0,0.0,10.0,5.0,5


Now find the max age of all entries with the maximum diversity value:

In [210]:
answer2 = max(fossil_data.loc[fossil_data.diversity == fossil_data.diversity.max()].age)

print('The oldest record with max biodiversity has an age of', answer2)

The oldest record with max biodiversity has an age of 216.8


Submit answer 2:

In [211]:
params = {'key': my_key,   # <--- must be the same key as before
          'question': 2,   # <--- which question you're answering
          'answer': answer2,  # <--- your answer to that question
         }

r = requests.get(url, params)

r.text

'Correct'

Boo-yeah!

## Question 3

_What is the span (first to last appearance) of the most abundant organism?_

I will take the span as the difference between the first and last appearance of that organism.

First I need to find the most abundant organsim:

In [221]:
abund_fossil = fossils[fossil_data[fossils].sum().values.argmax()]

print(abund_fossil, 'is the most abundant.')

🐟 is the most abundant.


Now find the difference between the first and last instance of the most abundant:

In [229]:
answer3 = round(max(fossil_data.loc[fossil_data[abund_fossil] > 0].age) -
      min(fossil_data.loc[fossil_data[abund_fossil] > 0].age),
      1)

print('The span of the most abundant fossil is', answer3)

The span of the most abundant fossil is 104.3


Submit answer 3:

In [230]:
params = {'key': my_key,   # <--- must be the same key as before
          'question': 3,   # <--- which question you're answering
          'answer': answer3,  # <--- your answer to that question
         }

r = requests.get(url, params)

r.text

'Correct'

Boo-yeah!

## Question 4

_At what age is the latest appearance of the last fossil to appear?_

Okay, a little bit of a mind-twister here...
1. The last fossil to appear is the one with the minimum value of their maximum age.
2. The last appearance is the minimum age.

So, if I create an array that contains the min and max ages of each fossil, I should be able to find the answer to this question.

In [236]:
# Create an array where each row represents a fossil type 
age_range = np.ones((len(fossils),2))

for i,fossil in enumerate(fossils):
    # Put the max age in column 0
    age_range[i,0] = max(fossil_data.loc[fossil_data[fossil] > 0].age)
    # Put the min age in column 1
    age_range[i,1] = min(fossil_data.loc[fossil_data[fossil] > 0].age)

Now analyze the age ranges to discover the answer to question 4:

In [238]:
age_range

array([[229.3, 121.5],
       [273.4, 159.2],
       [281.8, 234.2],
       [216.8, 112.5],
       [324.7, 212.6],
       [224.2, 144.7],
       [302.5, 144.7],
       [332.3, 267.2]])

In [242]:
answer4 = age_range[np.argmin(age_range[:,0]),1]

print('The last fossil to appear was',
      fossils[np.argmin(age_range[:,0])],
      ', and they last appeared at',
      answer4)

The last fossil to appear was 🐟 , and they last appeared at 112.5


Submit answer 4:

In [243]:
params = {'key': my_key,   # <--- must be the same key as before
          'question': 4,   # <--- which question you're answering
          'answer': answer4,  # <--- your answer to that question
         }

r = requests.get(url, params)

r.text

'Correct! The next challenge is not ready yet. Check this URL again later.'