In [1]:
import requests
from IPython.display import Markdown

url = 'https://kata.geosci.ai/challenge/sample-names'

r = requests.get(url)
print('Status', r.status_code)

Markdown(r.text)

Status 200


# Sample names

You have a set of sample names. They look like this:

    001235_Ainsa_Sobrarbe_C_2016-04-20_PCx
    ^^^^^^ ^^^^^ ^^^^^^^^ ^ ^^^^^^^^^^ ^^^
      1      2      3     4      5      6

A **valid name** consists of 6 parts separated by underscores. The parts are underlined, above. Having 6 such parts is enough to be called 'valid' (though there may be other problems, for example with the spelling or formatting of individual parts).

The 6 parts are:

- **Unique identifier** consisting of 6 characters.
- **Basin name.** Note that spellings are not guaranteed to be correct.
- **Unit or Formation name.** Note that spellings are not guaranteed to be correct.
- **Specimen type**, either H or C (hand or core).
- **Date**, which must be in ISO 8601 YYYY-MM-DD format to be considered correct.
- **Preparation codes** of at least one character.

We need to extract some information from this dataset.
        
1. How many valid sample names are there?
2. How many valid samples were taken in the Ainsa basin? Include records with misspelt basin names.
3. What's the longest period of days with no valid samples taken in Ainsa?

If looking for misspellings, we'll assume that any word starting and ending in the same letters, but with the middle letters scrambled, is the same word. So 'Anisa' is a misspelling of 'Ainsa', but 'Aimsa' is not. We'll also assume that the spelling with the most occurrences is the correct spelling.


## Example

Here's some sample data:

    001235_Ainsa_Sobrarbe_C_2016-04-20_PCx
    001236_Ainsa_Sobrarbe_H_2016-04-21_P
    001237_Anisa_Sobrarbe_H_2016-04-29_TCx
    001238_Sorbas_Gochar_2017-06-03_PxM
    001238_Sorbas_Gochar_C_2017-06-03_PxM
    001240_SORBAS_Gochar_C_2017-06-03_PxM

Let's answer the 3 questions for this sample dataset:

- There are **5** valid names (and 1 invalid one, with no specimen type).
- The Ainsa Basin appears in **3** sample names (including 1 misspelling).
- There is a **7** day period with no samples taken, between 21 April and 29 April.


## Hints

It's likely that the `datetime` library will be useful in answering question 3. In particular, this code is useful:

    from datetime import datetime
    datetime.fromisoformat('2016-07-03')
    
If that command fails on a date, then you should consider the date format incorrect and ignore that record.


## A quick reminder how this works

You can retrieve your data by choosing any Python string as a **`<KEY>`** and substituting here:
    
    https://kata.geosci.ai/challenge/sample-names?key=<KEY>
                                                      ^^^^^
                                                      use your own string here

To answer question 1, make a request like:

    https://kata.geosci.ai/challenge/sample-names?key=<KEY>&question=1&answer=1234
                                                      ^^^^^          ^        ^^^^
                                                      your key       Q        your answer

[Complete instructions at kata.geosci.ai](https://kata.geosci.ai/challenge)

----

© 2020 Agile Scientific, licensed CC-BY

## Load my input

In [2]:
my_key = "scibbatical"

params = {'key': my_key}

r = requests.get(url, params)

# Look at the first bit of the input:
r.text[:100]

'000055_Sorbas_Yesaers_C_2000-01-01_xM\n000057_Sorbas_Yesares_H_2000-01-01_PTM\n000058_Sorbas_Yesares_H'

### Parse the input 

We need to parse the input into something useful.

Parse the string into something like a csv, then read it into a DataFrame

In [4]:
import pandas as pd

In [3]:
# Replace '_' with ', '

interim = r.text.replace('_',', ')

print(interim[:100])

000055, Sorbas, Yesaers, C, 2000-01-01, xM
000057, Sorbas, Yesares, H, 2000-01-01, PTM
000058, Sorba


In [23]:
# Read the new string line-by-line, split each line at the ',' and use it to define a DataFrame
samps = pd.DataFrame([x.split(', ') for x in interim.split('\n')])

# Just for clarity, rename the columns:
samps.rename(columns={0:'id', 1:'basin', 2:'formation', 3:'type', 4:'date', 5:'prep'}, inplace=True)

# Great!
samps.head()

Unnamed: 0,id,basin,formation,type,date,prep
0,55,Sorbas,Yesaers,C,2000-01-01,xM
1,57,Sorbas,Yesares,H,2000-01-01,PTM
2,58,Sorbas,Yesares,H,2000-01-01,PTx
3,59,Sorbas,Yesraes,H,01-01-00,Cx
4,60,Sorbas,Yesares,2000-01-02,CxM,


## Question 1

_How many valid sample names are there?_

The problem states that to be valid, an entry simply has to have six parts. Pandas didn't detect seven columns, so I think we can assume that the invalid samples will have less than 6 parts.

For entries with less than six parts, Pandas populates the last column(s) with None (kind of equivalent to Nan. When we find the length of the string for a cell containing None, it returns Nan.

In [210]:
samps_valid = samps.loc[samps['prep'].str.len() > 0]

valid_samples = samps_valid.shape[0]

valid_samples

9190

Sumbit answer 1

In [179]:
params = {'key': my_key,   # <--- must be the same key as before
          'question': 1,   # <--- which question you're answering
          'answer': valid_samples,  # <--- your answer to that question
         }

r = requests.get(url, params)

r.text

'Correct'

## Question 2

_How many valid samples were taken in the Ainsa basin? Include records with misspelt basin names._

_If looking for misspellings, we'll assume that any word starting and ending in the same letters, but with the middle letters scrambled, is the same word. So 'Anisa' is a misspelling of 'Ainsa', but 'Aimsa' is not._

Okay, let's write a function that returns true if the basin name is Ainsa or something like it:

In [202]:
from itertools import permutations

def check_name(target, name):
    
    # put the target into lowercase
    targetl = target.lower()
    
    # put the name into lowercase
    namel = name.lower()
    
    # Easy if it's spelt correctly
    if namel == targetl:
        return True
    
    # Find the mis-spelled
    
    # if the lengths are the same
    if len(namel) == len(targetl):
        # if the first and last letters match
        if namel[0] == targetl[0] and namel[-1] == targetl[-1]:
            # if the middle letters are a permutation of the real middle letters
            perms = [''.join(p) for p in permutations(targetl[1:-1])] # creates a list of permutations
            if namel[1:-1] in perms:
                return True
            
    # Return False otherwise
    
    return False

In [214]:
# Select all samples from Ainsa basin
ainsa_samples = samps_valid.loc[[check_name('Ainsa', name) for name in samps_valid['basin']]]

print('There are', str(ainsa_samples.shape[0]),'samples in the Ainsa basin.')

There are 1618 samples in the Ainsa basin.


Submit answer 2:

In [215]:
params = {'key': my_key,   # <--- must be the same key as before
          'question': 2,   # <--- which question you're answering
          'answer': ainsa_samples.shape[0],  # <--- your answer to that question
         }

r = requests.get(url, params)

r.text

'Correct'

## Question 3

_What's the longest period of days with no valid samples taken in Ainsa?_

Now, the hint for this question suggests using datetime.fromisoformat(), but it's not available in Python <3.7, so I don't have access to it. I can write a function that I think will replicate the funtionality though.

In [None]:
# 5. Date must be in ISO8601 YYYY-MM-DD format to be considered correct.

from datetime import datetime

# Here's a function that returns True when date strings fit this format; False otherwise
def testdate(date):
    try:
        datetime.strptime(date, '%Y-%m-%d')
        return True
    except:
        return False
 

In [217]:
# Bring forward only those Ainsa samples with valid dates
# (also create this as a copy to help with future operations)
ainsa_samples_dated = ainsa_samples.loc[[testdate(date) for date in ainsa_samples['date']]].copy()

print(str(len(ainsa_samples)-len(ainsa_samples_dated)), 'samples removed.')

78 samples removed.


Now the remaining dates must be correctly formatted.

We'll want to work with these dates as datetime objects:

In [222]:
# Change the type of the date column (effectively)
ainsa_samples_dated['date'] = pd.to_datetime(ainsa_samples_dated['date'])

Now, we'll sort by date and then find the max change in time (in days) between consecutive samples.

In [255]:
# Sort
ainsa_samples_dated.sort_values('date', inplace=True)

# Now find the max difference in days
max_delta_t = max([ainsa_samples_dated['date'].iloc[i] -
                   ainsa_samples_dated['date'].iloc[i-1] for i in range(ainsa_samples_dated.shape[0])])

# Now, the max period of days with no samples is one less than the max difference.
print('Max time without samples is', str(max_delta_t.days-1))

Max time without samples is 299


Submit answer 3:

In [256]:
params = {'key': my_key,   # <--- must be the same key as before
          'question': 3,   # <--- which question you're answering
          'answer': max_delta_t.days-1,  # <--- your answer to that question
         }

r = requests.get(url, params)

r.text

'Correct! The next challenge is: https://kata.geosci.ai/challenge/prospecting - good luck!'