In [3]:
import requests
from IPython.display import Markdown

url = 'https://kata.geosci.ai/challenge/boreholes'

r = requests.get(url)
print('Status', r.status_code)

Markdown(r.text)

Status 200


# Boreholes

You have a list of boreholes. Each one has an (x, y) location. The locations are given as a Python string, and look like this:

    ..., (12.1, 34.3), (56.5, 78.7), (90.9, 12.1),...
    
Your data, when you receive it, will be longer than this.
    
We're going to analyse these locations. We need the answers to the following questions:
        
1. How many boreholes are there? We'll call this number _n_.
2. What's the distance, **to the nearest metre** between the first two boreholes in the list?
3. What is the mean straight-line distance between all pairs of boreholes **to the nearest metre**? Call this _m_.
4. There is a clump of boreholes. How many boreholes are in the clump? (A borehole is defined to be in a clump if the mean distance to its nearest _n_ / 5 neighbours is _m_ / 4 or less.)

Please note that all your answers must be integers. If you get a float for an answer, round it.


## Example

Here are the locations of some boreholes:

      (1, 4), (5, 4), (9, 3), (2, 8), (6, 4), (9, 9), (5, 5), (4, 3), (4, 5), (2, 1)
      
If we plot them, they look like this:

    y
    ^
    9 - - - - - - - - - 0
    8 - - 0 - - - - - - -
    7 - - - - - - - - - -
    6 - - - - - - - - - -
    5 - - - - 0 0 - - - -
    4 - 0 - - - 0 0 - - -
    3 - - - - 0 - - - - 0
    2 - - - - - - - - - -
    1 - - 0 - - - - - - -
    0 - - - - - - - - - -
      0 1 2 3 4 5 6 7 8 9 > x
    
Here's how we'd answer the questions for this small dataset:

- In this example, there are **10** wells (marked `0` on the plot above).
- The distance between the first two boreholes in the list, (1, 4) and (5, 4), is **4**.
- The mean distance between boreholes is 4.58... which to the nearest metre is **5**.
- There are **4** wells in the clump. See below.

Wells in the clump are marked `X` here (the borehole marked `O` does not meet the criterion):

    y
    ^
    9 - - - - - - - - - 0
    8 - - 0 - - - - - - -
    7 - - - - - - - - - -
    6 - - - - - - - - - -
    5 - - - - X X - - - -
    4 - 0 - - - X X - - -
    3 - - - - O - - - - 0
    2 - - - - - - - - - -
    1 - - 0 - - - - - - -
    0 - - - - - - - - - -
      0 1 2 3 4 5 6 7 8 9 > x


## A quick reminder how this works

You can retrieve your data by choosing any Python string as a **`<KEY>`** and substituting here:
    
    https://kata.geosci.ai/challenge/boreholes?key=<KEY>
                                                   ^^^^^
                                                   use your own string here

To answer question 1, make a request like:

    https://kata.geosci.ai/challenge/boreholes?key=<KEY>&question=1&answer=1234
                                                   ^^^^^          ^        ^^^^
                                                   your key       Q        your answer

[Complete instructions at kata.geosci.ai](https://kata.geosci.ai/challenge)

----

© 2020 Agile Scientific, licensed CC-BY

## Load my input

Okay, let's get an input for our problem. I'll use a "key":

In [54]:
my_key = "scibbatical"

params = {'key': my_key}

r = requests.get(url, params)

# Look at the first bit of the input:
r.text[:100]

'(16736.45, 10471.65), (18443.86, 47.41), (10702.08, 18033.28), (15015.61, 22452.92), (2537.61, 12254'

### Parse the input 

We need to parse the input into something useful.

Parse the string into something like a csv, then read it into a DataFrame

In [5]:
import pandas as pd

In [60]:
# Eliminate the open-bracket, replace the close-bracket, ',', and the space with a new line:
interim = r.text.replace('(','').replace('), ','\n').replace(')','')

print(interim[:100])

16736.45, 10471.65
18443.86, 47.41
10702.08, 18033.28
15015.61, 22452.92
2537.61, 12254.59
451.38, 7


In [61]:
# Read the new string line-by-line, split each line at the ',' and use it to define a DataFrame
locs = pd.DataFrame([x.split(', ') for x in interim.split('\n')])

# Great!

# But the cells are filled with strings, see?
locs.iloc[:-5].values

array([['16736.45', '10471.65'],
       ['18443.86', '47.41'],
       ['10702.08', '18033.28'],
       ...,
       ['24203.54', '2405.98'],
       ['15373.42', '12704.85'],
       ['13172.82', '12181.38']], dtype=object)

In [67]:
# So convert to float...
locs = locs.astype(float)

locs.iloc[:-5].values

array([[16736.45, 10471.65],
       [18443.86,    47.41],
       [10702.08, 18033.28],
       ...,
       [24203.54,  2405.98],
       [15373.42, 12704.85],
       [13172.82, 12181.38]])

In [68]:
locs.head()

Unnamed: 0,0,1
0,16736.45,10471.65
1,18443.86,47.41
2,10702.08,18033.28
3,15015.61,22452.92
4,2537.61,12254.59


## Question 1

_How many boreholes are there? We'll call this number n._

The number of wells is the number of lines in the dataframe.

In [20]:
n = locs.shape[0]

print('There are', str(locs.shape[0]), 'wells.')

There are 600 wells.


Submit the answer:

In [21]:
params = {'key': my_key,   # <--- must be the same key as before
          'question': 1,   # <--- which question you're answering
          'answer': n,  # <--- your answer to that question
         }

r = requests.get(url, params)

r.text

'Correct'

Bingo!

## Question 2

_What's the distance, to the nearest metre between the first two boreholes in the list?_

There's a numpy function that calculates the distance: numpy.linalg.norm(a-b)

In [22]:
import numpy as np

In [69]:
dist = np.linalg.norm(locs.iloc[0]-locs.iloc[1])
dist

10563.14481987727

Okay, so that works, but I think I'll package this as a function so I can just pass index numbers instead of .iloc[]:

In [70]:
def idist(a,b):
    global locs
    
    return np.linalg.norm(locs.iloc[a]-locs.iloc[b])

In [71]:
idist(0,1)

10563.14481987727

Great, it works.

In [76]:
answer2 = int(round(idist(0,1),0))

Submit answer 2:

In [78]:
params = {'key': my_key,   # <--- must be the same key as before
          'question': 2,   # <--- which question you're answering
          'answer': answer2,  # <--- your answer to that question
         }

r = requests.get(url, params)

r.text

'Correct'

Bingo!

## Question 3

_What is the mean straight-line distance between all pairs of boreholes to the nearest metre? Call this m._

Okay, so I could populate a 600x600 matrix with the distance values. Each distance would appear twice, but taking the mean of all values would be equivalent. The diagonal values would be zero if calculated, but these values would have to be ignored. Might be easier to define those as nans...

In [89]:
# Define a nan matrix to hold distance values
dist = np.empty((n,n))

# Fill it with np.nans
dist[:] = np.nan

dist.shape

(600, 600)

Fill dist with norm values. Leave self-distances as nan...

In [90]:
for i in range(n):
    for j in range (n):
        if i != j:
            dist[i,j] = idist(i,j)

Ugh, that was slow. Still workable for our 600 wells though!

In [91]:
# Check some values:
dist[:5,:5]

array([[           nan, 10563.14481988,  9674.28909811, 12104.21914534,
        14310.34354546],
       [10563.14481988,            nan, 19581.28385028, 22666.26957447,
        20050.53696575],
       [ 9674.28909811, 19581.28385028,            nan,  6175.73953389,
        10002.59108916],
       [12104.21914534, 22666.26957447,  6175.73953389,            nan,
        16115.4093584 ],
       [14310.34354546, 20050.53696575, 10002.59108916, 16115.4093584 ,
                   nan]])

Looks like the diagonal nans are honored.

The mean distance can be calculated while ignoring the nans:

In [93]:
m = int(round(np.nanmean(dist),0))

m

11707

Submit answer 3:

In [95]:
params = {'key': my_key,   # <--- must be the same key as before
          'question': 3,   # <--- which question you're answering
          'answer': m,  # <--- your answer to that question
         }

r = requests.get(url, params)

r.text

'Correct'

Bingo!

## Question 4

_There is a clump of boreholes. How many boreholes are in the clump? (A borehole is defined to be in a clump if the mean distance to its nearest n / 5 neighbours is m / 4 or less.)_

Okay, I think I'll just loop through all wells and test the criteria given. If the well passes the criteria, it'll get flagged.

In [126]:
# Let's add a column to the locs DataFrame that acts as a binary flag for clump wells:
locs['clump'] = np.zeros((n,1)).astype(int)

for i in range(n):
    if np.mean(np.sort(dist[i])[:n//5]) <= m/4:
               locs['clump'].iloc[i] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


That also took a little too long. Meh.

Since each clump well is flagged by a 1, the sum of the 1s is the number of wells in a clump! Find the sum:

In [128]:
answer4 = sum(locs['clump'])

answer4

137

Submit answer 4:

In [129]:

params = {'key': my_key,   # <--- must be the same key as before
          'question': 4,   # <--- which question you're answering
          'answer': answer4,  # <--- your answer to that question
         }

r = requests.get(url, params)

r.text

'Correct! The next challenge is: https://kata.geosci.ai/challenge/sample-names - good luck!'