# Project description

OilyGiant mining company need a model that will find the best place for a new well.

Steps to choose the location:
1. Collect the oil well parameters in the selected region: oil quality and volume of reserves.
2. Build a model for predicting the volume of reserves in the new wells.
3. Pick the oil wells with the highest estimated values.
4. Pick the region with the highest total profit for the selected oil wells.
5. Build a model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the Bootstrapping technique.


Project conclution: suggest a region for development of oil wells and justify the choice.

Data description:
- Geological exploration data for the three regions are stored in three seperate csv's.
- id - unique oil well identifier
- f0, f1, f2 - three features of points (their specific meaning is unimportant, but the features themselves are significant)
- product - volume of reserves in the oil well (thousand barrels).

Conditions:
- When exploring the region, a study of 500 points is carried with picking the best 200 points for the profit calculation.
- The budget for development of 200 oil wells is 100 USD million.
- One unit of product is 4,500 dollars (volume of reserves is in thousand barrels).
- After the risk evaluation, keep only the regions with the risk of losses lower than 2.5%. From the ones that fit the criteria, the region with the highest average profit should be selected.

In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from joblib import dump

from sklearn.linear_model import LinearRegression 
from sklearn.utils import shuffle
import numpy as np
import warnings

from sklearn.metrics import mean_squared_error
warnings.simplefilter(action='ignore')

### Open and and prepare the data:

In [2]:
df0 = pd.read_csv('/datasets/geo_data_0.csv')
df1 = pd.read_csv('/datasets/geo_data_1.csv')
df2 = pd.read_csv('/datasets/geo_data_2.csv')

geo0 = df0[['f0','f1','f2','product']]
geo1 = df1[['f0','f1','f2','product']]
geo2 = df2[['f0','f1','f2','product']]

print(geo0.head(3))
print("------------------------------------------------------------------------------")
print(geo0.info())

         f0        f1        f2     product
0  0.705745 -0.497823  1.221170  105.280062
1  1.334711 -0.340164  4.365080   73.037750
2  1.022732  0.151990  1.419926   85.265647
------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   f0       100000 non-null  float64
 1   f1       100000 non-null  float64
 2   f2       100000 non-null  float64
 3   product  100000 non-null  float64
dtypes: float64(4)
memory usage: 3.1 MB
None


### Split the source data into a training set, a validation set, and a test set (75:25):

In [3]:
geo0_valid, geo0_train = train_test_split(geo0, test_size=0.75, random_state=12345)
features_train_0 = geo0_train.drop(['product'], axis=1)
target_train_0 = geo0_train['product']
features_valid_0 = geo0_valid.drop(['product'], axis=1)
target_valid_0 = geo0_valid['product']

geo1_valid, geo1_train = train_test_split(geo1, test_size=0.75, random_state=12345)
features_train_1 = geo1_train.drop(['product'], axis=1)
target_train_1 = geo1_train['product']
features_valid_1 = geo1_valid.drop(['product'], axis=1)
target_valid_1 = geo1_valid['product']

geo2_valid, geo2_train = train_test_split(geo2, test_size=0.75, random_state=12345)
features_train_2 = geo2_train.drop(['product'], axis=1)
target_train_2 = geo2_train['product']
features_valid_2 = geo2_valid.drop(['product'], axis=1)
target_valid_2 = geo2_valid['product']

#### Train the model and make predictions for the validation set.
#### Print the average volume of predicted reserves and model RMSE.

In [4]:
def geocomp(features_train, target_train, features_valid, target_valid):
    model = LinearRegression() 
    model.fit(features_train, target_train)   
    prediction = model.predict(features_valid)
    print("GAVG:",prediction.mean())
    print("RMSE:",(mean_squared_error(target_valid, prediction))**0.5,"\n")
    return prediction

print("Geo0")
prediction_0 = geocomp(features_train_0, target_train_0, features_valid_0, target_valid_0)
print("Geo1")
prediction_1 = geocomp(features_train_1, target_train_1, features_valid_1, target_valid_1)
print("Geo2")
prediction_2 = geocomp(features_train_2, target_train_2, features_valid_2, target_valid_2)

Geo0
GAVG: 92.68042230714772
RMSE: 37.68019646464799 

Geo1
GAVG: 68.8598208262229
RMSE: 0.8873287354658539 

Geo2
GAVG: 95.06093851120131
RMSE: 40.11167877627781 



#### Analysis of results:
- Geo-2 and Geo-0 have much higher mean reserves.
- Geo-1 have the smallest oil reserves
- Geo-1 have the smallest root-mean-square error and subsequently have the best model accuracy.

### Profit calculation:
* Store all key values for calculations in separate variables.
* Calculate the volume of reserves sufficient for developing a new well without losses. 
* Compare the obtained value with the average volume of reserves in each region.


In [5]:
OIL_UNIT_DOLLARS = 4500
DEVELOPMENT_COST = 100000000
WELL_COST = DEVELOPMENT_COST/200
MIN_RESERVES = WELL_COST/OIL_UNIT_DOLLARS
print("Minimal well size for profitability:",round(MIN_RESERVES,2))
print("Well mean size Geo 0, 1, 2:  ",round(prediction_0.mean(),2),",", 
      round(prediction_1.mean(),2),',', round(prediction_2.mean(),2))

Minimal well size for profitability: 111.11
Well mean size Geo 0, 1, 2:   92.68 , 68.86 , 95.06


* Mean well size in all regions is smaller than the minimal profitability well size.
* We need to train acurate model to pick the best wells at each region and maintain mean well size larger than the profitability braking point.

### function to calculate profit from a set of selected oil wells and model predictions:

In [6]:
results_0 = pd.DataFrame (target_valid_0)
results_0['prediction'] = prediction_0

results_1 = pd.DataFrame (target_valid_1)
results_1['prediction'] = prediction_1

results_2 = pd.DataFrame (target_valid_2)
results_2['prediction'] = prediction_2

def profit(target, probabilities, count):
    probs_sorted = probabilities.sort_values(ascending=False)
    selected = target[probs_sorted.index][:count]
    return (OIL_UNIT_DOLLARS * selected.sum() - DEVELOPMENT_COST)

state = np.random.RandomState(12345)          

### Calculate risks and profit for each region:
* After the risk evaluation, keep only the regions with the risk of losses lower than 2.5%. From the ones that fit the criteria, the region with the highest average profit should be selected.

In [7]:
def bootpredict(results):
    target = results['product']
    probabilities = results['prediction']
    values = []
    for i in range(1000):
        target_subsample = target.sample(n=500, replace=True, random_state=state)
        probs_subsample = probabilities[target_subsample.index] 
        values.append(profit(target_subsample, probs_subsample, 200))

    values = pd.Series(values)
    print("Average Profit: ", round(values.mean()))
    print("2.5% quantile:  ",  round(values.quantile(0.025)))
    print("97.5% quantile: ",  round(values.quantile(0.975)))
    print("Risk of losses: ", 100*values.lt(0).sum()/len(values),"%\n")

print("GEO-0:")
bootpredict(results_0)

print("GEO-1:")
bootpredict(results_1)

print("GEO-2:")
bootpredict(results_2)

GEO-0:
Average Profit:  4687572
2.5% quantile:   -855133
97.5% quantile:  10257666
Risk of losses:  4.6 %

GEO-1:
Average Profit:  4953020
2.5% quantile:   1095333
97.5% quantile:  9365531
Risk of losses:  0.9 %

GEO-2:
Average Profit:  3654378
2.5% quantile:   -1850614
97.5% quantile:  8931665
Risk of losses:  10.0 %



#### conclusions
- GEO-1 have the highest average oil reserves chosen by our models.
- GEO-1 is the only region that have positive revenue in its lower 2.5% quantile using bootstrapping.
- GEO-1 have the lowest risk of losses.

#### Recommendations
- I recomend to use our model in developing GEO-1 region.