# Project description

You work for the OilyGiant mining company. Your task is to find the best place
for a new well.


Steps to choose the location:
* Collect the oil well parameters in the selected region: oil quality and
volume of reserves;
* Build a model for predicting the volume of reserves in the new wells;
* Pick the oil wells with the highest estimated values;
* Pick the region with the highest total profit for the selected oil wells.

## Download and prepare the data. 

In [1]:
# Loading all the libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt
from numpy.random import RandomState
import seaborn as sns

In [2]:
# Loading data

reg1 = pd.read_csv('geo_data_0.csv')
reg2 = pd.read_csv('geo_data_1.csv')
reg3 = pd.read_csv('geo_data_2.csv')

### Data description
Geological exploration data for the three regions are stored in files:
geo1, geo2, geo3.
* id — unique oil well identifier
* f0, f1, f2 — three features of points (their specific meaning is unimportant, but the features themselves are significant)
* product — volume of reserves in the oil well (thousand barrels).

### First region data inspection:

In [3]:
# Showing data smples
reg1.sample(5)

Unnamed: 0,id,f0,f1,f2,product
54775,duoC6,1.902902,-0.049188,4.409264,87.782301
56666,FzdcS,-0.911548,0.406071,7.100214,121.467503
34170,EasUV,0.477511,-0.69848,9.247731,155.571552
69604,Y5nMk,1.030943,0.459317,0.067644,15.332403
73829,4ZfC2,1.932281,0.51281,5.507542,82.964073


In [4]:
# Showing basic data information
reg1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


No missing data!

In [5]:
# Checking for duplicated rows
dup = reg1.duplicated()
print(f"There are {dup.sum()} duplicated rows in the data frame")

There are 0 duplicated rows in the data frame


* The `id` column is not numerical and therefor not fit for our model. I could encode it but it is redundant because this column has no value for the model.

* I'll will just get rid of it

In [6]:
# Making a copy of the data set.
reg1_raw = reg1.copy()

# Getting rid of id column.
reg1 = reg1.drop('id', axis = 1)
reg1.head(2)

Unnamed: 0,f0,f1,f2,product
0,0.705745,-0.497823,1.22117,105.280062
1,1.334711,-0.340164,4.36508,73.03775


In [7]:
reg1.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


### Second region data inspection:

In [8]:
# Showing data smples
reg2.sample(5)

Unnamed: 0,id,f0,f1,f2,product
29620,GUtpo,-2.876635,-6.823078,3.002195,84.038886
49835,VrZkw,-6.623307,-1.240629,1.000324,30.132364
88766,rMiAS,15.511561,-0.937813,-0.010283,0.0
7869,Hd3HU,0.886667,-7.015408,4.009139,110.992147
56154,JWI90,11.721453,6.411519,2.003406,53.906522


In [9]:
# Showing basic data information
reg2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


No missing data!

In [10]:
# Checking for duplicated rows
dup = reg2.duplicated()
print(f"There are {dup.sum()} duplicated rows in the data frame")

There are 0 duplicated rows in the data frame


* The `id` column is not numerical and therefor not fit for our model. I could encode it but it is redundant because this column has no value for the model.

* I'll will just get rid of it

In [11]:
# Making a copy of the data set.
reg2_raw = reg2.copy()

# Getting rid of id column.
reg2 = reg2.drop('id', axis = 1)
reg2.head(2)

Unnamed: 0,f0,f1,f2,product
0,-15.001348,-8.276,-0.005876,3.179103
1,14.272088,-3.475083,0.999183,26.953261


### Third region data inspection:

In [12]:
# Showing data smples
reg3.sample(5)

Unnamed: 0,id,f0,f1,f2,product
53687,KMR24,-2.276841,-0.790219,-0.474706,75.996201
21202,M2kLR,2.287307,1.59909,4.565334,160.05655
71506,NDWM5,-2.720173,2.092016,0.554109,152.713995
68223,kqCK2,0.273945,-0.120453,5.823647,54.600286
91036,TDQtJ,-0.340096,1.162068,12.00414,155.182628


In [13]:
# Showing basic data information
reg3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


No missing data!

In [14]:
# Checking for duplicated rows
dup = reg3.duplicated()
print(f"There are {dup.sum()} duplicated rows in the data frame")

There are 0 duplicated rows in the data frame


In [15]:
# Making a copy of the data set.
reg3_raw = reg3.copy()

# Getting rid of id column.
reg3 = reg3.drop('id', axis = 1)
reg3.head(2)

Unnamed: 0,f0,f1,f2,product
0,-1.146987,0.963328,-0.828965,27.758673
1,0.262778,0.269839,-2.530187,56.069697


In [16]:
reg3.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.002023,-0.002081,2.495128,95.0
std,1.732045,1.730417,3.473445,44.749921
min,-8.760004,-7.08402,-11.970335,0.0
25%,-1.162288,-1.17482,0.130359,59.450441
50%,0.009424,-0.009482,2.484236,94.925613
75%,1.158535,1.163678,4.858794,130.595027
max,7.238262,7.844801,16.739402,190.029838


## summary:
* No missing data .
* No duplicated rows.
* Got rid of `id` columns.


The data is ready for training.

# Train and test the model for each region:

## Models training

In [17]:
#Function to split the data
def split_data(region):
    # Spliting the data into features and target
    features = region.drop('product', axis=1)
    target = region['product']
    # Spliting the data into training set and validation set:
    features_train, features_valid, target_train, target_valid = train_test_split(
        features, target, test_size=0.25, random_state=12345)
    
    return features, features_train, features_valid, target_train, target_valid

In [18]:
# Spliting the first region data into traing and validation features and target
features_1, features_train_1, features_valid_1, target_train_1, target_valid_1 =\
split_data(reg1)

# Spliting the second region data into traing and validation features and target
features_2, features_train_2, features_valid_2, target_train_2, target_valid_2 =\
split_data(reg2)

# Spliting the third region data into traing and validation features and target
features_3, features_train_3, features_valid_3, target_train_3, target_valid_3 =\
split_data(reg3)

In [19]:
# Lists for loops
features = [features_1, features_2, features_3]
X_trains = [features_train_1, features_train_2, features_train_3]
y_trains = [target_train_1, target_train_2, target_train_3]
X_vals = [features_valid_1, features_valid_2, features_valid_3]
y_vals = [target_valid_1, target_valid_2, target_valid_3]
regions = ['First region', 'Second region', 'Third region']

In [20]:
# Function to check data proportions
def proportions(region, features_train,features_valid, features):
    print(region + " Proportions out of original data")
    print('----------------------------------')
    print(f" TRAINING set:     {features_train.shape[0]/features.shape[0]:.0%}")
    print(f" VALIDATION set:   {features_valid.shape[0]/features.shape[0]:.0%}\n")

In [21]:
# Checking data proportions
for reg, X_train, X_val, X in zip(regions, X_trains, X_vals, features):
    proportions(reg, X_train, X_val, X)

First region Proportions out of original data
----------------------------------
 TRAINING set:     75%
 VALIDATION set:   25%

Second region Proportions out of original data
----------------------------------
 TRAINING set:     75%
 VALIDATION set:   25%

Third region Proportions out of original data
----------------------------------
 TRAINING set:     75%
 VALIDATION set:   25%



In [22]:
# Function to train and Validate 
def train_pred(region, features_train, target_train, features_valid, target_valid):

    # Training the model
    model = LinearRegression()
    model.fit(features_train, target_train)

    #Testing the model
    r2_reg = model.score(features_valid, target_valid)
    predicted_reg = model.predict(features_valid)
    RMSE_reg = mean_squared_error(target_valid, predicted_reg, squared=False)
    
    return r2_reg, RMSE_reg, predicted_reg, 

In [23]:
# Function for printing results
def print_results(region, r2_reg, RMSE_reg, predicted_reg, target_valid):    
    print(region+' region model')
    print('------------------')
    print(f"R2 = {r2_reg:.4f}")
    print(f"RMSE = {RMSE_reg:.3f}")
    print(f"Avarage predicted volume:{predicted_reg.mean():.3f}")
    print(f"Avarage real volume:{target_valid.mean():.3f}\n")

In [24]:
# First region training
r2_reg_1, RMSE_reg_1, predicted_reg1 = train_pred\
(reg1, features_train_1, target_train_1, features_valid_1, target_valid_1 ) 

# Second region training
r2_reg_2, RMSE_reg_2, predicted_reg2 = train_pred\
(reg2, features_train_2, target_train_2, features_valid_2, target_valid_2 )

# Third region training
r2_reg_3, RMSE_reg_3, predicted_reg3 = train_pred\
(reg3, features_train_3, target_train_3, features_valid_3, target_valid_3 )

In [25]:
# lists for loops
r2s = [r2_reg_1, r2_reg_2, r2_reg_3]
rmses = [RMSE_reg_1, RMSE_reg_2, RMSE_reg_3]
predicteds = [predicted_reg1, predicted_reg2, predicted_reg3]

In [26]:
# printing results
for region, r2_reg, RMSE_reg, predicted_reg, target_valid in zip(regions, r2s, rmses, predicteds, y_vals):
    print_results(region, r2_reg, RMSE_reg, predicted_reg, target_valid)

First region region model
------------------
R2 = 0.2799
RMSE = 37.579
Avarage predicted volume:92.593
Avarage real volume:92.079

Second region region model
------------------
R2 = 0.9996
RMSE = 0.893
Avarage predicted volume:68.729
Avarage real volume:68.723

Third region region model
------------------
R2 = 0.2052
RMSE = 40.030
Avarage predicted volume:94.965
Avarage real volume:94.884



## Conclussions:

* The most accurate model is the model of the second region. with very high R2 (0.99) and very low RMSE (0.8).


* The richest (in oil) region is the third one with average of 94000 barrels and the poorest is the second with an average of 68000 barrels.  


# Prepare for profit calculation:

## Store all key values for calculations in separate variables.

In [27]:
budget = 100000000
wells = 200
cost_per_well = budget/wells
unit_revenue = 4500 
print(f"Cost per well = {cost_per_well:.0f}")

Cost per well = 500000


## Calculate the volume of reserves sufficient for developing a new well without losses. Compare the obtained value with the average volume of reserves in each region.

In [28]:
min_reserves_for_well = cost_per_well / unit_revenue
print(f"Minimum reserves for well without loosing money(in thousand barrels): {min_reserves_for_well:.3f}")

Minimum reserves for well without loosing money(in thousand barrels): 111.111


In [29]:
print('Difference between average reserves and minimum reserves')
print('--------------------------------------------------------\n')

print(f"First Region: {target_valid_1.mean() - min_reserves_for_well:.3f}")
print(f"Second Region: {target_valid_2.mean() - min_reserves_for_well:.3f}")
print(f"Third Region: {target_valid_3.mean() - min_reserves_for_well:.3f}")

Difference between average reserves and minimum reserves
--------------------------------------------------------

First Region: -19.033
Second Region: -42.388
Third Region: -16.227


## conclusions:

* In all the regions the average reserves are not enough to make profit out of a new oil well.
* For that reason we need to choose the richest oil wells.
* As I mentioned before, The richest region is the third one and the poorest is the second.

# Write a function to calculate profit from a set of selected oil wells and model predictions:

## Pick the wells with the highest values of predictions.

In [30]:
#A function for creating predictions and target data frames
def tar_pred(target, predictions):
    df = pd.DataFrame({'target': target,'predictions':predictions})
    return df

In [31]:
# Creating predictions and target data frames

tar_and_pred1 = tar_pred(target_valid_1,predicted_reg1)
tar_and_pred2 = tar_pred(target_valid_2,predicted_reg2)
tar_and_pred3 = tar_pred(target_valid_3,predicted_reg3)

In [32]:
# Profit function

def revenue(df, count):
    
    #Selecting best wells
    selected = df.sort_values(by='predictions', ascending=False).head(count)
    selected = selected['target'].head(200)
    #Calculate revenue and profit
    revenue =  unit_revenue * selected.sum()
    profit = revenue - budget
    
    return selected, profit

In [33]:
# First region
selected1, profit1  = revenue(tar_and_pred1, 200)

In [34]:
# Second region
selected2, profit2= revenue(tar_and_pred2, 200)

In [35]:
#Third regeion
selected3, profit3= revenue(tar_and_pred3, 200)

### First region

In [36]:
# Selected wells
selected1

93073    162.810993
46784    153.639837
78948    162.153488
43388     96.893581
6496     178.879516
            ...    
30488    179.683422
98799     95.396917
53840    160.361464
4638     102.186603
75908    119.890261
Name: target, Length: 200, dtype: float64

In [37]:
# Calculating the proportions of unprifitable wells in the selected 200
print(f"The proportions of unprofitable wells out of the selected wells is:\n{selected1[selected1<=min_reserves_for_well].count()/selected1.shape[0]:.1%}")

The proportions of unprofitable wells out of the selected wells is:
9.5%


### Second region

In [38]:
# Selected wells
selected2

38665    137.945408
20191    137.945408
14041    137.945408
24274    137.945408
92782    137.945408
            ...    
13370    137.945408
45823    137.945408
86987    137.945408
72313    137.945408
59892    137.945408
Name: target, Length: 200, dtype: float64

In [39]:
# Calculating the proportions of unprifitable wells in the selected 200
print(f"The proportions of unprofitable wells out of the selected wells is:\n{selected2[selected2<=min_reserves_for_well].count()/selected2.shape[0]:.1%}")

The proportions of unprofitable wells out of the selected wells is:
0.0%


 ### Third region

In [40]:
# Selected wells
selected3

98619    175.103291
46649    131.627481
82661    141.160070
53151    159.676082
18747    142.135203
            ...    
66244    104.949568
34285     89.492500
36778    184.895101
7806     137.480469
62558    134.507140
Name: target, Length: 200, dtype: float64

In [41]:
# Calculating the proportions of unprifitable wells in the selected 200
print(f"The proportions of unprofitable wells out of the selected wells is:\n{selected3[selected3<=min_reserves_for_well].count()/selected3.shape[0]:.1%}")

The proportions of unprofitable wells out of the selected wells is:
14.0%


## Summarize the target volume of reserves in accordance with these predictions

* The target volume is 111.111(thousand) barrels for each well.


* In the first region 9.5% of the best 200 wells are under the target volume.


* In the second region none of the best 200 wells are under the target volume.


* In the third region 14% of the best 200 wells are under the target volume.


* We can get profit from the richest 200 wells in every region!

## Provide findings: suggest a region for oil wells' development and justify the choice. Calculate the profit for the obtained volume of reserves.

In [42]:
# First region profit
print(f"First region profit: {profit1:.2f}, Dollars")

First region profit: 33208260.43, Dollars


In [43]:
# Second region profit
print(f"Second region profit: {profit2:.2f}, Dollars")

Second region profit: 24150866.97, Dollars


In [44]:
# Third region profit
print(f"Third region profit: {profit3:.2f}, Dollars")

Third region profit: 27103499.64, Dollars


## Conclusions:

* The most profitable is the first region but its predicting model is the least accurate (still prety good).
* The least profitable is the second region (with the highest accurate model)

# Calculate risks and profit for each region:


## Use the bootstrapping technique with 1000 samples to find the distribution of profit.

In [45]:
def bootstrap(df):
    state = np.random.RandomState(12345)
    values =[]
    
    for i in range(1000):
        subsample = df.sample(n=500, replace=True, random_state=state).reset_index(drop= True)
        selected, profit = revenue(subsample,200)
        values.append(profit)
    
    values = pd.Series(values)

    return values

In [46]:
values1 = bootstrap(tar_and_pred1)

In [47]:
values2 = bootstrap(tar_and_pred2)

In [48]:
values3 = bootstrap(tar_and_pred3)

## Find average profit, 95% confidence interval and risk of losses. Loss is negative profit, calculate it as a probability and then express as a percentage.

### First region

In [49]:
print(f"Region 1 average profit is: {values1.mean():.2f} Dollars")

Region 1 average profit is: 3961649.85 Dollars


In [50]:
lower1 = values1.quantile(0.025)
upper1 = values1.quantile(0.975)
print(f"First region's confidence_interval of 95%: {lower1:.2f} - {upper1:.2f}\n")
print(f"Margin error of: +- {(upper1 - lower1) /2:.2f}")

First region's confidence_interval of 95%: -1112155.46 - 9097669.42

Margin error of: +- 5104912.44


In [51]:
losess1 = values1[values1<0]
losess1
print(f"First region's risk of losses:  {len(losess1)/ len(values1):.1%}")

First region's risk of losses:  6.9%


### Second region

In [52]:
print(f"Region 2 average profit is: {values2.mean():.2f} Dollars")

Region 2 average profit is: 4560451.06 Dollars


In [53]:
lower2 = values2.quantile(0.025)
upper2 = values2.quantile(0.975)
print(f"First region's confidence_interval of 95%: {lower2:.2f} - {upper2:.2f}")
print(f"Margin error of: +- {(upper2 - lower2) /2:.2f}")

First region's confidence_interval of 95%: 338205.09 - 8522894.54
Margin error of: +- 4092344.72


In [54]:
losess2 = values2[values2<0]
losess2
print(f"Second region's risk of losses:  {len(losess2)/ len(values2):.1%}")

Second region's risk of losses:  1.5%


### Third region

In [55]:
print(f"Region 3 average profit is: {values3.mean():.2f} Dollars")

Region 3 average profit is: 4044038.67 Dollars


In [56]:
lower3 = values3.quantile(0.025)
upper3 = values3.quantile(0.975)
print(f"First region's confidence_interval of 95%: {lower3:.2f} - {upper3:.2f}")
print(f"Margin error of: +- {(upper3 - lower3) /2:.2f}")

First region's confidence_interval of 95%: -1633504.13 - 9503595.75
Margin error of: +- 5568549.94


In [57]:
losess3 = values3[values3<0]
losess3
print(f"Second region's risk of losses:  {len(losess3)/ len(values3):.1%}")

Second region's risk of losses:  7.6%


# Overall conclusion:

After training a linear regression model to predict oil volume in oil wells,
I calculated the predicted the possible profit in every region by taking 1000 samples(n=500) of the data and calculate the average profit of the samples for each region.


* The the best region is the `Second region`, it has the highest average profit, the lowest risk of losses and the most accurate predictions. I suggest that region for oil wells development!


