# PROJECT

# Selecting the most promising region for oil production

**Task description:**

The customer, the oil producing company GlavRosGosNeft, due to the need to expand oil production in one of the three existing fields, is interested in an objective justification for the choice of the region in which new wells will be installed.

Data is available on the characteristics of existing fields - oil samples in three regions: in each 10,000 fields, where the quality of oil and the volume of its reserves were measured.

**Necessary:**

Build a machine learning model - a regression task, supervised learning - that will help determine the region where mining will bring the greatest profit.

**Project execution process:**

The project will be completed in 4 steps:

loading, review and preparation of source data;
preparation of samples for the model, formation of the model and evaluation of modeling results;
assessment of the minimum effective production value at which production makes a profit;
calculation of profits and risks, conclusion for a promising region.

## Loading and preparing data

In [None]:
import numpy as np
import pandas as pd
from math import sqrt

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import train_test_split

We will load the data into one dataframe with the addition of the "region" column.

In [None]:
data_reg1 = pd.read_csv('/datasets/geo_data_0.csv')
data_reg2 = pd.read_csv('/datasets/geo_data_1.csv')
data_reg3 = pd.read_csv('/datasets/geo_data_2.csv')

data_reg1['region']=1
data_reg2['region']=2
data_reg3['region']=3

data=pd.concat([data_reg1, data_reg2, data_reg3])

Let's create a function to review the source data.

In [None]:
def info_df(df):
    for i in range(1,4):
        data_i=data.loc[data['region']==i]
        print('Информация по датафрейму Региона ',i,'\n')
        print('Общая информация:', '\n')
        print(data_i.info(), '\n')
        print('Число полных явных дубликатов строк ', data_i.duplicated().sum(), '\n')
        print('Обзор первых строк:', '\n', data_i.head(), '\n')
        print('Статистические характеристики:', '\n', data_i.describe(), '\n')
        print('Коэффициенты корреляции между столбцами:', '\n', data_i.corr(), '\n')
        print('Количество уникальных значений в столбце с индексом:', '\n')
        for c in range(0,5):
            data_i_c = data_i.iloc[:,c]
            print(c,' = ', data_i_c.value_counts().count())
        print()

In [None]:
info_df(data)

Информация по датафрейму Региона  1 

Общая информация: 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 6 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
 5   region   100000 non-null  int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 5.3+ MB
None 

Число полных явных дубликатов строк  0 

Обзор первых строк: 
       id        f0        f1        f2     product  region
0  txEyH  0.705745 -0.497823  1.221170  105.280062       1
1  2acmU  1.334711 -0.340164  4.365080   73.037750       1
2  409Wp  1.022732  0.151990  1.419926   85.265647       1
3  iJLyR -0.032172  0.139033  2.978566  168.620776       1
4  Xdl7t  1.988431  0.155413  4.751769  154.036647       1 

Статистические характеристики: 
 

We see that the data does not contain complete duplicates of rows and gaps. There are also no significant “outliers” in the data (according to the results of describe).

At the same time, there are duplicates by value in the id column, of which there are few, but, for the purity of the data, we will exclude rows with such duplicates from further analysis.

In [None]:
data = data.drop_duplicates(subset=['id'])

In [None]:
info_df(data)

Информация по датафрейму Региона  1 

Общая информация: 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99990 entries, 0 to 99999
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       99990 non-null  object 
 1   f0       99990 non-null  float64
 2   f1       99990 non-null  float64
 3   f2       99990 non-null  float64
 4   product  99990 non-null  float64
 5   region   99990 non-null  int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 5.3+ MB
None 

Число полных явных дубликатов строк  0 

Обзор первых строк: 
       id        f0        f1        f2     product  region
0  txEyH  0.705745 -0.497823  1.221170  105.280062       1
1  2acmU  1.334711 -0.340164  4.365080   73.037750       1
2  409Wp  1.022732  0.151990  1.419926   85.265647       1
3  iJLyR -0.032172  0.139033  2.978566  168.620776       1
4  Xdl7t  1.988431  0.155413  4.751769  154.036647       1 

Статистические характеристики: 
          

Based on the results of data preparation, we note the following:

- there is a certain, and for Region 2, a significant direct correlation between the f2 indicator and the volume of reserves;
- for Region 2, the value of the “product” indicator contains only 12 values (compared to almost 100 thousand values for regions 1 and 3)

## Model training and validation

In [None]:
target = data[['product', 'region']]
features = data.drop(['product', 'id'], axis=1)

We generated a table of features and target features.

Let's divide the samples into training and validation in a ratio of 75 to 25.

Let us derive, for verification, the dimensions of the resulting samples.
We will create variables directly in the loop.

In [None]:
for i in range(1,4):
    locals()[f"features_{i}"] = features.loc[features['region']==i]
    locals()[f"features_{i}"] = locals()[f"features_{i}"].drop(['region'], axis=1)
    locals()[f"target_{i}"] = target.loc[target['region']==i]
    locals()[f"target_{i}"] = locals()[f"target_{i}"].drop(['region'], axis=1)

    locals()[f"features_train_{i}"], locals()[f"features_valid_{i}"], locals()[f"target_train_{i}"], \
    locals()[f"target_valid_{i}"] = train_test_split(
    locals()[f"features_{i}"], locals()[f"target_{i}"], train_size=0.75, random_state=12345)

    print(locals()[f"features_train_{i}"].shape, '\n', locals()[f"features_valid_{i}"].shape, '\n', \
          locals()[f"target_train_{i}"].shape, '\n', locals()[f"target_valid_{i}"].shape)

(74992, 3) 
 (24998, 3) 
 (74992, 1) 
 (24998, 1)
(74989, 3) 
 (24997, 3) 
 (74989, 1) 
 (24997, 1)
(74981, 3) 
 (24994, 3) 
 (74981, 1) 
 (24994, 1)


Let's create a linear regression model, train the models for each region, and calculate the inventory volumes predicted by the model and the root mean square error (RMSE).
We will create all variables directly in the loop.

In [None]:
target_and_pred = pd.DataFrame()

for i in range(1,4):
    locals()[f"model_{i}"] = LinearRegression().fit(locals()[f"features_train_{i}"], locals()[f"target_train_{i}"])
    locals()[f"predictons_{i}"] = locals()[f"model_{i}"].predict(locals()[f"features_valid_{i}"])
    locals()[f"RMSE_{i}"] = sqrt(mse(locals()[f"target_valid_{i}"], locals()[f"predictons_{i}"]))
    print('Регион', i, '\n')
    print(' Предсказания: ', locals()[f"predictons_{i}"].mean().round(4), '\n',
          'Целевое значение: ', locals()[f"target_valid_{i}"]['product'].mean().round(4), '\n',
          'RMSE: ', round(locals()[f"RMSE_{i}"],4), '\n')

    locals()[f"target_and_pred_{i}"] = pd.DataFrame()
    locals()[f"target_and_pred_{i}"]['target'] = locals()[f"target_valid_{i}"]['product']
    locals()[f"target_and_pred_{i}"]['predictions'] = locals()[f"predictons_{i}"]
    locals()[f"target_and_pred_{i}"]['region'] = i
    target_and_pred = pd.concat([locals()[f"target_and_pred_{i}"], target_and_pred])

print(target_and_pred.info())

Регион 1 

 Предсказания:  92.7892 
 Целевое значение:  92.1582 
 RMSE:  37.8535 

Регион 2 

 Предсказания:  68.8852 
 Целевое значение:  68.8863 
 RMSE:  0.8903 

Регион 3 

 Предсказания:  95.0967 
 Целевое значение:  94.8519 
 RMSE:  39.9442 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 74989 entries, 20482 to 17774
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   target       74989 non-null  float64
 1   predictions  74989 non-null  float64
 2   region       74989 non-null  int64  
dtypes: float64(2), int64(1)
memory usage: 2.3 MB
None


**CONCLUSIONS for the section:**

The average volume of reserves is the largest in Region 3, followed by Region 1 with a slight decrease; for Region 2, the average volume of reserves is significantly less than in Regions 1 and 3, but at the same time, for Region 2 the smallest root-mean-square error of the model is about 1%, in compared to regions 1 and 3, in which it is up to 40%. Which can be explained both by the above significant correlation between the values of f2 and product, and by the fact that the product value contains only 12 unique values.

## Preparation for profit calculation

Let us fix all the initial parameters necessary to calculate the profit from production. Including the development budget of 1 region, profit from 1 well, the number of wells examined during exploration, the number of best wells.

Let's calculate the minimum production volume at which, taking into account the initial data, production would be profitable.

In [None]:
POINTS_IN_REGION = 500
BEST_POINTS = 200
REGION_BUDGET = 10e9
PRODUCT_PROFIT = 450_000

print('Безубыточный объем разработки: ', round((REGION_BUDGET / PRODUCT_PROFIT / BEST_POINTS),4))

Безубыточный объем разработки:  111.1111


The break-even production volume is greater than 111, which is higher than the average volume of reserves calculated in Section 2 for all regions and significantly higher than the average volume for Region 2.

Thus, when choosing a region, you cannot rely on average ratings for the region. It is necessary to select the best wells in each region and evaluate them.

## Calculation of profits and risks

Let's set the random number generator parameter.

In [None]:
state = np.random.RandomState(12345)

Let's create a function to calculate the average profit for the best wells in the region using the Bootstrap technique for 1,000 samples.

In [None]:
def profit_region_d(target_and_pred, i, BEST_POINTS, PRODUCT_PROFIT, REGION_BUDGET, state, POINTS_IN_REGION):

    profit_region = []

    target_and_pred_i = target_and_pred.loc[target_and_pred['region']==i]

    for g in range(1000):
        subsample = target_and_pred_i.sample(n=POINTS_IN_REGION, replace=True, random_state=state)
        target_subsample = subsample['target']
        predictions_subsample = subsample['predictions']

        best_predictions = target_subsample[predictions_subsample.sort_values(ascending=False).index][:BEST_POINTS]
        profit_mln = (PRODUCT_PROFIT * best_predictions.sum() - REGION_BUDGET) / 1000000

        profit_region.append(profit_mln)

    profit_region = pd.Series(profit_region)

    return profit_region

Let's apply this function and calculate the necessary indicators for assessing the region: incl. 95% confidence interval of profit and risk of loss.

In [None]:
for i in range(1,4):

    lower = profit_region_d(target_and_pred, i, BEST_POINTS, PRODUCT_PROFIT, REGION_BUDGET, state,
                            POINTS_IN_REGION).quantile(0.025)
    upper = profit_region_d(target_and_pred, i, BEST_POINTS, PRODUCT_PROFIT, REGION_BUDGET, state,
                            POINTS_IN_REGION).quantile(0.975)
    mean_v = profit_region_d(target_and_pred, i, BEST_POINTS, PRODUCT_PROFIT, REGION_BUDGET, state,
                             POINTS_IN_REGION).mean()
    risks = (profit_region_d(target_and_pred, i, BEST_POINTS, PRODUCT_PROFIT, REGION_BUDGET, state,
                             POINTS_IN_REGION) < 0).mean() * 100

    print('Регион', i, '\n')
    print('Средняя прибыль, млн.: ', mean_v.round(2))
    print('2,5%-квантиль: ', lower.round(2))
    print('95%-ый доверительный интервал:', lower.round(2), ':', upper.round(2))
    print('Риски:', risks.round(2), '\n')

Регион 1 

Средняя прибыль, млн.:  425.43
2,5%-квантиль:  -131.54
95%-ый доверительный интервал: -131.54 : 953.6
Риски: 7.1 

Регион 2 

Средняя прибыль, млн.:  510.06
2,5%-квантиль:  99.56
95%-ый доверительный интервал: 99.56 : 943.79
Риски: 0.5 

Регион 3 

Средняя прибыль, млн.:  391.04
2,5%-квантиль:  -183.67
95%-ый доверительный интервал: -183.67 : 947.64
Риски: 10.0 



**CONCLUSIONS on the SECTION and on the PROJECT as a whole:**

Region 2 has the best indicators in terms of profits and risks - it practically does not imply losses (risks less than 1%), and this region also had the minimum root-mean-square error of the model.
In second place - Region 1 - the average profit is higher, the risks are lower.

**Conclusion on the project as a whole:**

As a result of the development of the project, data for 3 oil production regions was analyzed; for each region, data in the amount of 100,000 rows and 5 columns were analyzed. The data did not contain complete duplicates of rows, although there were minor duplicates for individual samples (column id).

Based on these data, linear regression models (supervised learning) were generated, according to which, using the bootstrap procedure, the values of possible profit from mining in each region were predicted.

According to the calculation results, **Region 2** was recognized as the best region.

However, despite the good indicators of Region 2, it is necessary to further check the source data for this region, perhaps request them again or double-check them on other indicators, since the indicators of the source data of Region 2 differ significantly from the source data of Regions 1 and 3.