# Data Understanding
* Explain the hierarchy here
* describe real, simulated and drawn
* Summarise it from Data Characterisation and Exploratory Data Analysis

## Data Characterisation
Preprocessing a dataset through data characterisation involves summarising the features and characteristics present in the data using statistical measures and visualisations techniques such as bar charts and scatter plots. After this stage, it should be possible to identify biases, patterns, trends, and any missing or irrelevant data in the data set that may need to be addressed.

The data consists of over 50 million observations, with 12 columns of data for each observation. The first column, `label`, indicates the event type for each observation. The second column, `well`, contains the name of the well the observation was taken from. Hand-drawn and simulated instances have fixed names for in this column, while real instances have names masked with incremental id. The third column, `id`, is an identifier for the observation and it is incremental for hand-drawn and simulated instances, while each real instance has an id generated from its first timestamp. The remaining columns contain various readings, including:

* timestamp: observations timestamps loaded into pandas DataFrame as its index;
* P-PDG: pressure variable at the Permanent Downhole Gauge (PDG);
* P-TPT: pressure variable at the Temperature and Pressure Transducer (TPT);
* T-TPT: temperature variable at the Temperature and Pressure Transducer (TPT);
* P-MON-CKP: pressure variable upstream of the production choke (CKP);
* T-JUS-CKP: temperature variable downstream of the production choke (CKP);
* P-JUS-CKGL: pressure variable upstream of the gas lift choke (CKGL);
* T-JUS-CKGL: temperature variable upstream of the gas lift choke (CKGL);
* QGL: gas lift flow rate;
* class: observations labels associated with three types of periods (normal, fault transient, and faulty steady state).

The pressure features are measured in Pascal (Pa), the volumetric flow rate features are measured in standard cubic meters per second (SCM/s), and the temperature features are measured in degrees Celsius (°C). The `class` labels can be used to identify periods of normal operation, fault transients, and faulty steady states, which can help with diagnosis and maintenance. 

In [1]:
import pandas as pd
df = pd.read_csv('3Wdataset.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50822124 entries, 0 to 50822123
Data columns (total 12 columns):
 #   Column      Dtype  
---  ------      -----  
 0   label       int64  
 1   well        object 
 2   id          int64  
 3   P-PDG       float64
 4   P-TPT       float64
 5   T-TPT       float64
 6   P-MON-CKP   float64
 7   T-JUS-CKP   float64
 8   P-JUS-CKGL  float64
 9   T-JUS-CKGL  float64
 10  QGL         float64
 11  class       float64
dtypes: float64(9), int64(2), object(1)
memory usage: 4.5+ GB


In [4]:
df.head()

Unnamed: 0,label,well,id,P-PDG,P-TPT,T-TPT,P-MON-CKP,T-JUS-CKP,P-JUS-CKGL,T-JUS-CKGL,QGL,class
0,0,WELL-00001,20170201020207,0.0,10092110.0,119.0944,1609800.0,84.59782,1564147.0,,0.0,0.0
1,0,WELL-00001,20170201020207,0.0,10092000.0,119.0944,1618206.0,84.58997,1564148.0,,0.0,0.0
2,0,WELL-00001,20170201020207,0.0,10091890.0,119.0944,1626612.0,84.58213,1564148.0,,0.0,0.0
3,0,WELL-00001,20170201020207,0.0,10091780.0,119.0944,1635018.0,84.57429,1564148.0,,0.0,0.0
4,0,WELL-00001,20170201020207,0.0,10091670.0,119.0944,1643424.0,84.56644,1564148.0,,0.0,0.0


In [5]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
label,50822124.0,3.387997,2.526282,0.0,1.0,4.0,5.0,8.0
id,50822124.0,5536792000000.0,9000317000000.0,1.0,43.0,129.0,20140320000000.0,20180620000000.0
P-PDG,50816249.0,-4.923179e+39,7.606368999999999e+40,-1.180116e+42,11616160.0,21892680.0,26055640.0,44858050.0
P-TPT,50815940.0,14672570.0,43791320.0,0.0,10998300.0,14524390.0,17558300.0,2941990000.0
T-TPT,45011148.0,104.4371,27.27372,0.0,96.97665,116.7546,121.7072,127.7401
P-MON-CKP,49700496.0,3587708.0,3354934.0,-8317.492,1186575.0,1963778.0,5116738.0,13037170.0
T-JUS-CKP,49210426.0,75.59043,21.47815,-2.02,67.02149,77.35565,84.78074,173.0961
P-JUS-CKGL,10007488.0,4052477.0,5221975.0,-497671.7,2312238.0,2332198.0,3430441.0,21069820.0
T-JUS-CKGL,0.0,,,,,,,
QGL,10691260.0,0.1694197,0.4591983,0.0,0.0,0.0,0.0,4.146513


In order to maintain the realistic aspects of the data, the dataset was extracted without preprocessing, including the presence of NaN values, frozen variables due to sensor or communication issues, instances with varying sizes, and outliers (R.E.V. Vargas, et al. 2019).
 
The .isnull().sum() method indicates that there are no missing values in any of the columns. The describe() method provides some basic descriptive statistics for the data. It appears that the data has already been pre-processed, as the range of the auction duration is from 0 to 10 and the range of the other numerical features is from 0 to 1. There are no duplicated values in the dataset, as indicated by the following line.

## Exploratory Data Analysis