### Введение
Необходмо составить модель, предсказывающую коэффициент разрушения компрессора: 
* Compressor decay state coefficient (CD)

В модели имеется 16 известных параметров:
1. Lever position (lp) [ ]
2. Ship speed (v) [knots]
3. Gas Turbine shaft torque (GTT) [kN m]
4. Gas Turbine rate of revolutions (GTn) [rpm]
5. Gas Generator rate of revolutions (GGn) [rpm]
6. Starboard Propeller Torque (Ts) [kN]
7. Port Propeller Torque (Tp) [kN]
8. HP Turbine exit temperature (T48) [C]
9. GT Compressor inlet air temperature (T1) [C]
10. GT Compressor outlet air temperature (T2) [C]
11. HP Turbine exit pressure (P48) [bar]
12. GT Compressor inlet air pressure (P1) [bar]
13. GT Compressor outlet air pressure (P2) [bar]
14. Gas Turbine exhaust gas pressure (Pexh) [bar]
15. Turbine Injecton Control (TIC) [%]
16. Fuel flow (mf) [kg/s]

In [99]:
import pandas as pd
import numpy as np
import seaborn as sns
import re

In [100]:
with open('../models/Features.txt') as f:
    headers = f.readlines()
headers = list(map(lambda l: re.sub(r"""(.*\(|\).*\s?)""", '', l), headers))
print(headers)
ds = pd.read_csv('../models/data.txt', '\s+', engine='python', header=None, names=headers)
ds

['lp', 'v', 'GTT', 'GTn', 'GGn', 'Ts', 'Tp', 'T48', 'T1', 'T2', 'P48', 'P1', 'P2', 'Pexh', 'TIC', 'mf', 'CD', 'TD']


Unnamed: 0,lp,v,GTT,GTn,GGn,Ts,Tp,T48,T1,T2,P48,P1,P2,Pexh,TIC,mf,CD,TD
0,1.138,3.0,289.964,1349.489,6677.380,7.584,7.584,464.006,288.0,550.563,1.096,0.998,5.947,1.019,7.137,0.082,0.95,0.975
1,2.088,6.0,6960.180,1376.166,6828.469,28.204,28.204,635.401,288.0,581.658,1.331,0.998,7.282,1.019,10.655,0.287,0.95,0.975
2,3.144,9.0,8379.229,1386.757,7111.811,60.358,60.358,606.002,288.0,587.587,1.389,0.998,7.574,1.020,13.086,0.259,0.95,0.975
3,4.161,12.0,14724.395,1547.465,7792.630,113.774,113.774,661.471,288.0,613.851,1.658,0.998,9.007,1.022,18.109,0.358,0.95,0.975
4,5.140,15.0,21636.432,1924.313,8494.777,175.306,175.306,731.494,288.0,645.642,2.078,0.998,11.197,1.026,26.373,0.522,0.95,0.975
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11929,5.140,15.0,21624.934,1924.342,8470.013,175.239,175.239,681.658,288.0,628.950,2.087,0.998,10.990,1.027,23.803,0.471,1.00,1.000
11930,6.175,18.0,29763.213,2306.745,8800.352,245.954,245.954,747.405,288.0,658.853,2.512,0.998,13.109,1.031,32.671,0.647,1.00,1.000
11931,7.148,21.0,39003.867,2678.052,9120.889,332.389,332.389,796.457,288.0,680.393,2.982,0.998,15.420,1.036,42.104,0.834,1.00,1.000
11932,8.206,24.0,50992.579,3087.434,9300.274,438.024,438.024,892.945,288.0,722.029,3.594,0.998,18.293,1.043,58.064,1.149,1.00,1.000


In [101]:
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11934 entries, 0 to 11933
Data columns (total 18 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   lp      11934 non-null  float64
 1   v       11934 non-null  float64
 2   GTT     11934 non-null  float64
 3   GTn     11934 non-null  float64
 4   GGn     11934 non-null  float64
 5   Ts      11934 non-null  float64
 6   Tp      11934 non-null  float64
 7   T48     11934 non-null  float64
 8   T1      11934 non-null  float64
 9   T2      11934 non-null  float64
 10  P48     11934 non-null  float64
 11  P1      11934 non-null  float64
 12  P2      11934 non-null  float64
 13  Pexh    11934 non-null  float64
 14  TIC     11934 non-null  float64
 15  mf      11934 non-null  float64
 16  CD      11934 non-null  float64
 17  TD      11934 non-null  float64
dtypes: float64(18)
memory usage: 1.6 MB


In [102]:
ds.describe()

Unnamed: 0,lp,v,GTT,GTn,GGn,Ts,Tp,T48,T1,T2,P48,P1,P2,Pexh,TIC,mf,CD,TD
count,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0,11934.0
mean,5.166667,15.0,27247.498685,2136.289256,8200.947312,227.335768,227.335768,735.495446,288.0,646.215331,2.352963,0.998,12.297123,1.029474,33.641261,0.66244,0.975,0.9875
std,2.626388,7.746291,22148.613155,774.083881,1091.315507,200.495889,200.495889,173.680552,0.0,72.675882,1.08477,2.220539e-16,5.337448,0.01039,25.841363,0.507132,0.01472,0.0075
min,1.138,3.0,253.547,1307.675,6589.002,5.304,5.304,442.364,288.0,540.442,1.093,0.998,5.828,1.019,0.0,0.068,0.95,0.975
25%,3.144,9.0,8375.88375,1386.758,7058.324,60.317,60.317,589.87275,288.0,578.09225,1.389,0.998,7.44725,1.02,13.6775,0.246,0.962,0.981
50%,5.14,15.0,21630.659,1924.326,8482.0815,175.268,175.268,706.038,288.0,637.1415,2.083,0.998,11.092,1.026,25.2765,0.496,0.975,0.9875
75%,7.148,21.0,39001.42675,2678.079,9132.606,332.36475,332.36475,834.06625,288.0,693.9245,2.981,0.998,15.658,1.036,44.5525,0.882,0.988,0.994
max,9.3,27.0,72784.872,3560.741,9797.103,645.249,645.249,1115.797,288.0,789.094,4.56,0.998,23.14,1.052,92.556,1.832,1.0,1.0


### Изучение и подготовка модели
Все колонки являются float-значениями и не содержат NULL-значений.
Предобработка не требуется


### Создание тестовой выборки
Размер тестовой выборки: 30%

In [103]:
from sklearn.model_selection import train_test_split
y = ds.CD
x = ds.drop('CD', axis=1).drop('TD', axis=1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30)

#### Линейная регрессия
Для начала попробуем стандартную линейную регрессию.

In [104]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train, y_train)
trainScore = lr.score(x_train, y_train)
testScore = lr.score(x_test, y_test)
print('Train score ', trainScore)
print('Test score ', testScore)

Train score  0.8428540228936565
Test score  0.8445401151465483


Линейная регрессия показала неплохой результат.
Попробуем улучшить его с помощью полиномиальной регрессии.
### Полиномиальная регрессия и поиск по сетке
Будем сравнивать полиномиальную регрессию со степенями от 1 до 3 включительно

In [105]:
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures

In [106]:
poly = Pipeline([('pf',PolynomialFeatures(degree=4)),('lr', LinearRegression())])

In [107]:
params = {'pf__degree': range(1, 4)}
params

{'pf__degree': range(1, 4)}

In [108]:
gs = GridSearchCV(poly, params, cv=5)

In [109]:
gs.fit(x_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('pf', PolynomialFeatures(degree=4)),
                                       ('lr', LinearRegression())]),
             param_grid={'pf__degree': range(1, 4)})

In [110]:
gs.best_params_

{'pf__degree': 3}

In [111]:
gs.score(x_test, y_test)

0.9999987678610616

### Вывод
Модель легко поддаётся прогнозированию, позволяя получать почти 100% результат.
Лучше всего себя показала полиномиальная регрессия 3 степени.
Поиск степени > 3 не имеет смысла, т.к. текущая модель обладает необходимой точностью.