## Challenge Activities

#### 1. What steps would you take to solve this problem? Please describe as completely and clearly as possible all the steps that you see as essential for solving the problem.
**A:** I would do:
- **Data Exploration**: Analyse the data creating visualizations and analyse the outliers, to understand what I can do with it.
- **Data Treatment**: Treat the missing values using the most frequent value or the mean of the values for each column.
- **Create a correlational map**: Create a correlational map to see the most relevant variables/columns.
- **Reduce the dimensionality**: Use PCA to reduce the problem dimensionality.
- **Train Regression**: Train simultaneous models as SGD, Linear Regression, Logistic Regression, etc... and get their results.
    - **Create the train and test set**: Use the pareto principle to create the test set using 20% of the entire dataset.
    - **Use GridSearchCV to select the best parameter**: Use GridSearch to train the models with different parameters using cross validation.
- **Measure the results and Choose the best model**: Measure the MSE, RMSE and MAE of the models and choose the best one.
- **Analyse the Impact on problem**: Calculate the impact for possible next billings.
- **Create an presentation**: Presentate to the executive board.
- ***If the solution is approved***:
    - **Deploy the model in production**: Deploy the model for use.
    - **Maintain and Monitoring**: Create a system to monitoring the model and a pipeline to retrain.

#### 2. Which technical data science metric would you use to solve this challenge? Ex: absolute error, rmse, etc.
**A:** I would use the f1-score and recall for classification models trying to avoid the false negative state, and for regression models the RMSE to measure the errors of the model to penalize big mistakes on the training, trying to get the best model for the problem. And for deciding the best approach between classification and regression, I would consider the nature of the target variable and the business objectives. If the target is a discrete class label indicating a state (e.g., whether a truck needs maintenance), classification metrics such as f1-score and recall are more appropriate. However, if the target is a continuous variable (e.g., the cost of maintenance or time until failure), regression metrics such as RMSE would be used to evaluate the models.

#### 3. Which business metric would you use to solve the challenge?
**A:** I would use the reduction in air system maintenance costs as the primary business metric. This metric directly reflects the financial impact of the model by measuring how much the model can reduce unnecessary maintenance costs and prevent costly breakdowns. 

#### 4. How do technical metrics relate to the business metrics?
**A:** The metric tecniques indicate the precision for the models. A higher precision(least error) get the best predictions to indicate if a truck needs to go to maintaince, which can reduce the maintance cost avoiding not sending the trucks with defects.

#### 5. What types of analyses would you like to perform on the customer database?
**A:** Correlational analysis, and if possible, the time series analysis for getting insights to improve the decision when treating the data.

#### 6. What techniques would you use to reduce the dimensionality of the problem?
**A:** I would use the PCA to reduce dimensionality or an feature selector as Random Forest Importance to get the best features.

#### 7. What techniques would you use to select variables for your predictive model?
**A:** Analyse the correlation of the variables to get the best ones,or use Random Forest to get each variable importance to select the best ones.

#### 8. What predictive models would you use or test for this problem? Please indicate at least 3.
**A:** I would use the Linear Regression , SGD, Logistic Regression or Random Forest to understand the features importance.

#### 9. How would you rate which of the trained models is the best?
**A:** I would rate wich the technical metric such as RMSE using cross validation, and the expected cost reduce for the problem.

#### 10. How would you explain the result of your model? Is it possible to know which variables are most important?
**A:** To explain the results of my model, I would use feature importance for models like Random Forest and coefficients for linear models such as Linear Regression and Logistic Regression, create visualizations like bar charts to make these insights accessible to all stakeholders and provide example predictions; Yes, it is possible to identify the most important variables, as models like Random Forest and linear models offer ways to evaluate and rank each variable's importance.

#### 11. How would you assess the financial impact of the proposed model?
**A:** I would assess the financial impact of the proposed model by comparing maintenance costs before and after the model implementation. This would include cost reductions due to the prevention of undetected failures and fewer unnecessary maintenance activities.

#### 12. What techniques would you use to perform the hyperparameter optimization of the chosen model?
**A:** I would use techniques like Grid Search to explore different parameter combinations and identify those that provide the best performance and use cross-validation to ensure the results are robust and generalizable, selecting the hyperparameters that maximize the model performance metrics.

#### 13. What risks or precautions would you present to the customer before putting this model into production?
**A:** Before putting the model in production, I would present the risk of performance degradation over time, the importance data qualty, the need for continuous monitoring and model retraining and highlight the need for retrain if new data patterns and information emerge.

#### 14. If your predictive model is approved, how would you put it into production?
**A:** If the predictive model is approved, I would deploy it into production using an automated pipeline implementing the model on production environment, setting up monitoring routines to track real-time performance, and create processes for the continous collection of new data for future retraining.

#### 15. If the model is in production, how would you monitor it?
**A:** To monitor the model in production, I would establish key performance metrics, such as accuracy, recall, and other relevant metrics, and implement dashboards for real-time visualization, to audit periodically ensuring the model continues to provide valid and useful results.

#### 16. If the model is in production, how would you know when to retrain it?
**A:** To determine when to retrain the model in production, I would continuously monitor its performance metrics, if it indicates a drop in accuracy, an increase in false positives and negatives results I would suggest to retrain it.


# Start of the code

Import used libraries

In [86]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.linear_model import LinearRegression, LogisticRegression, SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from tqdm import tqdm
from sklearn.decomposition import PCA
import joblib

## Data reading

In [2]:
df = pd.read_csv('air_system_previous_years.csv')
df.head()

Unnamed: 0,class,aa_000,ab_000,ac_000,ad_000,ae_000,af_000,ag_000,ag_001,ag_002,...,ee_002,ee_003,ee_004,ee_005,ee_006,ee_007,ee_008,ee_009,ef_000,eg_000
0,neg,76698,na,2130706438,280,0,0,0,0,0,...,1240520,493384,721044,469792,339156,157956,73224,0,0,0
1,neg,33058,na,0,na,0,0,0,0,0,...,421400,178064,293306,245416,133654,81140,97576,1500,0,0
2,neg,41040,na,228,100,0,0,0,0,0,...,277378,159812,423992,409564,320746,158022,95128,514,0,0
3,neg,12,0,70,66,0,10,0,0,0,...,240,46,58,44,10,0,0,0,4,32
4,neg,60874,na,1368,458,0,0,0,0,0,...,622012,229790,405298,347188,286954,311560,433954,1218,0,0


## Data Analysis Start

Check the number of missing values in the data.

In [3]:
na_counts = (df == 'na').sum()
print(na_counts)

class         0
aa_000        0
ab_000    46329
ac_000     3335
ad_000    14861
          ...  
ee_007      671
ee_008      671
ee_009      671
ef_000     2724
eg_000     2723
Length: 171, dtype: int64


Replace 'na' values with NaN

In [4]:
df = df.replace('na', np.nan)

Replace 'pos' and 'neg' with 1 and 0.

Imput values with the most frequent value for the column in the missin lines.

In [84]:
df.replace('neg', 0, inplace=True)
df.replace('pos', 1, inplace=True)

imputer  = SimpleImputer(strategy='most_frequent')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns, index=df.index)

In [6]:
df_imputed.head()

Unnamed: 0,class,aa_000,ab_000,ac_000,ad_000,ae_000,af_000,ag_000,ag_001,ag_002,...,ee_002,ee_003,ee_004,ee_005,ee_006,ee_007,ee_008,ee_009,ef_000,eg_000
0,0,76698,0,2130706438,280,0,0,0,0,0,...,1240520,493384,721044,469792,339156,157956,73224,0,0,0
1,0,33058,0,0,0,0,0,0,0,0,...,421400,178064,293306,245416,133654,81140,97576,1500,0,0
2,0,41040,0,228,100,0,0,0,0,0,...,277378,159812,423992,409564,320746,158022,95128,514,0,0
3,0,12,0,70,66,0,10,0,0,0,...,240,46,58,44,10,0,0,0,4,32
4,0,60874,0,1368,458,0,0,0,0,0,...,622012,229790,405298,347188,286954,311560,433954,1218,0,0


Calculate the columns correlation with the column 'class'

In [7]:
correlation = df_imputed.corr()['class']

In [8]:
correlation = correlation.abs().sort_values(ascending=False)
for index, value in zip(correlation.index, correlation.values):
    print(f'{index}: {value}')

class: 1.0
ci_000: 0.5500485913599936
aa_000: 0.5369783925131034
bt_000: 0.533963753708639
bb_000: 0.5295008100476839
bv_000: 0.5280560333237196
bu_000: 0.5280560090906001
cq_000: 0.5280559924814421
aq_000: 0.5188414105610624
bj_000: 0.5134648529917623
cc_000: 0.5118862493439442
ah_000: 0.5116309351809286
an_000: 0.5106844611754808
bg_000: 0.5092986651899168
ao_000: 0.5072961420712778
bx_000: 0.5046089291123149
ap_000: 0.5029961600752952
by_000: 0.5000872377036355
ee_005: 0.4857230942244313
bh_000: 0.48472307666818365
dn_000: 0.481136717358398
ba_004: 0.4776901910078724
cn_004: 0.47371835477868135
ck_000: 0.4640517182087593
ba_003: 0.45934014514753435
ba_005: 0.4514152236466959
ag_005: 0.44817851080983434
ee_002: 0.44396007810139426
cs_005: 0.4419286525191903
ba_001: 0.43765859465632967
cs_004: 0.4375193685279318
ag_003: 0.43305030605979666
az_005: 0.4320135698926534
ba_000: 0.4306455709457857
ee_003: 0.42889850211051156
ba_002: 0.4240190193280069
bi_000: 0.41949249200635635
ee_004: 0.

In [9]:
most_correlated_columns = correlation[correlation >= 0.35].index
most_correlated_columns = [col for col in most_correlated_columns if col != 'class']
most_correlated_columns

['ci_000',
 'aa_000',
 'bt_000',
 'bb_000',
 'bv_000',
 'bu_000',
 'cq_000',
 'aq_000',
 'bj_000',
 'cc_000',
 'ah_000',
 'an_000',
 'bg_000',
 'ao_000',
 'bx_000',
 'ap_000',
 'by_000',
 'ee_005',
 'bh_000',
 'dn_000',
 'ba_004',
 'cn_004',
 'ck_000',
 'ba_003',
 'ba_005',
 'ag_005',
 'ee_002',
 'cs_005',
 'ba_001',
 'cs_004',
 'ag_003',
 'az_005',
 'ba_000',
 'ee_003',
 'ba_002',
 'bi_000',
 'ee_004',
 'ee_006',
 'ee_000',
 'ay_008',
 'cn_003',
 'ba_006',
 'cs_003',
 'cn_001',
 'cs_002',
 'ag_004',
 'am_0',
 'cn_005',
 'ag_006',
 'cn_002',
 'al_000',
 'ee_001',
 'az_004',
 'cs_000']

Separating labels from attributes and applying StandardScaler for data normalization

In [85]:
X = df_imputed.drop('class', axis=1)[most_correlated_columns].astype(float)
y = df_imputed['class'].astype(int)

scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns, index=X.index)

X_scaled.head()

Unnamed: 0,ci_000,aa_000,bt_000,bb_000,bv_000,bu_000,cq_000,aq_000,bj_000,cc_000,...,cs_002,ag_004,am_0,cn_005,ag_006,cn_002,al_000,ee_001,az_004,cs_000
0,0.21402,0.119381,0.120095,0.205083,0.206969,0.206969,0.206969,0.55209,0.162485,0.273471,...,0.026344,-0.167256,-0.109228,0.613924,0.520355,-0.14909,-0.109015,0.36441,-0.202987,0.492316
1,-0.140409,-0.180697,-0.180302,-0.076662,-0.075562,-0.075562,-0.075562,-0.07917,-0.062298,-0.069599,...,-0.058965,-0.175301,-0.109228,-0.01245,0.02095,-0.149055,-0.109015,0.019181,-0.108156,0.072159
2,-0.136617,-0.125811,-0.125354,-0.166468,-0.165619,-0.165619,-0.165619,-0.226223,-0.201648,-0.110265,...,-0.148707,-0.182333,-0.109228,-0.07122,0.062728,-0.14909,-0.109015,-0.125823,0.084359,-0.178611
3,-0.414981,-0.407928,-0.407764,-0.411136,-0.410971,-0.410971,-0.410971,-0.34769,-0.277063,-0.381834,...,-0.195306,-0.182094,-0.10462,-0.423308,-0.420782,-0.146708,-0.107819,-0.302519,-0.350741,-0.427045
4,0.012486,0.010572,0.011171,-0.01737,-0.016105,-0.016105,-0.016105,0.089865,-0.058323,0.037215,...,-0.12837,-0.164503,-0.109228,-0.106301,0.04139,-0.148757,-0.109015,0.050497,-0.343507,-0.14376


In [83]:
pca = PCA(n_components=5)
X_pca = pd.DataFrame(pca.fit_transform(X_scaled), columns=[f'pca_{i}' for i in range(5)], index=X_scaled.index)
X_pca.head()

Unnamed: 0,pca_0,pca_1,pca_2,pca_3,pca_4
0,1.628922,-1.345564,-0.213654,-0.403473,0.384583
1,-0.58539,-0.329233,0.078444,0.164999,0.075628
2,-0.900251,-0.097231,-0.217269,-0.209744,0.003336
3,-2.51693,0.284735,-0.070657,0.158977,0.069612
4,-0.032869,-0.273508,0.079855,-0.214371,-0.219891


## Training Basic Models for Initial Evaluation

Separating training and test data using Pareto distribution.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

Using some basic models to visualize the MSE and RMSE of the training.

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
log_reg = LogisticRegression(max_iter=200, solver='lbfgs')
log_reg.fit(X_train, y_train)
lin_reg, log_reg

In [14]:
lin_reg_pred = lin_reg.predict(X_test)
lin_reg_mse = mean_squared_error(y_test, lin_reg_pred)
print(f'Linear Regression MSE: {lin_reg_mse}')
print(f'Linear Regression RMSE: {np.sqrt(lin_reg_mse)}')
log_reg_pred = log_reg.predict(X_test)
log_reg_mse = mean_squared_error(y_test, log_reg_pred)
print(f'Logistic Regression MSE: {log_reg_mse}')
print(f'Logistic Regression RMSE: {np.sqrt(log_reg_mse)}')


Linear Regression MSE: 0.011394879760092437
Linear Regression RMSE: 0.10674680210710032
Logistic Regression MSE: 0.01525
Logistic Regression RMSE: 0.1234908903522847


Creating set to validate '*air_system_present_year.csv*'

In [None]:
new_df = pd.read_csv('air_system_present_year.csv')

new_df = new_df.replace('na', np.nan)
new_df = new_df.replace('neg', 0)
new_df = new_df.replace('pos', 1)

new_df_imputed = pd.DataFrame(imputer.transform(new_df), columns=new_df.columns, index=new_df.index)

new_X = new_df_imputed.drop('class', axis=1)[most_correlated_columns].astype(float)
new_y = new_df_imputed['class'].astype(int)
new_X_scaled = pd.DataFrame(scaler.transform(new_X), columns=new_X.columns, index=new_X.index)
new_X_pca = pd.DataFrame(pca.transform(new_X_scaled), columns=[f'pca_{i}' for i in range(5)], index=new_X_scaled.index)

Validating with '*air_system_present_year.csv*'

In [16]:
log_reg_pred = log_reg.predict(new_X_pca)
log_reg_mse = mean_squared_error(new_y, log_reg_pred)
log_reg_rmse = np.sqrt(log_reg_mse)
print(f'Logistic Regression MSE: {log_reg_mse}')
print(f'Logistic Regression RMSE: {log_reg_rmse}')


Logistic Regression MSE: 0.01825
Logistic Regression RMSE: 0.13509256086106294


## Start of training

Creating the pipeline to treat the data

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

most_correlated_columns = ['ci_000','aa_000','bt_000','bb_000','bv_000','bu_000','cq_000','aq_000','bj_000','cc_000','ah_000','an_000','bg_000','ao_000','bx_000','ap_000','by_000','ee_005','bh_000','dn_000','ba_004','cn_004','ck_000','ba_003','ba_005','ag_005','ee_002','cs_005','ba_001','cs_004','ag_003','az_005','ba_000','ee_003','ba_002','bi_000','ee_004','ee_006','ee_000','ay_008','cn_003','ba_006','cs_003','cn_001','cs_002','ag_004','am_0','cn_005','ag_006','cn_002','al_000','ee_001','az_004','cs_000']

class DataFrameMissingValuesReplacer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.replace('na', np.nan)
        return X
    
class DataFrameColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.columns]


pipeline = Pipeline([
    ('selector', DataFrameColumnSelector(most_correlated_columns)),
    ('replacer', DataFrameMissingValuesReplacer()),
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=5)),
])

In [None]:
dff = pd.read_csv('air_system_previous_years.csv')
X = dff.drop('class', axis=1)
y = dff['class'].replace('neg', 0).replace('pos', 1)

Fitting and transforming the columns with the created pipeline, and saving the pipeline

In [19]:
pipeline.fit(X)
X_transformed = pd.DataFrame(pipeline.transform(X), columns=['PC1', 'PC2', 'PC3', 'PC4', 'PC5'], index=X.index)
joblib.dump(pipeline, 'pipeline.pkl')

['pipeline.pkl']

Creating a function to select the best model, using for parameters, the train and test set, number of jobs to do the train, the quantity of cross validation folds and the model to be trained.

This function trains the model on combined training and validation sets, evaluating performance across different subsets defined by the cross-validation strategy, evaluating each using the custom train_regressor function that computes the root mean squared error for each model configuration. The model with the lowest RMSE is selected as the best. 

In [20]:
def train_regressor(regressor, X_train, X_val, y_train, y_val, params):
    reg = regressor(**params)
    reg.fit(X_train, y_train)
    predictions = reg.predict(X_val)
    mse = mean_squared_error(y_val, predictions)
    rmse = np.sqrt(mse)
    return rmse

def select_best_model(regressor, X_train, y_train, params ={}, cv_folds=10, n_jobs=4):
    clf = GridSearchCV(regressor(), params, cv=cv_folds, scoring='neg_root_mean_squared_error', n_jobs=n_jobs)
    clf.fit(X_train, y_train)
    best_params = clf.best_params_
    best_score = -clf.best_score_  # Convert to positive as scoring is negative RMSE
    best_estimator = clf.best_estimator_
    return best_estimator, best_params, best_score

def do_cv(regressor, X, y, cv_splits=5, param_cv_folds=None, n_jobs=8, params={}):
    skf = StratifiedKFold(n_splits=cv_splits, shuffle=True, random_state=1)
    rmses = []
    trained_regressors = []
    best_aprameters = []
    pgb = tqdm(total=cv_splits, desc='Folds evaluated')

    for train_idx, test_idx in skf.split(X, y):
        X_train = X.iloc[train_idx]
        X_test = X.iloc[test_idx]
        y_train = y.iloc[train_idx]
        y_test = y.iloc[test_idx]
            
        reg, best_params, _ = select_best_model(regressor, X_train, y_train, n_jobs=n_jobs, cv_folds=param_cv_folds, params=params)
        predictions = reg.predict(X_test)
        rmse = np.sqrt(mean_squared_error(y_test, predictions))
        rmses.append(rmse)
        best_aprameters.append(best_params)
        trained_regressors.append(reg)
        pgb.update(1)

    pgb.close()
    return rmses, trained_regressors, best_aprameters


Defining the parameters for training each model

In [21]:
params_lr = {}  # Linear Regression has no parameters for tuning in this simple form
params_log_reg = {
    'C': [0.1, 1, 10, 100],  # Regularization strengths
    'solver': ['liblinear', 'lbfgs']  # Solvers
}
params_sgd = {
    'penalty': ['l2', 'l1', 'elasticnet'],  # Types of penalties for regularization
    'alpha': [0.0001, 0.001, 0.01],  # Regularization strength
    'max_iter': [1000],  # Maximum number of passes over the data
    'tol': [1e-3],  # The stopping criterion
    'eta0': [0.01, 0.1],  # Initial learning rate for the 'constant' or 'adaptive' schedules
    'learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive'],  # Learning rate schedule
}


Perform cross-validation and calculate the average RMSE for linear regression, logistic regression and sgd regressor

In [22]:
linear_rmse, linear_trained_models, _ = do_cv(LinearRegression, X_transformed, y, cv_splits=10, params=params_lr, n_jobs=16)
log_reg_rmse, log_reg_trained_models, log_reg_params = do_cv(LogisticRegression, X_transformed, y, cv_splits=10, params=params_log_reg, n_jobs=8)
sgd_rmse, sgd_trained_models, sgd_params = do_cv( SGDRegressor, X_transformed, y, cv_splits=10, params=params_sgd, n_jobs=8)

linear_avg_rmse = np.mean(linear_rmse)
log_reg_avg_rmse = np.mean(log_reg_rmse)
sgd_avg_rmse = np.mean(sgd_rmse)

Folds evaluated: 100%|██████████| 10/10 [00:07<00:00,  1.31it/s]
Folds evaluated: 100%|██████████| 10/10 [00:13<00:00,  1.31s/it]
Folds evaluated: 100%|██████████| 10/10 [03:07<00:00, 18.77s/it]


In [23]:
print(f"Average RMSE for Linear Regression: {linear_avg_rmse}")
print(f"Average RMSE for Logistic Regression: {log_reg_avg_rmse}")
print(f"Average RMSE for SGD Regressor: {sgd_avg_rmse}")

Average RMSE for Linear Regression: 0.10522179176922435
Average RMSE for Logistic Regression: 0.12233264194155857
Average RMSE for SGD Regressor: 0.10521052481812418


### Getting the validation data and testing

In [None]:
df_validate = pd.read_csv('air_system_present_year.csv')
X_validate = df_validate.drop('class', axis=1)
y_validate = df_validate['class'].replace('neg', 0).replace('pos', 1)
X_validate_transformed = pd.DataFrame(pipeline.transform(X_validate), columns=['PC1', 'PC2', 'PC3', 'PC4', 'PC5'], index=X_validate.index)

In [27]:
linear_rmse = []
log_reg_rmse = []
sgd_rmse = []

for i in range(10):
    linear_predictions = linear_trained_models[i].predict(X_validate_transformed)
    log_reg_predictions = log_reg_trained_models[i].predict(X_validate_transformed)
    sgd_predictions = sgd_trained_models[i].predict(X_validate_transformed)
    
    linear_rmse.append(np.sqrt(mean_squared_error(y_validate, linear_predictions)))
    log_reg_rmse.append(np.sqrt(mean_squared_error(y_validate, log_reg_predictions)))
    sgd_rmse.append(np.sqrt(mean_squared_error(y_validate, sgd_predictions)))

linear_avg_rmse = np.mean(linear_rmse)
log_reg_avg_rmse = np.mean(log_reg_rmse)
sgd_avg_rmse = np.mean(sgd_rmse)

print(f"Average RMSE for Linear Regression: {linear_avg_rmse}")
print(f"Average RMSE for Logistic Regression: {log_reg_avg_rmse}")
print(f"Average RMSE for SGD Regressor: {sgd_avg_rmse}")

Average RMSE for Linear Regression: 0.11794454036393962
Average RMSE for Logistic Regression: 0.13420710713040307
Average RMSE for SGD Regressor: 0.11796058367163602


Getting the best trained model for each type

In [None]:
best_linear_model = linear_trained_models[np.argmin(linear_rmse)]
best_log_reg_model = log_reg_trained_models[np.argmin(log_reg_rmse)]
best_sgd_model = sgd_trained_models[np.argmin(sgd_rmse)]
joblib.dump(best_linear_model, 'models/best_linear_model.pkl')
joblib.dump(best_log_reg_model, 'models/best_log_reg_model.pkl')
joblib.dump(best_sgd_model, 'models/best_sgd_model.pkl')

From the result of the average RMSE for the models, the SGD was the best one, so we'll take it to make the analysis of costs in present year

In [39]:
predictions = best_sgd_model.predict(X_validate_transformed)

Testing the threshold of 0.2, 0.3, 0.4 and 0.5 in the predictions

In [66]:
results = pd.DataFrame({
    'threshold': [],
    'false_negatives': [],
    'false_positives': [],
    'true_positives': []
})

for threshold in [0.1, 0.2, 0.3, 0.4, 0.5]:
    pred = (predictions > threshold).astype(int)
    false_negatives = (y_validate == 1) & (pred == 0)
    false_positives = (y_validate == 0) & (pred == 1)
    true_positives = (y_validate == 1) & (pred == 1)
    new_row = pd.DataFrame({
        'threshold': [threshold],
        'false_negatives': [false_negatives.sum()],
        'false_positives': [false_positives.sum()],
        'true_positives': [true_positives.sum()]
    })
    results = pd.concat([results, new_row], ignore_index=True)

results.head()

Unnamed: 0,threshold,false_negatives,false_positives,true_positives
0,0.1,37.0,455.0,338.0
1,0.2,103.0,204.0,272.0
2,0.3,187.0,109.0,188.0
3,0.4,245.0,40.0,130.0
4,0.5,300.0,20.0,75.0


For the cost reduce, we want the lowest false negatives balanced with false positives and true positives, the best one for it is the threshold with value 0.1 that is almost the RMSE value. 
### Calculating the cost reduce

Calculating if 25% of positives needed maintance trucks that is not sended to maintance

In [64]:
not_sended_trucks = y_validate.sum() *0.25
sended_trucks = y_validate.sum() - not_sended_trucks
full_cost_not_sended_trucks = not_sended_trucks * 500
full_cost_sended_trucks = sended_trucks * 25
total_cost = full_cost_not_sended_trucks + full_cost_sended_trucks

print(f'Total cost: {total_cost}')

Total cost: 53906.25


Calculating the cost of the previewed model

In [None]:
fn = int(results[results['threshold'] == 0.1].drop('threshold', axis=1)['false_negatives'].astype(int))
fp = int(results[results['threshold'] == 0.1].drop('threshold', axis=1)['false_positives'].astype(int))
tp = int(results[results['threshold'] == 0.1].drop('threshold', axis=1)['true_positives'].astype(int))

In [68]:
full_predicted_cost_not_sended_trucks = fn * 500
full_predicted_cost_sended_trucks = tp * 25
full_predicted_cost_sended_trucks_with_no_need = fp * 10
total_predicted_cost = full_predicted_cost_not_sended_trucks + full_predicted_cost_sended_trucks + full_predicted_cost_sended_trucks_with_no_need

print(f'Total predicted cost: {total_predicted_cost}')

Total predicted cost: 31500


### Total reduced cost with this problem

In [75]:
print(f'${int(total_cost - total_predicted_cost)} reduced dollars and ', f'{int((total_cost - total_predicted_cost) / total_cost * 100)}% reduced cost')

$22406 reduced dollars and  41% reduced cost


So, if 25% of the defected air system trucks is not sended correctly to maintance, the model could help reducing 41% of the fully cost.

And as we saved the data pipeline and the best models, we could use it later to retrain the model with new data.