# <h2> 4. Modeling </h2>

In the modeling phase, I will start with a base model of logistic regression. Since this is a classification task in which there is an imbalance in the dependent variable, I will use F1-score as the metric for evaluation with higher F1 scores indicating better models. In all of the models I will use **Cross-Validation** with **10** folds and take the mean F1-score of all 10 results for evaluation purposes. 

There are two version of the data sets: one with PCA-derived variables and one without.

In [1]:
#Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

#Define a timer
def print_time(start_time, end_time):
    elapsed_time = end_time - start_time
    hours = elapsed_time // 3600
    mins = (elapsed_time - hours*3600)//60
    secs = (elapsed_time - hours*3600 - mins*60) // 1
    
    return print("\nTime elapsed: {} hours {} minutes and {} seconds".format(hours, mins, secs))


import warnings
warnings.filterwarnings("ignore")

In [2]:
#Load data set
df = pd.read_csv("saved_files/concrete.csv")
df.head()

Unnamed: 0,cement,slag,ash,water,superplasticizer,coarse_agg,fine_agg,age,strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


In [3]:
#separate inputs from target
X = df.drop('strength', axis=1)
y = df['strength']
#check X and y
print(y.head())
print(X.head())

0    79.986111
1    61.887366
2    40.269535
3    41.052780
4    44.296075
Name: strength, dtype: float64
   cement   slag  ash  water  superplasticizer  coarse_agg  fine_agg  age
0   540.0    0.0  0.0  162.0               2.5      1040.0     676.0   28
1   540.0    0.0  0.0  162.0               2.5      1055.0     676.0   28
2   332.5  142.5  0.0  228.0               0.0       932.0     594.0  270
3   332.5  142.5  0.0  228.0               0.0       932.0     594.0  365
4   198.6  132.4  0.0  192.0               0.0       978.4     825.5  360


In [4]:
#split test and train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 10)
print("Dimensions of X_train: ", X_train.shape)
print("Dimensions of X_test: ", X_test.shape)
print("Dimensions of y_train: ", y_train.shape)
print("Dimensions of y_test: ", y_test.shape)

Dimensions of X_train:  (824, 8)
Dimensions of X_test:  (206, 8)
Dimensions of y_train:  (824,)
Dimensions of y_test:  (206,)


In [5]:
#Standardize input variables
X_train_scaled = (X_train - X_train.mean(axis = 0))/X_train.std(axis = 0)
print(X_train_scaled.head())

       cement      slag       ash     water  superplasticizer  coarse_agg  \
34  -0.865837  1.347685 -0.843242  2.135043         -1.022928   -0.505706   
124  1.017125  0.269264 -0.843242 -1.099486          0.973077   -1.534705   
712 -0.846875  2.485310 -0.843242  0.473944         -1.022928   -0.534039   
859 -1.112348 -0.857913  2.253563 -0.033614          1.121540   -1.574628   
176  0.930846  0.897279 -0.843242 -1.284053          1.599921    2.099629   

     fine_agg       age  
34  -1.278181  5.025699  
124  1.874127 -0.282935  
712 -0.709853 -0.613741  
859  0.891573 -0.282935  
176 -2.079511  0.709480  


While I've already standardized the data, I won't directly use the standardized data set since I'll use the *StandardScaler* function in pipelines in the modeling phase.

In [6]:
#import models & other required packages
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV

## <h3> 4.1. Linear Regression </h3>



In [7]:
#define Linear Regression pipe
pipe_linreg = Pipeline([("scaler", StandardScaler()),
                        ("model", LinearRegression())
                        
])

In [8]:
#fit Linear Regression pipe
start_time = time.perf_counter()

pipe_linreg.fit(X_train, y_train)

end_time = time.perf_counter()
print_time(start_time, end_time)



Time elapsed: 0.0 hours 0.0 minutes and 0.0 seconds


In [9]:
#Predict using fitted Linear Regression model
linreg_preds = pipe_linreg.predict(X_test)

linreg_rmse = mean_squared_error(y_test, linreg_preds, squared = False)
print("Linear Regression RMSE:",linreg_rmse)

Linear Regression RMSE: 10.659515359643427


In [10]:
#Store results
results = pd.DataFrame({"RMSE" : linreg_rmse}, index = ["Linear Regression"])
results

Unnamed: 0,RMSE
Linear Regression,10.659515


In [11]:
#for reference, please see the mean compressive strength of concrete
df["strength"].mean()

35.81783582611362

The base Linear Regression Model gives an RMSE of **10.66**. Now, I will look at other models to improve the RMSE.

## <h3> 4.2. K-Nearest Neighbors (KNN)</h3>

Now, I will use the KNN model with 3 neighbors with uniform weighting. I will define a pipe that will standardize the data before fitting the model.


In [12]:
#define KNN pipe
pipe_knn = Pipeline([("scaler", StandardScaler()),
                         ("model", KNeighborsRegressor(n_neighbors = 3)) #weights = 'uniform' is the default
                        
])

In [13]:
#fit KNN pipe
start_time = time.perf_counter()

pipe_knn.fit(X_train, y_train)

end_time = time.perf_counter()
print_time(start_time, end_time)



Time elapsed: 0.0 hours 0.0 minutes and 0.0 seconds


In [14]:
#Predict using fitted KNN model
knn_preds = pipe_knn.predict(X_test)

knn_rmse = mean_squared_error(y_test, knn_preds, squared = False)
print("KNN RMSE:",knn_rmse)

KNN RMSE: 8.557281810590855


In [15]:
#Store results
results.loc["KNN"] = knn_rmse
results

Unnamed: 0,RMSE
Linear Regression,10.659515
KNN,8.557282


The KNN model gives us an RMSE of **8.56** MPa which is an improvement of over 2 megapascals over the base linear regression model.

## <h3> 4.3. K-Nearest Neighbors (KNN) with GridSearchCV</h3>

Now, I will use the cross validation and tune hyperparameters for the KNN model to see if we have an RMSE improvement.
The hyperparameters I will optimize are the number of neighbors as well as the weighting function.


In [16]:
#Define KNN + GridSearchCV pipe
pipe_knn_grid = Pipeline([("scaler", StandardScaler()),
                     ("model", KNeighborsRegressor())
                        
])
#Define parameter space
params_knn = {'model__n_neighbors': [3, 5, 7, 9, 11, 13, 15, 17, 19],
            'model__weights': ["uniform","distance"]
           }


#use n_jobs = -1 to use all processors
knn_grid = GridSearchCV(pipe_knn_grid, param_grid = params_knn, n_jobs = -1, cv=10, scoring='neg_root_mean_squared_error')

In [17]:
#Fit KNN + GridSearchCV
start_time = time.perf_counter()

knn_grid.fit(X_train, y_train)

end_time = time.perf_counter()
print_time(start_time, end_time)


Time elapsed: 0.0 hours 0.0 minutes and 2.0 seconds


In [18]:
#Print best parameters
print(knn_grid.best_params_)

{'model__n_neighbors': 5, 'model__weights': 'distance'}


In [19]:
#Predict using fitted KNN + GridSearchCV model
knn_grid_preds = knn_grid.best_estimator_.predict(X_test)

knn_grid_rmse = mean_squared_error(y_test, knn_grid_preds, squared = False)
print("KNN + GridSearchCV RMSE:",knn_grid_rmse)

KNN + GridSearchCV RMSE: 7.384882740854421


In [20]:
#Store results
results.loc["KNN + GridSearchCV"] = knn_grid_rmse
results

Unnamed: 0,RMSE
Linear Regression,10.659515
KNN,8.557282
KNN + GridSearchCV,7.384883


With GridSearchCV, the KNN model substantially improves to an RMSE of **7.38** from the original RMSE of 8.56.

Let's save X_train, X_test, y_train, y_test, ad Results sets so that the second Modeling part can load them directly.

In [21]:
#Save X_train set as a csv file
X_train.to_csv('saved_files/concrete_x_train.csv', index=False)
#Save X_test set as a csv file
X_test.to_csv('saved_files/concrete_x_test.csv', index=False)
#Save y_train set as a csv file
y_train.to_csv('saved_files/concrete_y_train.csv', index=False)
#Save y_test set as a csv file
y_test.to_csv('saved_files/concrete_y_test.csv', index=False)
#Save model results as a csv file
results.to_csv('saved_files/concrete_model_results.csv', index=True)

Next notebook will focus on Random Forest, XGBoost, their optimization and some ensemble model.