## Model Training

Now that we've prepared the dataset, let's split the data and train the model.

In [13]:
from sklearn.linear_model  import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd

total_data_norm = pd.read_csv("../data/interim/factorised_eda_results.csv")

# We divide the dataset into training and test samples.
X = total_data_norm.drop(["charges"], axis = 1)
y = total_data_norm["charges"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 42)

model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.7970561040137825

In [14]:
print(f"Intercept (a): {model.intercept_}")
print(f"Coefficients (b1, b2): {model.coef_}")

Intercept (a): 11374.760308981617
Coefficients (b1, b2): [   260.27696192   -129.81301457    323.58181576    422.49541425
 -23771.47159663    225.60529155]


In [15]:
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
import numpy as np

y_pred = model.predict(X_test)

print(f"Root mean squared error: {np.sqrt(mean_squared_error(y_test, y_pred))}")
print(f"Coefficient of determination: {r2_score(y_test, y_pred)}")

Root mean squared error: 5558.9393229312855
Coefficient of determination: 0.7970561040137825


Let's store the resulting model.

In [16]:
import pickle

with open("../models/linear_regression_model.pkl", "wb") as f:
    pickle.dump(model, f)
