## Phase: Modeling Building

Now that I have a basic sense of the visualization, I want to start modeling the data.

However, because I didn't save my work from yesterday, I'll go back to the noteook and save the result to disk, so I can use it here and maybe somewhere else.

In [None]:
import lineapy
import numpy as np
import pandas as pd

In [None]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [None]:
cleaned_data = pd.read_csv("outputs/cleaned_data_housing.csv")

In [None]:
len(cleaned_data)

In [None]:
cleaned_data = cleaned_data.dropna()

In [None]:
len(cleaned_data)

In [None]:
train, val = train_test_split(cleaned_data, test_size=0.3, random_state=42)
X_train = train.drop(['SalePrice'], axis = 1)
y_train = train.loc[:, 'SalePrice']
X_val = val.drop(['SalePrice'], axis = 1)
y_val = val.loc[:, 'SalePrice']

In [None]:
X_train

In [None]:
y_train

In [None]:
linear_model = LinearRegression(fit_intercept=True)

In [None]:
linear_model.fit(X_train, y_train)
y_fitted = linear_model.predict(X_train)
y_predicted = linear_model.predict(X_val)

In [None]:
X_val["Predicted Sales Price"] = y_predicted

## Use the prediction on split test data

Now that we have built the data, we want to take a look at how accurate we are

In [None]:
def rmse(predicted, actual):
    """
    Calculates RMSE from actual and predicted values
    Input:
      predicted (1D array): vector of predicted/fitted values
      actual (1D array): vector of actual values
    Output:
      a float, the root-mean square error
    """
    return np.sqrt(np.mean((actual - predicted)**2))

In [None]:
# NBVAL_IGNORE_OUTPUT
rmse(y_predicted, y_val)

## Use the prediction on new test data

In [None]:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer

test_data = pd.read_csv("data/ames_test_cleaned.csv")

In [None]:
# NBVAL_IGNORE_OUTPUT
vec_enc = DictVectorizer()
vec_enc.fit(test_data[['Neighborhood']].to_dict(orient='records'))
Neighborhood_data = vec_enc.transform(test_data[['Neighborhood']].to_dict(orient='records')).toarray()
Neighborhood_cats = vec_enc.get_feature_names()
Neighborhood = pd.DataFrame(Neighborhood_data, columns=Neighborhood_cats)
test_data = pd.concat([test_data, Neighborhood], axis=1)
test_data = test_data.drop(columns=Neighborhood_cats[0])

In [None]:
# new_res = lineapy.run(after_load, {input_data: pd.read_csv("../ames_other_cleaned.csv")})

In [None]:
relevant_table = test_data.filter(regex=("Neighborhood=.|Gr_Liv_Area|Garage_Area|SalePrice")).dropna()

In [None]:
relevant_table

In [None]:
# NBVAL_IGNORE_OUTPUT
y_test_predicted = linear_model.predict(relevant_table.drop(["SalePrice"], axis=1))

In [None]:
rmse(y_test_predicted, relevant_table['SalePrice'])

I've verified that the test results are still within the expected range.

In [None]:
from joblib import dump
dump(linear_model, "outputs/linea_model_housing.joblib")

## Task 2: App API

I want to deploy this model so the business folks come in and take my "suggested" values.

I would either have to learn flask and AWs to put a mini web app up, or make the business folks use a notebook (which involves setting up Python).

In [None]:
!rm outputs/linea_model_housing.joblib

In [None]:
artifact = lineapy.save(lineapy.file_system, "linea_model_housing")
artifact.visualize()

In [None]:
artifact.to_airflow();

In [None]:
print(artifact.code)