## Pipeline for House Price Prediction

### Problem Statement
Acquiring properties is common in our society today. However, a guildline for interested/prospected buyers to help them get a good value for their money seems to be lacking and buyers are left at their fate to gamble different options with their hard earned money. This project seeks to provide a model to guide buyers predict the price of a house based on their choices' features of a house.

### Data:
The data was collected within the 3 months, the feautures include city, number of bedrooms, number of bathrooms, square of living area, square of basement, number of floors, waterfront, number of views, year built, year renovated, etc

In [1]:
# You may need to restart your notebook kernel after updating the kfp sdk
! pip3 show kfp

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Name: kfp
Version: 1.4.0
Summary: KubeFlow Pipelines SDK
Home-page: UNKNOWN
Author: google
Author-email: None
License: UNKNOWN
Location: /home/jovyan/.local/lib/python3.6/site-packages
Requires: click, google-auth, docstring-parser, tabulate, kfp-server-api, strip-hints, google-cloud-storage, fire, kfp-pipeline-spec, kubernetes, requests-toolbelt, cloudpickle, PyYAML, Deprecated, jsonschema
Required-by: kfp-notebook


In [2]:
! wget https://dl.min.io/client/mc/release/linux-amd64/mc
! chmod +x mc
! ./mc --help

--2021-03-19 22:54:54--  https://dl.min.io/client/mc/release/linux-amd64/mc
Resolving dl.min.io (dl.min.io)... 178.128.69.202
Connecting to dl.min.io (dl.min.io)|178.128.69.202|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20287488 (19M) [application/octet-stream]
Saving to: ‘mc.3’


2021-03-19 22:55:12 (1.13 MB/s) - ‘mc.3’ saved [20287488/20287488]

NAME:
  mc - MinIO Client for cloud storage and filesystems.

USAGE:
  mc [FLAGS] COMMAND [COMMAND FLAGS | -h] [ARGUMENTS...]

COMMANDS:
  alias      set, remove and list aliases in configuration file
  ls         list buckets and objects
  mb         make a bucket
  rb         remove a bucket
  cp         copy objects
  mirror     synchronize object(s) to a remote site
  cat        display object contents
  head       display first 'n' lines of an object
  pipe       stream STDIN to an object
  share      generate URL for temporary access to an object
  find       search for objects
  sql        run sql queries

In [3]:
! ./mc alias set minio http://minio-service.kubeflow:9000 minio minio123

[m[32mAdded `minio` successfully.[0m
[0m

In [4]:
# ! ./mc mb minio/price

In [5]:
! ./mc ls minio price

[m[32m[2021-03-19 22:52:35 UTC][0m[33m     0B[0m[36;1m loan/[0m
[0m[m[32m[2021-03-19 22:43:33 UTC][0m[33m     0B[0m[36;1m mlpipeline/[0m
[0m[m[32m[2021-03-19 04:09:14 UTC][0m[33m     0B[0m[36;1m mnist/[0m
[0m[m[32m[2021-03-19 22:54:41 UTC][0m[33m     0B[0m[36;1m price/[0m
[0m[m[32m[2021-03-19 22:45:52 UTC][0m[33m  35KiB[0m[1m price_prediction.yaml[0m
[0m

In [6]:
! tar --dereference -czf datasets.tar.gz ./datasets
! ./mc cp datasets.tar.gz minio/price/datasets.tar.gz
! ./mc policy set download minio/price

...ts.tar.gz:  252.84 KiB / 252.84 KiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 13.07 MiB/s 0s[0m[0m[m[32;1m[m[32;1mAccess permission for `minio/price` is set to `download`[0m
[0m

In [7]:
from typing import NamedTuple
import kfp
import kfp.components as components
import kfp.dsl as dsl
from kfp.components import InputPath, OutputPath #helps define the input & output between the components

## Create a pipeline Function
## Preprocessing Function

In [8]:
def download_dataset(data_dir: OutputPath(str)):
    """Download the data set to the KFP volume to share it among all steps"""
    import urllib.request
    import tarfile
    import os
    import subprocess

    if not os.path.exists(data_dir):
        os.makedirs(data_dir)

    url = "http://minio-service.kubeflow:9000/mlpipeline/datasets.tar.gz"
    stream = urllib.request.urlopen(url)
    tar = tarfile.open(fileobj=stream, mode="r|gz")
    tar.extractall(path=data_dir)
    
    subprocess.call(["ls", "-lha", data_dir])

In [9]:
def preprocess(data_dir: InputPath(str), clean_data_dir: OutputPath(str)):
    
    import numpy as np
    import pickle
    import sys, subprocess;
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'pandas'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'scikit-learn'])
    from sklearn.model_selection import KFold
    from sklearn.model_selection import train_test_split  # splitting the data
    import pandas as pd
    import pickle
    import os
    # Get data
    
    data = pd.read_csv(f"{data_dir}/datasets/data.csv")
    
    print(data)
    
    # drop unneccessary column
    data.drop(columns=['date','country', 'statezip', 'street'], inplace=True)
    
    #Filtering for prices that are not zero.
    data = data.query('price != 0')
    
    #Filtering for houses not zero for number of bedrooms and bathrooms
    data = data.query('bedrooms != 0' or 'bathrooms != 0')
    
    # Converting the city variable to numerical values.
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    data['city'] = le.fit_transform(data['city'])
    
    #year convert function
    def yr_col (col1, col2):
        if col1 == 0:
            col1 = col2
        else:
            col1
        return col1
    
    #Change year renovated column with zero entry to the year built.
    data['yr_renovated'] = data.apply(lambda x: yr_col(x['yr_renovated'], x['yr_built']), axis =1)
    
    #Filtering the outliers
    data = data[(
                (data['price'] <= 2000000) & 
                (data['price'] > 150000) & 
                (data['bathrooms'] <= 4.5) & 
                (data['condition'] > 2) & 
                (data['sqft_living'] > 700) & 
                (data['sqft_living'] <= 5000) & 
                (data['sqft_lot'] <= 50000) & 
                (data['sqft_above'] <= 5000) &
                (data['sqft_basement'] <= 5000) &
                (data['bedrooms'] <= 6) 
                )]
    
    #Filtering for multicollinearity
    data = data.drop(columns=['sqft_living', 'sqft_above'])
    
    # We normalise our dataset to a common scale using the min max scaler
    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()
    data = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
    
    # split the data into X and y
    X = data.drop(['price'], axis=1)  # predictor
    y = data['price'] # target
    
    # Split the data into training and testing set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

    
    data = {"X_train": X_train,"X_test": X_test, "Y_train": y_train,"Y_test": y_test}
    
    os.makedirs(clean_data_dir, exist_ok=True)

    with open(os.path.join(clean_data_dir,'clean_data.pickle'), 'wb') as f:
        pickle.dump(data, f)
    
    print(f"clean_data.pickle {clean_data_dir}")
    
    print(os.listdir(clean_data_dir))
    
    print("Preprocessing Done")

## Training Function
## Training the data with the Catboost Regressor

In [10]:
def train_model(clean_data_dir: InputPath(str), model_dir: OutputPath(str)):
    
    # Install all the dependencies inside the function
    import numpy as np
    import pickle
    import os
    import sys, subprocess;
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost==0.24.2'])
    import pandas as pd
    # import libraries for training
    from catboost import CatBoostRegressor
    
    #load the preprocessed data
    
    print(clean_data_dir)
    with open(os.path.join(clean_data_dir,'clean_data.pickle'), 'rb') as f:
        data = pickle.load(f)
        
    print(data)
    
    X_train = data['X_train']
    y_train = data['y_train']
    
    # Instantiating the model 
    model = CatBoostRegressor(verbose=1, n_estimators=10)
    
    # Fit the model to the training data
    model.fit(X_train,y_train)
    
    #Save the model to the designated 
    os.makedirs(model_dir, exist_ok=True)
    
    with open(os.path.join(model_dir,'model.pickle'), 'wb') as f:
        pickle.dump(model, f)
    
    print(f"model.pickle {model_dir}")
    
    print(os.listdir(model_dir))

In [11]:
def prediction(
    clean_data_dir: InputPath(str), model_dir: InputPath(str), metrics_path: OutputPath(str)
) -> NamedTuple("EvaluationOutput", [("mlpipeline_metrics", "Metrics")]):   
    import pickle
    import os
    import sys, subprocess;
    import numpy as np
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'scikit-learn'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost==0.24.2'])
    # Evaluation metrics
    from sklearn.metrics import mean_squared_error, r2_score
    from collections import namedtuple
      
    print(model_dir)
    with open(os.path.join(model_dir,'model.pickle'), 'rb') as f:
        model = pickle.load(f)        
    print(model)
    
    print(clean_data_dir)
    with open(os.path.join(clean_data_dir,'clean_data.pickle'), 'rb') as f:
        data = pickle.load(f)       
    print(data)
 
    X_test = data['X_test']
    y_test = data['y_test']
    X_train = data['X_train']
    y_train = data['y_train']
    
    #Evaluate the model and print the results
    model_pred = model.predict(X_test)
    
    
    r2_score = r2_score(y_test, model_pred)
    rmse_test = np.sqrt(mean_squared_error(y_test, model_pred))
    rmse_train=np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
    
    metrics = {
        "metrics": [
            {"name": "r2_score", "numberValue": str(r2_score), "format": "PERCENTAGE"},
            {"name": "rmse_test", "numberValue": str(rmse_test), "format": "PERCENTAGE"},
            {"name": "rmse_train", "numberValue": str(rmse_train), "format": "PERCENTAGE"}
            
        ]
    }

    with open(metrics_path, "w") as f:
        json.dump(metrics, f)

    out_tuple = namedtuple("EvaluationOutput", ["mlpipeline_metrics"])

    return out_tuple(json.dumps(metrics))

In [12]:
def export_model(
    model_dir: InputPath(str),
    metrics: InputPath(str),
    export_bucket: str,
    model_name: str,
    model_version: int,
):
    import os
    import boto3
    from botocore.client import Config

    s3 = boto3.client(
        "s3",
        endpoint_url="http://minio-service.kubeflow:9000",
        aws_access_key_id="minio",
        aws_secret_access_key="minio123",
        config=Config(signature_version="s3v4"),
    )

    # Create export bucket if it does not yet exist
    response = s3.list_buckets()
    export_bucket_exists = False

    for bucket in response["Buckets"]:
        if bucket["Name"] == export_bucket:
            export_bucket_exists = True

    if not export_bucket_exists:
        s3.create_bucket(ACL="public-read-write", Bucket=export_bucket)

    # Save model files to S3
    for root, dirs, files in os.walk(model_dir):
        for filename in files:
            local_path = os.path.join(root, filename)
            s3_path = os.path.relpath(local_path, model_dir)

            s3.upload_file(
                local_path,
                export_bucket,
                f"{model_name}/{model_version}/{s3_path}",
                ExtraArgs={"ACL": "public-read"},
            )

    response = s3.list_objects(Bucket=export_bucket)
    print(f"All objects in {export_bucket}:")
    for file in response["Contents"]:
        print("{}/{}".format(export_bucket, file["Key"]))

## Run the Pipeline
Kubeflow Pipelines lets you group pipeline runs by Experiments.

In [13]:
def train_and_serve(
    data_dir: str,
    clean_data_dir: str,
    model_dir: str,
    export_bucket: str,
    model_name: str,
    model_version: int,
):
    # For GPU support, please add the "-gpu" suffix to the base image
    BASE_IMAGE = "mesosphere/kubeflow:1.0.1-0.5.0-tensorflow-2.2.0"

    downloadOp = components.func_to_container_op(
        download_dataset, base_image=BASE_IMAGE
    )()
    
    preprocessOp = components.func_to_container_op(preprocess, base_image=BASE_IMAGE)(
        downloadOp.output
    )
    trainOp = components.func_to_container_op(train_model, base_image=BASE_IMAGE)(
        preprocessOp.output
    )

    predictionOp = components.func_to_container_op(prediction, base_image=BASE_IMAGE)(
        preprocessOp.output, trainOp.output
    )

    exportOp = components.func_to_container_op(export_model, base_image=BASE_IMAGE)(
        trainOp.output, predictionOp.output, export_bucket, model_name, model_version
    )

In [14]:

def op_transformer(op):
    op.add_pod_annotation(name="sidecar.istio.io/inject", value="false")
    return op


@dsl.pipeline(
    name="End-to-End Loan Default",
    description="A sample pipeline to demonstrate multi-step model training, evaluation and export",
)
def price_prediction_pipeline(
    model_dir: str = "/train/model",
    data_dir: str = "/train/data",
    clean_data_dir: str= "/train/data",
    export_bucket: str = "price",
    model_name: str = "price",
    model_version: int = 1,
):
    train_and_serve(
        data_dir=data_dir,
        clean_data_dir=clean_data_dir,
        model_dir=model_dir,
        export_bucket=export_bucket,
        model_name=model_name,
        model_version=model_version,
    )
    dsl.get_pipeline_conf().add_op_transformer(op_transformer)

In [15]:
pipeline_func = price_prediction_pipeline
run_name = pipeline_func.__name__ + " run"
experiment_name = "End-to-End-Demo"

arguments = {
    "model_dir": "/train/model",
    "data_dir": "/train/data",
    "clean_data_dir": "/train/data",
    "export_bucket": "price",
    "model_name": "price",
    "model_version": "1",
}

kfp.compiler.Compiler().compile(pipeline_func,  'price_prediction.yaml')
    