## In this notebook, we are building a machine learning model to predict the housing market values .

## 1. Importing libraries

First we import all the libraries to load and augment the data. We will also import the libraries that are used to encode our data and to train the model.

In [4]:
import pandas as pd
from pycaret.time_series import *

## 2. Loading the data

Next, we load and display the raw data file to check what kind of data we are dealing with.

In [5]:
# load the data
data = pd.read_csv(".\\Datasets\\price_paid_records.csv")
data.columns = data.columns.str.lower()
data

Unnamed: 0,transaction unique identifier,price,date of transfer,property type,old/new,duration,town/city,district,county,ppdcategory type,record status - monthly file only
0,{81B82214-7FBC-4129-9F6B-4956B4A663AD},25000,1995-08-18 00:00,T,N,F,OLDHAM,OLDHAM,GREATER MANCHESTER,A,A
1,{8046EC72-1466-42D6-A753-4956BF7CD8A2},42500,1995-08-09 00:00,S,N,F,GRAYS,THURROCK,THURROCK,A,A
2,{278D581A-5BF3-4FCE-AF62-4956D87691E6},45000,1995-06-30 00:00,T,N,F,HIGHBRIDGE,SEDGEMOOR,SOMERSET,A,A
3,{1D861C06-A416-4865-973C-4956DB12CD12},43150,1995-11-24 00:00,T,N,F,BEDFORD,NORTH BEDFORDSHIRE,BEDFORDSHIRE,A,A
4,{DD8645FD-A815-43A6-A7BA-4956E58F1874},18899,1995-06-23 00:00,S,N,F,WAKEFIELD,LEEDS,WEST YORKSHIRE,A,A
...,...,...,...,...,...,...,...,...,...,...,...
22489343,{4C4EE000-291A-1854-E050-A8C063054F34},175000,2017-02-20 00:00,S,N,F,LEEDS,LEEDS,WEST YORKSHIRE,A,A
22489344,{4C4EE000-291B-1854-E050-A8C063054F34},586945,2017-02-15 00:00,D,N,F,WETHERBY,LEEDS,WEST YORKSHIRE,A,A
22489345,{4C4EE000-291C-1854-E050-A8C063054F34},274000,2017-02-24 00:00,D,N,L,HUDDERSFIELD,KIRKLEES,WEST YORKSHIRE,A,A
22489346,{4C4EE000-291D-1854-E050-A8C063054F34},36000,2017-02-22 00:00,T,N,F,HALIFAX,CALDERDALE,WEST YORKSHIRE,A,A


## 3. Data augmentation

**3.1 Dropping `transaction unique identifier`, `property type`, `old/new`, `duration`, `town/city`, `district`, `county`, `ppdcategory type` and `record status - monthly file only`**

For this model we decided to drop the classes `transaction unique identifier`, `property type`, `old/new`, `duration`, `town/city`, `district`, `county`, `ppdcategory type` and `record status - monthly file only`. 

These columns were dropped because the model only gets the date to base its predictions on and no other classes.

We also decided to drop all columns containing null-values. This is because our dataset contained 2 columns, the column to base its predictions on and the column which contained the predictions. When 1 of the 2 is incomplete, the model can't be trained.

Lastly we sorted the columns by date of transfer to ensure the model got the data in chronological order to make its predictions.

In [6]:
data = data.drop(columns=['transaction unique identifier', 'property type', 'old/new', 'duration', 'town/city', 'district', 'county', 'ppdcategory type', 'record status - monthly file only'])
data = data.dropna()  # Handle missing values, if any
data = data.sort_values(by='date of transfer')

**3.2 Grouping the dataset by month and year**

Next up, we converted the values of the `date of transfer` class to datetime values. We did this so that we could then take the month and year for each value. We then created a new class called `yearmonth`. This class contained the string values of the month and year, seperated by a `'-'`. 

Lastly we grouped the data by the `yearmonth` class. This means that it takes the sum of all values where the `yearmonth` values are the same. It then puts this summ in the according value of the `yearmonth` class. 

In [7]:
# Sum of prices for each day
data['date of transfer'] = pd.to_datetime(data['date of transfer'])
data['yearmonth'] =  data["date of transfer"].dt.year.astype(str) + '-' + data["date of transfer"].dt.month.astype(str).str.zfill(2)
data = data.groupby('yearmonth')['price'].sum().reset_index()

**3.3 Splitting the dataset in train and test data**

**3.3.1 Creating the train dataset**

We then divided this data in training and testing data. We used the first 215 rows as training data. This equals ~80% of the complete dataset. We used the upper 215 rows so that the model could use the prices of the previous data as a reference.

We then saved this data to the subset_80.csv file.

In [8]:
# train
subset_data = data.head(215)
subset_file_path = '.\\Datasets\\subset_80.csv'
subset_data.to_csv(subset_file_path, index=False)

**3.3.2 Creating the testing dataset**

We use the data from row 215 to the end as test data. This equals 20% of the complete dataset. 

We then save this data to the test_20.csv file.

In [9]:
# test
test_data = data.iloc[215:]
test_file_path = '.\\Datasets\\test_20.csv'
test_data.to_csv(test_file_path, index=False)

## 4. Training the Sagemaker model

The following code was entered in AWS Sagemaker

**4.1 Importing the libraries**

First we import the libraries used for loading the data, processing the data and training the model.

In [None]:
import boto3
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

**4.2 Loading the data**

First we define the bucket and the file that needs to be loaded. Then we convert all the different columns to lowercase.

In [None]:
#load the data
bucket_name = "ukhouseholding"
file_key = "subset_80.csv"
s3 = boto3.client("s3")
url = s3.generate_presigned_url(
    ClientMethod="get_object",
    Params={"Bucket": bucket_name, "Key": file_key},
    ExpiresIn=3600  # URL expires in 1 hour
)
data = pd.read_csv(url)
data.columns = data.columns.str.lower()
data

In [None]:
data = data.drop(columns=['transaction unique identifier', 'property type', 'old/new', 'duration', 'town/city', 'district', 'county', 'ppdcategory type', 'record status - monthly file only'])
data = data.dropna()  # Handle missing values, if any
data

In [None]:
data['date of transfer'] = pd.to_datetime(data['date of transfer'])

In [None]:
data["price"] = data["price"].astype(float)

# Split the data into features (X) and target (y)
X = data.drop(columns=["price"])  # Features
y = data["price"]                 # Target

# Combine the features and target into a single DataFrame for training
train_data = pd.concat([y, X], axis=1)

# Save the training data to a local CSV file
train_data.to_csv("train_data.csv", index=False)
s3_client = boto3.client("s3")
s3_prefix = "train/train_data.csv"
s3_client.upload_file("train_data.csv", bucket_name, s3_prefix)

In [None]:
session = sagemaker.Session()
role = get_execution_role()

In [None]:
bucket_name = "ukhouseholding"  # Replace with your bucket name

In [None]:
container = sagemaker.image_uris.retrieve("xgboost", session.boto_region_name, version="1.5-1")

In [None]:
xgboost = Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type="ml.p3.2xlarge",  # GPU instance type
    output_path=f"s3://{bucket_name}/ukhouseholding-xgboost",  # Output path in S3
    sagemaker_session=session,
    base_job_name="ukhouseholding-xgboost-job"
)

In [None]:
xgboost.set_hyperparameters(
    objective="reg:squarederror",  # Binary classification
    num_round=100,            # Number of training rounds
    max_depth=5,              # Example hyperparameter
    eta=0.2,                  # Learning rate
    gamma=4,                  # Minimum loss reduction
    subsample=0.8,            # Subsample ratio of training instances
    colsample_bytree=0.8      # Subsample ratio of columns
    
)

In [None]:
train_input = TrainingInput(
    s3_data=f"s3://{bucket_name}/{s3_prefix}",  # S3 path to training data
    content_type="text/csv"     # Data format
)
try:
    xgboost.fit({"train": train_input})
except Exception as e:
    print(f"Error: {e}")
try:
    xgboost.fit({"train": train_input})
except Exception as e:
    print(f"Error: {e}")

In [None]:
sm_client = boto3.client("sagemaker")

# Check if the job exists
job_name = xgboost.latest_training_job.name
response = sm_client.describe_training_job(TrainingJobName=job_name)
print(response)

In [None]:
model_artifact_s3_uri = f"s3://ukhouseholding/ukhouseholding-xgboost/ukhouseholding-xgboost-job-2024-11-27-08-27-55-012/output/model.tar.gz"

# Parse bucket and key from the URI
parsed_uri = model_artifact_s3_uri.replace("s3://", "").split("/")
bucket_name = parsed_uri[0]
key = "/".join(parsed_uri[1:])

# Initialize S3 client
s3 = boto3.client("s3")

# Download the model artifact locally
local_file_path = "model.tar.gz"  # Specify the desired local file name
s3.download_file(bucket_name, key, local_file_path)

print(f"Model saved locally as {local_file_path}")