### Part 0: Setup
First, we need to import some libraries that are necessary to complete the assignment.

In [None]:
import pandas as pd
import numpy as np

from PIL import Image, ImageOps
import os

import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
import torch.nn as nn
import torch.optim as optim

from collections import defaultdict

Add additional modules/libraries to import here (rather than wherever you first use them below):

In [None]:
# additional modules/libraries to import
import torchvision
import numpy as np
from sklearn.decomposition import PCA
from plotnine import *
from plotnine.data import *
from sklearn.svm import OneClassSVM
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer
import xgboost
!pip install tensorflow
import torch
from torch.utils.tensorboard import SummaryWriter
%load_ext tensorboard
writer = SummaryWriter()

Packages such as [TorchMetrics](https://torchmetrics.readthedocs.io/en/stable/) provide more options for evaluation metrics than what PyTorch natively provides. Here, we install the package and load their implementation of MSE, since it natively supports RMSE (which we will use in Part 2 instead of implementing it from scratch).

In [None]:
!pip install torchmetrics

In [None]:
from torchmetrics import MeanSquaredError

In [None]:
device = "cpu"
if (torch.cuda.is_available()):
  device = "cuda"
print("device: " + device)

In [None]:
# note that this command will trigger a request from google to allow colab
# to access your files: you will need to accept the terms in order to access
# the files this way
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# if you followed the instructions above exactly, your zipped data file should
# be located at the file path below; if your files are in a different directory
# on your Google Drive, you will need to change the path below accordingly
ZIPPATH = '/content/drive/My Drive/comp341/comp341-hw7.zip'

In [None]:
!cp "{ZIPPATH}" .
!unzip -q "comp341-hw7.zip"
!rm "comp341-hw7.zip"

In your local colab instance, you should now have a `house_imgs/` directory with many images of homes (includes images from both the training and test sets), as well as two csv files: `home_data_train.csv` and `home_data_test.csv`.

### Part 1: Exploring the Home Images
We have explored various tabular data extensively, especially in the context of dimensionality reduction when we have many features. One way to think about images is to consider each individual pixel (and channel) as an individual feature. Even with relatively small images like we have in our dataset, the dimensionality explodes pretty quickly, so let's explore if the dimensionality reduction methods we covered early on in the course can help us make sense of the data before we do any type of more sophisticated feature extraction.

In [None]:
# we provide some simple code to read in each training image and flatten
# the pixel-based values to a tidy DataFrame, where each row is a house image
# and each column is a feature (the R/G/B value at an individual pixel location)

# get houseids for homes in the training dataset
home_train = pd.read_csv('home_data_train.csv')
img_ids = home_train['houseid'].astype(str).tolist()

img_vect = []
for idx in img_ids:
  infile = os.path.join("house_imgs", idx + ".jpg")
  file, ext = os.path.splitext(infile)
  with Image.open(infile) as im:
    img_vect.append(np.asarray(im).flatten())

pixel_df = pd.DataFrame(np.vstack(img_vect))

Here, we store the flattened pixel values for each image, but we may want to look at the original image that corresponds to these flattened vectors. 

In [None]:
# TODO: write a simple function that, given a single house id as input (aka one of the elements
# in img_ids), loads the image file from its location in LOCALDIR and displays it in your notebook directly
# Hint: this is essentially a simpler version of the display_data function we provided in
# Part 2 of the assignment, so you may be able to use some of the similar methods referenced
# there as well as the provided HouseImagesDataset code

def display_data(id):
  #Loads the respective image with the given id and displays it
  image = Image.open("house_imgs/" + str(id) + ".jpg")
  display(image)

In [None]:
# TODO: test your function on one of the images in img_ids
display_data(1)

Now, use a dimensionality reduction method that we covered in class that you think is appropriate for this problem. 

In [None]:
# TODO: calculate the 2D reduced dimensionality space and plot it (each image is a single point)
# Note: depending on the dimensionality reduction method chosen,
# this step can take a couple of minutes to complete

PCA_pixel = pixel_df.copy(deep=True)

#Uses PCA on the pixel df and plots the results
pca = PCA(n_components=2)
PCA_pixel = pd.DataFrame(pca.fit_transform(PCA_pixel), columns=['x', 'y'])

(ggplot(PCA_pixel, aes(x='x', y='y'))
+ geom_point()
+ theme(figure_size=(12, 12))
)

If we look closely at some of our house images, we can see that instead of providing a "real" picture of the house, there are also schematics / floorplans. Two such examples are the houses at `houseid` 4112 and 7758. 


In [None]:
# TODO: use your image display function to verify that these 2 houseids used drawings instead of real pictures of the house(s).
display_data(4112)
display_data(7758)

In [None]:
# TODO: color the 2 points that correspond to these 2 houses in a new 2D reduced dimensionality plot

#Finds the index of the 2 given house ids 
idx_4112 = home_train.index[home_train['houseid'] == 4112][0]
idx_7758 = home_train.index[home_train['houseid'] == 7758][0]

PCA_pixel_drawings = PCA_pixel.copy(deep=True)

#Adds a boolean column to differentiate the previous house ids
PCA_pixel_drawings['drawing'] = False
PCA_pixel_drawings.iloc[idx_4112, 2] = True
PCA_pixel_drawings.iloc[idx_7758, 2] = True

(ggplot(PCA_pixel_drawings, aes(x='x', y='y', color='drawing'))
+ geom_point()
+ theme(figure_size=(12, 12))
)

In [None]:

#Goes through a sample of about 40 images and labels them to get training data
drawing_idxs = [idx_4112, idx_7758, 50, 596, 752, 1743]
drawings = PCA_pixel[PCA_pixel.index.isin(drawing_idxs)]
drawings['drawing'] = 1

nondrawing_idxs = [109, 330, 347, 537, 587, 614, 1284, 1556, 1557, 1558, 1559, 1616, 1626, 1793, 1916, 1917, 1924, 1928, 1982, 
                   1984, 2020, 2046, 2091, 2119, 2123, 2184, 2206, 2258, 2359, 2579, 2740, 2769, 2782, 2857, 2961, 2981]
nondrawings = PCA_pixel[PCA_pixel.index.isin(nondrawing_idxs)]
nondrawings['drawing'] = 0

#Using logistic regression to classify the rest of the data as either a drawing or not
drawings = pd.concat([drawings, nondrawings])
y_drawing = drawings.pop('drawing')
X_drawing = drawings

clf = LogisticRegression().fit(X_drawing, y_drawing)
pred_drawings = pd.DataFrame(clf.predict(PCA_pixel), columns=['pred'])

#There are about 49 houses in the training data that have a drawing as their primary image.

#Displays the images that the model predicts as a drawing
for i in pred_drawings.loc[pred_drawings.pred == 1].index:
  print(i, ": ")
  id = home_train.iloc[i, 0]
  display_data(id)

### Part 2: Predicting List Price
Now that we also have some sense of what the house listing images are like based on Part 1, we will set up a regression framework that can use both the tabular features and image features. Along the way, we will also see how our predictions may change depending on what data we use.

We provide several helper functions for setting up the data and functionality to visualize individual examples, which can sometimes be helpful to get a sense of what the model is doing.

In [None]:
# torch converts the 0-255 RGB values to 0-1 tensors, but it can
# also be beneficial to also standardize the values (or, as we
# see here, subtract the mean RGB values from the images)

# these transformations below help facilitate this
# inv_normalize is provided mainly for visualization sake, so that 
# we can flip the standardization process to see the image in its 
# original colors

house_mean = [0.5230, 0.5416, 0.4989]
# house_sd = [0.2271, 0.2162, 0.2640]
# only subtracting mean and not also dividing by standard deviation
# can actually sometimes work better, which is what we are doing here
house_sd = [1, 1, 1]

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(house_mean, house_sd)
])

inv_normalize = transforms.Normalize(
   mean= [-m/s for m, s in zip(house_mean, house_sd)],
   std= [1/s for s in house_sd]
)


In [None]:
# convenient function for displaying images
# by default, will reverse the standardization calculation so that we can
# see the images in a "normal" color scheme
def display_data(d, inv_norm=True):
  if isinstance(d['houseid'], list): # we can handle a list of houses
    batch_size = len(d['houseid'])
    for i in range(batch_size):
      if 'price' in d:
        print('price:', "${:,.0f}".format(d['price'][i]))

      if inv_norm:
       display(transforms.ToPILImage()(inv_normalize(d['image'][i])))
      else:
        display(transforms.ToPILImage()(d['image'][i]))
  else: # only an individual house to be displayed
    if 'price' in d:
      print('price:', "${:,.0f}".format(d['price']))

    if inv_norm:
      display(transforms.ToPILImage()(inv_normalize(d['image'])))
    else:
      display(transforms.ToPILImage()(d['image'])) 

In [None]:
class HouseImagesDataset(Dataset):
    def __init__(self, annot_file, image_dir, train=True):
        # the annotation file is tidy, aka each row is a unique observation in the dataset,
        # but it is not yet clean, which you will address in the TODO below
        df = pd.read_csv(annot_file)

        #Cleaning / preprocessing of features in df

        #Imputes the missing numeric values using KNN imputation
        KNN_imputer = KNNImputer(n_neighbors=5, weights='distance')
        num_df = df[['beds', 'baths', 'sqft', 'lot_size']]
        num_df = pd.DataFrame(KNN_imputer.fit_transform(num_df), columns = num_df.columns)
        #Scales the choice numeric features
        scaler = StandardScaler()
        num_df = pd.DataFrame(scaler.fit_transform(num_df), columns=num_df.columns)

        # #Turns zipcodes into categorical features
        # zip_cat = pd.Categorical(df["zipcode"], categories=df["zipcode"].unique().tolist())
        # df = df.assign(zip_cat = zip_cat)
        # #Imputes the missing categorical values using simple imputation of the most frequent value
        # simp_imputer = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
        # cat_df = df[['property_type', 'zip_cat']]

        #Imputes the missing categorical values using simple imputation of the most frequent value
        simp_imputer = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
        cat_df = df[['property_type']]
        cat_df = pd.DataFrame(simp_imputer.fit_transform(cat_df), columns = cat_df.columns)
        #One-hot encodes choice categorical features
        cat_df = pd.get_dummies(cat_df)

        #Combines the scaled numeric features with the one-hot encoded categorical features
        clean_df = pd.concat([num_df, cat_df], axis=1)
        #Finally, adds the houseid and list_price features back
        clean_df = pd.concat([df['houseid'], clean_df], axis=1)
        if train:
          clean_df = pd.concat([clean_df, df['list_price']], axis=1)

        # TODO: fill in this feature_cols list with the column names of
        # features you would like to use to predict list price (many of the columns 
        # will likely be transformed from the original data in annot_file)
        self.house_annot = clean_df
        self.feature_cols = list(self.house_annot.columns)
        self.image_dir = image_dir
        self.train=train

    def __len__(self):
        # TODO: fill in this method (replacing pass) to return the length of the dataset
        return len(self.house_annot)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        # we have provided code that will load and transform the indexed ("ith") image
        # as well as features specified earlier in self.feature_cols within the processed
        # pandas DataFrame

        img = Image.open(str(self.image_dir) + str(self.house_annot.loc[idx, 'houseid']) + str('.jpg'))
        img = transform(img)

        features = self.house_annot.loc[idx, self.feature_cols]
        features = features.tolist()
        features = torch.FloatTensor(features)

        # depending on whether the Dataset is in training mode, we will have the price data or not
        if self.train:
            item = {'image': img,
                    'houseid': self.house_annot.loc[idx, 'houseid'],
                    'features': features,
                    'price': torch.tensor(self.house_annot.loc[idx, 'list_price'], dtype=torch.float)}
        else:
            item = {'image': img,
                    'houseid': self.house_annot.loc[idx, 'houseid'],
                    'features': features}

        return item

In [None]:
# TODO: initialize the house dataset using the training data you were provided and check the length of the dataset
house_dataset = HouseImagesDataset("home_data_train.csv", "house_imgs/", train=True)

print(house_dataset.__len__())

In [None]:
# TODO: check that the data loads properly by calling the provided display_data function with a specifically indexed
# item in your house dataset (e.g., house_dataset[3])

print(display_data(house_dataset.__getitem__(3)))

In [None]:
# TODO: use the convenient torch.utils.data.random_split function to split your loaded dataset into training and 
# validation portions, using 75% of the data for training and 25% of the data for validation

house_train, house_valid = torch.utils.data.random_split(house_dataset, lengths=[3000, 1000])

print(house_train)
print(house_valid)

In [None]:
class HybridHouseNN(nn.Module):
  def __init__(self): 
    super().__init__()
    # TODO: set up some convolutional layers
    # it is your choice as to how many convolutional blocks, as well as
    # specifics within the blocks: activation function, pooling, etc

    conv1 = nn.Conv2d(3, 8, kernel_size=5, stride=1, padding=0)
    a1 = nn.ReLU()
    pool1 = nn.MaxPool2d(kernel_size=8)

    # TODO: set up an MLP for the tabular features that you will be
    # inputting into this model

    fc1 = nn.Linear(10, 16)
    a2 = nn.ReLU()

    # TODO: set up the final set of fully connected layers that
    # takes as input the concatenated set of flattened convolution features
    # together with the output of the MLP from the tabular features
    # to eventually output a single non-negative prediction

    fc2 = nn.Linear(116224, 2000)
    a3 = nn.Sigmoid()
    fc3 = nn.Linear(2000, 64)

    self.image_list = nn.ModuleList([conv1, a1, pool1])
    self.tabular_list = nn.ModuleList([fc1, a2])
    self.combined_list = nn.ModuleList([fc2, a3, fc3])
        
  def forward(self, ximg, xfeats): 
    # TODO: write out the forward pass steps
    # note that forward now has 2 inputs because we are using both
    # images and non-image features separately at first, before
    # merging them together for the final set of predictions
    # Note: you may also need to adjust the shape of your final prediction
    # so that it plays nice with the loss function etc.

    #Passes the features through their respective layers
    for func in self.image_list:
      ximg = func(ximg)
    for func in self.tabular_list:
      xfeats = func(xfeats)
    
    #Flattens both sets of parameters
    ximg = torch.flatten(ximg)
    xfeats = torch.flatten(xfeats)

    #Passes the combined features through the final set of layers
    xcombined = torch.cat((ximg, xfeats))
    for func in self.combined_list:
      xcombined = func(xcombined)

    return xcombined

Before training our model, we want to also set up some additional models to what the differences might be if we use *only* images or *only* the tabular features for our predictions. Of course, if we set the models up differently with different hyperparameters, we really cannot have a truly equivalent comparison, but we will try to keep as many of the model blocks the same as possible.

In [None]:
class HouseImageOnly(nn.Module):
  def __init__(self): 
    super().__init__()

    #Convolutional Layer
    conv1 = nn.Conv2d(3, 8, kernel_size=5, stride=1, padding=0)
    a1 = nn.ReLU()
    pool1 = nn.MaxPool2d(kernel_size=8)

    #Fully connected layers
    fc1 = nn.Linear(115200, 2000)
    a2 = nn.Sigmoid()
    fc2 = nn.Linear(2000, 64)

    self.conv_pool_list = nn.ModuleList([conv1, a1, pool1])
    self.fcl_list = nn.ModuleList([fc1, a2, fc2])
        
  def forward(self, ximg): 
    #Pass the image's parameters through the convolutional layers
    for func in self.conv_pool_list:
      ximg = func(ximg)

    #Flatten and pass the parameters through the fully connected layers
    ximg = torch.flatten(ximg)
    for func in self.fcl_list:
      ximg = func(ximg)

    return ximg

In [None]:
class HouseFeatsOnly(nn.Module):
  def __init__(self): 
    super().__init__()

    #First layer from the MLP
    fc1 = nn.Linear(10, 16)
    a1 = nn.ReLU()

    #Fully connected layers
    fc2 = nn.Linear(1024, 2000)
    a2 = nn.Sigmoid()
    fc3 = nn.Linear(2000, 64)

    self.mlp_list = nn.ModuleList([fc1, a1])
    self.fcl_list = nn.ModuleList([fc2, a2, fc3])
        
  def forward(self, xfeats): 
    #Pass the image's parameters through the MLP
    for func in self.mlp_list:
      xfeats = func(xfeats)

    #Flatten and pass the parameters through the fully connected layers
    xfeats = torch.flatten(xfeats)
    for func in self.fcl_list:
      xfeats = func(xfeats)

    return xfeats   

As mentioned earlier, we will use RMSE for our loss function:

In [None]:
loss_fn = MeanSquaredError(squared=False).to(device)

Now, let's fill in the details of the training and validation methods.

In [None]:
def train(model, train_loader, opt, epoch, mode="both", verbose=False):
  # mode can be "both", "image", or "features", depending on if we are using
  # our HybridHouseNN, HouseImageOnly, or HouseFeatsOnly model
  # we will assume that the model passed to this function matches the mode,
  # and mode will affect whether the model uses image, features, or a combination
  # as input to get the predictions in the forward pass
  
  if verbose:
    print("starting epoch", epoch)
  train_loss = 0
  for i, batch in enumerate(train_loader):
    image, features, price = batch['image'].to(device), \
                             batch['features'].to(device), \
                             batch['price'].to(device)

    model.train(True)

    # TODO: fill in the code for each of the steps in the
    # training loop, remembering that we want to account for
    # different modes in the forward pass step
    pred = 0
    if mode == "both":
      pred = model(image, features)
    elif mode == "image":
      pred = model(image)
    else:
      pred = model(features)
    #Calculate loss
    loss = loss_fn(pred, price)
    #Backward pass
    loss.backward()
    #Update weight estimates
    opt.step()
    #Reset gradients to zero
    opt.zero_grad()
    #We are tracking the sum of losses to calculate the average training loss for this epoch
    train_loss += loss.item()
  
    model.train(False)
    if verbose and ((i % 20) == 0):
      print('training [epoch {}: {}/{} ({:.0f}%)] loss: {:.6f}'.format(
          epoch, i * len(image), len(train_loader.dataset),
          100. * i / len(train_loader), loss.item()))
    if (i % 20) == 0:
    #log the running loss
      writer.add_scalar('training loss',
                        train_loss / 20,
                        epoch * len(train_loader) + i)

  avg_tl = train_loss / (i+1)
  print('epoch {} avg training loss: {:.6f}'.format(epoch, avg_tl))
  return avg_tl

In [None]:
def valid(model, valid_loader, mode="both"):
  # as in train, mode can be "both", "image", or "features", depending on if we are using
  # our HybridHouseNN, HouseImageOnly, or HouseFeatsOnly model
  # we will assume that the model passed to this function matches the mode,
  # and mode will affect whether the model uses image, features, or a combination
  # as input to get the predictions in the forward pass
  valid_loss = 0
  correct = 0
  with torch.no_grad():
    for i, batch in enumerate(valid_loader):
      image, features, price = batch['image'].to(device), \
                               batch['features'].to(device), \
                               batch['price'].to(device)
      
      # TODO: fill in code to calculate pred (the prediction), paying attention to 
      # different usage of the model depending on the inputted mode variable
      pred = 0
      if mode == "both":
        pred = model(image, features)
      elif mode == "image":
        pred = model(image)
      else:
        pred = model(features)
      
      #Get loss
      valid_loss += loss_fn(pred, price).item()


  # get the loss for the epoch
  avg_vl = valid_loss / (i+1)
  print('avg validation loss: {:.6f}'.format(avg_vl))
  
  return avg_vl

In [None]:
batch_size = 64
train_loader = torch.utils.data.DataLoader(house_train, batch_size = batch_size, shuffle=True, drop_last=True)
valid_loader = torch.utils.data.DataLoader(house_valid, batch_size = batch_size, shuffle=True, drop_last=True)

In [None]:
epoch_list = defaultdict(list)
train_loss = defaultdict(list)
valid_loss = defaultdict(list)

#I only did 10 epochs because this takes me hours to run.
epochs = 10
modes = {'both': HybridHouseNN(),'image': HouseImageOnly(),'features': HouseFeatsOnly()}
for m in modes:
  model = modes[m].to(device)
  # TODO: initialize the optimizer (and associated hyperparameters like learning rate) of your choice
  opt = torch.optim.Adam(model.parameters(), lr=.1)

  print("Current mode:", m)
  for e in range(1, epochs+1):
    epoch_list[m].append(e)
    train_loss[m].append(train(model, train_loader, opt, e, m, verbose=True))
    valid_loss[m].append(valid(model, valid_loader, m))

In [None]:
#Gets the number of epochs
epochs = pd.DataFrame(epoch_list['both'], columns=['epochs'])

#Concats the number of epochs, training loss, and validation loss
train_losses = pd.DataFrame(train_loss).rename(columns={'both': 'train_both', 'image': 'train_image', 'features': 'train_features'})
valid_losses = pd.DataFrame(valid_loss).rename(columns={'both': 'valid_both', 'image': 'valid_image', 'features': 'valid_features'})
losses = pd.concat([epochs, train_losses, valid_losses], axis=1)

#Plots each loss type as a separate line with their own color
(ggplot(losses, aes(x='epochs'))
+ geom_line(aes(y='train_both', color="'red'"))
+ geom_line(aes(y='train_image', color="'orange'"))
+ geom_line(aes(y='train_features', color="'yellow'"))
+ geom_line(aes(y='valid_both', color="'blue'"))
+ geom_line(aes(y='valid_image', color="'green'"))
+ geom_line(aes(y='valid_features', color="'purple'"))
+ theme(figure_size=(12, 12))
+ labs(y='Loss')
+ scale_color_identity(guide='legend',name='Loss Type',
                        breaks=['red', 'orange', 'yellow', 'blue', 'green', 'purple'],
                        labels=['train_both','train_image','train_features', 'valid_both', 'valid_image', 'valid_features'])
)

Using both or only image is usually the best performing (it changes between runs). They have the lowest amount of loss in both training and validation sets. On the other hand, using only tabular features seems to perform very poorly.

In [None]:
# Reads in the training and test data
home_train = pd.read_csv('home_data_train.csv')
home_test = pd.read_csv('home_data_test.csv')


def cleanNumeric(df):
  #Imputes the missing numeric values using KNN imputation
  KNN_imputer = KNNImputer(n_neighbors=5, weights='distance')
  num_df = df[['beds', 'baths', 'sqft', 'lot_size']]
  num_df = pd.DataFrame(KNN_imputer.fit_transform(num_df), columns = num_df.columns)
  #Scales the choice numeric features
  scaler = StandardScaler()
  num_df = pd.DataFrame(scaler.fit_transform(num_df), columns=num_df.columns)
  return num_df


cat_train = home_train[['property_type']]
cat_test = home_test[['property_type']]

#Gets the categorical zipcodes
zip_train = pd.Categorical(home_train["zipcode"], categories=home_train["zipcode"].unique().tolist())
cat_train = cat_train.assign(zip_cat = zip_train)
zip_test = pd.Categorical(home_test["zipcode"], categories=home_test["zipcode"].unique().tolist())
cat_test = cat_test.assign(zip_cat = zip_test)

#Imputes the missing categorical values using Simple Imputation of the most frequent value
simp_imputer = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
cat_train = pd.DataFrame(simp_imputer.fit_transform(cat_train), columns = cat_train.columns)
cat_test = pd.DataFrame(simp_imputer.fit_transform(cat_test), columns = cat_test.columns)

#Makes sure both dataframes have the same columns
temp = pd.get_dummies(pd.concat([cat_train[['property_type', 'zip_cat']], cat_test[['property_type', 'zip_cat']]],keys=[0,1]))
cat_train, cat_test = temp.xs(0),temp.xs(1)

#Combines the scaled numeric features with the one-hot encoded categorical features
X_train = pd.concat([cleanNumeric(home_train), cat_train], axis=1)
X_test = pd.concat([cleanNumeric(home_test), cat_test], axis=1)

#Combines the tabular features with our PCA reduced image data
train_images = PCA_pixel[['x', 'y']]
X_train = pd.concat([train_images, X_train], axis=1)

#Gets the PCA reduced image data of the test set
img_ids = home_test['houseid'].astype(str).tolist()

img_vect = []
for idx in img_ids:
  infile = os.path.join("house_imgs", idx + ".jpg")
  file, ext = os.path.splitext(infile)
  with Image.open(infile) as im:
    img_vect.append(np.asarray(im).flatten())

test_images = pd.DataFrame(np.vstack(img_vect))

pca = PCA(n_components=2)
test_images = pd.DataFrame(pca.fit_transform(test_images), columns=['x', 'y'])
X_test = pd.concat([test_images, X_test], axis=1)

y_train = home_train[['list_price']]

#Uses XGBoost to model the list_prices
xgb_reg = xgboost.XGBRegressor(booster='gbtree', eta=0.3, max_depth=6, objective='reg:squarederror', eval_metric='rmse')
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_test)

In [None]:
results = pd.Series(y_pred.flatten(), name="price")
results = pd.concat([home_test['houseid'], results], axis=1)
results.to_csv('my_submission.csv', index=False)