# Inference on OOT data

Using both the champion model, we will score the inference dataframe and save for submission

First let's compare the train, test and OOT performance of the XG Boost and Catboost models to determine which model we should use for inference

### Import Libraries and Path

Libraries

In [1]:
import os

import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


In [2]:
import numpy as np
import catboost as cb

print("NumPy version:", np.__version__)
print("CatBoost version:", cb.__version__)


NumPy version: 1.24.4
CatBoost version: 1.2.5


Data path

In [3]:
data_path_in  = "C:/Users/Ahmed/OneDrive/Documents/projects/procore/01_data/inbound/"
data_path_out = "C:/Users/Ahmed/OneDrive/Documents/projects/procore/01_data/outbound/"

In [4]:
function_path = "C:/Users/Ahmed/OneDrive/Documents/projects/procore/06_functions/"

### Load Performance Data

In [5]:
print("\nXGBoost Performance:\n\n")

pd.read_csv(data_path_out + 'xgb/model_performance.csv')



XGBoost Performance:




Unnamed: 0,Dataset,RMSE,R^2
0,Train,103.786528,0.446058
1,Test,124.024431,0.422472
2,OOT,201.894417,0.143924
3,CV (Mean),124.310478,0.344022
4,CV Fold 1,113.933711,0.308497
5,CV Fold 2,107.061988,0.371114
6,CV Fold 3,116.574085,0.436248
7,CV Fold 4,120.438698,0.338384
8,CV Fold 5,163.543908,0.265867


In [6]:
print("\nCatboost Performance:\n\n")
pd.read_csv(data_path_out + 'catboost/model_performance.csv')


Catboost Performance:




Unnamed: 0,Dataset,RMSE,R^2
0,Train,99.042034,0.495546
1,Test,116.255507,0.492559
2,OOT,201.895335,0.143916
3,CV Fold 1,111.130061,0.342111
4,CV Fold 2,100.622342,0.444492
5,CV Fold 3,106.546684,0.529062
6,CV Fold 4,114.533518,0.401672
7,CV Fold 5,160.668779,0.291452
8,CV Mean,118.700277,0.401758


##### Though both models under perform on a OOT sample of 200 listings, the Catboost model performs better than the XG Boost model on Train, Test and CV mean RMSE and R^2. Hence it is the model we will opt for. 

### Load Data

In [7]:
df = pd.read_csv(data_path_in + 'inference.csv')

print("df shape:", df.shape)

df shape: (500, 82)


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 82 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Unnamed: 0                        500 non-null    int64  
 1   id                                500 non-null    int64  
 2   listing_url                       500 non-null    object 
 3   scrape_id                         500 non-null    float64
 4   last_scraped                      500 non-null    object 
 5   name                              499 non-null    object 
 6   summary                           482 non-null    object 
 7   space                             353 non-null    object 
 8   description                       489 non-null    object 
 9   neighborhood_overview             315 non-null    object 
 10  notes                             249 non-null    object 
 11  transit                           313 non-null    object 
 12  access  

In [9]:
df.sample(3)

Unnamed: 0.1,Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,neighborhood_overview,notes,transit,access,interaction,house_rules,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighborhood,host_verifications,host_has_profile_pic,host_identity_verified,street,neighborhood,city,suburb,state,zipcode,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
72,10345,18884447,https://www.airbnb.com/rooms/18884447,20181200000000.0,12/7/2018,Large Bright Room With En Suite in Sunny Port ...,"A bright spacious apartment, beach at the end ...",,"A bright spacious apartment, beach at the end ...","A beautiful, quiet neighbourhood with the beac...",,Street parking is available and the 109 tram ...,"You'll have access to the living room, kitchen...","I will be contactable via phone, text or email...",#NAME?,https://a0.muscache.com/im/pictures/1ea6eceb-3...,16510225,https://www.airbnb.com/users/show/16510225,Gemma,6/7/2014,"Victoria, Australia",I am a twin born in NZ and raised in Ireland n...,,,f,https://a0.muscache.com/im/pictures/47f5c9ec-e...,https://a0.muscache.com/im/pictures/47f5c9ec-e...,Port Melbourne,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,f,"Port Melbourne, VIC, Australia",Port Melbourne,Port Phillip,Port Melbourne,VIC,3207.0,"Port Melbourne, Australia",AU,Australia,-37.838501,144.936597,t,Apartment,Private room,1,2.0,1.0,1.0,Real Bed,"{Wifi,Kitchen,Heating,Washer,Dryer,""Smoke dete...",200.0,0.0,1,0,5,1125,12 months ago,t,0,0,0,0,12/7/2018,1,6/24/2017,6/24/2017,100.0,10.0,10.0,10.0,10.0,10.0,10.0,f,,f,flexible,f,f,1,0.06
325,4909,11258658,https://www.airbnb.com/rooms/11258658,20181200000000.0,12/7/2018,Top reviews next to public transport 15mins to...,Metres from the Gardiner train station and tra...,There are two bedrooms in our apartment both b...,Metres from the Gardiner train station and tra...,"Glen Iris, south-east of the city, is leafy, w...","We have a little dog, Ron. He is a very friend...",The apartment is just metres to the new Gardin...,Guests will have access to most parts of the a...,Ruby and Jesse are both a 9 to 5 working profe...,The most important thing for us would be mindf...,https://a0.muscache.com/im/pictures/a6d69f9a-1...,957970,https://www.airbnb.com/users/show/957970,Ruby,8/11/2011,"Melbourne, Victoria, Australia",I have been lived in Melbourne for 17 years. I...,within a few hours,100%,t,https://a0.muscache.com/im/pictures/3b4acd3d-7...,https://a0.muscache.com/im/pictures/3b4acd3d-7...,Glen Iris,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Glen Iris, VIC, Australia",Glen Iris,Stonnington,Glen Iris,VIC,3146.0,"Glen Iris, Australia",AU,Australia,-37.85658,145.051901,t,Apartment,Private room,2,1.0,1.0,1.0,Real Bed,"{TV,Internet,Wifi,Kitchen,Gym,Breakfast,""Pets ...",0.0,16.0,1,22,2,1125,2 months ago,t,22,22,45,320,12/7/2018,32,2/23/2016,11/26/2018,96.0,10.0,10.0,10.0,10.0,10.0,10.0,f,,f,strict_14_with_grace_period,f,f,1,0.94
399,3627,9139846,https://www.airbnb.com/rooms/9139846,20181200000000.0,12/7/2018,"Studio/unit, beautiful gardens",A beautiful slice of heaven. Charming locatio...,Lovely established garden setting where you ca...,A beautiful slice of heaven. Charming locatio...,Close to the beach and the gateway to the Morn...,"I have a dog, so please ask if you would like ...",The train station is only 10 minutes away by c...,You have access to the garden and the Al-fresc...,I would be happy to help though out your stay,I live in a quiet area and am always considera...,https://a0.muscache.com/im/pictures/1a7d80f8-b...,47600522,https://www.airbnb.com/users/show/47600522,Carole,10/28/2015,"Victoria, Australia",,within an hour,100%,t,https://a0.muscache.com/im/pictures/886604b4-8...,https://a0.muscache.com/im/pictures/886604b4-8...,,"['email', 'phone', 'reviews']",t,f,"Frankston South, VIC, Australia",,Frankston,Frankston South,VIC,3199.0,"Frankston South, Australia",AU,Australia,-38.164599,145.133769,t,Bungalow,Entire home/apt,2,,,1.0,Real Bed,"{TV,Wifi,Kitchen,""Free parking on premises"",""S...",,,2,15,1,31,5 weeks ago,t,5,29,54,329,12/7/2018,195,11/23/2015,11/22/2018,93.0,10.0,9.0,10.0,10.0,9.0,10.0,f,,f,strict_14_with_grace_period,f,f,1,5.27


### Preprocess data

In [10]:
import sys
sys.path.append(function_path)

import preprocessor_inference
from preprocessor_inference import *

df_cleaned = preprocess_data(df)

print("df shape:", df.shape)
print("df_cleaned: ", df_cleaned.shape)

df shape: (500, 82)
df_cleaned:  (500, 75)


### Organize features

Organize features to numerical, categorical, target, drop features

In [11]:
drop_features = ['zipcode', 'price']  

numerical_features = df_cleaned.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_features = [feature for feature in numerical_features if feature not in drop_features]

categorical_features = df_cleaned.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_features = [feature for feature in categorical_features if feature not in drop_features]

target = ['price']

print("drop_features:\t\t", len(drop_features))
print("numerical_features:\t", len(numerical_features))
print("categorical_features:\t", len(categorical_features))
print("target:\t\t\t", len(target))

drop_features:		 2
numerical_features:	 31
categorical_features:	 43
target:			 1


### Inference

Load model

In [12]:
from catboost import CatBoostRegressor

# Load the saved CatBoost model
catboost_model = CatBoostRegressor()

model_folder = "C:/Users/Ahmed/OneDrive/Documents/projects/procore/07_artifacts/"
model_filename = "catboost_model.cbm"
model_path = os.path.join(model_folder, model_filename)

catboost_model.load_model(model_path)

<catboost.core.CatBoostRegressor at 0x1d5ebd85fd0>

In [13]:
subset_columns = ['id'] + catboost_model.feature_names_
df_final = df_cleaned[subset_columns]

drop_features = ['zipcode', 'price']  

numerical_features = df_final.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_features = [feature for feature in numerical_features if feature not in drop_features]

categorical_features = df_final.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_features = [feature for feature in categorical_features if feature not in drop_features]

print("numerical_features:\t", len(numerical_features))
print("categorical_features:\t", len(categorical_features))
print("df_final shape:\t\t", df_final.shape)

numerical_features:	 29
categorical_features:	 30
df_final shape:		 (500, 60)


In [14]:
from catboost import Pool

# Create a Pool object for the prediction data, specifying the categorical features
df_pool = Pool(df_final, cat_features=categorical_features)

# Perform predictions
predictions = catboost_model.predict(df_pool)

# Add the predictions to your DataFrame
df_final['predicted_price'] = predictions


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['predicted_price'] = predictions


### Save prediction file for sumbission

In [15]:
df_final.head()

Unnamed: 0,id,host_location,host_response_time,host_response_rate,host_is_superhost,host_verifications,host_has_profile_pic,host_identity_verified,street,neighborhood,city,suburb,state,zipcode,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,last_scraped_delta,host_since_delta,predicted_price
0,28964846,"Brunswick, Victoria, Australia",within a day,81%,f,"['email', 'phone', 'facebook', 'jumio', 'offli...",t,t,"Brunswick, VIC, Australia",Brunswick,Moreland,Brunswick,VIC,3056.0,"Brunswick, Australia",AU,Australia,-37.758036,144.971277,f,House,Entire home/apt,4,1.0,1.0,2.0,Real Bed,"{TV,Kitchen,""Free parking on premises"",""Smokin...",250.0,60.0,1,0,3,1125,yesterday,t,20,35,35,269,12/7/2018,0,Unknown,Unknown,97.0,10.0,10.0,10.0,10.0,10.0,10.0,f,f,flexible,f,f,2,1.02,2084,3499,146.336564
1,19648696,"Werribee, Victoria, Australia",within an hour,100%,f,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,f,"Maidstone, VIC, Australia",Footscray,Maribyrnong,Maidstone,VIC,3012.0,"Maidstone, Australia",AU,Australia,-37.782292,144.881223,t,House,Entire home/apt,10,1.0,1.0,8.0,Real Bed,"{TV,Internet,Wifi,""Air conditioning"",Kitchen,""...",600.0,140.0,6,15,3,90,today,t,12,36,64,154,12/7/2018,50,7/10/2017,12/3/2018,90.0,10.0,9.0,10.0,10.0,9.0,10.0,f,t,moderate,f,f,1,2.91,2084,3360,194.772625
2,11346949,"Melbourne, Victoria, Australia",Unknown,Unknown,f,"['email', 'phone', 'facebook', 'jumio', 'gover...",t,t,"Caulfield North, VIC, Australia",Caulfield,Glen Eira,Caulfield North,VIC,3161.0,"Caulfield North, Australia",AU,Australia,-37.872654,145.022953,f,House,Entire home/apt,5,1.0,1.0,4.0,Real Bed,"{TV,Internet,Wifi,""Air conditioning"",Kitchen,""...",250.0,59.0,1,0,1,1125,31 months ago,t,0,0,0,0,12/7/2018,0,Unknown,Unknown,97.0,10.0,10.0,10.0,10.0,10.0,10.0,f,f,flexible,f,f,1,1.02,2084,3106,155.766262
3,29476373,"Ferny Creek, Victoria, Australia",within an hour,100%,f,['phone'],t,f,"Melbourne, VIC, Australia",Central Business District,Melbourne,Melbourne,VIC,3000.0,"Melbourne, Australia",AU,Australia,-37.821687,144.955512,f,Apartment,Entire home/apt,2,1.0,1.0,1.0,Real Bed,"{TV,""Air conditioning"",""Paid parking off premi...",250.0,59.0,1,10,1,3,2 days ago,t,9,13,31,31,12/7/2018,11,10/28/2018,12/2/2018,91.0,9.0,9.0,9.0,9.0,10.0,9.0,f,f,moderate,f,f,1,8.05,2084,2989,128.605501
4,21627660,China,within an hour,100%,f,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,f,"Melbourne, VIC, Australia",Prahran/Windsor,Port Phillip,Melbourne,VIC,3004.0,"Melbourne, Australia",AU,Australia,-37.851703,144.978613,f,Apartment,Private room,2,1.0,1.0,1.0,Real Bed,"{Wifi,Pool,Kitchen,Gym,Elevator,Heating,Washer...",0.0,10.0,1,0,1,30,6 weeks ago,t,14,37,37,37,12/7/2018,16,12/24/2017,11/10/2018,85.0,9.0,9.0,9.0,9.0,10.0,9.0,f,t,strict_14_with_grace_period,f,f,3,1.38,2084,2881,79.885652


In [17]:
df_final.to_csv(data_path_out + 'catboost/inference_outbound.csv', index=False)