# MGMTMSA 403: Optimization

## Assignment 3: Predicting Airbnb Prices

Group Members: Guangying Pan, Yijie Fu

### Data Preprocessing

In [1]:
import pandas as pd

In [2]:
train = pd.read_csv('AirbnbTrain.csv')
train.head()

Unnamed: 0,latitude,longitude,Entire home,accommodates,bathrooms,bedrooms,beds,cleaning_fee,minimum_nights,number_of_reviews,review_scores_rating,instant_bookable,price
0,34.103701,-118.332241,1,13,2.0,3,2,150,2,1,100,1,350
1,34.099484,-118.331645,1,8,2.0,2,4,150,1,11,96,1,190
2,34.104321,-118.329662,1,4,1.0,0,1,55,1,1,80,0,85
3,34.101028,-118.317848,0,2,1.0,1,1,20,1,8,98,0,75
4,34.098292,-118.32498,1,2,1.0,1,1,20,1,11,96,0,130


In [3]:
train.shape

(1700, 13)

In [4]:
test = pd.read_csv('AirbnbTest.csv')
test.head()

Unnamed: 0,latitude,longitude,Entire home,accommodates,bathrooms,bedrooms,beds,cleaning_fee,minimum_nights,number_of_reviews,review_scores_rating,instant_bookable,price
0,34.100604,-118.341787,0,2,1.0,1,1,40,1,261,96,1,100
1,34.100607,-118.350583,1,8,2.0,2,2,100,2,10,98,0,300
2,34.10061,-118.347617,1,2,1.0,1,1,80,2,1,100,1,125
3,34.100611,-118.34218,1,3,1.0,0,2,55,2,54,97,1,169
4,34.100618,-118.342791,1,4,1.0,1,1,70,2,233,92,1,119


In [5]:
test.shape

(699, 13)

### Question1

**Model 1. Formulate the least absolute deviations regression problem as a linear program. Solve the linear program using the data given in the file AirbnbTrain.csv. What is the prediction error, in $/night, of your model on the test set (provided in AirbnbTest.csv)**

In [6]:
from gurobipy import *
import numpy as np
from numpy import genfromtxt
import csv

In [7]:
y = train.iloc[:, -1]
x = train.iloc[:, :-1]

In [8]:
# Construct a 'blank' model
mod = Model()

# Define decision variables
b = mod.addVars(12)
z = mod.addVars(1700)

# construct constraints

for i in range(1700):
    mod.addConstr(z[i] >= y[i] - sum(b[j]*x.iloc[i,j] for j in range(12)))

for i in range(1700):
    mod.addConstr(z[i] >= sum(b[j]*x.iloc[i,j] for j in range(12)) - y[i])


# Create the objective function, and set it to be minimized
mod.setObjective((1/1700)*sum(z[i] for i in range(1700)), GRB.MINIMIZE)

mod.update()

mod.optimize()

Set parameter Username
Academic license - for non-commercial use only - expires 2025-01-10
Gurobi Optimizer version 11.0.0 build v11.0.0rc2 (mac64[arm] - Darwin 22.5.0 22F82)

CPU model: Apple M2
Thread count: 8 physical cores, 8 logical processors, using up to 8 threads

Optimize a model with 3400 rows, 1712 columns and 41372 nonzeros
Model fingerprint: 0xcb8b6cd1
Coefficient statistics:
  Matrix range     [5e-01, 5e+02]
  Objective range  [6e-04, 6e-04]
  Bounds range     [0e+00, 0e+00]
  RHS range        [1e+01, 2e+03]
Presolve time: 0.05s
Presolved: 3400 rows, 1712 columns, 41372 nonzeros

Concurrent LP optimizer: primal simplex, dual simplex, and barrier
Showing barrier log only...

Ordering time: 0.00s

Barrier statistics:
 Dense cols : 12
 AA' NZ     : 2.995e+04
 Factor NZ  : 3.260e+04 (roughly 2 MB of memory)
 Factor Ops : 4.141e+05 (less than 1 second per iteration)
 Threads    : 1

                  Objective                Residual
Iter       Primal          Dual         Pri

In [9]:
# Extract the solution status. 
if mod.status == GRB.OPTIMAL:
    print("Solved to optimality")

    # print optimized value obtained at the optimal solution
    print(f"\nOptimized Value: {mod.objval}\n")

    for j in range(12):
        print(f'Beta[{j}] = {b[j].X}')

Solved to optimality

Optimized Value: 36.426247398213434

Beta[0] = 290.2514663518523
Beta[1] = 84.03092762130778
Beta[2] = 36.78323755553825
Beta[3] = 9.936817142125033
Beta[4] = 31.69433118878754
Beta[5] = 19.703957114478886
Beta[6] = 0.0
Beta[7] = 0.31061628141023223
Beta[8] = 0.0
Beta[9] = 0.0
Beta[10] = 0.2681158824601272
Beta[11] = 5.167454373349817


In [10]:
y_test = test.iloc[:, -1]
x_test = test.iloc[:, :-1]

In [11]:
pred_x = [sum(b[j].X * x_test.iloc[i, j] for j in range(12)) for i in range(699)]

In [12]:
error = (1/699)*sum(abs(y_test.iloc[i] - pred_x[i]) for i in range(699))

error

35.604535030377846

**The prediction error of model 1 on the test set  is $35.6/night.**

### Question 2

**Model 2. Suppose that to improve interpretability, you wish to build a model that predicts Airbnb prices using only the three most important variables. Modify Model 1 by including a constraint that allows at most three variables to have non-zero coefficients.**

In [13]:
# Construct a 'blank' model
mod = Model()

# Define decision variables
b = mod.addVars(12)
z = mod.addVars(1700)
s = mod.addVars(12, vtype = GRB.BINARY)

# construct constraints

for i in range(1700):
    mod.addConstr(z[i] >= y[i] - sum(b[j]*x.iloc[i,j] for j in range(12)))

for i in range(1700):
    mod.addConstr(z[i] >= sum(b[j]*x.iloc[i,j] for j in range(12)) - y[i])

# Add constriaint
mod.addConstr(sum(s[j] for j in range(12)) <= 3)

for j in range(12):
    mod.addConstr(10000*s[j] >= b[j])
    mod.addConstr(-10000*s[j] <= b[j])


# Create the objective function, and set it to be minimized
mod.setObjective((1/1700)*sum(z[i] for i in range(1700)), GRB.MINIMIZE)

mod.update()

mod.optimize()

Gurobi Optimizer version 11.0.0 build v11.0.0rc2 (mac64[arm] - Darwin 22.5.0 22F82)

CPU model: Apple M2
Thread count: 8 physical cores, 8 logical processors, using up to 8 threads

Optimize a model with 3425 rows, 1724 columns and 41432 nonzeros
Model fingerprint: 0x3def90f5
Variable types: 1712 continuous, 12 integer (12 binary)
Coefficient statistics:
  Matrix range     [5e-01, 1e+04]
  Objective range  [6e-04, 6e-04]
  Bounds range     [1e+00, 1e+00]
  RHS range        [3e+00, 2e+03]
Found heuristic solution: objective 144.9682353
Presolve removed 840 rows and 414 columns
Presolve time: 0.04s
Presolved: 2585 rows, 1310 columns, 31274 nonzeros
Variable types: 1298 continuous, 12 integer (12 binary)

Root relaxation: objective 3.642625e+01, 1419 iterations, 0.19 seconds (0.34 work units)

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

     0     0   36.42625    0    7  144.96824   

**a) List the names and coefficients of the three variables selected by the optimization model.**

In [14]:
# Extract the solution status. 
if mod.status == GRB.OPTIMAL:
    print("Solved to optimality")

    # print optimized value obtained at the optimal solution
    print(f"\nOptimized Value: {mod.objval}\n")

    for j in range(12):
        if b[j].X != 0:
            print(f'Name:', train.columns[j],f', beta: {b[j].X}')

Solved to optimality

Optimized Value: 38.33882352941181

Name: Entire home , beta: 52.0
Name: accommodates , beta: 14.0
Name: bedrooms , beta: 32.0


**b) What is the new prediction error, in $/night, of Model 2**

In [15]:
pred_x = [sum(b[j].X * x_test.iloc[i, j] for j in range(12)) for i in range(699)]

In [16]:
error = (1/699)*sum(abs(y_test.iloc[i] - pred_x[i]) for i in range(699))

error

37.73676680972818

**The prediction error of model 2 on the test set  is $37.74/night.**

### Question 3

**Model 3. Suppose now you wish to build a model that predicts Airbnb listing price using only three variables, where one of the variables is the number of beds.**

In [17]:
# Construct a 'blank' model
mod = Model()

# Define decision variables
b = mod.addVars(12)
z = mod.addVars(1700)
s = mod.addVars(12, vtype = GRB.BINARY)


# construct constraints

for i in range(1700):
    mod.addConstr(z[i] >= y[i] - sum(b[j]*x.iloc[i,j] for j in range(12)))

for i in range(1700):
    mod.addConstr(z[i] >= sum(b[j]*x.iloc[i,j] for j in range(12)) - y[i])

# Add constriaint
mod.addConstr(sum(s[i] for i in range(12)) == 3)
mod.addConstr(s[6] == 1)

for j in range(12):
    mod.addConstr(10000*s[j] >= b[j])
    mod.addConstr(-10000*s[j] <= b[j])


# Create the objective function, and set it to be minimized
mod.setObjective((1/1700)*sum(z[i] for i in range(1700)), GRB.MINIMIZE)

mod.update()

mod.optimize()

Gurobi Optimizer version 11.0.0 build v11.0.0rc2 (mac64[arm] - Darwin 22.5.0 22F82)

CPU model: Apple M2
Thread count: 8 physical cores, 8 logical processors, using up to 8 threads

Optimize a model with 3426 rows, 1724 columns and 41433 nonzeros
Model fingerprint: 0xb1cf5923
Variable types: 1712 continuous, 12 integer (12 binary)
Coefficient statistics:
  Matrix range     [5e-01, 1e+04]
  Objective range  [6e-04, 6e-04]
  Bounds range     [1e+00, 1e+00]
  RHS range        [1e+00, 2e+03]
Found heuristic solution: objective 144.9682353
Presolve removed 842 rows and 415 columns
Presolve time: 0.10s
Presolved: 2584 rows, 1309 columns, 31271 nonzeros
Variable types: 1298 continuous, 11 integer (11 binary)

Root relaxation: objective 3.642625e+01, 1391 iterations, 0.37 seconds (0.32 work units)

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

     0     0   36.42625    0    8  144.96824   

**a) List the names and coefficients of the two other variables selected by the optimization model.**

In [18]:
# Extract the solution status. 
if mod.status == GRB.OPTIMAL:
    print("Solved to optimality")

    # print optimized value obtained at the optimal solution
    print(f"\nOptimized Value: {mod.objval}\n")

    for j in range(12):
        if b[j].X != 0:
            print(f'Name:', train.columns[j],f', coefficient: {b[j].X}')

Solved to optimality

Optimized Value: 40.073014705882386

Name: Entire home , coefficient: 67.875
Name: bedrooms , coefficient: 47.375
Name: beds , coefficient: 12.125


The other two variables are Entire home and bedrooms. 

**b) Which variable was in Model 2 but is no longer in Model 3? Briefly explain in 1-2 sentences why this variable might have been dropped.**

Since the correlation between 'accommodates' and 'the number of beds' is highly significant (0.714887), Model 3 requires the inclusion of 'the number of beds' as one of its three features. Therefore, the model will choose not to include 'accommodates' to avoid multicollinearity, opting instead for two other variables that, despite having lower correlations, significantly contribute to the Airbnb listing price.

In [19]:
train.corr()

Unnamed: 0,latitude,longitude,Entire home,accommodates,bathrooms,bedrooms,beds,cleaning_fee,minimum_nights,number_of_reviews,review_scores_rating,instant_bookable,price
latitude,1.0,-0.146861,0.155917,0.092906,-0.104603,-0.092728,0.084267,0.021063,-0.024054,0.018956,-0.052867,0.14725,0.017187
longitude,-0.146861,1.0,-0.061369,-0.073256,-0.068875,-0.076943,-0.079002,-0.180739,-0.104919,-0.033092,-0.062512,-0.020672,-0.147224
Entire home,0.155917,-0.061369,1.0,0.387529,0.064571,0.064114,0.163433,0.407221,0.044382,-0.01346,-0.086006,0.098798,0.298168
accommodates,0.092906,-0.073256,0.387529,1.0,0.603855,0.712638,0.714887,0.548043,-0.139038,0.013333,-0.056631,0.226832,0.592929
bathrooms,-0.104603,-0.068875,0.064571,0.603855,1.0,0.735774,0.546788,0.489532,-0.091059,-0.019455,0.020201,0.08944,0.599028
bedrooms,-0.092728,-0.076943,0.064114,0.712638,0.735774,1.0,0.565229,0.50295,-0.111259,0.026365,0.05389,0.105767,0.601706
beds,0.084267,-0.079002,0.163433,0.714887,0.546788,0.565229,1.0,0.351499,-0.130375,-0.013726,-0.090479,0.209118,0.39576
cleaning_fee,0.021063,-0.180739,0.407221,0.548043,0.489532,0.50295,0.351499,1.0,0.265867,-0.081668,0.010097,-0.033599,0.632186
minimum_nights,-0.024054,-0.104919,0.044382,-0.139038,-0.091059,-0.111259,-0.130375,0.265867,1.0,-0.124261,-0.005363,-0.214826,-0.064613
number_of_reviews,0.018956,-0.033092,-0.01346,0.013333,-0.019455,0.026365,-0.013726,-0.081668,-0.124261,1.0,0.014055,0.09475,-0.037913


**c) What is the new prediction error, in $/night, of Model 3**

In [20]:
pred_x = [sum(b[j].X * x_test.iloc[i, j] for j in range(12)) for i in range(699)]

In [21]:
error = (1/699)*sum(abs(y_test.iloc[i] - pred_x[i]) for i in range(699))

error

38.59960658082976

**The prediction error of model 3 on the test set  is $38.60/night.**