# SDS Challenge #4 - Hostel Listings

## Problem Statement

Welcome Data Scientist to the 4th SDS Club Monthly Challenge! In this month's challenge you have been hired by a hostel company to 
help them determine a fair price for their new hostels based on previous hostel data. Your mission to predict the price of each hostel based on the data provided.

## Evaluation

$$\begin{equation*}
MSE = {\frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat{y}_{i})^{2}}
\end{equation*}$$

## Understanding the Dataset

In the dataset there are 65 unique columns. Each column is properly and descriptively named in the dataset.

## Dataset Files

**public_listings.csv** - Dataset to train and analyze <br />
**pred_listings.csv** - Dataset to predict listings' prices

## Submission

All submissions should be sent through email to challenges@superdatascience.com. When submitting, the file should contain predictions made on the pred_listings.csv file, and it should have the following format:

In [None]:
150.00
95.00
80.00
105.00
72.00

## Acknowledgements

The data was scraped and collected by Inside Airbnb from publicly available information on the Airbnb site.

## Importing the Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Importing the dataset and cleaning them

In [None]:
dataset = pd.read_csv("public_cars.csv")
dataset_test = pd.read_csv("pred_cars.csv")

In [None]:
dataset

Unnamed: 0,manufacturer_name,model_name,transmission,color,odometer_value,year_produced,engine_fuel,engine_has_gas,engine_type,engine_capacity,body_type,has_warranty,state,drivetrain,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,duration_listed,price_usd
0,Ford,Fusion,mechanical,blue,245000,2006,gasoline,False,gasoline,1.6,hatchback,False,owned,front,True,False,False,False,False,False,False,False,False,False,7,4250.0
1,Dodge,Caravan,automatic,silver,265542,2002,gasoline,False,gasoline,3.3,minivan,False,owned,front,False,True,False,False,False,False,False,False,False,False,133,4000.0
2,Ford,Galaxy,mechanical,blue,168000,2009,diesel,False,diesel,1.8,minivan,False,owned,front,False,False,False,True,False,False,True,True,True,True,0,10900.0
3,Mazda,6,mechanical,other,225522,2008,gasoline,False,gasoline,1.8,universal,False,owned,front,False,True,True,False,False,True,False,False,True,True,20,6999.0
4,Audi,80,mechanical,black,370000,1991,gasoline,False,gasoline,1.8,sedan,False,owned,front,False,False,False,False,False,False,False,False,False,True,160,1600.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30819,Mazda,Millenia,automatic,green,360493,1999,gasoline,False,gasoline,2.5,sedan,False,owned,front,False,True,False,False,True,True,False,False,False,True,66,2000.0
30820,Audi,A3,automatic,grey,117000,2009,gasoline,False,gasoline,1.4,universal,False,owned,front,False,True,True,True,False,False,False,True,True,True,58,8800.0
30821,Mazda,626,mechanical,black,333000,1997,gasoline,False,gasoline,2.0,hatchback,False,owned,front,False,False,False,False,False,False,False,False,False,True,87,1400.0
30822,Audi,A6,automatic,violet,530000,1995,gasoline,False,gasoline,2.6,universal,False,owned,all,False,True,True,False,False,False,False,False,True,True,52,3500.0


In [None]:
dataset_test

Unnamed: 0,manufacturer_name,model_name,transmission,color,odometer_value,year_produced,engine_fuel,engine_has_gas,engine_type,engine_capacity,body_type,has_warranty,state,drivetrain,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,duration_listed
0,Renault,Megane,mechanical,blue,360000,1997,gasoline,False,gasoline,1.6,coupe,False,owned,front,False,True,False,False,False,False,False,True,False,True,114
1,Peugeot,206,mechanical,silver,267000,1999,gasoline,False,gasoline,1.4,hatchback,False,owned,front,False,False,False,False,False,False,False,False,False,True,67
2,Volkswagen,Sharan,mechanical,blue,172000,2000,gasoline,False,gasoline,2.0,minivan,False,owned,front,True,False,False,False,False,False,False,False,False,False,50
3,Volvo,XC60,mechanical,white,230000,2009,diesel,False,diesel,2.4,universal,False,owned,front,False,True,True,True,False,True,True,True,True,True,79
4,Mazda,3,mechanical,silver,206000,2007,gasoline,False,gasoline,1.6,sedan,False,owned,front,False,True,False,True,False,True,False,False,False,False,74
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7702,Chrysler,Sebring,automatic,blue,347618,2001,gasoline,False,gasoline,2.4,sedan,False,owned,front,True,False,False,False,False,False,False,False,False,False,81
7703,Geely,Emgrand 7,automatic,black,9700,2018,gasoline,False,gasoline,2.0,suv,False,owned,front,False,True,False,True,False,True,False,True,True,True,84
7704,Chrysler,Sebring,automatic,red,111111,2001,gasoline,False,gasoline,2.4,sedan,False,owned,front,True,False,False,False,False,False,False,False,False,False,1
7705,Ford,EcoSport,mechanical,white,109000,2016,diesel,False,diesel,1.5,suv,False,owned,front,False,False,False,True,False,False,True,True,True,True,2


## Taking care of missing data

In [None]:
dataset.drop(['manufacturer_name', 'model_name', 'color', 'body_type'],
  axis='columns', inplace=True)
dataset_test.drop(['manufacturer_name', 'model_name', 'color', 'body_type'],
  axis='columns', inplace=True)

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [None]:
dataset

Unnamed: 0,transmission,odometer_value,year_produced,engine_fuel,engine_has_gas,engine_type,engine_capacity,has_warranty,state,drivetrain,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,duration_listed,price_usd
0,mechanical,245000,2006,gasoline,False,gasoline,1.6,False,owned,front,True,False,False,False,False,False,False,False,False,False,7,4250.0
1,automatic,265542,2002,gasoline,False,gasoline,3.3,False,owned,front,False,True,False,False,False,False,False,False,False,False,133,4000.0
2,mechanical,168000,2009,diesel,False,diesel,1.8,False,owned,front,False,False,False,True,False,False,True,True,True,True,0,10900.0
3,mechanical,225522,2008,gasoline,False,gasoline,1.8,False,owned,front,False,True,True,False,False,True,False,False,True,True,20,6999.0
4,mechanical,370000,1991,gasoline,False,gasoline,1.8,False,owned,front,False,False,False,False,False,False,False,False,False,True,160,1600.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30819,automatic,360493,1999,gasoline,False,gasoline,2.5,False,owned,front,False,True,False,False,True,True,False,False,False,True,66,2000.0
30820,automatic,117000,2009,gasoline,False,gasoline,1.4,False,owned,front,False,True,True,True,False,False,False,True,True,True,58,8800.0
30821,mechanical,333000,1997,gasoline,False,gasoline,2.0,False,owned,front,False,False,False,False,False,False,False,False,False,True,87,1400.0
30822,automatic,530000,1995,gasoline,False,gasoline,2.6,False,owned,all,False,True,True,False,False,False,False,False,True,True,52,3500.0


In [None]:
dataset_test

Unnamed: 0,transmission,odometer_value,year_produced,engine_fuel,engine_has_gas,engine_type,engine_capacity,has_warranty,state,drivetrain,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,duration_listed
0,mechanical,360000,1997,gasoline,False,gasoline,1.6,False,owned,front,False,True,False,False,False,False,False,True,False,True,114
1,mechanical,267000,1999,gasoline,False,gasoline,1.4,False,owned,front,False,False,False,False,False,False,False,False,False,True,67
2,mechanical,172000,2000,gasoline,False,gasoline,2.0,False,owned,front,True,False,False,False,False,False,False,False,False,False,50
3,mechanical,230000,2009,diesel,False,diesel,2.4,False,owned,front,False,True,True,True,False,True,True,True,True,True,79
4,mechanical,206000,2007,gasoline,False,gasoline,1.6,False,owned,front,False,True,False,True,False,True,False,False,False,False,74
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7702,automatic,347618,2001,gasoline,False,gasoline,2.4,False,owned,front,True,False,False,False,False,False,False,False,False,False,81
7703,automatic,9700,2018,gasoline,False,gasoline,2.0,False,owned,front,False,True,False,True,False,True,False,True,True,True,84
7704,automatic,111111,2001,gasoline,False,gasoline,2.4,False,owned,front,True,False,False,False,False,False,False,False,False,False,1
7705,mechanical,109000,2016,diesel,False,diesel,1.5,False,owned,front,False,False,False,True,False,False,True,True,True,True,2


In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = "mean")
X_test = dataset_test.values
imputer.fit(X[:, 6:7])
X[:, 6:7] = imputer.transform(X[:, 6:7])
X_test[:, 6:7] = imputer.transform(X_test[:, 6:7])

In [None]:
print(X)
X.shape

[['mechanical' 245000 2006 ... False False 7]
 ['automatic' 265542 2002 ... False False 133]
 ['mechanical' 168000 2009 ... True True 0]
 ...
 ['mechanical' 333000 1997 ... False True 87]
 ['automatic' 530000 1995 ... True True 52]
 ['automatic' 15000 2018 ... True True 75]]


(30824, 21)

In [None]:
print(X_test)
X_test.shape

[['mechanical' 360000 1997 ... False True 114]
 ['mechanical' 267000 1999 ... False True 67]
 ['mechanical' 172000 2000 ... False False 50]
 ...
 ['automatic' 111111 2001 ... False False 1]
 ['mechanical' 109000 2016 ... True True 2]
 ['mechanical' 180000 2000 ... False False 26]]


(7707, 21)

## Encoding categorical data

### Encoding the Independent Variable

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,18, 19])], remainder = 'passthrough' )
X = np.array(ct.fit_transform(X))
X_test = np.array(ct.transform(X_test))

In [None]:
print(X_test)
X_test.shape

[[0.0 1.0 0.0 ... 1997 1.6 114]
 [0.0 1.0 0.0 ... 1999 1.4 67]
 [0.0 1.0 0.0 ... 2000 2.0 50]
 ...
 [1.0 0.0 0.0 ... 2001 2.4 1]
 [0.0 1.0 1.0 ... 2016 1.5 2]
 [0.0 1.0 0.0 ... 2000 1.4 26]]


(7707, 45)

In [None]:
print(X)
X.shape

[[0.0 1.0 0.0 ... 2006 1.6 7]
 [1.0 0.0 0.0 ... 2002 3.3 133]
 [0.0 1.0 1.0 ... 2009 1.8 0]
 ...
 [0.0 1.0 0.0 ... 1997 2.0 87]
 [1.0 0.0 0.0 ... 1995 2.6 52]
 [1.0 0.0 0.0 ... 2018 3.5 75]]


(30824, 45)

## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X = sc.fit_transform(X)
X_test = sc.fit_transform(X_test)

In [None]:
print(X)

[[-0.71001645  0.71001645 -0.70856972 ...  0.38022628 -0.6805601
  -0.65105271]
 [ 1.40841807 -1.40841807 -0.70856972 ... -0.11675116  1.86649355
   0.46740533]
 [-0.71001645  0.71001645  1.41129372 ...  0.75295936 -0.38090673
  -0.71318927]
 ...
 [-0.71001645  0.71001645 -0.70856972 ... -0.73797296 -0.08125336
   0.05907938]
 [ 1.40841807 -1.40841807 -0.70856972 ... -0.98646168  0.81770675
  -0.25160341]
 [ 1.40841807 -1.40841807 -0.70856972 ...  1.87115859  2.16614693
  -0.04744044]]


In [None]:
from sklearn.cross_decomposition import PLSRegression
pls = PLSRegression(n_components=30)
X = pls.fit_transform(X, y)
X_test = pls.transform(X_test)



In [None]:
X = X[0]
print(X)

[[-3.2406207   1.30461024  0.60500718 ...  0.          0.
   0.        ]
 [-1.18948098 -0.12083718  0.40252081 ...  0.          0.
   0.        ]
 [ 2.25046596  1.0697606   0.82689294 ...  0.          0.
   0.        ]
 ...
 [-2.65499076 -1.04375939 -0.45875171 ...  0.          0.
   0.        ]
 [ 0.96429328 -1.62127777 -0.72155814 ...  0.          0.
   0.        ]
 [ 6.53432936 -0.24156533 -0.13357443 ...  0.          0.
   0.        ]]


## Making the ANN and training it on the Training set

In [None]:
import tensorflow as tf
import time

In [None]:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [None]:
!nvidia-smi

Sat Jan  2 00:17:10 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0    27W /  70W |    227MiB / 15079MiB |      5%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
y = y.reshape(len(y), 1)
y.shape

(30824, 1)

In [None]:
X.shape

(30824, 30)

### Initializing the ANN

In [None]:
def build_model():
    with tf.device('/gpu:0'):
      ann = tf.keras.models.Sequential()

      ann.add(tf.keras.layers.Dense(units = 30, activation = 'relu'))

      ann.add(tf.keras.layers.Dense(units = 100, activation = 'relu'))

      ann.add(tf.keras.layers.Dense(units = 100, activation = 'relu'))

      ann.add(tf.keras.layers.Dense(units = 100, activation = 'relu'))

      ann.add(tf.keras.layers.Dense(units = 1, activation = 'relu'))

      start = time.time()
      ann.compile(loss="mean_squared_error", optimizer="adam")
      print("> Compilation Time : ", time.time() - start)
      return ann

In [None]:
ann = build_model()

> Compilation Time :  0.015689373016357422


### Training the ANN model on the Training set

In [None]:
ann.fit( x=X, y=y, batch_size=32, epochs = 100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7fe090026198>

### Predicting the results of the Test set

In [None]:
y_pred = ann.predict(X_test)
print(y_pred)

[[ 2109.5005]
 [ 1813.2583]
 [ 2960.6238]
 ...
 [ 2654.2727]
 [14215.504 ]
 [ 1745.5387]]


## Building the csv file

In [None]:
df = pd.DataFrame(y_pred, columns=["price_usd"])

df_total = pd.concat([pd.read_csv("pred_cars.csv"), df], axis=1)

print(df_total)

     manufacturer_name model_name  ... duration_listed     price_usd
0              Renault     Megane  ...             114   2109.500488
1              Peugeot        206  ...              67   1813.258301
2           Volkswagen     Sharan  ...              50   2960.623779
3                Volvo       XC60  ...              79  13271.837891
4                Mazda          3  ...              74   5188.108887
...                ...        ...  ...             ...           ...
7702          Chrysler    Sebring  ...              81   2515.612549
7703             Geely  Emgrand 7  ...              84  21078.427734
7704          Chrysler    Sebring  ...               1   2654.272705
7705              Ford   EcoSport  ...               2  14215.503906
7706             Skoda      Fabia  ...              26   1745.538696

[7707 rows x 26 columns]


In [None]:
df_total.to_csv("pred_cars.csv", index=None)