# Machine Learning fo Reaserch

#### [Institute of Data Science at Maastricht University](https://www.maastrichtuniversity.nl/research/institute-data-science)

Copyright 2018 Pedro Hernandez Serrano, Maastriht University  
License: [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)

---

We will use the library [Turicreate](https://github.com/apple/turicreate) and the Dataset is from [Tom Slee Blog](http://tomslee.net/airbnb-data-collection-get-the-data) and it shows the most recent listings (2017-07-22) from Amsterdam Airbnb Rooms, itself is extracted from [Inside Airbnb](http://insideairbnb.com/) where is possible to make a very nice visual analysis following the hypothesis: Airbnb claims to be part of the "sharing economy" and disrupting the hotel industry. However, data shows that the majority of Airbnb listings in most cities are entire homes, many of which are rented all year round - disrupting housing and communities.

## Answering the question

>Did you specify the type of data analytic question (e.g. exploration, association causality) before touching the data?

>Did you define the metric for success before beginning?

>Did you understand the context for the question and the scientific or business application?

>Did you record the experimental design?

>Did you consider whether the question could be answered with the available data?


In [1]:
import turicreate as tc

In [2]:
sf_rooms = tc.SFrame('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/airbnb_amsterdam.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,int,str,str,str,int,float,int,int,int,str,str,float,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
sf_rooms.head()

room_id,survey_id,host_id,room_type,city,neighborhood,reviews,overall_satisfaction
10176931,1476,49180562,Shared room,Amsterdam,De Pijp / Rivierenbuurt,7,4.5
8935871,1476,46718394,Shared room,Amsterdam,Centrum West,45,4.5
14011697,1476,10346595,Shared room,Amsterdam,Watergraafsmeer,1,0.0
6137978,1476,8685430,Shared room,Amsterdam,Centrum West,7,5.0
18630616,1476,70191803,Shared room,Amsterdam,De Baarsjes / Oud West,1,0.0
5790170,1476,29968916,Shared room,Amsterdam,De Pijp / Rivierenbuurt,184,4.5
934060,1476,5037506,Shared room,Amsterdam,Oostelijk Havengebied / Indische Buurt ...,67,5.0
19590049,1476,132687356,Shared room,Amsterdam,Westerpark,2,0.0
5020280,1476,4059485,Shared room,Amsterdam,Oud Oost,2,0.0
15810783,1476,84978218,Shared room,Amsterdam,Centrum West,0,0.0

accommodates,bedrooms,price,name,last_modified,latitude,longitude
2,1,156,Red Light/ Canal view apartment (Shared) ...,2017-07-23 13:06:27.391699 ...,52.356209,4.887491
4,1,126,Sunny and Cozy Living room in quite neighbours ...,2017-07-23 13:06:23.607187 ...,52.378518,4.89612
3,1,132,Amsterdam,2017-07-23 13:06:23.603546 ...,52.338811,4.943592
4,1,121,Canal boat RIDE in Amsterdam ...,2017-07-23 13:06:22.689787 ...,52.376319,4.890028
2,1,93,One room for rent in a three room appartment ...,2017-07-23 13:06:19.681469 ...,52.370384,4.852873
2,1,102,Beautiful apartment,2017-07-23 13:06:19.663975 ...,52.342265,4.897126
16,1,462,"LOTUS, Classic Dutch Saling Barge ...",2017-07-23 13:06:09.988016 ...,52.377552,4.930418
2,1,414,big boot Adam 04,2017-07-23 13:06:09.984748 ...,52.375205,4.866117
2,1,222,Bright modern appartment in East! ...,2017-07-23 13:06:07.452609 ...,52.357346,4.912887
12,1,301,"CANAL BOATTOUR AMSTERDAM covered boat 1,5 hour ...",2017-07-23 13:06:07.447989 ...,52.38661,4.890128


In [4]:
# Make a train-test split
sf_rooms['overall_satisfaction'] = sf_rooms['overall_satisfaction'].astype(str)
train_data, test_data = sf_rooms.random_split(0.8)

# Automatically picks the right model based on your data.
model = tc.boosted_trees_classifier.create(train_data, target='overall_satisfaction',
                                    features = ['room_type',
                                                'neighborhood',
                                                'reviews',
                                                'accommodates',
                                                'bedrooms',
                                                'price'], max_iterations=10)

# Save predictions to an SArray
predictions = model.predict(test_data)

# Evaluate the model and save the results into a dictionary
results = model.evaluate(test_data)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [5]:
results['confusion_matrix'].sort('count', ascending=False)

target_label,predicted_label,count
5.0,5.0,1430
0.0,0.0,1140
4.5,5.0,743
4.5,4.5,164
5.0,4.5,158
4.0,5.0,102
4.0,4.5,25
3.5,5.0,10
3.0,5.0,2
0.0,5.0,2


**Create a new data values and predict**

In [6]:
new_data = tc.SFrame({'room_type':['Entire home/apt'], 'neighborhood':['Westerpark'], 'reviews':[5], 'accommodates':[2], 'bedrooms':[1], 'price':[50]})
new_data

accommodates,bedrooms,neighborhood,price,reviews,room_type
2,1,Westerpark,50,5,Entire home/apt


In [7]:
satisfaction = model.predict(new_data)
print ("Predicted Airbnb Satisfaction:  {}".format(satisfaction[0]))

Predicted Airbnb Satisfaction:  5
