We will use the library [Turicreate](https://github.com/apple/turicreate) and the Dataset is from [Tom Slee Blog](http://tomslee.net/airbnb-data-collection-get-the-data) and it shows the most recent listings (2017-07-22) from Amsterdam Airbnb Rooms, itself is extracted from [Inside Airbnb](http://insideairbnb.com/) where is possible to make a very nice visual analysis following the hypothesis: Airbnb claims to be part of the "sharing economy" and disrupting the hotel industry. However, data shows that the majority of Airbnb listings in most cities are entire homes, many of which are rented all year round - disrupting housing and communities.

In [10]:
import turicreate as tc

In [2]:
sf_rooms = tc.SFrame('airbnb_london.csv').dropna()

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,int,str,str,str,int,float,int,int,int,int,int,float,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
sf_rooms.head()

X1,room_id,host_id,room_type,borough,neighborhood,reviews,overall_satisfaction
1,9293,29896,Entire home/apt,Kensington and Chelsea,Chelsea Riverside,108,4.5
4,11551,43039,Entire home/apt,Lambeth,Ferndale,146,4.5
5,13913,54730,Private room,Islington,Tollington,9,5.0
6,15400,60302,Entire home/apt,Kensington and Chelsea,Stanley,56,4.5
8,18317,37014,Entire home/apt,Richmond upon Thames,Kew,11,4.5
13,26223,110865,Entire home/apt,Islington,St. Mary's,36,4.5
14,26482,110865,Entire home/apt,Islington,St. Mary's,29,4.5
15,26682,113354,Private room,Kensington and Chelsea,Redcliffe,3,4.5
16,28010,119316,Private room,Tower Hamlets,Weavers,20,4.5
17,28311,54987,Entire home/apt,Tower Hamlets,Canary Wharf,10,3.5

accommodates,bedrooms,bathrooms,price,minstay,latitude,longitude
4,2,1,150,1,51.482968,-0.174777
4,1,1,142,3,51.462254,-0.117324
2,1,1,74,1,51.568017,-0.111208
2,1,1,147,3,51.487962,-0.168981
5,2,1,143,3,51.473664,-0.287364
5,1,1,191,3,51.542362,-0.103796
4,1,1,191,2,51.53915,-0.10137
1,1,1,112,1,51.485037,-0.185547
3,1,1,90,3,51.525102,-0.073727
5,2,2,198,5,51.49974,-0.01867


In [4]:
# Make a train-test split
sf_rooms['overall_satisfaction'] = sf_rooms['overall_satisfaction'].astype(str)
train_data, test_data = sf_rooms.random_split(0.8)

# Automatically picks the right model based on your data.
model = tc.classifier.create(train_data, target='room_type',
                                    features = ['overall_satisfaction',
                                                'neighborhood',
                                                'reviews',
                                                'accommodates',
                                                'bedrooms',
                                                'price',
                                                'minstay'])#, max_iterations=10)

# Save predictions to an SArray
predictions = model.predict(test_data)

# Evaluate the model and save the results into a dictionary
results = model.evaluate(test_data)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: BoostedTreesClassifier, RandomForestClassifier, DecisionTreeClassifier, LogisticClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.


PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: BoostedTreesClassifier          : 0.8874074074074074
PROGRESS: RandomForestClassifier          : 0.88
PROGRESS: DecisionTreeClassifier          : 0.8740740740740741
PROGRESS: LogisticClassifier              : 0.8548148148148148
PROGRESS: ---------------------------------------------
PROGRESS: Selecting BoostedTreesClassifier based on validation set performance.


In [5]:
results['confusion_matrix'].sort('count', ascending=False)

target_label,predicted_label,count
Entire home/apt,Entire home/apt,1680
Private room,Private room,1593
Entire home/apt,Private room,177
Private room,Entire home/apt,139
Shared room,Private room,38
Shared room,Shared room,7
Shared room,Entire home/apt,2


---
# PGGM Dataset

In [8]:
import turicreate as tc

In [2]:
pggm = tc.SFrame('pggm_dataset.csv').dropna()

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,int,str,float,float,float,float,str,str,float,float,float,float,float,float,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
import numpy as np
pggm['Universe_Returns_F4W_cat'] = np.where(pggm['Universe_Returns_F4W']>=0, 1, -1)

In [4]:
pggm.head()

Identifier,Name,Period,Period_YYYYMMDD,Ticker,Universe_Returns_F1W,Universe_Returns_F4W
17290810,Cintas Corporation,12/31/2014,20141231,CTAS-US,-2.517855,-0.121111
80589M10,SCANA Corporation,12/31/2014,20141231,SCG-US,2.036428,6.307948
50241310,L-3 Communications Holdings Inc. ...,12/31/2014,20141231,LLL-US,-0.396162,-1.28358
91301710,United Technologies Corporation ...,12/31/2014,20141231,UTX-US,-1.973909,1.669562
92939U10,Wisconsin Energy Corporation ...,12/31/2014,20141231,WEC-US,1.118696,7.679176
00130H10,AES Corporation,12/31/2014,20141231,AES-US,-5.374008,-10.530144
31190010,Fastenal Company,12/31/2014,20141231,FAST-US,-4.352397,-5.361647
03662Q10,"ANSYS, Inc.",12/31/2014,20141231,ANSS-US,-2.792686,-0.493902
20911510,"Consolidated Edison, Inc.",12/31/2014,20141231,ED-US,1.590657,7.332218
29444U70,"Equinix, Inc.",12/31/2014,20141231,EQIX-US,-2.902132,-4.247337

Universe_Returns_F12W,Weight,GICS_Sector,GICS_Ind_Grp,Market_Cap_USD,Price_USD
4.156041,0.000402,Industrials,Commercial & Professional Services ...,7761.12,78.44
-8.426744,0.000422,Utilities,Utilities,8151.001,60.4
-0.753021,0.000563,Industrials,Capital Goods,10883.341,126.21
1.815629,0.005174,Industrials,Capital Goods,99942.99,115.0
-6.160975,0.000616,Utilities,Utilities,11893.872,52.74
-8.48009,0.00049,Utilities,Utilities,9461.443,13.77
-11.370432,0.000694,Industrials,Capital Goods,13408.662,47.56
5.390239,0.000391,Information Technology,Software & Services,7545.312,82.0
-7.292783,0.001001,Utilities,Utilities,19333.338,66.01
1.67712,0.000624,Information Technology,Software & Services,12060.699,226.73

NTM_EP,LTM_ROA,BP,LTM_EP,5Y_Sales_Growth,Universe_Returns_F4W_cat
0.044387,9.089989,0.246962,0.042708,4.718765,-1
0.061397,3.472852,0.572871,0.062748,-0.949881,1
0.060554,4.744629,0.570099,0.059821,-4.316938,-1
0.062889,6.805052,0.325584,0.059088,3.083364,1
0.051359,4.201019,0.369798,0.050815,1.421392,1
0.097947,1.139182,0.453672,0.044365,-0.455666,-1
0.039999,21.42764,0.135137,0.033642,9.695387,-1
0.044055,9.653068,0.303087,0.03378,12.52306,-1
0.05951,3.008556,0.657278,0.06393,-1.785031,1
0.024895,1.895395,0.202957,0.011752,23.247128,-1


In [5]:
pggm.column_names()

['Identifier',
 'Name',
 'Period',
 'Period_YYYYMMDD',
 'Ticker',
 'Universe_Returns_F1W',
 'Universe_Returns_F4W',
 'Universe_Returns_F12W',
 'Weight',
 'GICS_Sector',
 'GICS_Ind_Grp',
 'Market_Cap_USD',
 'Price_USD',
 'NTM_EP',
 'LTM_ROA',
 'BP',
 'LTM_EP',
 '5Y_Sales_Growth',
 'Universe_Returns_F4W_cat']

In [6]:
# Make a train-test split
pggm['Universe_Returns_F4W_cat'] = pggm['Universe_Returns_F4W_cat'].astype(str)
train_data, test_data = pggm.random_split(0.8)

# Automatically picks the right model based on your data.
model = tc.classifier.create(train_data, target='Universe_Returns_F4W_cat',
                                    features = ['Universe_Returns_F1W',
                                                'Universe_Returns_F12W',
                                                'Weight',
                                                'GICS_Sector',
                                                'GICS_Ind_Grp',
                                                'Market_Cap_USD',
                                                'Price_USD',
                                                'NTM_EP',
                                                'LTM_ROA',
                                                'BP',
                                                'LTM_EP',
                                                '5Y_Sales_Growth'])

# Save predictions to an SArray
predictions = model.predict(test_data)

# Evaluate the model and save the results into a dictionary
results = model.evaluate(test_data)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: BoostedTreesClassifier, RandomForestClassifier, DecisionTreeClassifier, SVMClassifier, LogisticClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.


PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: BoostedTreesClassifier          : 0.7293136626042335
PROGRESS: RandomForestClassifier          : 0.7203335471456062
PROGRESS: DecisionTreeClassifier          : 0.7280307889672867
PROGRESS: SVMClassifier                   : 0.7305965362411803
PROGRESS: LogisticClassifier              : 0.731879409878127
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.


In [7]:
results['confusion_matrix'].sort('count', ascending=False)

target_label,predicted_label,count
1,1,3068
-1,-1,2526
-1,1,1096
1,-1,895
