# Regression & Classification

In [2]:
import graphlab as gl
gl.canvas.set_target('ipynb')

## Data Overview
In this notebook, we will use a subset of the data from the Yelp Dataset Challenge for this tutorial. The task is to predict the 'star rating' for a restaurant for a given user. The dataset comprises three tables that cover 11,537 businesses, 8,282 check-ins, 43,873 users, and 229,907 reviews. The entire dataset as well as details about the dataset are available on the Yelp website.

### Review Data

The review table includes information about each review. Specifically, it contains:

* business_id: An encrypted business ID for the business being reviewed.
* user_id: An encrypted user ID for the user who provided the review.
* stars: A star rating (on a scale of 1-5)
* text: The raw review text.
* date: Date, formatted like '2012-03-14'
* votes: The number of 'useful', 'funny' or 'cool' votes provided by other users for this review.

### User Data

The user table consists of details about each user:

* user_id: The encrypted user ID (cross referenced in the Review table)
* name: First name
* review_count: Total number of reviews made by the user.
* average_stars: Average rating (on a scale of 1-5) made by the user.
* votes: For each review type i.e ('useful', 'funny', 'cool') the total number of votes for reviews made by this user.

### Business Data

The business table contains details about each business:

* business_id: Encrypted business ID (cross referenced in the Review table)
* name: Business name.
* neighborhoods: Neighborhoods served by the business.
* full_address: Address (text format)
* city: City where the business is located.
* state: State where the business is located.
* latitude: Latitude of the business.
* longitude: Longitude of the business.
* stars: A star rating (rounded to half-stars) for this business.
* review_count: The total number of reviews about this business.
* categories: Category tags for this business.
* open: Is this business still open? (True/False)

In [4]:
business = gl.SFrame('http://s3.amazonaws.com/dato-datasets/regression/business.csv')
user = gl.SFrame('http://s3.amazonaws.com/dato-datasets/regression/user.csv')
review = gl.SFrame('http://s3.amazonaws.com/dato-datasets/regression/review.csv')

PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/regression/business.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.060661 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,list,str,str,float,float,str,int,int,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/regression/business.csv
PROGRESS: Parsing completed. Parsed 11537 lines in 0.071249 secs.
PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/regression/user.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.083054 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[float,str,int,str,str,int,int,int]
If p

In [5]:
review.show()

In [6]:
user.show()

In [7]:
business.show()

In [13]:
rev_bus_SF = review.join(business, how='inner', on='business_id')
rev_bus_SF.head()

business_id,date,review_id,stars,text,type
9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for break ...,review
ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad reviews ...,review
6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I also ...,review
_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!! ...",review
6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!! ...,review
-yxfBYGB6SEqszmxJxd97A,2007-12-13,m2CKSsepBCoRYWxiRUsxAg,4,"Quiessence is, simply put, beautiful. Full ...",review
zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I ...,review
hW0Ne_HTHEAgGF1rAdmR-g,2012-07-12,JL7GXJ9u4YMx7Rzs05NfiQ,4,"Luckily, I didn't have to travel far to make my ...",review
wNUea3IXZWD63bbOQaOH-g,2012-08-17,XtnfnYmnJYi71yIuGsXIUA,4,Definitely come for Happy hour! Prices are amaz ...,review
nMHhuYan8e3cONo3PornJA,2010-08-11,jJAIXA46pU1swYyRCdfXtQ,5,Nobuo shows his unique talents with everything ...,review

user_id,votes,year,month,day,categories,city
rLtl8ZkDX5vH5nAx9C3q5Q,"{'funny': 0, 'useful': 5, 'cool': 2} ...",2011,1,26,"[Breakfast & Brunch, Restaurants] ...",Phoenix
0a2KyEL0d3Yb1V6aivbIuQ,"{'funny': 0, 'useful': 0, 'cool': 0} ...",2011,7,27,"[Italian, Pizza, Restaurants] ...",Phoenix
0hT2KtfLiobPvh6cDC8JQg,"{'funny': 0, 'useful': 1, 'cool': 0} ...",2012,6,14,"[Middle Eastern, Restaurants] ...",Tempe
uZetl9T0NcROGOyFfughhg,"{'funny': 0, 'useful': 2, 'cool': 1} ...",2010,5,27,"[Active Life, Dog Parks, Parks] ...",Scottsdale
vYmM4KTsC8ZfQBg-j5MWkw,"{'funny': 0, 'useful': 0, 'cool': 0} ...",2012,1,5,"[Tires, Automotive]",Mesa
sqYN3lNgvPbPCTRsMFu27g,"{'funny': 1, 'useful': 3, 'cool': 4} ...",2007,12,13,"[Wine Bars, Bars, American (New), ...",Phoenix
wFweIWhv2fREZV_dYkz_1g,"{'funny': 4, 'useful': 7, 'cool': 7} ...",2010,2,12,"[Mexican, Restaurants]",Phoenix
1ieuYcKS7zeAv_U15AB13A,"{'funny': 0, 'useful': 1, 'cool': 0} ...",2012,7,12,"[Hotels & Travel, Airports] ...",Phoenix
Vh_DlizgGhSqQh4qfZ2h6A,"{'funny': 0, 'useful': 0, 'cool': 0} ...",2012,8,17,"[Sushi Bars, Restaurants]",Phoenix
sUNkXg8-KFtCMQDV6zRzQg,"{'funny': 0, 'useful': 1, 'cool': 0} ...",2010,8,11,"[Food, Tea Rooms, Japanese, Restaurants] ...",Phoenix

full_address,latitude,longitude,name,open,review_count,stars.1,state
"6106 S 32nd St\nPhoenix, AZ 85042 ...",33.3908,-112.013,Morning Glory Cafe,1,116,4.0,AZ
"4848 E Chandler Blvd\nPhoenix, AZ 85044 ...",33.3056,-111.979,Spinato's Pizzeria,1,102,4.0,AZ
"1513 E Apache Blvd\nTempe, AZ 85281 ...",33.4143,-111.913,Haji-Baba,1,265,4.5,AZ
"5401 N Hayden Rd\nScottsdale, AZ 85250 ...",33.5229,-111.908,Chaparral Dog Park,1,88,4.5,AZ
"1357 S Power Road\nMesa, AZ 85206 ...",33.391,-111.684,Discount Tire,1,5,4.5,AZ
"6106 S 32nd St\nPhoenix, AZ 85042 ...",33.3908,-112.013,Quiessence Restaurant,1,109,3.5,AZ
"1919 N 16th St\nPhoenix, AZ 85006 ...",33.4691,-112.048,La Condesa Gourmet Taco Shop ...,1,307,4.0,AZ
"3400 E Sky Harbor Blvd\nPhoenix, AZ 85034 ...",33.4348,-112.006,Phoenix Sky Harbor International Airport ...,1,862,3.0,AZ
"2574 E Camelback Rd\nPhoenix, AZ 85016 ...",33.5096,-112.026,Stingray Sushi,1,163,3.0,AZ
"622 E Adams St\nPhoenix, AZ 85004 ...",33.4495,-112.066,Nobuo At Teeter House,1,189,4.5,AZ

type.1
business
business
business
business
business
business
business
business
business
business


In [14]:
rev_bus_SF = rev_bus_SF.rename({'stars.1': 'business_avg_stars',
                                'type.1': 'business_type',
                                'review_count': 'business_review_count'})

In [15]:
user_rev_bus_SF = rev_bus_SF.join(user, how='inner', on='user_id')

In [16]:
user_rev_bus_SF = user_rev_bus_SF.rename({'name.1': 'user_name', 
                                          'type.1': 'user_type', 
                                          'average_stars': 'user_avg_stars',
                                          'review_count': 'user_review_count'})

In [17]:
user_rev_bus_SF.head(3)

business_id,date,review_id,stars,text,type
9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for break ...,review
ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad reviews ...,review
6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I also ...,review

user_id,votes,year,month,day,categories,city
rLtl8ZkDX5vH5nAx9C3q5Q,"{'funny': 0, 'useful': 5, 'cool': 2} ...",2011,1,26,"[Breakfast & Brunch, Restaurants] ...",Phoenix
0a2KyEL0d3Yb1V6aivbIuQ,"{'funny': 0, 'useful': 0, 'cool': 0} ...",2011,7,27,"[Italian, Pizza, Restaurants] ...",Phoenix
0hT2KtfLiobPvh6cDC8JQg,"{'funny': 0, 'useful': 1, 'cool': 0} ...",2012,6,14,"[Middle Eastern, Restaurants] ...",Tempe

full_address,latitude,longitude,name,open,business_review_count,business_avg_stars
"6106 S 32nd St\nPhoenix, AZ 85042 ...",33.3908,-112.013,Morning Glory Cafe,1,116,4.0
"4848 E Chandler Blvd\nPhoenix, AZ 85044 ...",33.3056,-111.979,Spinato's Pizzeria,1,102,4.0
"1513 E Apache Blvd\nTempe, AZ 85281 ...",33.4143,-111.913,Haji-Baba,1,265,4.5

state,business_type,user_avg_stars,user_name,user_review_count,user_type,votes_funny,votes_cool,votes_useful
AZ,business,3.72,Jason,376,user,331,322,1034
AZ,business,5.0,Paul,2,user,2,0,0
AZ,business,4.33,Nicole,3,user,0,0,3


In [18]:
train, test = user_rev_bus_SF.random_split(0.8, seed=1)

In [20]:
model = gl.regression.create(train,
                             target='stars',
                             features=['user_avg_stars','business_avg_stars','user_review_count', 'business_review_count'])

PROGRESS: Boosted trees regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 172608
PROGRESS: Number of features          : 4
PROGRESS: Number of unpacked features : 4
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter        RMSE Elapsed time
PROGRESS:      0   2.540e+00        0.16s
PROGRESS:      1   1.906e+00        0.47s
PROGRESS:      2   1.500e+00        0.64s
PROGRESS:      3   1.253e+00        0.82s
PROGRESS:      4   1.112e+00        0.99s
PROGRESS:      5   1.036e+00        1.16s
PROGRESS:      6   9.958e-01        1.33s
PROGRESS:      7   9.753e-01        1.56s
PROGRESS:      8   9.648e-01        1.74s
PROGRESS:      9   9.594e-01        1.93s


In [21]:
predictions = model.predict(test)

In [22]:
predictions.head()

dtype: float
Rows: 10
[2.882311338823247, 4.483300353741852, 4.525471693792973, 3.632639937409009, 4.593913328669084, 4.0762198455695975, 4.814230824053438, 4.086396972627913, 3.3185162552237286, 4.5956380832787245]

In [23]:
model.evaluate(test)

{'max_error': 3.7118803532122646, 'rmse': 0.9627572562293729}

In [28]:
sf = gl.SFrame()
sf['Predicted_Rating'] = predictions
sf['Actual_Rating'] = test['stars']
predict_count = sf.groupby('Actual_Rating', [gl.aggregate.AVG('Predicted_Rating'), gl.aggregate.COUNT('Predicted_Rating')])
predict_count.topk('Actual_Rating', k=5, reverse=True)

Actual_Rating,Avg of Predicted_Rating,Count
1,2.5274579684,3280
2,3.16392865092,4003
3,3.4532731599,6455
4,3.73709090454,15150
5,4.13840883145,14383


In [32]:
model.list_fields()

['column_subsample',
 'features',
 'max_depth',
 'max_iterations',
 'min_child_weight',
 'min_loss_reduction',
 'num_examples',
 'num_features',
 'num_trees',
 'num_unpacked_features',
 'num_validation_examples',
 'row_subsample',
 'step_size',
 'target',
 'training_rmse',
 'training_time',
 'trees_json',
 'unpacked_features',
 'validation_rmse']

In [34]:
model.summary()

Class                         : BoostedTreesRegression

Schema
------
Number of examples            : 172608
Number of feature columns     : 4
Number of unpacked features   : 4

Settings
--------
Number of trees               : 10
Max tree depth                : 6
Train RMSE                    : 0.9594
Validation RMSE               : None
Training time (sec)           : 2.1059



In [40]:
user_rev_bus_SF['is_good'] = user_rev_bus_SF['stars'] >= 3

In [41]:
train, test = user_rev_bus_SF.random_split(0.8, seed=1)

In [43]:
model = gl.logistic_classifier.create(train, target="is_good", 
                                      features = ['user_avg_stars','business_avg_stars', 
                                                'user_review_count', 'business_review_count'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 163883
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 4
PROGRESS: Number of unpacked features : 4
PROGRESS: Number of coefficients    : 5
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 2        | 1.326315     | 0.863878          | 0.862120            |
PROGRESS: | 2         | 3        | 1.581255     | 

In [44]:
model.predict(test)

dtype: int
Rows: 43271
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ]

In [46]:
model.predict(test, output_type='margin').head()

dtype: float
Rows: 10
[0.47645544358402425, 3.9443249528954585, 3.8910113277722296, 2.2093278863022263, 4.3851715061667615, 2.8464627090705754, 5.937036818319687, 2.9983530979984607, 1.2756987925740475, 4.225598974611657]

In [47]:
model.predict(test, output_type='probability').head()

dtype: float
Rows: 10
[0.6169105307869611, 0.9810035685271968, 0.9799841380229354, 0.9010840364025863, 0.9876926084839169, 0.945135547433027, 0.9973671100096887, 0.9524996698117504, 0.7817167263260186, 0.9855939894360589]

In [48]:
model.evaluate(test)

{'accuracy': 0.8650828499456911, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        0        |  2378 |
 |      0       |        1        |  4905 |
 |      1       |        0        |  933  |
 |      1       |        1        | 35055 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns]}

# Multiclass Classification

In [50]:
model = gl.logistic_classifier.create(train, target="stars", 
                                      features = ['user_avg_stars','business_avg_stars', 
                                                'user_review_count', 'business_review_count'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 164120
PROGRESS: Number of classes           : 5
PROGRESS: Number of feature columns   : 4
PROGRESS: Number of unpacked features : 4
PROGRESS: Number of coefficients    : 20
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 2        | 0.456509     | 0.450494          | 0.451343            |
PROGRESS: | 2         | 3        | 0.714741     |

In [51]:
model.predict_topk(test,output_type='probability', k=2)

id,class,probability
0,4,0.290756917045
0,3,0.246313849962
1,5,0.708060448989
1,4,0.244908802068
2,5,0.673131287052
2,4,0.270266511659
3,4,0.453833383631
3,5,0.246664523108
4,5,0.720110955415
4,4,0.23953510783


In [53]:
model.predict_topk(test, output_type = 'rank', k = 2)

id,class,rank
0,4,0
0,3,1
1,5,0
1,4,1
2,5,0
2,4,1
3,4,0
3,5,1
4,5,0
4,4,1


In [54]:
model.evaluate(test)

{'accuracy': 0.47935106653416837, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 25
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        2        |  129  |
 |      1       |        1        |   19  |
 |      3       |        1        |   17  |
 |      3       |        0        |  391  |
 |      0       |        0        |  1600 |
 |      0       |        1        |   18  |
 |      1       |        3        |  2625 |
 |      4       |        2        |   43  |
 |      4       |        3        |  5107 |
 |      0       |        3        |  1377 |
 +--------------+-----------------+-------+
 [25 rows x 3 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

In [57]:
review['stars'].astype(str).show()

In [59]:
model = gl.logistic_classifier.create(train, target="stars", 
                                      features = ['user_avg_stars','business_avg_stars', 
                                                'user_review_count', 'business_review_count'], 
                                      class_weights = 'auto')

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 164050
PROGRESS: Number of classes           : 5
PROGRESS: Number of feature columns   : 4
PROGRESS: Number of unpacked features : 4
PROGRESS: Number of coefficients    : 20
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 2        | 0.499012     | 0.381530          | 0.387240            |
PROGRESS: | 2         | 3        | 0.737497     |

In [60]:
model.evaluate(test)

{'accuracy': 0.42573548103810865, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 25
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        3        |  726  |
 |      1       |        1        |  1011 |
 |      0       |        2        |  255  |
 |      0       |        1        |  626  |
 |      0       |        0        |  1969 |
 |      4       |        2        |  810  |
 |      4       |        3        |  2896 |
 |      0       |        3        |  262  |
 |      1       |        4        |  383  |
 |      2       |        1        |  1464 |
 +--------------+-----------------+-------+
 [25 rows x 3 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

# Feature Engineering

In [62]:
train_set, test_set = user_rev_bus_SF.random_split(0.8, seed=1)

In [63]:
train_set['city'].show()

In [64]:
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_avg_stars','business_avg_stars', 
                                                'user_review_count', 'business_review_count', 
                                                'city'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 163999
PROGRESS: Number of features          : 5
PROGRESS: Number of unpacked features : 5
PROGRESS: Number of coefficients    : 65
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 0.147692     | 3.970074           | 3.8750

In [65]:
model.evaluate(test_set)

{'max_error': 4.016611633717115, 'rmse': 0.9709879379195403}

In [66]:
model.summary()

Class                         : LinearRegression

Schema
------
Number of coefficients        : 65
Number of examples            : 163999
Number of feature columns     : 5
Number of unpacked features   : 5

Hyperparameters
---------------
L1 penalty                    : 0.0
L2 penalty                    : 0.01

Training Summary
----------------
Solver                        : auto
Solver iterations             : 1
Solver status                 : SUCCESS: Optimal solution found.
Training time (sec)           : 0.1914

Settings
--------
Residual sum of squares       : 154630.2751
Training RMSE                 : 0.971

Highest Positive Coefficients
-----------------------------
user_avg_stars                : 0.8108
business_avg_stars            : 0.7795
city[Sun City Anthem]         : 0.4094
city[North Pinal]             : 0.3632
city[Tonopah]                 : 0.2502

Lowest Negative Coefficients
----------------------------
(intercept)                   : -2.2281
city[Charleston]      

In [67]:
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_id','business_id',
                                                'user_avg_stars','business_avg_stars'],
                                    max_iterations=10)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 163812
PROGRESS: Number of features          : 4
PROGRESS: Number of unpacked features : 4
PROGRESS: Number of coefficients    : 49546
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 6        | 0.000001  

In [68]:
model.summary()

Class                         : LinearRegression

Schema
------
Number of coefficients        : 49546
Number of examples            : 163812
Number of feature columns     : 4
Number of unpacked features   : 4

Hyperparameters
---------------
L1 penalty                    : 0.0
L2 penalty                    : 0.01

Training Summary
----------------
Solver                        : auto
Solver iterations             : 10
Solver status                 : TERMINATED: Iteration limit reached.
Training time (sec)           : 1.5253

Settings
--------
Residual sum of squares       : 111921.2105
Training RMSE                 : 0.8266

Highest Positive Coefficients
-----------------------------
user_id[fu7wivArEkJm6ZsxJ6LNSQ]: 5.5599
user_id[IsysDvB1ZovwbzjJsHhJOw]: 5.4157
user_id[nIC5jJesAzTDjtquK335BA]: 5.1976
business_id[HvCIOs3WQiycBk_3VqdGYQ]: 5.1355
user_id[GDzMvqfqxozwOR8yNJcQnA]: 5.1172

Lowest Negative Coefficients
----------------------------
user_id[5Lj1Ox3Hf6yfKVSuJUlTbg]: -4.5199
use

In [69]:
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_id','business_id',
                                                'user_avg_stars','business_avg_stars'],
                                    max_iterations=100)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 164052
PROGRESS: Number of features          : 4
PROGRESS: Number of unpacked features : 4
PROGRESS: Number of coefficients    : 49533
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 6        | 0.000001  

In [73]:
train_set['votes'].head(3)

dtype: dict
Rows: 3
[{'funny': 0, 'useful': 5, 'cool': 2}, {'funny': 0, 'useful': 0, 'cool': 0}, {'funny': 0, 'useful': 1, 'cool': 0}]

In [74]:
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_id','business_id',
                                                'user_avg_stars','votes', 'business_avg_stars'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 164159
PROGRESS: Number of features          : 5
PROGRESS: Number of unpacked features : 7
PROGRESS: Number of coefficients    : 49604
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 6        | 0.000001  

In [76]:
train_set['votes_list'] = train_set['votes'].apply(lambda x: x.values())
train_set['votes_list'].head(3)

dtype: array
Rows: 3
[array('d', [0.0, 5.0, 2.0]), array('d', [0.0, 0.0, 0.0]), array('d', [0.0, 1.0, 0.0])]

In [80]:
train_set['categories'].head(5)

dtype: list
Rows: 5
[['Breakfast & Brunch', 'Restaurants'], ['Italian', 'Pizza', 'Restaurants'], ['Middle Eastern', 'Restaurants'], ['Active Life', 'Dog Parks', 'Parks'], ['Tires', 'Automotive']]

In [81]:
tag_dict = lambda tags: dict(zip(tags, [1 for tag in tags]))

In [82]:
train_set['categories_dict'] = train_set.apply(lambda row: tag_dict(row['categories']))

In [83]:
train_set['categories_dict'].head(5)

dtype: dict
Rows: 5
[{'Breakfast & Brunch': 1, 'Restaurants': 1}, {'Restaurants': 1, 'Pizza': 1, 'Italian': 1}, {'Middle Eastern': 1, 'Restaurants': 1}, {'Dog Parks': 1, 'Parks': 1, 'Active Life': 1}, {'Tires': 1, 'Automotive': 1}]

In [84]:
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_id','business_id', 'categories_dict',
                                                'user_avg_stars','votes', 'business_avg_stars'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 163990
PROGRESS: Number of features          : 6
PROGRESS: Number of unpacked features : 515
PROGRESS: Number of coefficients    : 50067
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 6        | 0.000000

In [85]:
train_set['text'].head(1)

dtype: str
Rows: 1
['My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!']

In [86]:
train_set['negative_review_tags'] = gl.text_analytics.count_words(train_set['text'])
train_set['negative_review_tags'].head(1)

dtype: dict
Rows: 1
[{'better.': 1, 'looks': 1, 'go': 1, 'perfect': 1, 'everything': 1, 'menu': 1, 'had': 1, 'to': 1, 'only': 1, 'pleasure.': 1, 'pretty': 2, 'it.': 1, 'do': 1, 'them': 1, 'garden': 1, 'sitting': 1, 'food': 1, 'they': 1, 'yourself': 1, '"toast"': 1, 'bread': 1, 'like': 1, 'had.': 2, 'weather': 1, 'amazing.': 1, 'meal': 1, 'absolutely': 1, 'our': 2, 'saturday': 1, 'best': 2, 'for': 1, 'phenomenal': 1, 'favor': 1, 'outside': 1, 'truffle': 1, 'ever': 2, 'anyway,': 1, 'here': 2, 'wait': 1, 'on': 3, 'semi-busy': 1, 'of': 1, 'place': 1, "i'm": 1, 'waitress': 1, 'grounds': 1, 'complete.': 1, 'bloody': 1, 'griddled': 1, 'simply': 1, 'skillet': 1, 'morning.': 1, 'use': 1, 'from': 1, 'quickly': 2, 'their': 4, '2': 1, 'delicious.': 1, 'white': 1, 'was': 8, "i've": 2, 'took': 1, 'excellent': 1, 'an': 1, 'with': 2, 'me': 1, 'made': 2, 'wife': 1, 'up': 1, 'while': 1, 'my': 2, 'and': 8, 'it': 8, 'pieces': 1, 'tasty': 1, 'breakfast': 1, 'absolute': 1, 'ingredients': 1, 'get': 2, 'when'

In [87]:
bad_review_words = ['hate','terrible', 'awful', 'spit', 'disgusting', 'filthy', 'tasteless', 'rude', 
                    'dirty', 'slow', 'poor', 'late', 'angry', 'flies', 'disappointed', 'disappointing', 'wait', 
                    'waiting', 'dreadful', 'appalling', 'horrific', 'horrifying', 'horrible', 'horrendous', 'atrocious', 
                    'abominable', 'deplorable', 'abhorrent', 'frightful', 'shocking', 'hideous', 'ghastly', 'grim', 
                    'dire', 'unspeakable', 'gruesome']

In [89]:
train_set['negative_review_tags'] = train_set['negative_review_tags'].dict_trim_by_keys(bad_review_words, exclude=False)

In [94]:
train_set['negative_review_tags']

dtype: dict
Rows: 172608
[{'wait': 1}, {'wait': 1}, {}, {}, {}, {'waiting': 1}, {'waiting': 1}, {}, {}, {}, {}, {}, {}, {}, {}, {'waiting': 1, 'disappointed': 1}, {}, {}, {}, {}, {'wait': 1}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {'hate': 1}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {'dirty': 1}, {'waiting': 1}, {}, {'wait': 1}, {}, {}, {}, {}, {}, {}, {'wait': 2}, {}, {}, {}, {}, {}, {'slow': 1, 'wait': 1}, {}, {}, {}, {}, {}, {}, {}, {'late': 1, 'wait': 1}, {'waiting': 1}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {'slow': 1}, {}, {'atrocious': 1}, {'wait': 1}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, ... ]

In [95]:
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_id', 'business_id', 'categories_dict', 'negative_review_tags', 
                                                'user_avg_stars', 'votes', 'business_avg_stars'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 163872
PROGRESS: Number of features          : 7
PROGRESS: Number of unpacked features : 551
PROGRESS: Number of coefficients    : 50129
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 6        | 0.000000

In [98]:
test_set['categories_dict'] = test_set.apply(lambda row: tag_dict(row['categories']))
test_set['categories_dict'].head(5)

test_set['negative_review_tags'] = gl.text_analytics.count_words(test_set['text'])
test_set['negative_review_tags'] = test_set['negative_review_tags'].dict_trim_by_keys(bad_review_words, exclude=False)

In [99]:
model.evaluate(test_set)

{'max_error': 6.5267151390294735, 'rmse': 1.159982299334633}

In [100]:
predictions = model.predict(test_set)