# MLWARE 2 - Recommendation Challenge

This was a challenge organized for Analytics Vidhya that lasted 60 hours. It started at Feb 22 2017.

The link of the challenge: https://datahack.analyticsvidhya.com/contest/mlware-2/

# Problem Statement

Understanding customers and their preferences is the holy grail for online businesses. Building a recommender system is one of the common ways to do so.
In this contest, you need to build a model that predicts a given user’s ratings (from 0 to 10 stars) for a given item based on past ratings on other items and/or other information. The problem of rating prediction is the primary part of a recommendation problem (the part where explicit ratings are given). No additional information (user demographics, item content features etc.) are given and the prediction has to be made using only the ratings of already rated items.

### Dataset:
The rating data of 40,000 users, and 120 items . Ratings of users who have rated less than 10 items have been removed.

1.- training.csv - This contains 958,529 ratings which are selected randomly from 1,599,544 ratings. Contains 4 columns:

    ID - Unique ID for each record
    userId - Unique user ID for each customer
    itemid - Item ID fo the product
    rating - Rating given to each item by user

2.- test.csv - This file has three columns containing the ID, userId and itemId. The predictions on this set would be judged.

### Evaluation:
The metrics used for evaluating the performance of the model is the "Root Mean Squared Error" (RMSE) between the predicted and the actual ratings.


# SOLUTION

In this solution, I will present how I have obtained an RMSE = 1.9858 in the testing data (30% original training data given for the problem).

#### Autor: Keven Ronald Fernández Carrillo ( A passionate newbie in machine learning and data science topics )

# I. Import Data

In [1]:
# Import libraries
import pandas as pd
import numpy as np

In [2]:
# Load training and testing data
train_data = pd.read_csv("C:/Users/b33580/Documents/Python Scripts/MLWARE/TRAIN/train_MLWARE2.csv")
test_data = pd.read_csv("C:/Users/b33580/Documents/Python Scripts/MLWARE/TEST/test_MLWARE2.csv")

In [3]:
# Check info of training data
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 958529 entries, 0 to 958528
Data columns (total 4 columns):
ID        958529 non-null object
userId    958529 non-null int64
itemId    958529 non-null int64
rating    958529 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 29.3+ MB


In [4]:
# Check info of testing data (submitting data)
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 641015 entries, 0 to 641014
Data columns (total 3 columns):
ID        641015 non-null object
userId    641015 non-null int64
itemId    641015 non-null int64
dtypes: int64(2), object(1)
memory usage: 14.7+ MB


In [5]:
# Creating a DataFrame that will contain our own training and testing data
data = train_data
print("Shape: {}".format(data.shape))

Shape: (958529, 4)


# II. Feature Engineering

##  II.1 User Set

We get started obtaining individually features per each user:

In [6]:
# Sort data by "userId" and "itemId"
data_user_temp = data.sort_values(by = ["userId","itemId"], ascending = [True,True])
data_user_temp.head(10)

Unnamed: 0,ID,userId,itemId,rating
488842,0_1,0,1,0.5
488839,0_28,0,28,1.5
488844,0_29,0,29,6.5
488841,0_47,0,47,2.0
488840,0_107,0,107,1.5
488845,0_121,0,121,6.0
488838,0_129,0,129,1.0
488843,0_137,0,137,6.0
438447,1_1,1,1,7.0
438496,1_3,1,3,8.0


In [7]:
# Reset index
data_user_temp.reset_index(inplace = True, drop = True)
data_user_temp.head(5)

Unnamed: 0,ID,userId,itemId,rating
0,0_1,0,1,0.5
1,0_28,0,28,1.5
2,0_29,0,29,6.5
3,0_47,0,47,2.0
4,0_107,0,107,1.5


We'll obtain per each userId: count, mean, median, min, max and std of their ratings 

In [8]:
data_user = pd.DataFrame(np.array(data_user_temp.groupby("userId", as_index = False, axis = 0)["rating"].count()), 
                         columns = ["userId","rat_count_user"])

data_user = pd.concat([data_user,
                       pd.DataFrame(np.array(data_user_temp.groupby("userId", as_index = False, axis = 0)["rating"].mean()), 
                       columns = ["userId","rat_mean_user"])["rat_mean_user"],
                       pd.DataFrame(np.array(data_user_temp.groupby("userId", as_index = False, axis = 0)["rating"].median()), 
                       columns = ["userId","rat_median_user"])["rat_median_user"],
                       pd.DataFrame(np.array(data_user_temp.groupby("userId", as_index = False, axis = 0)["rating"].min()), 
                       columns = ["userId","rat_min_user"])["rat_min_user"],
                       pd.DataFrame(np.array(data_user_temp.groupby("userId", as_index = False, axis = 0)["rating"].max()), 
                       columns = ["userId","rat_max_user"])["rat_max_user"],
                       pd.DataFrame(np.array(data_user_temp.groupby("userId", as_index = False, axis = 0)["rating"].std()), 
                       columns = ["userId","rat_std_user"])["rat_std_user"] ], 
                       axis = 1)
data_user_temp = [] #Restart the temporal variable

data_user.head(10)

Unnamed: 0,userId,rat_count_user,rat_mean_user,rat_median_user,rat_min_user,rat_max_user,rat_std_user
0,0,8,3.125,1.75,0.5,6.5,2.559994
1,1,70,7.314286,8.0,2.0,9.5,1.558454
2,2,11,3.090909,0.5,0.0,8.5,3.277333
3,4,10,8.55,8.5,8.0,9.0,0.437798
4,5,13,5.269231,5.5,2.5,7.5,1.494649
5,6,26,2.076923,1.5,0.0,5.5,1.677452
6,7,22,3.590909,4.25,0.5,7.0,2.152719
7,8,52,6.413462,6.25,3.5,10.0,1.247433
8,9,11,5.909091,7.0,1.0,8.5,2.508168
9,10,50,6.71,7.5,0.0,10.0,3.136503


##  II.2 Item Set

Next we get individually features per each item:

In [9]:
# Sort data by "itemId" and "userId"
data_item_temp = data.sort_values(by = ["itemId","userId"], ascending = [True,True])
data_item_temp.head(10)

Unnamed: 0,ID,userId,itemId,rating
488842,0_1,0,1,0.5
438447,1_1,1,1,7.0
62369,4_1,4,1,9.0
175260,5_1,5,1,5.0
525179,6_1,6,1,1.5
5596,7_1,7,1,4.5
711690,8_1,8,1,5.5
871059,10_1,10,1,8.5
223519,16_1,16,1,1.5
18231,18_1,18,1,0.0


In [10]:
# Reset index
data_item_temp.reset_index(inplace = True, drop = True)
data_item_temp.head(5)

Unnamed: 0,ID,userId,itemId,rating
0,0_1,0,1,0.5
1,1_1,1,1,7.0
2,4_1,4,1,9.0
3,5_1,5,1,5.0
4,6_1,6,1,1.5


We obtain per each intemId: rating count, mean, median, min, max and std of the ratings given from users:

In [11]:
data_item = pd.DataFrame(np.array(data_item_temp.groupby("itemId", as_index = False, axis = 0)["rating"].count()), 
                         columns = ["itemId","rat_count_item"])
data_item = pd.concat([data_item,
                       pd.DataFrame(np.array(data_item_temp.groupby("itemId", as_index = False, axis = 0)["rating"].mean()), 
                       columns = ["itemId","rat_mean_item"])["rat_mean_item"],
                       pd.DataFrame(np.array(data_item_temp.groupby("itemId", as_index = False, axis = 0)["rating"].median()), 
                       columns = ["itemId","rat_median_item"])["rat_median_item"],
                       pd.DataFrame(np.array(data_item_temp.groupby("itemId", as_index = False, axis = 0)["rating"].min()), 
                       columns = ["itemId","rat_min_item"])["rat_min_item"],
                       pd.DataFrame(np.array(data_item_temp.groupby("itemId", as_index = False, axis = 0)["rating"].max()), 
                       columns = ["itemId","rat_max_item"])["rat_max_item"],
                       pd.DataFrame(np.array(data_item_temp.groupby("itemId", as_index = False, axis = 0)["rating"].std()), 
                       columns = ["itemId","rat_std_item"])["rat_std_item"] ], 
                       axis = 1)
data_item_temp = [] #Restart the temporal variable
data_item.head(10)

Unnamed: 0,itemId,rat_count_item,rat_mean_item,rat_median_item,rat_min_item,rat_max_item,rat_std_item
0,1,23934,5.377288,5.5,0.0,10.0,2.638525
1,2,7393,6.247531,6.5,0.0,10.0,2.425138
2,3,8479,6.547647,7.0,0.0,10.0,2.456252
3,4,3569,5.626086,6.0,0.0,10.0,2.467517
4,5,3028,4.777576,5.0,0.0,10.0,2.979029
5,6,3319,5.29301,5.5,0.0,10.0,2.763852
6,7,4642,5.950883,6.0,0.0,10.0,2.58692
7,8,7881,6.510088,6.5,0.0,10.0,2.361828
8,10,23706,4.693917,5.0,0.0,10.0,2.696327
9,11,5603,5.819115,6.0,0.0,10.0,2.843468


Check if there are null values:

In [12]:
print (data_user.isnull().sum() )

userId             0
rat_count_user     0
rat_mean_user      0
rat_median_user    0
rat_min_user       0
rat_max_user       0
rat_std_user       6
dtype: int64


There are null values, then we check each row that have null values:

In [13]:
data_user[data_user["rat_std_user"].isnull()]

Unnamed: 0,userId,rat_count_user,rat_mean_user,rat_median_user,rat_min_user,rat_max_user,rat_std_user
4628,6615,1,7.0,7.0,7.0,7.0,
8631,13046,1,9.0,9.0,9.0,9.0,
16782,27746,1,5.0,5.0,5.0,5.0,
18319,29589,1,6.5,6.5,6.5,6.5,
21332,33146,1,4.0,4.0,4.0,4.0,
30872,44813,1,0.0,0.0,0.0,0.0,


For these null values we'll replace with "0", because the std for these rows are 0 individually(1 row for each userId):

In [14]:
data_user["rat_std_user"] = data_user["rat_std_user"].fillna(0)

Then now we have ...

In [15]:
print (data_user.isnull().sum() )

userId             0
rat_count_user     0
rat_mean_user      0
rat_median_user    0
rat_min_user       0
rat_max_user       0
rat_std_user       0
dtype: int64


## II.3 Join between initial DataFrame and created DataFrames

Adding the user and item features to our initial dataframe:

In [16]:
# User Feactures
dataset_temp = pd.merge(data, data_user, how="left", on = "userId")
# Item Feactures
dataset = pd.merge(dataset_temp, data_item, how="left", on = "itemId")

dataset_temp = [] #Restart the temporal variable

dataset.head(50)

Unnamed: 0,ID,userId,itemId,rating,rat_count_user,rat_mean_user,rat_median_user,rat_min_user,rat_max_user,rat_std_user,rat_count_item,rat_mean_item,rat_median_item,rat_min_item,rat_max_item,rat_std_item
0,16041_129,16041,129,0.5,29,5.862069,5.5,0.5,10.0,3.390893,23951,4.027619,3.5,0.0,10.0,2.67462
1,16041_25,16041,25,0.5,29,5.862069,5.5,0.5,10.0,3.390893,23844,4.677403,5.0,0.0,10.0,2.561905
2,16041_28,16041,28,5.5,29,5.862069,5.5,0.5,10.0,3.390893,23988,4.29077,4.0,0.0,10.0,2.656385
3,16041_101,16041,101,0.5,29,5.862069,5.5,0.5,10.0,3.390893,23976,5.368702,5.5,0.0,10.0,2.640348
4,16041_47,16041,47,1.5,29,5.862069,5.5,0.5,10.0,3.390893,23891,4.935101,5.0,0.0,10.0,2.550463
5,16041_132,16041,132,0.5,29,5.862069,5.5,0.5,10.0,3.390893,4468,5.660586,6.0,0.0,10.0,2.564918
6,16041_38,16041,38,0.5,29,5.862069,5.5,0.5,10.0,3.390893,4046,5.54004,5.5,0.0,10.0,2.564231
7,16041_89,16041,89,10.0,29,5.862069,5.5,0.5,10.0,3.390893,9977,6.646637,7.0,0.0,10.0,2.368218
8,16041_17,16041,17,2.5,29,5.862069,5.5,0.5,10.0,3.390893,3583,5.460787,5.5,0.0,10.0,2.679421
9,16041_116,16041,116,6.5,29,5.862069,5.5,0.5,10.0,3.390893,11737,6.713002,7.0,0.0,10.0,2.155627


## Split into training & test sets

We'll use 70% of the data for training and 30% for testing:

In [17]:
from sklearn import cross_validation
data_train , data_test = cross_validation.train_test_split(dataset, train_size = 0.70, random_state = 99)
print ("Train:", data_train.shape)
print ("Test: ", data_test.shape)



Train: (670970, 16)
Test:  (287559, 16)


In [18]:
# Identify Target:
variables_set_y = ['rating']

# Identify Features (The columns we'll use to predict the target):
variables_set_x = list(dataset.columns)
delete_var = ['rating','ID','userId','itemId'] # Features to delete

temp = []
for var in variables_set_x:
    if var not in delete_var:
        temp.append(var)
variables_set_x = temp
temp = []
print ("# Feactures: ", len(variables_set_x))
print ("Feactures: ")
variables_set_x

# Feactures:  12
Feactures: 


['rat_count_user',
 'rat_mean_user',
 'rat_median_user',
 'rat_min_user',
 'rat_max_user',
 'rat_std_user',
 'rat_count_item',
 'rat_mean_item',
 'rat_median_item',
 'rat_min_item',
 'rat_max_item',
 'rat_std_item']

In [19]:
# Store separately X(features) and Y(target) for training and testing datasets
X_train = data_train.ix[:, variables_set_x]
y_train = data_train.ix[:, variables_set_y]

X_test = data_test.ix[:, variables_set_x]
y_test = data_test.ix[:, variables_set_y]

In [20]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 670970 entries, 836349 to 684673
Data columns (total 12 columns):
rat_count_user     670970 non-null int64
rat_mean_user      670970 non-null float64
rat_median_user    670970 non-null float64
rat_min_user       670970 non-null float64
rat_max_user       670970 non-null float64
rat_std_user       670970 non-null float64
rat_count_item     670970 non-null int64
rat_mean_item      670970 non-null float64
rat_median_item    670970 non-null float64
rat_min_item       670970 non-null float64
rat_max_item       670970 non-null float64
rat_std_item       670970 non-null float64
dtypes: float64(10), int64(2)
memory usage: 66.5 MB


## III. Modeling

First, we'll train and test with RandomForestRegressor and ExtraTreesRegressor models individually:

## III.1 Individual Models

### A. Random Forest Regressor

In [21]:
from sklearn.ensemble import RandomForestRegressor

In [22]:
# Show RandomForestRegressor's parameters
RandomForestRegressor()

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [23]:
# Initialize our RandomForestRegressor  
# n_estimators: is the number of trees in the forest.
# min_samples_split: is the minimum number of rows we need to make a split
# min_samples_leaf: is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
# criterion: the function to measure the quality of a split.
# random_state: is the seed used by the random number generator
# n_jobs: the number of jobs to run in parallel for both fit and predict
rfr_model = RandomForestRegressor(n_estimators = 50, criterion='mse', random_state=100, n_jobs=-1, 
                                  min_samples_leaf=4, min_samples_split=8)

In [24]:
# We'll calculate the running time
from time import time
time_star = time()

# Training our model:
rfr_model.fit(X_train, y_train)

time_end = time()
print ("Time: ", np.round((time_end-time_star)/60,2), " minutes")



Time:  0.99  minutes


In [25]:
# Predict the target to our training and testing data
y_pred_rfr_train = rfr_model.predict(X_train)
y_pred_rfr = rfr_model.predict(X_test)

In [26]:
from sklearn import metrics
import numpy as np
print('Metrics for Training Set:')
print('MAE:', metrics.mean_absolute_error(y_train, y_pred_rfr_train))
print('MSE:', metrics.mean_squared_error(y_train, y_pred_rfr_train))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, y_pred_rfr_train)))

print('\nMetrics for Test Set:')
print('MAE:', metrics.mean_absolute_error(y_test, y_pred_rfr))
print('MSE:', metrics.mean_squared_error(y_test, y_pred_rfr))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_rfr)))

Metrics for Training Set:
MAE: 1.00914955162
MSE: 1.90360949667
RMSE: 1.37971355602

Metrics for Test Set:
MAE: 1.50396081277
MSE: 4.10212944038
RMSE: 2.0253714327


### B. ExtraTrees Regressor

In [27]:
from sklearn.ensemble import ExtraTreesRegressor

In [28]:
# Show ExtraTreesRegressor's parameters
ExtraTreesRegressor()

ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None,
          min_impurity_split=1e-07, min_samples_leaf=1,
          min_samples_split=2, min_weight_fraction_leaf=0.0,
          n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
          verbose=0, warm_start=False)

In [29]:
# Initialize our ExtraTreesRegressor  
# n_estimators: is the number of trees in the forest.
# min_samples_split: is the minimum number of rows we need to make a split
# min_samples_leaf: is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
# criterion: the function to measure the quality of a split.
# random_state: is the seed used by the random number generator
# n_jobs: the number of jobs to run in parallel for both fit and predict
etr_model = ExtraTreesRegressor(n_estimators = 50, criterion='mse', random_state=100, n_jobs=-1,
                                  min_samples_leaf=4, min_samples_split=8)

In [30]:
# We'll calculate the running time
from time import time
time_star = time()

# Training our model:
etr_model.fit(X_train, y_train)

time_end = time()
print ("Time: ", np.round((time_end-time_star)/60,2), " minutes")



Time:  0.59  minutes


In [31]:
# Predict the target to our training and testing data
y_pred_etr_train = etr_model.predict(X_train)
y_pred_etr = etr_model.predict(X_test)

In [32]:
from sklearn import metrics
import numpy as np
print('Metrics for Training Set:')
print('MAE:', metrics.mean_absolute_error(y_train, y_pred_etr_train))
print('MSE:', metrics.mean_squared_error(y_train, y_pred_etr_train))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, y_pred_etr_train)))

print('\nMetrics for Test Set:')
print('MAE:', metrics.mean_absolute_error(y_test, y_pred_etr))
print('MSE:', metrics.mean_squared_error(y_test, y_pred_etr))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_etr)))

Metrics for Training Set:
MAE: 1.02168226937
MSE: 1.92032092862
RMSE: 1.38575644636

Metrics for Test Set:
MAE: 1.49539578074
MSE: 4.05270048168
RMSE: 2.01313200801


#### The best model between these models is ExtraTreesRegressor.

## III.2 First simple ensemble model

To start, first we make a simple ensemble model with an simple average between the 2 predictions of our generated models:

In [33]:
y_pred = (y_pred_rfr + y_pred_etr)/2

In [34]:
print('Metrics for Test Set:')
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Metrics for Test Set:
MAE: 1.49429490222
MSE: 4.03993119611
RMSE: 2.00995800855


Now, we'll make an interation to coefficients that multiply the predictions of the models with values between 1 - 10 for finding the best

In [35]:
# At begin(above) "a" and "b" were:
a = 1
b = 1
# Now, "a" and "b" coefficients will be from 1 until 10 
array_mrse = []
for a in range(1,11):
    for b in range(1,11):
        var_sum = a + b
        y_pred = (y_pred_etr*a + y_pred_rfr*b)/var_sum
        rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
        array_mrse.append([a,b,rmse])
pd.DataFrame(array_mrse, columns = ['a','b','mrse']).sort_values(by = ['mrse','a']).head(5)

Unnamed: 0,a,b,mrse
10,2,1,2.008944
31,4,2,2.008944
52,6,3,2.008944
73,8,4,2.008944
94,10,5,2.008944


We can see that best combination of coefficients are: a = 2 and b = 1 and this assembling improve(decreasing the RMSE) the individual models.

## III.3 Hyperparameters Tunning using GridSearch and CV k-folds

Now, we'll go tunning some parameters of the ExtraTreesRegressor model to improve our result. For that we use GridSearchCV:

In [36]:
from sklearn.ensemble import ExtraTreesRegressor

In [None]:
from time import time
time_star = time()

n_estimators = list(np.arange(50, 251, 50)) #values = [50,100,150,200,250]
max_features = [None, 'sqrt', 'log2'] 

from sklearn.grid_search import GridSearchCV
etr_tunning = ExtraTreesRegressor(criterion='mse', random_state=100, n_jobs=-1, 
                                  min_samples_leaf=4, min_samples_split=8)
hyperparameters = {
                     'n_estimators' : n_estimators,
                     'max_features' : max_features
                  }

gridCV = GridSearchCV(etr_tunning , param_grid = hyperparameters, cv = 4, n_jobs = -1)

gridCV.fit(X_train, y_train)

print('Best score: {}'.format(gridCV.best_score_))
print('Best parameters: {}'.format(gridCV.best_params_))

time_end = time()
print ("Time: ", (time_end-time_star)/60, " minutes")

  best_estimator.fit(X, y, **self.fit_params)


Best score: 0.4226315365964312
Best parameters: {'max_features': 'sqrt', 'n_estimators': 250}
Time:  539.5460170626641  minutes


When we run above code, we could have next result:

#### Best parameters: {'max_features': 'sqrt', 'n_estimators': 250}

Best score: 0.4226315365964312

Time:  539.5460170626641  minutes > 8 hours

## III. Final Model

Setting our models with the values of the parameters obtained in the before step:

### A.2 Random Forest Regression

In [37]:
from sklearn.ensemble import RandomForestRegressor
# max_features:
# The number of features to consider when looking for the best split:
# If max_features = 'sqrt', then max_features=sqrt(n_features).
rfr_model = RandomForestRegressor(n_estimators=250, criterion='mse', random_state=100, n_jobs=-1, max_features = 'sqrt',
                                  min_samples_leaf=4, min_samples_split=8)

In [38]:
from time import time
time_star = time()

rfr_model.fit(X_train, y_train)

time_end = time()
print ("Time: ", np.round((time_end-time_star)/60,2), " minutes")

  app.launch_new_instance()


Time:  1.6  minutes


In [39]:
# Predict the target to our training and testing data
y_pred_rfr_train = rfr_model.predict(X_train)
y_pred_rfr = rfr_model.predict(X_test)

In [40]:
from sklearn import metrics
import numpy as np
print('Metrics for Training Set:')
print('MAE:', metrics.mean_absolute_error(y_train, y_pred_rfr_train))
print('MSE:', metrics.mean_squared_error(y_train, y_pred_rfr_train))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, y_pred_rfr_train)))

print('\nMetrics for Test Set:')
print('MAE:', metrics.mean_absolute_error(y_test, y_pred_rfr))
print('MSE:', metrics.mean_squared_error(y_test, y_pred_rfr))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_rfr)))

Metrics for Training Set:
MAE: 1.13405267898
MSE: 2.34096648521
RMSE: 1.53002172704

Metrics for Test Set:
MAE: 1.48530504212
MSE: 3.96996689933
RMSE: 1.99247757813


### B.2. Extra Tree Regression

In [41]:
from sklearn.ensemble import ExtraTreesRegressor

# Setting ExtraTreesRegressor parameters
etr_model = ExtraTreesRegressor(n_estimators=250, criterion='mse', random_state=100, n_jobs=-1, max_features = 'sqrt',
                                  min_samples_leaf=4, min_samples_split=8)

In [42]:
from time import time
time_star = time()
etr_model.fit(X_train, y_train)
time_end = time()
print ("Time: ", np.round((time_end-time_star)/60,2), " minutes")

  app.launch_new_instance()


Time:  1.05  minutes


In [43]:
# Predict the target to our training and testing data
y_pred_etr_train = etr_model.predict(X_train)
y_pred_etr = etr_model.predict(X_test)

In [44]:
from sklearn import metrics
import numpy as np
print('Metrics for Training Set:')
print('MAE:', metrics.mean_absolute_error(y_train, y_pred_etr_train))
print('MSE:', metrics.mean_squared_error(y_train, y_pred_etr_train))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, y_pred_etr_train)))

print('\nMetrics for Test Set:')
print('MAE:', metrics.mean_absolute_error(y_test, y_pred_etr))
print('MSE:', metrics.mean_squared_error(y_test, y_pred_etr))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_etr)))

Metrics for Training Set:
MAE: 1.30613412701
MSE: 3.04343795822
RMSE: 1.74454520097

Metrics for Test Set:
MAE: 1.48941371716
MSE: 3.94587457389
RMSE: 1.98642255673


### C. Ensemble

Again, make a simple ensemble model with an simple average between the 2 predictions of our generated models:

In [45]:
y_pred = (y_pred_etr + y_pred_rfr)/2

In [46]:
print('\nMetrics for Test Set:')
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


Metrics for Test Set:
MAE: 1.48578154481
MSE: 3.94673091613
RMSE: 1.9866380939


Searching the best values for "a" and "b" coefficients:

In [47]:
# At begin(above) "a" and "b" were:
a = 1
b = 1
# Now, "a" and "b" coefficients will be from 1 until 10 
array_mrse = []
for a in range(1,11):
    for b in range(1,11):
        var_sum = a + b
        y_pred = (y_pred_etr*a + y_pred_rfr*b)/var_sum
        rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
        array_mrse.append([a,b,rmse])
pd.DataFrame(array_mrse, columns = ['a','b','mrse']).sort_values(by = ['mrse','a']).head(5)

Unnamed: 0,a,b,mrse
92,10,3,1.985822
61,7,2,1.985823
20,3,1,1.985826
51,6,2,1.985826
82,9,3,1.985826


Finally, we can see that best combination of coefficients are: a = 10 and b = 3 and this assembling improve(decreasing the RMSE) the individual models.

#### In conclusion, the final MRSE for our testing data is 1.9858

# IV. Summit 

Merge features to the summit dataset

In [49]:
dataset_summit_temp = pd.merge(test_data, data_user, how="left", on = "userId")
dataset_summit = pd.merge(dataset_summit_temp, data_item, how="left", on = "itemId")
dataset_summit_temp = [] 
dataset_summit

Unnamed: 0,ID,userId,itemId,rat_count_user,rat_mean_user,rat_median_user,rat_min_user,rat_max_user,rat_std_user,rat_count_item,rat_mean_item,rat_median_item,rat_min_item,rat_max_item,rat_std_item
0,16041_10,16041,10,29,5.862069,5.5,0.5,10.0,3.390893,23706,4.693917,5.0,0.0,10.0,2.696327
1,16041_107,16041,107,29,5.862069,5.5,0.5,10.0,3.390893,24107,4.197038,4.0,0.0,10.0,2.689592
2,16041_1,16041,1,29,5.862069,5.5,0.5,10.0,3.390893,23934,5.377288,5.5,0.0,10.0,2.638525
3,16041_40,16041,40,29,5.862069,5.5,0.5,10.0,3.390893,6969,6.494332,7.0,0.0,10.0,2.548807
4,16041_96,16041,96,29,5.862069,5.5,0.5,10.0,3.390893,5071,5.647604,6.0,0.0,10.0,2.563235
5,16041_137,16041,137,29,5.862069,5.5,0.5,10.0,3.390893,15353,6.779066,7.0,0.0,10.0,2.345042
6,16041_51,16041,51,29,5.862069,5.5,0.5,10.0,3.390893,4143,5.588342,6.0,0.0,10.0,2.608309
7,16041_59,16041,59,29,5.862069,5.5,0.5,10.0,3.390893,6643,6.283080,6.5,0.0,10.0,2.380900
8,16041_135,16041,135,29,5.862069,5.5,0.5,10.0,3.390893,5807,6.034441,6.0,0.0,10.0,2.361113
9,16041_15,16041,15,29,5.862069,5.5,0.5,10.0,3.390893,7191,6.210332,6.5,0.0,10.0,2.514399


In [50]:
print("Shape: {} \n".format(dataset_summit.shape))
dataset_summit.info()

Shape: (641015, 15) 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 641015 entries, 0 to 641014
Data columns (total 15 columns):
ID                 641015 non-null object
userId             641015 non-null int64
itemId             641015 non-null int64
rat_count_user     641015 non-null int64
rat_mean_user      641015 non-null float64
rat_median_user    641015 non-null float64
rat_min_user       641015 non-null float64
rat_max_user       641015 non-null float64
rat_std_user       641015 non-null float64
rat_count_item     641015 non-null int64
rat_mean_item      641015 non-null float64
rat_median_item    641015 non-null float64
rat_min_item       641015 non-null float64
rat_max_item       641015 non-null float64
rat_std_item       641015 non-null float64
dtypes: float64(10), int64(4), object(1)
memory usage: 78.2+ MB


In [51]:
X_dataset_summit = dataset_summit.ix[:, variables_set_x]
print ("Nro_Variables: ", len(X_dataset_summit.columns))
X_dataset_summit.columns

Nro_Variables:  12


Index(['rat_count_user', 'rat_mean_user', 'rat_median_user', 'rat_min_user',
       'rat_max_user', 'rat_std_user', 'rat_count_item', 'rat_mean_item',
       'rat_median_item', 'rat_min_item', 'rat_max_item', 'rat_std_item'],
      dtype='object')

### Apply ensemble model on summit dataset:

In [52]:
y_pred_summiDS_etr = etr_model.predict(X_dataset_summit)
y_pred_summiDS_etr_df = pd.DataFrame(y_pred_summiDS_etr, columns = ['rating'])

In [53]:
y_pred_summiDS_rfr = rfr_model.predict(X_dataset_summit)
y_pred_summiDS_rfr_df = pd.DataFrame(y_pred_summiDS_rfr, columns = ['rating'])

In [54]:
a = 10
b = 3
sum_var = a + b
y_pred_summiDS = (y_pred_summiDS_etr_df*a + y_pred_summiDS_rfr_df*b)/sum_var

In [55]:
summit = pd.concat( [dataset_summit.ix[:,'ID'] , y_pred_summiDS ], axis = 1)

In [56]:
summit

Unnamed: 0,ID,rating
0,16041_10,4.252705
1,16041_107,3.551483
2,16041_1,5.342552
3,16041_40,6.568824
4,16041_96,5.387443
5,16041_137,7.278069
6,16041_51,5.358696
7,16041_59,5.568638
8,16041_135,5.113431
9,16041_15,5.925161


In [57]:
summit.to_csv("C:/Users/b33580/Documents/Python Scripts/MLWARE/TEST/submission_MLWARE2_ensenmble.csv")

------------------------------------------------------------------------------------------------------------------------

#### Autor: Keven Ronald Fernández Carrillo ( A passionate newbie in machine learning topics )