### Descrição das Features

Number |Attribute Information
--|--
1...50 | Average, standard deviation, min, max and median of the Attributes 51...60 for the source of the current blog post. With source we mean the blog on which the post appeared. For example, myblog.blog.org would be the source of the post myblog.blog.org/post_2010_09_10 
51| Total number of comments before basetime 
52| Number of comments in the last 24 hours before the basetime 
53| Let T1 denote the datetime 48 hours before basetime, Let T2 denote the datetime 24 hours before basetime. This attribute is the number of comments in the time period between T1 and T2 
54| Number of comments in the first 24 hours after the publication of the blog post, but before basetime 
55| The difference of Attribute 52 and Attribute 53 
56...60| The same features as the attributes 51...55, but features 56...60 refer to the number of links (trackbacks), while features 51...55 refer to the number of comments. 
61| The length of time between the publication of the blog post and basetime 
62| The length of the blog post 
63...262| The 200 bag of words features for 200 frequent words of the text of the blog post 
263...269| binary indicator features (0 or 1) for the weekday (Monday...Sunday) of the basetime 
270...276| binary indicator features (0 or 1) for the weekday (Monday...Sunday) of the date of publication of the blog post 
277| Number of parent pages: we consider a blog post P as a parent of blog post B, if B is a reply (trackback) to  blog post P. 
278...280| Minimum, maximum, average number of comments that the parents received 
281| The target: the number of comments in the next 24 hours (relative to basetime)


### Carregando Dados

In [1]:
import pandas as pd

data_train = pd.read_csv('blogdata/train.csv',header=None)
data_test = pd.read_csv('blogdata/test1.csv', header=None)


In [2]:
x_train = data_train.iloc[:,0:280]
y_train = data_train.iloc[:,-1]

x_test = data_test.iloc[:, 0:280]
y_test = data_test.iloc[:,-1] 

### Feature Selection 

**Variance Theshold** é um algoritmo de seleção de características que remove features que não atendem a certa variância.

**RFE** elimina features menos importante por seleção recursiva 

**K-best** Seleciona a k-best features utilizando método univariados



In [3]:
from sklearn.feature_selection import VarianceThreshold, RFECV, SelectKBest, f_regression
from sklearn.svm import SVR

vt = VarianceThreshold(threshold=(.8 * (1 - .8))) #retira todos que a variância em 80% dos exemplos é 0
rfecv = RFECV(SVR(kernel='linear'))
selbest = SelectKBest(f_regression,10)


##### Processamento

In [4]:
x_train_vt = vt.fit_transform(x_train, y_train)
x_test_vt = vt.fit_transform(x_test,y_test)
#x_train_rfecv = rfecv.fit_transform(x_train, y_train)
#x_train_selbest = selbest.fit_transform(x_train,y_train)


In [5]:
print(x_train_vt.shape)

(52397, 68)


### Start Model

In [6]:
from sklearn.ensemble import RandomForestRegressor

model_rfg = RandomForestRegressor(n_estimators=25)


### Fit Model


In [7]:
model_rfg.fit(x_train_vt,y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=25, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

## Result

Reference method: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html#sklearn.metrics.explained_variance_score

In [9]:
from sklearn.model_selection import cross_val_score
import numpy as np 
y = np.array(y_test)
x = np.array(x_test)
print(cross_val_score(model_rfg,x_test_vt,y_test,scoring='r2'))


[ 0.21391377  0.28003014 -1.41391377]


In [12]:
from sklearn.externals import joblib

joblib.dump(model_rfg,'randonforest_and_variancthreshold.sav')


['randonforest_and_variancthreshold.sav']