# <center> Идентификация пользователей по посещенным веб-страницам
<img src='http://i.istockimg.com/file_thumbview_approve/21546327/5/stock-illustration-21546327-identification-de-l-utilisateur.jpg'>


вспомним про концепцию стохастического градиентного спуска и опробуем классификатор Scikit-learn SGDClassifier, который работает намного быстрее на больших выборках, чем алгоритмы.

In [1]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

**Считаем данные [соревнования](https://inclass.kaggle.com/c/identify-me-if-you-can-yandex-mipt/data) в DataFrame train_df и test_df (обучающая и тестовая выборки).**

In [2]:
train_df = pd.read_csv('kaggle_data/train_sessions.csv', index_col='session_id')
test_df = pd.read_csv('kaggle_data/test_sessions.csv', index_col='session_id')

In [3]:
train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,user_id
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,8,2014-01-04 08:44:50,11.0,2014-01-04 08:44:50,82.0,2014-01-04 08:45:19,68.0,2014-01-04 08:45:25,8.0,2014-01-04 08:45:25,...,2014-01-04 08:45:51,8403.0,2014-01-04 08:45:51,932.0,2014-01-04 08:45:53,3260.0,2014-01-04 08:45:53,8.0,2014-01-04 08:45:53,1845
2,111,2014-03-18 10:33:20,78.0,2014-03-18 10:33:31,151.0,2014-03-18 10:33:31,111.0,2014-03-18 10:33:31,1401.0,2014-03-18 10:33:31,...,2014-03-18 10:33:32,1375.0,2014-03-18 10:33:32,38.0,2014-03-18 10:33:32,1401.0,2014-03-18 10:33:32,97.0,2014-03-18 10:33:34,3322
3,11,2014-12-02 13:13:41,3187.0,2014-12-02 13:13:41,132.0,2014-12-02 13:13:42,496.0,2014-12-02 13:13:42,1969.0,2014-12-02 13:13:45,...,2014-12-02 13:13:45,3187.0,2014-12-02 13:13:45,82.0,2014-12-02 13:13:46,3191.0,2014-12-02 13:13:46,3184.0,2014-12-02 13:13:47,2003
4,668,2014-02-14 15:16:45,1965.0,2014-02-14 15:17:13,598.0,2014-02-14 15:20:47,1965.0,2014-02-14 15:21:13,284.0,2014-02-14 15:21:14,...,2014-02-14 15:21:14,38.0,2014-02-14 15:21:14,4451.0,2014-02-14 15:21:14,4537.0,2014-02-14 15:21:15,11.0,2014-02-14 15:21:15,1373
5,1943,2014-03-17 15:19:40,1943.0,2014-03-17 15:20:10,1943.0,2014-03-17 15:21:40,1943.0,2014-03-17 15:22:10,1943.0,2014-03-17 15:22:39,...,2014-03-17 15:22:39,1952.0,2014-03-17 15:22:41,1943.0,2014-03-17 15:22:41,1943.0,2014-03-17 15:22:42,1943.0,2014-03-17 15:22:43,1737


**Объединим обучающую и тестовую выборки – это понадобится, чтоб вместе потом привести их к разреженному формату.**

In [4]:
train_df.shape, test_df.shape

((95319, 21), (41177, 20))

В обучающей выборке видим следующие признаки:
    - site1 – индекс первого посещенного сайта в сессии
    - time1 – время посещения первого сайта в сессии
    - ...
    - site10 – индекс 10-го посещенного сайта в сессии
    - time10 – время посещения 10-го сайта в сессии
    - user_id – ID пользователя
    
Сессии пользователей выделены таким образом, что они не могут быть длинее получаса или 10 сайтов. То есть сессия считается оконченной либо когда пользователь посетил 10 сайтов подряд, либо когда сессия заняла по времени более 30 минут. 

**Посмотрим на статистику признаков.**

Пропуски возникают там, где сессии короткие (менее 10 сайтов). Скажем, если человек 1 января 2015 года посетил *vk.com* в 20:01, потом *yandex.ru* в 20:29, затем *google.com* в 20:33, то первая его сессия будет состоять только из двух сайтов (site1 – ID сайта *vk.com*, time1 – 2015-01-01 20:01:00, site2 – ID сайта  *yandex.ru*, time2 – 2015-01-01 20:29:00, остальные признаки – NaN), а начиная с *google.com* пойдет новая сессия, потому что уже прошло более 30 минут с момента посещения *vk.com*.

**В обучающей выборке – 550 пользователей.**

In [5]:
train_df['user_id'].nunique()

550

**Пока для прогноза ID пользователя будем использовать только индексы посещенных сайтов. Индексы нумеровались с 1, так что заменим пропуски на нули.**

In [6]:
train_sites = train_df[['site1', 'site2', 'site3','site4','site5','site6','site7',
          'site8', 'site9', 'site10', 'user_id']].fillna(0).astype('int')

In [7]:
def ures_top_sites(data, threshold=1):
    """ input dataframe
    return 
    list of top sites for users with the threshold
    dictionary user, top site's list """
    top_users_sites =[]
    dic_user_top ={}
    
    for user, values in pd.groupby(data, by = 'user_id'):
        n,m = values.shape
        sites, freq = np.unique(np.ravel(values.values), return_counts=True)
        mask = np.logical_not(np.logical_not(sites))
        sites = sites[mask]
        freq = freq[mask]
        sort_sites = sorted([(s, fr) for s,fr in zip(sites, freq)], key= lambda x: x[1], reverse =True )
        #user_site_top = sort_sites[:threshold]
        top_list = np.array([x[0] for x in  sort_sites[:threshold]])
        dic_user_top[user] = top_list
        top_users_sites.append(top_list)
    return np.unique(top_users_sites), dic_user_top

In [8]:
def top_sites(data, threshold=30):
    """ input dataframe
    return top sites with the threshold"""
    data_ravel =np.ravel(data.drop(['user_id'], axis =1).values)
    sites, freq = np.unique(data_ravel, return_counts=True)
    mask = np.logical_not(np.logical_not(sites))
    sites = sites[mask]
    freq = freq[mask]
    sort_sites = sorted([(s, fr) for s,fr in zip(sites, freq)], key= lambda x: x[1], reverse =True )
    #user_site_top = sort_sites[:threshold]
    top_list = np.array([x[0] for x in  sort_sites[:threshold]])
    return top_list

- создадим список из 10 самых популярных для каждого пользователя сайта

In [9]:
users_top_sites, dic_user_sites = ures_top_sites(train_sites, 4)

In [10]:
len(users_top_sites)

788

- создадим список из 300 самых популярных  сайтов в выборке

In [11]:
top_list_sites = top_sites(train_sites, 100)

In [12]:
def in2D(data, ar2):
    """ return the mask like data where True only the element of data is in ar2"""
    data = data.values
    n,m = data.shape
    mask = np.in1d(np.ravel(data),ar2)
    mask_2d = mask.reshape((n,m))
    return mask_2d
#np.where(mask_2d,train_sites.head(3).values, 0)

In [13]:
def gramm_2 (data):
    """ return 2_gramm from all rows of data """
    data = data.values
    n,m = data.shape
    ngramm = np.empty((n,m-1), dtype = object)
    for  i, list_ in enumerate (data.tolist()):
        ngramm[i] =[str(x[0])+'andsite'+str(x[1]) for x in zip(list_[:-1], list_[1:])]
    return ngramm

#### создадим признаки связанные с временем

In [14]:
train_time = train_df[['time1', 'time2', 'time3', 'time4','time5', 'time6',
                        'time7','time8', 'time9', 'time10']].fillna(np.datetime64('nat'))
train_time.head(3) 

Unnamed: 0_level_0,time1,time2,time3,time4,time5,time6,time7,time8,time9,time10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,2014-01-04 08:44:50,2014-01-04 08:44:50,2014-01-04 08:45:19,2014-01-04 08:45:25,2014-01-04 08:45:25,2014-01-04 08:45:51,2014-01-04 08:45:51,2014-01-04 08:45:53,2014-01-04 08:45:53,2014-01-04 08:45:53
2,2014-03-18 10:33:20,2014-03-18 10:33:31,2014-03-18 10:33:31,2014-03-18 10:33:31,2014-03-18 10:33:31,2014-03-18 10:33:32,2014-03-18 10:33:32,2014-03-18 10:33:32,2014-03-18 10:33:32,2014-03-18 10:33:34
3,2014-12-02 13:13:41,2014-12-02 13:13:41,2014-12-02 13:13:42,2014-12-02 13:13:42,2014-12-02 13:13:45,2014-12-02 13:13:45,2014-12-02 13:13:45,2014-12-02 13:13:46,2014-12-02 13:13:46,2014-12-02 13:13:47


In [15]:
test_sites = test_df[['site1', 'site2', 'site3','site4','site5','site6','site7',
                      'site8', 'site9', 'site10']].fillna(0).astype('int')
test_time = test_df[['time1', 'time2', 'time3', 'time4','time5', 'time6',
                        'time7','time8', 'time9', 'time10']].fillna(np.datetime64('nat'))

- создадим 2 граммы сайтов идущих подряд в каждой сессии

In [16]:
X_train_2gramm = gramm_2 (train_sites.drop(['user_id'], axis =1))
X_test_2gramm = gramm_2 (test_sites)

- переведем строчки базы sites в строчки типа документов, сначала заменим 0 на пробел чтобы избежать обучения на 0

In [17]:
train_str = train_sites.drop(['user_id'], axis =1).astype(str).replace('0','')
test_str = test_sites.astype(str).replace('0','')

In [18]:
train_str = train_str.apply(lambda x: ' '.join(x), axis =1)
test_str = test_str.apply(lambda x: ' '.join(x), axis =1)

In [19]:
train_time = train_time.apply(pd.to_datetime).values
test_time = test_time.apply(pd.to_datetime).values

In [20]:
train_time.shape, test_time.shape

((95319L, 10L), (41177L, 10L))

In [21]:
# время перехода между сайтами в 3 сек 
train_time_diff = (np.diff(train_time, axis =1)/np.timedelta64(3, 's')).round()
test_time_diff = (np.diff(test_time, axis =1)/np.timedelta64(3, 's')).round()

In [22]:
train_time_diff = np.nan_to_num(train_time_diff).astype(int) # заменим пропуски на ноль
test_time_diff = np.nan_to_num(test_time_diff).astype(int)

- к созданым 2gramm из посещенных сайтов в сессии добавим к этим 2gramm время перехода между сайтами

In [23]:
X_train_2gramm_time = X_train_2gramm +'andtime' + train_time_diff.astype(str)
X_test_2gramm_time = X_test_2gramm +'andtime' + test_time_diff.astype(str)

In [24]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
#cvt = CountVectorizer( ngram_range = (1,2), min_df =2)
#tvt = TfidfVectorizer( ngram_range = (1,2), min_df =2)

In [25]:
def ngramm_analys (Train, Test, ngramm = (1,1) , min_freq = 2, analyzer = u'word'):
    cvt = CountVectorizer( ngram_range = ngramm, min_df =min_freq, analyzer = analyzer)
    tvt = TfidfVectorizer( ngram_range = ngramm, min_df =min_freq, analyzer = analyzer)
    X_train_cvt = cvt.fit_transform(Train)
    X_train_tvt = tvt.fit_transform(Train)
    X_test_cvt  = cvt.transform(Test)
    X_test_tvt  = tvt.transform(Test)
    return X_train_cvt, X_train_tvt, X_test_cvt, X_test_tvt

**создадим признаки из списка посещенных сайтов ипользуя инструмент CountVectorizer, TfidfVectorizer**

In [68]:
X_train_cvt1, X_train_tvt1, X_test_cvt1, X_test_tvt1 = ngramm_analys (train_str,  test_str, ngramm =(1,2), min_freq =2)
X_train_cvt1.shape, X_train_tvt1.shape

((95319, 60252), (95319, 60252))

**обучим и проверим на валидации два классификатора SGDClassifier( alpha=0.00007, loss ='log') и SGDClassifier( alpha=0.00007, loss ='hinge')**

In [73]:
def fit_SGD (Train, y, explain = "fit_SGD"):
    # Разобьем обучающую выборку на 2 части в пропорции 7/3
    X_train, X_valid, y_train, y_valid = train_test_split(Train, y, test_size=0.3, 
                                                     random_state=5, stratify=y)
    
    sgd_logit = SGDClassifier( alpha=0.00007, loss ='log', random_state = 0,  n_jobs = -1)
    sgd_svm  = SGDClassifier( alpha=0.00007, loss ='hinge', random_state=0,  n_jobs = -1)
    print('fit...')
    %time sgd_logit.fit(X_train, y_train)
    %time sgd_svm.fit(X_train, y_train)
    # Сделаем прогнозы на отложенной выборке (X_valid, y_valid)
    print('predict...')
    pred_log = sgd_logit.predict(X_valid)
    pred_svm = sgd_svm.predict(X_valid)
    print (explain)
    accu = accuracy_score(y_valid, pred_log)
    print('log',accuracy_score(y_valid, pred_log))
    print('svm',accuracy_score(y_valid, pred_svm))
    return sgd_logit, sgd_svm, accu*sgd_logit.predict_proba(X_valid), sgd_svm.predict(X_train)

In [70]:
y = train_df['user_id'].values

In [74]:
sgd_log1, sgd_svm1, pred1, pred_svm1 = fit_SGD(X_train_cvt1,y,'only sites in session with CountVectorizer preperation')

fit...
Wall time: 40.4 s
Wall time: 29.5 s
predict...
only sites in session with CountVectorizer preperation
('log', 0.29248845992446498)
('svm', 0.27402433906840118)


In [75]:
sgd_log2, sgd_svm2, pred2, pred_svm2 = fit_SGD(X_train_tvt1,y,'only sites in session with TfidfVectorizer preperation')

fit...
Wall time: 40.8 s
Wall time: 34.2 s
predict...
only sites in session with TfidfVectorizer preperation
('log', 0.19775493075954678)
('svm', 0.30791019723038188)


In [76]:
pred = pred1 + pred2
svm_pred = np.hstack((pred_svm1, pred_svm2))

**добавим 2gramm в качестве признаков**

In [77]:
train_2gramm_str = pd.DataFrame(X_train_2gramm).apply(lambda x: ' '.join(x), axis =1)
test_2gramm_str  = pd.DataFrame(X_test_2gramm).apply(lambda x: ' '.join(x), axis =1)

In [80]:
X_train_cvt2, X_train_tvt2, X_test_cvt2, X_test_tvt2 = ngramm_analys (train_2gramm_str,test_2gramm_str,\
                                                                      ngramm =(1,2), min_freq =4)
X_train_cvt2.shape, X_train_tvt2.shape

((95319, 39543), (95319, 39543))

In [82]:
sgd_log3, sgd_svm3, pred3, pred_svm3= fit_SGD(X_train_cvt2,y,'2gramm in session with CountVectorizer preperation')

fit...
Wall time: 44.9 s
Wall time: 33.3 s
predict...
2gramm in session with CountVectorizer preperation
('log', 0.21499510421037907)
('svm', 0.20747657014967127)


In [83]:
sgd_log4, sgd_svm4, pred4, pred_svm4 = fit_SGD(X_train_tvt2,y,'2gramm in session with TfidfVectorizer preperation')

fit...
Wall time: 40.8 s
Wall time: 30.3 s
predict...
2gramm in session with TfidfVectorizer preperation
('log', 0.13159183102531824)
('svm', 0.23206042803189258)


In [84]:
pred = pred + pred3 + pred4
svm_pred = np.hstack((svm_pred, pred_svm3, pred_svm4))

**добавим 2gramm + time diff в качестве признаков**

In [85]:
train_2gramm_time_str = pd.DataFrame(X_train_2gramm_time).apply(lambda x: ' '.join(x), axis =1)
test_2gramm_time_str  = pd.DataFrame(X_test_2gramm_time).apply(lambda x: ' '.join(x), axis =1)

In [86]:
X_train_cvt3, X_train_tvt3, X_test_cvt3, X_test_tvt3 =\
                        ngramm_analys (train_2gramm_time_str,test_2gramm_time_str, min_freq =3)
X_train_cvt3.shape, X_train_tvt3.shape

((95319, 33664), (95319, 33664))

In [87]:
sgd_log5, sgd_svm5, pred5, pred_svm5  = fit_SGD(X_train_cvt3,y,'2gramm_time in session with CountVectorizer preperation')

fit...
Wall time: 36.7 s
Wall time: 27.8 s
predict...
2gramm_time in session with CountVectorizer preperation
('log', 0.19838438942509443)
('svm', 0.20988949503427054)


In [88]:
sgd_log6, sgd_svm6, pred6, pred_svm6 = fit_SGD(X_train_tvt3,y,'2gramm_time in session with TfidfVectorizer preperation')

fit...
Wall time: 46.3 s
Wall time: 28.3 s
predict...
2gramm_time in session with TfidfVectorizer preperation
('log', 0.11760386067981536)
('svm', 0.22583578122814379)


In [84]:
pred = pred + pred5 + pred6
svm_pred = np.hstack((svm_pred, pred_svm5, pred_svm6))

In [57]:
def next_step (X_train, X_test, clf1):
    
    n1,_ = X_train.shape
    tr_pred1= clf1.predict(X_train).reshape(n1,1)
    
    
    n2,_ = X_test.shape
    ts_pred1= clf1.predict(X_test).reshape(n2,1)
    
    
    from sklearn.preprocessing import OneHotEncoder
    OHEn = OneHotEncoder()
    tr_encode1 = OHEn.fit_transform(tr_pred1)
    ts_encode1 = OHEn.transform(ts_pred1)
    
    
    return tr_encode1, ts_encode1

In [58]:
X_tr, X_ts = next_step (X_train_tvt1, X_test_tvt1, sgd_svm2)

In [59]:
X_tr2, X_ts2 = next_step (X_train_tvt2, X_test_tvt2, sgd_svm4)

In [60]:
X_tr3, X_ts3 = next_step (X_train_tvt3, X_test_tvt3, sgd_svm6)

In [61]:
from scipy.sparse import hstack
X_train_new = hstack((X_tr, X_tr2, X_tr3))
X_test_new  = hstack((X_ts, X_ts2, X_ts3))
X_train_new.shape, X_test_new.shape

((95319, 1641), (41177, 1641))

In [62]:
sgd_log7, sgd_svm7 = fit_SGD(X_train_new,y,'new essemble')

fit...
Wall time: 36 s
Wall time: 20.6 s
predict...
new essemble
('log', 0.26528185760246187)
('svm', 0.26195971464540496)


In [64]:
X_train1, X_valid1, y_train1, y_valid1 = train_test_split(X_train_cvt1, y, test_size=0.3, 
                                                     random_state=5, stratify=y)
sgd_logit = SGDClassifier( alpha=0.00007, loss ='log', random_state = 5,  n_jobs = -1)
sgd_logit8 = sgd_logit.fit(X_train1, y_train1)
pred1 = sgd_logit8.predict_proba(X_valid1)

X_train2, X_valid2, y_train1, y_valid1 = train_test_split(X_train_cvt2, y, test_size=0.3, 
                                                     random_state=5, stratify=y)
sgd_logit9 = sgd_logit.fit(X_train2, y_train1)
pred2 = sgd_logit9.predict_proba(X_valid2)

X_train3, X_valid3, y_train1, y_valid1 = train_test_split(X_train_cvt3, y, test_size=0.3, 
                                                     random_state=5, stratify=y)
sgd_logit10 = sgd_logit.fit(X_train3, y_train1)
pred3 = sgd_logit10.predict_proba(X_valid3)

X_train4, X_valid4, y_train1, y_valid1 = train_test_split(X_train_new, y, test_size=0.3, 
                                                     random_state=5, stratify=y)
sgd_logit9 = sgd_logit.fit(X_train4, y_train1)
pred4 = sgd_logit9.predict_proba(X_valid4)

pred_sum_ = 0.28*pred1 + 0.23*pred2 + 0.2*pred3 + 0.27*pred4 

pred_sum = sgd_logit9.classes_[pred_sum_.argmax(axis=1)]
print accuracy_score(y_valid1, pred_sum)

0.277940970765


In [65]:
pred_sum_ = 0.28*pred1 + 0.23*pred2 + 0.2*pred3 + 0.27*pred4 

pred_sum = sgd_logit9.classes_[pred_sum_.argmax(axis=1)]
print accuracy_score(y_valid1, pred_sum)

0.277975940691


**Добавим временные признаки**

In [66]:
# время захода на сайт для train_sites, test_sites соответсвенно разница между заходами на сайты в сессии
train_time.shape, test_time.shape, train_time_diff.shape, test_time_diff.shape

((95319L, 10L), (41177L, 10L), (95319L, 9L), (41177L, 9L))

In [27]:
# время сессии в 3-х сек
train_session_time = (np.max(train_time, axis =1)
                                          -np.min(train_time, axis =1))/np.timedelta64(3, 's')
test_session_time  = (np.max(test_time, axis =1)
                                          -np.min(test_time, axis =1))/np.timedelta64(3, 's')

train_session_time = train_session_time.reshape(train_time.shape[0],1)
test_session_time  = test_session_time.reshape(test_time.shape[0],1)

- отмаштабируем признаки

In [44]:
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

In [45]:
train_session_scaled = scale.fit_transform(train_session_time) # время сессии отмаштабированное
test_session_scaled  = scale.transform(train_session_time) # время сессии отмаштабированное

In [46]:
scale = StandardScaler()
train_time_diff_scaled = scale.fit_transform(train_time_diff)
test_time_diff_scaled  = scale.transform(test_time_diff)

**начало сессии в часах день недели, время обращения к top300 sites, top10user sites**

In [28]:
# время начала сессии в часах
train_start_hour = (pd.to_datetime(np.min(train_time, axis =1))).hour
train_start_hour = train_start_hour.reshape(train_time.shape[0],1)
test_start_hour  = (pd.to_datetime(np.min(test_time, axis =1))).hour
test_start_hour  = test_start_hour.reshape(test_time.shape[0],1)

In [29]:
# начало сессии по дням недели
train_day_of_week = (pd.to_datetime(np.min(train_time, axis =1))).dayofweek
train_day_of_week = train_day_of_week.reshape(train_time.shape[0],1) 
test_day_of_week = (pd.to_datetime(np.min(test_time, axis =1))).dayofweek
test_day_of_week = test_day_of_week.reshape(test_time.shape[0],1) 

In [30]:
diff_top_sites = np.setdiff1d( users_top_sites, top_list_sites)
len(diff_top_sites), len(users_top_sites)

(828, 988)

In [31]:
train_mask_top_sites = in2D(train_sites.drop(['user_id'], axis =1), top_list_sites) #  users_top_sites
test_mask_top_sites = in2D(test_sites, top_list_sites) #  users_top_sites
train_mask_top_sites.shape, test_mask_top_sites.shape

((95319L, 10L), (41177L, 10L))

In [32]:
train_time_top_site = np.where(train_mask_top_sites, train_time, np.datetime64('nat'))
test_time_top_site = np.where(test_mask_top_sites, test_time, np.datetime64('nat'))

In [33]:
train_mask_top_sites = in2D(train_sites.drop(['user_id'], axis =1), diff_top_sites) #  users_top_sites
test_mask_top_sites = in2D(test_sites, diff_top_sites) #  users_top_sites
train_mask_top_sites.shape, test_mask_top_sites.shape

((95319L, 10L), (41177L, 10L))

In [34]:
train_time_top_user_site = np.where(train_mask_top_sites, train_time, np.datetime64('nat'))
test_time_top_user_site = np.where(test_mask_top_sites, test_time, np.datetime64('nat'))

In [35]:
# время первого захода на сайт из top300 в часах
train_top_hour = (pd.to_datetime(np.min(train_time_top_site, axis =1))).hour
train_top_hour = np.nan_to_num(train_top_hour).reshape(train_time.shape[0],1)
test_top_hour  = (pd.to_datetime(np.min(test_time_top_site, axis =1))).hour
test_top_hour  = np.nan_to_num(test_top_hour).reshape(test_time.shape[0],1)

In [36]:
# время первого захода на сайт из top300 в день недели
train_top_day = (pd.to_datetime(np.min(train_time_top_site, axis =1))).dayofweek
train_top_day = np.nan_to_num(train_top_day).reshape(train_time.shape[0],1)
test_top_day  = (pd.to_datetime(np.min(test_time_top_site, axis =1))).dayofweek
test_top_day  = np.nan_to_num(test_top_day).reshape(test_time.shape[0],1)

In [38]:
# время первого захода на сайт из top10_user в часах
train_top_user_hour = (pd.to_datetime(np.min(train_time_top_user_site, axis =1))).hour
train_top_user_hour = np.nan_to_num(train_top_user_hour).reshape(train_time.shape[0],1)
test_top_user_hour  = (pd.to_datetime(np.min(test_time_top_user_site, axis =1))).hour
test_top_user_hour  = np.nan_to_num(test_top_user_hour).reshape(test_time.shape[0],1)

In [39]:
# время первого захода на сайт из top10_user в день недели
train_top_user_day = (pd.to_datetime(np.min(train_time_top_user_site, axis =1))).dayofweek
train_top_user_day = np.nan_to_num(train_top_user_day).reshape(train_time.shape[0],1)
test_top_user_day  = (pd.to_datetime(np.min(test_time_top_user_site, axis =1))).dayofweek
test_top_user_day  = np.nan_to_num(test_top_user_day).reshape(test_time.shape[0],1)

- произведем кодирование временных признаков в категориальные

In [40]:
from sklearn.preprocessing import OneHotEncoder
OHE = OneHotEncoder()
tr_start_hour_encode = OHE.fit_transform(train_start_hour) # время захода на сайт в часах
ts_start_hour_encode = OHE.transform(test_start_hour)

In [41]:
OHE = OneHotEncoder()
tr_start_day_encode = OHE.fit_transform(train_day_of_week) # время захода на сайт в дняхнедели
ts_start_day_encode = OHE.transform(test_day_of_week)

In [42]:
OHE = OneHotEncoder()
tr_top_hour_encode = OHE.fit_transform(train_top_hour.astype(int).astype(str)) # время захода на сайт из top300 в часах
ts_top_hour_encode = OHE.transform(test_top_hour.astype(int))

In [43]:
OHE = OneHotEncoder()
tr_top_day_encode = OHE.fit_transform(train_top_day) # время захода на сайт из top300 в днях
ts_top_day_encode = OHE.transform(test_top_day)

In [44]:
OHE = OneHotEncoder()
tr_top_user_hour_encode = OHE.fit_transform(train_top_user_hour) # время захода на сайт из top10user в часах
ts_top_user_hour_encode = OHE.transform(test_top_user_hour)

In [45]:
OHE = OneHotEncoder()
tr_top_user_day_encode = OHE.fit_transform(train_top_user_day) # время захода на сайт из top10user в днях
ts_top_user_day_encode = OHE.transform(test_top_user_day)

**добавим по одному признаку из 6 новых к sites data**

In [73]:
X_train_cvt1.shape, tr_start_hour_encode.shape

((95319, 12341), (95319, 17))

In [74]:
from scipy.sparse import hstack
X_train_cvt_time = hstack((X_train_cvt1, tr_start_hour_encode))
X_train_cvt_time.shape

(95319, 12358)

In [78]:
sgd_log7, sgd_svm7 = fit_SGD(X_train_cvt_time,y,'sites + hour in session with CountVectorizer preperation')

fit...
Wall time: 34.7 s
Wall time: 22.1 s
predict...
sites + hour in session with CountVectorizer preperation
('log', 0.3170373478808225)
('svm', 0.3016156105749056)


In [79]:
X_train_cvt_time2 = hstack((X_train_cvt1, tr_start_day_encode))
X_train_cvt_time2.shape

(95319, 12348)

In [80]:
sgd_log8, sgd_svm8 = fit_SGD(X_train_cvt_time2,y,'sites + hour in session with CountVectorizer preperation')

fit...
Wall time: 34.3 s
Wall time: 26.8 s
predict...
sites + hour in session with CountVectorizer preperation
('log', 0.31791159602741642)
('svm', 0.30056651279899288)


In [81]:
X_train_cvt_time2 = hstack((X_train_cvt_time2, tr_start_hour_encode))
X_train_cvt_time2.shape

(95319, 12365)

In [82]:
sgd_log9, sgd_svm9 = fit_SGD(X_train_cvt_time2,y,'sites + hour+day in session with CountVectorizer preperation')

fit...
Wall time: 37.3 s
Wall time: 31 s
predict...
sites + hour+day in session with CountVectorizer preperation
('log', 0.35770737166037209)
('svm', 0.33875367184221572)


In [83]:
X_train_cvt_time3 = hstack((X_train_cvt_time2, tr_top_hour_encode, tr_top_day_encode))
X_train_cvt_time3.shape

(95319, 12390)

In [84]:
sgd_log10, sgd_svm10 = fit_SGD(X_train_cvt_time3,y,'sites + hour+day+top300 in session with CountVectorizer preperation')

fit...
Wall time: 33 s
Wall time: 28.3 s
predict...
sites + hour+day+top300 in session with CountVectorizer preperation
('log', 0.36526087564694365)
('svm', 0.33945307035949085)


In [85]:
X_train_cvt_time4 = hstack((X_train_cvt_time3, tr_top_user_hour_encode, tr_top_user_day_encode))
X_train_cvt_time4.shape

(95319, 12426)

In [86]:
sgd_log11, sgd_svm11 = fit_SGD(X_train_cvt_time4,y,\
                               'sites + hour+day+top300+top10users in session with CountVectorizer preperation')

fit...
Wall time: 38.5 s
Wall time: 25.5 s
predict...
sites + hour+day+top300+top10users in session with CountVectorizer preperation
('log', 0.36714925164358653)
('svm', 0.3426703035389565)


In [87]:
X_train_tvt_time6 = hstack((X_train_gramm_tvt,tr_top_user_hour_encode, tr_top_user_day_encode, 
            tr_top_hour_encode, tr_top_day_encode, tr_start_hour_encode, tr_start_day_encode))
X_train_tvt_time6.shape

(95319, 24767)

In [88]:
sgd_log12, sgd_svm12 = fit_SGD(X_train_tvt_time6,y,\
                               'sites + hour+day+top300+top10users in session with TfidfVectorizer preperation')

fit...
Wall time: 38.7 s
Wall time: 27.1 s
predict...
sites + hour+day+top300+top10users in session with TfidfVectorizer preperation
('log', 0.32221289690865856)
('svm', 0.35634354455168554)


In [92]:
#X_train_cvt_time4, 
time_sparse = csr_matrix(train_time_diff_scaled)

In [95]:
X_train_cvt_time7 = hstack((X_train_cvt_time4, time_sparse))
X_train_cvt_time7.shape

(95319, 12435)

In [96]:
sgd_log14, sgd_svm14 = fit_SGD(X_train_cvt_time7,y,\
                               'sites + hour+day+top300+top10users+diff_time in session with CountVectorizer preperation')

fit...
Wall time: 39.5 s
Wall time: 31.4 s
predict...
sites + hour+day+top300+top10users+diff_time in session with CountVectorizer preperation
('log', 0.36368722898307454)
('svm', 0.34043222828367603)


In [48]:
X_train_cvt21, X_train_tvt21, X_test_cvt21, X_test_tvt21 = ngramm_analys (train_str,test_str, ngramm =(1,1),
                                                                          min_freq =2, analyzer =u'word')
X_train_cvt21.shape, X_train_tvt21.shape

((95319, 12341), (95319, 12341))

In [56]:
X_train_cvt32, X_train_tvt32, X_test_cvt32, X_test_tvt32 = ngramm_analys (train_str,test_str, ngramm =(2,2),
                                                                          min_freq =3, analyzer =u'word')
X_train_cvt32.shape, X_train_tvt32.shape

((95319, 27265), (95319, 27265))

In [57]:
X_train_cvt42, X_train_tvt42, X_test_cvt42, X_test_tvt42 = ngramm_analys (train_str,test_str, ngramm =(2,2),
                                                                          min_freq =4, analyzer =u'word')
X_train_cvt42.shape, X_train_tvt42.shape

((95319, 19263), (95319, 19263))

In [58]:
X_train_cvt33, X_train_tvt33, X_test_cvt33, X_test_tvt33 = ngramm_analys (train_str,test_str, ngramm =(3,3),
                                                                          min_freq =3, analyzer =u'word')
X_train_cvt33.shape, X_train_tvt33.shape

((95319, 24879), (95319, 24879))

In [59]:
X_train_cvt43, X_train_tvt43, X_test_cvt43, X_test_tvt43 = ngramm_analys (train_str,test_str, ngramm =(3,3),
                                                                          min_freq =4, analyzer =u'word')
X_train_cvt43.shape, X_train_tvt43.shape

((95319, 16358), (95319, 16358))

In [97]:
from scipy.sparse import hstack
X_train_cvt_time8 = hstack((X_train_cvt21,tr_top_user_hour_encode, tr_top_user_day_encode, 
            tr_top_hour_encode, tr_top_day_encode, tr_start_hour_encode, tr_start_day_encode))
X_train_cvt_time8.shape

(95319, 12415)

In [98]:
sgd_log15, sgd_svm15 = fit_SGD(X_train_cvt_time8,y,\
            'sites+ hour+ day+ top300+ top10users  in session with CountVectorizer preperation')

fit...
Wall time: 33.5 s
Wall time: 23.7 s
predict...
sites+ hour+ day+ top300+ top10users  in session with CountVectorizer preperation
('log', 0.37117079311791856)
('svm', 0.3410267170233599)


In [62]:
X_train_cvt_time9 = hstack((X_train_cvt21,X_train_cvt32,tr_top_user_hour_encode, tr_top_user_day_encode, 
            tr_top_hour_encode, tr_top_day_encode, tr_start_hour_encode, tr_start_day_encode))
X_train_cvt_time9.shape

(95319, 39680)

In [64]:
sgd_log16, sgd_svm16 = fit_SGD(X_train_cvt_time9,y,\
            'sites+ ngramm+ hour+ day+ top300+ top10users  in session with CountVectorizer preperation')

fit...
Wall time: 54.4 s
Wall time: 34.3 s
predict...
sites+ ngramm+ hour+ day+ top300+ top10users  in session with CountVectorizer preperation
('log', 0.36802349979018045)
('svm', 0.3434046719820954)


In [65]:
X_train_cvt_time10 = hstack((X_train_cvt21,X_train_cvt42,tr_top_user_hour_encode, tr_top_user_day_encode, 
            tr_top_hour_encode, tr_top_day_encode, tr_start_hour_encode, tr_start_day_encode))
X_train_cvt_time10.shape

(95319, 31678)

In [66]:
sgd_log17, sgd_svm17 = fit_SGD(X_train_cvt_time10,y,\
            'sites+ ngramm+ hour+ day+ top300+ top10users  in session with CountVectorizer preperation')

fit...
Wall time: 48.4 s
Wall time: 37.4 s
predict...
sites+ ngramm+ hour+ day+ top300+ top10users  in session with CountVectorizer preperation
('log', 0.36777871030913417)
('svm', 0.34746118338229121)


In [67]:
X_train_cvt_time11 = hstack((X_train_cvt21,X_train_cvt42, X_train_cvt33, tr_top_user_hour_encode, tr_top_user_day_encode, 
            tr_top_hour_encode, tr_top_day_encode, tr_start_hour_encode, tr_start_day_encode))
X_train_cvt_time11.shape

(95319, 56557)

In [68]:
sgd_log18, sgd_svm18 = fit_SGD(X_train_cvt_time11,y,\
            'sites+ ngramm+ hour+ day+ top300+ top10users  in session with CountVectorizer preperation')

fit...
Wall time: 55.9 s
Wall time: 46.3 s
predict...
sites+ ngramm+ hour+ day+ top300+ top10users  in session with CountVectorizer preperation
('log', 0.3648062666107148)
('svm', 0.34441879983214435)


In [69]:
X_train_cvt_time12 = hstack((X_train_cvt21,X_train_cvt42, X_train_cvt43, tr_top_user_hour_encode, tr_top_user_day_encode, 
            tr_top_hour_encode, tr_top_day_encode, tr_start_hour_encode, tr_start_day_encode))
X_train_cvt_time12.shape

(95319, 48036)

In [70]:
sgd_log19, sgd_svm19 = fit_SGD(X_train_cvt_time12,y,\
            'sites+ ngramm+ hour+ day+ top300+ top10users  in session with CountVectorizer preperation')

fit...
Wall time: 57 s
Wall time: 39.5 s
predict...
sites+ ngramm+ hour+ day+ top300+ top10users  in session with CountVectorizer preperation
('log', 0.36473632675898726)
('svm', 0.34172611554063503)


In [71]:
X_train_cvt_time13 = hstack((X_train_cvt21,X_train_cvt43, tr_top_user_hour_encode, tr_top_user_day_encode, 
            tr_top_hour_encode, tr_top_day_encode, tr_start_hour_encode, tr_start_day_encode))
X_train_cvt_time13.shape

(95319, 28773)

In [72]:
sgd_log20, sgd_svm20 = fit_SGD(X_train_cvt_time13,y,\
            'sites+ ngramm+ hour+ day+ top300+ top10users  in session with CountVectorizer preperation')

fit...
Wall time: 59 s
Wall time: 34.9 s
predict...
sites+ ngramm+ hour+ day+ top300+ top10users  in session with CountVectorizer preperation
('log', 0.36987690586095956)
('svm', 0.34735627360469995)


In [73]:
X_train_cvt11, X_train_tvt11, X_test_cvt11, X_test_tvt11 = ngramm_analys (train_str,test_str, ngramm =(1,1),
                                                                          min_freq =1, analyzer =u'word')
X_train_cvt11.shape, X_train_tvt11.shape

((95319, 21523), (95319, 21523))

In [74]:
X_train_cvt_time14 = hstack((X_train_cvt11, tr_top_user_hour_encode, tr_top_user_day_encode, 
            tr_top_hour_encode, tr_top_day_encode, tr_start_hour_encode, tr_start_day_encode))
X_train_cvt_time14.shape

(95319, 21597)

In [75]:
sgd_log21, sgd_svm21 = fit_SGD(X_train_cvt_time14,y,\
            'sites+ ngramm+ hour+ day+ top300+ top10users  in session with CountVectorizer preperation')

fit...
Wall time: 48.8 s
Wall time: 27.9 s
predict...
sites+ ngramm+ hour+ day+ top300+ top10users  in session with CountVectorizer preperation
('log', 0.37127570289550987)
('svm', 0.34190096516995383)


In [91]:
X_train_cvt_time15 = hstack((X_train_tvt11, tr_top_user_hour_encode, tr_top_user_day_encode, 
            tr_top_hour_encode, tr_top_day_encode, tr_start_hour_encode, tr_start_day_encode))
X_train_cvt_time15.shape

(95319, 21597)

In [92]:
sgd_log22, sgd_svm22 = fit_SGD(X_train_cvt_time15,y,\
            'sites + hour+ day+ top300+ top10users  in session with TfidfVectorizer preperation')

fit...
Wall time: 40.2 s
Wall time: 25.9 s
predict...
sites + hour+ day+ top300+ top10users  in session with TfidfVectorizer preperation
('log', 0.28098335431528887)
('svm', 0.32399636312771019)


**Сделаем прогноз для тестовой выборки с помощью sgd_log21.**

In [99]:
X_test_cvt_time = hstack((X_test_cvt21, ts_top_user_hour_encode, ts_top_user_day_encode, 
            ts_top_hour_encode, ts_top_day_encode, ts_start_hour_encode, ts_start_day_encode))
X_test_cvt_time.shape

(41177, 12415)

In [100]:
pred_test = sgd_log15.predict(X_test_cvt_time)

In [95]:
def write_to_submission_file(predicted_labels, out_file,
                             target='user_id', index_label="session_id"):
    # turn predictions into data frame and save as csv file
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

In [101]:
write_to_submission_file (pred_test, 'kaggle_data/[YDF&MIPT]_Coursera_Oleg67.csv')

**ансабль моделей**

In [79]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train_cvt_time14, y, test_size=0.3, 
                                                     random_state=0, stratify=y)

In [83]:
from sklearn.ensemble import VotingClassifier
sgd_logit  = SGDClassifier( loss ='log', random_state = 0,  n_jobs = -1)
sgd_svm    = SGDClassifier( loss ='hinge', random_state=0,  n_jobs = -1)
sgd_modif  = SGDClassifier( loss ='modified_huber', random_state = 0,  n_jobs = -1)
sgd_sq_svm = SGDClassifier( loss ='squared_hinge', random_state=0,  n_jobs = -1)
 
estimators = [('log', sgd_logit), ('svm', sgd_svm), ('modif', sgd_modif), ('sq_svm', sgd_sq_svm)]
eclf = VotingClassifier(estimators, voting= 'hard', n_jobs = -1)

In [84]:
%%time
eclf.fit(X_train, y_train)

Wall time: 2min 5s


VotingClassifier(estimators=[('log', SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=5, n_jobs=-1,
       penalty='l2', power_t=0.5, random_state=0, shuffle=True, verbose=0,
       warm_s...      penalty='l2', power_t=0.5, random_state=0, shuffle=True, verbose=0,
       warm_start=False))],
         n_jobs=-1, voting='hard', weights=None)

In [85]:
pred_eclf = eclf.predict(X_valid)
print('VotingClassifier',accuracy_score(y_valid, pred_eclf))

('VotingClassifier', 0.35665827388445936)


In [89]:
%%time
estimators = [('log', sgd_logit), ('modif', sgd_modif)]
eclf2 = VotingClassifier(estimators, voting= 'soft', n_jobs = -1)

eclf2.fit(X_train, y_train)

Wall time: 1min 20s


In [90]:
pred_eclf2 = eclf2.predict(X_valid)
print('VotingClassifier',accuracy_score(y_valid, pred_eclf2))

('VotingClassifier', 0.3503986571548468)


In [102]:
from sklearn.gaussian_process import GaussianProcessClassifier

In [103]:
GPC = GaussianProcessClassifier()

In [105]:
%%time
#GPC.fit(X_train, y_train)

Wall time: 0 ns


**Создаем разреженные матрицы *X_train_sparse* и *X_test_sparse* аналогично тому, как ранее. Используем объединенную матрицу train_test_df_sites – потом разделим обратно на обучающую и тестовую части.**


**Выделите в отдельный вектор *y* ответы на обучающей выборке.**

In [25]:
def matrix_to_sparse_matrix (matrix):
    """переводим обычную матрицу в разреженноу матрицу 
    где 
    номер столбца это уникальное число из исходной матрицы от 1  до максимального
    значение в строке это сколько раз уникальное число встречалось в строке оригинальной матрицы"""
    import numpy as np
    from scipy.sparse import csr_matrix
        
    NMZ = np.prod(np.array(matrix.shape)) # колличество элементов в matrix
    data = np.array([1]*NMZ)
    indptr = np.arange(0, NMZ+matrix.shape[1], matrix.shape[1])
    return csr_matrix((data, matrix.reshape(-1), indptr), dtype=int)[:,1:]

In [26]:
X_train_test_sparse =  matrix_to_sparse_matrix(train_test_df_sites.values)

In [27]:
X_train_test_sparse.shape#, matrix_time.shape

(136496, 24052)

** добавим к признакам посещенные сайты временные: начало сессии в часах и день недели**

In [28]:
from scipy.sparse import hstack
X_train_test_sparse_time = hstack((X_train_test_sparse, start_hour_encod, day_of_week_encod))#, tr_ts_df_time_encode))
X_train_test_sparse_time.shape

(136496, 24076)

- разделим train , test данные

In [29]:
X_train_sparse = X_train_test_sparse_time.tocsr()[: 95319, :]
X_test_sparse = X_train_test_sparse_time.tocsr()[95319 : ,:]
y = train_df['user_id'].values

In [30]:
X_train_sparse.shape, X_test_sparse.shape

((95319, 24076), (41177, 24076))

**Сохраним в pickle-файлы объекты *X_train_sparse*, *X_test_sparse* и *y* (последний – в файл *kaggle_data/train_target.pkl*).**

In [32]:
import pickle
with open('kaggle_data/X_train_sparse.pkl', 'wb') as X_train_sparse_pkl:
    pickle.dump(X_train_sparse, X_train_sparse_pkl)
with open('kaggle_data/X_test_sparse.pkl', 'wb') as X_test_sparse_pkl:
    pickle.dump(X_test_sparse, X_test_sparse_pkl)
with open('kaggle_data/train_target.pkl', 'wb') as train_target_pkl:
    pickle.dump(y, train_target_pkl)

In [31]:
def fit_SGD (Train, y, explain = "fit_SGD"):
    # Разобьем обучающую выборку на 2 части в пропорции 7/3
    X_train, X_valid, y_train, y_valid = train_test_split(Train, y, test_size=0.3, 
                                                     random_state=0, stratify=y)
    
    sgd_logit = SGDClassifier( alpha=0.00007, loss ='log', random_state = 0,  n_jobs = -1)
    sgd_svm  = SGDClassifier( alpha=0.00007, loss ='hinge', random_state=0,  n_jobs = -1)
    print('fit...')
    %time sgd_logit.fit(X_train, y_train)
    %time sgd_svm.fit(X_train, y_train)
    # Сделаем прогнозы на отложенной выборке (X_valid, y_valid)
    print('predict...')
    pred_log = sgd_logit.predict(X_valid)
    pred_svm = sgd_svm.predict(X_valid)
    print (explain)
    print('log',accuracy_score(y_valid, pred_log))
    print('svm',accuracy_score(y_valid, pred_svm))
    return sgd_logit, sgd_svm

**обучим класификаторы  по признакам site + day + hour**

In [32]:
sgd_log, sgd_svm = fit_SGD(X_train_sparse, y, 'site + time')

fit...
Wall time: 34.1 s
Wall time: 26.4 s
predict...
site + time
('log', 0.35812701077073716)
('svm', 0.33224926563155688)


**сделаем класификатор на временных признаках**

In [33]:
X_train_test_sparse_time = hstack((session_timespan_encode,start_hour_encod, day_of_week_encod, tr_ts_df_time_encode))
X_train_test_sparse_time.shape

(136496, 5219)

In [34]:
X_train_time = X_train_test_sparse_time.tocsr()[: 95319, :]
X_test_time = X_train_test_sparse_time.tocsr()[95319 :, :]

**обучим класификаторы по признакам diff time+ session time + day + hour**

In [35]:
sgd_log_time, sgd_svm_time = fit_SGD(X_train_time, y, 'diff time+ session time + day + hour')

fit...
Wall time: 36.6 s
Wall time: 27.5 s
predict...
diff time+ session time + day + hour
('log', 0.078052874527906005)
('svm', 0.034760106308574623)


**точность классификатора очень низкое**

**составим из ответов этого класификатора и первого новые признаки и обучим на них новую модель**

In [36]:
log_pred_time = sgd_log_time.predict_proba(X_train_time) # предсказание временной модели
log_pred = sgd_log.predict_proba(X_train_sparse) # предсказание первой модели

In [38]:
X_train_add = np.hstack((log_pred, log_pred_time) )
X_train_add.shape

(95319L, 1100L)

**обучим класификаторы  по признакам first.model + time.model**

**точность улучшилась по сравнению с sites+ day + hour  но не значительно**

- загрузим словарь сайтов

In [32]:
site_diction = pd.read_csv('kaggle_data/site_indexes.txt', names =['key','site'])
site_diction.head()

Unnamed: 0,key,site
0,1,fr.openclassrooms.com
1,2,sigayret.fr
2,3,c1.adform.net
3,4,dnn506yrbagrg.cloudfront.net
4,5,ocsp.verisign.com


- создадим дополнительные признаки по первому и второму доменному именю

In [33]:
site_diction['_second'] = site_diction.site.apply(lambda x: '.'.join(x.split('.')[-2:]))
site_diction['_first'] = site_diction.site.apply(lambda x: x.split('.')[-1])
site_diction.astype(str).head()

Unnamed: 0,key,site,_second,_first
0,1,fr.openclassrooms.com,openclassrooms.com,com
1,2,sigayret.fr,sigayret.fr,fr
2,3,c1.adform.net,adform.net,net
3,4,dnn506yrbagrg.cloudfront.net,cloudfront.net,net
4,5,ocsp.verisign.com,verisign.com,com


- преобразуем в словари

In [34]:
site_diction = site_diction.astype(str)
second_dic = site_diction.set_index('key')['_second'].to_dict()
first_dic = site_diction.set_index('key')['_first'].to_dict()
second_dic['0'] = ''
first_dic['0'] =''

#### используем для создания признаков инструменты из text features

In [35]:
df_train = train_df[['site1', 'site2', 'site3', 'site4','site5', 
                    'site6','site7', 'site8', 'site9', 'site10']].fillna(0).astype(int).astype(str)

In [36]:
df_test = test_df[['site1', 'site2', 'site3', 
                                     'site4','site5', 
                                     'site6','site7', 'site8', 
                                     'site9', 'site10']].fillna(0).astype(int).astype(str)

In [37]:
df_train = df_train.replace('0','')
df_test = df_test.replace('0','')

In [38]:
df_train_str = df_train.apply(lambda x: ' '.join(x), axis =1)

In [39]:
df_test_str = df_test.apply(lambda x: ' '.join(x), axis =1)

- создадим несколько таблиц признаков с разными значениями

**обучим класификаторы по этим признакам** 

-выберем наиболее сильные признаки - ngramm =(1,2) min_df = 2 для log CountVectorizer() сделаем предсказание и будем использовать его в следующем классификаторе

**добавим в качестве признаков первое и второе доменное имя**

In [45]:
df_train = train_df[['site1', 'site2', 'site3', 'site4','site5', 
                    'site6','site7', 'site8', 'site9', 'site10']].fillna(0).astype(int).astype('str')

In [46]:
df_test = test_df[['site1', 'site2', 'site3', 
                                     'site4','site5', 
                                     'site6','site7', 'site8', 
                                     'site9', 'site10']].fillna(0).astype(int).astype('str')

In [47]:
df2 = df_train.applymap(lambda x: second_dic[x])
df3 = df_train.applymap(lambda x: first_dic[x])
df3.head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,com,com,com,fr,com,com,net,com,net,com
2,com,com,com,com,com,com,com,com,com,com
3,com,fr,io,com,com,fr,fr,com,com,fr
4,com,com,com,com,com,com,com,com,com,com
5,org,org,org,org,org,fr,fr,org,org,org


In [27]:
df2_test = df_test.applymap(lambda x: second_dic[x])
df3_test = df_test.applymap(lambda x: first_dic[x])

In [29]:
df_train2 = df_train2.apply(lambda x: ' '.join(x) if x != '0' else '', axis =1)
df_test2 = df_test2.apply(lambda x: ' '.join(x), axis =1)

In [30]:
df2 = df2.apply(lambda x: ' '.join(x), axis =1)
df2_test = df2_test.apply(lambda x: ' '.join(x), axis =1)
df3 = df3.apply(lambda x: ' '.join(x), axis =1)
df3_test = df3_test.apply(lambda x: ' '.join(x), axis =1)

In [31]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
cvt = CountVectorizer( ngram_range = (1,2), min_df =2)#, analyzer=u'char')
tvt = TfidfVectorizer( ngram_range = (1,2), min_df =2)#, analyzer=u'char')

In [32]:
%time X_train_ngram_c = cvt.fit_transform(df_train2.values)
X_train_ngram_t = tvt.fit_transform(df_train2.values)

Wall time: 6.49 s


In [35]:
%time X_test_ngram_c = cvt.transform(df_test2.values)
X_test_ngram_t = tvt.transform(df_test2.values)

Wall time: 2.16 s


In [36]:
X_train_ngram_c.shape, X_train_ngram_t.shape

((95319, 60252), (95319, 60252))

In [37]:
y = train_df['user_id'].values

In [59]:
%%time
cvt = CountVectorizer()
tvt = TfidfVectorizer()
X_train_ngram_c2 = cvt.fit_transform(df2.values)
X_train_ngram_t2 = tvt.fit_transform(df2.values)
X_test_ngram_c2 = cvt.transform(df2_test.values)
X_test_ngram_t2 = tvt.transform(df2_test.values)

Wall time: 10.7 s


In [60]:
X_train_ngram_c2.shape, X_train_ngram_t2.shape

((95319, 9710), (95319, 9710))

In [61]:
%%time
cvt = CountVectorizer()#, analyzer=u'char')
tvt = TfidfVectorizer()#, analyzer=u'char')
X_train_ngram_c3 = cvt.fit_transform(df3.values)
X_train_ngram_t3 = tvt.fit_transform(df3.values)
X_test_ngram_c3 = cvt.transform(df3_test.values)
X_test_ngram_t3 = tvt.transform(df3_test.values)

Wall time: 6.34 s


In [62]:
X_train_ngram_c3.shape, X_train_ngram_t3.shape

((95319, 144), (95319, 144))

In [63]:
from scipy.sparse import hstack
X_train_new_c = hstack((X_train_ngram_c, X_train_ngram_c2, X_train_ngram_c3))
X_train_new_t = hstack((X_train_ngram_t, X_train_ngram_t2, X_train_ngram_t3))
X_train_new_c.shape, X_train_new_t.shape

((95319, 70106), (95319, 70106))

**обучим те же класификаторы что и до этого с преобразованием CountVectorizer**

In [111]:
sgd_log_new_c, sgd_svm_new_c = fit_SGD(X_train_new_c, y, 'cvt ngramm + dominate names')

fit...
Wall time: 55.8 s
Wall time: 37.1 s
predict...
cvt ngramm + dominate names
('log', 0.2470625262274444)
('svm', 0.23080151070079732)


**обучим те же класификаторы что и до этого с преобразованием TfidfVectorizer**

In [110]:
sgd_log_new_t, sgd_svm_new_t = fit_SGD(X_train_new_t, y, 'tvt ngramm + dominate names')

fit...
Wall time: 46.7 s
Wall time: 35.3 s
predict...
tvt ngramm + dominate names
('log', 0.23569730032172331)
('svm', 0.30661630997342287)


In [112]:
from scipy.sparse import hstack
X_train_new2_c = hstack((X_train_ngram_c, X_train_ngram_c2))
X_train_new2_t = hstack((X_train_ngram_t, X_train_ngram_t2))
X_train_new_c.shape, X_train_new_t.shape

((95319, 70106), (95319, 70106))

**обучим те же класификаторы что и до этого с преобразованием CountVectorizer**

In [113]:
sgd_log_new2_c, sgd_svm_new2_c = fit_SGD(X_train_new2_c, y, 'cvt ngramm + 1 dominate names')

fit...
Wall time: 50.2 s
Wall time: 37.3 s
predict...
cvt ngramm + 1 dominate names
('log', 0.26409288012309412)
('svm', 0.24422996223248006)


**обучим те же класификаторы что и до этого с преобразованием TfidfVectorizer**

In [114]:
sgd_log_new2_t, sgd_svm_new2_t = fit_SGD(X_train_new2_t, y, 'tvt ngramm + 1 dominate names')

fit...
Wall time: 58 s
Wall time: 43.3 s
predict...
tvt ngramm + 1 dominate names
('log', 0.24674779689467058)
('svm', 0.31095258078052873)


**добавим временные признаки к ngramm**

In [103]:
start_hour_encod.shape, day_of_week_encod.shape

((136496, 17), (136496, 7))

In [104]:
start_hour_train = start_hour_encod.tocsr()[: 95319, :]
start_hour_test = start_hour_encod.tocsr()[95319 :, :]
day_of_week_train = day_of_week_encod.tocsr()[: 95319, :]
day_of_week_test = day_of_week_encod.tocsr()[95319 :, :]

In [105]:

X_train_time_c = hstack((X_train_ngram_c, start_hour_train, day_of_week_train))
X_train_time_t = hstack((X_train_ngram_t, start_hour_train, day_of_week_train))
X_train_time_c.shape, X_train_time_t.shape

((95319, 60276), (95319, 60276))

**обучим те же класификаторы что и до этого с преобразованием CountVectorizer**

In [108]:
sgd_log_time, sgd_svm_time = fit_SGD(X_train_time_c, y, 'cvt ngramm + time')

fit...
Wall time: 41.3 s
Wall time: 32.8 s
predict...
cvt ngramm + time
('log', 0.35798713106728214)
('svm', 0.33676038606798153)


**обучим те же класификаторы что и до этого с преобразованием TfidfVectorizer**

In [109]:
sgd_log_time2, sgd_svm_time2 = fit_SGD(X_train_time_t, y, 'tvt ngramm + time')

fit...
Wall time: 46.3 s
Wall time: 35.5 s
predict...
tvt ngramm + time
('log', 0.25884739124353057)
('svm', 0.36071478528465517)


In [82]:
X_train_new = hstack((X_train_sparse, X_train_ngram))
X_test_new = hstack((X_test_sparse, X_test_ngram))

In [20]:
from sklearn.model_selection import GridSearchCV

In [21]:
parameters = {'penalty': ('l2', 'l1'), 
              'alpha': [0.00015, 0.00012, 0.00008, 0.0001]
              }           
parameters = {'alpha': [0.00007, 0.00006, 0.00009, 0.00012, 0.00008, 0.0001]
              }           

In [22]:
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=7)

In [23]:
clf = GridSearchCV(sgd_svm, parameters, cv =skf)

In [24]:
%time clf.fit(X_train, y_train)

Wall time: 19min 49s


GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=7, shuffle=True),
       error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=-1,
       penalty='l2', power_t=0.5, random_state=7, shuffle=True, verbose=0,
       warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ('l2', 'l1'), 'alpha': [0.00015, 0.00012, 8e-05, 0.0001]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [25]:
clf.best_score_, clf.best_params_

(0.2875919847728669, {'alpha': 8e-05, 'penalty': 'l2'})

In [31]:
accuracy_score(y_valid, clf.best_estimator_ .predict(X_valid))

0.30196530983354314

In [53]:
clf.best_score_, clf.best_params_

(0.28089264571437134, {'alpha': 0.0001, 'penalty': 'l2'})

In [32]:
write_answer_to_file('{}  {}'.format(round(accuracy_score(y_valid, logit_valid_pred), 3),
                                    round(accuracy_score(y_valid, svm_valid_pred), 3)),
                     'answer5_2.txt')

**Сделайте прогноз для тестовой выборки с помощью sgd_logit.**

In [90]:
logit_test_pred = sgd_logit.predict(X_test_new)

In [26]:
logit_test_pred = clf.best_estimator_ .predict(X_test_ngram)

In [91]:
def write_to_submission_file(predicted_labels, out_file,
                             target='user_id', index_label="session_id"):
    # turn predictions into data frame and save as csv file
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

In [92]:
write_to_submission_file (logit_test_pred, 'kaggle_data/[YDF&MIPT]_Coursera_Oleg67.csv')

## Пути улучшения
соревнования, в финальный проект (.pdf или .ipynb).
Что можно попробовать:
 - Использовать ранее построенные признаки для улучшения модели (проверить их можно на меньшей выборке по 150 пользователям – это быстрее)
 - Настроить параметры моделей (например, коэффициенты регуляризации)
 - Если позволяют мощности (или хватает терпения), можно попробовать смешивание (блендинг) ответов бустинга и линейной модели. [Вот](http://mlwave.com/kaggle-ensembling-guide/) один из самых известных тьюториалов по смешиванию ответов алгоритмов


In [41]:
df_train1 = train_df[['site1', 'site2', 'site3', 
                                     'site4','site5', 
                                     'site6','site7', 'site8', 
                                     'site9', 'site10']].fillna(0).astype('str')

In [None]:
from sklearn.cluster import DBSCAN
clu = DBSCAN(eps=0.6, min_samples=500 )
%time clus = clu.fit_predict(matrix_time)

In [None]:
np.unique(clus, return_counts =True)

In [45]:
from sklearn.mixture import BayesianGaussianMixture
BGM = BayesianGaussianMixture(n_components =20)
%time BGM.fit(df_train1.values)

Wall time: 3min 1s




BayesianGaussianMixture(covariance_prior=None, covariance_type='full',
            degrees_of_freedom_prior=None, init_params='kmeans',
            max_iter=100, mean_precision_prior=None, mean_prior=None,
            n_components=20, n_init=1, random_state=None, reg_covar=1e-06,
            tol=0.001, verbose=0, verbose_interval=10, warm_start=False,
            weight_concentration_prior=None,
            weight_concentration_prior_type='dirichlet_process')

In [57]:
lables = BGM.predict(df_train1.values)

In [58]:
np.unique(lables, return_counts =True)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19], dtype=int64),
 array([12802,  3332,  3239,  8405,  2428,  3098,  2490,  5426,  2067,
         3414,  2800,  6209, 16916,  2866,  5041,  2243,  2183,  2292,
         5109,  2959], dtype=int64))