**Идея текущего подзода:** Если исполнители играют внутри одних и тех же сессий, значит такие исполнители похожи.

1. Нужно для каждого исполнителя найти соседей по всем сессиям, в которых был данный исполнитель. Каждому исполнителю будет сопоставлена последовательность из его соседей за все сессии.

2. Последовательность соседей может быть преобразована:
    - в вектор, где каждый сосед по сессиям будет иметь единицу, а остальные исполнители будут иметь нули \[бинарно\];
    - в вектор, где каждый сосед по сессиям будет значение равное кол-ву сессий, в которых данные исполнитель и сосед были в одной сессии, а остальные исполнители будут иметь нули;
    - tf-idf;
3. **Но** тк исполнителей очень большое количество, получается разреженная матрица размера $N_{persons} \cdot N_{persons}$, это слишком большая размерность для того, чтобы работать с ней на прямую. Предлагается использовать подходы [снижения размерности](https://ru.wikipedia.org/wiki/Снижение_размерности), такие как SVD, PCA и тд.

4. Теперь каждому исполнителю соответствует плотный (в смысле `dense`, а не `sparse`) вектор, содержащий информацию о соседстве с другими исполнителями внутри сессий. Этот вектор и является **векторным представлением (эмбеддингом)** для исполнителя.

5. Похожесть исполнителей определяется как близость их эмбеддингов. Меры близости:
    - косинусное расстояние;
    - мера жаккара

6. ??? Валидация поиска похожих на тестовых данных

# Import

In [1]:
import numpy as np
import pandas as pd
from tqdm import tqdm

import sys
sys.path.append('../')

In [2]:
from src.config import (
    path_users_temp,
    path_persons_temp,
    path_sessions_train,
    path_sessions_test,
    path_sessions_val
)

# Чтение файлов

### Данные о пользователях

In [3]:
users = pd.read_csv(path_users_temp)
print(users.shape)
users.head()

(45167, 9)


Unnamed: 0,user_id,timestamp,age,gender,country,playcount,playlists,user_name,subscribertype
0,1,1116715959,24.0,f,US,221012.0,2.0,000123,base
1,2,1163123792,39.0,m,CZ,217535.0,9.0,000333,base
2,3,1184426573,,f,,49733.0,2.0,00elen,base
3,4,1123157597,32.0,m,DE,168054.0,2.0,00Eraser00,base
4,5,1171302116,23.0,m,UK,45700.0,2.0,00fieldsy,base


### Данные об исполнителях

In [4]:
# исполнители
persons = pd.read_csv(path_persons_temp)
print(persons.shape)
persons.head()

# поле "person_name" по фиксированному "person_id" может иметь несколько вариаций
# пример с "David Guetta"
# будем использовать для декодинга person_id
persons[persons['person_id'].isin([227])]

(595049, 3)


Unnamed: 0,person_id,person_name,person_MBID
4673,227,David+Guetta+&+Nicky+Romero,
12079,227,"David+Guetta,+Sam+Martin",1bb1eec6-88c3-4028-8920-a985c4b9081a
17306,227,David+Guetta+ft.+Chris+Willis,1bb1eec6-88c3-4028-8920-a985c4b9081a
31403,227,David+Guetta+-+Ne-Yo+-+Kelly+Rowland,1bb1eec6-88c3-4028-8920-a985c4b9081a
38019,227,David+Guetta+&+Chris+Willis+Feat.+Fergie+&+LMFAO,1bb1eec6-88c3-4028-8920-a985c4b9081a
...,...,...,...
570347,227,David+Guetta+&+Alesso+feat.+Tegan+&+Sara,
572905,227,David+Guetta+ft.+Kelly+Rowland,1bb1eec6-88c3-4028-8920-a985c4b9081a
578055,227,David+Guetta+Feat+Lil+Wayne+&+Chris+Brown,1bb1eec6-88c3-4028-8920-a985c4b9081a
588531,227,David+Guetta+&+Glowinthedark+feat.+Harrison+Shaw,


### Данные о сессиях пользователей

In [5]:
%%time
sessions_train = pd.read_csv(path_sessions_train)
sessions_test = pd.read_csv(path_sessions_test)
# sessions_val = pd.read_csv(path_sessions_val)

CPU times: user 3.92 s, sys: 638 ms, total: 4.56 s
Wall time: 6.45 s


In [6]:
print(sessions_train.shape)
sessions_train.head()

(6549461, 10)


Unnamed: 0,session_id,timestamp,playtime,numtracks,user_id,track_id,track_playratio,playcount,person_id,numpersons
0,12,1405519516,5202,25,41504,1210840,1.0,353.0,154295,4
1,12,1405519516,5202,25,41504,1210840,1.0,94.0,154295,4
2,12,1405519516,5202,25,41504,1210766,1.0,1093.0,154295,4
3,12,1405519516,5202,25,41504,1210626,1.0,328.0,154295,4
4,12,1405519516,5202,25,41504,1210759,1.0,4.0,154295,4


In [7]:
print(sessions_test.shape)
sessions_test.head()

(1403506, 10)


Unnamed: 0,session_id,timestamp,playtime,numtracks,user_id,track_id,track_playratio,playcount,person_id,numpersons
0,20,1418303964,7654,31,41504,3488336,1.0,415.0,435780,4
1,20,1418303964,7654,31,41504,3488271,1.0,2395.0,435780,4
2,20,1418303964,7654,31,41504,3488314,1.0,1448.0,435780,4
3,20,1418303964,7654,31,41504,3488330,0.81,1806.0,435780,4
4,20,1418303964,7654,31,41504,1975962,1.46,76.0,247876,4


In [8]:
def gather_session_persons(sessions_df):
    return sessions_df.groupby(['session_id'])['person_id'].unique()# .apply(np.array)


def get_person_session_neighbours(sessions_df):
    # gather_session_persons
    session_neighbours = gather_session_persons(sessions_df)
    # separate_person
    session_neighbours = [
        (person, session[session != person])
        for session in session_neighbours
        for person in session
    ]
    session_neighbours = pd.DataFrame(
        session_neighbours,
        columns=['person_id', 'session_neighbours']
    )
    # выбрасываем сессии, где соседей не оказалось
    session_neighbours = session_neighbours[
        session_neighbours['session_neighbours'].apply(len) != 0
    ]
    
    return session_neighbours


def join_person_neighbours(session_neighbours):
    df_lst = []
    for i, person in enumerate(tqdm(session_neighbours['person_id'].unique())):
        mask_person = session_neighbours['person_id'].isin([person])
        neighbours = np.concatenate(
            session_neighbours.loc[mask_person, 'session_neighbours'].values
        )
        df_lst.append((person, neighbours))
    session_neighbours = pd.DataFrame(
        df_lst,
        columns=['person_id', 'session_neighbours']
    )
    
    return session_neighbours

In [9]:
%%time
session_neighbours = get_person_session_neighbours(sessions_train)

CPU times: user 29.9 s, sys: 2.32 s, total: 32.2 s
Wall time: 31.9 s


In [10]:
session_neighbours.head()

Unnamed: 0,person_id,session_neighbours
0,154295,"[288626, 341684, 325050]"
1,288626,"[154295, 341684, 325050]"
2,341684,"[154295, 288626, 325050]"
3,325050,"[154295, 288626, 341684]"
4,134615,"[28445, 377440, 303680]"


In [11]:
session_neighbours = join_person_neighbours(session_neighbours)

100%|██████████| 5911/5911 [00:27<00:00, 212.17it/s]


In [12]:
session_neighbours.head()

Unnamed: 0,person_id,session_neighbours
0,154295,"[288626, 341684, 325050, 46425, 390636, 360406..."
1,288626,"[154295, 341684, 325050, 17514, 203180, 291223..."
2,341684,"[154295, 288626, 325050, 318259, 383294, 11601..."
3,325050,"[154295, 288626, 341684, 296457, 33937, 264248..."
4,134615,"[28445, 377440, 303680, 232667, 145566, 458471..."


In [13]:
%%time

from sklearn.pipeline import Pipeline

from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.decomposition import TruncatedSVD

from sklearn.preprocessing import MinMaxScaler

from scipy.spatial.distance import cdist, cosine

def dict_unique_counter(arr):
    return dict(zip(*np.unique(arr, return_counts=True)))


def get_features_unique(input_series):
    return input_series.apply(np.unique)


def get_features_dict_counter(input_series):
    return input_series.apply(dict_unique_counter)


class DataframeFunctionTransformer():
    def __init__(self, func):
        self.func = func

    def transform(self, input_series, **transform_params):
        return self.func(input_series)

    def fit(self, X, y=None, **fit_params):
        return self

# pipe_embedding
pipeline_embedding = Pipeline([
    ('get_features_dict_counter', DataframeFunctionTransformer(get_features_dict_counter)),
    ('vectorizer', DictVectorizer()),
#     ('min_max_scaler', MinMaxScaler()),
    ('tsvd', TruncatedSVD(n_components=100))
])

# apply the pipeline to the input dataframe
X_tsvd = pipeline_embedding.fit_transform(session_neighbours['session_neighbours'])

embeddings = pd.concat([
    session_neighbours['person_id'],
    pd.DataFrame(X_tsvd)
], axis=1).set_index('person_id')

CPU times: user 23.6 s, sys: 1.07 s, total: 24.7 s
Wall time: 16 s


In [14]:
embeddings

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
person_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
154295,4949.699412,-1750.532505,-648.187752,1474.375876,-209.351247,-761.835271,179.374223,363.606450,-98.846362,308.307823,...,17.565058,34.499740,77.430516,136.984935,13.797780,46.308876,-91.865726,-3.053490,9.590776,-6.720554
288626,82.715366,-41.403656,53.267186,8.886980,26.156459,-14.720249,-22.338997,57.682953,54.214461,3.254450,...,3.244507,1.641206,-1.774092,-7.211024,0.830528,-3.173485,-9.240561,-1.447317,-12.668585,-12.698976
341684,41.936810,-13.121896,-6.118134,1.901221,26.894174,-6.804689,-1.830265,23.198585,21.147444,27.898749,...,-0.224087,-6.975390,-2.463210,3.125271,1.297435,-0.244111,-1.410001,1.228939,0.514455,3.316155
325050,13.821278,-0.392727,-6.010899,4.230200,13.715804,-1.689387,-3.955677,11.847167,11.034525,17.633932,...,0.784181,-3.527830,-1.079744,0.637190,1.332873,-2.358532,-1.991598,0.769981,-0.078114,1.378448
134615,1457.369544,-508.200270,395.422965,-156.180770,-37.106836,-238.249289,221.957981,-17.815713,193.668008,-115.163174,...,-7.512580,-17.194230,1.270385,29.104769,6.063966,12.585002,-21.414331,21.913996,14.984256,7.974710
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248012,2.603932,-3.303726,-0.945596,-3.403730,-0.733316,0.245777,0.393385,1.353973,0.585685,-0.163009,...,-0.513576,1.064326,-0.066586,-0.220023,0.267288,0.452300,0.172706,0.283347,0.374657,0.116384
94389,0.608153,-0.579995,-0.351880,-0.287543,-0.115787,-0.064446,-0.023173,-0.039208,-0.089390,0.190402,...,-0.265560,-0.223689,-0.135595,0.016479,-0.017147,0.194216,0.287763,-0.099811,0.263073,0.426325
182689,0.462265,-0.300386,0.440709,-0.466143,0.229269,0.149709,0.193954,-0.155520,0.505380,-0.083902,...,-0.185281,0.058959,-0.086513,-0.213944,0.100478,-0.096681,-0.013594,0.259404,-0.247643,-0.286601
143872,2.580737,-1.653577,-0.465707,0.978933,0.453157,0.399190,0.984348,-0.293290,0.913649,-0.058016,...,-0.204675,-0.193787,-0.416844,-0.253220,-0.132588,0.057706,0.435706,0.590833,0.132004,-0.191056


In [14]:
sessions_test

Unnamed: 0,session_id,timestamp,playtime,numtracks,user_id,track_id,track_playratio,playcount,person_id,numpersons
0,20,1418303964,7654,31,41504,3488336,1.00,415.0,435780,4
1,20,1418303964,7654,31,41504,3488271,1.00,2395.0,435780,4
2,20,1418303964,7654,31,41504,3488314,1.00,1448.0,435780,4
3,20,1418303964,7654,31,41504,3488330,0.81,1806.0,435780,4
4,20,1418303964,7654,31,41504,1975962,1.46,76.0,247876,4
...,...,...,...,...,...,...,...,...,...,...
1403501,2764392,1418637043,7526,28,39896,2228905,1.02,1136.0,278529,12
1403502,2764392,1418637043,7526,28,39896,2228905,1.02,11.0,278529,12
1403503,2764392,1418637043,7526,28,39896,1298797,1.03,532.0,164418,12
1403504,2764392,1418637043,7526,28,39896,2552183,1.02,10.0,320151,12


In [20]:
def split_session_one(session):
    return [(session[session != person], person) for person in session]


class FeatureExtractTransformer():
    def __init__(self):
        return None

    def transform(self, sessions_df, **transform_params):
        sessions_gathered = gather_session_persons(sessions_df)
        sessions_splited = sessions_gathered.apply(split_session_one)
        sessions_splited = sessions_splited.explode()
        sessions_splited = pd.DataFrame(
            sessions_splited.tolist(),
            columns=['X', 'y'],
            index=sessions_splited.index
        )#.reset_index()
        
        return sessions_splited['X'], sessions_splited['y']

    def fit(self, X, y=None, **fit_params):
        return self


pipeline_data_process = Pipeline([
    ('feature_extract_transformer', FeatureExtractTransformer())
])

In [70]:
X, y = pipeline_data_process.fit_transform(sessions_test) # .head(1000)
X, y = X.head(1000), y.head(1000)

In [181]:
from sklearn.base import BaseEstimator, ClassifierMixin


def hit_rate(y, y_pred):
    hits = [
        int(y_el in y_pred_el)
        for y_el, y_pred_el in zip(y, y_pred)
    ]
    return sum(hits) / len(hits)


def search_nn(embeddings, base_nearest_search, num_top_elements=10):
    dist = cdist(embeddings, base_nearest_search, metric='cosine')
    dist = dist.mean(axis=1)
    dist = pd.Series(
        dist,
        index=embeddings.index
    )
    dist = dist.drop(index=base_nearest_search.index).sort_values()
    top_n = dist.index.values[:num_top_elements]
    return top_n


class TopNSearch(BaseEstimator, ClassifierMixin):
    def __init__(self, embeddings, num_top_elements):
        self.embeddings = embeddings
        self.num_top_elements = num_top_elements
    def fit(self, X, y):
        pass
    def predict(self, X):
        pass
    def score(self, X, y):
        y_pred = self.predict(X)
        return hit_rate(y, y_pred)


class TopNRandom(TopNSearch):
    def __init__(self, embeddings, num_top_elements, weights=None, random_seed=None):
            super().__init__(embeddings, num_top_elements)
            self.weights = weights
            self.random_seed = random_seed
    def predict(self, X):
        np.random.seed(self.random_seed)
        top_n = np.array([
            self.embeddings.sample(self.num_top_elements, weights=self.weights).index.values
            for _ in tqdm(X)
        ])
        return top_n

    
class NearestNeighboursSearch(TopNSearch):
    def __init__(self, embeddings, num_top_elements):
        super().__init__(embeddings, num_top_elements)
    def predict(self, X):
        y_pred = []
        for x in tqdm(X):
            base_nearest_search = embeddings.loc[x]
            top_n = search_nn(
                self.embeddings,
                base_nearest_search,
                self.num_top_elements
            )
            y_pred.append(top_n)
        return np.array(y_pred)

In [182]:
num_top_elements = 10
weights = sessions_train.groupby(['person_id'])['playcount'].sum()
weights = weights.loc[embeddings.index]


random_search = TopNRandom(
    embeddings,
    num_top_elements,
    random_seed=42
)
weighted_random_search = TopNRandom(
    embeddings,
    num_top_elements,
    weights,
    random_seed=42
)
nn_search = NearestNeighboursSearch(
    embeddings,
    num_top_elements
)

In [73]:
# X = sessions_splited_test['X'].head(1000)
# y = sessions_splited_test['y'].head(1000)

In [74]:
random_search.score(X, y)

100%|██████████| 1000/1000 [00:00<00:00, 3584.12it/s]


0.0

In [75]:
weighted_random_search.score(X, y)

100%|██████████| 1000/1000 [00:01<00:00, 724.12it/s]


0.003

In [183]:
nn_search.score(X, y)

100%|██████████| 1000/1000 [00:06<00:00, 166.16it/s]


0.107

In [165]:
def add_user_column(X, session_user_mapping):
    X_ext = pd.merge(
        X.reset_index(),
        session_user_mapping,
        how='left',
        on=['session_id']
    )
    return X_ext


def search_history_freq(x, user, user_person):
    mask = (
        user_person['user_id'].isin([user])
        & (~user_person['person_id'].isin(x))
    )
    historical_person_freq = (user_person
        .loc[mask, 'person_id']
        .value_counts(normalize=True)
        .reset_index()
        .rename(columns={'person_id':'freq', 'index':'person_id'})
    )
    return historical_person_freq
    

class UserPersonFreqSearch(TopNSearch):
    def __init__(self, embeddings, sessions_train, sessions_test, num_top_elements):
        self.embeddings = embeddings
        self.user_person = sessions_train[['user_id', 'person_id']]
        self.session_user_mapping = sessions_test[['session_id', 'user_id']].drop_duplicates()
        self.num_top_elements = num_top_elements
    def fit(self, X, y):
        pass  
    def predict(self, X):
        random_case_cnt = 0
        X_ext = add_user_column(X, self.session_user_mapping)
        y_pred = []
        for _, (session_id, x, user) in tqdm(X_ext.iterrows()):
            historical_person_freq = search_history_freq(x, user, self.user_person)
            if len(historical_person_freq) == 0:
                top_n = self.embeddings.sample(self.num_top_elements).index.values
                random_case_cnt += 1
            else:
                n = self.num_top_elements
                if self.num_top_elements > len(historical_person_freq):
                    n = len(historical_person_freq)
                top_n = historical_person_freq['person_id'].sample(
                    n=n,
                    weights=historical_person_freq['freq']
                ).values
            y_pred.append(top_n)
        print(f'num of random generated top_n: {random_case_cnt}')
        return np.array(y_pred)

In [166]:
upf = UserPersonFreqSearch(embeddings, sessions_train, sessions_test, num_top_elements)

In [167]:
upf.score(X, y)

1000it [01:13, 13.62it/s]

num of random generated top_n: 94



  return np.array(y_pred)


0.241

In [184]:
class UserPersonFreqSearchPlus(TopNSearch):
    def __init__(self, embeddings, sessions_train, sessions_test, num_top_elements):
        self.embeddings = embeddings
        self.user_person = sessions_train[['user_id', 'person_id']]
        self.session_user_mapping = sessions_test[['session_id', 'user_id']].drop_duplicates()
        self.num_top_elements = num_top_elements
    def fit(self, X, y):
        pass  
    def predict(self, X):
        random_case_cnt = 0
        X_ext = add_user_column(X, self.session_user_mapping)
        y_pred = []
        for _, (session_id, x, user) in tqdm(X_ext.iterrows()):
            historical_person_freq = search_history_freq(x, user, self.user_person)
            if len(historical_person_freq) == 0:
                base_nearest_search = embeddings.loc[x]
                top_n = search_nn(
                    self.embeddings,
                    base_nearest_search,
                    self.num_top_elements
                )
                random_case_cnt += 1
            else:
                n = self.num_top_elements
                if self.num_top_elements > len(historical_person_freq):
                    n = len(historical_person_freq)
                top_n = historical_person_freq['person_id'].sample(
                    n=n,
                    weights=historical_person_freq['freq']
                ).values
            y_pred.append(top_n)
        print(f'num of random generated top_n: {random_case_cnt}')
        return np.array(y_pred)

In [185]:
upf_plus_search = UserPersonFreqSearchPlus(embeddings, sessions_train, sessions_test, num_top_elements)

In [186]:
upf_plus_search.score(X, y)

1000it [01:09, 14.32it/s]

num of random generated top_n: 94



  return np.array(y_pred)


0.264

In [27]:
pipeline_evaluation = Pipeline([
    ('nn_search', NearestNeighboursSearch(embeddings, num_top_elements))
])

In [30]:
pipeline_evaluation.score(X, y)

100%|██████████| 1000/1000 [00:05<00:00, 176.73it/s]


0.047

In [39]:
top_20 = dist['person_id'].iloc[1:51].values

In [40]:
persons[persons['person_id'].isin(top_20)].head(60)

Unnamed: 0,person_id,person_name,person_MBID
5,120605,Drake+&+Coldplay,
158,75441,Soulja+Boy+Tell%60em,
2061,4807,50+Cent+feat.+Snoop+Dogg+&+Pre,8e68819d-71be-4e7d-b41d-f1df81b01d3f
3184,148708,Fat+Joe+&+50+Cent,
5677,463801,Juvenile,3d7cc904-86e6-4f65-b1fd-d7a3e8bfefa3
7841,49227,Birdman+&+Lil%27+Wayne,ab794c44-fc35-48c9-a1c1-817bb54e070d
8259,47600,Big+Sean,942a9807-9c1a-4a0e-a285-1fde2c5be9d1
9086,208682,Ludacris+&+Shawnna,0638ba22-040f-438d-83a5-9b670c4adaf5
10646,173505,Gucci+Mane+&+Tity+Boi,
10855,173505,Gucci+Mane+&+Chief+Keef,


In [2]:
# session_neighbours = pd.read_csv('../data/temp/session_neighbours_500.csv', chunksize=1000)

In [None]:
# df_lst = []
# for df in tqdm(session_neighbours):
#     df['neighbours_dct'] = df['session_neighbours'].str.split(' ').apply(dict_unique_counter)
#     df_lst.append(df[['person_id', 'neighbours_dct']])
# #     break

79it [02:52,  1.61it/s]

In [11]:
# vectorizer = CountVectorizer(max_features=10000)

# X = vectorizer.fit_transform(session_neighbours['session_neighbours'])

# X

# print(vectorizer.get_feature_names())

# print(X.toarray())

In [None]:
# session_neighbours['session_neighbours'] = session_neighbours['session_neighbours'].str.split(' ')

In [None]:
df_lst = []
for _, (person, neighbours) in tqdm(session_neighbours.iterrows()):
    df_lst.append(
        (person, dict_unique_counter(neighbours.split(' ')))
    )
#     break

49940it [04:35, 11.22s/it]  

In [13]:
df_lst

[(332749,
  {'100758': 1,
   '100889': 1,
   '100904': 2,
   '101590': 1,
   '101611': 1,
   '101774': 1,
   '101903': 2,
   '102300': 1,
   '102352': 1,
   '102835': 1,
   '102923': 12,
   '103018': 1,
   '103260': 2,
   '103498': 1,
   '103673': 1,
   '103871': 7,
   '104139': 7,
   '104207': 1,
   '104243': 1,
   '104521': 4,
   '104676': 4,
   '104885': 4,
   '105144': 1,
   '105256': 3,
   '10617': 1,
   '106222': 2,
   '106237': 1,
   '106350': 1,
   '107039': 3,
   '107103': 2,
   '107168': 1,
   '107364': 1,
   '10764': 1,
   '107743': 1,
   '107750': 1,
   '107846': 2,
   '107920': 1,
   '108114': 4,
   '108228': 2,
   '108267': 5,
   '108734': 1,
   '108742': 3,
   '108766': 1,
   '109145': 4,
   '109210': 1,
   '109301': 1,
   '109433': 8,
   '109444': 7,
   '109595': 1,
   '109597': 1,
   '109784': 1,
   '109854': 2,
   '11005': 1,
   '110264': 4,
   '110578': 1,
   '110729': 1,
   '110792': 1,
   '110855': 6,
   '11092': 1,
   '111203': 1,
   '111262': 2,
   '111494': 1,
 

In [4]:
print(session_neighbours.shape)
session_neighbours.head()

(190037, 2)


Unnamed: 0,person_id,session_neighbours
0,332749,288070 124041 11423 442798 291362 52354 70915 ...
1,288070,332749 269833 5399 88671 109097 313066 147287 ...
2,154295,288626 341684 325050 46425 390636 360406 10326...
3,288626,154295 341684 325050 17514 203180 100758 29122...
4,341684,154295 288626 325050 84554 219840 318259 21984...


In [5]:
# session_neighbours.loc[0, 'session_neighbours']

'288070 124041 11423 442798 291362 52354 70915 245488 417177 399353 50321 359758 455930 399329 42218 21950 249900 332244 286859 110855 159300 159611 62678 84244 161015 114966 258705 371043 196132 287752 67354 369381 338261 27165 441362 452099 262896 121972 165405 70070 106222 55207 368250 155863 379822 234386 227562 434646 392962 86376 354892 73357 291837 360509 204114 364910 199934 85842 12631 62568 363805 348947 364597 386855 163656 353623 318475 214743 398072 52023 55136 258927 82359 62444 448656 34180 452823 47307 284448 221307 208657 108267 203935 328397 174073 273651 415453 3452 449356 304049 297016 117020 433041 277494 211094 323879 83974 413521 461101 235831 57272 74545 47552 383822 159844 42456 385714 460815 49971 279548 431811 55438 444371 41200 273573 283744 449220 242068 149679 76404 304591 267416 433523 14125 114283 411805 47154 253248 371765 236107 393820 42032 400168 74637 261027 314219 8102 303270 349103 362541 154377 305019 203487 273495 426900 123386 40656 233878 2048

In [None]:
# %%time
# session_neighbours['neighbours_counts'] = session_neighbours['session_neighbours'].apply(dict_unique_counter)

In [55]:
df_lst = []
for _, (person, neighbours) in tqdm(session_neighbours.iterrows()):
    neighbours_counts = dict_unique_counter(neighbours)
    df_lst.append((person, neighbours_counts))
#     break

100it [00:01, 79.74it/s]


In [54]:
session_neighbours = pd.DataFrame(
    df_lst,
    columns=['person_id', 'neighbours_counts']
)

[(332749,
  {'100758': 1,
   '100889': 1,
   '100904': 2,
   '101590': 1,
   '101611': 1,
   '101774': 1,
   '101903': 2,
   '102300': 1,
   '102352': 1,
   '102835': 1,
   '102923': 12,
   '103018': 1,
   '103260': 2,
   '103498': 1,
   '103673': 1,
   '103871': 7,
   '104139': 7,
   '104207': 1,
   '104243': 1,
   '104521': 4,
   '104676': 4,
   '104885': 4,
   '105144': 1,
   '105256': 3,
   '10617': 1,
   '106222': 2,
   '106237': 1,
   '106350': 1,
   '107039': 3,
   '107103': 2,
   '107168': 1,
   '107364': 1,
   '10764': 1,
   '107743': 1,
   '107750': 1,
   '107846': 2,
   '107920': 1,
   '108114': 4,
   '108228': 2,
   '108267': 5,
   '108734': 1,
   '108742': 3,
   '108766': 1,
   '109145': 4,
   '109210': 1,
   '109301': 1,
   '109433': 8,
   '109444': 7,
   '109595': 1,
   '109597': 1,
   '109784': 1,
   '109854': 2,
   '11005': 1,
   '110264': 4,
   '110578': 1,
   '110729': 1,
   '110792': 1,
   '110855': 6,
   '11092': 1,
   '111203': 1,
   '111262': 2,
   '111494': 1,
 

In [39]:
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{11: 1, 12: 2}, {11: 3, 13: 1}]
X = v.fit_transform(D)
X

array([[1., 2., 0.],
       [3., 0., 1.]])

# DRAFT

In [64]:
# def session_split(session, train_size=0.5):
#     session_length = len(session)
#     split_index = int(np.ceil(session_length * train_size))
    
#     return session[:split_index], session[split_index:]

# sessions_splited_test = pd.DataFrame(
#     sessions_gathered_test.apply(session_split).tolist(),
#     columns=['X', 'y'],
#     index=sessions_gathered_test.index
# )#.reset_index()

# sessions_splited_test

In [None]:
# from itertools import permutations

# def split_session_with_permutation(session, num_train):
#     session_length = len(session)
    
#     mask = np.zeros(session_length, dtype=bool)
#     mask[:num_train] = True
#     mask_permutations = np.array(list(set(permutations(mask))))
    
#     session_splits = [(session[mask], session[~mask]) for mask in mask_permutations]
    
#     return session_splits


# def split_session_one(session):
#     session_length = len(session)
#     num_train = session_length - 1
#     session_splits = split_session_with_permutation(session, num_train)
    
#     return session_splits



In [None]:
# %%time
# # не хватает мощности
# session_neighbours = (session_neighbours#.head(1000)
#     .groupby(['person_id'], as_index=False)
#     .agg({'session_neighbours' : lambda x: list(np.concatenate(x.values))})
# )

In [16]:
# 
sessions_extend[sessions_extend['playtime'].isin([-1])]

Unnamed: 0,session_id,timestamp,playtime,numtracks,user_id,track_id,track_playratio,playcount,person_id
4,1873088,1406217037,-1,1,23183,838286,,212.0,107103
19,361604,1400260299,-1,1,40718,838286,,212.0,107103
40,2700519,1405970598,-1,1,24690,838286,,212.0,107103
58,846,1411031704,-1,1,40433,838286,,212.0,107103
82,2604837,1410191003,-1,1,377,838286,,212.0,107103
...,...,...,...,...,...,...,...,...,...
30732660,2704365,1392347333,-1,1,25560,366126,,4.0,45178
30732664,373369,1408739136,-1,1,40091,4789796,1.0,,552076
30732704,2751633,1408490773,-1,1,36788,3517368,,166.0,439634
30732765,1885736,1408391273,-1,1,26603,2749494,,112.0,344789


In [17]:
sessions_extend[sessions_extend['track_playratio'].isin([np.nan])]

Unnamed: 0,session_id,timestamp,playtime,numtracks,user_id,track_id,track_playratio,playcount,person_id
0,287144,1390231051,4547,23,44361,4698874,,,142266
4,1873088,1406217037,-1,1,23183,838286,,212.0,107103
9,1591983,1421070443,6740,28,25592,838286,,212.0,107103
10,1591217,1399022203,3066,14,25458,838286,,212.0,107103
19,361604,1400260299,-1,1,40718,838286,,212.0,107103
...,...,...,...,...,...,...,...,...,...
30732763,1885731,1407078600,749,4,26603,3465265,,259.0,432719
30732765,1885736,1408391273,-1,1,26603,2749494,,112.0,344789
30732766,1885735,1408300690,-1,1,26603,3208862,,1297.0,397734
30732769,2480038,1407956764,940,5,33058,2014661,,1466.0,252514


In [25]:
sessions_extend['person_id'].nunique()

560926

In [18]:
(sessions_extend['numtracks'] >= 5).sum() / sessions_extend.shape[0]

0.9114987875489852

In [29]:
user_person_statistics = (sessions_extend
    .groupby(['user_id', 'person_id'], as_index=False)
    ['track_playratio'].median()
#     .agg({
#         'track_playratio':['count', 'sum', 'mean', 'median'],
#     })
)
user_person_statistics

Unnamed: 0,user_id,person_id,track_playratio
0,1,11467,1.00
1,1,11617,1.00
2,1,19627,1.01
3,1,28510,0.53
4,1,42218,1.43
...,...,...,...
6305840,45175,361085,2.18
6305841,45175,390280,1.03
6305842,45175,426475,0.88
6305843,45175,438476,


In [30]:
user_person_statistics.columns

Index(['user_id', 'person_id', 'track_playratio'], dtype='object')

In [32]:
from surprise import Dataset
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /home/user/.surprise_data/ml-100k


In [31]:
from surprise import SVD

# Use the famous SVD algorithm.
svd = SVD()

In [20]:
users_listen_count = (sessions_extend
    .groupby(['user_id'], as_index=False)['track_playratio']
    .count()
    .rename(columns={'track_playratio':'listen_count'})
)
users_listen_count

Unnamed: 0,user_id,listen_count
0,1,404
1,2,672
2,3,1821
3,4,1013
4,5,173
...,...,...
45170,45171,174
45171,45172,16
45172,45173,364
45173,45174,351


In [24]:
(users_listen_count['listen_count'] >= 50).sum() / users_listen_count.shape[0]

0.8976646375207527