            +------------+     +---------------+     +-----------+
            | user_data  |     | post_text_df  |     | feed_data |
            +------------+     +---------------+     +-----------+
            | age        |     | id            |     | timestamp |
            | city       |     | text          |     | user_id   |
            | country    |     | topic         |     | post_id   |
            | exp_group  |     +---------------+     | action    |
            | gender     |           7,023           | target    |
            | id         |                           +-----------+
            | os         |                             76,892,800
            | source     |
            +------------+
                163,205 

# Data loading

In [75]:
import pandas as pd
from sqlalchemy import create_engine
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

In [76]:
engine = create_engine(
    "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
    "postgres.lab.karpov.courses:6432/startml"
)


In [77]:
# Чтение данных таблицы user_data
query = "SELECT * FROM user_data"
user_data = pd.read_sql(query, engine)

# Чтение данных таблицы post_text_df
query = "SELECT * FROM post_text_df"
post_text_df = pd.read_sql(query, engine)

# Чтение ограниченного количества данных таблицы feed_data
query = "SELECT * FROM feed_data LIMIT 100000"
feed_data = pd.read_sql(query, engine)

 # Data prerocessing


## One-hot encoding

In [78]:
user_data

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads
3,203,0,18,Russia,Moscow,1,iOS,ads
4,204,0,36,Russia,Anzhero-Sudzhensk,3,Android,ads
...,...,...,...,...,...,...,...,...
163200,168548,0,36,Russia,Kaliningrad,4,Android,organic
163201,168549,0,18,Russia,Tula,2,Android,organic
163202,168550,1,41,Russia,Yekaterinburg,4,Android,organic
163203,168551,0,38,Russia,Moscow,3,iOS,organic


In [79]:
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163205 entries, 0 to 163204
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   user_id    163205 non-null  int64 
 1   gender     163205 non-null  int64 
 2   age        163205 non-null  int64 
 3   country    163205 non-null  object
 4   city       163205 non-null  object
 5   exp_group  163205 non-null  int64 
 6   os         163205 non-null  object
 7   source     163205 non-null  object
dtypes: int64(4), object(4)
memory usage: 10.0+ MB


1. We will one-hot encode the city and country, also the os and source because they are categorical variables. 

In [80]:
categorical_columns = ['country', 'city', 'os', 'source']
user_data_encoded = pd.get_dummies(user_data, columns=categorical_columns)

In [81]:
numeric_columns = ['user_id', 'gender', 'age']

Should we add the user_id to the numerical features?
the answer is no, because the user_id is not a numerical feature, it is a categorical feature. So we will think about it later.

In [82]:
user_data_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163205 entries, 0 to 163204
Columns: 3934 entries, user_id to source_organic
dtypes: int64(4), uint8(3930)
memory usage: 616.7 MB


We will need to normalize the age feature, because it is a numerical feature.

In [83]:
from sklearn.preprocessing import MinMaxScaler
# Normalize the 'age' column using MinMaxScaler
scaler = MinMaxScaler()
user_data_encoded['age'] = scaler.fit_transform(user_data_encoded['age'].values.reshape(-1, 1))

print(user_data_encoded)

        user_id  gender       age  exp_group  country_Azerbaijan  \
0           200       1  0.246914          3                   0   
1           201       0  0.283951          0                   0   
2           202       1  0.037037          4                   0   
3           203       0  0.049383          1                   0   
4           204       0  0.271605          3                   0   
...         ...     ...       ...        ...                 ...   
163200   168548       0  0.271605          4                   0   
163201   168549       0  0.049383          2                   0   
163202   168550       1  0.333333          4                   0   
163203   168551       0  0.296296          3                   0   
163204   168552       1  0.024691          4                   0   

        country_Belarus  country_Cyprus  country_Estonia  country_Finland  \
0                     0               0                0                0   
1                     0      

With one-hot encoding the amout of columns rose to 3934, which is a lot. The memory usage increased by 60x. We will use PCA to reduce the dimensionality of the data.

In [84]:
post_text_df

Unnamed: 0,post_id,text,topic
0,1,UK economy facing major risks\n\nThe UK manufa...,business
1,2,Aids and climate top Davos agenda\n\nClimate c...,business
2,3,Asian quake hits European shares\n\nShares in ...,business
3,4,India power shares jump on debut\n\nShares in ...,business
4,5,Lacroix label bought by US firm\n\nLuxury good...,business
...,...,...,...
7018,7315,"OK, I would not normally watch a Farrelly brot...",movie
7019,7316,I give this movie 2 stars purely because of it...,movie
7020,7317,I cant believe this film was allowed to be mad...,movie
7021,7318,The version I saw of this film was the Blockbu...,movie


In [85]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = vectorizer.fit_transform(post_text_df['text'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

print(tfidf_df)

           000  000m        10       100  100m   11        12  120        13  \
0     0.000000   0.0  0.000000  0.000000   0.0  0.0  0.053308  0.0  0.000000   
1     0.161013   0.0  0.000000  0.050808   0.0  0.0  0.047776  0.0  0.000000   
2     0.063193   0.0  0.055471  0.000000   0.0  0.0  0.000000  0.0  0.000000   
3     0.000000   0.0  0.000000  0.000000   0.0  0.0  0.000000  0.0  0.077277   
4     0.000000   0.0  0.000000  0.000000   0.0  0.0  0.000000  0.0  0.000000   
...        ...   ...       ...       ...   ...  ...       ...  ...       ...   
7018  0.000000   0.0  0.000000  0.000000   0.0  0.0  0.000000  0.0  0.000000   
7019  0.000000   0.0  0.000000  0.000000   0.0  0.0  0.000000  0.0  0.000000   
7020  0.000000   0.0  0.107996  0.000000   0.0  0.0  0.000000  0.0  0.000000   
7021  0.000000   0.0  0.000000  0.000000   0.0  0.0  0.000000  0.0  0.000000   
7022  0.000000   0.0  0.000000  0.000000   0.0  0.0  0.000000  0.0  0.000000   

       14  ...  youth  youve  yugansk  

# Collaborative filtering

### User-based Collaborative Filtering

            +------------+     +---------------+     +-----------+
            | user_data  |     | post_text_df  |     | feed_data |
            +------------+     +---------------+     +-----------+
            | age        |     | id            |     | timestamp |
            | city       |     | text          |     | user_id   |
            | country    |     | topic         |     | post_id   |
            | exp_group  |     +---------------+     | action    |
            | gender     |           7,023           | target    |
            | id         |                           +-----------+
            | os         |                             76,892,800
            | source     |
            +------------+
                163,205 

### Timestamp conversion

In [86]:
# Convert the 'timestamp' column to datetime objects
feed_data['timestamp'] = pd.to_datetime(feed_data['timestamp'])

# Extract the desired components from the 'timestamp' column
feed_data['year'] = feed_data['timestamp'].dt.year
feed_data['month'] = feed_data['timestamp'].dt.month
feed_data['day'] = feed_data['timestamp'].dt.day
feed_data['day_of_week'] = feed_data['timestamp'].dt.dayofweek
feed_data['hour'] = feed_data['timestamp'].dt.hour
feed_data['minute'] = feed_data['timestamp'].dt.minute

# Drop the original 'timestamp' column
feed_data = feed_data.drop('timestamp', axis=1)

print(feed_data.head())

   user_id  post_id action  target  year  month  day  day_of_week  hour  \
0   161123     5908   view       0  2021     11   15            0    10   
1   161123     4505   view       0  2021     11   15            0    10   
2   161123     3464   view       1  2021     11   15            0    10   
3   161123     3464   like       0  2021     11   15            0    10   
4   161123     4160   view       0  2021     11   15            0    10   

   minute  
0      19  
1      21  
2      24  
3      26  
4      26  


In [87]:
# Combine 'hour' and 'minute' columns into a single column
feed_data['hour_minute'] = feed_data['hour'].astype(str).str.zfill(2) + ':' + feed_data['minute'].astype(str).str.zfill(2)

# Convert 'hour_minute' column to numerical column representing minute of the day
feed_data['minute_of_day'] = pd.to_timedelta(feed_data['hour_minute'] + ':00').dt.seconds // 60

# Drop the 'hour', 'minute', and 'hour_minute' columns
feed_data = feed_data.drop(['hour', 'minute', 'hour_minute'], axis=1)

print(feed_data.head())


   user_id  post_id action  target  year  month  day  day_of_week  \
0   161123     5908   view       0  2021     11   15            0   
1   161123     4505   view       0  2021     11   15            0   
2   161123     3464   view       1  2021     11   15            0   
3   161123     3464   like       0  2021     11   15            0   
4   161123     4160   view       0  2021     11   15            0   

   minute_of_day  
0            619  
1            621  
2            624  
3            626  
4            626  


In [88]:
# One-hot encode the 'action' column
one_hot = pd.get_dummies(feed_data['action'])

# Concatenate the one-hot encoded columns with the original DataFrame
feed_data = pd.concat([feed_data, one_hot], axis=1)

# Drop the original 'action' column
feed_data = feed_data.drop('action', axis=1)

print(feed_data.head())


   user_id  post_id  target  year  month  day  day_of_week  minute_of_day  \
0   161123     5908       0  2021     11   15            0            619   
1   161123     4505       0  2021     11   15            0            621   
2   161123     3464       1  2021     11   15            0            624   
3   161123     3464       0  2021     11   15            0            626   
4   161123     4160       0  2021     11   15            0            626   

   like  view  
0     0     1  
1     0     1  
2     0     1  
3     1     0  
4     0     1  


In [89]:
# Convert the 'month' and 'day' columns to a datetime object
date = pd.to_datetime(feed_data[['year', 'month', 'day']])

# Extract the day of the year
feed_data['day_of_year'] = date.dt.dayofyear

# Drop the original 'month' and 'day' columns
feed_data = feed_data.drop(['month', 'day'], axis=1)

print(feed_data.head())


   user_id  post_id  target  year  day_of_week  minute_of_day  like  view  \
0   161123     5908       0  2021            0            619     0     1   
1   161123     4505       0  2021            0            621     0     1   
2   161123     3464       1  2021            0            624     0     1   
3   161123     3464       0  2021            0            626     1     0   
4   161123     4160       0  2021            0            626     0     1   

   day_of_year  
0          319  
1          319  
2          319  
3          319  
4          319  


In [90]:
# let's devide the data into train and test sets
train = feed_data.iloc[:-20000].copy()
test = feed_data.iloc[-20000:].copy()

In [91]:
pivot = train.pivot_table(index='post_id', columns='user_id', values='target')
corrs = pivot.corr()

corrs

user_id,11549,11550,11551,11552,11553,11554,11555,11556,11557,11558,...,166342,166343,166344,166345,166346,166347,166348,166349,166350,166351
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11549,1.000000,1.000000,0.244444,0.218218,-0.142857,,-0.188982,-0.100219,-0.146667,-0.236598,...,-0.133631,0.385543,,-0.101956,-0.080000,,-0.166667,0.038018,-0.161848,-0.149071
11550,1.000000,1.000000,-0.197814,-0.142857,,0.019087,,-0.134840,-0.316228,1.000000,...,,-0.192394,0.498646,-0.167332,-0.123443,-0.258199,,0.149786,-0.072836,-0.108465
11551,0.244444,-0.197814,1.000000,,,-0.184968,0.107833,0.146886,-0.205883,-0.187500,...,-0.080064,-0.094333,-0.127491,-0.201228,0.208587,0.362933,0.534586,-0.125000,0.186290,
11552,0.218218,-0.142857,,1.000000,,0.019231,,-0.096374,0.106311,-0.365148,...,-0.294174,-0.230089,-0.254824,,0.126179,-0.258199,-0.461538,0.412691,0.167228,-0.175412
11553,-0.142857,,,,1.000000,-0.090909,,,-0.316228,,...,,,,-0.142857,,,-0.100000,,-0.071429,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
166347,,-0.258199,0.362933,-0.258199,,,,,-0.066667,,...,-0.707107,0.101130,-0.169031,0.150828,-1.000000,1.000000,-0.258199,-0.222222,-0.286972,-0.196116
166348,-0.166667,,0.534586,-0.461538,-0.100000,-0.166667,0.409091,0.020064,0.005436,,...,0.566667,0.153897,-0.032359,-0.031083,-0.096236,-0.258199,1.000000,-0.085214,-0.147669,
166349,0.038018,0.149786,-0.125000,0.412691,,-0.022243,-0.218218,-0.089087,0.423077,0.177667,...,0.443203,-0.104828,-0.170213,-0.157472,,-0.222222,-0.085214,1.000000,0.079279,0.327968
166350,-0.161848,-0.072836,0.186290,0.167228,-0.071429,0.216802,-0.247121,0.074795,-0.184542,0.090930,...,-0.068073,-0.029891,0.180948,0.049245,0.203817,-0.286972,-0.147669,0.079279,1.000000,-0.066593


In [92]:
corrs = (
    corrs
    .stack()
    .rename_axis(['userId1', 'userId2'])
    .reset_index()
)

corrs.columns = ['userId1', 'userId2', 'corr']

In [94]:
corrs = corrs[corrs['corr'] >= 0]

corrs

Unnamed: 0,userId1,userId2,corr
0,11549,11549,1.000000
1,11549,11550,1.000000
2,11549,11551,0.244444
3,11549,11552,0.218218
9,11549,11562,0.238964
...,...,...,...
23256,166351,166343,0.051852
23257,166351,166344,0.217407
23258,166351,166346,0.250161
23260,166351,166349,0.327968


In [99]:
### Для каждого юзера из теста 
### Найдем всех "соседей"
### Которые смотрели те же фильмы,
### Что и юзер на тесте

import math

preds = []

for user in test['user_id'].unique():
    
    ### Если юзера не было в трейне,
    ### То прогноз в выбранной парадигме дать не сможем
    
    if user in train['user_id'].unique():
        part = test[test['user_id']==user]

        ### Выделим соседей данного юзера
        
        neighbours = corrs[corrs['userId1']==user]
        neighbours_users = neighbours['userId2'].unique()
        
        ### Если соседей нет, то и предсказывать нечего
        ### Разве что среднее выбранного юзера по фильмам
        ### Но это сильно тупо
        
        if neighbours_users.shape[0]==0:
            continue
        
        ### Выделим фильмы, для которых нужно дать прогноз
        
        posts_ = part['post_id'].unique()

        ### Выделим часть данных с трейна про соседей
        
        train_part = train[train['user_id'].isin(neighbours_users)]
        
        ### Посчитаем средние оценки соседей
        
        neighbours_means = train_part.groupby('user_id')['target'].mean()
        
        ### Присоединим эту информацию и посчитаем
        ### Остальные компоненты формулы для рассчета предсказания
        ### Относительно соседей и фильмов,
        ### Для которых возможно сделать прогноз
        
        train_part = train_part[train_part['post_id'].isin(posts_)]
        train_part = pd.merge(train_part,
                              neighbours[['userId2', 'corr']],
                              right_on='userId2',
                              left_on='user_id',
                              how='left')
        
        train_part['neighbour_mean'] = train_part['userId2'].map(neighbours_means)
        train_part['diff'] = train_part['target'] - train_part['neighbour_mean']
        train_part['diff_dot_corr'] = train_part['diff'] * train_part['corr']
        
        ### Посчитаем среднее по юзеру
    
        user_mean = train[train['user_id']==user]['target'].mean()
        
        ### Применим формулу для предсказания 
        
        upper_part = train_part.groupby('post_id')['diff_dot_corr'].sum()
        lower_part = train_part.groupby('post_id')['corr'].sum()
        
        predictions = upper_part / lower_part + user_mean
        predictions = predictions.reset_index()
        predictions.columns = ['post_id', 'prediction']
        predictions['user_id'] = user
        
        preds.append(predictions)
        
preds = pd.concat(preds)

preds = pd.merge(preds,
            test[['user_id', 'post_id', 'target']],
            on=['user_id', 'post_id'],
            how='left'
            )
        
        


In [100]:
print(f"""Смогли дать предсказания только для {preds.shape[0]} 
          пар айтем-юзер из {test.shape[0]} тестовых""")

preds

Смогли дать предсказания только для 644 
          пар айтем-юзер из 20000 тестовых


Unnamed: 0,post_id,prediction,user_id,target
0,142,-0.111521,154490,0
1,187,-0.134921,154490,0
2,218,-0.111521,154490,0
3,228,-0.103412,154490,0
4,257,0.166439,154490,0
...,...,...,...,...
639,5044,0.016031,97968,0
640,5189,0.016031,97968,0
641,5392,0.016031,97968,0
642,7180,0.016031,97968,0


In [102]:
### Посчитаем DSG@2 хотя бы для этих пар!
import numpy as np

users_dsgs = []

for user in preds['user_id'].unique():
    part = preds[preds['user_id']==user]
    part = part.sort_values('prediction', ascending=False)
    part = part.reset_index()
    user_dsg2 = (np.log2(part.index+1) * part.target)[:2].sum()
    
    users_dsgs.append(user_dsg2)

print(f"Среднее DSG@2 по пользователям из теста: {np.mean(users_dsgs)}")

Среднее DSG@2 по пользователям из теста: 0.25
