# EDA

This is EDA of the [KuaiRec](https://kuairec.com/) dataset which has been presented in the paper of August 2022 [KuaiRec: A Fully-observed Dataset and Insights for Evaluating
Recommender Systems](https://arxiv.org/pdf/2202.10842.pdf)

The idea is to perform a preliminary data analysis on all the provided dataset to select the most interesting preprocessing and model to be used in our final implementation

Here are the following questions I will try to answer:
* General questions:
    - Data types ✓
    - Missing values / Inconsistency ✓
    - Features in each dataset ✓
    - Size ✓
    - Duplicates ✓

* User Activity / Features:
    - Who are the unique users ? ✓
    - What is the number of interactions per user ? ✓
    - What is the frequence of interactions, are there any periodic behavior ?
    - Who are famous (lots of followers) ?
    - What are the differences in interaction between users who are / are not engaged ?
    - What are the differences in interaction between video maker and not video maker ?
    - Do specific users interact with specific types of content ?
    - What is the average number of videos each user interacts with over time
    - Are there groups of users who tend to interact with the same videos ?

* Video Characteristics:
    - What are the types of videos ?
    - What are the unique videos ? ✓
    - What are famous videos ? ✓
    - Is there a correlation between engagement and duration ?
    - Are there any patterns between the number of views and the level of engagement?
    - What is the age of a video's impact ?
    - Are there temporal trends in interactions ?

## Imports

In [None]:
# Import libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# set plot size
plt.rcParams["figure.figsize"] = (20, 13)
%matplotlib inline
%config InlineBackend.figure_format = "retina"

## Dataset

According to the paper, 

| #Users | #Items | #Interactions | Density  |
|--------|--------|---------------|----------|
| 1,411  | 3,327  | 4,676,570     | 99.6%    |
| 7,176  | 10,728 | 12,530,806    | 16.3%    |

---

* User feature: Each user has 30 features which includes 12
explicit features and 18 encrypted vectors.

* Item feature: Each video has at least 1 and at most 4 tags
out of the totally 31 tags, e.g., {Sports}.
Each item has 56 explicit features, where
45 fields are the statistics of each day.

* Social network: Small matrix: 146 users have friend


In [None]:
%%bash
wget --no-check-certificate 'https://drive.usercontent.google.com/download?id=1qe5hOSBxzIuxBb1G_Ih5X-O65QElollE&export=download&confirm=t&uuid=b2002093-cc6e-4bd5-be47-9603f0b33470
' -O KuaiRec.zip
unzip KuaiRec.zip -d data_final_project

## Interractions (big matrix)

This is our training dataset.

The training set is known to be sparsed, this means there are not a lot of interactions between each video and users

In [None]:
interactions_train = pd.read_csv("data_final_project/KuaiRec 2.0/data/big_matrix.csv")
interactions_train.head()

In [None]:
interactions_train.shape

In [None]:
interactions_train.describe()

In [None]:
interactions_test = pd.read_csv("data_final_project/KuaiRec 2.0/data/small_matrix.csv")
interactions_test.head()

In [None]:
interactions_test.shape

### General questions

#### DTypes

In [None]:
print(interactions_train.dtypes)
print(interactions_test.dtypes)
print(interactions_train["date"].head())
print(interactions_test["date"].head()) # to convert datetime
print(interactions_train["time"].head())
print(interactions_test["time"].head()) # to convert datetime

#### Missing values

In [None]:
print(interactions_train.isna().any())
print(interactions_test.isna().any())
print(interactions_test.dropna().shape[0] / interactions_test.shape[0])
interactions_test = interactions_test.dropna()

In [None]:
print(f"{interactions_test.memory_usage(deep=True).sum() / (1024 ** 2)} MB")
print(f"{interactions_train.memory_usage(deep=True).sum() / (1024 ** 2)} MB")

In [None]:
# Duplicates
print(interactions_train[interactions_train.duplicated(['user_id', 'video_id'], keep=False)])
print(interactions_test[interactions_train.duplicated(['user_id', 'video_id'], keep=False)])

#### Unique Users

In [None]:
unique_train_users = set(interactions_train["user_id"].unique())
unique_test_users = set(interactions_test["user_id"].unique())
print(f"{len(unique_train_users)} users in training")
print(f"{len(unique_test_users)} users in test")
print(f"{len(unique_train_users & unique_test_users)} users in both") # data is consistent and match paper properties

#### Unique Videos

In [None]:
unique_train_videos = set(interactions_train["video_id"].unique())
unique_test_videos = set(interactions_test["video_id"].unique())
print(f"{len(unique_train_videos)} videos in training")
print(f"{len(unique_test_videos)} videos in test")
print(f"{len(unique_train_videos & unique_train_videos)} videos in both") # data is consistent and match paper properties

#### Interaction per user

There is no need to verify interaction per user in test set since the density is almost 100%.

**Uniquess**
We remark that user interaction when we remove duplicate interactions (when user replay the video), we have a gaussian dstribution with outliers for user having less than 600 interactions.

In [None]:
# Interactions per user with unique videos
nb_interactions_per_user = interactions_train.drop_duplicates(subset=['user_id', 'video_id']) \
    .groupby('user_id')['video_id'] \
    .count() \
    .reset_index(name='total_interactions')
plt.figure(figsize=(10, 6))
sns.histplot(nb_interactions_per_user['total_interactions'], kde=True, bins=30)

plt.xlabel('Number of Interactions per User (unique videos)')
plt.ylabel('Number of Users')
plt.title('Distribution of Total Interactions per User')

plt.show()

**Non Uniquess**
The plot shows very sparse results where some users have nearly 0 interactions whereas other have more than 3000 interactions.

* 91.9% have less than 3000 interactions
* 0.069% have more than 6000 interactions (high outliers)
* 17.7% have less than 500 interactions
* 0.11% have less then 200 interactions (low outliers)
* **91.18% are included in [200,3000] interactions**

In [None]:
# Interactions per user with unique videos
nb_interactions_per_user = interactions_train.groupby('user_id')['video_id'] \
    .count() \
    .reset_index(name='total_interactions')
print(f"> 6000 interactions: {nb_interactions_per_user[nb_interactions_per_user['total_interactions'] > 6000]['total_interactions'].count() / nb_interactions_per_user['total_interactions'].count()}")
print(f"< 3000 interactions: {nb_interactions_per_user[nb_interactions_per_user['total_interactions'] < 3000]['total_interactions'].count() / nb_interactions_per_user['total_interactions'].count()}")
print(f"< 200 interactions: {nb_interactions_per_user[nb_interactions_per_user['total_interactions'] < 200]['total_interactions'].count() / nb_interactions_per_user['total_interactions'].count()}")
print(f"[200,3000] interactions: {nb_interactions_per_user[(nb_interactions_per_user['total_interactions'] > 200) & (nb_interactions_per_user['total_interactions'] < 3000)]['total_interactions'].count() / nb_interactions_per_user['total_interactions'].count()}")

plt.figure(figsize=(10, 6))
sns.histplot(nb_interactions_per_user['total_interactions'], kde=True, bins=30)

plt.xlabel('Number of Interactions per User')
plt.ylabel('Number of Users')
plt.title('Distribution of Total Interactions per User')

plt.show()

We can see two peaks around 300-500

In [None]:
# Interactions in 200 - 3000
nb_interactions_per_user = interactions_train.groupby('user_id')['video_id'] \
    .count() \
    .reset_index(name='total_interactions')
plt.figure(figsize=(10, 6))
sns.histplot(nb_interactions_per_user[(nb_interactions_per_user['total_interactions'] > 200) & (nb_interactions_per_user['total_interactions'] < 3000)]['total_interactions'], kde=True, bins=30)

plt.xlabel('Number of Interactions per User')
plt.ylabel('Number of Users')
plt.title('Distribution of Total Interactions per User')

plt.show()

#### Video interactions

They all have interaction. 

Around 10% of the dataset have less than 10 interactions. Some video (3%) have a lot of interactions, more than 5000.

Some skyrockets with 10000 interactions (4 videos)

* More than 5000 interactions: 2.96%
* Less than 5000 interactions: 97.03%
* Less than **50 interactions: 28.51%**
* Less than **10 interactions: 11.14%**

In [None]:
# Interactions per video
nb_interactions_per_video = interactions_train.groupby('video_id')['user_id'] \
    .count() \
    .reset_index(name='total_interactions')
plt.figure(figsize=(10, 6))
print(f"> 10000 interactions: {nb_interactions_per_video[nb_interactions_per_video['total_interactions'] > 10000]['total_interactions'].count() / nb_interactions_per_video['total_interactions'].count()}")
print(f"> 5000 interactions: {nb_interactions_per_video[nb_interactions_per_video['total_interactions'] > 5000]['total_interactions'].count() / nb_interactions_per_video['total_interactions'].count()}")
print(f"< 5000 interactions: {nb_interactions_per_video[nb_interactions_per_video['total_interactions'] < 5000]['total_interactions'].count() / nb_interactions_per_video['total_interactions'].count()}")
print(f"< 50 interactions: {nb_interactions_per_video[nb_interactions_per_video['total_interactions'] < 50]['total_interactions'].count() / nb_interactions_per_video['total_interactions'].count()}")
print(f"< 10 interactions: {nb_interactions_per_video[nb_interactions_per_video['total_interactions'] < 10]['total_interactions'].count() / nb_interactions_per_video['total_interactions'].count()}")
print(f"0 interactions: {nb_interactions_per_video[nb_interactions_per_video['total_interactions'] == 0]['total_interactions'].count() / nb_interactions_per_video['total_interactions'].count()}")
print(f"[50,5000] interactions: {nb_interactions_per_video[(nb_interactions_per_video['total_interactions'] > 50) & (nb_interactions_per_video['total_interactions'] < 5000)]['total_interactions'].count() / nb_interactions_per_video['total_interactions'].count()}")
print(f"NB videos with more than 10000 interactions: {nb_interactions_per_video[nb_interactions_per_video['total_interactions'] > 10000]['total_interactions'].count()}")

sns.histplot(nb_interactions_per_video['total_interactions'], kde=True, bins=30)

plt.xlabel('Number of Interactions per Video')
plt.ylabel('Number of Videos')
plt.title('Distribution of Total Interactions per Video')

plt.show()

In [None]:
# Interactions per video
nb_interactions_per_video = interactions_train.drop_duplicates(subset=['user_id', 'video_id']).groupby('video_id')['user_id'] \
    .count() \
    .reset_index(name='total_interactions')
plt.figure(figsize=(10, 6))
sns.histplot(nb_interactions_per_video['total_interactions'], kde=True, bins=30)

plt.xlabel('Number of Interactions per Video (unique)')
plt.ylabel('Number of Videos')
plt.title('Distribution of Total Interactions per Video')

plt.show()

## User Features

The user dataset including features about the user itself.

This can be useful if we were to consider user characteristics in our recommendation.

In [94]:
users = pd.read_csv("data_final_project/KuaiRec 2.0/data/user_features.csv")
users

Unnamed: 0,user_id,user_active_degree,is_lowactive_period,is_live_streamer,is_video_author,follow_user_num,follow_user_num_range,fans_user_num,fans_user_num_range,friend_user_num,...,onehot_feat8,onehot_feat9,onehot_feat10,onehot_feat11,onehot_feat12,onehot_feat13,onehot_feat14,onehot_feat15,onehot_feat16,onehot_feat17
0,0,high_active,0,0,0,5,"(0,10]",0,0,0,...,184,6,3,0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,full_active,0,0,0,386,"(250,500]",4,"[1,10)",2,...,186,6,2,0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,full_active,0,0,0,27,"(10,50]",0,0,0,...,51,2,3,0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,full_active,0,0,0,16,"(10,50]",0,0,0,...,251,3,2,0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,full_active,0,0,0,122,"(100,150]",4,"[1,10)",0,...,99,4,2,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7171,7171,full_active,0,0,1,52,"(50,100]",1,"[1,10)",0,...,259,1,4,0,1.0,0.0,0.0,0.0,0.0,0.0
7172,7172,full_active,0,0,0,45,"(10,50]",2,"[1,10)",2,...,11,2,0,0,1.0,0.0,0.0,0.0,0.0,0.0
7173,7173,full_active,0,0,0,615,500+,3,"[1,10)",2,...,51,2,2,0,1.0,0.0,0.0,0.0,0.0,0.0
7174,7174,full_active,0,0,0,959,500+,0,0,0,...,107,3,2,0,0.0,0.0,0.0,0.0,0.0,0.0


In [87]:
users.describe()

Unnamed: 0,user_id,is_lowactive_period,is_live_streamer,is_video_author,follow_user_num,fans_user_num,friend_user_num,register_days,onehot_feat0,onehot_feat1,...,onehot_feat8,onehot_feat9,onehot_feat10,onehot_feat11,onehot_feat12,onehot_feat13,onehot_feat14,onehot_feat15,onehot_feat16,onehot_feat17
count,7176.0,7176.0,7176.0,7176.0,7176.0,7176.0,7176.0,7176.0,7176.0,7176.0,...,7176.0,7176.0,7176.0,7176.0,7099.0,7101.0,7101.0,7102.0,7102.0,7102.0
mean,3587.5,0.000418,0.006828,0.169593,197.327899,12.553094,4.494844,296.790691,0.39228,2.670569,...,168.661511,3.83194,2.264353,0.137124,0.298774,0.104633,0.094775,0.018586,0.017882,0.014503
std,2071.677098,0.020444,0.082357,0.375301,426.543245,181.017537,44.897861,286.38132,0.488293,1.782502,...,96.254783,1.747046,1.063131,0.500184,0.457753,0.306102,0.292925,0.135068,0.132533,0.11956
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1793.75,0.0,0.0,0.0,9.0,0.0,0.0,132.0,0.0,1.0,...,88.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3587.5,0.0,0.0,0.0,33.0,2.0,0.0,225.0,0.0,2.0,...,167.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,5381.25,0.0,0.0,0.0,130.0,6.0,1.0,324.0,1.0,4.0,...,255.0,5.0,3.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
max,7175.0,1.0,1.0,1.0,2100.0,11401.0,1425.0,2245.0,1.0,7.0,...,339.0,6.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0


In [88]:
users.shape

(7176, 31)

### Generic questions

In [89]:
print(users.dtypes)
print(users["follow_user_num_range"].unique())
print(users["register_days_range"].unique())
print(users["fans_user_num_range"].unique())
print(users["is_live_streamer"].unique())
print(users["is_video_author"].unique())
# follow_user_num_range, register_days_range, fans_user_num_range -> OHE
# is_live_streamer, is_video_author -> boolean

user_id                    int64
user_active_degree        object
is_lowactive_period        int64
is_live_streamer           int64
is_video_author            int64
follow_user_num            int64
follow_user_num_range     object
fans_user_num              int64
fans_user_num_range       object
friend_user_num            int64
friend_user_num_range     object
register_days              int64
register_days_range       object
onehot_feat0               int64
onehot_feat1               int64
onehot_feat2               int64
onehot_feat3               int64
onehot_feat4             float64
onehot_feat5               int64
onehot_feat6               int64
onehot_feat7               int64
onehot_feat8               int64
onehot_feat9               int64
onehot_feat10              int64
onehot_feat11              int64
onehot_feat12            float64
onehot_feat13            float64
onehot_feat14            float64
onehot_feat15            float64
onehot_feat16            float64
onehot_fea

In [101]:
print(users.isna().any())
users = users.drop(columns=users.filter(regex=r'^onehot_feat\d{2}$').columns)
print(users[users.isna().any(axis=1)])
print(users.dropna().shape[0] / interactions_test.shape[0])
#users = users.dropna()

user_id                  False
user_active_degree       False
is_lowactive_period      False
is_live_streamer         False
is_video_author          False
follow_user_num          False
follow_user_num_range    False
fans_user_num            False
fans_user_num_range      False
friend_user_num          False
friend_user_num_range    False
register_days            False
register_days_range      False
onehot_feat0             False
onehot_feat1             False
onehot_feat2             False
onehot_feat3             False
onehot_feat4              True
onehot_feat5             False
onehot_feat6             False
onehot_feat7             False
onehot_feat8             False
onehot_feat9             False
onehot_feat10            False
onehot_feat11            False
onehot_feat12             True
onehot_feat13             True
onehot_feat14             True
onehot_feat15             True
onehot_feat16             True
onehot_feat17             True
dtype: bool
      user_id user_active_d

In [91]:
f"{users.memory_usage(deep=True).sum() / (1024 ** 2)} MB"

'3.3163766860961914 MB'

## Item Daily Features

This dataset contains daily interactions on the video.

This may be useful when we'll consider **video characterstics on a daily basis** or to **create a more long term feature**.

In [None]:
video_daily = pd.read_csv("data_final_project/KuaiRec 2.0/data/item_daily_features.csv")
video_daily.head()

In [None]:
video_daily.describe()

## Social Network

The social network graph is also given.

This can be useful to **recommend a video based on friendship / followers**

We could even detect clusters of users that have similar interests.

In [None]:
social_network = pd.read_csv("data_final_project/KuaiRec 2.0/data/social_network.csv")
social_network.head()

In [None]:
social_network.shape

## Video Features

We can use the video features including caption and categories.

This may be useful if we were to **recommend a video using its categories and its caption**.

This dataset can also be useful to create **the categories id mapping** (category_id -> category_name)

In [None]:
video_features = pd.read_csv("data_final_project/KuaiRec 2.0/data/kuairec_caption_category.csv", lineterminator='\n')
video_features.head()

## Video Categories

In [None]:
The video categories dataset 

In [None]:
video_categories = pd.read_csv("data_final_project/KuaiRec 2.0/data/item_categories.csv")
video_categories.head()