# KuaiRec Dataset Exploration

## Theoretical Framework: Complex Networks & Data-Driven Modeling

This notebook explores the KuaiRec dataset through the lens of **complex networks theory** and **data-driven modeling**.

### What is a Complex Network?

A **complex network** is a graph with non-trivial topological features that don't occur in simple networks (like lattices or random graphs). Examples include:
- Social networks (people connected by friendships)
- Information networks (web pages connected by hyperlinks)
- Recommendation networks (users connected through shared interests)

### Two Core Approaches in This Course

#### 1. Network Characterization
**Goal**: Understand the structure and properties of networks

**Key Concepts**:
- **Degree Distribution**: How many connections does each node have? In social networks, most people have few friends, but some have many (power-law distribution)
- **Clustering**: Do friends of your friends tend to be your friends? (Triangles in the network)
- **Path Lengths**: How many steps to get from one user to another?
- **Centrality**: Who are the most important/influential nodes?
  - *Degree centrality*: Simply how many connections
  - *Betweenness centrality*: How often a node lies on shortest paths between others
  - *Eigenvector centrality*: Being connected to important nodes makes you important
- **Community Detection**: Are there clusters of tightly connected users?

**Why this matters for KuaiRec**: The social network structure can influence what videos people watch. Users connected to many others might spread content faster (diffusion).

#### 2. Data-Driven Modeling
**Goal**: Build predictive models based on data patterns

**Key Concepts**:
- **Regression Models**: Predict a continuous outcome (e.g., watch time) from features
  - *LASSO regression*: Automatically selects important features by pushing irrelevant coefficients to zero
  - *Ridge regression*: Prevents overfitting by penalizing large coefficients
- **Feature Engineering**: Creating informative variables from raw data
  - Network features: degree, centrality, clustering coefficient of each user
  - Content features: video tags, duration, popularity
  - Temporal features: time of day, day of week
- **Model Interpretation**: What factors actually drive behavior?

**Why this matters for KuaiRec**: We want to understand what predicts watch time - is it the user's social position? Video popularity? Content type?

---

## Our Research Question

**Primary Focus**: What factors influence user watch time and engagement patterns?

We'll combine both approaches:
1. **Network Characterization**: Analyze the social network structure
2. **Data-Driven Modeling**: Build predictive models using network features + content features

---

## Setup

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Network analysis
import networkx as nx

# Machine learning
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

%matplotlib inline

## 1. Data Loading

The KuaiRec dataset contains:
- **User-item interaction matrix**: Which users watched which videos and for how long
- **Social network**: Follower relationships between users
- **Video features**: Content tags, duration, popularity metrics
- **User features**: Demographics, activity levels

In [None]:
# Load KuaiRec dataset
print("Loading KuaiRec dataset...")

# Start with small_matrix for faster initial exploration
# Switch to big_matrix once you're ready for full analysis
interactions = pd.read_csv('../data/raw/small_matrix.csv')
print(f"✓ Loaded interactions: {interactions.shape[0]:,} rows, {interactions.shape[1]} columns")

# Social network (friend connections)
social_network = pd.read_csv('../data/raw/social_network.csv')
print(f"✓ Loaded social network: {social_network.shape[0]:,} users")

# Video features (categories/tags)
video_features = pd.read_csv('../data/raw/item_categories.csv')
print(f"✓ Loaded video features: {video_features.shape[0]:,} videos")

# User features (demographics, activity levels)
user_features = pd.read_csv('../data/raw/user_features.csv')
print(f"✓ Loaded user features: {user_features.shape[0]:,} users")

print("\n" + "="*60)
print("Dataset Overview")
print("="*60)
print(f"Total interactions: {interactions.shape[0]:,}")
print(f"Unique users: {interactions['user_id'].nunique():,}")
print(f"Unique videos: {interactions['video_id'].nunique():,}")
print(f"Date range: {interactions['date'].min()} to {interactions['date'].max()}")

# Display sample of interaction data
print("\nSample interaction data:")
print(interactions.head())

print("\nInteraction columns:")
print(interactions.columns.tolist())

print("\nUser feature columns:")
print(user_features.columns.tolist())

## 2. Network Characterization

### Theory: Why Analyze the Social Network?

**Social influence hypothesis**: People's behavior (including video watching) is influenced by their social connections.

**Key questions**:
- Do central users watch more content?
- Do users in dense communities have similar watch patterns?
- How does content diffuse through the network?

### Network Metrics to Compute

1. **Basic Statistics**
   - Number of nodes (users) and edges (connections)
   - Degree distribution
   - Density: How connected is the network?

2. **Structural Properties**
   - Clustering coefficient: Do friends cluster together?
   - Average path length: How many hops between users?
   - Connected components: Is the network fragmented?

3. **Centrality Measures** (for each user)
   - Degree: Number of connections
   - Betweenness: Bridge between communities
   - Eigenvector: Connected to important people
   - PageRank: Like Google's algorithm for web pages

In [None]:
# Build network from social connections
print("Building social network graph...")

# Parse friend_list (stored as string representation of list)
import ast

edges = []
for _, row in social_network.iterrows():
    user_id = row['user_id']
    friend_list = ast.literal_eval(row['friend_list'])
    for friend_id in friend_list:
        edges.append((user_id, friend_id))

print(f"Total edges extracted: {len(edges):,}")

# Create directed graph (following relationships)
G = nx.DiGraph()
G.add_edges_from(edges)

print(f"\nNetwork Statistics:")
print(f"  Number of users (nodes): {G.number_of_nodes():,}")
print(f"  Number of connections (edges): {G.number_of_edges():,}")
print(f"  Network density: {nx.density(G):.6f}")
print(f"  Is connected: {nx.is_weakly_connected(G)}")

# For undirected metrics, create undirected version
G_undirected = G.to_undirected()
print(f"  Connected components: {nx.number_connected_components(G_undirected)}")

# Degree distribution
degrees = dict(G.degree())
degree_values = list(degrees.values())

plt.figure(figsize=(14, 5))

# Linear scale
plt.subplot(1, 2, 1)
plt.hist(degree_values, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Degree (Number of Connections)')
plt.ylabel('Frequency')
plt.title('Degree Distribution (Linear Scale)')
plt.grid(alpha=0.3)

# Log-log scale (to check for power law)
plt.subplot(1, 2, 2)
plt.hist(degree_values, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Degree (Number of Connections)')
plt.ylabel('Frequency')
plt.title('Degree Distribution (Log-Log Scale)')
plt.yscale('log')
plt.xscale('log')
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nDegree Statistics:")
print(f"  Mean degree: {np.mean(degree_values):.2f}")
print(f"  Median degree: {np.median(degree_values):.2f}")
print(f"  Max degree: {np.max(degree_values)}")
print(f"  Min degree: {np.min(degree_values)}")

### Interpretation Guide

**What to look for**:
- **Power-law degree distribution**: A few highly connected users (influencers), many with few connections
- **High clustering**: Indicates community structure
- **Small world property**: Short average path length despite large network size

**Scale-free networks**: If degree follows power law P(k) ~ k^(-γ), the network is scale-free (common in social networks)

In [None]:
# Calculate centrality measures for each user
# These become FEATURES for our predictive model
print("Computing centrality measures...")
print("(This may take a minute for larger networks)")

# Degree centrality (fast)
degree_centrality = nx.degree_centrality(G)
print("✓ Degree centrality computed")

# PageRank (fast, good for directed graphs)
pagerank = nx.pagerank(G, max_iter=100)
print("✓ PageRank computed")

# Betweenness centrality (slower - use sampling for large networks)
# For small network, compute exactly; for large, use approximation
if G.number_of_nodes() < 5000:
    betweenness_centrality = nx.betweenness_centrality(G)
    print("✓ Betweenness centrality computed (exact)")
else:
    # Use sampling for large networks
    sample_k = min(100, G.number_of_nodes())
    betweenness_centrality = nx.betweenness_centrality(G, k=sample_k)
    print(f"✓ Betweenness centrality computed (sampled, k={sample_k})")

# Clustering coefficient (use undirected for this metric)
clustering_coef = nx.clustering(G_undirected)
print("✓ Clustering coefficient computed")

# Create dataframe of network features
network_features = pd.DataFrame({
    'user_id': list(degree_centrality.keys()),
    'degree_centrality': list(degree_centrality.values()),
    'betweenness_centrality': [betweenness_centrality[node] for node in degree_centrality.keys()],
    'pagerank': [pagerank[node] for node in degree_centrality.keys()],
    'clustering_coef': [clustering_coef[node] for node in degree_centrality.keys()]
})

print(f"\nNetwork features created for {len(network_features)} users")
print("\nSample network features:")
print(network_features.head())

# Visualize centrality distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0, 0].hist(network_features['degree_centrality'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Degree Centrality')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Degree Centrality Distribution')
axes[0, 0].grid(alpha=0.3)

axes[0, 1].hist(network_features['pagerank'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 1].set_xlabel('PageRank')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('PageRank Distribution')
axes[0, 1].grid(alpha=0.3)

axes[1, 0].hist(network_features['betweenness_centrality'], bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Betweenness Centrality')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Betweenness Centrality Distribution')
axes[1, 0].grid(alpha=0.3)

axes[1, 1].hist(network_features['clustering_coef'], bins=50, edgecolor='black', alpha=0.7)
axes[1, 1].set_xlabel('Clustering Coefficient')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Clustering Coefficient Distribution')
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Exploratory Data Analysis

### Understanding Watch Time Patterns

**Watch ratio** = (time watched) / (video duration)
- 1.0 = watched entire video
- <1.0 = skipped parts
- >1.0 = rewatched parts

This is our **target variable** for prediction.

In [None]:
# Explore watch time distribution
print("Watch Ratio Statistics:")
print(f"  Mean: {interactions['watch_ratio'].mean():.3f}")
print(f"  Median: {interactions['watch_ratio'].median():.3f}")
print(f"  Std: {interactions['watch_ratio'].std():.3f}")
print(f"  Min: {interactions['watch_ratio'].min():.3f}")
print(f"  Max: {interactions['watch_ratio'].max():.3f}")
print(f"\nInterpretation:")
print("  watch_ratio = 1.0 means watched entire video")
print("  watch_ratio < 1.0 means skipped/stopped early")
print("  watch_ratio > 1.0 means rewatched or looped")

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Watch ratio distribution
axes[0, 0].hist(interactions['watch_ratio'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Watch Ratio')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Watch Ratios')
axes[0, 0].axvline(1.0, color='red', linestyle='--', label='Complete watch')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# Watch ratio vs video duration
sample_size = min(10000, len(interactions))
sample = interactions.sample(sample_size, random_state=42)
axes[0, 1].scatter(sample['video_duration'], sample['watch_ratio'], alpha=0.1, s=1)
axes[0, 1].set_xlabel('Video Duration (ms)')
axes[0, 1].set_ylabel('Watch Ratio')
axes[0, 1].set_title(f'Watch Ratio vs Video Duration (n={sample_size:,})')
axes[0, 1].grid(alpha=0.3)

# Video duration distribution
axes[1, 0].hist(interactions['video_duration'] / 1000, bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Video Duration (seconds)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Video Durations')
axes[1, 0].grid(alpha=0.3)

# Play duration distribution
axes[1, 1].hist(interactions['play_duration'] / 1000, bins=50, edgecolor='black', alpha=0.7)
axes[1, 1].set_xlabel('Play Duration (seconds)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Distribution of Actual Watch Time')
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# User and video activity
print(f"\nUser Activity:")
interactions_per_user = interactions.groupby('user_id').size()
print(f"  Mean interactions per user: {interactions_per_user.mean():.1f}")
print(f"  Median interactions per user: {interactions_per_user.median():.1f}")
print(f"  Max interactions by single user: {interactions_per_user.max()}")

print(f"\nVideo Popularity:")
views_per_video = interactions.groupby('video_id').size()
print(f"  Mean views per video: {views_per_video.mean():.1f}")
print(f"  Median views per video: {views_per_video.median():.1f}")
print(f"  Max views for single video: {views_per_video.max()}")

## 4. Feature Engineering

### Theory: What Features Matter?

To predict watch time, we need features that capture:

1. **User characteristics**
   - Network position (centrality measures)
   - Activity level (total videos watched)
   - Average watch behavior

2. **Video characteristics**
   - Duration
   - Popularity (how many views)
   - Content tags (comedy, education, etc.)

3. **Social influence**
   - How many of user's friends watched this video?
   - Average rating by friends

4. **Temporal patterns**
   - Time of day
   - Day of week
   - Sequence position (1st video of session vs 10th)

In [None]:
# Example feature engineering

# User-level features
# user_activity = interactions.groupby('user_id').agg({
#     'watch_ratio': ['mean', 'std', 'count'],
#     'video_duration': 'mean'
# })

# Video popularity
# video_popularity = interactions.groupby('video_id').agg({
#     'watch_ratio': 'mean',
#     'user_id': 'count'  # number of views
# })

# Merge all features
# full_data = interactions.merge(network_features, on='user_id')
# full_data = full_data.merge(video_features, on='video_id')
# full_data = full_data.merge(user_activity, on='user_id')

## 5. Predictive Modeling

### Theory: LASSO Regression

**Why LASSO?**

LASSO (Least Absolute Shrinkage and Selection Operator) minimizes:

$$\text{Loss} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j|$$

- First term: Prediction error (standard regression)
- Second term: L1 penalty on coefficients

**Key advantage**: Automatic feature selection
- Irrelevant features get coefficient = 0
- Helps identify what TRULY matters for watch time

**λ (lambda)**: Controls regularization strength
- Small λ: More features retained, risk of overfitting
- Large λ: Fewer features, simpler model
- Choose via cross-validation

### Interpretation

After fitting LASSO:
- **Non-zero coefficients**: Important features
- **Sign of coefficient**: Positive = increases watch time, Negative = decreases
- **Magnitude**: Strength of effect (after standardization)

In [None]:
# Prepare features and target
# feature_columns = ['degree_centrality', 'betweenness_centrality', 'pagerank',
#                   'video_duration', 'video_popularity', ...]

# X = full_data[feature_columns]
# y = full_data['watch_ratio']

# Split data
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features (important for LASSO)
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)

# Fit LASSO
# lasso = Lasso(alpha=0.01)  # alpha = λ
# lasso.fit(X_train_scaled, y_train)

# Model performance
# train_score = lasso.score(X_train_scaled, y_train)
# test_score = lasso.score(X_test_scaled, y_test)
# print(f"Train R²: {train_score:.3f}")
# print(f"Test R²: {test_score:.3f}")

In [None]:
# Feature importance analysis
# coefficients = pd.DataFrame({
#     'feature': feature_columns,
#     'coefficient': lasso.coef_
# })
# coefficients = coefficients[coefficients['coefficient'] != 0].sort_values('coefficient', ascending=False)

# plt.figure(figsize=(10, 6))
# plt.barh(coefficients['feature'], coefficients['coefficient'])
# plt.xlabel('LASSO Coefficient')
# plt.title('Important Features for Predicting Watch Time')
# plt.tight_layout()

## 6. Network Influence on Watch Behavior

### Theory: Social Influence and Diffusion

**Question**: Does network position affect what/how much people watch?

**Hypotheses to test**:
1. Users with higher centrality watch more content
2. Users in dense communities have correlated watch patterns
3. Content diffuses through network connections

**Diffusion models**:
- **Threshold models**: Users adopt after k friends do
- **Cascade models**: Probabilistic spread through edges
- **Exposure models**: More exposures = higher adoption probability

In [None]:
# Analyze correlation between network position and behavior

# Total watch time per user
# user_total_watch = interactions.groupby('user_id')['watch_ratio'].sum()

# Merge with centrality
# analysis_df = network_features.copy()
# analysis_df['total_watch'] = analysis_df['user_id'].map(user_total_watch)

# Scatter plots
# fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# axes[0].scatter(analysis_df['degree_centrality'], analysis_df['total_watch'], alpha=0.3)
# axes[0].set_xlabel('Degree Centrality')
# axes[0].set_ylabel('Total Watch Ratio')
# axes[0].set_title('Degree vs Watch Behavior')

# axes[1].scatter(analysis_df['betweenness_centrality'], analysis_df['total_watch'], alpha=0.3)
# axes[1].set_xlabel('Betweenness Centrality')
# axes[1].set_ylabel('Total Watch Ratio')
# axes[1].set_title('Betweenness vs Watch Behavior')

# axes[2].scatter(analysis_df['pagerank'], analysis_df['total_watch'], alpha=0.3)
# axes[2].set_xlabel('PageRank')
# axes[2].set_ylabel('Total Watch Ratio')
# axes[2].set_title('PageRank vs Watch Behavior')

# plt.tight_layout()

## 7. Next Steps and Research Questions

### Questions to Explore

1. **Feature importance**: Which factors most strongly predict watch time?
   - Network position? Content features? Social influence?

2. **Community structure**: Do communities have different watch patterns?
   - Use community detection (Louvain, Girvan-Newman)
   - Compare watch behavior across communities

3. **Temporal dynamics**: How does engagement change over time?
   - Time series analysis
   - Session-level patterns

4. **Diffusion analysis**: How do videos spread through the network?
   - Track which users watch which videos when
   - Model as contagion process

5. **Advanced modeling**:
   - Graph Neural Networks (GNNs) to incorporate network structure
   - Recommender systems
   - Causal inference (does network position CAUSE different behavior?)

### Potential Findings

**If network features are important**: Social influence matters for engagement
**If content features dominate**: Individual preferences override social effects
**If both matter**: Complex interaction between social and content factors

### Deliverables for Team Discussion

1. Network characterization summary
2. Initial predictive model results
3. Key insights and surprising findings
4. Proposed research directions for final project

## References and Further Reading

### Key Concepts

- **Complex Networks**: Barabási, A. L., & Albert, R. (1999). Emergence of scaling in random networks. Science.
- **Social Influence**: Aral, S., & Walker, D. (2012). Identifying influential and susceptible members of social networks. Science.
- **LASSO Regression**: Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. JRSS-B.
- **Network Diffusion**: Kempe, D., Kleinberg, J., & Tardos, É. (2003). Maximizing the spread of influence through a social network. KDD.

### Tools

- **NetworkX**: https://networkx.org/
- **scikit-learn**: https://scikit-learn.org/
- **KuaiRec Dataset**: https://chongminggao.github.io/KuaiRec/