<a href="https://colab.research.google.com/github/Kuzay3t/Recommendation-System-Simulator-/blob/main/Recommendation_Simulator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Model Description: AI Content Recommendation Simulator**

**Goal:** Simulate how social media algorithms (e.g., TikTok) curate content for teens by predicting recommended videos based on engagement patterns, using collaborative filtering and reinforcement learning (RL).

**Key Components:** Collaborative Filtering (CF): Approach: User-item interaction matrix (implicit feedback from video_like_count, video_share_count).

**Model:** Pretrained Hugging Face all-MiniLM-L6-v2 to generate video embeddings from video_transcription_text, then compute cosine similarity for content-based filtering.

**Library:** Surprise or LightFM for matrix factorization.

Reinforcement Learning (RL): Optimize recommendations to maximize engagement (likes/shares) while simulating filter bubbles.

**Framework:** Custom OpenAI Gym environment with:
**States: **User’s watch history + engagement metrics.
**Actions:** Recommend a video from the top-*k* similar content pool.
**Rewards:** +1 for predicted engagement (like/share), −1 for harmful/low-quality content.

**Input Data:**

tiktok_dataset2.csv: Video metadata (transcription_text) + engagement signals (like_count, share_count).

**Output Simulation:**

*"Teens who engage with 1 body-image video receive 5+ similar videos within 2 hours (78% similarity score)."*

**Why This Works:**
Hugging Face Embeddings: Capture semantic meaning of video content (e.g., body-image, fitness).


**Importing Necesasary Libaries**

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from huggingface_hub import from_pretrained_keras
from tensorflow.keras.optimizers import Adam
from sklearn.feature_extraction.text import TfidfVectorizer

**Importing pre-trained model from Hugging Face URL**

In [2]:
from huggingface_hub import from_pretrained_keras
model = from_pretrained_keras("keras-io/collaborative-filtering-movielens")

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

(…)fevents.1654143210.eea0d1945de5.451.0.v2:   0%|          | 0.00/86.3k [00:00<?, ?B/s]

keras_metadata.pb:   0%|          | 0.00/4.92k [00:00<?, ?B/s]

saved_model.pb:   0%|          | 0.00/121k [00:00<?, ?B/s]

(…)fevents.1654143224.eea0d1945de5.451.1.v2:   0%|          | 0.00/818 [00:00<?, ?B/s]

model.png:   0%|          | 0.00/1.67k [00:00<?, ?B/s]

history.json:   0%|          | 0.00/277 [00:00<?, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

variables.index:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

variables.data-00000-of-00001:   0%|          | 0.00/6.33M [00:00<?, ?B/s]

**Uploading dataset used**

In [3]:
from google.colab import files
file = files.upload()

Saving tiktok_dataset.csv to tiktok_dataset.csv


In [4]:
data= pd.read_csv('tiktok_dataset.csv')
data.head()

Unnamed: 0,video_duration_sec,video_transcription_text,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,59,someone shared with me that drone deliveries a...,343296.0,19425.0,241.0,1.0,0.0
1,32,someone shared with me that there are more mic...,140877.0,77355.0,19034.0,1161.0,684.0
2,31,someone shared with me that american industria...,902185.0,97690.0,2858.0,833.0,329.0
3,25,someone shared with me that the metro of st. p...,437506.0,239954.0,34812.0,1234.0,584.0
4,19,someone shared with me that the number of busi...,56167.0,34987.0,4110.0,547.0,152.0


In [10]:
# Create a composite engagement score
data['engagement_score'] = (0.3*data['video_like_count'] +
                          0.2*data['video_share_count'] +
                          0.2*data['video_comment_count'] +
                          0.2*data['video_download_count'] +
                          0.1*data['video_view_count'])

# Normalize engagement score (0-1 range)
scaler = MinMaxScaler()
data['engagement_score'] = scaler.fit_transform(data[['engagement_score']])

# Extract content features from text
tfidf = TfidfVectorizer(max_features=100, stop_words='english')
content_features = tfidf.fit_transform(data['video_transcription_text'].fillna(''))
content_feature_names = tfidf.get_feature_names_out()

# Create synthetic user IDs (since real dataset lacks them)
num_users = 1000  # Simulated user base
data['user_id'] = np.random.randint(0, num_users, size=len(data))

**Adapting the pretrained Model**

In [11]:
# Modify model for your needs
num_unique_videos = len(data)
num_unique_users = data['user_id'].nunique()

# Get model architecture and recreate with your dimensions
original_embedding_size = model.layers[0].output_dim

# Rebuild model
user_input = tf.keras.layers.Input(shape=(1,), name='user_input')
video_input = tf.keras.layers.Input(shape=(1,), name='video_input')

user_embedding = tf.keras.layers.Embedding(
    num_unique_users, original_embedding_size, name='user_embedding')(user_input)
video_embedding = tf.keras.layers.Embedding(
    num_unique_videos, original_embedding_size, name='video_embedding')(video_input)

dot_product = tf.keras.layers.Dot(axes=2)([user_embedding, video_embedding])
flattened = tf.keras.layers.Flatten()(dot_product)

# Add content features to the model
content_input = tf.keras.layers.Input(shape=(content_features.shape[1],), name='content_input')
merged = tf.keras.layers.Concatenate()([flattened, content_input])

output = tf.keras.layers.Dense(1, activation='sigmoid')(merged)

model = tf.keras.Model(inputs=[user_input, video_input, content_input], outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


**Preparing the Training Data**

In [12]:
# Create video IDs
data['video_id'] = data.index

# Split into train/test
train_data, test_data = train_test_split(data, test_size=0.2)

# Prepare input data
def prepare_data(sub_data):
    user_data = sub_data['user_id'].values
    video_data = sub_data['video_id'].values
    content_data = content_features[sub_data.index].toarray()
    engagement_data = sub_data['engagement_score'].values
    return [user_data, video_data, content_data], engagement_data

X_train, y_train = prepare_data(train_data)
X_test, y_test = prepare_data(test_data)

**Training the Model**

In [13]:
history = model.fit(
    X_train, y_train,
    batch_size=64,
    epochs=10,
    validation_data=(X_test, y_test)
)

Epoch 1/10
[1m243/243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 10ms/step - accuracy: 0.0000e+00 - loss: nan - val_accuracy: 2.5793e-04 - val_loss: nan
Epoch 2/10
[1m243/243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - accuracy: 0.0000e+00 - loss: nan - val_accuracy: 2.5793e-04 - val_loss: nan
Epoch 3/10
[1m243/243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 9ms/step - accuracy: 0.0000e+00 - loss: nan - val_accuracy: 2.5793e-04 - val_loss: nan
Epoch 4/10
[1m243/243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 10ms/step - accuracy: 0.0000e+00 - loss: nan - val_accuracy: 2.5793e-04 - val_loss: nan
Epoch 5/10
[1m243/243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step - accuracy: 0.0000e+00 - loss: nan - val_accuracy: 2.5793e-04 - val_loss: nan
Epoch 6/10
[1m243/243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 9ms/step - accuracy: 0.0000e+00 - loss: nan - val_accuracy: 2.5793e-04 - val_loss: nan
Epoch 7/10
[

# Recommendation Simulation

In [14]:
def simulate_recommendations(user_id, initial_video_id, steps=5):
    """Simulate recommendation cascade"""
    recommendations = [initial_video_id]
    current_video = initial_video_id

    for _ in range(steps):
        # Get content features of current video
        content_feats = content_features[current_video].toarray()

        # Predict engagement for all videos
        all_videos = np.arange(len(data))
        user_array = np.array([user_id]*len(data))
        content_array = content_features.toarray()

        predictions = model.predict([user_array, all_videos, content_array])

        # Select top recommendation (excluding already recommended)
        top_idx = np.argsort(predictions.flatten())[::-1]
        for idx in top_idx:
            if idx not in recommendations:
                recommendations.append(idx)
                current_video = idx
                break

    return data.loc[recommendations]

# Example simulation
initial_video = 0  # First video in dataset
user_id = 42       # Random user
recommendation_sequence = simulate_recommendations(user_id, initial_video)
print(recommendation_sequence[['video_transcription_text', 'engagement_score']])

[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
                               video_transcription_text  engagement_score
0     someone shared with me that drone deliveries a...          0.115038
6456  a colleague learned  from the news that singap...          0.338885
6457  a colleague learned  from the news that the av...          0.626649
6458  a colleague learned  from the news that there ...          0.336782
6459  a colleague learned  from the news that 4 of t...          0.574399
6460  a colleague learned  from the news that the fi...          0.351895


**Reinforcement Learning Component**

In [15]:
import gym
from gym import spaces

class TikTokEnv(gym.Env):
    def __init__(self, df, model, content_features):
        super(TikTokEnv, self).__init__()
        self.data = data
        self.model = model
        self.content_features = content_features

        # Action space: choose any video
        self.action_space = spaces.Discrete(len(data))

        # Observation space: user embedding + last video features
        self.observation_space = spaces.Box(
            low=0, high=1,
            shape=(model.layers[0].output_dim + content_features.shape[1],)
        )

        self.current_user = None
        self.last_video = None
        self.recommendation_history = []

    def reset(self):
        self.current_user = np.random.randint(0, data['user_id'].max())
        self.last_video = np.random.randint(0, len(data))
        self.recommendation_history = [self.last_video]

        # Get initial state
        user_embedding = model.get_layer('user_embedding')(
            np.array([self.current_user])).numpy().flatten()
        content_embedding = content_features[self.last_video].toarray().flatten()

        return np.concatenate([user_embedding, content_embedding])

    def step(self, action):
        # Action is the video ID to recommend
        video_id = action

        # Get predicted engagement
        user_array = np.array([self.current_user])
        video_array = np.array([video_id])
        content_array = content_features[video_id].toarray()

        predicted_engagement = model.predict(
            [user_array, video_array, content_array])[0][0]

        # Reward is the engagement score
        reward = predicted_engagement

        # Update state
        self.last_video = video_id
        self.recommendation_history.append(video_id)

        # Get new state
        user_embedding = model.get_layer('user_embedding')(
            np.array([self.current_user])).numpy().flatten()
        content_embedding = content_features[self.last_video].toarray().flatten()

        # Simple termination condition
        done = len(self.recommendation_history) > 10

        return (np.concatenate([user_embedding, content_embedding]),
                reward, done, {})

**Analysis and Reporting**

In [17]:
def analyze_recommendation_patterns(recommendation_sequences):
    """Analyze how recommendations evolve"""
    results = []

    for seq in recommendation_sequences:
        initial_content = seq.iloc[0]['video_transcription_text']
        similar_count = 0

        # Simple similarity check (could use NLP similarity metrics)
        initial_tokens = set(initial_content.lower().split())
        for i in range(1, len(seq)):
            current_tokens = set(seq.iloc[i]['video_transcription_text'].lower().split())
            if len(initial_tokens & current_tokens) > 2:  # At least 2 common words
                similar_count += 1

        results.append({
            'initial_content': initial_content,
            'similar_recommendations': similar_count,
            'total_recommendations': len(seq)-1,
            'percentage_similar': similar_count/(len(seq)-1)
        })

    return pd.DataFrame(results)

# Example analysis
sequences = [simulate_recommendations(u, 0) for u in range(10)]  # 10 users
analysis_results = analyze_recommendation_patterns(sequences)
print(analysis_results[['similar_recommendations', 'total_recommendations', 'percentage_similar']].mean())

# Generate report like your example
avg_similar = analysis_results['percentage_similar'].mean()
print(f"\nTeens who engage with 1 video receive {avg_similar:.0%} similar "
      f"recommendations within 5 steps, creating filter bubbles.")

[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
[1m606/606[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
[1m606/606[0m [32m━━━━