# 🎯 Simulating Random Spamming Attacks on a Ratings Dataset

## 📄 Dataset Overview

This notebook uses the `ratings.dat` dataset, where each row represents a user's rating of a movie in the following format:


- **UserID**: Integer between 1 and 6040
- **MovieID**: Integer between 1 and 3952
- **Rating**: Whole number from 1 to 5
- **Timestamp**: Seconds since the Unix epoch
- **NormalizedOverall**: Normalized rating on a [0, 1] scale
- Each user has rated **at least 20 items**

---

## 🧪 Goal of This Notebook

The objective is to **simulate a random spamming attack** on this dataset. This attack represents a common type of noise or malicious behavior in rating systems, where spammers add non-informative or random feedback.

---

## 🔥 Attack Model: Random Spamming

Each simulated spammer behaves as follows:

- **Rating values**: Randomly chosen from a uniform distribution over the rating scale {1, 2, 3, 4, 5}.
- **Number of ratings per spammer**: Sampled from a **Poisson distribution** with λ = 20 (then +1 to ensure at least one rating).
- **Items rated**: Randomly selected based on the **item popularity distribution** from the real dataset — i.e., more frequently rated items are more likely to be selected.
- **Ratings are timestamped** with the current time and normalized.

---

## ⚙️ Configurable Parameter

- Instead of defining a fixed number of spammers, we simulate a **proportion** of the total number of users (e.g., 10% of the total users).

This ensures the attack is **scalable and adaptable** to datasets of different sizes.

---

## 📦 Outcome

- The notebook will output:
  - The original dataset with **spam ratings injected**.
  - A separate DataFrame containing only the spammer ratings.
  - Optionally: Save results to disk.

---

Let's get started! 🚀





**1. Import**

In [8]:
import pandas as pd
import numpy as np
from datetime import datetime
import os

**2.Load Dataset**

In [1]:
def load_dataset(path="ratings.dat"):
    return pd.read_csv(path, sep="::", engine="python", 
                       names=["UserID", "MovieID", "Rating", "Timestamp", "NormalizedOverall"])

**3. Item Popularity**

In [2]:
def compute_item_popularity(df):
    return df["MovieID"].value_counts(normalize=True)

**4. Spammer Simulation**

In [3]:
def simulate_random_spammers(num_spammers, item_popularity_dist, user_id_start, lambda_poisson=5):
    spam_data = []
    item_ids = item_popularity_dist.index.tolist()
    item_probs = item_popularity_dist.values
    max_rating = 5

    for i in range(num_spammers):
        user_id = user_id_start + i
        num_ratings = np.random.poisson(lam=lambda_poisson) + 1
        sampled_items = np.random.choice(item_ids, size=num_ratings, replace=False, p=item_probs)

        for movie_id in sampled_items:
            rating = np.random.randint(1, max_rating + 1)
            timestamp = int(datetime.now().timestamp())
            normalized = (rating - 1) / (max_rating - 1)
            spam_data.append([user_id, movie_id, rating, timestamp, normalized])

    return pd.DataFrame(spam_data, columns=["UserID", "MovieID", "Rating", "Timestamp", "NormalizedOverall"])

**5. Combine Real Data + Spam**

In [5]:
def add_spammers_to_dataset(df, spammer_ratio=0.1, lambda_poisson=5):
    total_users = df["UserID"].nunique()
    num_spammers = int(np.ceil(spammer_ratio * total_users))
    item_popularity = compute_item_popularity(df)
    user_id_start = df["UserID"].max() + 1
    spam_df = simulate_random_spammers(num_spammers, item_popularity, user_id_start, lambda_poisson)
    return pd.concat([df, spam_df], ignore_index=True), spam_df


**6. Generate Multiple CSV's**

In [4]:
def generate_spam_datasets(base_df, ratios, lambda_poisson=5, output_dir="spam_versions"):
    os.makedirs(output_dir, exist_ok=True)
    for ratio in ratios:
        combined_df, spam_df = add_spammers_to_dataset(base_df, spammer_ratio=ratio, lambda_poisson=lambda_poisson)
        percent = int(ratio * 100)
        combined_df.to_csv(f"{output_dir}/ratings_with_{percent}percent_spam.csv", index=False)
        spam_df.to_csv(f"{output_dir}/spam_only_{percent}percent.csv", index=False)
        print(f"✔ Generated {percent}% spam version with {len(spam_df)} fake ratings.")

**7. Main Function**

In [9]:
if __name__ == "__main__":
    df_real = load_dataset("/home/martimsbaltazar/Desktop/tese/datasets/ml-1m/normalized_ratings.dat")

    spam_ratios = [0.10, 0.30, 0.50, 0.70]  
    generate_spam_datasets(df_real, spam_ratios, lambda_poisson=20)

✔ Generated 10% spam version with 12517 fake ratings.
✔ Generated 30% spam version with 37798 fake ratings.
✔ Generated 50% spam version with 63297 fake ratings.
✔ Generated 70% spam version with 88807 fake ratings.
