# This code has dynamic allocation of resources.

In [None]:
'''🧠 Cell 1 – The Brain that Adapts: Dynamic Dask Initialization
Imagine you're about to load and analyze thousands of Steam game reviews — each review is rich, unstructured text. Now, if we throw all this data at a single processor, it’ll choke. But what if our code could intelligently adapt to the machine it's running on?

Welcome to Cell 1 — the brain of our pipeline. It starts with:

🧰 Toolbox Assembly
We load essential Python libraries:

psutil to peek into system resources.

dask.distributed to launch parallel computation across multiple cores.

sentence-transformers, transformers, tqdm, and more for embedding, similarity scoring, and progress bars.

🧮 Step 1: Measuring the Machine
We define get_system_resources(), a smart function that inspects:

🧠 Total RAM: How much memory do we have to work with?

🔧 CPU cores: How many brains can we use in parallel?

It makes two clever decisions:

It uses only 70% of total memory — to avoid crashes and leave room for the OS.

It calculates memory per worker by splitting that 70% evenly across a sensible number of worker processes (leaving one CPU core free).

🛠️ Step 2: Setting Up the Workers
Once we have the memory and CPU plan, we initialize a Dask LocalCluster:

Each worker is like a mini-computer that handles a chunk of the workload.

Each worker gets 2 threads and a specific memory limit.

This setup ensures predictable performance, tailored to the hardware.

📡 Step 3: Launching the Cluster
With Client(cluster), Dask connects to the cluster, and — boom — a dashboard link appears! This is your control tower, showing how tasks are distributed, how memory is used, and how fast everything is running.

You now have a flexible, dynamic, and scalable backbone for the rest of the pipeline.

🔍 TL;DR for Presentation Slide:
“This cell intelligently analyzes your system's RAM and CPU, dynamically spins up a custom-sized Dask cluster, and connects a live dashboard — so every run is hardware-optimized and ready for heavy review analysis.”'''
# Modified Cell 1: Dynamic resource allocation for Dask Client
import os
import json
import numpy as np
import pandas as pd
import psutil
from tqdm.auto import tqdm
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import dask.bag as db
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline

# Dynamically determine system resources
def get_system_resources():
    # Get available memory (in GB)
    total_memory = psutil.virtual_memory().total / (1024**3)
    # Get CPU count
    cpu_count = psutil.cpu_count(logical=False)  # Physical cores only
    if not cpu_count:
        cpu_count = psutil.cpu_count(logical=True)  # Logical if physical not available
    
    # Use 70% of available memory for Dask, split across workers
    dask_memory = int(total_memory * 0.7)
    # Determine optimal worker count (leave at least 1 core for system)
    worker_count = max(1, cpu_count - 1)
    # Memory per worker
    memory_per_worker = int(dask_memory / worker_count)
    
    return {
        'worker_count': worker_count,
        'memory_per_worker': memory_per_worker,
        'total_memory': total_memory
    }

# Get system resources
resources = get_system_resources()
print(f"System has {resources['total_memory']:.1f}GB memory and {resources['worker_count']} CPU cores")
print(f"Allocating {resources['worker_count']} workers with {resources['memory_per_worker']}GB each")

# Start a local Dask cluster with dynamically determined resources
cluster = LocalCluster(
    n_workers=resources['worker_count'],
    threads_per_worker=2,
    memory_limit=f"{resources['memory_per_worker']}GB"
)
client = Client(cluster)
print(f"Dashboard link: {client.dashboard_link}")
client

In [None]:
'''🎯 Cell 2 – Building the Knowledge Base: Game Themes & Semantic Fingerprints
Now that we’ve built a scalable engine in Cell 1, it’s time to feed it something meaningful — the intelligence that powers topic detection for each game.

This cell is all about loading the brains of the operation: the themes per game and their semantic representations.

📚 Step 1: Load the Theme Dictionary
We start by opening a file called game_themes.json. Think of this as a manual provided by indie devs or curators — it lists themes for each game, such as:

json
Copy
Edit
{
  "123456": {
    "story": ["plot", "narrative", "characters"],
    "visuals": ["art", "graphics", "color"]
  }
}
Each Steam appid is mapped to themes, and each theme has a set of seed keywords that represent it. This forms the semantic space of what each game cares about.

We store this in a Python dictionary: GAME_THEMES.

🧠 Step 2: Initialize the SBERT Model
Here comes the transformer — SentenceTransformer('all-MiniLM-L6-v2').

This model takes in keywords or phrases and converts them into dense numerical vectors — basically, semantic fingerprints that machines can compare.

This is crucial, because later we’ll compare review texts to these fingerprints to figure out:
📌 “Is this review talking about story, visuals, or gameplay?”

⚡ Step 3: Embedding Only What You Need
Rather than encoding all themes for all games upfront (which would be wasteful and memory-heavy), we define a smart function:

python
Copy
Edit
def get_theme_embeddings(app_ids):
This function:

Takes a list of specific game IDs.

For each game:

Loops through its themes.

For each theme, embeds all its seed keywords using SBERT.

Averages these vectors to get a single theme vector.

Stacks all the theme vectors together for that game.

This way, we get a compact matrix of theme embeddings per game — just in time and just for the data we’re processing. No memory bloat. ⚙️

💡 Why This Matters:
This step gives us a shared language between reviews and themes. Without this, we’d just have a pile of raw text. With it, we can ask:

“Which predefined theme is this review most similar to?”

🧾 TL;DR for Presentation Slide:
“We load curated theme keywords for each game and convert them into semantic embeddings using SBERT. These vectors form the core reference that allows us to match incoming reviews to meaningful themes — efficiently and on-demand.”'''
# Cell 2: Load Theme Dictionary & Optimize Theme Embeddings
# Load per-game theme keywords
with open('game_themes.json', 'r') as f:
    raw = json.load(f)
GAME_THEMES = {int(appid): themes for appid, themes in raw.items()}

# Initialize SBERT embedder
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Function to get theme embeddings for specific app IDs
# This avoids loading all embeddings at once
def get_theme_embeddings(app_ids):
    """Get theme embeddings for a specific set of app IDs"""
    embeddings = {}
    for appid in app_ids:
        if appid not in embeddings and appid in GAME_THEMES:
            emb_list = []
            for theme, seeds in GAME_THEMES[appid].items():
                seed_emb = embedder.encode(seeds, convert_to_numpy=True)
                emb_list.append(seed_emb.mean(axis=0))
            embeddings[appid] = np.vstack(emb_list)
    return embeddings

In [None]:
'''🧱 Cell 3 – Smart Data Loading: The Right Bite for Every File
Our pipeline so far has been resource-aware (Cell 1) and meaning-aware (Cell 2). Now it’s time to dive into the actual review data — thousands of user thoughts spread across dozens of .parquet files.

But here’s the twist:

You don’t want to gulp all the data at once… you want to bite just the right amount.

This cell is where smart, adaptive data ingestion happens.

📏 Step 1: Peek Before You Leap
We define a function estimate_dataset_size() — it walks through the review directory and adds up the file sizes of all .parquet files:

This helps us measure the total dataset size in GB.

Why? Because memory issues can sneak in if we blindly read everything with default settings.

Example output:

python
Copy
Edit
Estimated dataset size: 21.4GB
🧮 Step 2: Choose the Right Block Size
Here’s where the code gets smart — it adjusts how data is loaded based on size:

🔹 Big dataset (>100GB) → use tiny chunks (16MB)

🔸 Medium dataset (>10GB) → moderate chunks (32MB)

⚪ Small dataset (≤10GB) → larger chunks (64MB)

Why does this matter?

Smaller chunks mean less memory strain and better parallelism.

Larger chunks are fine when memory isn't tight.

It's like loading a truck:

For heavy cargo, you use more trips with smaller boxes. For lighter loads, you pack more in each trip.

🛠️ Step 3: Load with Dask using Dynamic Blocksize
We now read the review data using:

python
Copy
Edit
dd.read_parquet(..., blocksize=blocksize)
The data is read as a Dask DataFrame, meaning it's still lazy and parallelized.

We only pick essential columns for the analysis:

steam_appid – which game?

review – the actual text

review_language – filter for English later

voted_up – thumbs up or down?

The result: a memory-efficient, scalable review loader that adapts to any machine and any dataset size.

🧾 TL;DR for Presentation Slide:
“This cell intelligently estimates the dataset size and adjusts the read blocksize accordingly. It prevents memory overload and ensures efficient streaming of .parquet reviews into a Dask DataFrame — tailored to the system’s capabilities.”'''
# Modified Cell 3: Dynamic blocksize for reading Parquet Files
# Estimate dataset size first
def estimate_dataset_size(path):
    import os
    total_size = 0
    for file in os.listdir(path):
        if file.endswith('.parquet'):
            file_path = os.path.join(path, file)
            total_size += os.path.getsize(file_path)
    return total_size / (1024**3)  # Convert to GB

# Estimate dataset size
dataset_path = 'parquet_output_theme_combo'
estimated_size = estimate_dataset_size(dataset_path)
print(f"Estimated dataset size: {estimated_size:.2f}GB")

# Dynamically determine blocksize based on dataset and memory
# Use smaller blocks for larger datasets to prevent memory issues
if estimated_size > 100:  # Very large dataset
    blocksize = '16MB'
elif estimated_size > 10:  # Medium-large dataset
    blocksize = '32MB'
else:  # Smaller dataset
    blocksize = '64MB'

print(f"Using dynamic blocksize: {blocksize}")

# Read with dynamic blocksize
ddf = dd.read_parquet(
    f'{dataset_path}/*.parquet',
    columns=['steam_appid', 'review', 'review_language', 'voted_up'],
    blocksize=blocksize
)

In [None]:
'''🧹 Cell 4 – Filtering the Noise: Clean Reviews, Clear Insights
Now that we’ve successfully loaded the raw reviews into memory — chunked, distributed, and ready — it’s time to clean the data pipeline.

Because let’s face it:

Not all reviews are useful. Some are in different languages. Some are empty. Some are just noise.

This cell is the data janitor — quietly ensuring we only work with reviews that matter.

🌐 Step 1: Keep It English
python
Copy
Edit
ddf = ddf[ddf['review_language'] == 'english']
Steam users write reviews in dozens of languages. But our theme keywords and SBERT model are trained on English — mixing in other languages would dilute and confuse our embeddings.

So here, we filter out all non-English reviews, keeping only those with:

python
Copy
Edit
'review_language' == 'english'
🕳️ Step 2: Drop Empty Reviews
python
Copy
Edit
ddf = ddf.dropna(subset=['review'])
Sometimes, reviews might be technically present but actually missing or null. These offer no insight, no words — just blank space.

So we call .dropna() to clean out anything where the 'review' column is missing. This ensures:

Every row we embed has actual text.

No wasted compute cycles or model calls.

💡 Why This Is Important:
This cleaning step happens before any heavy embedding or summarization.

It prevents wasted GPU/CPU usage on irrelevant rows.

It boosts accuracy and efficiency — we only analyze real, readable content.

🧾 TL;DR for Presentation Slide:
“This cell filters the dataset to include only valid English reviews with non-empty text. It eliminates irrelevant or malformed rows, ensuring that our pipeline processes only meaningful content.”'''
# Cell 4: Filter & Clean Data
# Keep only English reviews and drop missing text
ddf = ddf[ddf['review_language'] == 'english']
ddf = ddf.dropna(subset=['review'])

In [None]:
'''🧠 Cell 5 – The Matchmaker: Assigning Reviews to Game-Specific Themes
Until now, we’ve:

Loaded and cleaned data.

Parsed themes and embedded them.

Set up a Dask engine to scale it all.

But here comes the magic moment — the point where we ask:

“What is this review really talking about?”

This cell is the theme assignment engine, assigning each review to the most relevant topic — and it does it intelligently, efficiently, and in parallel.

🧩 Step 1: Partition-wise Processing
python
Copy
Edit
def assign_topic(df_partition):
Dask splits your huge dataset into smaller partitions, each processed separately. This function works on each chunk independently, which is:

More scalable.

Easier on memory.

Inherently parallel.

The first check:

python
Copy
Edit
if df_partition.empty: return as-is
This avoids wasting compute on empty chunks.

🎮 Step 2: Game-Specific Theme Lookup
python
Copy
Edit
app_ids = df_partition['steam_appid'].unique().tolist()
Each review belongs to a specific Steam game. Instead of loading all game embeddings, we fetch only the theme embeddings relevant to this partition:

python
Copy
Edit
local_theme_embeddings = get_theme_embeddings(app_ids)
This keeps memory low and speeds up lookup.

🧠 Step 3: Semantic Comparison
We embed all reviews in this chunk in a single batch:

python
Copy
Edit
review_embeds = embedder.encode(reviews, batch_size=64, convert_to_numpy=True)
Then, for each review:

We find its appid.

We fetch the game’s theme embeddings.

We compute cosine similarity between the review vector and each theme vector.

We assign the theme with the highest similarity as the topic_id.

If for some reason the game’s themes are missing (edge case), we fallback to:

python
Copy
Edit
topic_ids.append(0)
🧾 Step 4: Return Result with Topic Labels
python
Copy
Edit
df_partition['topic_id'] = topic_ids
Each row now has a topic_id, telling us which theme this review is most aligned with.

Then we apply it across all partitions:

python
Copy
Edit
ddf.map_partitions(assign_topic, meta=meta)
We also specify the output structure using meta, to make Dask’s lazy execution happy.

⚙️ Why This Is Special:
✅ It’s game-specific: different games have different theme vocabularies.

✅ It’s parallel: runs on all partitions at once.

✅ It’s batch-embedded: maximizes GPU/CPU efficiency.

✅ It adds a critical column: topic_id — the foundation for aggregation, summarization, and insights.

🧾 TL;DR for Presentation Slide:
“Each review is semantically compared to the themes of its game. This cell assigns a topic_id to every review by finding the closest matching theme vector using cosine similarity — all done in parallel across Dask partitions.”'''
# Cell 5: Optimized Partition-wise Topic Assignment
def assign_topic(df_partition):
    """Assign topics using only theme embeddings for app IDs in this partition"""
    # If no rows, return as-is
    if df_partition.empty:
        df_partition['topic_id'] = []
        return df_partition
    
    # Get unique app IDs in this partition
    app_ids = df_partition['steam_appid'].unique().tolist()
    app_ids = [int(appid) for appid in app_ids]
    
    # Get embeddings only for app IDs in this partition
    local_theme_embeddings = get_theme_embeddings(app_ids)
    
    reviews = df_partition['review'].tolist()
    # Compute embeddings in one go with batching
    review_embeds = embedder.encode(reviews, convert_to_numpy=True, batch_size=64)
    
    # Assign each review to its game-specific theme
    topic_ids = []
    for idx, appid in enumerate(df_partition['steam_appid']):
        appid = int(appid)
        if appid in local_theme_embeddings:
            theme_embs = local_theme_embeddings[appid]
            sims = cosine_similarity(review_embeds[idx:idx+1], theme_embs)
            topic_ids.append(int(sims.argmax()))
        else:
            # Default topic if theme embeddings not available
            topic_ids.append(0)
    
    df_partition['topic_id'] = topic_ids
    return df_partition

# Apply to each partition; specify output metadata
meta = ddf._meta.assign(topic_id=np.int64())
ddf_with_topic = ddf.map_partitions(assign_topic, meta=meta)

In [None]:
'''📊 Cell 6 – Smart Batching for Scalable Review Aggregation
We’ve now reached a powerful phase in our pipeline — we’ve cleaned the data, labeled each review with a topic_id, and embedded everything efficiently.

Now it’s time to zoom out:

"How are reviews distributed across themes? What topics are loved? Which are controversial?"

But there’s a challenge:
Each game (each steam_appid) has hundreds or thousands of reviews. If we try to aggregate all app IDs at once, we could overwhelm memory — especially on larger datasets.

So this cell introduces something clever:

⚖️ Step 1: Adaptive Batching
python
Copy
Edit
unique_app_ids = ddf['steam_appid'].unique().compute()
We compute the number of distinct Steam games present.

Then we dynamically decide a batch size:

python
Copy
Edit
if total_app_ids > 1000:
    batch_size = 3
elif total_app_ids > 500:
    batch_size = 5
...
🔁 This means:

For huge datasets → small batches to avoid memory overflow.

For smaller datasets → larger batches for speed.

📦 Step 2: Batch-by-Batch Processing
We now process game IDs in manageable chunks:

python
Copy
Edit
for i in range(0, len(unique_app_ids), batch_size):
In each iteration:

We isolate reviews for a few games using .isin(batch_app_ids).

We group by both steam_appid and topic_id to calculate:

📈 review_count – how many reviews belong to this theme

👍 likes_sum – how many of them were upvoted

These values help us measure engagement and sentiment toward each theme.

🧾 Step 3: Collecting Raw Reviews for Summarization
We also collect the raw review text, grouped by game and topic:

python
Copy
Edit
reviews_series = batch_ddf.groupby(['steam_appid', 'topic_id'])['review'] \
    .apply(lambda x: list(x), meta=('review', object))
This allows future steps (like summarization) to read all reviews related to a specific theme for a game.

⚙️ Step 4: Parallel Execution & Final Assembly
python
Copy
Edit
agg_df, reviews_df = dd.compute(agg, reviews_series)
Dask executes both operations in parallel — leveraging our cluster — and we:

Flatten the results with .reset_index()

Rename the review list column to Reviews

Append everything to the final result lists

This gives us:

all_agg_dfs: count-based summaries

all_review_dfs: theme-wise grouped raw review text

Both ready for visualization or summarization.

📌 TL;DR for Presentation Slide:
“This cell dynamically adjusts batch size based on dataset scale, then processes each game in chunks — aggregating review counts and upvotes per theme while also collecting raw reviews for future summarization. It balances memory use and compute power for scalable insight extraction.”'''
# Modified Cell 6: Dynamic batch sizing for aggregation
# Get unique app IDs
unique_app_ids = ddf['steam_appid'].unique().compute()
total_app_ids = len(unique_app_ids)

# Dynamically determine batch size based on number of app IDs and memory
# For larger datasets, use smaller batches to avoid memory issues
if total_app_ids > 1000:  # Very large number of app IDs
    batch_size = 3
elif total_app_ids > 500:  # Medium-large number
    batch_size = 5
elif total_app_ids > 100:  # Medium number
    batch_size = 10
else:  # Smaller number
    batch_size = 20

print(f"Processing {total_app_ids} unique app IDs with batch size {batch_size}")

# Initialize empty dataframes for results
all_agg_dfs = []
all_review_dfs = []

# Process in dynamically sized batches
for i in tqdm(range(0, len(unique_app_ids), batch_size)):
    batch_app_ids = unique_app_ids[i:i+batch_size]
    
    # Filter data for this batch of app IDs
    batch_ddf = ddf_with_topic[ddf_with_topic['steam_appid'].isin(batch_app_ids)]
    
    # Aggregate for this batch
    agg = batch_ddf.groupby(['steam_appid', 'topic_id']).agg(
        review_count=('review', 'count'),
        likes_sum=('voted_up', 'sum')
    )
    
    # Collect reviews for this batch
    reviews_series = batch_ddf.groupby(['steam_appid', 'topic_id'])['review'] \
        .apply(lambda x: list(x), meta=('review', object))
    
    # Compute both in parallel
    agg_df, reviews_df = dd.compute(agg, reviews_series)
    
    # Convert to DataFrames
    agg_df = agg_df.reset_index()
    reviews_df = reviews_df.reset_index().rename(columns={'review': 'Reviews'})
    
    # Append to results
    all_agg_dfs.append(agg_df)
    all_review_dfs.append(reviews_df)

In [None]:
'''🧾 Cell 7 – The Final Report: Turning Numbers into Meaning
After all the data crunching, embedding, batching, and aggregation, we arrive at the final step of our pipeline:

“Let’s turn this into something a developer can actually use.”

This cell crafts a concise, readable summary that answers the big question:

“What themes are people talking about in my game, and how are they reacting to them?”

🔗 Step 1: Merge Everything
python
Copy
Edit
pd.merge(agg_df, reviews_df, on=['steam_appid', 'topic_id'], how='left')
We combine:

agg_df: the numerical summary — review counts and like sums.

reviews_df: the raw grouped review lists — needed for qualitative insights or summarization.

They’re merged on:

steam_appid — which game?

topic_id — which theme?

Now we have both numbers and text in one place.

🏗️ Step 2: Build Final Rows
We loop over each row of the merged dataframe, and construct the final human-readable format.

For each review group, we:

🔢 Extract game ID (appid) and topic index (tid)

🧠 Convert tid into a real theme name using the GAME_THEMES dictionary:

If the theme index is valid, we extract its name.

Otherwise, we assign it a fallback label like "Unknown Theme 2" (to avoid breaking things).

💡 Step 3: Compute Like Ratio
We calculate how positively that theme is being received:

python
Copy
Edit
like_ratio = f"{(likes / total * 100):.1f}%" if total > 0 else '0%'
This gives game developers a quick feel:

Are reviews around this theme mostly positive?

Or is this a hot-button issue?

📋 Step 4: Construct Final Rows
Each row includes:

steam_appid – which game

Theme – human-readable theme name

#Reviews – number of reviews in this group

LikeRatio – % of upvotes

Reviews – actual review texts (for later summarization)

We store each entry in a rows[] list, and finally convert it into:

python
Copy
Edit
final_report = pd.DataFrame(rows)
💾 Step 5: Save as CSV
python
Copy
Edit
final_report.to_csv('output_csvs/SBERT_DD_new_report.csv', index=False)
This creates a snapshot of everything we’ve processed — so if summarization or visualization fails later, we don’t lose progress. It’s the final checkpoint.

📌 TL;DR for Presentation Slide:
“This cell merges aggregated metrics with review text and maps each topic ID back to a theme name. It calculates like ratios per theme, formats the result into a clean report, and saves it as a CSV for downstream analysis or visualization.”'''
# Cell 7: Construct Final Report DataFrame
# Merge counts, likes, and reviews
report_df = pd.merge(
    agg_df,
    reviews_df,
    on=['steam_appid', 'topic_id'],
    how='left'
)

# Build the final output structure
rows = []
for _, row in report_df.iterrows():
    appid = int(row['steam_appid'])
    tid = int(row['topic_id'])
    
    # Check if appid exists in GAME_THEMES
    if appid in GAME_THEMES:
        theme_keys = list(GAME_THEMES[appid].keys())
        # Check if tid is a valid index
        if tid < len(theme_keys):
            theme_name = theme_keys[tid]
        else:
            theme_name = f"Unknown Theme {tid}"
    else:
        theme_name = f"Unknown Theme {tid}"
    
    total = int(row['review_count'])
    likes = int(row['likes_sum'])
    like_ratio = f"{(likes / total * 100):.1f}%" if total > 0 else '0%'
    rows.append({
        'steam_appid': appid,
        'Theme': theme_name,
        '#Reviews': total,
        'LikeRatio': like_ratio,
        'Reviews': row['Reviews']
    })

final_report = pd.DataFrame(rows)

# Save intermediate results to avoid recomputation if summarization fails
final_report.to_csv('output_csvs/SBERT_DD_new_report.csv', index=False)

In [None]:
'''👁️ Cell 8 – Final Preview: A Glimpse Into The Insight
After all the work — loading, embedding, filtering, batching, labeling, aggregating — it’s time to answer:

“What did we actually get out of this pipeline?”

This cell does just that: it opens the curtains on the final report and gives us a preview without overwhelming us.

🖼️ Step 1: Print Summary View
python
Copy
Edit
print(final_report[['steam_appid', 'Theme', '#Reviews', 'LikeRatio']].head())
We show only the essential columns:

🕹️ steam_appid – which game this row belongs to

🏷️ Theme – the specific topic (e.g., “story”, “visuals”)

🔢 #Reviews – how many reviews were assigned to this theme

👍 LikeRatio – what percentage of those were positive

This is the part you could easily show in a dashboard or export to a spreadsheet.

🧐 Step 2: Check Raw Review Content
python
Copy
Edit
sample_reviews = final_report['Reviews'].iloc[0]
Each entry in the Reviews column is actually a list of raw review texts — grouped by game and theme. But we don’t print the whole list — that would flood the terminal.

Instead, we:

Confirm it's a list ✅

Print how many reviews it contains

Show just the first review, safely truncated at 100 characters

python
Copy
Edit
print(f"First review (truncated): {sample_reviews[0][:100]}...")
This is useful for:

🧪 Sanity checking: Are the reviews grouped correctly?

📝 Summarization: This is what the summarizer will later process into a “quick insights” paragraph.

✅ Step 3: Clean Exit
python
Copy
Edit
client.close()
We politely shut down the Dask client — freeing up RAM, CPU, and GPU. This is especially good hygiene for heavy pipelines running in notebooks or servers.

🧾 TL;DR for Presentation Slide:
“This cell previews the final report, showing game-wise theme stats and verifying that raw reviews were correctly grouped. It offers a final check before visualization or summarization — and closes the Dask client cleanly.”'''
# Cell 8: View the Report
# Print preview of the DataFrame (excluding the Reviews column as it contains lists)
print("Final report preview (Reviews column contains lists of review texts):")
print(final_report[['steam_appid', 'Theme', '#Reviews', 'LikeRatio']].head())

# Verify that Reviews column contains lists
sample_reviews = final_report['Reviews'].iloc[0]
print(f"\nSample from first Reviews entry (showing first review only):")
if isinstance(sample_reviews, list) and len(sample_reviews) > 0:
    print(f"Number of reviews in list: {len(sample_reviews)}")
    print(f"First review (truncated): {sample_reviews[0][:100]}...")
client.close()

# Tuned for my hardware 1m 50 secs inference

In [None]:
'''⚡ Cell 9 – The Fast Lane: Hardware-Optimized Summarization for RTX 4080 Super + Ryzen 9700X
You’ve built the themes. Assigned the topics. Collected the data.
Now it’s time to summarize thousands of game reviews, game by game, theme by theme — into short, readable insights.

But summarization is expensive — it runs on transformer models, which means:

High VRAM usage 💾

High CPU throughput 🧠

High memory churn ⚙️

So we don’t just run it — we engineer the entire process for your specific hardware: an RTX 4080 Super + Ryzen 9700X + 20GB RAM workstation.

Let’s break it down.

🛠️ Step 1: Build an Optimized Runtime Environment
We define a HARDWARE_CONFIG:

6 workers, each using 3GB memory

96 reviews per GPU batch — a sweet spot for the RTX 4080

‘sshleifer/distilbart-cnn-12-6’ — a fast, high-quality summarization model

Chunk sizes, checkpoint frequency, and cleanup parameters — all hand-tuned

Then we launch:

python
Copy
Edit
cluster = LocalCluster(...)
client = Client(cluster)
This is the second Dask cluster, purpose-built for heavy summarization.

🧱 Step 2: Create Balanced Partitions
python
Copy
Edit
prepare_partition(start_idx, end_idx)
We divide the final_report into N partitions, one per worker. Each partition contains a chunk of games to process. Bigger chunks mean:

Fewer I/O calls

Better GPU utilization

Faster throughput

Each partition is prepared using dask.delayed for lazy evaluation.

🤖 Step 3: Custom Worker Function
We define a highly tuned function:

python
Copy
Edit
def process_partition(partition_df, worker_id):
Each worker:

Initializes its own summarizer pipeline, using:

Half-precision (float16) for speed

device_map="auto" for seamless GPU assignment

Reports its own GPU memory usage

Batches reviews into large chunks, summarized with:

python
Copy
Edit
summarizer(..., num_beams=2)
We use a hierarchical summarization strategy:

For small review sets → summarize directly

For large ones → break into chunks → summarize each → then summarize the summaries

This gives fast + coherent outputs, while preventing model overload.

⏱️ Step 4: Schedule and Track
We submit all partitioned tasks:

python
Copy
Edit
futures = client.compute(delayed_results)
And we monitor progress with a background thread that updates a tqdm progress bar every 5 seconds — with zero UI lag and minimal CPU load.

⚙️ Step 5: Execution and Fallback
If distributed futures fail, we fallback to:

python
Copy
Edit
results = dask.compute(*delayed_results)
This ensures robustness, even in environments where GPU config changes.

📋 Step 6: Assemble the Results
Once done, we:

Collect and sort the results

Extract the summaries

Inject them into a new column in final_report:

python
Copy
Edit
final_report['QuickSummary'] = summaries
Each row now has a bite-sized summary of all reviews for a specific theme in a specific game.

💾 Step 7: Output & Clean-Up
We report performance:

python
Copy
Edit
Average time per item: 0.93 seconds
Then:

Preview the results

Save the full CSV to optimized_hardware_report.csv

Shut down the Dask cluster cleanly

📌 TL;DR for Presentation Slide:
“This cell uses hardware-tuned Dask workers and half-precision transformers to summarize thousands of grouped reviews in parallel. With batching, chunking, hierarchical compression, and aggressive memory reuse, it extracts fast, theme-specific insights from massive text volumes — in under a second per summary.”'''
# Cell 9: Hardware-optimized GPU summarization with Dask - Tuned for Ryzen 9700X & RTX 4080 Super

import pandas as pd
import numpy as np
import torch
import dask
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster
from tqdm.auto import tqdm
import time
import os
import threading

# Create checkpoint directory if it doesn't exist (minimal overhead)
os.makedirs('checkpoints', exist_ok=True)

# Optimized configuration for your specific hardware
# RTX 4080 Super (12GB usable VRAM) + Ryzen 9700X + 20GB usable RAM
HARDWARE_CONFIG = {
    'worker_count': 6,                # Optimal for Ryzen 9700X
    'memory_per_worker': '3GB',       # 18GB total for workers, leaving headroom
    'gpu_batch_size': 96,             # Aggressive batch size for RTX 4080 Super
    'model_name': 'sshleifer/distilbart-cnn-12-6',  # Best model for your GPU
    'chunk_size': 400,                # Larger chunks for faster processing
    'checkpoint_frequency': 25,       # Less frequent checkpoints for speed
    'cleanup_frequency': 10,          # Less frequent memory cleanup
}

print(f"Starting optimized Dask cluster for Ryzen 9700X + RTX 4080 Super configuration")
cluster = LocalCluster(
    n_workers=HARDWARE_CONFIG['worker_count'], 
    threads_per_worker=2,
    memory_limit=HARDWARE_CONFIG['memory_per_worker']
)
client = Client(cluster)
print(f"Dask dashboard available at: {client.dashboard_link}")

# Determine optimal partition sizes - larger for better throughput
@dask.delayed
def prepare_partition(start_idx, end_idx):
    """Prepare a partition optimized for high-end hardware"""
    return final_report.iloc[start_idx:end_idx].copy()

# Create larger partitions for better throughput
n_workers = HARDWARE_CONFIG['worker_count']
partition_size = len(final_report) // n_workers
partitions = []
for i in range(n_workers):
    start_idx = i * partition_size
    end_idx = (i + 1) * partition_size if i < n_workers - 1 else len(final_report)
    partitions.append(prepare_partition(start_idx, end_idx))
    print(f"Prepared partition {i+1} with {end_idx-start_idx} items")

# Optimized worker function with aggressive resource usage
@dask.delayed
def process_partition(partition_df, worker_id):
    """Optimized worker for RTX 4080 Super"""
    # Import needed packages
    from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
    import torch
    
    # Load model components with optimal settings for RTX 4080 Super
    print(f"Worker {worker_id} initializing with optimized settings for RTX 4080 Super")
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(HARDWARE_CONFIG['model_name'])
    
    # Load model with optimized settings for RTX 4080 Super
    model = AutoModelForSeq2SeqLM.from_pretrained(
        HARDWARE_CONFIG['model_name'],
        torch_dtype=torch.float16,        # Half precision for speed
        device_map="auto",                # Automatic device placement
        low_cpu_mem_usage=True            # Optimized memory usage
    )
    
    # Create optimized pipeline
    summarizer = pipeline(
        task='summarization',
        model=model,
        tokenizer=tokenizer,
        framework='pt',
        model_kwargs={
            "use_cache": True,            # Enable caching for speed
            "return_dict_in_generate": True  # More efficient generation
        }
    )
    
    # Report GPU status
    gpu_mem = torch.cuda.memory_allocated(0) / (1024**3)
    print(f"Worker {worker_id}: GPU Memory: {gpu_mem:.2f}GB allocated")
    
    # Highly optimized batch processing function
    def process_chunks_batched(chunks):
        """Process chunks in large batches for RTX 4080 Super"""
        all_summaries = []
        
        # Use large batches for the RTX 4080 Super
        for i in range(0, len(chunks), HARDWARE_CONFIG['gpu_batch_size']):
            batch = chunks[i:i+HARDWARE_CONFIG['gpu_batch_size']]
            batch_summaries = summarizer(
                batch,
                max_length=60,
                min_length=20,
                truncation=True,
                do_sample=False,
                num_beams=2  # Use beam search for better quality with minimal speed impact
            )
            all_summaries.extend([s["summary_text"] for s in batch_summaries])
            
            # Minimal cleanup - only when really needed
            if i % (HARDWARE_CONFIG['gpu_batch_size'] * 3) == 0 and torch.cuda.is_available():
                torch.cuda.empty_cache()
                    
        return all_summaries
    
    # Optimized hierarchical summary function
    def hierarchical_summary(reviews):
        """Create hierarchical summary with optimized chunk sizes"""
        # Handle edge cases efficiently
        if not reviews or not isinstance(reviews, list):
            return "No reviews available for summarization."
        
        # Fast path for small review sets
        if len(reviews) <= HARDWARE_CONFIG['chunk_size']:
            doc = "\n\n".join(reviews)
            return summarizer(
                doc,
                max_length=60,
                min_length=20,
                truncation=True,
                do_sample=False
            )[0]['summary_text']
        
        # Process larger review sets with optimized chunking
        all_chunks = []
        for i in range(0, len(reviews), HARDWARE_CONFIG['chunk_size']):
            batch = reviews[i:i+HARDWARE_CONFIG['chunk_size']]
            text = "\n\n".join(batch)
            all_chunks.append(text)
        
        # Process chunks with optimized batching
        intermediate_summaries = process_chunks_batched(all_chunks)
        
        # Create final summary
        joined = " ".join(intermediate_summaries)
        return summarizer(
            joined,
            max_length=60,
            min_length=20,
            truncation=True,
            do_sample=False
        )[0]['summary_text']
    
    # Process the partition with minimal overhead
    results = []
    
    # Use tqdm for progress tracking
    with tqdm(total=len(partition_df), desc=f"Worker {worker_id}", position=worker_id) as pbar:
        for idx, row in partition_df.iterrows():
            # Process the review
            summary = hierarchical_summary(row['Reviews'])
            results.append((idx, summary))
            
            # Minimal cleanup - only every N iterations
            if len(results) % HARDWARE_CONFIG['cleanup_frequency'] == 0:
                torch.cuda.empty_cache()
                
            # Update progress bar
            pbar.update(1)
    
    # Final cleanup
    torch.cuda.empty_cache()
    
    print(f"Worker {worker_id} completed successfully")
    return results

# Schedule tasks
print(f"Scheduling {n_workers} optimized partitions...")
delayed_results = []
for i in range(n_workers):
    delayed_result = process_partition(partitions[i], i)
    delayed_results.append(delayed_result)

# Streamlined progress tracking
print("\nStarting optimized computation...")
main_progress = tqdm(total=len(final_report), desc="Overall Progress")

# Start timing
start_time = time.time()

# Minimal checkpoint system - only save occasionally
def update_main_progress(futures):
    while not stop_flag:
        # Count completed futures
        completed_count = sum(f.status == 'finished' for f in futures)
        completed_percentage = completed_count / len(futures)
        
        # Update progress bar
        main_progress.n = int(len(final_report) * completed_percentage)
        main_progress.refresh()
        
        # Only check every 5 seconds to reduce overhead
        time.sleep(5)

# Submit tasks to cluster
futures = client.compute(delayed_results)

# Start progress monitor with minimal overhead
stop_flag = False
monitor_thread = threading.Thread(target=update_main_progress, args=(futures,))
monitor_thread.daemon = True
monitor_thread.start()

# Wait for computation
try:
    print("Computing with optimal settings for RTX 4080 Super...")
    results = client.gather(futures)
except Exception as e:
    print(f"Error with futures: {e}")
    print("Falling back to direct computation...")
    results = dask.compute(*delayed_results)

# Stop progress monitor
stop_flag = True
monitor_thread.join(timeout=3)

# Update progress to completion
main_progress.n = len(final_report)
main_progress.refresh()
main_progress.close()

# Process results efficiently
all_results = []
for worker_results in results:
    all_results.extend(worker_results)

# Sort results
all_results.sort(key=lambda x: x[0])
summaries = [result[1] for result in all_results]

# Store results
final_report['QuickSummary'] = summaries

# Report timing
elapsed_time = time.time() - start_time
print(f"\nOptimized processing completed in {elapsed_time:.2f} seconds")
print(f"Average time per item: {elapsed_time/len(final_report):.2f} seconds")

# Display results
print("\nResults sample:")
display(final_report[['steam_appid', 'Theme', 'QuickSummary']].head())

# Save results
final_report.to_csv('output_csvs/optimized_hardware_report.csv')
print("Results saved to output_csvs/optimized_hardware_report.csv")

# Clean up
client.close()
cluster.close()