# Steam Reviews Sentiment Analysis for PUBG: Battlegrounds using Naive Bayes

This notebook performs sentiment analysis on Steam reviews for PUBG: Battlegrounds using a Naive Bayes algorithm. We'll load the reviews from a parquet file, preprocess the text data, and train a Naive Bayes model to classify reviews as positive or negative.

In [None]:
# Import required libraries
import re

import nltk
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to C:\Users\Gerson
[nltk_data]     Leite\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Gerson
[nltk_data]     Leite\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\Gerson
[nltk_data]     Leite\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Gerson Leite\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## Data Loading and Preprocessing
Let's load the PUBG reviews and prepare them for sentiment analysis:
1. Load reviews from parquet file
2. Preprocess the text data
3. Create features and labels

In [None]:
def preprocess_text(text: str) -> str:
    """Preprocess text data for sentiment analysis."""
    # Convert to lowercase
    text = text.lower()

    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Join tokens back into a string
    return ' '.join(tokens)

## Data Collection and Preprocessing
Let's fetch PUBG reviews and prepare them for sentiment analysis:
1. Fetch reviews from Steam API
2. Preprocess the text data
3. Create features and labels

In [None]:
# Load PUBG reviews from parquet file
reviews_df = pd.read_parquet('pubg_reviews.parquet')

# Display basic information about the dataset
print(f"Total reviews collected: {len(reviews_df)}")
print("\nSentiment distribution:")
print(reviews_df['sentiment'].value_counts())

# Preprocess reviews
print("\nPreprocessing reviews...")
reviews_df['processed_review'] = reviews_df['review'].apply(preprocess_text)

# Prepare features and labels
X = reviews_df['processed_review']
y = reviews_df['voted_up']  # True for positive, False for negative

Fetching Steam reviews for App ID: 578080
Target: 2000 reviews, Type: all, Language: english
--------------------------------------------------
Fetched 100 reviews so far...
Fetched 100 reviews so far...
Fetched 200 reviews so far...
Fetched 200 reviews so far...
Fetched 300 reviews so far...
Fetched 300 reviews so far...
Fetched 400 reviews so far...
Fetched 400 reviews so far...
Fetched 500 reviews so far...
Fetched 500 reviews so far...
Fetched 600 reviews so far...
Fetched 600 reviews so far...
Fetched 700 reviews so far...
Fetched 700 reviews so far...
Fetched 800 reviews so far...
Fetched 800 reviews so far...
Fetched 900 reviews so far...
Fetched 900 reviews so far...
Fetched 1000 reviews so far...
Fetched 1000 reviews so far...
Fetched 1100 reviews so far...
Fetched 1100 reviews so far...
Fetched 1200 reviews so far...
Fetched 1200 reviews so far...
Fetched 1300 reviews so far...
Fetched 1300 reviews so far...
Fetched 1400 reviews so far...
Fetched 1400 reviews so far...
Fetche

## Training the Naive Bayes Model
Now we'll:
1. Convert text to TF-IDF features
2. Split the data into training and testing sets
3. Train the Naive Bayes model
4. Evaluate its performance

In [6]:
# Convert text to TF-IDF features
tfidf = TfidfVectorizer(max_features=5000)
X_tfidf = tfidf.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf,
    y,
    test_size=0.2,
    random_state=42
)

print(f"\nTraining data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")

# Initialize and train the Naive Bayes model
print("\nTraining Naive Bayes model...")
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Make predictions
y_pred = nb_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.2%}")

# Print detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))


Training data shape: (1600, 5000)
Testing data shape: (400, 5000)

Training Naive Bayes model...

Accuracy: 76.00%

Classification Report:
              precision    recall  f1-score   support

    Negative       0.76      0.99      0.86       291
    Positive       0.84      0.15      0.25       109

    accuracy                           0.76       400
   macro avg       0.80      0.57      0.55       400
weighted avg       0.78      0.76      0.69       400



In [9]:
def analyze_sentiment(text: str, model, vectorizer) -> str:
    """Analyze the sentiment of a given text using the trained model."""
    # Preprocess the text
    processed_text = preprocess_text(text)

    # Vectorize
    text_vectorized = vectorizer.transform([processed_text])

    # Predict
    prediction = model.predict(text_vectorized)[0]

    return 'Positive' if prediction else 'Negative'

# Test the analyzer with some example reviews
example_reviews = [
    "Lots of fun, but the matchmaking is terrible.",
    "It's okay, not the best but not the worst either.",
    "I had a lot of fun, but there are some issues that need fixing.",
    "While the game has potential, it suffers from numerous bugs.",
    "Best battle royale game ever, highly recommended!"
]

# print("\nTesting sentiment analyzer with example reviews:")
# for review in example_reviews:
#     sentiment = analyze_sentiment(review, nb_model, tfidf)
#     print(f"\nReview: {review}")
#     print(f"Sentiment: {sentiment}")

# Calculate prediction probabilities for the example reviews
print("\nPrediction probabilities:")
for review in example_reviews:
    processed_text = preprocess_text(review)
    text_vectorized = tfidf.transform([processed_text])
    probs = nb_model.predict_proba(text_vectorized)[0]
    print(f"\nReview: {review}")
    print(f"Negative probability: {probs[0]:.2%}")
    print(f"Positive probability: {probs[1]:.2%}")


Prediction probabilities:

Review: Lots of fun, but the matchmaking is terrible.
Negative probability: 78.06%
Positive probability: 21.94%

Review: It's okay, not the best but not the worst either.
Negative probability: 75.14%
Positive probability: 24.86%

Review: I had a lot of fun, but there are some issues that need fixing.
Negative probability: 79.91%
Positive probability: 20.09%

Review: While the game has potential, it suffers from numerous bugs.
Negative probability: 73.41%
Positive probability: 26.59%

Review: Best battle royale game ever, highly recommended!
Negative probability: 41.09%
Positive probability: 58.91%


## Model Comparison

Key differences between Naive Bayes and SVM for sentiment analysis:

1. **Speed**: Naive Bayes is typically faster to train than SVM, especially on large datasets
2. **Probability Output**: Naive Bayes naturally provides probability estimates
3. **Assumptions**: Naive Bayes assumes feature independence, while SVM doesn't make this assumption
4. **Memory Usage**: Naive Bayes typically uses less memory than SVM
5. **Performance**: While both models can perform well, SVM might handle complex decision boundaries better

The choice between them often depends on:
- Dataset size
- Training time requirements
- Need for probability estimates
- Available computational resources