# Steam Reviews Sentiment Analysis for PUBG: Battlegrounds using SVM

This notebook performs sentiment analysis on Steam reviews for PUBG: Battlegrounds using a Support Vector Machine (SVM) algorithm. We'll load the reviews from a parquet file, preprocess the text data, and train an SVM model to classify reviews as positive or negative.

In [None]:
# Import required libraries
import re

import nltk
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to C:\Users\Gerson
[nltk_data]     Leite\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Gerson
[nltk_data]     Leite\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\Gerson
[nltk_data]     Leite\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Gerson Leite\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## Text Preprocessing
Before training our SVM model, we need to preprocess the review text by:
1. Converting to lowercase
2. Removing special characters and numbers
3. Removing stopwords
4. Tokenizing the text

In [5]:
def preprocess_text(text: str) -> str:
    """Preprocess text data for sentiment analysis.

    Parameters:
    -----------
    text : str
        The input text to preprocess

    Returns:
    --------
    str
        Preprocessed text
    """
    # Convert to lowercase
    text = text.lower()

    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Join tokens back into a string
    return ' '.join(tokens)

In [6]:
# Load PUBG reviews from parquet file
reviews_df = pd.read_parquet('pubg_reviews.parquet')

# Display basic information about the dataset
print(f"Total reviews collected: {len(reviews_df)}")
print("\nSentiment distribution:")
print(reviews_df['sentiment'].value_counts())

# Preprocess reviews
print("\nPreprocessing reviews...")
reviews_df['processed_review'] = reviews_df['review'].apply(preprocess_text)

# Prepare features and labels
X = reviews_df['processed_review']
y = reviews_df['voted_up']  # True for positive, False for negative

Fetching Steam reviews for App ID: 578080
Target: 2000 reviews, Type: all, Language: english
--------------------------------------------------
Fetched 100 reviews so far...
Fetched 100 reviews so far...
Fetched 200 reviews so far...
Fetched 200 reviews so far...
Fetched 300 reviews so far...
Fetched 300 reviews so far...
Fetched 400 reviews so far...
Fetched 400 reviews so far...
Fetched 500 reviews so far...
Fetched 500 reviews so far...
Fetched 600 reviews so far...
Fetched 600 reviews so far...
Fetched 700 reviews so far...
Fetched 700 reviews so far...
Fetched 800 reviews so far...
Fetched 800 reviews so far...
Fetched 900 reviews so far...
Fetched 900 reviews so far...
Fetched 1000 reviews so far...
Fetched 1000 reviews so far...
Fetched 1100 reviews so far...
Fetched 1100 reviews so far...
Fetched 1200 reviews so far...
Fetched 1200 reviews so far...
Fetched 1300 reviews so far...
Fetched 1300 reviews so far...
Fetched 1400 reviews so far...
Fetched 1400 reviews so far...
Fetche

## Training the SVM Model
Now we'll:
1. Convert text to TF-IDF features
2. Split the data into training and testing sets
3. Train the SVM model
4. Evaluate its performance

In [7]:
# Convert text to TF-IDF features
tfidf = TfidfVectorizer(max_features=5000)
X_tfidf = tfidf.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf,
    y,
    test_size=0.2,
    random_state=42
)

print(f"\nTraining data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")


Training data shape: (1600, 5000)
Testing data shape: (400, 5000)


In [8]:
# Initialize and train the SVM model with probability estimation enabled
print("\nTraining SVM model...")
svm_model = SVC(kernel='linear', C=1.0, random_state=42, probability=True)
svm_model.fit(X_train, y_train)

# Make predictions
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.2%}")

# Print detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))


Training SVM model...

Accuracy: 84.25%

Classification Report:
              precision    recall  f1-score   support

    Negative       0.85      0.95      0.90       291
    Positive       0.79      0.57      0.66       109

    accuracy                           0.84       400
   macro avg       0.82      0.76      0.78       400
weighted avg       0.84      0.84      0.83       400


Accuracy: 84.25%

Classification Report:
              precision    recall  f1-score   support

    Negative       0.85      0.95      0.90       291
    Positive       0.79      0.57      0.66       109

    accuracy                           0.84       400
   macro avg       0.82      0.76      0.78       400
weighted avg       0.84      0.84      0.83       400



In [11]:
def analyze_sentiment(text: str, model, vectorizer) -> tuple:
    """Analyze the sentiment of a given text using the trained model.

    Parameters:
    -----------
    text : str
        The text to analyze
    model : sklearn model
        Trained model
    vectorizer : TfidfVectorizer
        Fitted vectorizer

    Returns:
    --------
    tuple
        Contains (sentiment prediction, probability scores dictionary)
    """
    # Preprocess the text
    processed_text = preprocess_text(text)

    # Vectorize
    text_vectorized = vectorizer.transform([processed_text])

    # Get prediction and probabilities
    prediction = model.predict(text_vectorized)[0]
    probabilities = model.predict_proba(text_vectorized)[0]

    # Create probability dictionary
    prob_dict = {
        'Negative': probabilities[0],
        'Positive': probabilities[1]
    }

    return ('Positive' if prediction else 'Negative', prob_dict)

# Test the analyzer with some example reviews
example_reviews = [
    "This game is amazing! Great gameplay and graphics.",
    "Terrible optimization, lots of bugs and hackers.",
    "Really enjoyed playing with friends, good battle royale experience.",
    "Lots of fun, but the matchmaking is terrible.",
    "It's okay, not the best but not the worst either.",
    "I had a lot of fun, but there are some issues that need fixing."
]

print("\nTesting sentiment analyzer with example reviews:")
for review in example_reviews:
    sentiment, probabilities = analyze_sentiment(review, svm_model, tfidf)
    print(f"\nReview: {review}")
    print(f"Sentiment: {sentiment}")
    print(f"Probability scores:")
    print(f"- Positive: {probabilities['Positive']:.2%}")
    print(f"- Negative: {probabilities['Negative']:.2%}")


Testing sentiment analyzer with example reviews:

Review: This game is amazing! Great gameplay and graphics.
Sentiment: Positive
Probability scores:
- Positive: 99.63%
- Negative: 0.37%

Review: Terrible optimization, lots of bugs and hackers.
Sentiment: Negative
Probability scores:
- Positive: 6.51%
- Negative: 93.49%

Review: Really enjoyed playing with friends, good battle royale experience.
Sentiment: Positive
Probability scores:
- Positive: 95.70%
- Negative: 4.30%

Review: Lots of fun, but the matchmaking is terrible.
Sentiment: Negative
Probability scores:
- Positive: 41.08%
- Negative: 58.92%

Review: It's okay, not the best but not the worst either.
Sentiment: Positive
Probability scores:
- Positive: 55.19%
- Negative: 44.81%

Review: I had a lot of fun, but there are some issues that need fixing.
Sentiment: Positive
Probability scores:
- Positive: 82.82%
- Negative: 17.18%
