# Natural Language Processing Application: Sentimental Analysis on Steam Reviews (Possibly?)

## Team

* Gabriel Aracena
* Joshua Canode
* Aaron Galicia

### Project Description

A key area of knowledge in data analytics is the ability to extract meaning from text. This assignment provides the foundational skills in this area by detecting whether a text conveys a positive or negative message.

Analyze the sentiment (e.g., negative, neutral, positive) conveyed in a large body (corpus) of texts using the NLTK package in Python. Complete the steps below. Then, write a comprehensive technical report as a Python Jupyter notebook to include all code, code comments, all outputs, plots, and analysis. Make sure the project documentation contains a) Problem statement, b) Algorithm of the solution, c) Analysis of the findings, and d) References.

## Abstract

TODO

### Data Preparation:

TODO

### ANN Model Building:

TODO


### Training the ANN:



### Evaluation:


## Model Architecture


## Interpretation and Conclusion



In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
import nltk


Importing the data set

In [3]:
df = pd.read_csv('dataset.csv')
print(df.head())


   app_id        app_name                                        review_text  \
0      10  Counter-Strike                                    Ruined my life.   
1      10  Counter-Strike  This will be more of a ''my experience with th...   
2      10  Counter-Strike                      This game saved my virginity.   
3      10  Counter-Strike  • Do you like original games? • Do you like ga...   
4      10  Counter-Strike           Easy to learn, hard to master.             

   review_score  review_votes  
0             1             0  
1             1             1  
2             1             0  
3             1             0  
4             1             1  


In [4]:

# sampling the dataset to decrease run time
sample_size = int(0.01 * len(df))
reduced_sample = df.sample(n=sample_size, random_state=42) 
print(reduced_sample.head())

print(reduced_sample.shape)


         app_id                                   app_name  \
301327    12210  Grand Theft Auto IV: The Complete Edition   
1662500  226320                        Marvel Heroes Omega   
2061157  236450           PAC-MAN Championship Edition DX+   
1171799  218620                                   PAYDAY 2   
1450080  221640                              Super Hexagon   

                                               review_text  review_score  \
301327   Best bowling simulator 2014 10/10 It has good ...             1   
1662500  Marvel characters? Check. Tons of loot? Check....             1   
2061157  This game while its not the original is defina...             1   
1171799  This game ♥♥♥♥ing awesome ,You can be professi...             1   
1450080  If you are high, play this game. 420/420 would...             1   

         review_votes  
301327              1  
1662500             0  
2061157             0  
1171799             0  
1450080             0  
(64171, 5)


## Data Preprocessing and Visualization:

In [5]:
import nltk
import string

# Specify the NLTK data path explicitly
nltk.data.path.append('C:/Users/josh/nltk_data')  # Replace with the actual path to your nltk_data directory

# Download the required NLTK data
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aaron\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aaron\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string


In [7]:
def preprocess_text_lower(text):
    if isinstance(text, str):
        text = text.lower()
    else:
        cleaned_text = ""

    return text

# tokenize the text
def tokenize_text(text):
    if isinstance(text, str):
        # Tokenization
        tokens = word_tokenize(text)
        cleaned_text = " ".join(tokens)
    else:
        cleaned_text = ""

    return cleaned_text

def remove_punctuation(text):
    if isinstance(text, str):
        # Removing Punctuation
        tokens = word_tokenize(text)
        tokens = [word for word in tokens if word not in string.punctuation]
        cleaned_text = " ".join(tokens)
    else:
        cleaned_text = ""

    return cleaned_text

def remove_stopwords(text):
    if isinstance(text, str):
        # Tokenization
        tokens = word_tokenize(text)

        # Stop Word Removal
        stop_words = set(stopwords.words('english'))
        filtered_tokens = [word for word in tokens if word not in stop_words]

        cleaned_text = " ".join(filtered_tokens)
    else:
        cleaned_text = ""

    return cleaned_text

import re
# handleing things like 10/10
def replace_good_ratings(text):
    pattern = r'(\d+)/(\d+)'

    def replace(match):
        numerator = int(match.group(1))
        denominator = int(match.group(2))

        # Check if the numerator is not 0
        if numerator != 0:
            return 'great'
        else:
            return 'very bad'  # Replace "0/number" with "very bad"

    cleaned_text = re.sub(pattern, replace, text)

    return cleaned_text



In [8]:
# lower case the text
reduced_sample['review_text'] = df['review_text'].apply(preprocess_text_lower)
# 6 seconds

In [9]:
# tokenize the text
reduced_sample['review_text'] = reduced_sample['review_text'].apply(tokenize_text)
# 24 seconds

In [10]:
# remove punctuation
reduced_sample['review_text'] = reduced_sample['review_text'].apply(remove_punctuation)
# 31 seconds

In [11]:
# remove stopwords
reduced_sample['review_text'] = reduced_sample['review_text'].apply(remove_stopwords)
# 50 seconds

In [12]:
# replace good ratings
reduced_sample['review_text'] = reduced_sample['review_text'].apply(replace_good_ratings)


In [13]:
# Print the result (original and cleaned text for the first few rows)
print(reduced_sample.head())

         app_id                                   app_name  \
301327    12210  Grand Theft Auto IV: The Complete Edition   
1662500  226320                        Marvel Heroes Omega   
2061157  236450           PAC-MAN Championship Edition DX+   
1171799  218620                                   PAYDAY 2   
1450080  221640                              Super Hexagon   

                                               review_text  review_score  \
301327    best bowling simulator 2014 great good storyline             1   
1662500  marvel characters check tons loot check tons c...             1   
2061157  game original definately one best renditions p...             1   
1171799  game ♥♥♥♥ing awesome professional heister fun ...             1   
1450080                    high play game great would dank             1   

         review_votes  
301327              1  
1662500             0  
2061157             0  
1171799             0  
1450080             0  


In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix

print(reduced_sample.shape)

# 1. TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Adjust the number of features as needed
tfidf_matrix = tfidf_vectorizer.fit_transform(reduced_sample['review_text'])
tfidf_matrix = csr_matrix(tfidf_matrix)

# 2. Calculate Sentiment Scores in Batches and Append to DataFrame
batch_size = 1000  # Number of rows to process in each batch
sentiment_scores = []

for start in range(0, len(reduced_sample), batch_size):
    end = min(start + batch_size, len(reduced_sample))
    batch_tfidf_matrix = tfidf_matrix[start:end]
    batch_scores = batch_tfidf_matrix.mean(axis=1)
    sentiment_scores.extend(batch_scores)

# Add the 'sentiment_scores' column to 'reduced_sample' from the TF-IDF scores
reduced_sample['sentiment_scores'] = sentiment_scores


# slopy very bad code... 
def extract_sentiment_score(scores):
    try:
        return float(scores[0][0][0][0])
    except (IndexError, ValueError, TypeError):
        return None

reduced_sample['sentiment_scores'] = reduced_sample['sentiment_scores'].apply(extract_sentiment_score)


# Print the result
print(reduced_sample.head())

(64171, 5)
         app_id                                   app_name  \
301327    12210  Grand Theft Auto IV: The Complete Edition   
1662500  226320                        Marvel Heroes Omega   
2061157  236450           PAC-MAN Championship Edition DX+   
1171799  218620                                   PAYDAY 2   
1450080  221640                              Super Hexagon   

                                               review_text  review_score  \
301327    best bowling simulator 2014 great good storyline             1   
1662500  marvel characters check tons loot check tons c...             1   
2061157  game original definately one best renditions p...             1   
1171799  game ♥♥♥♥ing awesome professional heister fun ...             1   
1450080                    high play game great would dank             1   

         review_votes  sentiment_scores  
301327              1          0.000461  
1662500             0          0.000660  
2061157             0          0.

In [17]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Initialize Vader sentiment intensity analyzer
sia = SentimentIntensityAnalyzer()

# Function to calculate sentiment score
def calculate_sentiment_score(review):
    sentiment_dict = sia.polarity_scores(review)
    return sentiment_dict['compound']

# Calculate sentiment scores for each review
reduced_sample['vader_sentiment_score'] = reduced_sample['review_text'].apply(calculate_sentiment_score)

# Print the result
print(reduced_sample.head())

         app_id                                   app_name  \
301327    12210  Grand Theft Auto IV: The Complete Edition   
1662500  226320                        Marvel Heroes Omega   
2061157  236450           PAC-MAN Championship Edition DX+   
1171799  218620                                   PAYDAY 2   
1450080  221640                              Super Hexagon   

                                               review_text  review_score  \
301327    best bowling simulator 2014 great good storyline             1   
1662500  marvel characters check tons loot check tons c...             1   
2061157  game original definately one best renditions p...             1   
1171799  game ♥♥♥♥ing awesome professional heister fun ...             1   
1450080                    high play game great would dank             1   

         review_votes  sentiment_scores  vader_sentiment_score  
301327              1          0.000461                 0.9042  
1662500             0          0.000660 

In [19]:
from afinn import Afinn

# Initialize Afinn sentiment analyzer
afinn = Afinn()

# Function to calculate sentiment score for each word
def calculate_word_sentiment(review):
    words = review.split()
    scores = [afinn.score(word) for word in words]
    return scores

# Calculate sentiment scores for each word in each review
reduced_sample['word_sentiment_scores'] = reduced_sample['review_text'].apply(calculate_word_sentiment)

# Print the result
print(reduced_sample.head())


         app_id                                   app_name  \
301327    12210  Grand Theft Auto IV: The Complete Edition   
1662500  226320                        Marvel Heroes Omega   
2061157  236450           PAC-MAN Championship Edition DX+   
1171799  218620                                   PAYDAY 2   
1450080  221640                              Super Hexagon   

                                               review_text  review_score  \
301327    best bowling simulator 2014 great good storyline             1   
1662500  marvel characters check tons loot check tons c...             1   
2061157  game original definately one best renditions p...             1   
1171799  game ♥♥♥♥ing awesome professional heister fun ...             1   
1450080                    high play game great would dank             1   

         review_votes  sentiment_scores  vader_sentiment_score  \
301327              1          0.000461                 0.9042   
1662500             0          0.00066

## Sentiment Analysis Model

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Prepare your data
# Use the word sentiment scores as features
# Convert list of word sentiment scores to fixed size arrays or take mean/sum
X = reduced_sample['word_sentiment_scores'].apply(lambda x: sum(x) / len(x) if len(x) > 0 else 0).values.reshape(-1, 1)
y = reduced_sample['review_score']

# 2. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train the model
clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)

# 4. Evaluate the model
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)


Accuracy: 0.8176860148032723


## Make Predictions

In [130]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Select random samples with at least one having a review score less than 0
n_samples = 5
sample_indices = np.random.choice(X_test.shape[0], n_samples - 1, replace=False)
negative_sample_index = df[df['review_score'] < 0].sample(1, random_state=1).index

# Extract reviews, scores, and predictions
sample_reviews = df.loc[negative_sample_index.union(sample_indices), 'review_text']
sample_scores = df.loc[negative_sample_index.union(sample_indices), 'review_score']
sample_preds = clf.predict(X_test[sample_indices])

# Print the reviews, review scores, and corresponding predicted sentiments
for idx, (review, score, pred) in enumerate(zip(sample_reviews, sample_scores, sample_preds)):
    print(f"Review {idx + 1}: {review}\nReview Score: {score}\nPredicted Sentiment: {'Positive' if pred == 1 else 'Negative'}\n")

# Calculate additional performance metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

# Print the performance metrics
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')
print(f'ROC-AUC: {roc_auc:.4f}')


Review 1: My name is Commander Shepard and this game that ruined my grades
Review Score: 1
Predicted Sentiment: Positive

Review 2: EPIC and LEGENDARY game
Review Score: 1
Predicted Sentiment: Positive

Review 3: ♥♥♥♥♥♥♥♥
Review Score: 1
Predicted Sentiment: Positive

Review 4: I played this game for a few years and it made me puke constantly and have bloody diarrhea every day. 10/10 A++++++ would play again
Review Score: 1
Predicted Sentiment: Positive

Accuracy: 0.8177
Precision: 0.8245
Recall: 0.9874
F1 Score: 0.8986
ROC-AUC: 0.5203


## Evaluate the Model

## Summary

In [29]:
nltk.download('vader_lexicon')
nltk.data.path.append('C:/Users/josh/nltk_data')


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\aaron\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
