# Natural Language Processing Application: Sentimental Analysis on Steam Reviews (Possibly?)

## Team

* Gabriel Aracena
* Joshua Canode
* Aaron Galicia

### Project Description

A key area of knowledge in data analytics is the ability to extract meaning from text. This assignment provides the foundational skills in this area by detecting whether a text conveys a positive or negative message.

Analyze the sentiment (e.g., negative, neutral, positive) conveyed in a large body (corpus) of texts using the NLTK package in Python. Complete the steps below. Then, write a comprehensive technical report as a Python Jupyter notebook to include all code, code comments, all outputs, plots, and analysis. Make sure the project documentation contains a) Problem statement, b) Algorithm of the solution, c) Analysis of the findings, and d) References.

## Abstract

TODO

### Data Preparation:

TODO

### ANN Model Building:

TODO


### Training the ANN:



### Evaluation:


## Model Architecture


## Interpretation and Conclusion



In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
import nltk


Importing the data set

In [30]:
df = pd.read_csv('dataset.csv')
print(df.head())


   app_id        app_name                                        review_text  \
0      10  Counter-Strike                                    Ruined my life.   
1      10  Counter-Strike  This will be more of a ''my experience with th...   
2      10  Counter-Strike                      This game saved my virginity.   
3      10  Counter-Strike  • Do you like original games? • Do you like ga...   
4      10  Counter-Strike           Easy to learn, hard to master.             

   review_score  review_votes  
0             1             0  
1             1             1  
2             1             0  
3             1             0  
4             1             1  
         app_id                                   app_name  \
301327    12210  Grand Theft Auto IV: The Complete Edition   
1662500  226320                        Marvel Heroes Omega   
2061157  236450           PAC-MAN Championship Edition DX+   
1171799  218620                                   PAYDAY 2   
1450080  221640  

In [136]:

# sampling the dataset to decrease run time
sample_size = int(0.01 * len(df))
reduced_sample = df.sample(n=sample_size, random_state=42) 
print(reduced_sample.head())

print(reduced_sample.shape)


         app_id                                   app_name  \
301327    12210  Grand Theft Auto IV: The Complete Edition   
1662500  226320                        Marvel Heroes Omega   
2061157  236450           PAC-MAN Championship Edition DX+   
1171799  218620                                   PAYDAY 2   
1450080  221640                              Super Hexagon   

                                               review_text  review_score  \
301327   Best bowling simulator 2014 10/10 It has good ...             1   
1662500  Marvel characters? Check. Tons of loot? Check....             1   
2061157  This game while its not the original is defina...             1   
1171799  This game ♥♥♥♥ing awesome ,You can be professi...             1   
1450080  If you are high, play this game. 420/420 would...             1   

         review_votes  
301327              1  
1662500             0  
2061157             0  
1171799             0  
1450080             0  
(64171, 5)


## Data Preprocessing and Visualization:

In [137]:
import nltk
import string

# Specify the NLTK data path explicitly
nltk.data.path.append('C:/Users/josh/nltk_data')  # Replace with the actual path to your nltk_data directory

# Download the required NLTK data
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\josh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\josh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [138]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string


In [139]:
def preprocess_text_lower(text):
    if isinstance(text, str):
        text = text.lower()
    else:
        cleaned_text = ""

    return text

# tokenize the text
def tokenize_text(text):
    if isinstance(text, str):
        # Tokenization
        tokens = word_tokenize(text)
        cleaned_text = " ".join(tokens)
    else:
        cleaned_text = ""

    return cleaned_text

def remove_punctuation(text):
    if isinstance(text, str):
        # Removing Punctuation
        tokens = word_tokenize(text)
        tokens = [word for word in tokens if word not in string.punctuation]
        cleaned_text = " ".join(tokens)
    else:
        cleaned_text = ""

    return cleaned_text

def remove_stopwords(text):
    if isinstance(text, str):
        # Tokenization
        tokens = word_tokenize(text)

        # Stop Word Removal
        stop_words = set(stopwords.words('english'))
        filtered_tokens = [word for word in tokens if word not in stop_words]

        cleaned_text = " ".join(filtered_tokens)
    else:
        cleaned_text = ""

    return cleaned_text

import re
# handleing things like 10/10
def replace_good_ratings(text):
    pattern = r'(\d+)/(\d+)'

    def replace(match):
        numerator = int(match.group(1))
        denominator = int(match.group(2))

        # Check if the numerator is not 0
        if numerator != 0:
            return 'great'
        else:
            return 'very bad'  # Replace "0/number" with "very bad"

    cleaned_text = re.sub(pattern, replace, text)

    return cleaned_text



In [140]:
# lower case the text
reduced_sample['review_text'] = df['review_text'].apply(preprocess_text_lower)
# 6 seconds

In [141]:
# tokenize the text
reduced_sample['review_text'] = reduced_sample['review_text'].apply(tokenize_text)
# 24 seconds

In [142]:
# remove punctuation
reduced_sample['review_text'] = reduced_sample['review_text'].apply(remove_punctuation)
# 31 seconds

In [143]:
# remove stopwords
reduced_sample['review_text'] = reduced_sample['review_text'].apply(remove_stopwords)
# 50 seconds

In [144]:
# replace good ratings
reduced_sample['review_text'] = reduced_sample['review_text'].apply(replace_good_ratings)


In [145]:
# Print the result (original and cleaned text for the first few rows)
print(reduced_sample.head())

         app_id                                   app_name  \
301327    12210  Grand Theft Auto IV: The Complete Edition   
1662500  226320                        Marvel Heroes Omega   
2061157  236450           PAC-MAN Championship Edition DX+   
1171799  218620                                   PAYDAY 2   
1450080  221640                              Super Hexagon   

                                               review_text  review_score  \
301327    best bowling simulator 2014 great good storyline             1   
1662500  marvel characters check tons loot check tons c...             1   
2061157  game original definately one best renditions p...             1   
1171799  game ♥♥♥♥ing awesome professional heister fun ...             1   
1450080                    high play game great would dank             1   

         review_votes  
301327              1  
1662500             0  
2061157             0  
1171799             0  
1450080             0  


In [151]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix

print(reduced_sample.shape)

# 1. TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Adjust the number of features as needed
tfidf_matrix = tfidf_vectorizer.fit_transform(reduced_sample['review_text'])
tfidf_matrix = csr_matrix(tfidf_matrix)

# 2. Calculate Sentiment Scores in Batches and Append to DataFrame
batch_size = 1000  # Number of rows to process in each batch
sentiment_scores = []

for start in range(0, len(reduced_sample), batch_size):
    end = min(start + batch_size, len(reduced_sample))
    batch_tfidf_matrix = tfidf_matrix[start:end]
    batch_scores = batch_tfidf_matrix.mean(axis=1)
    sentiment_scores.extend(batch_scores)

# Add the 'sentiment_scores' column to 'reduced_sample' from the TF-IDF scores
reduced_sample['sentiment_scores'] = sentiment_scores

# Print the result
print(reduced_sample.head())

(64171, 6)
         app_id                                   app_name  \
301327    12210  Grand Theft Auto IV: The Complete Edition   
1662500  226320                        Marvel Heroes Omega   
2061157  236450           PAC-MAN Championship Edition DX+   
1171799  218620                                   PAYDAY 2   
1450080  221640                              Super Hexagon   

                                               review_text  review_score  \
301327    best bowling simulator 2014 great good storyline             1   
1662500  marvel characters check tons loot check tons c...             1   
2061157  game original definately one best renditions p...             1   
1171799  game ♥♥♥♥ing awesome professional heister fun ...             1   
1450080                    high play game great would dank             1   

         review_votes      sentiment_scores  
301327              1  [[[[[0.00046087]]]]]  
1662500             0  [[[[[0.00065986]]]]]  
2061157             0

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
from spellchecker import SpellChecker

def preprocess_text2(text):
    if isinstance(text, str):
        # Lowercasing
        text = text.lower()

        # Tokenization
        tokens = word_tokenize(text)

        # Removing Punctuation
        tokens = [word for word in tokens if word not in string.punctuation]

        # Stop Word Removal
        stop_words = set(stopwords.words('english'))
        filtered_tokens = [word for word in tokens if word not in stop_words]

        # Lemmatization
        lemmatizer = WordNetLemmatizer()
        lemmatized_tokens = [lemmatizer.lemmatize(token, pos="v") for token in filtered_tokens]

        # Join the filtered and lemmatized tokens back into a single string
        cleaned_text = " ".join(lemmatized_tokens)

        # HTML tag removal
        soup = BeautifulSoup(cleaned_text, "html.parser")
        cleaned_text = soup.get_text()

        # Replacing emoticons and abbreviations
        replacements = {
            ":)": "smile",
            "lol": "laugh out loud",
            # Add more replacements as needed
        }
        for key, value in replacements.items():
            cleaned_text = cleaned_text.replace(key, value)

        # Spell checking
        spell = SpellChecker()
        words = cleaned_text.split()
        corrected_words = [spell.correction(word) for word in words]
        cleaned_text = " ".join(corrected_words)
    else:
        # Handle missing values (NaN or non-string values)
        cleaned_text = ""

    return cleaned_text

# Apply the preprocessing function to the 'review_text' column and create a new column 'preprocessed_text2'
df['preprocessed_text2'] = df['review_text'].apply(preprocess_text2)

# Print the result (original and cleaned text for the first few rows)
print(df[['review_text', 'preprocessed_text2']].head())


## Sentiment Analysis Model

## Make Predictions

## Evaluate the Model

## Summary

In [11]:
import nltk
from nltk.chunk import ChunkParserI
from nltk.chunk.util import conlltags2tree, tree2conlltags
from nltk.tag import UnigramTagger, BigramTagger
from nltk.sentiment import SentimentIntensityAnalyzer

import nltk
nltk.data.path.append('C:\\Users\\josh/nltk_data')

nltk.download('vader_lexicon')

class NEChunkParser(ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t, c) for w, t, c in sent] for sent in train_sents]
        self.tagger = BigramTagger(train_data, backoff=UnigramTagger(train_data))
        self.sentiment_analyzer = SentimentIntensityAnalyzer()

    def parse(self, sentence):
        pos_tags = [pos for (word, pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag if chunktag is not None else 'O' for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word, pos), chunktag) in zip(sentence, chunktags)]
        return conlltags2tree(conlltags)

    def analyze_sentiment(self, sentence):
        sentence_text = ' '.join(word for word, pos in sentence)
        score = self.sentiment_analyzer.polarity_scores(sentence_text)
        return score

# Expanded sample training data
train_sents = [
    [('James', 'NNP', 'B-PERSON'), ('works', 'VBZ', 'O'), ('in', 'IN', 'O'), ('Intel', 'NNP', 'B-ORG')],
    [('Mary', 'NNP', 'B-PERSON'), ('lives', 'VBZ', 'O'), ('in', 'IN', 'O'), ('New York', 'NNP', 'B-LOC')],
    [('Google', 'NNP', 'B-ORG'), ('is', 'VBZ', 'O'), ('a', 'DT', 'O'), ('technology', 'NN', 'O'), ('company', 'NN', 'O')],
    [('Barack', 'NNP', 'B-PERSON'), ('Obama', 'NNP', 'I-PERSON'), ('was', 'VBD', 'O'), ('the', 'DT', 'O'), ('president', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('United States', 'NNP', 'B-LOC')],
    [('The', 'DT', 'O'), ('Eiffel', 'NNP', 'B-LOC'), ('Tower', 'NNP', 'I-LOC'), ('is', 'VBZ', 'O'), ('in', 'IN', 'O'), ('Paris', 'NNP', 'B-LOC')],
    [('Apple', 'NNP', 'B-ORG'), ('produces', 'VBZ', 'O'), ('the', 'DT', 'O'), ('iPhone', 'NN', 'O')]
]

chunker = NEChunkParser(train_sents)

# Test the custom NER
test_sent = [('James', 'NNP'), ('is', 'VBZ'), ('from', 'IN'), ('Intel', 'NNP')]
parsed_tree = chunker.parse(test_sent)
print("Named Entity Recognition:")
print(parsed_tree)

# Test sentiment analysis
test_sentence = [('James', 'NNP'), ('loves', 'VBZ'), ('working', 'VBG'), ('at', 'IN'), ('Intel', 'NNP')]
sentiment_score = chunker.analyze_sentiment(test_sentence)
print("\nSentiment Analysis:")
print("Sentence: ", " ".join(word for word, pos in test_sentence))
print("Positive Sentiment: ", sentiment_score['pos'])
print("Negative Sentiment: ", sentiment_score['neg'])
print("Neutral Sentiment: ", sentiment_score['neu'])
print("Compound Sentiment: ", sentiment_score['compound'])


Named Entity Recognition:
(S (PERSON James/NNP) is/VBZ from/IN (LOC Intel/NNP))

Sentiment Analysis:
Sentence:  James loves working at Intel
Positive Sentiment:  0.481
Negative Sentiment:  0.0
Neutral Sentiment:  0.519
Compound Sentiment:  0.5719


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\josh\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\josh/nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\\nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\\share\\nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\\lib\\nltk_data'
    - 'C:\\Users\\josh\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:\\Users\\josh/nltk_data'
    - 'C:\\Users\\josh/nltk_data'
    - 'C:/Users/josh/nltk_data'
    - ''
**********************************************************************


In [14]:
nltk.download('vader_lexicon')
nltk.data.path.append('C:/Users/josh/nltk_data')


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\josh\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
