<a href="https://colab.research.google.com/github/SudhakarShivashankar/MachineLearningSamples/blob/master/Amazon_Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Amazon Reviews dataset
https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews

Contains 34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This subset contains 1,800,000 training samples and 200,000 testing samples in each polarity sentiment.

The dataset contain polarity, title, text. These 3 columns in them, correspond to class index (1 or 2), review title and review text.

polarity - 1 for negative and 2 for positive
title - review heading
text - review body
The review title and text are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".

CITATION
The Amazon reviews polarity dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu). It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

In [1]:
!pip install kaggle



In [2]:
!kaggle datasets download kritanjalijain/amazon-reviews

Dataset URL: https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews
License(s): CC0-1.0
Downloading amazon-reviews.zip to /content
 99% 1.28G/1.29G [00:17<00:00, 146MB/s]
100% 1.29G/1.29G [00:17<00:00, 80.7MB/s]


In [3]:
!unzip amazon-reviews.zip

Archive:  amazon-reviews.zip
  inflating: amazon_review_polarity_csv.tgz  
  inflating: test.csv                
  inflating: train.csv               


Load the data set into pandas

In [5]:
!tar -xvzf /content/amazon_review_polarity_csv.tgz

amazon_review_polarity_csv/
amazon_review_polarity_csv/train.csv
amazon_review_polarity_csv/readme.txt
amazon_review_polarity_csv/test.csv


In [9]:
import pandas as pd

# Adjust the file path and file name according to the extracted file
df = pd.read_csv('train.csv')

# Preview the data
df.head()

Unnamed: 0,2,Stuning even for the non-gamer,This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^
0,2,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
1,2,Amazing!,This soundtrack is my favorite music of all ti...
2,2,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
3,2,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."
4,2,an absolute masterpiece,I am quite sure any of you actually taking the...


In [12]:
# Get and display column names
columns = list(df.columns)
print(columns)

['2', 'Stuning even for the non-gamer', 'This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^']


In [14]:
# Rename the columns
df.columns = ['polarity', 'review_title', 'review_text']

# Check the DataFrame to ensure the renaming was successful
df.head()

Unnamed: 0,polarity,review_title,review_text
0,2,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
1,2,Amazing!,This soundtrack is my favorite music of all ti...
2,2,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
3,2,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."
4,2,an absolute masterpiece,I am quite sure any of you actually taking the...


Topic Modeling and Sentiment Analysis

In [15]:
!pip install gensim
!pip install nltk
!pip install textblob



In [16]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from textblob import TextBlob

# Download NLTK data
nltk.download('stopwords')
stop_words = stopwords.words('english')

# Preprocess the review text
def preprocess(text):
    # Convert to lowercase and remove stopwords
    text = text.lower().split()
    text = [word for word in text if word not in stop_words]
    return ' '.join(text)

# Apply preprocessing to the 'review_text' column
df['processed_review_text'] = df['review_text'].apply(preprocess)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [17]:
df.head()

Unnamed: 0,polarity,review_title,review_text,processed_review_text
0,2,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...,i'm reading lot reviews saying best 'game soun...
1,2,Amazing!,This soundtrack is my favorite music of all ti...,"soundtrack favorite music time, hands down. in..."
2,2,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...,truly like soundtrack enjoy video game music. ...
3,2,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine...","played game, know divine music is! every singl..."
4,2,an absolute masterpiece,I am quite sure any of you actually taking the...,quite sure actually taking time read played ga...


In [1]:
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel

# Tokenize the reviews
reviews_tokenized = df['processed_review_text'].apply(lambda x: x.split())

# Create a dictionary and corpus for LDA
dictionary = Dictionary(reviews_tokenized)
corpus = [dictionary.doc2bow(review) for review in reviews_tokenized]

# Train LDA model
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes=10)

# Function to get the dominant topic for each review
def get_dominant_topic(review):
    bow = dictionary.doc2bow(review.split())
    topics = lda_model.get_document_topics(bow)
    # Sort topics by probability
    topics = sorted(topics, key=lambda x: x[1], reverse=True)
    if topics:
        return topics[0][0]  # Return the most probable topic
    return None

# Add a new column for the dominant topic in each review
df['dominant_topic'] = df['processed_review_text'].apply(get_dominant_topic)

NameError: name 'df' is not defined