<a href="https://colab.research.google.com/github/RidhaAnsar/Sentiment_analysis_using_Amazon_review_/blob/main/Sentiment_analysis_using_Amazon_review_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.python.keras import models, layers, optimizers
import tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
import re
import bz2
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score

In [17]:
# function to load the text and labels from train and test test

def get_labels_and_texts(file):
    labels=[]
    texts=[]
    for line in bz2.BZ2File(file):
        x=line.decode("utf-8")
        labels.append(int(x[9])-1)
        texts.append(x[10:].strip())
    return np.array(labels), texts

train_labels, train_texts=get_labels_and_texts('/content/train.ft.txt.bz2')
test_labels, test_texts=get_labels_and_texts('/content/test.ft.txt.bz2')

This function simplifies the loading and preprocessing of labeled text data from compressed .bz2 files by extracting labels and corresponding texts, storing them in a format ready for model training and evaluation.


In [18]:
train_labels[0]
train_texts[0]

'Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'

In [19]:
train_labels=train_labels[0:500]
train_texts=train_texts[500:0]

In [20]:
import re
NON_ALPHANUM = re.compile(r'[\W]')
NON_ASCII = re.compile(r'[^a-z0-1\s]')

def normalize_texts(texts):
    normalized_texts = []
    for text in texts:
        lower = text.lower()
        no_punctuation = NON_ALPHANUM.sub(r' ', lower)
        no_non_ascii = NON_ASCII.sub(r'', no_punctuation)
        normalized_texts.append(no_non_ascii)
    return normalized_texts

train_texts = normalize_texts(train_texts)
test_texts = normalize_texts(test_texts)


This function takes a list of texts and normalizes them by making them lowercase, removing non-alphanumeric characters, and removing non-ASCII characters.

In [21]:
train_texts[0]


IndexError: list index out of range

In [22]:
print(len(train_texts))

0


In [23]:
import os
print("File exists:", os.path.exists('train.ft.txt.bz2'))


File exists: True


In [24]:
import bz2

def test_read_file(file):
    try:
        with bz2.BZ2File(file) as f:
            for i, line in enumerate(f):
                print("Line", i + 1, ":", line)
                if i >= 4:  # Limit output to the first 5 lines for testing
                    break
    except Exception as e:
        print("Error reading file:", e)

test_read_file('train.ft.txt.bz2')


Line 1 : b'__label__2 Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^\n'
Line 2 : b"__label__2 The best soundtrack ever to anything.: I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.\n"
Line 3 : b'__

In [25]:
def test_decode_line(file):
    try:
        with bz2.BZ2File(file) as f:
            line = next(f)  # Read the first line
            decoded_line = line.decode("utf-8")
            print("Decoded line:", decoded_line)
    except Exception as e:
        print("Error decoding line:", e)

test_decode_line('train.ft.txt.bz2')


Decoded line: __label__2 Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^



In [26]:
import bz2
import numpy as np

def get_labels_and_texts(file):
    labels = []
    texts = []
    try:
        with bz2.BZ2File(file) as f:
            for line in f:
                x = line.decode("utf-8").strip()  # Decode and strip whitespace
                label, text = x.split(' ', 1)  # Split into label and text
                labels.append(int(label.split('__label__')[1]) - 1)  # Extract and convert label to int
                texts.append(text)
    except Exception as e:
        print(f"Error processing file: {e}")
    return np.array(labels), texts

train_labels, train_texts = get_labels_and_texts('train.ft.txt.bz2')
print("Number of training texts:", len(train_texts))
print("First training text:", train_texts[0])

Number of training texts: 3600000
First training text: Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^


In [27]:
train_texts[0]

'Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'

In [28]:
import re
NON_ALPHANUM = re.compile(r'[\W]')
NON_ASCII = re.compile(r'[^a-z0-1\s]')

def normalize_texts(texts):
    normalized_texts = []
    for text in texts:
        lower = text.lower()
        no_punctuation = NON_ALPHANUM.sub(r' ', lower)
        no_non_ascii = NON_ASCII.sub(r'', no_punctuation)
        normalized_texts.append(no_non_ascii)
    return normalized_texts

train_texts = normalize_texts(train_texts)
test_texts = normalize_texts(test_texts)


In [32]:
train_texts[0]

'stuning even for the non gamer  this sound track was beautiful  it paints the senery in your mind so well i would recomend it even to people who hate vid  game music  i have played the game chrono cross but out of all of the games i have ever played it has the best music  it backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras  it would impress anyone who cares to listen    '

In [34]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(binary=True)
cv.fit(train_texts)
x=cv.transform(train_texts)
x_test=cv.transform(test_texts)


KeyboardInterrupt: 

above  code shows error bcs  the process takes longer than expected, so i had to manually stop. (due to large datasets)

In [37]:
#sampling
from sklearn.feature_extraction.text import CountVectorizer


cv = CountVectorizer(binary=True, max_features=5000, min_df=0.01)


train_sample = train_texts[:1000]  # Taking a smaller sample for testing

# Fit and transform
cv.fit(train_sample)
x_train = cv.transform(train_sample)
x_test = cv.transform(test_texts)


In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val=train_test_split(x_train, train_labels[:1000], train_size=0.75)

In [43]:
for c in [0.01, 0.05, 0.25, 0.5, 1]:
  model=LogisticRegression(C=c)
  model.fit(x_train, y_train)
  print("accuracy for c=%s: %s" % (c, accuracy_score(y_val, model.predict(x_val))))

accuracy for c=0.01: 0.808
accuracy for c=0.05: 0.796
accuracy for c=0.25: 0.816
accuracy for c=0.5: 0.816
accuracy for c=1: 0.792


In [45]:
model.predict(x_test[29])

array([0])

In [46]:
test_texts[29]

'three days of use and it broke  very disappointed in this product  it worked perfectly for exactly three days and could not be resuscitated  it was very inexpensive so i did not want to pay half again the price to ship it back for an exchange  so the company would do nothing when they sent me an inquiry as to product satisfaction '

In [49]:
model.predict(x_test[1])

array([1])

In [50]:
test_texts[1]

'one of the best game music soundtracks   for a game i didn t really play  despite the fact that i have only played a small portion of the game  the music i heard  plus the connection to chrono trigger which was great as well  led me to purchase the soundtrack  and it remains one of my favorite albums  there is an incredible mix of fun  epic  and emotional songs  those sad and beautiful tracks i especially like  as there s not too many of those kinds of songs in my other video game soundtracks  i must admit that one of the songs  life a distant promise  has brought tears to my eyes on many occasions my one complaint about this soundtrack is that they use guitar fretting effects in many of the songs  which i find distracting  but even if those weren t included i would still consider the collection worth it '

In [60]:
model.predict(x_test[47])

array([0])

in sentiment analysis, class 0 might represent "negative," class 1 might represent "neutral," and class 2 might represent "positive."