# **Amazon Sentiment Analysis**

## Install Required Library for Data

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading aiohttp-3.12.15-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting aiohappyeyeballs>=2.5.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading aiohappyeyeballs-2.6.1-py3-none-any.whl.metadata (5.9 kB)
Collecting aiosigna

## **Importing Reruired Library**

In [3]:
from datasets import load_dataset
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score

warnings.filterwarnings('ignore')

## Loading Amazon Polarity DataSet

In [None]:
dataset = load_dataset("amazon_polarity")
dataset

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00004.parquet:   0%|          | 0.00/260M [00:00<?, ?B/s]

train-00001-of-00004.parquet:   0%|          | 0.00/258M [00:00<?, ?B/s]

train-00002-of-00004.parquet:   0%|          | 0.00/255M [00:00<?, ?B/s]

train-00003-of-00004.parquet:   0%|          | 0.00/254M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3600000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/400000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 3600000
    })
    test: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 400000
    })
})

### Converting DataSet Into DataFrame

In [None]:
df_train = pd.DataFrame(dataset['train'])
df_test = pd.DataFrame(dataset['test'])

In [None]:
df = pd.concat([df_train,df_test],ignore_index=True)

In [None]:
df.head()

Unnamed: 0,label,title,content
0,1,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...
1,1,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
2,1,Amazing!,This soundtrack is my favorite music of all ti...
3,1,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
4,1,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."


## Data Preprocessing

In [None]:
df.shape

(4000000, 3)

In [None]:
df = df.sample(3000000, random_state=42)

In [None]:
df.shape

(3000000, 3)

In [None]:
df.isnull().sum()

Unnamed: 0,0
label,0
title,0
content,0


In [None]:
df['text'] = df['title'] + " " + df['content']

In [None]:
df = df[['label','text']]

In [None]:
df.head()

Unnamed: 0,label,text
1049554,0,"Deeply disappointing, faulty morality & social..."
214510,1,insight into the philosophy of libertarian soc...
2145764,1,"a great book ""In vain did the Bedouins strive ..."
2198867,0,"toys for great sex wow, that was bad, I threw ..."
1184366,1,i love this movie!!!! i just finished reading ...


In [None]:
df['label'].unique()

array([0, 1])

## **God Code For NLP Text Preprocessing**

In [None]:
import re
import string
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def text_preprocess(text):
    # Remove emojis
    text = text.encode('ascii', 'ignore').decode('ascii')

    # Lowercase
    text = text.lower()

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])

    return text
df['text'] = df['text'].apply(text_preprocess)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## **Model Building**

### Spliting Data Into X and y Variables

In [None]:
X = df['text']
y = df['label']

### Spliting Data Into Train Set And Test Set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
bow_vectorizer = CountVectorizer()

In [None]:
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

### **Naive Byes**

In [None]:
model_NB = MultinomialNB()

In [None]:
model_NB.fit(X_train_bow,y_train)

In [None]:
pred_np = model_NB.predict(X_test_bow)

In [None]:
print('accuracy = ',accuracy_score(y_test,pred_np))

accuracy =  0.851355


In [None]:
vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

### **Support Vector Machine Model (SVM)**

In [None]:
svm = LinearSVC()
svm.fit(X_train_vec, y_train)

In [None]:
y_pred = svm.predict(X_test_vec)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9062266666666666
              precision    recall  f1-score   support

           0       0.91      0.90      0.91    299992
           1       0.90      0.91      0.91    300008

    accuracy                           0.91    600000
   macro avg       0.91      0.91      0.91    600000
weighted avg       0.91      0.91      0.91    600000



## **Converting Model Into Pickel File**


---


*SVM -> amazon_review_model.pkl*

---

*vectorizer -> vectorizer.pkl*


---


In [None]:
import joblib
joblib.dump(svm, 'amazon_review_model.pkl')

['amazon_review_model.pkl']

In [None]:
joblib.dump(vectorizer, 'vectorizer.pkl')

['vectorizer.pkl']