Welcome to my text classification notebook! In this project, I worked with the Amazon Reviews dataset, which contains over 1.4 million reviews from Amazon products scraped between May 1996 and October 2018. My goal was to build a model that classifies the sentiment of these reviews as positive or negative. Using text vectorization to convert the reviews into numerical data and LinearSVC as the classifier, I achieved an accuracy of 92%. This shows a strong performance in distinguishing between sentiments based on the 'Reviews' and 'Rating' columns. Check out the steps, code, and results in the sections below!
Source: https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews/data

In [1]:
import pandas as pd
import numpy as np
import tarfile
import os
import kagglehub

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC 
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, accuracy_score
import pandas as pd
from catboost import CatBoostClassifier  

from sklearn.metrics import classification_report
from sklearn import metrics


In [2]:


# Download latest version
path = kagglehub.dataset_download("kritanjalijain/amazon-reviews")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\caioe\.cache\kagglehub\datasets\kritanjalijain\amazon-reviews\versions\2


In [3]:


files = os.listdir(path)  # Lista todos os arquivos e pastas no diretório

print("Files in the directory", files)


Files in the directory ['amazon_review_polarity_csv.tgz', 'test.csv', 'train.csv']


In [4]:
test = pd.read_csv(path + "/test.csv")
train = pd.read_csv(path + "/train.csv")

EDA

In [5]:
test.head()

Unnamed: 0,2,Great CD,"My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing ""Who was that singing ?"""
0,2,One of the best game music soundtracks - for a...,Despite the fact that I have only played a sma...
1,1,Batteries died within a year ...,I bought this charger in Jul 2003 and it worke...
2,2,"works fine, but Maha Energy is better",Check out Maha Energy's website. Their Powerex...
3,2,Great for the non-audiophile,Reviewed quite a bit of the combo players and ...
4,1,DVD Player crapped out after one year,I also began having the incorrect disc problem...


In [6]:
train.head()

Unnamed: 0,2,Stuning even for the non-gamer,This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^
0,2,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
1,2,Amazing!,This soundtrack is my favorite music of all ti...
2,2,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
3,2,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."
4,2,an absolute masterpiece,I am quite sure any of you actually taking the...


In [7]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 399999 entries, 0 to 399998
Data columns (total 3 columns):
 #   Column                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Non-Null Count   Dtype 
---  ------                                                                                                                                                                                                                                                                                                                                     

In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3599999 entries, 0 to 3599998
Data columns (total 3 columns):
 #   Column                                                                                                                                                                                                                                                                                                                                                                                                      Dtype 
---  ------                                                                                                                                                                                                                                                                                                                                                                                                      ----- 
 0   2                                                                  

Name of columns:

In [9]:
new_names = {1: "label", 2: "title", 3: "text"}

for df in [test, train]:
    columns = df.columns.tolist()

    for idx, new_name in new_names.items():
        
        columns[idx - 1] = new_name

    df.columns = columns
    df['text'] = df['title'] + ' ' + df['text']
    df.drop(columns=['title'], inplace=True)





In [10]:
test

Unnamed: 0,label,text
0,2,One of the best game music soundtracks - for a...
1,1,Batteries died within a year ... I bought this...
2,2,"works fine, but Maha Energy is better Check ou..."
3,2,Great for the non-audiophile Reviewed quite a ...
4,1,DVD Player crapped out after one year I also b...
...,...,...
399994,1,Unbelievable- In a Bad Way We bought this Thom...
399995,1,"Almost Great, Until it Broke... My son recieve..."
399996,1,Disappointed !!! I bought this toy for my son ...
399997,2,Classic Jessica Mitford This is a compilation ...


In [11]:
train

Unnamed: 0,label,text
0,2,The best soundtrack ever to anything. I'm read...
1,2,Amazing! This soundtrack is my favorite music ...
2,2,Excellent Soundtrack I truly like this soundtr...
3,2,"Remember, Pull Your Jaw Off The Floor After He..."
4,2,an absolute masterpiece I am quite sure any of...
...,...,...
3599994,1,Don't do it!! The high chair looks great when ...
3599995,1,"Looks nice, low functionality I have used this..."
3599996,1,"compact, but hard to clean We have a small hou..."
3599997,1,what is it saying? not sure what this book is ...


In [12]:
dfs = [test, train]

for i, df in enumerate(dfs):
    print(f"DataFrame {i}:")

    for column in df.columns:
        if (df[column] == ' ').any():
            print('Empty Space in', column)
        else:
            print('No empty space in', column)




DataFrame 0:
No empty space in label
No empty space in text
DataFrame 1:
No empty space in label
No empty space in text


Dropping null values

In [13]:
test.dropna(inplace = True)
train.dropna(inplace = True)

In [14]:
train['label'].value_counts()*100/len(train)

label
2    50.000444
1    49.999556
Name: count, dtype: float64

In [15]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3599792 entries, 0 to 3599998
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   label   int64 
 1   text    object
dtypes: int64(1), object(1)
memory usage: 82.4+ MB


Split the data into train & test sets

So, our dataframe of train is balanced. No issues to determine train and test.

In [16]:


X_train = train.drop(columns = ['label'])
y_train = train['label']
X_test = test.drop(columns = ['label'])
y_test = test['label']

In [17]:
print(f"Tamanho de X_train: {len(X_train)}")
print(f"Tamanho de y_train: {len(y_train)}")


Tamanho de X_train: 3599792
Tamanho de y_train: 3599792


In [18]:
print(type(X_train))
print(type(y_train))


<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [19]:
X_train = X_train.squeeze()
X_test = X_test.squeeze()



So, I just used Logistic Regression, Linear SVC, Multinominal methods for classification. Anyway, I left in the comments other methods, so, it's your chooice. The function bellow will classify the best model by F1-Score

In [20]:




def pipelines(X_train, y_train, X_test, y_test):
    
    vectorizer = TfidfVectorizer()
    X_train_tfidf = vectorizer.fit_transform(X_train)
    X_test_tfidf = vectorizer.transform(X_test)

    dict_methods = {
            'LogisticRegression': LogisticRegression(),  
            'LinearSVC': LinearSVC(),  
            #'CatBoost': CatBoostClassifier(),  
            'MultinomialNB': MultinomialNB(),
            #'DecisionTreeClassifier': DecisionTreeClassifier(),
            #'RandomForestClassifier': RandomForestClassifier()
        }


    results = []

    for name, method in dict_methods.items():
        
        print(name)
        method.fit(X_train_tfidf, y_train)
        predictions = method.predict(X_test_tfidf)
        f1_micro = f1_score(y_test, predictions, average='micro')
        accuracy = accuracy_score(y_test, predictions)
        
        

        print(f1_micro)

        results.append({
            'Method': name,
            'F1 Micro': f1_micro,
            
            'Accuracy': accuracy
        })

    df = pd.DataFrame(results)
    return df.sort_values(by='F1 Micro', ascending=False)




In [21]:
pipelines(X_train, y_train, X_test, y_test)

LogisticRegression
0.9100868804300268
LinearSVC
0.910534408400525
MultinomialNB
0.8469654353397087


Unnamed: 0,Method,F1 Micro,Accuracy
1,LinearSVC,0.910534,0.910534
0,LogisticRegression,0.910087,0.910087
2,MultinomialNB,0.846965,0.846965


Linear SVC is more accurate and precise than others. We have more options of methods, but it takes a long time, it could be innecifient by this point of view. Anyway, we used default vectorization and SVC. So, fine tunning to improve.

First Fine-Tunning, in vectorization

In [22]:
import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')


In [23]:
clf = LinearSVC()

In [24]:



tfidf =TfidfVectorizer(max_features=40000, 
                                               ngram_range=(1, 2), #bigrams 
                                               stop_words=stop_words, # Stop words are common words removed in NLP as they add little meaning.
                                               sublinear_tf=True,
                                               max_df=0.9, #erases  90% less frequent words
                                               min_df=100 # minimum frequency
                                               )




In [25]:
def text_clf(clf, tfidf, X_train, y_train):
    text_clf = Pipeline([
        ('tfidf', tfidf),
        ('clf', clf)
    ])
    text_clf.fit(X_train, y_train)  
    return text_clf


pipeline_2 = text_clf(clf, tfidf, X_train, y_train)  # Já vem treinado


In [26]:
predictions = pipeline_2.predict(X_test)  # Predição direta

In [27]:
print(metrics.classification_report(y_test,predictions))


              precision    recall  f1-score   support

           1       0.92      0.92      0.92    199984
           2       0.92      0.92      0.92    199991

    accuracy                           0.92    399975
   macro avg       0.92      0.92      0.92    399975
weighted avg       0.92      0.92      0.92    399975



Ok, the vectorization has improved accuracy and f1-score and became more efficient. Next fine-tunning: Linear SVC.

I could look for the best parameters by gridsearch, but for that analysis, I'd be very heavy for a simple notebook. So, intelligent inference would help us.

In [28]:
clf = LinearSVC(C=10, # I could use less than 1, but I want to pursuit better accuracy
                 dual=False, # a lot of data and few features
                 max_iter = 40000,#more iterations
                tol=1e-6 # for more precision, lower tolerance
                                  )


In [29]:
pipeline_2 = text_clf(clf, tfidf, X_train, y_train)

In [30]:
predictions = pipeline_2.predict(X_test) 

In [31]:
print(metrics.classification_report(y_test,predictions))


              precision    recall  f1-score   support

           1       0.92      0.92      0.92    199984
           2       0.92      0.92      0.92    199991

    accuracy                           0.92    399975
   macro avg       0.92      0.92      0.92    399975
weighted avg       0.92      0.92      0.92    399975



Well, by hiperparameters wasn't the best one. Sometimes, we depend on data to do that, so, it's better to use the original linear_svc with few modifications, as dual = False. Anyway, this analysis could be improved by using Deep Learning Techniques or another Machine Learning methods. Another hypotesis, maybe, the column Title could be enogh to hold better accuracies and f1-scores. 

Examples of functionality:

In [32]:
clf = LinearSVC( 
                 dual=False # a lot of data and few features
                  )

In [33]:
pipeline_2 = text_clf(clf, tfidf, X_train, y_train)

In [34]:

data = {
    "Title": [
        "The product arrived super fast and works perfectly! I ordered it on Monday, and by Wednesday it was already at my door. The quality is top-notch, and I couldn’t be happier with how well it performs.",
        "Terrible experience, the item was damaged on delivery. When I opened the box, the product was cracked, and it’s clear the shipping company didn’t handle it with care. Really frustrating after waiting a week.",
        "Amazing quality for such a low price, I'm thrilled! I wasn’t expecting much given the discount, but this exceeded all my hopes—it’s durable, stylish, and worth every penny I spent.",
        "Shipping took forever, really disappointing. I waited almost three weeks for my order, and there were no updates until the last day. It ruined the excitement I had for this purchase.",
        "Customer service was so helpful with my order. I had an issue with the payment, but their team responded quickly, fixed it in no time, and even threw in a small discount as an apology—great experience!",
        "The package got lost, worst purchase ever. I tracked it for days, and suddenly the status just stopped updating. No one could tell me where it went, and I ended up with nothing after paying full price.",
        "Great deal, I'd definitely buy again! The price was unbeatable, and the item arrived in perfect condition within two days. It’s exactly what I needed, and I’m already planning my next order.",
        "Poor packaging, everything was broken. The box was flimsy, and by the time it got to me, the contents were shattered. It’s a shame because the product itself looked promising online.",
        "Fast delivery and the item exceeded my expectations. I got it in just 48 hours, and it’s even better than the pictures—super functional and feels premium. Couldn’t ask for more!",
        "Overpriced and low quality, I regret this purchase. I paid way too much for something that feels cheap and stopped working after a few uses. Should’ve read the reviews more carefully.",
        "Fantastic purchase, I’m so impressed! The item arrived early, beautifully packaged, and it works even better than advertised—what a pleasant surprise after shopping online.",
        "Complete disaster, the wrong item was sent. I ordered a specific model, but they shipped something totally different, and now I’m stuck dealing with returns. Such a waste of time.",
        "Best online shopping experience yet! The website was easy to use, shipping was lightning-fast, and the product quality is outstanding—I’ll be recommending this to everyone.",
        "Horrible delays and no communication. My order sat in ‘processing’ for two weeks with no explanation, and when it finally arrived, it was missing parts. Never again.",
        "Super satisfied with this buy! It arrived ahead of schedule, and the attention to detail in the design is incredible—definitely feels like I got more than I paid for.",
        "Disappointing purchase, it broke on day one. I was excited to try it out, but it fell apart almost immediately, and the return process is a nightmare to deal with.",
        "Quick shipping and a flawless product! I ordered it late at night, and it was at my doorstep by the next afternoon—works like a charm and looks great too.",
        "Awful quality, not worth the hype. The reviews made it sound amazing, but it arrived scratched and barely functional—should’ve saved my money for something else.",
        "Really happy with this order! The item was well-priced, shipped promptly, and came with a little thank-you note that made the experience feel personal and special.",
        "Total letdown, delivery was a mess. The package was left in the rain despite my instructions, and the contents were soaked and ruined—terrible service all around."
    ],
    "Sentiment": [2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1]
}

df_test = pd.DataFrame(data)



In [35]:
df_test

Unnamed: 0,Title,Sentiment
0,The product arrived super fast and works perfe...,2
1,"Terrible experience, the item was damaged on d...",1
2,"Amazing quality for such a low price, I'm thri...",2
3,"Shipping took forever, really disappointing. I...",1
4,Customer service was so helpful with my order....,2
5,"The package got lost, worst purchase ever. I t...",1
6,"Great deal, I'd definitely buy again! The pric...",2
7,"Poor packaging, everything was broken. The box...",1
8,Fast delivery and the item exceeded my expecta...,2
9,"Overpriced and low quality, I regret this purc...",1


In [36]:
df_test["Predicted_Sentiment"] = pipeline_2.predict(df_test["Title"])

In [37]:
df_test

Unnamed: 0,Title,Sentiment,Predicted_Sentiment
0,The product arrived super fast and works perfe...,2,2
1,"Terrible experience, the item was damaged on d...",1,1
2,"Amazing quality for such a low price, I'm thri...",2,2
3,"Shipping took forever, really disappointing. I...",1,1
4,Customer service was so helpful with my order....,2,2
5,"The package got lost, worst purchase ever. I t...",1,1
6,"Great deal, I'd definitely buy again! The pric...",2,2
7,"Poor packaging, everything was broken. The box...",1,1
8,Fast delivery and the item exceeded my expecta...,2,2
9,"Overpriced and low quality, I regret this purc...",1,1


In [38]:
df_test['Correct'] = (df_test['Sentiment'] == df_test['Predicted_Sentiment']).astype(int)

In [39]:
correct = 100 * df_test['Correct'].mean()

In [40]:
print(f'Correct rate: {correct} %')

Correct rate: 100.0 %


In my test, it works well with those examples. You can try by yourself others.

Second part: Sentiment analysis by NTKL

In [41]:
## VADER (Valence Aware Dictionary and sEntiment Reasoner)

In [42]:
test

Unnamed: 0,label,text
0,2,One of the best game music soundtracks - for a...
1,1,Batteries died within a year ... I bought this...
2,2,"works fine, but Maha Energy is better Check ou..."
3,2,Great for the non-audiophile Reviewed quite a ...
4,1,DVD Player crapped out after one year I also b...
...,...,...
399994,1,Unbelievable- In a Bad Way We bought this Thom...
399995,1,"Almost Great, Until it Broke... My son recieve..."
399996,1,Disappointed !!! I bought this toy for my son ...
399997,2,Classic Jessica Mitford This is a compilation ...


In [44]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\caioe\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [45]:
test['predicted_label'] = test['text'].apply(lambda x: sia.polarity_scores(x)['compound'])

test['predicted_label'] = test['predicted_label'].apply(lambda x: 2 if x>0 else 1)



In [46]:
test

Unnamed: 0,label,text,predicted_label
0,2,One of the best game music soundtracks - for a...,2
1,1,Batteries died within a year ... I bought this...,2
2,2,"works fine, but Maha Energy is better Check ou...",2
3,2,Great for the non-audiophile Reviewed quite a ...,2
4,1,DVD Player crapped out after one year I also b...,1
...,...,...,...
399994,1,Unbelievable- In a Bad Way We bought this Thom...,2
399995,1,"Almost Great, Until it Broke... My son recieve...",1
399996,1,Disappointed !!! I bought this toy for my son ...,2
399997,2,Classic Jessica Mitford This is a compilation ...,2


In [47]:
print(classification_report(test['label'], test['predicted_label']))

              precision    recall  f1-score   support

           1       0.87      0.51      0.65    199984
           2       0.66      0.92      0.77    199991

    accuracy                           0.72    399975
   macro avg       0.76      0.72      0.71    399975
weighted avg       0.76      0.72      0.71    399975



Ok, of course, ML would work better than Vader, but Vader a easy and light method.