<a href="https://colab.research.google.com/github/Cutie-tee/Roboreviews_project/blob/main/reviews_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset consists of 3 files: 1429_1.csv
Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv
Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv




In [7]:
!pip install --upgrade pandas




In [8]:
import pandas as pd

def safe_read_csv(file_path):
    """
    Safely reads a CSV file by handling parsing issues.
    """
    try:
        # Read the CSV with specific options for handling errors
        return pd.read_csv(
            file_path,
            on_bad_lines='skip',  # Skips problematic lines
            quotechar='"',       # Specifies the quote character
            escapechar='\\',     # Escapes special characters
            engine='python'      # Use Python engine for better handling of malformed rows
        )
    except Exception as e:
        print(f"Error reading file {file_path}: {e}")
        return None

# Load datasets safely
file1_data = safe_read_csv('1429_1.csv')
file2_data = safe_read_csv('Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv')
file3_data = safe_read_csv('Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv')

# Check if all files loaded successfully
if file1_data is not None and file2_data is not None and file3_data is not None:
    # Standardizing column names
    file1_data.rename(columns=lambda x: x.strip(), inplace=True)
    file2_data.rename(columns=lambda x: x.strip(), inplace=True)
    file3_data.rename(columns=lambda x: x.strip(), inplace=True)

    # Align datasets to common columns
    common_columns = list(set(file1_data.columns) & set(file2_data.columns) & set(file3_data.columns))

    # Selecting only common columns
    file1_data = file1_data[common_columns]
    file2_data = file2_data[common_columns]
    file3_data = file3_data[common_columns]

    # Concatenate datasets
    combined_data = pd.concat([file1_data, file2_data, file3_data], ignore_index=True)

    # Dropping duplicates
    combined_data.drop_duplicates(inplace=True)

    # Resetting index
    combined_data.reset_index(drop=True, inplace=True)

    # Save cleaned dataset
    combined_data.to_csv('combined_reviews_cleaned.csv', index=False)

    # Display overview
    print("Dataset successfully cleaned and saved.")
    print(combined_data.info())
else:
    print("One or more files could not be loaded.")


Dataset successfully cleaned and saved.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51023 entries, 0 to 51022
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   name                 44263 non-null  object 
 1   reviews.doRecommend  38206 non-null  object 
 2   reviews.numHelpful   38288 non-null  float64
 3   reviews.id           41 non-null     float64
 4   reviews.text         51022 non-null  object 
 5   reviews.title        51004 non-null  object 
 6   manufacturer         51023 non-null  object 
 7   brand                51023 non-null  object 
 8   reviews.date         50984 non-null  object 
 9   reviews.dateSeen     51023 non-null  object 
 10  reviews.username     51016 non-null  object 
 11  asins                51021 non-null  object 
 12  reviews.sourceURLs   51023 non-null  object 
 13  keys                 51023 non-null  object 
 14  id                   51023 non-null  object 
 

In [9]:
# Load the combined dataset
import pandas as pd

# Load the cleaned dataset
combined_data = pd.read_csv('combined_reviews_cleaned.csv')

# Display initial missing values
print("Missing Values Before Processing:")
print(combined_data.isnull().sum())

# 1. Drop rows with missing `reviews.text` (essential for all tasks)
combined_data = combined_data.dropna(subset=['reviews.text'])

# 2. Drop rows with missing `reviews.rating` (essential for classification and clustering)
combined_data = combined_data.dropna(subset=['reviews.rating'])

# 3. Optional: Handle missing values in other columns (example: reviews.doRecommend)
# Here, replace missing values with a default or mode value
combined_data['reviews.doRecommend'] = combined_data['reviews.doRecommend'].fillna('Unknown')

# 4. Drop columns with minimal or irrelevant data (e.g., `reviews.id` with only 71 non-null values)
columns_to_drop = ['reviews.id']
combined_data = combined_data.drop(columns=columns_to_drop, errors='ignore')

# Display missing values after processing
print("\nMissing Values After Processing:")
print(combined_data.isnull().sum())

# Save the preprocessed dataset
preprocessed_dataset_path = 'preprocessed_reviews.csv'
combined_data.to_csv(preprocessed_dataset_path, index=False)

print(f"Preprocessed dataset saved to {preprocessed_dataset_path}.")



Missing Values Before Processing:
name                    6760
reviews.doRecommend    12817
reviews.numHelpful     12735
reviews.id             50982
reviews.text               1
reviews.title             19
manufacturer               0
brand                      0
reviews.date              39
reviews.dateSeen           0
reviews.username           7
asins                      2
reviews.sourceURLs         0
keys                       0
id                         0
reviews.rating            33
categories                 0
dtype: int64

Missing Values After Processing:
name                    6759
reviews.doRecommend        0
reviews.numHelpful     12701
reviews.text               0
reviews.title             19
manufacturer               0
brand                      0
reviews.date              29
reviews.dateSeen           0
reviews.username           7
asins                      2
reviews.sourceURLs         0
keys                       0
id                         0
reviews.rating      

In [10]:
combined_data['reviews.username'] = combined_data['reviews.username'].fillna("Anonymous")
combined_data['name'] = combined_data['name'].fillna("Unknown")

combined_data['reviews.doRecommend'] = combined_data['reviews.doRecommend'].map({'Yes': True, 'No': False})

assert combined_data['reviews.rating'].between(1, 5).all()

combined_data.to_csv('final_preprocessed_reviews.csv', index=False)
print("Final preprocessed dataset saved.")



Final preprocessed dataset saved.


**Sentinent analysis  with RoBerta-base**

In [11]:
!pip install datasets



In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

# Load the final_preprocessed dataset
data = pd.read_csv('final_preprocessed_reviews.csv', low_memory=False)

# Map ratings to sentiment using the given labels
def map_rating_to_sentiment(rating):
    if rating > 3:
        return 2  # Positive
    elif rating == 3:
        return 1  # Neutral
    else:
        return 0  # Negative

data['sentiment'] = data['reviews.rating'].apply(map_rating_to_sentiment)

# Split data into training and testing sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    data['reviews.text'], data['sentiment'], test_size=0.2, random_state=42
)

# Convert to Hugging Face Dataset format
train_df = pd.DataFrame({"text": train_texts, "label": train_labels})
test_df = pd.DataFrame({"text": test_texts, "label": test_labels})

train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

# Tokenizer and model setup
model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)  # 3 classes

# Tokenize datasets
def preprocess_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True)

train_dataset = train_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",  # or "steps" for more frequent evaluation
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    save_total_limit=2,
    report_to="none"  # Disable W&B and other integrations
)

import os
os.environ["WANDB_DISABLED"] = "true"  # Disable Weights & Biases (W&B)

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer
)

# Train the model
trainer.train()

# Optionally, evaluate the model
# eval_results = trainer.evaluate()

# Save the fine-tuned model
model.save_pretrained("./fine_tuned_roberta_sentiment")
tokenizer.save_pretrained("./fine_tuned_roberta_sentiment")

print("Model fine-tuned and saved successfully!")


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/40791 [00:00<?, ? examples/s]

Map:   0%|          | 0/10198 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.0968,0.208934
2,0.1446,0.201895
3,0.0976,0.251213


Model fine-tuned and saved successfully!


In [13]:
# Evaluate the model
trainer.evaluate()


{'eval_loss': 0.2512132227420807,
 'eval_runtime': 245.8632,
 'eval_samples_per_second': 41.478,
 'eval_steps_per_second': 5.186,
 'epoch': 3.0}

In [None]:
metrics = trainer.evaluate()
metrics


In [15]:
#Predict sentiment for new reviews using the fine-tuned model:

from transformers import pipeline

# Load the fine-tuned model
sentiment_classifier = pipeline(
    "text-classification",
    model="./fine_tuned_roberta_sentiment",
    tokenizer="./fine_tuned_roberta_sentiment"
)

# Predict sentiment for a new review
new_review = "This product is amazing! It exceeded all my expectations."
prediction = sentiment_classifier(new_review)
print(prediction)


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'LABEL_2', 'score': 0.9995864033699036}]


In [None]:
# Predict sentiments for the entire dataset
predictions = data['reviews.text'].apply(lambda x: sentiment_classifier(x)[0]['label'])
data['predicted_sentiment'] = predictions
data.to_csv('reviews_with_sentiments.csv', index=False)


In [None]:
import matplotlib.pyplot as plt

sentiment_counts = data['predicted_sentiment'].value_counts()
sentiment_counts.plot(kind='bar', title='Sentiment Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()


**Clustering based on emrging trends to ascertain partner growth opportuntities** . Lifestyle Enhancers:

Products designed to improve daily routines or convenience.
Includes items like Nespresso pods, smart assistants, pet carriers, and other gadgets that simplify life.
Creative and Productivity Tools:

Products for work, study, or creative activities.
Includes laptops, tablets, keyboards, laptop stands, webcams, and styluses.
Health and Wellness:

Devices that focus on personal health, fitness, or beauty.
Includes fitness trackers, electric massagers, hairdryers, and grooming devices.
Entertainment and Immersion:

Products that provide entertainment or enhanced experiences.
Includes gaming consoles, headphones, speakers, VR headsets, and streaming devices.
Power and Connectivity Solutions:

Products that enable devices to stay powered or connected.
Includes chargers, batteries, power banks, docking stations, and USB hubs.
Eco-Friendly and Sustainable Solutions:

Products marketed as sustainable or environmentally friendly.

Testing the model