<a href="https://colab.research.google.com/github/Cutie-tee/Roboreviews_project/blob/main/reviews_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset consists of 3 files: 1429_1.csv
Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv
Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv




In [3]:
!pip install --upgrade pandas




In [4]:
import pandas as pd

def safe_read_csv(file_path):
    """
    Safely reads a CSV file by handling parsing issues.
    """
    try:
        return pd.read_csv(file_path, low_memory=False, on_bad_lines='skip', quotechar='"', escapechar='\\')
    except Exception as e:
        print(f"Error reading file {file_path}: {e}")
        return None

# Load datasets safely
file1_data = safe_read_csv('1429_1.csv')
file2_data = safe_read_csv('Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv')
file3_data = safe_read_csv('Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv')

# Check if all files loaded successfully
if file1_data is not None and file2_data is not None and file3_data is not None:
    # Standardizing column names
    file1_data.rename(columns=lambda x: x.strip(), inplace=True)
    file2_data.rename(columns=lambda x: x.strip(), inplace=True)
    file3_data.rename(columns=lambda x: x.strip(), inplace=True)

    # Align datasets to common columns
    common_columns = list(set(file1_data.columns) & set(file2_data.columns) & set(file3_data.columns))

    # Selecting only common columns
    file1_data = file1_data[common_columns]
    file2_data = file2_data[common_columns]
    file3_data = file3_data[common_columns]

    # Concatenate datasets
    combined_data = pd.concat([file1_data, file2_data, file3_data], ignore_index=True)

    # Dropping duplicates
    combined_data.drop_duplicates(inplace=True)

    # Resetting index
    combined_data.reset_index(drop=True, inplace=True)

    # Save cleaned dataset
    combined_data.to_csv('combined_reviews_cleaned.csv', index=False)

    # Display overview
    print("Dataset successfully cleaned and saved.")
    print(combined_data.info())
else:
    print("One or more files could not be loaded.")

Dataset successfully cleaned and saved.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67351 entries, 0 to 67350
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   reviews.text         67350 non-null  object 
 1   reviews.username     67339 non-null  object 
 2   reviews.sourceURLs   67351 non-null  object 
 3   reviews.dateSeen     67351 non-null  object 
 4   categories           67351 non-null  object 
 5   reviews.date         67312 non-null  object 
 6   reviews.id           71 non-null     float64
 7   reviews.doRecommend  54511 non-null  object 
 8   reviews.title        67332 non-null  object 
 9   asins                67349 non-null  object 
 10  name                 60591 non-null  object 
 11  manufacturer         67351 non-null  object 
 12  keys                 67351 non-null  object 
 13  reviews.rating       67318 non-null  float64
 14  brand                67351 non-null  object 
 

In [5]:
# Load the combined dataset
import pandas as pd

# Load the cleaned dataset
combined_data = pd.read_csv('combined_reviews_cleaned.csv')

# Display initial missing values
print("Missing Values Before Processing:")
print(combined_data.isnull().sum())

# 1. Drop rows with missing `reviews.text` (essential for all tasks)
combined_data = combined_data.dropna(subset=['reviews.text'])

# 2. Drop rows with missing `reviews.rating` (essential for classification and clustering)
combined_data = combined_data.dropna(subset=['reviews.rating'])

# 3. Optional: Handle missing values in other columns (example: reviews.doRecommend)
# Here, replace missing values with a default or mode value
combined_data['reviews.doRecommend'] = combined_data['reviews.doRecommend'].fillna('Unknown')

# 4. Drop columns with minimal or irrelevant data (e.g., `reviews.id` with only 71 non-null values)
columns_to_drop = ['reviews.id']
combined_data = combined_data.drop(columns=columns_to_drop, errors='ignore')

# Display missing values after processing
print("\nMissing Values After Processing:")
print(combined_data.isnull().sum())

# Save the preprocessed dataset
preprocessed_dataset_path = 'preprocessed_reviews.csv'
combined_data.to_csv(preprocessed_dataset_path, index=False)

print(f"Preprocessed dataset saved to {preprocessed_dataset_path}.")



  combined_data = pd.read_csv('combined_reviews_cleaned.csv')


Missing Values Before Processing:
reviews.text               1
reviews.username          12
reviews.sourceURLs         0
reviews.dateSeen           0
categories                 0
reviews.date              39
reviews.id             67280
reviews.doRecommend    12840
reviews.title             19
asins                      2
name                    6760
manufacturer               0
keys                       0
reviews.rating            33
brand                      0
reviews.numHelpful     12746
id                         0
dtype: int64

Missing Values After Processing:
reviews.text               0
reviews.username          12
reviews.sourceURLs         0
reviews.dateSeen           0
categories                 0
reviews.date              29
reviews.doRecommend        0
reviews.title             19
asins                      2
name                    6759
manufacturer               0
keys                       0
reviews.rating             0
brand                      0
reviews.numHelpful  

Next Steps:
Now that the problematic file has been cleaned and saved as Datafiniti_Cleaned_May19.csv, l

1.  inspect the cleaned file to ensure it's ready for analysis:
Check for missing values.
Display an overview of the data.

2. Merge Cleaned Data with Other Files
Combine the cleaned file with the previously processed datasets to create a unified dataset.

3. Sentiment Classification and Clustering


**Sentinent analysis  with RoBerta-base**

In [6]:
!pip install datasets



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

# Load the preprocessed dataset
data = pd.read_csv('preprocessed_reviews.csv', low_memory=False)

# Map ratings to sentiment using the given labels
def map_rating_to_sentiment(rating):
    if rating > 3:
        return 2  # Positive
    elif rating == 3:
        return 1  # Neutral
    else:
        return 0  # Negative

data['sentiment'] = data['reviews.rating'].apply(map_rating_to_sentiment)

# Split data into training and testing sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    data['reviews.text'], data['sentiment'], test_size=0.2, random_state=42
)

# Convert to Hugging Face Dataset format
train_df = pd.DataFrame({"text": train_texts, "label": train_labels})
test_df = pd.DataFrame({"text": test_texts, "label": test_labels})

train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

# Tokenizer and model setup
model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)  # 3 classes

# Tokenize datasets
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

train_dataset = train_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    save_total_limit=2,
    report_to="none"  # Disable W&B and other integrations
)

import os
os.environ["WANDB_DISABLED"] = "true"


# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer
)

# Train the model
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./fine_tuned_roberta_sentiment")
tokenizer.save_pretrained("./fine_tuned_roberta_sentiment")

print("Model fine-tuned and saved successfully!")


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/53853 [00:00<?, ? examples/s]

Map:   0%|          | 0/13464 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.1521,0.210057
2,0.2407,0.208434


Epoch,Training Loss,Validation Loss
1,0.1521,0.210057
2,0.2407,0.208434


Clustering based on emrging trends to ascertain partner growth opportuntities . Lifestyle Enhancers:

Products designed to improve daily routines or convenience.
Includes items like Nespresso pods, smart assistants, pet carriers, and other gadgets that simplify life.
Creative and Productivity Tools:

Products for work, study, or creative activities.
Includes laptops, tablets, keyboards, laptop stands, webcams, and styluses.
Health and Wellness:

Devices that focus on personal health, fitness, or beauty.
Includes fitness trackers, electric massagers, hairdryers, and grooming devices.
Entertainment and Immersion:

Products that provide entertainment or enhanced experiences.
Includes gaming consoles, headphones, speakers, VR headsets, and streaming devices.
Power and Connectivity Solutions:

Products that enable devices to stay powered or connected.
Includes chargers, batteries, power banks, docking stations, and USB hubs.
Eco-Friendly and Sustainable Solutions:

Products marketed as sustainable or environmentally friendly.

Testing the model