<a href="https://www.kaggle.com/code/nadaarfaoui/preprocessing-the-amazon-electronics-dataset?scriptVersionId=289392054" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download resources (run once)
nltk.download('stopwords')
nltk.download('wordnet')

# Load your dataset
df = pd.read_csv("/kaggle/input/merged-amazon-electronics-dataset/merged_electronics_dataset.csv")

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
df.head()

Unnamed: 0,name,main_category,sub_category,image,link,no_of_ratings,discount_price,actual_price,review_rating,review_text
0,"Redmi 10 Power (Power Black, 8GB RAM, 128GB St...","tv, audio & cameras",All Electronics,https://m.media-amazon.com/images/I/81eM15lVcJ...,https://www.amazon.in/Redmi-Power-Black-128GB-...,965,"₹10,999","₹18,999",3.0 out of 5 stars,NOTE:
1,"OnePlus Nord CE 2 Lite 5G (Blue Tide, 6GB RAM,...","tv, audio & cameras",All Electronics,https://m.media-amazon.com/images/I/71AvQd3Vzq...,https://www.amazon.in/OnePlus-Nord-Lite-128GB-...,113956,"₹18,999","₹19,999",1.0 out of 5 stars,Very bad experience with this device xr phone....
2,OnePlus Bullets Z2 Bluetooth Wireless in Ear E...,"tv, audio & cameras",All Electronics,https://m.media-amazon.com/images/I/51UhwaQXCp...,https://www.amazon.in/Oneplus-Bluetooth-Wirele...,90304,"₹1,999","₹2,299",5.0 out of 5 stars,Amazing phone with amazing camera coming from ...
3,"Samsung Galaxy M33 5G (Mystique Green, 6GB, 12...","tv, audio & cameras",All Electronics,https://m.media-amazon.com/images/I/81I3w4J6yj...,https://www.amazon.in/Samsung-Mystique-Storage...,24863,"₹15,999","₹24,999",1.0 out of 5 stars,So I got the device XR just today. The product...
4,"OnePlus Nord CE 2 Lite 5G (Black Dusk, 6GB RAM...","tv, audio & cameras",All Electronics,https://m.media-amazon.com/images/I/71V--WZVUI...,https://www.amazon.in/OnePlus-Nord-Black-128GB...,113956,"₹18,999","₹19,999",5.0 out of 5 stars,I've been an android user all my life until I ...


In [3]:
df['name'] = df['name'].astype(str) \
                     .str.replace(r'\(Renewed\)', '', regex=True) \
                     .str.replace(r'[^a-zA-Z0-9\s]', '', regex=True) \
                     .str.strip()  # remove leading/trailing spaces

In [4]:
# Drop unwanted columns
df = df.drop(columns=['main_category', 'sub_category'])
# Extract brand (first word before any space or parenthesis)
df['brand'] = df['name'].str.extract(r'^(\w+)')
# Clean review_rating to keep only numeric value
df['review_rating'] = df['review_rating'].str.extract(r'(\d+\.\d+)').astype(float)

In [5]:
print("Missing values before cleaning:\n", df.isnull().sum(), "\n")

Missing values before cleaning:
 name                0
image               0
link                0
no_of_ratings      34
discount_price    270
actual_price       31
review_rating       0
review_text         3
brand               0
dtype: int64 



In [6]:
df = df.dropna(subset=['review_text'])

In [7]:
def clean_numeric(col):
    col = col.astype(str).str.replace(r'[^\d.]', '', regex=True)  # remove non-numeric chars
    col = pd.to_numeric(col, errors='coerce')                      # convert invalids to NaN
    return col

# Apply cleaning
for col in ['no_of_ratings', 'discount_price', 'actual_price', 'review_rating']:
    df[col] = clean_numeric(df[col])
    # Fill missing values with mean (skip review_rating if you want to keep raw ratings)
    if col != 'review_rating':
        df[col] = df[col].fillna(df[col].mean())

In [8]:
print("Missing values after cleaning:\n", df.isnull().sum(), "\n")

Missing values after cleaning:
 name              0
image             0
link              0
no_of_ratings     0
discount_price    0
actual_price      0
review_rating     0
review_text       0
brand             0
dtype: int64 



In [9]:
# Initialize tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Define text preprocessing function
def preprocess_text(text):
    if pd.isna(text):
        return ""
    text = text.lower()  # lowercase
    text = re.sub(r'http\S+|www\S+', '', text)  # remove URLs
    text = re.sub(r'[^a-z\s]', '', text)  # remove punctuation and numbers
    text = ' '.join(word for word in text.split() if word not in stop_words)  # remove stopwords
    text = ' '.join(lemmatizer.lemmatize(word) for word in text.split())  # lemmatize words
    return text

# Apply preprocessing to review_text
df['cleaned_review_text'] = df['review_text'].apply(preprocess_text)

# Display results
print(df[['review_text', 'cleaned_review_text']].head())
df = df.rename(columns={'review_rating': 'rating'})


# 1️⃣ Create binary sentiment label
def label_sentiment(rating):
    if rating >= 4:
        return "Positive"
    elif rating <= 2:
        return "Negative"
    else:
        return None  # Neutral reviews will be dropped

df['sentiment'] = df['rating'].apply(label_sentiment)

# 2️⃣ Drop Neutral reviews
df_binary = df[df['sentiment'].notnull()]

# 3️⃣ Optional: check class distribution
print("Class distribution (binary):")
print(df_binary['sentiment'].value_counts())

# 4️⃣ Save cleaned binary dataset
df_binary.to_csv("cleaned_dataset.csv", index=False)

                                         review_text  \
0                                              NOTE:   
1  Very bad experience with this device xr phone....   
2  Amazing phone with amazing camera coming from ...   
3  So I got the device XR just today. The product...   
4  I've been an android user all my life until I ...   

                                 cleaned_review_text  
0                                               note  
1  bad experience device xr phone back camera fou...  
2  amazing phone amazing camera coming device plu...  
3  got device xr today product look amazing unfor...  
4  ive android user life decided try device xr io...  
Class distribution (binary):
sentiment
Positive    4448
Negative     406
Name: count, dtype: int64
