## PHASE 2: FEATURE CLASSIFICATION
  
This phase classifies potential engineered features into:
1. Numeric
2. Categorical
3. Text/NLP


We will classify features into three core buckets:

| Type | Column / Derived Feature | Why Useful |
|----------|----------|----------|
| Numeric  | discounted_price, actual_price, discount_percentage, Discount Ratio = discounted_price / actual_price  | Numeric fields are core for ranking, regression models, trend tracking, and pricing optimization.  |
| Categorical  | Category grouping, Price Tier ("Low", "Mid", "High" based on quantiles), Discount Tier ("Small", "Medium", "Heavy")  | Enables aggregation, segmentation, and strategic targeting. Great for dashboards and category-level forecasting.  |
| Text/NLP  |  about_product, review_title, review_content, Title Length, Description Word Count, Sentiment Score (from review_content) & Keyword Flags   | Extracts qualitative insights from customer reviews & descriptions  |
 
  

In [20]:
import pandas as pd
import numpy as np
import re

In [21]:
df_2 = pd.read_csv("/workspaces/Amazon-Sales-data-analysis/notebooks/cleaned_data.csv")
df_2.head()

Unnamed: 0,product_id,product_name,category,discounted_price,actual_price,discount_percentage,rating,rating_count,about_product,user_id,user_name,review_id,review_title,review_content,img_link,product_link,discount_percentage_num
0,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,Computers&Accessories|Accessories&Peripherals|...,399.0,1099.0,64%,4.2,24269.0,High Compatibility : Compatible With iPhone 12...,"AG3D6O4STAQKAY2UVGEUV46KN35Q,AHMY5CWJMMK5BJRBB...","Manav,Adarsh gupta,Sundeep,S.Sayeed Ahmed,jasp...","R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1K...","Satisfied,Charging is really fast,Value for mo...",Looks durable Charging is fine tooNo complains...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Wayona-Braided-WN3LG1-Sy...,64
1,B098NS6PVG,Ambrane Unbreakable 60W / 3A Fast Charging 1.5...,Computers&Accessories|Accessories&Peripherals|...,199.0,349.0,43%,4.0,43994.0,"Compatible with all Type C enabled devices, be...","AECPFYFQVRUWC3KGNLJIOREFP5LQ,AGYYVPDD7YG7FYNBX...","ArdKn,Nirbhay kumar,Sagar Viswanathan,Asp,Plac...","RGIQEG07R9HS2,R1SMWZQ86XIN8U,R2J3Y1WL29GWDE,RY...","A Good Braided Cable for Your Type C Device,Go...",I ordered this cable to connect my phone to An...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Ambrane-Unbreakable-Char...,43
2,B096MSW6CT,Sounce Fast Phone Charging Cable & Data Sync U...,Computers&Accessories|Accessories&Peripherals|...,199.0,1899.0,90%,3.9,7928.0,【 Fast Charger& Data Sync】-With built-in safet...,"AGU3BBQ2V2DDAMOAKGFAWDDQ6QHA,AESFLDV2PT363T2AQ...","Kunal,Himanshu,viswanath,sai niharka,saqib mal...","R3J3EQQ9TZI5ZJ,R3E7WBGK7ID0KV,RWU79XKQ6I1QF,R2...","Good speed for earlier versions,Good Product,W...","Not quite durable and sturdy,https://m.media-a...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Sounce-iPhone-Charging-C...,90
3,B08HDJ86NZ,boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...,Computers&Accessories|Accessories&Peripherals|...,329.0,699.0,53%,4.2,94363.0,The boAt Deuce USB 300 2 in 1 cable is compati...,"AEWAZDZZJLQUYVOVGBEUKSLXHQ5A,AG5HTSFRRE6NL3M5S...","Omkar dhale,JD,HEMALATHA,Ajwadh a.,amar singh ...","R3EEUZKKK9J36I,R3HJVYCLYOY554,REDECAZ7AMPQC,R1...","Good product,Good one,Nice,Really nice product...","Good product,long wire,Charges good,Nice,I bou...",https://m.media-amazon.com/images/I/41V5FtEWPk...,https://www.amazon.in/Deuce-300-Resistant-Tang...,53
4,B08CF3B7N1,Portronics Konnect L 1.2M Fast Charging 3A 8 P...,Computers&Accessories|Accessories&Peripherals|...,154.0,399.0,61%,4.2,16905.0,[CHARGE & SYNC FUNCTION]- This cable comes wit...,"AE3Q6KSUK5P75D5HFYHCRAOLODSA,AFUGIFH5ZAFXRDSZH...","rahuls6099,Swasat Borah,Ajay Wadke,Pranali,RVK...","R1BP4L2HH9TFUP,R16PVJEXKV6QZS,R2UPDB81N66T4P,R...","As good as original,Decent,Good one for second...","Bought this instead of original apple, does th...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Portronics-Konnect-POR-1...,61


In [22]:
# converted the category column to a categorical datatype
df_2["category"] = df_2["category"].astype("category")
print(df_2["category"].dtype)

category


In [23]:
# converted respective columns to string datatype
string_cols = [ "product_name", "about_product", "review_title", "review_content", "user_id", "user_name", "review_id", "product_link", "img_link"]
df_2[string_cols] = df_2[string_cols].astype("string")
print(df_2.dtypes) 

product_id                         object
product_name               string[python]
category                         category
discounted_price                  float64
actual_price                      float64
discount_percentage                object
rating                            float64
rating_count                      float64
about_product              string[python]
user_id                    string[python]
user_name                  string[python]
review_id                  string[python]
review_title               string[python]
review_content             string[python]
img_link                   string[python]
product_link               string[python]
discount_percentage_num             int64
dtype: object


### NUMERIC:

1. Discount Ratio

Formula: discounted_price / actual_price

Why: It’s different from discount_percentage. Imagine a $10,000 product with a 10% discount vs. a $10 product with 10% discount. Both are "10%", but the absolute ratio gives clearer scale when comparing. 

2. Ultra Discount Flag  

Logic: Mark if discount_percentage_num > 90.

Why: Helps isolate products being nearly given away.  

3. Price Difference 

Formula: actual_price - discounted_price

Unlike the ratio or percentage, it tells us the absolute value lost to discounting.

In [24]:
df_2["discount_ratio"] = df_2["discounted_price"] / df_2["actual_price"]
df_2["price_difference"] = df_2["actual_price"] - df_2["discounted_price"]
df_2["ultra_discount"] = df_2["discount_percentage_num"] > 90   

In [25]:
df_2.head()

Unnamed: 0,product_id,product_name,category,discounted_price,actual_price,discount_percentage,rating,rating_count,about_product,user_id,user_name,review_id,review_title,review_content,img_link,product_link,discount_percentage_num,discount_ratio,price_difference,ultra_discount
0,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,Computers&Accessories|Accessories&Peripherals|...,399.0,1099.0,64%,4.2,24269.0,High Compatibility : Compatible With iPhone 12...,"AG3D6O4STAQKAY2UVGEUV46KN35Q,AHMY5CWJMMK5BJRBB...","Manav,Adarsh gupta,Sundeep,S.Sayeed Ahmed,jasp...","R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1K...","Satisfied,Charging is really fast,Value for mo...",Looks durable Charging is fine tooNo complains...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Wayona-Braided-WN3LG1-Sy...,64,0.363057,700.0,False
1,B098NS6PVG,Ambrane Unbreakable 60W / 3A Fast Charging 1.5...,Computers&Accessories|Accessories&Peripherals|...,199.0,349.0,43%,4.0,43994.0,"Compatible with all Type C enabled devices, be...","AECPFYFQVRUWC3KGNLJIOREFP5LQ,AGYYVPDD7YG7FYNBX...","ArdKn,Nirbhay kumar,Sagar Viswanathan,Asp,Plac...","RGIQEG07R9HS2,R1SMWZQ86XIN8U,R2J3Y1WL29GWDE,RY...","A Good Braided Cable for Your Type C Device,Go...",I ordered this cable to connect my phone to An...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Ambrane-Unbreakable-Char...,43,0.570201,150.0,False
2,B096MSW6CT,Sounce Fast Phone Charging Cable & Data Sync U...,Computers&Accessories|Accessories&Peripherals|...,199.0,1899.0,90%,3.9,7928.0,【 Fast Charger& Data Sync】-With built-in safet...,"AGU3BBQ2V2DDAMOAKGFAWDDQ6QHA,AESFLDV2PT363T2AQ...","Kunal,Himanshu,viswanath,sai niharka,saqib mal...","R3J3EQQ9TZI5ZJ,R3E7WBGK7ID0KV,RWU79XKQ6I1QF,R2...","Good speed for earlier versions,Good Product,W...","Not quite durable and sturdy,https://m.media-a...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Sounce-iPhone-Charging-C...,90,0.104792,1700.0,False
3,B08HDJ86NZ,boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...,Computers&Accessories|Accessories&Peripherals|...,329.0,699.0,53%,4.2,94363.0,The boAt Deuce USB 300 2 in 1 cable is compati...,"AEWAZDZZJLQUYVOVGBEUKSLXHQ5A,AG5HTSFRRE6NL3M5S...","Omkar dhale,JD,HEMALATHA,Ajwadh a.,amar singh ...","R3EEUZKKK9J36I,R3HJVYCLYOY554,REDECAZ7AMPQC,R1...","Good product,Good one,Nice,Really nice product...","Good product,long wire,Charges good,Nice,I bou...",https://m.media-amazon.com/images/I/41V5FtEWPk...,https://www.amazon.in/Deuce-300-Resistant-Tang...,53,0.470672,370.0,False
4,B08CF3B7N1,Portronics Konnect L 1.2M Fast Charging 3A 8 P...,Computers&Accessories|Accessories&Peripherals|...,154.0,399.0,61%,4.2,16905.0,[CHARGE & SYNC FUNCTION]- This cable comes wit...,"AE3Q6KSUK5P75D5HFYHCRAOLODSA,AFUGIFH5ZAFXRDSZH...","rahuls6099,Swasat Borah,Ajay Wadke,Pranali,RVK...","R1BP4L2HH9TFUP,R16PVJEXKV6QZS,R2UPDB81N66T4P,R...","As good as original,Decent,Good one for second...","Bought this instead of original apple, does th...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Portronics-Konnect-POR-1...,61,0.385965,245.0,False


## Categorical Features

Purpose: To group products for aggregation & segmentation.

Feature Name|Method
|----------|----------|
category_group	| Map raw category values into 5–10 super-groups
price_tier	| Bin actual_price into "Very Low", "Low", "Mid", "High" using quantiles
discount_tier |	Bin discount_percentage into "Small", "Medium", "Large"

In [26]:
df_2["category"].unique()

['Computers&Accessories|Accessories&Peripherals..., 'Computers&Accessories|NetworkingDevices|Netwo..., 'Electronics|HomeTheater,TV&Video|Accessories|..., 'Electronics|HomeTheater,TV&Video|Televisions|..., 'Electronics|HomeTheater,TV&Video|Accessories|..., ..., 'Home&Kitchen|Kitchen&HomeAppliances|SmallKitc..., 'Home&Kitchen|Heating,Cooling&AirQuality|Parts..., 'Home&Kitchen|Kitchen&HomeAppliances|SmallKitc..., 'Home&Kitchen|Heating,Cooling&AirQuality|Fans|..., 'Home&Kitchen|Kitchen&HomeAppliances|Vacuum,Cl...]
Length: 211
Categories (211, object): ['Car&Motorbike|CarAccessories|InteriorAccessor..., 'Computers&Accessories|Accessories&Peripherals..., 'Computers&Accessories|Accessories&Peripherals..., 'Computers&Accessories|Accessories&Peripherals..., ..., 'OfficeProducts|OfficePaperProducts|Paper|Stat..., 'OfficeProducts|OfficePaperProducts|Paper|Stat..., 'OfficeProducts|OfficePaperProducts|Paper|Stat..., 'Toys&Games|Arts&Crafts|Drawing&PaintingSuppli...]

In [27]:
df_2["category_group"] = df_2["category"].str.split("|").str[0]
df_2["category_group"].unique() 

array(['Computers&Accessories', 'Electronics', 'MusicalInstruments',
       'OfficeProducts', 'Home&Kitchen', 'HomeImprovement', 'Toys&Games',
       'Car&Motorbike', 'Health&PersonalCare'], dtype=object)

In [28]:
# replacing unique categories with orders less than five as "Others" 
count = df_2["category_group"].value_counts()
counts = count[count < 5].index
df_2["category_group"] = df_2["category_group"].replace(counts, "Others")
 

In [29]:
# Binning of different price tiers into low, mid & high
df_2["price_tier"] = pd.qcut(df_2["actual_price"], q=3, labels= ["Low", "Mid", "High"]) 
df_2["discount_tier"] = pd.qcut(df_2["discount_percentage_num"], q=3, labels=["Low", "Mid", "High"])
 

## Text/NLP Features

Purpose: To quantify and extract patterns from text columns like review_title, and review_content.

we make use of review content and review tiltle for sentiment guage and customer satisfaction by picking out the words that best describe the product validity and integridity. Distribute them accordingly into a satisfactory & dissatisfactory index.
 

In [30]:
pos_words = ["great", "excellent", "amazing", "good product", "satisfied", "happy", "love", "perfect"," worth it", "nice", "really nice", "value for money", "durable", "sturdy", "reliable", "authentic", "genuine", "as described", "recommend", "good quality"]

neg_words = ["bad", "poor", "disappointed", "terrible", "awful", "regret", "waste", "fake", "duplicate", "broken", "cracked", "flimsy", "weak", "overpriced", "cheap quality", "not working", "stopped working", "not as described", "useless", "returned"]

In [31]:
df_2["review_content"] = df_2["review_content"].fillna("").str.lower() 
df_2["review_title"] = df_2["review_title"].fillna("").str.lower() 
 

In [34]:
pos_word = "|".join([re.escape(words) for words in pos_words])
df_2["positive"] = df_2["review_content"].str.contains(pos_word, regex=True)

neg_word = "|".join([re.escape(words) for words in neg_words])
df_2["negative"] = df_2["review_content"].str.contains(neg_word, regex=True) 

In [None]:
def customer():
    if df_2["positive"] == True and df_2["negative"] == False:
        print("satisfied")
    elif df_2["positive"] == False and df_2["negative"] == True:
        print("dissatisfied")
    elif df_2["positive"] == True and df_2["negaryve"] == True:
        print("Mixed") 
    else:
        print(np.NaN)  

def main():
    df_2["customer_sentiment"] == customer()


main()    


"not quite durable and sturdy,https://m.media-amazon.com/images/w/webp_402378-t1/images/i/71riggrbucl._sy88.jpg,working good,https://m.media-amazon.com/images/w/webp_402378-t1/images/i/61bkp9yo6wl._sy88.jpg,product,very nice product,working well,it's a really nice product"