# Smart E-commerce Catalog Data Analysis
##5. Feature Engineering on E-commerce Data

##1. Importing the dataset

In [0]:
#import the dataset

import pandas as pd 
product_df = spark.sql("SELECT * FROM product_db.product_table")
product_pd_df = product_df.toPandas()
product_pd_df.head(10)

Unnamed: 0,product_id,product_name,category,price,stock,rating,launch_date,description,is_active
0,251,Product 251,Home,474.100006,62,1.8,2024-08-21,This is a great product.,True
1,252,Product 252,Electronics,151.419998,100,1.4,2024-01-28,This is a great product.,True
2,253,Product 253,Home,438.809998,79,1.5,2024-04-21,This is a great product.,True
3,254,Product 254,Clothing,94.519997,88,4.1,2024-08-07,This is a great product.,False
4,255,Product 255,Electronics,208.380005,86,3.9,2024-03-17,This is a great product.,False
5,256,Product 256,Books,120.830002,30,2.2,2024-03-04,This is a great product.,True
6,257,Product 257,Home,340.76001,73,4.9,2024-03-08,This is a great product.,False
7,258,Product 258,Books,253.940002,53,4.9,2024-07-20,This is a great product.,True
8,259,Product 259,Home,30.83,61,4.5,2024-01-21,This is a great product.,True
9,260,Product 260,Toys,426.329987,36,1.3,2024-08-09,This is a great product.,False


###2. Creating "in_demand" Feature

####Creating an "in_demand" feature from the rating column is a sensible approach in e-commerce data, as ratings often reflect product popularity and customer satisfaction, which can correlate with demand

#####Using a threshold of 3.0 or 3.5 seems most appropriate given the data I have, as it balances capturing products that perform reasonably well while excluding those that are likely not meeting customer expectations.

- Threshold of 3.0: This is a balanced choice, allowing to classify products that are performing reasonably well while still filtering out those with very low ratings.

- Threshold of 3.5: This might be ideal if I want to be more selective about what constitutes "in demand," potentially focusing on higher-quality offerings. 'Not in this case'

- Choosing the Threshold: So,I will choose a threshold of `3.0`,to classify products that are performing reasonably well while still filtering out those with very low ratings.

- Implementation: Adding the in_demand feature to the dataset.

In [0]:
#Implement the in_demand feature based on "3.0" threshold
product_pd_df['in_demand'] = (product_pd_df['rating'] >= 3.0).astype(int)

###3. Creating "rating_category" Feature
####Creating bins for the ratings to categorize products into different levels of performance (low, medium, high).


In [0]:
# Categorize 'rating' values into 'Low', 'Medium', or 'High' based on defined bins

bins = [0, 2.0, 3.0, 4.0, 5.0]
labels = ['Low', 'Medium', 'High','Very_High']
product_pd_df['rating_category'] = pd.cut(product_pd_df['rating'], bins=bins, labels=labels)


###4. Handling Categorical Variables
One-Hot Encoding:

- Categorical variables like `category`, must be convert to a format suitable for modeling.

In [0]:
# Convert specified categorical columns to dummy variables, dropping the first level to avoid multicollinearity

product_pd_df = pd.get_dummies(product_pd_df, columns=['category', 'rating_category'], drop_first=True)

###5. Saving the prepared data

In [0]:
#Check the dataset before saving
# Convert uint8 columns to int64 in pandas DataFrame
product_pd_df.dropna(inplace=True)
product_pd_df.head(10)

Unnamed: 0,product_id,product_name,price,stock,rating,launch_date,description,is_active,in_demand,category_Clothing,category_Electronics,category_Home,category_Toys,rating_category_Medium,rating_category_High,rating_category_Very_High
0,251,Product 251,474.100006,62,1.8,2024-08-21,This is a great product.,True,0,0,0,1,0,0,0,0
1,252,Product 252,151.419998,100,1.4,2024-01-28,This is a great product.,True,0,0,1,0,0,0,0,0
2,253,Product 253,438.809998,79,1.5,2024-04-21,This is a great product.,True,0,0,0,1,0,0,0,0
3,254,Product 254,94.519997,88,4.1,2024-08-07,This is a great product.,False,1,1,0,0,0,0,0,1
4,255,Product 255,208.380005,86,3.9,2024-03-17,This is a great product.,False,1,0,1,0,0,0,1,0
5,256,Product 256,120.830002,30,2.2,2024-03-04,This is a great product.,True,0,0,0,0,0,1,0,0
6,257,Product 257,340.76001,73,4.9,2024-03-08,This is a great product.,False,1,0,0,1,0,0,0,1
7,258,Product 258,253.940002,53,4.9,2024-07-20,This is a great product.,True,1,0,0,0,0,0,0,1
8,259,Product 259,30.83,61,4.5,2024-01-21,This is a great product.,True,1,0,0,1,0,0,0,1
9,260,Product 260,426.329987,36,1.3,2024-08-09,This is a great product.,False,0,0,0,0,1,0,0,0


In [0]:
# Import the SparkSession from the PySpark library
from pyspark.sql import SparkSession

# Initialize a Spark session or retrieve an existing one
spark = SparkSession.builder.getOrCreate()

# Convert a Pandas DataFrame (product_pd_df) to a Spark DataFrame
spark_df = spark.createDataFrame(product_pd_df)

# Write the Spark DataFrame to a Delta table in overwrite mode
# - format("delta"): Specifies Delta format to enable ACID transactions and time-travel features
# - mode("overwrite"): Overwrites any existing data in the specified table location
# - saveAsTable("product_db.products_data_prepared"): Saves the data to a table in the 'product_db' database with the name 'products_data_prepared'
spark_df.write.format("delta").mode("overwrite").saveAsTable("product_db.products_data_prepared")


  Unable to convert the field category_Clothing. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.
Direct cause: [UNSUPPORTED_DATA_TYPE_FOR_ARROW_CONVERSION] uint8 is not supported in conversion to Arrow.
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg)
