## About the Dataset

This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.

### Columns:

- **Transaction ID**: Unique identifier for each sales order.
- **Date**: Date of the sales transaction.
- **Product Category**: Broad category of the product sold (e.g., Electronics, Home Appliances, Clothing, Books, Beauty Products, Sports).
- **Product Name**: Specific name or model of the product sold.
- **Units Sold**: Number of units of the product sold in the transaction.
- **Unit Price**: Price of one unit of the product.
- **Total Revenue**: Total revenue generated from the sales transaction (Quantity * Unit Price).
- **Region**: Geographic region where the transaction occurred (e.g., North America, Europe, Asia).
- **Payment Method**: Method used for payment (e.g., Credit Card, PayPal, Debit Card).

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

In [2]:
data = pd.read_csv("/content/drive/MyDrive/Datasets/Online_Sales_Data.csv")
data.head()

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal


In [3]:
# Assuming 'data' is your DataFrame
data.rename(columns={'Product Name': 'Product_Name', 'Payment Method': 'Payment_Method', 'Total Revenue': 'Total_Revenue', 'Product Category': 'Product_Category'}, inplace=True)
data.head()

Unnamed: 0,Transaction ID,Date,Product_Category,Product_Name,Units Sold,Unit Price,Total_Revenue,Region,Payment_Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal


**Observations:**

No missing values in the selected columns.

"**Product Name**" and "**Payment Method**" are categorical variables.

"**Total Revenue**" is a numerical variable.

"**Product Category**" is the target variable.


In [4]:
# Check for missing values
missing_values = data.isnull().sum()

# Select relevant columns for training
selected_columns = ["Product_Name", "Payment_Method", "Total_Revenue", "Product_Category"]
df_selected = data[selected_columns]

# Display missing values and dataset summary
missing_values, df_selected.info(), df_selected.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Product_Name      240 non-null    object 
 1   Payment_Method    240 non-null    object 
 2   Total_Revenue     240 non-null    float64
 3   Product_Category  240 non-null    object 
dtypes: float64(1), object(3)
memory usage: 7.6+ KB


(Transaction ID      0
 Date                0
 Product_Category    0
 Product_Name        0
 Units Sold          0
 Unit Price          0
 Total_Revenue       0
 Region              0
 Payment_Method      0
 dtype: int64,
 None,
               Product_Name Payment_Method  Total_Revenue Product_Category
 0            iPhone 14 Pro    Credit Card        1999.98      Electronics
 1         Dyson V11 Vacuum         PayPal         499.99  Home Appliances
 2         Levi's 501 Jeans     Debit Card         209.97         Clothing
 3        The Da Vinci Code    Credit Card          63.96            Books
 4  Neutrogena Skincare Set         PayPal          89.99  Beauty Products)

In [5]:
# Levels counts

data['Product_Category'].value_counts()

Unnamed: 0_level_0,count
Product_Category,Unnamed: 1_level_1
Electronics,40
Home Appliances,40
Clothing,40
Books,40
Beauty Products,40
Sports,40


**Next Steps:**

Convert categorical variables into numerical format using encoding.
Split the dataset into training and testing sets.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Encode categorical features
label_encoders = {}
for column in ["Product_Name", "Payment_Method", "Product_Category"]:
    le = LabelEncoder()
    df_selected[column] = le.fit_transform(df_selected[column])
    label_encoders[column] = le  # Store encoders for future decoding

# Split dataset into train and test sets (80% train, 20% test)
X = df_selected.drop(columns=["Product_Category"])
y = df_selected["Product_Category"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display transformed dataset
df_selected.head(), X_train.shape, X_test.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected[column] = le.fit_transform(df_selected[column])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected[column] = le.fit_transform(df_selected[column])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected[column] = le.fit_transform(df_selected[column])


(   Product_Name  Payment_Method  Total_Revenue  Product_Category
 0           230               0        1999.98                 3
 1            59               2         499.99                 4
 2           115               1         209.97                 2
 3           182               0          63.96                 1
 4           131               2          89.99                 0,
 (192, 3),
 (48, 3))

**Data Encoding and Splitting Summary:**

- **Categorical features ("Product Name", "Payment Method", and "Product Category")** have been converted into numerical values using Label Encoding.

- The dataset is split into:
Training Set: 192 samples (80%).
Testing Set: 48 samples (20%)

**Next Step:**

Train a machine learning model to classify the **Product Category** based on the given inputs. I'll start with a Decision Tree classifier.

In [7]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Train a Decision Tree model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Predictions on the test set
y_pred = model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=label_encoders['Product_Category'].classes_)

print("Model Accuracy:", accuracy)
print("Classification Report:\n", report)


Model Accuracy: 0.7291666666666666
Classification Report:
                  precision    recall  f1-score   support

Beauty Products       0.43      0.75      0.55         4
          Books       1.00      0.78      0.88         9
       Clothing       1.00      1.00      1.00         8
    Electronics       0.67      0.67      0.67        12
Home Appliances       0.80      0.50      0.62         8
         Sports       0.56      0.71      0.62         7

       accuracy                           0.73        48
      macro avg       0.74      0.73      0.72        48
   weighted avg       0.77      0.73      0.74        48



In [8]:
# Function to make predictions
def predict_category(product_name, payment_method, total_revenue):
    product_encoded = label_encoders["Product_Name"].transform([product_name])[0]
    payment_encoded = label_encoders["Payment_Method"].transform([payment_method])[0]
    input_data = pd.DataFrame([[product_encoded, payment_encoded, total_revenue]],
                              columns=["Product_Name", "Payment_Method", "Total_Revenue"])
    category_encoded = model.predict(input_data)[0]
    category = label_encoders["Product_Category"].inverse_transform([category_encoded])[0]
    return category

In [9]:
# Example usage
# Youy have to enter "Product_Name", "Payment_Method", "Total_Revenue" as input and it will predict the "Product_Category" as output.

example_prediction = predict_category("iPhone 14 Pro", "Credit Card", 1999.98)
print("Predicted Category:", example_prediction)

Predicted Category: Electronics


In [10]:
# Example usage
# Youy have to enter "Product_Name", "Payment_Method", "Total_Revenue" as input and it will predict the "Product_Category" as output.

example_prediction = predict_category("Neutrogena Skincare Set", "PayPal", 89.99)
print("Predicted Category:", example_prediction)

Predicted Category: Beauty Products
