# Initial Exploration â€” Amazon Sales 2025 Dataset

### Driving Questions
1. Which product categories and states generate the highest total sales during the Divali season?
2. What factors influence customer review ratings?
3. How do different payment methods affect revenue and return rates?
4. Are there monthly or seasonal patterns in order volumn or revenue in 2025?

### Phase 1: Load & Initial Reconnaissance 
1. Is the dataset tidy?
2. What are the entities (rows) and attributes (columns) represented in the data?
3. What variable types are present?
4. What is the rough shape and scale of the dataset?
5. Are there obvious problems?

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv(
    "../data/raw/amazon_sales_2025_INR.csv", parse_dates=["Date"], na_values=["NA", "", "null"]
)
df.head()

***Explanation***
1. The dataset loads with date parsed as datetime, since time-based patterns are needed for driven question 4.
2. Dataset looks structured with clearly labeled columns representing orders, products, customers, payments, and reviews.
3. Each row corresponds to a single transaction, each column holds one attribute. The dataset is tidy.

In [None]:
df.info()

***Explanations***
1. The dataset contains 14 attributes including order_id, date, customer_id, product_category, product_name, quantity, unit_price_INR, total_sales_INR, payment method, delivery status, review rating, review text, state and country.
2. The initial dataset contains: object, datetime, numeric type (int, float). There are some categorical-like objects: product_Category, payment_method, Delivery_Status and State.
3. Those categorical-like objects need to be converted to category later. 


In [None]:
df.shape

***Explanations***

The dataset has 15000 rows and 14 columns. 

In [None]:
df.info(memory_usage="deep")

***Explanations***
The dataset requires around 8MB of memory. This indicates that the dataset is small to medium scale. 

In [None]:
df.describe()

***Explanations***

Besides Date, there are 4 numeric variables: Quantity, unit_price_INR, total_sales_INR, review_rating. The quantity ranges from 1 to 5, which is resonable for consumer orders, and not indicate extreme outliers. Unit Price INR and total sales INR show large spreads between min and max values. Those variables might have potential outliers. The reivew rating behaves as expected within 1-5 scale, with no invalid values. 

In [None]:
df.nunique()

***Explanations***
1. Product_categories, payment_method, delivery_status each have only a few unique values, indicating they are true categorical variables. They are good for grouping and comparisons to solve driven questions.
2. Product_Name has 25 unique values, it means 25 product items are in the dataset. It probably can be convert to category 
3. Customer_ID uniquely identifies customers and should not be treated as a categorical feature. 
4. State contains 28 unique values, which is typical for Indian geography. This makes it a strong variable for analyzing regional sales differences and support our driven questions. 

In [None]:
df.duplicated().sum()

***Explanation***
There are no repeated transactions. It is good.  

In [None]:
df["Delivery_Status"].value_counts()

In [None]:
df["Payment_Method"].value_counts()

In [None]:
df["Product_Category"].value_counts()

In [None]:
df["Product_Name"].value_counts()

In [None]:
df["Review_Text"].value_counts()

***Explanation***
  
1. Just based on the value_counts and df.describe() I can not find any potential missing values so far. 
2. Those variables: delivery_status, payment method, product category, product name and review text should be coverted to categorical data type. 


#### Overall Hypothesis generation & Iteration

- The variables most relevant to answering the research questions include:
  - Product_Category, State, Total_Sales_INR, Date (for 1 and 4)
  - Review_Rating, Review_Text, Delivery_Status, Unit_Price_INR (for 2)
  - Payment_Method, Total_Sales_INR, Delivery_Status (for R3) 
- Useful relationships to explore later:
  - State vs Total_Sales_INR
  - Delivery_Status vs Review_Rating
  - Payment Method vs Revenue or Return Rate
  - Date vs Order Volume and Seasonal Trends
- Iteration signals based on the initial load:
  - Some object columns should be converted to categorical types.
  - Text field (Review_Text) may contain noise or blanks.
  - Need to verify Total_Sales_INR calculations because it sould be Unit_Sales_INR times Quantity
  - Investigate any unexpected or extreme values.
