# 🌟 Step 1: Concept — What is Feature Engineering?

- In machine learning, the model doesn’t magically understand your raw data.
- It learns from features → numeric or categorical variables that describe the problem.

### Example:
- Raw column → "Sales"
- Engineered feature → "Average Order Value per Customer"

- Good features = better predictions.
- Feature engineering is 80% of the work in ML.

# 🌟 Step 2: Why Hypothesis Thinking?

- When preparing features, you don’t just randomly create them. You think like a scientist:

- Hypothesis: "Customers who buy more frequently are more likely to churn less."
    - So, we create order frequency.

- Hypothesis: "Customers with higher average order value bring more profit."
    - So, we create AOV (Average Order Value).

- This connects business intuition → features → ML performance.

In [4]:
import pandas as pd

df = pd.read_csv("data/superstore.csv")
df['Order Date'] = pd.to_datetime(df['Order Date'])
print(df.info())
print(df.head(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9993 entries, 0 to 9992
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Order ID         9993 non-null   object        
 1   Order Date       9993 non-null   datetime64[ns]
 2   Ship Date        9993 non-null   object        
 3   Ship Mode        9993 non-null   object        
 4   Customer ID      9993 non-null   object        
 5   Customer Name    9993 non-null   object        
 6   Segment          9993 non-null   object        
 7   Country          9993 non-null   object        
 8   City             9993 non-null   object        
 9   State            9993 non-null   object        
 10  Postal Code      9993 non-null   int64         
 11  Region           9993 non-null   object        
 12  Product ID       9993 non-null   object        
 13  Category         9993 non-null   object        
 14  Sub-Category     9993 non-null   object 

## Average Order Value (AOV)

#### In pandas:

- .unique() → returns the array of unique values.
- Example: df['Order ID'].unique() → gives you a list/array of all unique order IDs.
- 
- .nunique() → returns the count of unique values (number of distinct entries).
- Example: df['Order ID'].nunique() → gives you the number of unique order IDs.

In [None]:
aov = df.groupby('Customer ID').agg(
    total_sales=('Sales', 'sum'),
    total_orders=('Order ID', 'nunique')
)
aov['AOV'] = aov['total_sales'] / aov['total_orders']

             total_sales  total_orders          AOV
Customer ID                                        
AA-10315        5563.560             5  1112.712000
AA-10375        1056.390             9   117.376667
AA-10480        1790.512             4   447.628000
AA-10645        5086.935             6   847.822500
AB-10015         886.156             3   295.385333
...                  ...           ...          ...
XP-21865        2374.658            11   215.878000
YC-21895        5454.350             5  1090.870000
YS-21880        6720.444             8   840.055500
ZC-21910        8025.707            13   617.362077
ZD-21925        1493.944             5   298.788800

[793 rows x 3 columns]


## Order Frequency

In [8]:
customer_orders = df.groupby('Customer ID').agg(
    first_order=('Order Date', 'min'),
    last_order=('Order Date', 'max'),
    total_orders=('Order ID', 'nunique')
)
customer_orders['active_months'] = ((customer_orders['last_order'] - customer_orders['first_order']).dt.days / 30).round(1)
customer_orders['order_frequency'] = customer_orders['total_orders'] / customer_orders['active_months']
print(customer_orders)

            first_order last_order  total_orders  active_months  \
Customer ID                                                       
AA-10315     2019-03-31 2022-06-29             5           39.5   
AA-10375     2019-04-21 2022-12-11             9           44.3   
AA-10480     2019-05-04 2022-04-15             4           35.9   
AA-10645     2019-06-22 2022-11-05             6           41.1   
AB-10015     2019-02-18 2021-11-10             3           33.2   
...                 ...        ...           ...            ...   
XP-21865     2019-01-20 2022-11-17            11           46.6   
YC-21895     2019-11-17 2022-12-26             5           37.8   
YS-21880     2020-01-12 2022-12-21             8           35.8   
ZC-21910     2019-10-13 2022-11-06            13           37.3   
ZD-21925     2019-08-27 2022-06-11             5           34.0   

             order_frequency  
Customer ID                   
AA-10315            0.126582  
AA-10375            0.203160  
AA-1

## Retention_Bucket

In [9]:
max_date = df['Order Date'].max()
customer_orders['days_since_last_order'] = (max_date - customer_orders['last_order']).dt.days

def retention_status(days):
    if days <= 90:
        return "Active"
    elif days <= 180:
        return "At-risk"
    else:
        return "Lost"

customer_orders['Retention_Bucket'] = customer_orders['days_since_last_order'].apply(retention_status)
print(customer_orders)

            first_order last_order  total_orders  active_months  \
Customer ID                                                       
AA-10315     2019-03-31 2022-06-29             5           39.5   
AA-10375     2019-04-21 2022-12-11             9           44.3   
AA-10480     2019-05-04 2022-04-15             4           35.9   
AA-10645     2019-06-22 2022-11-05             6           41.1   
AB-10015     2019-02-18 2021-11-10             3           33.2   
...                 ...        ...           ...            ...   
XP-21865     2019-01-20 2022-11-17            11           46.6   
YC-21895     2019-11-17 2022-12-26             5           37.8   
YS-21880     2020-01-12 2022-12-21             8           35.8   
ZC-21910     2019-10-13 2022-11-06            13           37.3   
ZD-21925     2019-08-27 2022-06-11             5           34.0   

             order_frequency  days_since_last_order Retention_Bucket  
Customer ID                                              

In [12]:
# Merge features into one customer-level dataset
final_features = aov.merge(customer_orders, left_index=True, right_index=True)

# Save for ML
final_features.to_csv("data/customer_features.csv")
print(final_features.head())

             total_sales  total_orders_x          AOV first_order last_order  \
Customer ID                                                                    
AA-10315        5563.560               5  1112.712000  2019-03-31 2022-06-29   
AA-10375        1056.390               9   117.376667  2019-04-21 2022-12-11   
AA-10480        1790.512               4   447.628000  2019-05-04 2022-04-15   
AA-10645        5086.935               6   847.822500  2019-06-22 2022-11-05   
AB-10015         886.156               3   295.385333  2019-02-18 2021-11-10   

             total_orders_y  active_months  order_frequency  \
Customer ID                                                   
AA-10315                  5           39.5         0.126582   
AA-10375                  9           44.3         0.203160   
AA-10480                  4           35.9         0.111421   
AA-10645                  6           41.1         0.145985   
AB-10015                  3           33.2         0.090361  