# Feature Engineering Notebook

This notebook performs feature engineering on the Online Retail Transactions dataset, following the feature plan in the README. The engineered dataset will be saved as `engineered_data.csv` for analysis.

In [11]:
import pandas as pd

# Load the raw dataset
df = pd.read_csv('../data/clean_online_retail.csv')
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


## Feature: TotalPrice

Pseudocode:
- Multiply Quantity by UnitPrice to get TotalPrice per transaction.

In [12]:
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

## Feature: InvoiceMonth

Pseudocode:
- Extract the month from InvoiceDate for monthly analysis.

In [13]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['InvoiceMonth'] = df['InvoiceDate'].dt.month

## Feature: InvoiceDayOfWeek

Pseudocode:
- Extract day of the week from InvoiceDate (0 = Monday, 6 = Sunday).

In [14]:
df['InvoiceDayOfWeek'] = df['InvoiceDate'].dt.dayofweek

## Feature: InvoiceWeekOfYear

Pseudocode:
- Extract the week number from InvoiceDate.

In [15]:
df['InvoiceWeekOfYear'] = df['InvoiceDate'].dt.isocalendar().week

## Feature: ReturnsFlag

Pseudocode:
- Identify returns where Quantity or UnitPrice is negative.
- If either is negative, mark as return (1), else 0.

In [16]:
df['ReturnsFlag'] = ((df['Quantity'] < 0) | (df['UnitPrice'] < 0)).astype(int)

## Feature: Recency

Pseudocode:
- Calculate days since customer’s last purchase from reference date.

In [17]:
reference_date = pd.to_datetime('2011-12-10')  # Example reference date
recency_df = df.groupby('CustomerID')['InvoiceDate'].max().reset_index()
recency_df['Recency'] = (reference_date - recency_df['InvoiceDate']).dt.days
df = df.merge(recency_df[['CustomerID', 'Recency']], on='CustomerID', how='left')

## Feature: MonetaryValue

Pseudocode:
- Calculate total spending per customer.

In [18]:
monetary_df = df.groupby('CustomerID')['TotalPrice'].sum().reset_index()
monetary_df.rename(columns={'TotalPrice': 'MonetaryValue'}, inplace=True)
df = df.merge(monetary_df, on='CustomerID', how='left')

## Feature: CustomerFrequency

Pseudocode:
- Count number of unique invoices per customer.

In [19]:
frequency_df = df.groupby('CustomerID')['InvoiceNo'].nunique().reset_index()
frequency_df.rename(columns={'InvoiceNo': 'CustomerFrequency'}, inplace=True)
df = df.merge(frequency_df, on='CustomerID', how='left')

In [20]:
# Save the engineered dataset
df.to_csv('../data/engineered_data.csv', index=False)