
# 📊 Elasticity Project — Phase 2: Data Cleaning and Feature Engineering

---

## 📝 Purpose of this Notebook

This notebook initiates **Phase 2** of the elasticity modeling project:
- Clean the raw dataset after initial exploration
- Engineer features necessary for elasticity regression modeling
- Prepare a finalized dataset ready for modeling

---

## 📚 Tasks Covered

- Remove zero-sales observations to avoid skewing elasticity
- Create log-transformed sales feature (`Log_Sales`)
- Engineer promotional flags and seasonal features (Month, Weekday, Year)
- Output a clean dataset for modeling

---

## 🔥 Next Steps After This Notebook

- Model log-sales as a function of price and promotions
- Estimate price elasticity across stores and products
- Build a Streamlit dashboard to visualize elasticity curves

---

## 🚀 Let's Get Started!

In [1]:
# 📚 Import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 🏗️ Set some basic visual configs
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context('talk')


In [5]:
# Load exploration-cleaned dataset
train_df2 = pd.read_csv(
    '../data/processed/train_df_exploration_clean.csv',
    index_col=0,
    parse_dates=['Date'],
    on_bad_lines='skip',
    low_memory=False
)

# After loading, still good practice:
train_df2['Date'] = pd.to_datetime(train_df2['Date'], errors='coerce')



## 🚦 Step 1: Validate Date Column and Index
### Check Date column is:
- Actually parsed as datetime64
- Set as the index properly (or ready to be if needed)
- In ascending order (important for any time series modeling later)

In [6]:
# Check the Date column type
print("Column type: ", train_df2.index.dtype)

# Check if it's sorted
print("Data sorted: ", train_df2.index.is_monotonic_increasing)

# Display a sample
print("Head sample: \n", train_df2.head(3))


Column type:  int64
Data sorted:  True
Head sample: 
    Store  DayOfWeek       Date  Sales  Customers  Open  Promo StateHoliday  \
0      1          5 2015-07-31   5263        555     1      1            0   
1      2          5 2015-07-31   6064        625     1      1            0   
2      3          5 2015-07-31   8314        821     1      1            0   

   SchoolHoliday  
0              1  
1              1  
2              1  


### Results of above checks:
- Data Sorted:  Good
- Head of sample data:  Looks reasonable
- Column dtype as int64:  Needs to be addressed
    - Index is still just row numbers (int64) — not the Date column.
    - Right now Date is just a regular column, not the index.

## 🎯 Next Step:
- Convert the Date column into the actual DataFrame index.

In [7]:
# Set the Date column as the index
train_df2['Date'] = pd.to_datetime(train_df2['Date'], errors='coerce')
train_df2 = train_df2.set_index('Date')

# Confirm it worked
print("Column type after setting index: ", train_df2.index.dtype)
print("Data sorted: ", train_df2.index.is_monotonic_increasing)
print("Head sample: \n", train_df2.head(3))


Column type after setting index:  datetime64[ns]
Data sorted:  False
Head sample: 
             Store  DayOfWeek  Sales  Customers  Open  Promo StateHoliday  \
Date                                                                       
2015-07-31      1          5   5263        555     1      1            0   
2015-07-31      2          5   6064        625     1      1            0   
2015-07-31      3          5   8314        821     1      1            0   

            SchoolHoliday  
Date                       
2015-07-31              1  
2015-07-31              1  
2015-07-31              1  


### Checks:
- Data is **NOT** monotonic.
- Data is showing to have sale dates on the same day at different stores.
- Date index has repeats

## 🎯 Next Step:
- Start checking for nulls, infinities, weird values across the dataset.
- Perform a "Data Health Check".

## 🎯 Data Health Check:
✅ 1. Check for NaNs<br>
✅ 2. Check for infinite values<br>
✅ 3. Check data types<br>
✅ 4. Check for duplicates<br>

In [10]:
# 1. Check for missing values
print("Missing Values Per Column:")
print(train_df2.isnull().sum())
print("-" * 50)

# 2. Check for infinite values
# Only check numeric columns for infinities
numeric_cols = train_df2.select_dtypes(include=['number'])

print("Any Infinite Values in Numeric Columns?")
print(np.isinf(numeric_cols).values.any())

# 3. Check data types
print("Data Types Overview:")
print(train_df2.dtypes)
print("-" * 50)

# 4. Check for duplicate rows
print("Number of Duplicate Rows:")
print(train_df2.duplicated().sum())


Missing Values Per Column:
Store            0
DayOfWeek        0
Sales            0
Customers        0
Open             0
Promo            0
StateHoliday     0
SchoolHoliday    0
dtype: int64
--------------------------------------------------
Any Infinite Values in Numeric Columns?
False
Data Types Overview:
Store             int64
DayOfWeek         int64
Sales             int64
Customers         int64
Open              int64
Promo             int64
StateHoliday     object
SchoolHoliday     int64
dtype: object
--------------------------------------------------
Number of Duplicate Rows:
154077


## 🔥 Professional Tip:
- To perform mathy checks (isinf, isnan, outliers, etc.) on a dataframe:
- Always .select_dtypes(include='number') first.
- Avoid string columns unless you are text-processing on purpose.

## 🏥 Diagnostics of Data Health Check

### 🧠 Summary of Checks:

| Check | Result | Verdict |
|------|--------|---------|
| Missing Values | 0 | ✅ Excellent |
| Infinite Values | False | ✅ Perfect |
| Data Types | Mostly Correct (minor note on `StateHoliday`) | ⚡ Flagged for later |
| Duplicate Rows | 154,077 | ⚡ Needs Investigation |

---

### 📚 Detailed Analysis:

#### ✅ 1. Missing Values
- **No NaNs** detected across any columns.
- **Verdict:** No immediate action needed.

---

#### ✅ 2. Infinite Values
- No `inf` or `-inf` values detected in numeric columns.
- **Verdict:** Safe to proceed.

---

#### ⚡ 3. Data Types
| Column | Data Type | Issue? |
|--------|-----------|--------|
| Store | int64 | No issues |
| DayOfWeek | int64 | No issues |
| Sales | int64 | No issues |
| Customers | int64 | No issues |
| Open | int64 | No issues |
| Promo | int64 | No issues |
| StateHoliday | object | ⚡ Flagged: Should be properly encoded |
| SchoolHoliday | int64 | No issues |

- `StateHoliday` is stored as an object type (`'0'`, `'a'`, `'b'`, `'c'`).
- This is normal for this dataset but should be **properly encoded** later during preprocessing.

---

#### ⚡ 4. Duplicate Rows
- **154,077 duplicate rows detected.**
- Next Step: **Investigate** whether these are:
  - Accidental duplicates (need removal)
  - Legitimate multi-store entries (keep or modify)

**Verdict:** Investigation required before making changes.

---

### 📋 Professional Path Forward:

| Task | Action |
|-----|--------|
| `StateHoliday` object type | Flag for later encoding |
| Duplicate rows | Investigate and assess before removal |
| All other checks | ✅ Green light to proceed |

---

## 🚀 Notes:

- **You are building this project like a real-world data scientist.**
- **Data cleaning decisions will be carefully documented for full transparency.**
- **No blind assumptions. Every step is defendable for peer review.**
