# 📊 Problem: E-Commerce Customer Orders Dataset

You’re given raw data from an online store.

| OrderID | CustomerName | Age | Gender | OrderDate   | ProductCategory | Quantity | Price | PaymentMethod | Review          |
|---------|--------------|-----|--------|-------------|-----------------|----------|-------|---------------|-----------------|
| 1001    | John Doe     | 25  | Male   | 2023-01-15  | Electronics     | 1        | 500   | Credit Card   | Great product!  |
| 1002    | Jane Smith   | NaN | female | 2023-01-16  | Fashion         | 2        | -50   | PayPal        |                 |
| 1003    | Alex Johnson | 45  | M      | 2023/01/18  | Electronics     | NaN      | 300   | cash          | Product ok      |
| 1004    | Maria Santos | 200 | F      | 18-01-2023  |                 | 1        | 100   | Credit card   | Fast delivery   |
| 1005    | John Doe     | 25  | male   | 2023-01-15  | Electronics     | 1        | 500   | credit card   | Great product!  |

---

## 🔎 Issues to Solve

### Missing Values
- **Age** missing for some customers.  
- **ProductCategory** blank.  
- **Quantity** missing.  
- **Review** sometimes empty.  

### Inconsistent Formats
- **Gender** has `"Male"`, `"male"`, `"M"`, `"F"`, `"female"`.  
- **OrderDate** stored in different formats (`2023-01-15`, `18-01-2023`, `2023/01/18`).  
- **PaymentMethod** inconsistencies (`Credit Card`, `credit card`, `cash`).  

### Outliers / Wrong Values
- **Age = 200** (unrealistic).  
- **Price = -50** (negative).  

### Duplicates
- Same order repeated (`John Doe` with same details).  

---

## 🛠 Preprocessing Plan

### Handle Missing Values
- **Age** → median imputation.  
- **Quantity** → fill with `1` (assume default).  
- **ProductCategory** → `"Unknown"`.  
- **Review** → `"No Review"`.  

### Fix Formats
- Standardize **Gender** → `"Male"`, `"Female"`.  
- Convert all **OrderDate** into proper datetime.  
- Normalize **PaymentMethod** → `"Credit Card"`, `"PayPal"`, `"Cash"`.  

### Handle Outliers
- Replace unrealistic **Ages** (>100) with `NaN` → then impute.  
- Fix negative **Prices** → set to absolute value or `NaN` → impute.  

### Remove Duplicates
- Drop duplicate rows.  

### Encoding for ML (Later Steps)
- One-hot encode **categorical** columns.  
- Scale **numeric** features (`Age`, `Quantity`, `Price`).  


In [9]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# --- Step 1: Create messy dataset ---
data = {
    "OrderID": [1001, 1002, 1003, 1004, 1005],
    "CustomerName": ["John Doe", "Jane Smith", "Alex Johnson", "Maria Santos", "John Doe"],
    "Age": [25, np.nan, 45, 200, 25],
    "Gender": ["Male", "female", "M", "F", "male"],
    "OrderDate": ["2023-01-15", "2023-01-16", "2023/01/18", "18-01-2023", "2023-01-15"],
    "ProductCategory": ["Electronics", "Fashion", "Electronics", "", "Electronics"],
    "Quantity": [1, 2, np.nan, 1, 1],
    "Price": [500, -50, 300, 100, 500],
    "PaymentMethod": ["Credit Card", "PayPal", "cash", "Credit card", "credit card"],
    "Review": ["Great product!", "", "Product ok", "Fast delivery", "Great product!"]
}

df = pd.DataFrame(data)
print("----- RAW DATA -----")
display(df)

----- RAW DATA -----


Unnamed: 0,OrderID,CustomerName,Age,Gender,OrderDate,ProductCategory,Quantity,Price,PaymentMethod,Review
0,1001,John Doe,25.0,Male,2023-01-15,Electronics,1.0,500,Credit Card,Great product!
1,1002,Jane Smith,,female,2023-01-16,Fashion,2.0,-50,PayPal,
2,1003,Alex Johnson,45.0,M,2023/01/18,Electronics,,300,cash,Product ok
3,1004,Maria Santos,200.0,F,18-01-2023,,1.0,100,Credit card,Fast delivery
4,1005,John Doe,25.0,male,2023-01-15,Electronics,1.0,500,credit card,Great product!


# ABOVE IS THE SAMPLE DATA


In [None]:
df["Age"].replace({200: np.nan}, inplace=True) 
df["Age"].fillna(df["Age"].median(), inplace=True)

display(df)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].replace({200: np.nan}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].fillna(df["Age"].median(), inplace=True)


Unnamed: 0,OrderID,CustomerName,Age,Gender,OrderDate,ProductCategory,Quantity,Price,PaymentMethod,Review
0,1001,John Doe,25.0,Male,2023-01-15,Electronics,1.0,500,Credit Card,Great product!
1,1002,Jane Smith,25.0,female,2023-01-16,Fashion,2.0,-50,PayPal,
2,1003,Alex Johnson,45.0,M,2023/01/18,Electronics,,300,cash,Product ok
3,1004,Maria Santos,25.0,F,18-01-2023,,1.0,100,Credit card,Fast delivery
4,1005,John Doe,25.0,male,2023-01-15,Electronics,1.0,500,credit card,Great product!
