_____

<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block; font-weight:bold; font-size:40px;">
 Data Loading
</p>

<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block;">
1. Introduction<br>
- Project description and objective of this notebook<br>
- Data sources (CSV files)<br>
- Goal: load and preview datasets
</p>

<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block;">
2. Load Datasets<br>
- Load all CSV files<br>
- Handle encoding and separators<br>
- Display rows and columns count for each dataset
</p>

<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block;">
3. Standardize Column Names<br>
- Convert all columns to lowercase<br>
- Replace spaces with underscores `_`<br>
- Review final column names
</p>

In [3]:
import pandas as pd

In [5]:
# Function to standardize column names
def standardize_columns(df):
    df.columns = (
        df.columns
        .str.strip()               # remove spaces at edges
        .str.lower()               # convert to lowercase
        .str.replace(' ', '_')     # replace spaces with underscore
    )
    return df


In [6]:
# Function to safely load csv files
def load_csv(data):
    try:
        df = pd.read_csv(data)
    except:
        try:
            df = pd.read_csv(data, encoding='latin-1')
        except:
            df = pd.read_csv(data, sep=None, engine='python')  # auto-detect separator

    df = standardize_columns(df)

    print(f"{data} â†’ Rows: {df.shape[0]}, Columns: {df.shape[1]}")
    return df

In [7]:
# ---- Load all files ----
products     = load_csv("products.csv")
brands       = load_csv("brands.csv")
categories   = load_csv("categories.csv")
customers    = load_csv("customers.csv")
orders       = load_csv("orders.csv")
order_items  = load_csv("order_items.csv")
stores       = load_csv("stores.csv")
staffs       = load_csv("staffs.csv")
stocks       = load_csv("stocks.csv")

products.csv â†’ Rows: 334, Columns: 6
brands.csv â†’ Rows: 9, Columns: 2
categories.csv â†’ Rows: 7, Columns: 2
customers.csv â†’ Rows: 1445, Columns: 9
orders.csv â†’ Rows: 1615, Columns: 8
order_items.csv â†’ Rows: 4764, Columns: 6
stores.csv â†’ Rows: 3, Columns: 8
staffs.csv â†’ Rows: 10, Columns: 8
stocks.csv â†’ Rows: 939, Columns: 3


In [10]:
# Function to clean column names
def clean_columns(df):
    df.columns = (
        df.columns
        .str.lower()
        .str.strip()
        .str.replace(" ", "_")
    )
    return df


___________

<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block; font-weight:bold; font-size:40px;">
 Data Cleaning
</p>

<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block;">
1. Handling Missing Values<br>
- Detect missing values<br>
- Drop or fill missing values logically
</p>

<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block;">
2. Correct Data Types<br>
- Convert IDs to integers<br>
- Convert dates to datetime<br>
- Convert prices to float<br>
- Ensure foreign key columns have matching types
</p>

<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block;">
3. Outliers & Format Issues<br>
- Fix incorrect or inconsistent values<br>
- Remove negative quantities<br>
- Clean multi-valued phone numbers
</p>

<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block;">
4. Duplicates<br>
- Detect duplicate rows<br>
- Remove or fix duplicates
</p>

__________


<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block; font-weight:bold; font-size:22px;">
Missing Values
</p>

In [None]:
def clean_missing(df):
    print(df.isna().sum())

    df.dropna(how='all', inplace=True)

    if 'phone' in df.columns:
        df['phone'] = df['phone'].fillna("Unknown")

    return df


<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block; font-weight:bold; font-size:22px;">
Data Types
</p>

In [21]:
def fix_dtypes(df):
    for col in df.columns:

        if col.endswith("_id"):
            if df[col].astype(str).str.isnumeric().all():
                df[col] = df[col].astype(int)
            else:
                print(f"âš  Warning: Column '{col}' contains non-numeric values. Skipping.")

        if "price" in col:
            df[col] = pd.to_numeric(df[col], errors="coerce")

        if "date" in col:
            df[col] = pd.to_datetime(df[col], errors="coerce")

    return df


**Outliers & Format Issues**


In [None]:
def clean_format_issues(df):

    if 'quantity' in df.columns:
        df = df[df['quantity'] >= 0]

    if 'phone' in df.columns:
        df['phone'] = df['phone'].astype(str)
        df['phone'] = df['phone'].apply(lambda x: x.split(",")[0].strip())

    return df


**Duplicates**

In [14]:
def remove_duplicates(df):
    before = len(df)
    df.drop_duplicates(inplace=True)
    after = len(df)

    print(f"Removed {before - after} duplicates.")
    return df


**Unify cleaning in one function for all files**

In [22]:
def clean_dataset(df):
    df = clean_missing(df)
    df = fix_dtypes(df)
    df = clean_format_issues(df)
    df = remove_duplicates(df)
    return df

products     = clean_dataset(products)
brands       = clean_dataset(brands)
categories   = clean_dataset(categories)
customers    = clean_dataset(customers)
orders       = clean_dataset(orders)
order_items  = clean_dataset(order_items)
stores       = clean_dataset(stores)
staffs       = clean_dataset(staffs)
stocks       = clean_dataset(stocks)



product_id      0
product_name    0
brand_id        0
category_id     0
model_year      0
list_price      0
dtype: int64
Removed 0 duplicates.
brand_id      0
brand_name    0
dtype: int64
Removed 0 duplicates.
category_id      0
category_name    0
dtype: int64
Removed 0 duplicates.
customer_id    0
first_name     0
last_name      0
phone          0
email          0
street         0
city           0
state          0
zip_code       0
dtype: int64
Removed 0 duplicates.
order_id           0
customer_id        0
order_status       0
order_date        11
required_date     13
shipped_date     183
store_id           0
staff_id           0
dtype: int64
Removed 0 duplicates.
order_id      0
item_id       0
product_id    0
quantity      0
list_price    0
discount      0
dtype: int64
Removed 34 duplicates.
store_id      0
store_name    0
phone         0
email         1
street        0
city          0
state         0
zip_code      1
dtype: int64
Removed 0 duplicates.
staff_id      0
first_name    0

In [18]:
for df_name, df in {
    "products": products,
    "customers": customers,
    "orders": orders,
    "order_items": order_items,
    "staffs": staffs,
    "stores": stores,
}.items():

    print("\nChecking:", df_name)
    for col in df.columns:
        if col.endswith("_id"):
            print(col, "â†’", df[col].unique()[:10])



Checking: products
product_id â†’ [ 1  2  3  4  5  6  7  8  9 10]
brand_id â†’ [9 5 8 3 1 4 7 2 6]
category_id â†’ [6 5 4 3 1 2 7]

Checking: customers
customer_id â†’ [ 1  2  3  4  5  6  7  8  9 10]

Checking: orders
order_id â†’ [ 1  2  3  4  5  6  7  8  9 10]
customer_id â†’ [ 259 1212  523  175 1324   94  324 1204   60  442]
store_id â†’ [1 2 3]
staff_id â†’ [2 6 7 3 8 9]

Checking: order_items
order_id â†’ [ 1  2  3  4  5  6  7  8  9 10]
item_id â†’ ['1' '2' '3' '4' '5' '60' '14' '22' '17' '9']
product_id â†’ [20  8 10 16  4  3  2 17 26 18]

Checking: staffs
staff_id â†’ [ 1  2  3  4  5  6  7  8  9 10]
store_id â†’ [nan  1.  2.  3.]
manager_id â†’ [nan  1.  2.  5.  7.]

Checking: stores
store_id â†’ [1 2 3]


_____________

<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block; font-weight:bold; font-size:40px;">
Data Transformation
</p>

<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block;">
1. Merge Tables<br>
- Merge products with brands and categories
</p>

<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block;">
2. Calculate Metrics<br>
- Calculate total_price per order item<br>
- Calculate order total amount per order
</p>

<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block;">
3. Customer Data<br>
- Create full_name column for customers<br>
- Clean and standardize phone numbers
</p>

In [None]:
# 1. Merge products with brands and categories
products_merged = (
    products
    .merge(brands, on="brand_id", how="left")
    .merge(categories, on="category_id", how="left")
)


In [None]:
# 2. Calculate total price per order item
order_items["total_price"] = order_items["quantity"] * order_items["list_price"]

In [28]:
# 3. Calculate total order amount
order_totals = (
    order_items
    .groupby("order_id")["total_price"]
    .sum()
    .reset_index()
    .rename(columns={"total_price": "order_total"})
)

orders = orders.merge(order_totals, on="order_id", how="left")

In [29]:
# 4. Create full_name for customers
customers["full_name"] = customers["first_name"] + " " + customers["last_name"]

In [30]:
# 5. Clean phone numbers
import re

def clean_phone(phone):
    if pd.isna(phone):
        return "Unknown"
    phone = str(phone)
    phone = phone.replace(" ", "")
    phone = re.sub(r"\D", "", phone)
    return phone

customers["phone"] = customers["phone"].apply(clean_phone)

_____________

<p style="background: linear-gradient(to right, #1E90FF, #8A2BE2); -webkit-background-clip: text; color: transparent; display:block; font-weight:bold; font-size:30px;">
Done ðŸ’¯
</p>
