## Data Processing

###  purpose: 
This notebook is focused on cleaning and preparing raw data for further analysis and modeling. The raw data originates from the data_task.xlsx file and includes the following datasets:

order_numbers: Records the number of orders placed and their corresponding dates.
transaction_data: Contains indices for total spending and weekly active users.
reported_data: Aggregated revenue information for specific time periods.

Clean and process the data, of applying the DataProcessor class 


### import and set up
 

In [1]:
import pandas as pd
import os

from pathlib import Path

raw_path = "../data/raw/data_task.xlsx"
processed_path = "../data/processed/"
Path(processed_path).mkdir(parents=True, exist_ok=True)  # Ensure directory exists


### Load Raw Data

In [2]:
df_orders = pd.read_excel(raw_path, sheet_name="order_numbers")
df_transactions = pd.read_excel(raw_path, sheet_name="transaction_data")
df_reported = pd.read_excel(raw_path, sheet_name="reported_data")

df_orders.head()


Unnamed: 0,date,order_number
0,2018-01-07,33841906
1,2018-01-22,34008921
2,2018-01-25,34397468
3,2018-02-06,34434432
4,2018-02-08,34579365


 ### Define and Apply Cleaning Functions

The original data contained inconsistencies and errors that were corrected through a cleaning process. This included removing out-of-range values ​​in order_number, converting dates, and standardizing the column format as period.

Some functions where created :


clean_order_numbers ensures chronological order and removes invalid order_number entries.

clean_transaction_data standardizes date formatting.

clean_reported_data removes spaces in the period column and converts dates.

In [3]:
def clean_order_numbers(df_orders):
    df_orders = df_orders.copy()
    df_orders['date'] = pd.to_datetime(df_orders['date'])
    df_orders = df_orders.sort_values(by='date')
    df_orders['order_diff'] = df_orders['order_number'].diff()
    df_orders = df_orders[df_orders['order_diff'] >= 0]
    df_orders = df_orders.drop(columns=['order_diff'])
    return df_orders

def clean_transaction_data(df_transactions):
    df_transactions = df_transactions.copy()
    df_transactions['date'] = pd.to_datetime(df_transactions['date'])
    return df_transactions

def clean_reported_data(df_reported):
    df_reported = df_reported.copy()
    df_reported['start_date'] = pd.to_datetime(df_reported['start_date'])
    df_reported['end_date'] = pd.to_datetime(df_reported['end_date'])
    df_reported['period'] = df_reported['period'].str.replace(" ", "", regex=False)
    return df_reported


In [5]:
df_orders_clean = clean_order_numbers(df_orders)
df_transactions_clean = clean_transaction_data(df_transactions)
df_reported_clean = clean_reported_data(df_reported)


### Save the Cleaned Data

In [7]:
df_orders_clean.to_csv(os.path.join(processed_path, "orders_cleaned.csv"), index=False)
df_transactions_clean.to_csv(os.path.join(processed_path, "transactions_cleaned.csv"), index=False)
df_reported_clean.to_csv(os.path.join(processed_path, "reported_cleaned.csv"), index=False)

print("Cleaned data saved ")


Cleaned data saved 


Cleaned data is saved in the processed/ directory as:
- orders_cleaned.csv
- transactions_cleaned.csv
- reported_cleaned.csv
