### Section 1: Data Cleaning & Wrangling

#### Objectives
Transform raw, messy e-commerce data into analysis-ready format while handling data quality issues and creating meaningful features.

In [5]:
%pip install -r requirements.txt

Collecting pyarrow>=5.0.0 (from -r requirements.txt (line 15))
  Downloading pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Downloading pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (42.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: pyarrow
Successfully installed pyarrow-21.0.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
# imports
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
from typing import Self
import dask.dataframe as dd
import matplotlib.pyplot as plt
from abc import ABC, abstractmethod

### Task 1.1: Data Quality Assessment

**Questions to Investigate**:

1. **Missing Data Analysis**
   - What percentage of records have missing user_session values?
   - Which product categories have the highest proportion of missing brand information?
   - Is there a correlation between missing price data and specific event types?

In [None]:
# transforming the data from csv format to parquet due to memory issues
df = dd.read_csv("/home/tovix/projects/E-commerceCustomerBehaviorAnalysis/data/raw/2019-Nov.csv")
df.to_parquet("/home/tovix/projects/E-commerceCustomerBehaviorAnalysis/data/raw/2019_nov_prqt")

In [25]:
df = dd.read_csv("/home/tovix/projects/E-commerceCustomerBehaviorAnalysis/data/raw/2019-Oct.csv")
df.to_parquet("/home/tovix/projects/E-commerceCustomerBehaviorAnalysis/data/raw/2019_oct_prqt")

### DataRawAnalysis Class Summary
---
####  Overview
A Dask-powered analyzer for large e-commerce datasets that performs automated data quality assessments with focus on missing values.
#### Key Capabilities
- **Missing Value Analysis**: Tracks nulls across critical fields (user_session, brand, price)
- **Hierarchical Category Breakdown**: Analyzes brand gaps across main/sub/product-type categories  
- **Event Type Correlation**: Checks price completeness across different event types
- **Memory-Efficient**: Handles 40M+ rows using Dask parallel processing
####  Output Metrics
- Total rows & null percentages
- Top categories with missing brands (3 hierarchy levels)
- Price data completeness by event type
- User session integrity stats

In [7]:
class DataRawAnalysis:
    """A comprehensive data analysis class for processing large e-commerce datasets with Dask.
    
    This class specializes in analyzing data quality issues in e-commerce data, with particular
    focus on missing values distribution across various fields including user sessions, brand 
    information, and pricing data. It leverages Dask for parallel processing to efficiently 
    handle large-scale datasets that exceed memory limits.
    
    Key Analysis Capabilities:
    - Overall data quality statistics (null counts, percentages)
    - User session missingness analysis
    - Hierarchical analysis of missing brand information across category levels
    - Correlation analysis between missing price data and event types
    - Memory-efficient processing of large datasets
    
    Attributes:
        datapath (str): Path to the parquet dataset directory or file
        totalRows (int): Total number of rows in the dataset
        totalUserSessionNulls (int): Total number of rows with missing user_session values
        nullPercentageAny (float): Percentage of rows with missing user_session values
        highestCatWithNulls (str): Category code with the highest number of missing brand values
        df (dask.DataFrame): The loaded DataFrame for analysis
    
    Example:
        >>> analyzer = DataRawAnalysis("path/to/parquet/data")
        >>> print(f"Total rows processed: {analyzer.totalRows:,}")
        >>> print(f"Missing user sessions: {analyzer.nullPercentageAny:.2f}%")
        >>> print(f"Highest null brand category: {analyzer.highestCatWithNulls}")
    
    Note:
        The constructor automatically executes all analysis methods upon initialization.
        For very large datasets, initialization may take several minutes to complete.
    """
    
    def __init__(self, datapath: str) -> None:
        """Initializes the DataRawAnalysis instance and loads the dataset.
        
        Args:
            datapath: Path to the parquet dataset directory or file. Can be a single
                     file or a directory containing multiple parquet files. Supports
                     any storage format supported by Dask (local, S3, HDFS, etc.).
        
        Raises:
            Exception: If the parquet file cannot be loaded or is malformed
            FileNotFoundError: If the specified path does not exist
            ValueError: If the parquet file contains unsupported data formats
        
        Note:
            The constructor automatically triggers the full analysis pipeline including:
            - Basic statistics calculation (including user_session analysis)
            - Hierarchical category analysis for missing brands
            - Price missingness correlation with event types
            For large datasets, this may take significant time to compute.
        """
        self.datapath = datapath
        self.totalUserSessionNulls = 0
        self.highestCatWithNulls = ""
        self.totalRows = 0
        self.nullPercentageAny = 0.0
        
        try:
            self.df = dd.read_parquet(datapath)
            self._calculateStatistics()
            self._extractHighNullBrandsPerHierCategory()
            self._corrMissingPricePerEventType()
            
        except Exception as e:
            print("error init the raw data analysis: {}".format(e))
            raise

    def _calculateStatistics(self) -> None:
        """Calculates comprehensive data quality statistics with focus on user_session.
        
        Computes fundamental data quality metrics including:
            - Total row count of the dataset
            - Number of rows with missing user_session values
            - Percentage of rows with missing user_session values
            - Identification of the category with highest missing brand values
        
        This method populates class attributes with analysis results and provides
        immediate feedback via console output. It serves as the foundation for
        all subsequent specialized analyses.
        
        Raises:
            Exception: If statistical calculations fail due to data integrity issues
            MemoryError: If the dataset size exceeds available computational resources
        
        Note:
            Uses Dask's lazy evaluation with explicit compute() for memory efficiency.
            Now includes specific analysis of user_session completeness in addition to brand analysis.
        """
        try:
            self.totalRows = len(self.df)
            self.totalUserSessionNulls = self.df['user_session'].isnull().sum().compute()
            self.nullPercentageAny = (self.totalUserSessionNulls / self.totalRows) * 100
            
            # Additional brand analysis for completeness
            catPerNullBrand = (self.df[self.df['brand'].isna()]
                              .groupby('category_code')
                              .size()
                              .reset_index()
                              .rename(columns={0: 'nullBrandCount'})
                              .sort_values(by='nullBrandCount', ascending=False)
                              .compute())
            self.highestCatWithNulls = catPerNullBrand.iat[0, 0]
            
            print(f"Total rows: {self.totalRows:,}")
            print(f"Rows with missing user_session: {self.totalUserSessionNulls:,}")
            print(f"Missing user_session percentage: {self.nullPercentageAny:.2f}%")
            print(f"Top category with most missing brands: {self.highestCatWithNulls}")
            
        except Exception as e:
            print("error calculating statistics: {}".format(e))
            
    def _extractHighNullBrandsPerHierCategory(self) -> None:
        """Analyzes missing brand information across hierarchical category levels.
        
        Performs granular analysis of missing brand values by decomposing the category_code
        into three hierarchical levels:
            - main_category: First level (e.g., 'electronics')
            - sub_category: Second level (e.g., 'electronics.smartphone')
            - product_type: Third level (e.g., 'electronics.smartphone.accessories')
        
        For each level, identifies the top 5 categories with the highest number of
        missing brand values. Uses memory-efficient processing by persisting filtered
        data and cleaning up intermediate variables.
        
        Output:
            Console display of top 5 categories per hierarchy level with missing brand counts
        
        Note:
            Uses Dask's persist() to optimize memory usage for large filtered datasets.
            Implements careful memory management by deleting intermediate variables.
        """
        nullBrands = self.df[self.df['brand'].isna()].persist()
        print(f"Processing {len(nullBrands):,} rows with null brands")
        
        categoryLevels = {
            'main_category': lambda x: x.str.split('.').str[0],
            'sub_category': lambda x: x.str.split('.').str[0] + '.' + x.str.split('.').str[1],
            'product_type': lambda x: x.str.split('.').str[2]
        }
        
        for catName, extractFunc in categoryLevels.items():
            try:
                print(f"\n--- Processing {catName} ---")
                
                categoryValues = extractFunc(nullBrands['category_code'])
                counts = categoryValues.value_counts().compute()
                
                topFive = counts.head(5).sort_values(ascending=False)
                
                print(f"Top 5 {catName} with missing brands:")
                for category, count in topFive.items():
                    print(f"  {category}: {count:,}")
                
                del categoryValues, counts, topFive
                
            except Exception as e:
                print(f"Error with {catName}: {e}")
                
    def _corrMissingPricePerEventType(self) -> None:
        """Analyzes correlation between missing price data and event types.
        
        Investigates whether missing price values are correlated with specific event types
        in the e-commerce data. Provides insights into potential data collection issues
        or business process gaps that might affect price data completeness.
        
        Steps:
            1. Checks overall presence of missing price values
            2. If missing prices exist, analyzes their distribution across event_types
            3. Provides clear reporting on findings
        
        Output:
            Console display of missing price statistics and event_type distribution
            Explicit message if no missing price data is found (indicating data quality)
        
        Note:
            This analysis helps identify potential systematic issues in data collection
            pipelines where certain event types might be missing critical pricing information.
        """
        totalNullPrices = self.df['price'].isna().sum().compute()
        print(f"Total null prices in dataset: {totalNullPrices}")
        
        if totalNullPrices == 0:
            print("No missing price values found in the entire dataset!")
            return
        
        nullPricePerEvent = (self.df[self.df['price'].isna()]
                            .groupby('event_type')
                            .size()
                            .reset_index()
                            .rename(columns={0: 'nullPriceCount'})
                            .sort_values(by='nullPriceCount', ascending=False)
                            .compute())
        print(nullPricePerEvent)

In [4]:
rawDataAnalysis = DataRawAnalysis("/home/tovix/projects/E-commerceCustomerBehaviorAnalysis/data/raw/2019_nov_prqt/*.parquet")

Total rows: 67,501,979
Rows with any nulls: 10
Any null percentage: 0.00%
Top category with the most null furniture.living_room.cabinet
Processing 9,224,078 rows with null brands

--- Processing main_category ---
Top 5 main_category with missing brands:
  furniture: 1,108,386
  computers: 135,503
  sport: 61,955
  accessories: 51,705
  medicine: 621

--- Processing sub_category ---
Top 5 sub_category with missing brands:
  apparel.shoes: 364,225
  kids.toys: 48,474
  computers.components: 42,961
  accessories.wallet: 2,054
  electronics.tablet: 1,326

--- Processing product_type ---
Top 5 product_type with missing brands:
  drill: 80,599
  compressor: 42,325
  microphone: 7,201
  hood: 1,007
  cpu: 51
Total null prices in dataset: 0
No missing price values found in the entire dataset!


---

**Task 1.1: Data Quality Assessment - November 2019 Dataset**


   - **What percentage of records have missing user_session values?**  
     **0.00%** - Only 10 records out of 67,501,979 have missing user_session values, representing exceptional data quality for this field.

   - **Which product categories have the highest proportion of missing brand information?**  
     **Main Category Level**: Furniture (1,108,386 missing brands), Computers (135,503), Sport (61,955)  
     **Sub-Category Level**: Apparel.Shoes (364,225 missing brands), Kids.Toys (48,474), Computers.Components (42,961)  
     **Product Type Level**: Drill (80,599 missing brands), Compressor (42,325), Microphone (7,201)  
     **Overall Highest**: Furniture.Living_Room.Cabinet has the most missing brand values

   - **Is there a correlation between missing price data and specific event types?**  
     **No correlation possible** - There are **zero missing price values** in the entire dataset of 67.5 million records. Price data demonstrates 100% completeness across all event types.

**Summary**: The November 2019 dataset shows excellent quality for user_session (99.999985% complete) and price (100% complete) fields, but has significant brand data gaps particularly in furniture and apparel categories.

---

**Data Quality Assessment Findings**

Based on the comprehensive analysis of our e-commerce dataset, I can provide the following assurances:

1. **Price Data Integrity**: The price information is 100% free of null values. This ensures that any analysis based on pricing data and customer purchasing behavior will be statistically reliable and accurate.

2. **User Session Quality**: The user_session field contains only 10 missing records out of millions, representing effectively 0% null rate. This indicates exceptionally high data quality for user tracking and session-based analytics.

3. **Brand Data Analysis**: The furniture category demonstrates the highest number of missing values when examining the category hierarchy. To address this business challenge, we will implement a Category-Aware Brand Handling approach.

This methodology recognizes that different product categories require distinct handling strategies. For electronics, brand loyalty significantly influences purchasing decisions as customers often seek product compatibility within brand ecosystems like Apple or Samsung. Conversely, for furniture categories, brand is less critical as customers prioritize aesthetic coordination and decorative harmony over brand consistency.

The Category-Aware approach ensures we handle missing brand data in a way that aligns with actual customer behavior patterns in each product category.

---

2. **Data Consistency Evaluation**
   - Are there users with sessions spanning impossible time durations (>24 hours)?
   - What proportion of products appear in multiple category hierarchies?
   - How many unique timestamp formats exist in the event_time field?