# E-commerce Customer Behavior Prediction and Recommendation Engine
## Notebook 1: Data Acquisition and Integration

This notebook focuses on the critical first step: establishing a robust data acquisition and integration pipeline that will form the foundation for all subsequent analyses and modeling.


## Environment Setup

Before diving into data acquisition, let's set up our environment with the necessary libraries and configurations:


In [8]:
# Import necessary libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
import logging
import json
import hashlib
from pathlib import Path
import time
from typing import Dict, List, Tuple, Optional, Union, Any
import glob
import re
from dotenv import load_dotenv
import zipfile
import shutil
from tqdm.notebook import tqdm

# Determine the project root.
# If running in a Jupyter notebook, assume the current working directory is 'notebooks'
PROJECT_ROOT = Path(__file__).resolve().parents[1] if '__file__' in globals() else Path.cwd().parent

# Load environment variables from the .env file located in the project root
load_dotenv(PROJECT_ROOT / ".env")

# Configure warnings
warnings.filterwarnings('ignore')

# Ensure the logs directory exists (located at PROJECT_ROOT/logs)
log_dir = PROJECT_ROOT / "logs"
log_dir.mkdir(parents=True, exist_ok=True)

# Set up logging to track our data acquisition process.
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(log_dir / "data_acquisition.log"),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger("DataAcquisition")

# Set display options for better readability of dataframes
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 100)

# Set random seed for reproducibility across runs
np.random.seed(42)

## Project Directory Structure

Let's create a well-organized project structure to maintain clean separation between raw data, processed data, models, and outputs. This follows data science best practices and will make our project more maintainable.


In [9]:
# Create project directory structure
def create_project_structure(base_dir: Path) -> dict:
    """
    Creates a standardized project directory structure for organizing data and outputs.
    
    Args:
        base_dir: The base directory for the project (as a Path object)
        
    Returns:
        A dictionary mapping directory names to their paths (as Path objects)
    """
    directories = {
        "raw_data": base_dir / "data" / "raw",           # Original, immutable data
        "processed_data": base_dir / "data" / "processed", # Cleaned and transformed data
        "interim_data": base_dir / "data" / "interim",     # Temporary data between processing steps
        "models": base_dir / "models",                     # Trained models and model artifacts
        "outputs": base_dir / "outputs",                   # Analysis outputs and visualizations
        "logs": base_dir / "logs",                         # Logs from various processes
        "reports": base_dir / "reports",                   # Generated analysis reports
        "notebooks": base_dir / "notebooks",               # Jupyter notebooks
        "configs": base_dir / "configs"                    # Configuration files
    }
    
    # Create directories if they don't exist
    for name, dir_path in directories.items():
        dir_path.mkdir(parents=True, exist_ok=True)
        logger.info(f"Created directory: {dir_path}")
    
    return directories

# Use the previously defined PROJECT_ROOT from the earlier code block
project_dirs = create_project_structure(PROJECT_ROOT)

# Display the project structure
# print("Project Directory Structure Created:")
# for name, path in project_dirs.items():
    # print(f"- {name}: {path}")

2025-03-26 12:15:32,182 - DataAcquisition - INFO - Created directory: C:\_Arash\github\ecom-reco-predictor\full_recom_prediction_engine\ecommerce_recommendation_project\data\raw
2025-03-26 12:15:32,184 - DataAcquisition - INFO - Created directory: C:\_Arash\github\ecom-reco-predictor\full_recom_prediction_engine\ecommerce_recommendation_project\data\processed
2025-03-26 12:15:32,187 - DataAcquisition - INFO - Created directory: C:\_Arash\github\ecom-reco-predictor\full_recom_prediction_engine\ecommerce_recommendation_project\data\interim
2025-03-26 12:15:32,188 - DataAcquisition - INFO - Created directory: C:\_Arash\github\ecom-reco-predictor\full_recom_prediction_engine\ecommerce_recommendation_project\models
2025-03-26 12:15:32,192 - DataAcquisition - INFO - Created directory: C:\_Arash\github\ecom-reco-predictor\full_recom_prediction_engine\ecommerce_recommendation_project\outputs
2025-03-26 12:15:32,194 - DataAcquisition - INFO - Created directory: C:\_Arash\github\ecom-reco-predic

Project Directory Structure Created:
- raw_data: C:\_Arash\github\ecom-reco-predictor\full_recom_prediction_engine\ecommerce_recommendation_project\data\raw
- processed_data: C:\_Arash\github\ecom-reco-predictor\full_recom_prediction_engine\ecommerce_recommendation_project\data\processed
- interim_data: C:\_Arash\github\ecom-reco-predictor\full_recom_prediction_engine\ecommerce_recommendation_project\data\interim
- models: C:\_Arash\github\ecom-reco-predictor\full_recom_prediction_engine\ecommerce_recommendation_project\models
- outputs: C:\_Arash\github\ecom-reco-predictor\full_recom_prediction_engine\ecommerce_recommendation_project\outputs
- logs: C:\_Arash\github\ecom-reco-predictor\full_recom_prediction_engine\ecommerce_recommendation_project\logs
- reports: C:\_Arash\github\ecom-reco-predictor\full_recom_prediction_engine\ecommerce_recommendation_project\reports
- notebooks: C:\_Arash\github\ecom-reco-predictor\full_recom_prediction_engine\ecommerce_recommendation_project\noteboo