# Processing Keepa Data

## Sales Rank Data Processing Documentation

### Overview
This documentation describes the process of extracting and analyzing sales rank data from Keepa's product data stored in pickle files. The program converts Keepa's proprietary time format to EST dates and creates visualizations and CSV exports of sales rank trends.

### Data Structure
Input Data (Pickle Files)
The raw data is stored in pickle files with the following relevant fields:
```
{
    'asin': str,                # Amazon product identifier
    'title': str,               # Product title
    'salesRankReference': int,  # Reference category ID
    'salesRanks': {            # Sales rank history by category
        'category_id': [keepa_time, rank, keepa_time, rank, ...]
    }
}
```

#### Key Data Fields
**1. salesRankReference:**
- Integer value representing the main category ID
- Special values: -1 (not available), -2 (listed in launchpad)

**2. salesRanks:**
- Dictionary with category IDs as keys
- Values are lists alternating between Keepa time and sales rank
Data Processing Steps


### Data Processing Steps
1. Time Conversion
2. Sales Rank Data Extraction
3. Output Files



## Initialization

In [1]:
import pickle
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import pytz

### Step 1. Time Conversion

In [2]:
def keepa_to_est(keepa_time):
    """
    Convert Keepa time to EST datetime
    Formula: unix_time = (keepa_time + 21564000) * 60
    """
    unix_time = (keepa_time + 21564000) * 60  # Convert to seconds
    utc_time = datetime.fromtimestamp(unix_time, tz=pytz.UTC)
    est_time = utc_time.astimezone(pytz.timezone('US/Eastern'))
    return est_time.strftime('%Y-%m-%d')

### Step 2. Sales Rank Data Extraction

In [3]:

def process_sales_ranks(pickle_path):
    """
    Process steps:
    1. Load pickle file
    2. Get reference category
    3. Extract sales rank data
    4. Convert time format
    5. Create date-rank dictionary
    """
    # Load pickle file
    with open(pickle_path, 'rb') as f:
        data = pickle.load(f)
    
    # Get reference category
    ref_category = str(data.get('salesRankReference'))
    if not ref_category or ref_category == '-1' or ref_category == '-2':
        print(f"No valid sales rank reference for {pickle_path}")
        return None
    
    # Get sales rank data
    sales_ranks = data.get('salesRanks', {})
    if not sales_ranks or ref_category not in sales_ranks:
        print(f"No sales rank data for category {ref_category}")
        return None
    
    # Process time series data
    rank_data = sales_ranks[ref_category]
    sales_dict = {}
    
    # Convert list to dictionary with EST dates
    for i in range(0, len(rank_data), 2):
        keepa_time = rank_data[i]
        rank = rank_data[i + 1]
        est_date = keepa_to_est(keepa_time)
        sales_dict[est_date] = rank
    
    return {
        'asin': data.get('asin'),
        'title': data.get('title'),
        'category': ref_category,
        'sales_data': sales_dict
    }





### Output Files
1. CSV File ({ASIN}_sales_rank.csv)
Format: Date-indexed time series
```
Date, Sales Rank
2024-01-01, 1000
2024-01-02, 999
...
```

2. Visualization ({ASIN}_sales_rank.png)
Line plot showing sales rank trends over time
Features:
- Title with ASIN and product name
- Date on x-axis
- Sales rank on y-axis
- Grid for better readability

### Data Format Details
#### Input Time Format
- Keepa time: Minutes since May 17th, 2011
- Conversion formula: unix_time = (keepa_time + 21564000) * 60
#### Output Time Format
- Date format: 'YYYY-MM-DD'
- Timezone: US Eastern Time (EST/EDT)
#### Sales Rank Values
- Lower numbers indicate better sales performance
- Raw integer values preserved from original data
- Missing or invalid data points are excluded




In [4]:
def plot_sales_rank(sales_data, output_dir):
    """Plot sales rank data and save to CSV"""
    asin = sales_data['asin']
    title = sales_data['title']
    
    # Create DataFrame
    df = pd.DataFrame.from_dict(sales_data['sales_data'], 
                               orient='index', 
                               columns=['Sales Rank'])
    df.index = pd.to_datetime(df.index)
    df = df.sort_index()
    
    # Plot
    plt.figure(figsize=(15, 7))
    plt.plot(df.index, df['Sales Rank'])
    plt.title(f'Sales Rank History for {asin}\n{title}')
    plt.xlabel('Date')
    plt.ylabel('Sales Rank')
    plt.grid(True)
    plt.xticks(rotation=45)
    
    # Save plot
    plt.tight_layout()
    plt.savefig(output_dir / f'sales_rank_figures/{asin}_sales_rank.png')
    plt.close()
    
    # Save CSV
    df.to_csv(output_dir / f'sales_rank_csv/{asin}_sales_rank.csv')
    
    return df

In [10]:
def main():
    # Setup paths
    base_path = Path('/Users/takedownccp/Documents/Cursor/DDU/data')
    raw_data_path = base_path / 'raw_data'
    output_path = base_path / 'processed_data'
    output_path.mkdir(exist_ok=True)
    
    # Process all pickle files
    # pickle_files = list(raw_data_path.glob('*_raw.pkl'))
    pickle_files = ['B0CHTZ6NCL_raw.pkl', 'B07N52NLC3_raw.pkl', 'B09MZ9T3KT_raw.pkl', 'B09MZBXNHP_raw.pkl', 'B09QL5K6LW_raw.pkl', 'B09ZLSR8PH_raw.pkl']
    
    for pickle_file in pickle_files:
        print(f"\nProcessing {pickle_file.name}")
        
        # Process sales rank data
        sales_data = process_sales_ranks(pickle_file)
        
        if sales_data:
            print(f"Found sales rank data for {sales_data['asin']}")
            print(f"Category: {sales_data['category']}")
            print(f"Data points: {len(sales_data['sales_data'])}")
            
            # Plot and save data
            df = plot_sales_rank(sales_data, output_path)
            print(f"Data saved to {output_path}")
            
            # Display first few rows
            print("\nFirst few data points:")
            print(df.head())
        else:
            print("No valid sales rank data found")

if __name__ == "__main__":
    main()


Processing B0BHL4C4MW_raw.pkl
Found sales rank data for B0BHL4C4MW
Category: 3760901
Data points: 81
Data saved to /Users/takedownccp/Documents/Cursor/DDU/data/processed_data

First few data points:
            Sales Rank
2024-10-18       11888
2024-10-23       10390
2024-10-24       12956
2024-10-25       12548
2024-10-26       12756

Processing B00JT3TP1O_raw.pkl
Found sales rank data for B00JT3TP1O
Category: 2619533011
Data points: 750
Data saved to /Users/takedownccp/Documents/Cursor/DDU/data/processed_data

First few data points:
            Sales Rank
2022-12-14         437
2022-12-15         727
2022-12-16         766
2022-12-17         876
2022-12-18        1042

Processing B08HP8SM3J_raw.pkl
Found sales rank data for B08HP8SM3J
Category: 2619533011
Data points: 759
Data saved to /Users/takedownccp/Documents/Cursor/DDU/data/processed_data

First few data points:
            Sales Rank
2022-12-14        1052
2022-12-15        1020
2022-12-16        1164
2022-12-17        1070
2

  plt.tight_layout()
  plt.savefig(output_dir / f'sales_rank_figures/{asin}_sales_rank.png')


Data saved to /Users/takedownccp/Documents/Cursor/DDU/data/processed_data

First few data points:
            Sales Rank
2024-01-10         986
2024-01-11         825
2024-01-12         820
2024-01-13         760
2024-01-14         826

Processing B00BPA11YI_raw.pkl
Found sales rank data for B00BPA11YI
Category: 2619533011
Data points: 756
Data saved to /Users/takedownccp/Documents/Cursor/DDU/data/processed_data

First few data points:
            Sales Rank
2022-12-14         708
2022-12-15         795
2022-12-16         756
2022-12-17         785
2022-12-18         835

Processing B0C3HX86TB_raw.pkl
Found sales rank data for B0C3HX86TB
Category: 2619533011
Data points: 486
Data saved to /Users/takedownccp/Documents/Cursor/DDU/data/processed_data

First few data points:
            Sales Rank
2023-09-08         312
2023-09-09         329
2023-09-10         308
2023-09-11         362
2023-09-12         279

Processing B001HYB2P0_raw.pkl
Found sales rank data for B001HYB2P0
Category: 26