# 아마존 세일즈 데이터 EDA - 보고서 작성용

## Introduction

> 앞선 3주 동안 약 **1400개** 행의 소규모 데이터를 분석하며, 기본적인 통계기법과 클러스터링에 대해 파악함.   
> 이번 EDA의 목적은 2023년의 인도, 영국, 미국, 캐나다의 **대규모** 마켓 세일즈 데이터(~150만개)를 분석함에 있음.

|#| Table of Contents | Finished |
|:--|:--:|:--:|
|1| Install Necessary Packages & Load Data | &check; |
|2| Acknowledge Characteristics of Data | &cross;|


&copy; 2024 Yoori Choi <it.glasschoi@gmail.com>
* * *

## 1. Install Necessary Packages & Load Data

In [3]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
# Import Necessary Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
import warnings

# from tqdm import tqdm_notebook
warnings.simplefilter(action='ignore')
%matplotlib inline

In [5]:
import sys
import os

# Get the full path to the project root
project_root = os.path.abspath(os.path.join(os.getcwd(), '../..'))

# Add the project root to sys.path
sys.path.insert(0, project_root)

# Now try to import
from functions.amazon_analysis import AmazonAnalyzer, AmazonDataframe

In [6]:
project_root

'/Users/yoorichoi/Documents/ds_study/Amazon_sales_machine_learning'

In [17]:
analyzer = AmazonAnalyzer.from_config('config.json', project_root)

Attempting to access: /Users/yoorichoi/Documents/ds_study/Amazon_sales_machine_learning/data/amazon_india.csv
File path verified: /Users/yoorichoi/Documents/ds_study/Amazon_sales_machine_learning/data/amazon_india.csv
Loaded India from CSV and cached
Successfully added dataframe: India
Attempting to access: /Users/yoorichoi/Documents/ds_study/Amazon_sales_machine_learning/data/amazon_usa.csv
File path verified: /Users/yoorichoi/Documents/ds_study/Amazon_sales_machine_learning/data/amazon_usa.csv
Loaded US from CSV and cached
Successfully added dataframe: US
Attempting to access: /Users/yoorichoi/Documents/ds_study/Amazon_sales_machine_learning/data/amazon_uk.csv
File path verified: /Users/yoorichoi/Documents/ds_study/Amazon_sales_machine_learning/data/amazon_uk.csv
Loaded UK from CSV and cached
Successfully added dataframe: UK
Attempting to access: /Users/yoorichoi/Documents/ds_study/Amazon_sales_machine_learning/data/amazon_canada.csv
File path verified: /Users/yoorichoi/Documents/ds_

In [18]:
if analyzer:
    print("Loaded dataframes:")
    for name, df_obj in analyzer.dataframes.items():
        try:
            print(f"- {name}: {df_obj.df.shape}")
        except Exception as e:
            print(f"- {name}: Error loading - {str(e)}")
    
    result = analyzer.compare_columns_presence()
    if result is not None:
        print(result)
    else:
        print("Failed to compare columns presence")
else:
    print("Failed to initialize analyzer from config.")

Loaded dataframes:
Successfully loaded CSV from /Users/yoorichoi/Documents/ds_study/Amazon_sales_machine_learning/data/amazon_india.csv
- India: (1497145, 14)
Successfully loaded CSV from /Users/yoorichoi/Documents/ds_study/Amazon_sales_machine_learning/data/amazon_usa.csv
- US: (1393614, 14)
Successfully loaded CSV from /Users/yoorichoi/Documents/ds_study/Amazon_sales_machine_learning/data/amazon_uk.csv
- UK: (2222724, 11)
Successfully loaded CSV from /Users/yoorichoi/Documents/ds_study/Amazon_sales_machine_learning/data/amazon_canada.csv
- Canada: (1988016, 14)
Successfully loaded CSV from /Users/yoorichoi/Documents/ds_study/Amazon_sales_machine_learning/data/amazon.csv
- India_2022: (1465, 16)
                      India     US     UK  Canada  India_2022
rating_count          False  False  False   False        True
product_id             True   True   True    True        True
review_id             False  False  False   False        True
img_link               True   True   True    T

In [19]:
india_df = analyzer.get_dataframe("India")
us_df = analyzer.get_dataframe("US")
uk_df = analyzer.get_dataframe("UK")
canada_df = analyzer.get_dataframe("Canada")
india_2022_df = analyzer.get_dataframe("India_2022")

In [16]:
# analyzer.clear_cache()

Cache cleared


## 2. Acknowledge Characteristics of Data

In [23]:
# Compare columns across all dataframes
presence_df = analyzer.compare_columns_presence()
presence_df

Unnamed: 0,India,US,UK,Canada,India_2022
rating_count,False,False,False,False,True
product_id,True,True,True,True,True
review_id,False,False,False,False,True
img_link,True,True,True,True,True
product_name,True,True,True,True,True
product_link,True,True,True,True,True
discount_percentage,True,True,False,True,True
about_product,False,False,False,False,True
user_id,False,False,False,False,True
boughtInLastMonth,True,True,True,True,False


In [27]:
def get_common_columns_by_threshold(analyzer, min_true_count=4):
    """
    Get columns that are present (True) in at least min_true_count dataframes
    
    Parameters:
    analyzer (AmazonAnalyzer): The analyzer instance
    min_true_count (int): Minimum number of True values required (default=4)
    
    Returns:
    list: Column names that meet the threshold criteria
    """
    presence_df = analyzer.compare_columns_presence()
    if presence_df is None:
        return []
        
    # Sum True values across each row
    true_counts = presence_df.sum(axis=1)
    
    # Filter rows where count >= min_true_count
    # filtered_columns = presence_df[true_counts >= min_true_count].index.tolist()
    filtered_columns = presence_df[true_counts >= min_true_count]
    
    return filtered_columns

In [28]:
get_common_columns_by_threshold(analyzer, 5)

Unnamed: 0,India,US,UK,Canada,India_2022
product_id,True,True,True,True,True
img_link,True,True,True,True,True
product_name,True,True,True,True,True
product_link,True,True,True,True,True
discounted_price,True,True,True,True,True
category,True,True,True,True,True
rating,True,True,True,True,True


In [29]:
get_common_columns_by_threshold(analyzer, 4)

Unnamed: 0,India,US,UK,Canada,India_2022
product_id,True,True,True,True,True
img_link,True,True,True,True,True
product_name,True,True,True,True,True
product_link,True,True,True,True,True
discount_percentage,True,True,False,True,True
boughtInLastMonth,True,True,True,True,False
actual_price,True,True,False,True,True
discounted_price,True,True,True,True,True
discounted_price_KRW,True,True,True,True,False
category,True,True,True,True,True


In [21]:
# Get top 10 categories by discounted price for India
top_categories_india = analyzer.top_items("India", "category", n=10, sort_by="discounted_price")
top_categories_india

category
Laptop                          5.167377e+08
Presius jewelery                2.224557e+08
Accessory                       1.543568e+08
Home and kitchen                1.155153e+08
Sports, fitness and outdoor     1.064918e+08
Desktop                         9.532797e+07
Watches                         9.449648e+07
Monitor                         9.277302e+07
Loose jamstone and diamond      9.083936e+07
Suitcase, check in and straw    8.269167e+07
Name: discounted_price, dtype: float64