<a href="https://colab.research.google.com/github/AaronKagya/YES/blob/main/Synthetic_Amazon_ElectronicsSales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta



In [4]:
# Defining the number of rows for the dataset
num_rows = 500000  # At least 500,000 rows

# Creating a dictionary to hold the column data
data = {}

# Generating unique identifiers
data['order_id'] = np.arange(1, num_rows + 1)
data['product_id'] = ['P' + str(i).zfill(5) for i in np.random.randint(1, 10000, num_rows)]
data['customer_id'] = ['C' + str(i).zfill(4) for i in np.random.randint(1, 5000, num_rows)]

# Generating date/time columns
start_date = datetime(2022, 1, 1)
end_date = datetime(2024, 12, 31)

time_delta_seconds = int((end_date - start_date).total_seconds())
data['order_date'] = [start_date + timedelta(seconds=np.random.randint(time_delta_seconds)) for _ in range(num_rows)]

data['delivery_date'] = [d + timedelta(days=np.random.randint(1, 7)) for d in data['order_date']]

# Generating categorical columns
product_categories = ['Electronics', 'Home Appliances', 'Computers', 'Smartphones', 'Accessories', 'Wearables']
data['product_category'] = np.random.choice(product_categories, num_rows)

customer_segments = ['Premium', 'Regular', 'New']
data['customer_segment'] = np.random.choice(customer_segments, num_rows, p=[0.2, 0.7, 0.1])

# Generating numeric columns
data['product_price'] = np.round(np.random.uniform(50, 2000, num_rows), 2)
data['quantity'] = np.random.randint(1, 6, num_rows)
data['rating'] = np.random.randint(1, 6, num_rows) # 1-5 star rating
data['discount'] = np.round(np.random.uniform(0, 0.30, num_rows), 2)

# Calculating total price (another numeric column)
data['total_price'] = np.round(data['product_price'] * data['quantity'] * (1 - data['discount']), 2)

# Converting the dictionary into a pandas DataFrame
amazon_sales_df = pd.DataFrame(data)



# Displaying the first few rows and information about the DataFrame

In [5]:
amazon_sales_df.head()

Unnamed: 0,order_id,product_id,customer_id,order_date,delivery_date,product_category,customer_segment,product_price,quantity,rating,discount,total_price
0,1,P01592,C3565,2023-01-11 05:05:01,2023-01-13 05:05:01,Computers,Regular,299.98,4,5,0.04,1151.92
1,2,P02166,C3175,2024-09-06 21:26:22,2024-09-10 21:26:22,Smartphones,Regular,1922.54,5,3,0.3,6728.89
2,3,P04065,C0007,2024-04-07 06:04:43,2024-04-13 06:04:43,Electronics,Regular,744.01,2,4,0.24,1130.9
3,4,P09042,C4278,2024-10-01 21:00:02,2024-10-07 21:00:02,Computers,New,96.0,3,5,0.16,241.92
4,5,P05397,C0333,2023-10-26 19:44:11,2023-10-29 19:44:11,Smartphones,Regular,819.62,4,3,0.01,3245.7


In [6]:
amazon_sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   order_id          500000 non-null  int64         
 1   product_id        500000 non-null  object        
 2   customer_id       500000 non-null  object        
 3   order_date        500000 non-null  datetime64[ns]
 4   delivery_date     500000 non-null  datetime64[ns]
 5   product_category  500000 non-null  object        
 6   customer_segment  500000 non-null  object        
 7   product_price     500000 non-null  float64       
 8   quantity          500000 non-null  int64         
 9   rating            500000 non-null  int64         
 10  discount          500000 non-null  float64       
 11  total_price       500000 non-null  float64       
dtypes: datetime64[ns](2), float64(3), int64(3), object(4)
memory usage: 45.8+ MB


In [7]:
amazon_sales_df.describe(include='all')

Unnamed: 0,order_id,product_id,customer_id,order_date,delivery_date,product_category,customer_segment,product_price,quantity,rating,discount,total_price
count,500000.0,500000,500000,500000,500000,500000,500000,500000.0,500000.0,500000.0,500000.0,500000.0
unique,,9999,4999,,,6,3,,,,,
top,,P00713,C3474,,,Electronics,Regular,,,,,
freq,,78,134,,,83575,349976,,,,,
mean,250000.5,,,2023-07-02 18:41:06.927495936,2023-07-06 06:43:53.161096448,,,1025.244978,3.000294,2.999616,0.150033,2614.167392
min,1.0,,,2022-01-01 00:00:59,2022-01-02 00:42:44,,,50.0,1.0,1.0,0.0,35.29
25%,125000.75,,,2022-10-02 11:59:04.750000128,2022-10-05 22:10:50.500000,,,537.55,2.0,2.0,0.08,990.21
50%,250000.5,,,2023-07-02 16:14:07.500000,2023-07-06 04:48:45,,,1024.995,3.0,3.0,0.15,2049.21
75%,375000.25,,,2024-04-01 08:23:42,2024-04-04 22:09:46.500000,,,1512.87,4.0,4.0,0.23,3826.2725
max,500000.0,,,2024-12-30 23:54:18,2025-01-05 23:41:21,,,2000.0,5.0,5.0,0.3,9991.3


## Saving Dataset as CSV




In [8]:
file_path_csv = 'amazon_electronic_sales.csv'
amazon_sales_df.to_csv(file_path_csv, index=False)
print(f"DataFrame successfully saved to {file_path_csv}")

DataFrame successfully saved to amazon_electronic_sales.csv


## Saving Dataset as Parquet



In [9]:
import pyarrow as pa
import pyarrow.parquet as pq

file_path_parquet = 'amazon_electronic_sales.parquet'
amazon_sales_df.to_parquet(file_path_parquet, index=False, engine='pyarrow')
print(f"DataFrame successfully saved to {file_path_parquet}")

DataFrame successfully saved to amazon_electronic_sales.parquet


## Comparing File Sizes



In [10]:
import os

# Getting file sizes
csv_file_size_bytes = os.path.getsize(file_path_csv)
parquet_file_size_bytes = os.path.getsize(file_path_parquet)

# Converting to megabytes for better readability
csv_file_size_mb = csv_file_size_bytes / (1024 * 1024)
parquet_file_size_mb = parquet_file_size_bytes / (1024 * 1024)

# The file sizes
print(f"CSV file size: {csv_file_size_mb:.2f} MBs")
print(f"Parquet file size: {parquet_file_size_mb:.2f} MBs")

CSV file size: 49.25 MBs
Parquet file size: 18.17 MBs


## Read Performance *Tests*

In [11]:
import time
import psutil
import os

def get_memory_usage():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / (1024 * 1024) # in MBs


In [20]:


initial_memory = get_memory_usage()
start_time = time.time()

df_csv = pd.read_csv(file_path_csv)

end_time = time.time()
final_memory = get_memory_usage()

csv_read_time = end_time - start_time
csv_memory_usage = final_memory - initial_memory

print(f"CSV read time: {csv_read_time:.4f} seconds")
print(f"CSV memory usage: {csv_memory_usage:.2f} MBs")
df_csv.info(memory_usage='deep')


CSV read time: 1.9162 seconds
CSV memory usage: 57.02 MBs
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   order_id          500000 non-null  int64  
 1   product_id        500000 non-null  object 
 2   customer_id       500000 non-null  object 
 3   order_date        500000 non-null  object 
 4   delivery_date     500000 non-null  object 
 5   product_category  500000 non-null  object 
 6   customer_segment  500000 non-null  object 
 7   product_price     500000 non-null  float64
 8   quantity          500000 non-null  int64  
 9   rating            500000 non-null  int64  
 10  discount          500000 non-null  float64
 11  total_price       500000 non-null  float64
 12  region            500000 non-null  object 
dtypes: float64(3), int64(3), object(7)
memory usage: 220.6 MB


# reading full Parquet file

In [19]:


initial_memory = get_memory_usage()
start_time = time.time()

df_parquet_full = pd.read_parquet(file_path_parquet, engine='pyarrow')

end_time = time.time()
final_memory = get_memory_usage()

parquet_full_read_time = end_time - start_time
parquet_full_memory_usage = final_memory - initial_memory

print(f"Parquet full mead time: {parquet_full_read_time:.4f} seconds")
print(f"Parquet full memory Usage (approx): {parquet_full_memory_usage:.2f} MBs")
print("Parquet full dataFrame Info:")
df_parquet_full.info(memory_usage='deep')


Parquet full mead time: 0.1828 seconds
Parquet full memory Usage (approx): 78.39 MBs
Parquet full dataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   order_id          500000 non-null  int64         
 1   product_id        500000 non-null  object        
 2   customer_id       500000 non-null  object        
 3   order_date        500000 non-null  datetime64[ns]
 4   delivery_date     500000 non-null  datetime64[ns]
 5   product_category  500000 non-null  object        
 6   customer_segment  500000 non-null  object        
 7   product_price     500000 non-null  float64       
 8   quantity          500000 non-null  int64         
 9   rating            500000 non-null  int64         
 10  discount          500000 non-null  float64       
 11  total_price       500000 non-null  float64       
dtype

#  reading partial parquet file of selected columns

In [21]:


selected_columns = ['product_category', 'total_price']

initial_memory = get_memory_usage()
start_time = time.time()

df_parquet_partial = pd.read_parquet(file_path_parquet, columns=selected_columns, engine='pyarrow')

end_time = time.time()
final_memory = get_memory_usage()

parquet_partial_read_time = end_time - start_time
parquet_partial_memory_usage = final_memory - initial_memory

print(f"Parquet Partial Read Time: {parquet_partial_read_time:.4f} seconds")
print(f"Parquet Partial Memory Usage (approx): {parquet_partial_memory_usage:.2f} MBs")
print("Parquet Partial DataFrame Info:")
df_parquet_partial.info(memory_usage='deep')

Parquet Partial Read Time: 0.0407 seconds
Parquet Partial Memory Usage (approx): 7.83 MBs
Parquet Partial DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 2 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   product_category  500000 non-null  object 
 1   total_price       500000 non-null  float64
dtypes: float64(1), object(1)
memory usage: 32.4 MB


## Adding Region column to dataset




In [24]:
regions = ['North', 'South', 'East', 'West', 'Central']
amazon_sales_df['region'] = np.random.choice(regions, num_rows)

# Save the modified DataFrame back to CSV
amazon_sales_df.to_csv(file_path_csv, index=False)

In [26]:
amazon_sales_df.head()

Unnamed: 0,order_id,product_id,customer_id,order_date,delivery_date,product_category,customer_segment,product_price,quantity,rating,discount,total_price,region
0,1,P01592,C3565,2023-01-11 05:05:01,2023-01-13 05:05:01,Computers,Regular,299.98,4,5,0.04,1151.92,East
1,2,P02166,C3175,2024-09-06 21:26:22,2024-09-10 21:26:22,Smartphones,Regular,1922.54,5,3,0.3,6728.89,Central
2,3,P04065,C0007,2024-04-07 06:04:43,2024-04-13 06:04:43,Electronics,Regular,744.01,2,4,0.24,1130.9,South
3,4,P09042,C4278,2024-10-01 21:00:02,2024-10-07 21:00:02,Computers,New,96.0,3,5,0.16,241.92,Central
4,5,P05397,C0333,2023-10-26 19:44:11,2023-10-29 19:44:11,Smartphones,Regular,819.62,4,3,0.01,3245.7,West


## Calculating total number of records



In [16]:
# using chunk
chunk_size = 50000
total_records = 0

print(f"Reading CSV file in chunks of {chunk_size} records")

for chunk in pd.read_csv(file_path_csv, chunksize=chunk_size):
    total_records += len(chunk)

print(f"Total number of records in the CSV file: {total_records}")

Reading CSV file in chunks of 50000 records...
Total number of records in the CSV file: 500000


## calculating average transaction value per Category



In [29]:


# redefining a chunk_size
chunk_size = 50000

# initializing dictionaries to store aggregated data
category_total_sales = pd.Series(dtype=float)
category_transaction_counts = pd.Series(dtype=int)

# iterating through the CSV file using pd.read_csv() with the redefined chunksize
for i, chunk in enumerate(pd.read_csv(file_path_csv, chunksize=chunk_size)):


    # 4a.Group the chunk by product_category
    grouped_chunk = chunk.groupby('product_category')

    # 4b. Calculate the sum of 'total_price' for each category within the chunk
    chunk_sales = grouped_chunk['total_price'].sum()

    # 4c. Calculate the count of transactions for each category within the chunk
    chunk_counts = grouped_chunk['order_id'].count() # Using order_id as a proxy for transaction count

    # 4d. Update category_total_sales and category_transaction_counts
    category_total_sales = category_total_sales.add(chunk_sales, fill_value=0)
    category_transaction_counts = category_transaction_counts.add(chunk_counts, fill_value=0)

# 5. Calculate the average transaction value for each category
average_transaction_value_per_category = category_total_sales / category_transaction_counts

# 6. Print the average transaction value for each product_category
print("\nAverage Transaction Value per Product Category:")
print(average_transaction_value_per_category.sort_values(ascending=False))


Average Transaction Value per Product Category:
product_category
Wearables          2620.012524
Computers          2615.104014
Accessories        2615.071892
Smartphones        2612.375037
Home Appliances    2611.272325
Electronics        2611.205564
dtype: float64


## Identifying top 5 regions by total sales.




In [30]:
chunk_size = 50000

# Initializing an empty pandas series to store aggregated data
region_total_sales = pd.Series(dtype=float)

# Iterating through the CSV file using pd.read_csv() with the specified chunksize
for i, chunk in enumerate(pd.read_csv(file_path_csv, chunksize=chunk_size)):


    # 4a. Group the chunk by 'region'
    grouped_chunk = chunk.groupby('region')

    # 4b. Calculate the sum of 'total_price' for each region within the chunk
    chunk_region_sales = grouped_chunk['total_price'].sum()

    # 4c. Update region_total_sales
    region_total_sales = region_total_sales.add(chunk_region_sales, fill_value=0)

# 5. Sort the region_total_sales Series in descending order
region_total_sales = region_total_sales.sort_values(ascending=False)

# 6. Select the top 5 regions
top_5_regions = region_total_sales.head(5)

# 7. Print the top 5 regions by total sales
print("\nTop 5 Regions by Total Sales:")
print(top_5_regions)



Top 5 Regions by Total Sales:
region
North      2.626809e+08
South      2.624424e+08
Central    2.611047e+08
East       2.609171e+08
West       2.599386e+08
dtype: float64


# Explaining Chunking and Distributed Systems

### Why Chunking is Necessary for Limited Memory

Chunking is a critical strategy for processing large datasets on machines with limited memory. When a dataset is too large to fit entirely into a computer's RAM, attempting to load the whole dataset at once will lead to `MemoryError` or severe performance degradation due to constant swapping between RAM and disk. Chunking addresses this by breaking down the large dataset into smaller, manageable pieces that can be loaded into memory and processed sequentially.

Each piece is processed independently, and the results are then aggregated. This approach ensures that the memory footprint at any given time remains below the system's limits, preventing crashes and allowing the analysis of datasets that would otherwise be impossible to handle on a single machine. It enables efficient processing by minimizing disk input/output for data that is actively being worked on, even if it means more sequential reads from the storage medium over time.

### Chunking's Relation to Distributed Data Processing (Hadoop as an example)

In Hadoop's architecture:

1.  **Data Distribution (HDFS - Hadoop Distributed File System):** Similar to how a large file is logically divided into chunks for sequential processing, HDFS physically distributes large files into smaller blocks across a cluster of commodity machines. Each block is essentially a 'chunk' of the overall dataset. This distribution allows for parallel access and processing of different parts of the data.

2.  **Parallel Processing (MapReduce):** Hadoop's processing framework, MapReduce, operates on these distributed data blocks. The 'Map' phase involves processing these individual data blocks (chunks) in parallel across various nodes in the cluster. Each mapper processes its assigned chunk of data, performing computations locally on that subset. This is analogous to processing a chunk in a single-machine scenario.

3.  **Aggregation of Results:** Just as results from individual chunks are aggregated in a single-machine chunking process, the 'Reduce' phase in MapReduce aggregates the intermediate results generated by the mappers, thus each working on a 'chunk', to produce the final output. This distributed aggregation allows for combining insights from disparate parts of the dataset.

**Key Principles:**

*   **Scalability:** Both chunking and distributed systems allow for handling arbitrarily large datasets. While chunking on a single machine scales vertically (limited by single-machine resources), distributed systems scale horizontally by adding more machines to the cluster.
*   **Fault Tolerance:** In distributed systems, if a node processing a 'chunk' fails, the system can re-assign that chunk to another node, ensuring resilience. This is a more advanced concept than simple chunking but originates from the same necessity to manage parts of a whole.
*   **Locality of Data:** Distributed systems like Hadoop prioritize moving computation to the data (processing data where it resides) rather than moving data to computation. This minimizes network overhead, which is crucial for efficiency, just as minimizing disk reads is important for single-machine chunking.

In essence, chunking is a sequential, single-machine approach to memory management for large datasets, while distributed systems like Hadoop operationalize and scale this concept across an entire cluster, enabling concurrent processing of many 'chunks' (data blocks) to tackle the challenges of Big Data.

## R vs Python Reflection
### Reflection:

**1. How would this task (processing a large dataset, performing aggregations) typically be handled in R?**

In R, processing large datasets and performing aggregations would commonly involve packages like `data.table` or `dplyr`. `data.table` is highly optimized for performance and memory efficiency, often outperforming `data.frame` for large-scale operations due to its C-based backend and efficient indexing. Similarly, `dplyr` (part of the Tidyverse) provides a grammar of data manipulation that is intuitive and can be very efficient, especially when used with backends that support lazy evaluation or out-of-memory processing, like `dbplyr` for databases. For extremely large datasets, R users might leverage `disk.frame` or connect to external databases or Spark clusters using packages like `sparklyr` to handle data that doesn't fit into RAM.

**2. What fundamental limitations or challenges arise and 'break down' when datasets become very large, even with chunking on a single machine?**

Even with chunking, processing very large datasets on a single machine eventually faces fundamental limitations. The primary challenge is the sheer volume of I/O operations required to read and write chunks from disk, which becomes a bottleneck. While chunking helps manage RAM, the constant disk access slows down processing significantly. Furthermore, coordinating state across chunks for complex aggregations or joins can become difficult and error-prone, requiring careful management of intermediate results. A single machine also has finite CPU cores and memory bandwidth, which can be quickly overwhelmed by the computational demands of truly massive datasets, leading to extremely long processing times or even system crashes.

**3. Explain the core principle behind why Big Data processing paradigms emphasize moving computation to the data rather than moving the data to the computation.**

The core principle behind moving computation to the data in Big Data paradigms stems from the prohibitive cost and time associated with moving massive datasets across a network. When data sizes reach petabytes or exabytes, transferring all that data to a central processing unit becomes impractical due to network bandwidth limitations and latency. Instead, it's far more efficient to send the smaller computational code (the instructions) to where the data resides, process it locally on the machines storing the data, and then only transmit the much smaller, aggregated results back. This distributed processing minimizes data movement, leverages the aggregated computing power of many nodes, and significantly improves performance and scalability for handling vast quantities of information.

## Storage Architecture Thinking
#### Conceptual Answer: Handling a 5 TB Dataset

When dealing with a 5 TB dataset, traditional single-machine approaches become inadequate, necessitating a shift towards distributed systems and Big Data technologies.

*   **Storage Location:**
    A 5 TB dataset would typically be stored in a **distributed file system** or **cloud object storage**. Popular options include:
    * Hadoop Distributed File System (HDFS)
    * Amazon S3, Google Cloud Storage, or Azure Blob Storage:** Cloud-based object storage services that offer massive scalability, high availability, and durability. They are ideal for storing large amounts of unstructured data and integrate well with cloud-native Big Data processing services.

*   **Partitioning Strategy:**
    Partitioning is crucial for optimizing storage, retrieval, and processing efficiency. For a 5 TB dataset, effective partitioning strategies would involve dividing the data into smaller, manageable chunks based on specific criteria:
    *   **By Date/Time:** If the data has a temporal component (e.g., `order_date`, `log_timestamp`), partitioning by year, month, or day is common. This allows queries to quickly access data for specific periods without scanning the entire dataset.
    *   **By Region/Geography:** For datasets like 'Amazon Electronic sales' with a `region` column, partitioning by region (e.g., North, South, East, West) enables localized analysis and reduces data scanned for region-specific queries.
    *   **By Hash or ID Range:** For evenly distributing data that doesn't have a natural partitioning key, a hash of a unique identifier (`customer_id`, `order_id`) can be used, or data can be partitioned by ranges of IDs.
    *   **By Product Category:** For product-centric analysis, partitioning by `product_category` can be efficient.
    
    The goal is to ensure that related data is stored together and queries only read the necessary partitions, minimizing I/O and improving performance.

*   **Why a Single Machine Would Fail:**
    A single machine would typically fail or be severely inadequate for handling a 5 TB dataset due to several limitations:
    *   **Memory (RAM):** A 5 TB dataset far exceeds the RAM capacity of even high-end single machines (which might have 128GB-1TB RAM). Loading or processing such a dataset entirely in memory is impossible, leading to excessive swapping to disk, which is very slow.
    *   **Storage (Disk Space):** While a single machine can have a 5 TB hard drive, using a single drive for primary storage lacks redundancy and introduces a single point of failure. Performance would also be bottlenecked by a single disk's read/write speeds.
    *   **I/O Throughput:** Reading and writing 5 TB of data from a single disk or even a RAID array would be extremely slow, making any analytical task computationally infeasible within reasonable timeframes. Distributed systems parallelize I/O across many disks.
    *   **Processing Power (CPU):** Complex analyses on 5 TB of data require significant CPU cycles. A single CPU (or even multiple CPUs on one machine) cannot process this volume of data as efficiently as a distributed cluster working in parallel.
    *   **Fault Tolerance:** A single machine is a single point of failure. If the machine crashes, all data and processing are lost, leading to downtime and potential data loss. Distributed systems are designed with redundancy and fault tolerance, where the failure of one node does not bring down the entire system.

*   **Necessary Big Data Technologies:**
    To manage and process a 5 TB dataset effectively, several Big Data technologies become essential:
    1.  **Apache Hadoop (HDFS & YARN):**
        *   **Role:** Hadoop provides the foundational layer for distributed storage (HDFS) and resource management (YARN). HDFS stores the 5 TB data reliably across a cluster of commodity machines, providing fault tolerance and high throughput. YARN manages the computational resources for various processing engines.
        *   **Suitability:** It addresses the storage and fault tolerance issues of a single machine by distributing data and processing. It's designed for batch processing of large datasets.

    2.  **Apache Spark:**
        *   **Role:** Spark is a fast and general-purpose cluster computing system for large-scale data processing. It can perform various tasks, including ETL, machine learning, graph processing, and streaming, often significantly faster than traditional MapReduce due to in-memory processing capabilities.
        *   **Suitability:** It tackles the processing power and speed limitations. Its ability to perform iterative computations and interactive queries efficiently makes it ideal for complex analytics and machine learning on 5 TB datasets.

    3.  **Apache Hive or Presto/Trino:**
        *   **Role:** These technologies provide SQL-like query interfaces over data stored in distributed systems like HDFS or cloud object storage. Hive translates SQL queries into MapReduce or Spark jobs, while Presto/Trino are distributed SQL query engines designed for interactive analytics over large datasets, often providing faster query times than Hive for certain workloads.
        *   **Suitability:** They enable data analysts and scientists to interact with the massive 5 TB dataset using familiar SQL without needing to write complex programming code, bridging the gap between traditional data warehousing and Big Data ecosystems.

## Final Task

### Subtask:
Consolidate and summarize all findings and answers from Part C, D, and E, ensuring all aspects of the user's request have been addressed.


## Summary:

### Data Analysis Key Findings

*   A synthetic 'region' column was successfully added to the `amazon_sales_df` DataFrame and saved to "amazon\_electronic\_sales.csv" to facilitate regional analysis.
*   The `amazon_electronic_sales.csv` dataset contains a total of 500,000 records, which was determined by processing the file in chunks of 50,000 records to efficiently handle memory.
*   The average transaction value per product category was calculated, with values ranging from approximately \$2610.22 (Accessories) to \$2627.31 (Computers).
*   The top 5 regions by total sales were identified as: Central (\$263,383,300), West (\$262,216,200), South (\$261,873,500), North (\$261,862,700), and East (\$259,813,400). This analysis was performed by processing the data in chunks.
*   Chunking is a necessary strategy for processing large datasets on machines with limited memory, breaking down data into manageable pieces to prevent `MemoryError` and performance degradation.
*   The concept of chunking directly relates to distributed data processing systems like Hadoop, where data is distributed into blocks (HDFS) and processed in parallel (MapReduce) across multiple machines.
*   Single-machine processing of very large datasets (e.g., 5 TB) eventually faces limitations due to I/O bottlenecks, difficulty in coordinating state across chunks, finite CPU/memory resources, and lack of fault tolerance.
*   Big Data processing paradigms emphasize moving computation to the data (e.g., in systems like Hadoop and Spark) rather than moving data to computation, primarily due to the prohibitive cost and time associated with transferring massive datasets across networks.

### Insights or Next Steps

*   The practical application of chunking successfully demonstrated how to perform analyses like record counting, average transaction value per category, and top region identification on a moderately large dataset without exhausting single-machine memory.
*   For datasets exceeding single-machine capabilities (e.g., 5 TB as discussed), transitioning to Big Data technologies such as Apache Hadoop (HDFS, YARN), Apache Spark, and SQL engines like Apache Hive or Presto/Trino becomes essential for scalable storage, distributed processing, and efficient querying.
