# Amazon Sales Data Analysis Project

## Introduction
This study is designed to examine large-scale Amazon sales data, which forms the foundation of the rapidly growing e-commerce world. The project aims not only to present data analysis results but also to deeply explore the technological infrastructure of modern data science approaches. In this context, starting with the fundamental principles of Big Data (Volume, Velocity, Variety), the industry standard framework for distributed and high-performance data processing, Apache Spark (PySpark), will be utilized. Throughout the work, the differences between Spark's core building blocks, RDDs and DataFrames, will be investigated, and data loading processes will be addressed within the context of Distributed Storage solutions such as HDFS and S3. Finally, data will be processed using PySpark's powerful filtering, aggregation, and joining capabilities, the findings will be presented through visualization tasks, and the work will be concluded with a mini analysis report. Our goal is to leverage the power of PySpark to uncover critical business trends and operational insights within the Amazon sales data.[bağlantı metni](https:// [bağlantı metni](https://))

## Project Details

* **Framework:** Apache PySpark
* **Environment:** Google Colab / Jupyter Notebook
* **Visualization:** Matplotlib & Seaborn
* **Output:** Single .ipynb file + Short Summary Report

## Setup and Environment Configuration

### 1. Install Dependencies

In [None]:
# Install the OpenJDK
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Download Apache Spark version 3.5.1
!wget -q https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

# Unzip the file
!tar xf spark-3.5.1-bin-hadoop3.tgz

# (Optional) Remove the downloaded archive file
!rm spark-3.5.1-bin-hadoop3.tgz

# Install findspark
!pip install -q findspark

### 2. Configure Environment Variables

In [None]:
import os

# Set the location of the Spark installation
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3"

# Set the location of the Java installation
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

# Add the PySpark library to the system path
import findspark
findspark.init()

### 3. Initialize Spark Session

In [None]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder\
        .master("local[*]")\
        .appName("AmazonAnalysis")\
        .getOrCreate()

# Display the SparkSession object (to verify success)
spark

## Section 1 — What is Big Data? (Volume, Velocity, Variety)

In [None]:
## DATA UPLOAD

# Upload the Amazon sales data file from your local machine to the Colab environment.
# A file selector will appear after executing this cell.
from google.colab import files
uploaded = files.upload()

## Section 2 — Hadoop & MapReduce

## Section 3 — Apache Spark (PySpark): RDD vs DataFrame

## Section 4 — Spark SQL & Streaming

## Section 5 — Distributed Storage (HDFS & S3)

## Section 6 — Data Processing with PySpark (Filter, Aggregation, Joins)

## Section 7 — Loading & Reading Data on S3

## Section 8 — Visualization Tasks

## Section 9 — Mini Analysis Report (Short Comments)