# AGUME KENNETH B30309 S24B38/017

## SECTION A

## Q1. CSV vs Parquet (Core)

 How CSV stores data?

- In my experiment, the CSV file size was 32.86 MB while the Parquet file was only 7.03 MB. CSV stores data in plain text format, where each row is written sequentially and values are separated by commas. This means numbers and categories are stored as readable text which increases file size and requires more disk reading.

How Parquet stores data?

- Parquet, on the other hand, stores data in a columnar format. Each column is stored separately and compressed efficiently. This is because similar data types are stored together thus making compression  more effective. This explains why the Parquet file was significantly smaller.

What this means for disk I/O?

- The CSV loaded slightly fastee ir 1.09s vs 1.51s, Parquet reduces disk I/O overall because less data needs to be physically read from disk. As data grows larger, reduced disk I/O becomes more important than small decompression overhead.

## Q2. Column Selection Experiment

Why is this possible with Parquet but not efficient with CSV?

- When I loaded only two columns from the Parquet file, the load time dropped to 0.09 seconds because Parquet stores data column-by-column thus allowing the system to read only the required columns from disk.

Relate your answer to columnar storage design.

- CSV files store data row-by-row even when only two columns are needed, the entire row must be read and parsed thus making selective column reading inefficient in CSV. My experiment clearly demonstrated that columnar storage improves performance when working with analytical queries.

## Q3. Storage as a Bottleneck

- Loading the full dataset increased memory usage from 90 MB to 127 MB. Even though the dataset was only 32.86 MB on disk, reading it required additional memory and parsing time.

- As datasets increase in size, disk reading becomes a major bottleneck because data must first be transferred from storage into memory before processing. Disk I/O is slower than CPU operations therefore, even before CPU becomes overloaded, the speed of reading data from disk limits performance. This is why Big Data systems optimize storage formats and reduce unnecessary reads.

## SECTION B

## Q4. Full Load vs Chunk Processing

- Loading the entire CSV file at once increased memory usage and required the system to hold all rows in RAM simultaneously.
- While this worked for 500,000 rows, it would not scale well for millions or billions of rows.

- Chunk processing worked more reliably because only 50,000 rows were loaded at a time. After processing each chunk, the memory was freed before loading the next one. This reduced memory pressure and made the process more stable thus preventing crashes and improving scalability.

## Q5. Chunk Processing Logic

What a chunk represents?

- A chunk represents a small portion of the dataset, in my case 50,000 rows. Instead of loading all rows into memory, the file was divided into manageable blocks.

How partial results were combined?


- For each chunk, partial results were computed. For example, when calculating average transaction value per category, I accumulated partial sums and counts for each category. After all chunks were processed, the final averages were calculated.

Why this approach scales better?

- This approach scales better because memory usage remains constant regardless of dataset size. Only a fixed-size chunk is processed at any time, making it suitable for very large datasets.

## Q6. Manual Effort Observation

- Chunk processing required manually maintaining dictionaries to accumulate partial sums and counts. It also required explicit loops and aggregation logic across chunks which made the process more complex compared to using simple pandas groupby on a fully loaded DataFrame.

- Big Data systems automate this process by distributing data automatically and combining results behind the scenes. For example, systems like Hadoop and Spark handle partitioning and aggregation without manual tracking. This reduces coding complexity and human error.

## SECTION C

## Q7. From Chunks to Partitions

- The chunk processing i implemented is conceptually similar to data partitioning in Big data systems. Each chunk represents a small partition of the overall dataset.

- In distributed systems, data is split across multiple machines instead of being read sequentially on one machine. Each machine processes its partition independently and the results are combined. My chunk logic mimics this idea, but sequentially on a single machine.

## Q8. Why Distributed Storage Is Necessary

- If the dataset grew to several terabytes, storing it on a single machine would be impractical. First, storage capacity would be insufficient. Second, reading and writing such large files would be extremely slow on one disk. Third, a single machine represents a single point of failure.

- Distributed storage systems solve these problems by spreading data across multiple machines. This increases storage capacity, improves read or write performance through parallelism, and ensures fault tolerance through replication.

## Q9. Moving Computation to the Data

- During the first assignment, all processing occurred on my local machine. Even with 500,000 rows, loading data into memory increased RAM usage noticeably.

- If the dataset were terabytes in size, transferring all data to a single processor would be inefficient and slow. Big Data systems move computation to where data is stored to reduce network traffic and avoid transferring massive datasets. This improves performance and scalability.

## SECTION D

## Q10. Hadoop Motivation

- Hadoop was created to address problems such as large file sizes and memory constraints. In my assignment, I observed that loading large files increases memory usage and requires careful chunk processing.

- Hadoop solves this by splitting files into blocks and distributing them across multiple machines. It also provides fault tolerance through data replication. This removes the limitations of sequential processing on a single machine.

## Q11. Why Spark Improves on Hadoop

- In my assignment, repeated file reads were required during experimentation. Each time the file was reloaded, it took additional time and memory.

- Spark improves on Hadoop by keeping data in memory across operations. This reduces repeated disk reads and significantly speeds up iterative or interactive analytics. Spark is therefore better suited for data exploration and machine learning tasks.

## SECTION E

## Q12. Limits of Single-Machine Python

- Using a more powerful computer may temporarily improve performance but it does not solve scalability problems. Hardware upgrades have limits and become expensive. Memory and disk capacity remain finite.

- Big Data problems require horizontal scaling across multiple machines rather than vertical scaling on one powerful machine. My experiment already showed increasing memory usage with moderate data size.

## Q13. Conceptual Architecture Question

In a modern Big Data system:
- Storage (e.g HDFS or cloud storage) stores large datasets.
- File format (e.g Parquet) determines how data is physically organized.
- Computation (e.g Spark) processes data where it is stored.

- In my assignment, the above components were combined on a single machine. However, Big Data systems separate them to improve scalability and flexibility.

## Q14. Fault Tolerance

- My laptop was a single point of failure. If it crashed, all processing would stop. Distributed storage systems replicate data across multiple machines. If one machine fails, another replica is used. This ensures reliability and continuous operation.

## Q15. Real-World Scaling

- If this approach were applied to national telecom logs, the data would quickly reach terabytes or petabytes. Chunk-based processing on a single laptop would be too slow and unreliable. Distributed storage (HDFS or cloud storage), Spark for processing, and possibly Kafka for streaming would become necessary to handle real-time and large-scale analytics.