data-engineering-zoomcamp-1/week_5_batch_processing at main · JMGGarcia/data-engineering-zoomcamp-1

History

Name		Name	Last commit message	Last commit date
parent directory ..
code		code
setup		setup
.gitignore		.gitignore
README.md		README.md

README.md

Week 5: Batch Processing

5.1 Introduction

🎥 5.1.1 Introduction to Batch Processing
🎥 5.1.2 Introduction to Spark

5.2 Installation

Follow these intructions to install Spark:

And follow this to run PySpark in Jupyter

🎥 5.2.1 (Optional) Installing Spark (Linux)

5.3 Spark SQL and DataFrames

🎥 5.3.1 First Look at Spark/PySpark
🎥 5.3.2 Spark Dataframes
🎥 5.3.3 (Optional) Preparing Yellow and Green Taxi Data

Script to prepare the Dataset download_data.sh

Note: The other way to infer the schema (apart from pandas) for the csv files, is to set the inferSchema option to true while reading the files in Spark.

🎥 5.3.4 SQL with Spark

5.4 Spark Internals

Community notes

Did you take notes? You can share them here.

Notes by Alvaro Navas
Sandy's DE Learning Blog
Notes by Alain Boisvert
Alternative : Using docker-compose to launch spark by rafik
Marcos Torregrosa's blog (spanish)
Notes by Victor Padilha
Add your notes here (above this line)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week_5_batch_processing

week_5_batch_processing

README.md

Week 5: Batch Processing

5.1 Introduction

5.2 Installation

5.3 Spark SQL and DataFrames

5.4 Spark Internals

5.5 (Optional) Resilient Distributed Datasets

5.6 Running Spark in the Cloud

Homework

Community notes

Files

week_5_batch_processing

Directory actions

More options

Directory actions

More options

Latest commit

History

week_5_batch_processing

Folders and files

parent directory

README.md

Week 5: Batch Processing

5.1 Introduction

5.2 Installation

5.3 Spark SQL and DataFrames

5.4 Spark Internals

5.5 (Optional) Resilient Distributed Datasets

5.6 Running Spark in the Cloud

Homework

Community notes