This course is provided by University of California Davis on coursera, which provides a comprehensive overview of distributed computing using Spark.
The four modules build on one another and by the end of the course are:
- Spark architecture:
- Spark DataFrame
- Optimizing reading/writing data
- How to build a machine learning model.
By understanding when to use Spark, either scaling out when the model or data is too large to process on a single machine, or having a need to simply speed up to get faster results, students like me will hone their SQL skills and become a more adept Data Scientist.
This repository includes the following things: