Cloud computing

In the first notebook, I discovered and experimented with Apache Spark locally.

Then, I created a second notebook slightly adapted to be executed in an AWS EMR cluster with a pyspark kernel.

In the end of the first notebook and in the third notebook, I assessed performance differences between running the image processing chain via Spark and via a classic tensorflow pipeline and tried to understand where the value variations came from (pre-processing steps with different resize functions). I also quickly evaluated the quality of the online spark processing with t-SNE visualizations.

On the small dataset used (320 images), Spark was 10 times slower (see the presentation to see the EMR cluster configuration).

The remaining files of that repository are configuration files placed on my S3 bucket which I referred to when configuring the EMR cluster.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
README.md		README.md
bootstrap-emr.sh		bootstrap-emr.sh
cloud_image_processing.ipynb		cloud_image_processing.ipynb
jupyter_persistence.json		jupyter_persistence.json
local.ipynb		local.ipynb
not_distributed_plus_comparison.ipynb		not_distributed_plus_comparison.ipynb
presentation_p8.pdf		presentation_p8.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloud computing

About

Releases

Packages

Languages

JulienfLeBoucher/OC_cloud_computing

Folders and files

Latest commit

History

Repository files navigation

Cloud computing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages